Showing posts with label f3-statistics. Show all posts
Showing posts with label f3-statistics. Show all posts

June 08, 2013

Friendly rejoinder to genetiker

genetiker calls me "dumber than he thought" and responds to my criticism of his model. As always, I will disregard the name calling and deal with the (much more interesting) facts.

First, he writes:
In his post Dienekes takes the phylogeny I used for running the F4 ratio estimation program and shows that it won’t work for f3 statistics. 
No kidding.
No kidding indeed. Either genetiker believes in his phylogeny or he doesn't. The fact that the F4 ratio estimation program requires a phylogeny with that structure is meaningless: as I have shown, that phylogeny is wrong because it makes a prediction that is falsified by the data. Garbage in-garbage out, so the estimates obtained by genetiker with the wrong phylogeny are of course... wrong.

Second, he presents an even more elaborate phylogeny, where "V is Veddoids, C is Caucasoids, M is Mediterraneans, N is Nordics, G is Mongoloids, S is Sardinians, E is Europeans, and A is Amerindians."


This phylogeny is of course also wrong, for at least two reasons:

  • It ignores post-admixture drift in Europeans, i.e., the drift that has accumulated after E was formed by M+N. This drift is always traversed in the same direction from E to S and to A, so it contributes a constant positive term in the value of F3(E; S,A)
  • It proposes instantaneous formation of S, E, and A, e.g., the "Nordic" component in Europeans is symmetrically related to the "Nordic" component in Sardinians and Amerindians. genetiker clearly does not believe this, since he argues in his site (i) for I-M26 bearing "White Gods" coming to the Americas via the Canary islands, (ii) that mtDNA haplogroup X in the Americas is Caucasoid and so is (iii) Y-haplogroup C, which although "originally Veddoid" was carried by "Caucasoids" into the Americas. Now there's zero evidence that any of this has anything to do with Caucasoids, let alone Nordics in the Americas, but in any case it would be nice if genetiker harmonized his convoluted model of "Nordic" migrations with his phylogeny. In other words, his mental model of what happened isn't only inconsistent with the data, it's also inconsistent with itself. 
Finally, genetiker attempts to work out the mathematical details of his model, arriving at the conclusion that:
There are four paths from Europeans to Sardinians and four paths from Europeans to Amerindians, so there are sixteen path combinations.
This is of course wrong, because these paths are not independent; one actually needs to sum over 8 (=2^3) different trees for the different combinations of α, β and γ in the model; genetiker is therefore using wrong math applied to a wrong model. I believe his confusion stems from conflating admixture edges with drift edges.

It is not clear what he has aimed to accomplish with this "model", but let's analyze it properly: 

  • If α,β or 1-α,1-β  then because of the instantaneous derivation of S,E from M and N respectively there is no drift in the degenerated length-0 "path" E-to-S, and hence F3(E; S,A) = 0. So, we only have to consider the cases α, 1-β and 1-α, β:
  • If 1-γ then if α,1-β we have drift overlap MC, or if β,1-α we have drift overlap CN
  • If γ then then if α,1-β we have drift overlap MC+CN, or if β,1-α we have drift overlap 0

So, in total we have a positive F3(E; S, A) statistic again, since we are summing over positive or zero drifts. If we also added the post-admixture drift in E, that statistic would be even higher -although this is not really necessary to falsify genetiker's model.

In any case, I still applaud genetiker for engaging with the data, and I'm happy to contribute to his continuing education!

June 05, 2013

Amerindian-like admixture in northern Europe is real

genetiker, a new genome blogger questions the existence of Amerindian-like admixture in Europe. I am generally well-disposed to anyone who tries their hand at analysis of genetic data. On the other hand, if one  accuses me of writing a series of posts "chock-full of stupidity", then there's a good chance I might respond. This should also be useful for anyone wishing to understand the evidence for this admixture.

genetiker proposes that the "Amerindian-like" admixture in North Europeans is misunderstood and can be in fact explained by the existence of "North European-like" admixture in Amerindians. In support of this, he presents the results of an F4 Ratio estimation analysis which suggests that there is "Nordic admixture of the Amerindian populations in the 10 to 20 percent range."

F4 Ratio estimation produces admixture estimates but does not prove the existence of such admixture. The admixture estimates are as good as the relationship proposed for a particular set of populations. If the relationship is nonsensical, so will be the admixture estimates.

According to genetiker, the following relationship holds, with A=Sardinian, B=Orcadian, C=Dai, and O=Yoruba, with X=Amerindians.


But, is the above consistent with the data? The existence of Amerindian-like admixture was argued by Patterson et al. (2012) on the basis of the following F3 statistic (right):

F3(European; Sardinian, Amerindian)

which is signifantly negative for North Europeans. Now, consider the value of this statistic for genetiker's phylogeny.

F3(B = North European; A = Sardinian, X = Amerindian)


In the above figure I color-coded the path from B=North European to A=Sardinian (red) and from B=North European to X=Karitiana (green, if it goes via the supposed "North European" admixture, or blue, if it goes via the "Amerindian" admixture). The value of the F3 statistic is then the weighted sum of the overlap of the red/green and red/blue paths:

F3(B; A, X) = αBZ+(1-α)(BZ+ZW)

where BZ and ZW are drifts along the paths indicated in the figure. This statistic is then always positive, since the common segments in the graph are traversed in the same direction.

genetiker's model is thus falsified by the data: it predicts a positive f3(North European; Sardinian, Amerindian) statistic, but we in fact observe negative ones.

November 24, 2012

Assessment of Totonac and Bolivian samples using 'globe13'

I was on the lookout for some Affy 6.0 samples recently, and I discovered the data of the recent Watkins et al. (2012) paper, so I decided to run them through my globe13 calculator. A total of 49,233 SNPs were in common between that and my globe13 set, which is not much, but ought to be sufficient to discover the main features of these two population samples.


It appears that both samples are mainly "Amerindian", with the Bolivian sample having some more European admixture than the Totonac one.

Here are the population portraits, clearly showing that the "European" admixture in Bolivians comes from a subset of individuals.
For comparison, here are the ADMIXTURE results from the original paper that appear quite similar to my own. (Note that the individual ordering is probably not the same as my own):


The Mediterranean/North_European ratio of my own analysis suggests the likely "southern" (probably Spanish) origin of the European admixture in these populations.

UPDATE:

I also combined the two Amerindian populations with HGDP Karitiana, Sardinian, and French to calculate f3-statistics. Here are the significant ones:


So, admixture in the Bolivian sample is confirmed, while in the Totonac one it is not. I do think it's possible that the Totonac might have a little European admixture though which might be masked by their history of drift. Also notice the evidence for admixture in the French using all three Amerindian samples, with lowest f3(French; Amerindian, Sardinian) using the Karitiana reference.

November 22, 2012

ALDER signal of admixture in Ashkenazi Jews

(You can skip the first part if you want, and head straight to the RESULTS section)

Previous studies on uniparental markers have indicated that Ashkenazi Jews (AJ) were formed by admixture between a Near Eastern population and European host populations; the evidence for the former element seems pretty clear on the basis of Y-chromosomes where Jews possess a relatively high frequency of Y-haplogroup J1 (and a few others) that are quite rare in non-Jewish north/east Europeans. As for the latter, it seems probable on the basis of the location of Ashkenazi Jews on PCA plots where they tend to occupy an intermediate position between extant populations of the Levant (including Near Eastern Jews) and non-Jewish Europeans.

Anyone who has played around with genetic data will know that while AJ may be positioned in the aforementioned "intermediate" location within the "West Eurasian continuum" between Europe and Near East, they tend to form their own cluster at higher dimensions. And, indeed, this is why it's fairly easy for a clustering algorithm, such as my "Clusters Galore" (MCLUST/MDS) approach to pick out a very specific AJ cluster (e.g., here, or here, using a fastIBD approach). An Ashkenazi Jewish-specific cluster also pops out at higher K in ADMIXTURE analyses. This cluster may reflect endogamy within the AJ community until quite recent times.

One way of detecting admixture in a group is through the use of f3-statistics. The statistic f3(AJ; European, Near_East) could be negative --which would indicate admixture-- but it is usually not -at least in the combinations of (European, Near_East) I've tried, and this is consistent with either the presence admixture or absence of admixture.

A simple and intuitive way to see why post-admixture drift might mask the presence of admixture can be seen by means of a simple calculation. Remember that the f3-statistic's +/- sign depends on the +/- sign of quantities (c-a)*(c-b) where c is an allele frequency in the admixed (?) population we are investigating, and a, b in the two reference populations. We can pick a to be less than b with no loss of generality.

In the absence of strong drift (e.g., if all populations have a very large number of individuals), then the allele frequency c=xa+(1-x)b where x is the amount of admixture --between 0 and 1-- from group A and (1-x) from group B, and this c will be maintained little changed in the post-admixture phase. With the aid of a little algebra, we get that:

(c-a)*(c-b) = (xa+(1-x)b-a)*(xa+(1-x)b-b)
= (xa+b-xb-a)*(xa+b-xb-b) =
= x(x-1)(a-b)^2

and this is of course negative because we assumed that x was less than 1.

In a large population, this c will remain near-constant, because of the lack of strong drift. As long as it remains within the interval (a,b), then (c-a)*(c-b) will also remain negative, and so will the f3 statistic.

But, what if strong drift affects the admixed population? Allele frequencies fluctuate more wildly in larger populations, so c might go outside the (a,b) interval. Without loss of generality, assume that c becomes greater than b in which case (c-a)*(c-b) will become positive.

The f3-statistic averages over many SNPs, so, depending on (i) the initial differentiation of the admixed populations, which could be seen as b-a, and (ii) the amount of drift, which causes c to jump outside the (a, b) interval as discussed above, it is possible that the evidence for admixture may disappear.

So, relying on allele frequency differences may help obliterate the signal of admixture. But, there is a different signal of admixture that uses the decay of admixture linkage-disequilibrium, most recently discussed in the ALDER paper. The admixture LD signal's evidence may also disappear in time, but only because the signal occurs at increasingly lower genetic distances over time due to recombination. Thankfully, it tends to occur at large enough --for the last few thousand years-- distances, for which the SNP density of existing genotyping platforms that measure a few hundred thousand SNPs per individual is sufficient.

METHODS

Naturally I was curious to see whether the admixture LD mechanism would produce the evidence of admixture that the f3-statistics did not. I combined three datasets in my possession (HGDP by Li et al. Behar et al. and Yunusbayev et al. ) and identified sets of European and Semitic populations. (Remember that these sets are non-exhaustive, but presumably usable surrogates for the true mixing populations exist within them):

Abhkasians_Y, Adygei, Belorussian, Bulgarians_Y, Chechens_Y, Chuvashs, French, French_Basque, Georgians, Hungarians, Lezgins, Lithuanians, Mordovians_Y, North_Italian, North_Ossetians_Y, Orcadian, Romanians, Russian, Sardinian, Spaniards, Tuscan, Ukranians_Y

and:

Bedouin, Druze, Egyptans, Ethiopian_Jews, Ethiopians, Iraq_Jews, Jordanians, Lebanese, Morocco_Jews, Palestinian, Saudis, Sephardic_Jews, Syrians, Yemenese, Yemen_Jews

I used my Dodecad Project sample of AJ which numbers 36 individuals and is larger than any other usable public sample available to me.

(ALDER was run with default parameters, using the Rutgets recombination map for Illumina chips, and with the merged dataset prepared with a --geno 0.03 flag. Note that the Ashkenazi_D sample consists of individuals typed on different Illumina platforms from 23andMe and FamilyTreeDNA. The total number of SNPs considered was 527,165.)

RESULTS

I report below the tests for which ALDER reported "success" for the test with no warnings:



The median of all these estimates is 36.78 generations or 1070 years which corresponds to a calendar date of 910CE, assuming the sample's birthday was 1980, and a generation length of 29 years.

Palamara et al. placed the beginning of demographic expansion of AJ in a similar timeframe (33 generations), following a severe founder effect reducing the population to ~270 individuals. Such a founder effect may have indeed served to produce positive f3-statistics, masking the presence of admixture, the occurrence of which appears to be substantiated on the basis of the ALDER test of admixture.

As for the levels of admixture, using a 1-ref analysis with the European populations, I get the following lower bounds:



I'd be interested in hearing people's opinions on the plausibility of these dates/proportions, as well as their potential historical associations; a lot of factors might affect these results, so perhaps this analysis could be improved in the future.

November 16, 2012

f3-statistics on craniometric data?

It occurred to me that the concept of f3-statistics, originally developed to detect admixture by exploiting allele frequency difference anti-correlations could very well be applied to craniometric data as well.

The basic idea is quite simple: suppose that for a metric trait, two populations A and B have mean value a and b and that a third population C is formed by mixture between A and B. Unlike allele frequencies where the admixed population's frequency will be between a and b immediately post-admixture, anthropometric traits may respond in unexpected ways to admixture (e.g., heterosis might cause first-generation offspring to exceed both their parents in height, rather than exhibit an intermediate value). I will leave the justification of the hypothesis that "mixed-origin offspring will possess intermediate metric traits" to the physical anthropologists, who may have gathered data on such things, and, for the present, I will take it for granted.

So, assuming that c, the mean trait in the mixed population, is between a and b, we can easily see that (c-a)(c-b) will be negative, and hence so will be the correlation coefficient (over many traits) between C-A and C-B, where by C-A I denote the k-long vector difference of mean trait values between populations C and A.

Going back to my analysis of Howells' dataset, I calculated population means for 57 traits over the NORMALIZED_DATA array of modern populations (in which sexual dimorphism has been removed and traits of different scale have been normalized in standard deviation units), and calculated 30*choose(29,2) correlations for each of 30 populations, expressed as a mixture of any pair of the remaining 29.

I list below, the top 20 anti-correlations, and highlight a few in bold (third population as mixture of first two):


BURIAT ANDAMAN PHILLIPI -0.54005191575771
EGYPT BURIAT NORSE -0.490018084440697
ANDAMAN ANYANG HAINAN -0.48323680182295
BURIAT ANDAMAN HAINAN -0.480939028739347
EGYPT BURIAT ZALAVAR -0.476445836100052
ANDAMAN ANYANG PHILLIPI -0.457902384166767
DOGON BURIAT PHILLIPI -0.416551851781419
BERG EASTER_I ZALAVAR -0.378996437433417
AUSTRALI BURIAT ARIKARA -0.375898166338775
BURIAT EASTER_I MOKAPU -0.37169703838378
ESKIMO ANDAMAN S_JAPAN -0.366611599944932
ESKIMO PERU N_JAPAN -0.354535077363928
TOLAI BURIAT ARIKARA -0.348110323746154
BERG EGYPT ZALAVAR -0.344843098962355
DOGON ESKIMO GUAM -0.344577928128792
TOLAI BURIAT GUAM -0.338804214799388
ESKIMO PHILLIPI GUAM -0.336537918547276
DOGON BURIAT HAINAN -0.332635954428392
TASMANIA BURIAT ARIKARA -0.331301837598433
ESKIMO PERU S_JAPAN -0.330302035072489

Some interesting ones:
  • Philippines as Buriat+Andaman; this makes sense if Philippines is the result of admixture between an "East Asian" and a "Negrito" population
  • Norse as Egypt+Buriat; the Howells "Egypt" sample is "Mediterranean" in the classical sense. Perhaps this involves the same "East Eurasian"-like signal of admixture detected by genetic methods? Similar signal also occurs for Zalavar (from Hungary)
  • Hainan as Andaman+Anyang; south Chinese as Neolithic Chinese+"Negrito"-like old south Chinese?
  • Arikara as Buriat+Australian; admixture between "Australoid" Paleo-Indians and "Mongoloid" ones? or between 1st wave Indians and later ones (sensu Reich et al. 2012)?
  • Guam as Tolai+Buriat; admixture between "Papuan"-like and East Asian-like people in Polynesia?
As with "normal" f3-statistics, absence of a negative correlation does not reject admixture; this may be especially the case here, because phenotypes may be affected by strong natural selection during the post-admixture period.

And, there are some difficult-to-interpret cases (e.g., Philippines as Buriat+Dogon) which may point to limitations of the method; for example, the Dogon may act as a stand-in for the "equatorial"-like physique of the true "Andaman"-like mixing element. Presumably such limitations can be overcome by limiting the analysis to "selectively neutral" traits, rather than the whole suite of 57 Howells variables used here.

I certainly think that the idea ought to be investigated further: it might be redundant when genetic data are available, but may prove useful in the analysis of admixture when such data do not exist, e.g., in anthropological data of prehistoric specimens from hot climates where archaeogenetic evidence may never materialize. 

November 08, 2012

Okinawans and admixture in East Asia

I don't use the Pan-Asian SNP Consortium data much, but the upcoming paper on the Ainu spurred me to give it a look, because it contains an Okinawan sample (JP-RK). I calculated all f3-statistics that involved this sample, and report the lowest f3-statistic for all populations in this set that appear to be admixed:


Several of these are interesting:
  • A set of Indonesian populations (ID prefix; Lamaholot, Lembata, Kambera, Manggarai) are mixed with Melanesians (AX-ME)
  • A set of Indian populations appear admixed (IN prefix). It seems that the Okinawan sample acts as a surrogate for "Asian" ancestry 
  • Filipino populations PI-UI and PI-UN (listed as Visaya, Chabakano and Tagalog) are seen as mixtures of Okinawans and PI-UB (Ilocano)
  • The three Singaporean populations (SG prefix) are seen as mixtures with Caucasoids (the SG-ID Tamil Indians with CEU), with Sunda Indonesians (SG-ML Malay with ID-SU), with Zhuang Chinese (SG-CH Singaporean Chinese with CN-CC Zhuang, northern)
  • Tai Yuan from Thailand with Mlabri (TH-TU with TH-MA)
  • Taiwanese (Hakka TW-HA and Minnan TW-HB) with CN-CC (Zhuang) and Jiamao (CN-JI)
  • Cantonese CN-GA  with Jiamao (CN-JI)
  • Uygur CN-UG with West Eurasians (CEU)
And, of course JPT and JP-ML (Japanese) are seen as a mixture of Okinawans and Mandarin Han (CN-SH) and Beijing Chinese (CHB).

An interesting question is whether the mainland East Asian Yayoi element in Japanese is more similar to Han (as the f3 statistic suggests) or to Koreans. Interestingly, Koreans themselves (KR-KR) appear admixed between Han (CN-SH) and Okinawans. So, it seems that whatever this Okinawan element represents was not limited to the isles of Japan.

I also calculated the D-statistic:

D(CN-SH      KR-KR  :      JP-RK        YRI) =      -0.0154   (Z = -14.779)

which suggests indeed, that there is an excess of "Okinawan"-like ancestry in Koreans compared to the Chinese. This is very interesting, because it suggests that similarity between Koreans and Japanese is due to a common substratum in the two populations. 

November 03, 2012

Admixture in the Chuvash and the Uygur

I took the Behar et al. (2010) sample of Chuvash, excluding GSM536731 which has atypical ancestry and merged it with the Li et al. HGDP French_Basque and Dai. The latter two populations don't show evidence of admixture according to both the f3-statistic and ALDER (Loh et al. 2012). (I used a --geno 0.03 flag in PLINK and extracted a subset of SNPs including in the Rutgers recombination map for Illumina chips).

The f3-statistic f3(Chuvashs_16; French_Basque, Dai) was equal to -0.011311 (Z=-31.308), indicative of admixture.

I then ran an ALDER analysis:


Test SUCCEEDS (z=4.85, p=1.2e-06) for Chuvashs_16 with {French_Basque, Dai} weights

DATA: success (warning: decay rates inconsistent) 1.2e-06 Chuvashs_16 French_Basque Dai 4.85 3.78 5.18 50% 40.27 +/- 5.80 0.00032377 +/- 0.00006676 28.21 +/- 7.47 0.00004231 +/- 0.00000962 47.08 +/- 4.53 0.00016628 +/- 0.00003212

DATA: test status p-value test pop ref A ref B 2-ref z-score 1-ref z-score A 1-ref z-score B max decay diff % 2-ref decay 2-ref amp_exp 1-ref decay A 1-ref amp_exp A 1-ref decay B 1-ref amp_exp B

This indicates that the Chuvash can be seen as admixed, but with inconsistent decays: the one with the French Basque (=28.21) is younger than the one with the Dai (=47.08). I think this makes fairly good sense, because the Chuvash are descended from people who came to Europe during the 1st millennium AD and must have later mixed with Europeans, perhaps with eastern Slavs as these made their way eastward during the 2nd millennium AD.

I then carried out similar analyses on the HGDP Uygur. As expected f3(Uygur; French_Basque, Dai) = -0.023917 (Z = -60.362), indicative of admixture. The ALDER analysis:


Test SUCCEEDS (z=6.85, p=7.4e-12) for Uygur with {French_Basque, Dai} weights

DATA: success 7.4e-12 Uygur French_Basque Dai 6.85 4.47 7.39 15% 20.56 +/- 3.00 0.00036760 +/- 0.00003660 22.59 +/- 5.06 0.00010920 +/- 0.00002025 19.46 +/- 2.64 0.00007864 +/- 0.00000710

DATA: test status p-value test pop ref A ref B 2-ref z-score 1-ref z-score A 1-ref z-score B max decay diff % 2-ref decay 2-ref amp_exp 1-ref decay A 1-ref amp_exp A 1-ref decay B 1-ref amp_exp B

suggests a very recent admixture on both the European and East Asian side. It seems fairly clear that whatever admixture was taking place in Central Asia, perhaps for thousands of years, the present-day Ugyur were formed, at least in part, by a fairly recent, perhaps post-Mongol admixture event.

October 27, 2012

Inter-relationships between 'world' components

In a previous post I calculated f3-statistics between my K=7 and K=12 ancestral components. The basic idea is to discover which component A can be seen as a mixture of two other components, B and C, in which case (assuming A does not have excessive drift), we expect a negative f3(A; B, C) statistic.

As part of my analysis of the world dataset, I calculated f3-statistics for each of the K=3 to K=12, that is, for some K, I tried to see if one of the K inferred components could be seen as a mixture of the remaining K-1. It turns out that no negative f3 statistics appeared at all, and this suggests that the components inferred by ADMIXTURE at each K tend to form an "orthogonal" set that are not mixtures of each other.

More generally, we can calculate f3 statistics where A, B, and C are components inferred from any of the K=3 to K=12 runs. There is a total of 75 such components, and hence 75*(74 choose 2) = 202,575 such f3 statistics. Since calculating these would take a while (and would become intractable as K increases further), I decided to calculate pairwise f3 statistics, i.e., statistics where A, B, and C are constrained to be from successive K, K+1 runs. The significant results can be seen in the spreadsheet.

It might be worthwhile to develop an automated way of using these statistics to guide us in the interpretation of ADMIXTURE components. But, they are useful, in any case, as a source of information.

For example, consider the following (the third column represents the mixed population):

Atlantic_Baltic_6/globe6_Z Near_East_6/globe6_Z European_5/globe5_Z -0.013911 0.000084 -166.457

This means that the European component at K=5 can be seen as a mix of the Atlantic_Baltic and Near_East components at K=6. So, this suggests that the European component can be seen as "secondary", the product of admixture. But:

European_5/globe5_Z Amerindian_5/globe5_Z Atlantic_Baltic_6/globe6_Z -0.003964 0.000175 -22.588

This indicates conversely that the Atlantic_Baltic at K=6 component can be seen as a mix of the European and Amerindian components at K=6.

It would be very interesting to use f-statistics to guide one in the choice of an "orthogonal" set of ancestral populations, or to summarize the relationships between them in tree or network form. One could potentially use my ADMIXTURE to TreeMix script to do something like this, although as K increases, there is a combinatorial explosion in the total number of components with a probable runtime slowdown/memory usage blowup which might render this approach unusable, at least for large K.

October 10, 2012

The Indo-European invasion of the Baltic

In some recent posts, I showed that South Asian populations (North Indian BrahminsSouth Indian Brahmins) can be seen as mixtures of West Eurasian and South Indian populations, but also that West Eurasians (BulgariansGreeksArmenians, and French) can be seen as mixtures of South Asian and Sardinian populations.

This may seem strange, but can be explained if we understand how f3-statistics and rolloff actually work. These methods do not require pure or unadmixed ancestral populations, but exploit allele frequency differences in the reference populations together with either (i) allele frequencies in the mixed population, in the case of f3-statistics, or (ii) admixture linkage disequilibrium in the mixed population, in the case of rolloff.

If a and b are allele frequencies in two ancestral populations A and B that mix, then:

  • The frequency of a will shift towards b if A experiences gene flow from B
  • The frequency of a will randomly shift if A experiences gene flow from an "outgroup" population
  • The frequency of a will shift towards b if A experiences gene flow from a third population that is geographically and genetically intermediate between A and B

An application to the Europe-South Asia cline

I took the following set of populations, and calculated all 1,365 possible f3-statistics:
"FIN30"         "Lithuanians"   "Russian"       "Pathan"        "Balochi"       "North_Kannadi" "Polish_D"      "Russian_D"     "Mixed_Slav_D"  "Bulgarian_D"   "Serb_D"        "Ukrainian_D"   "Belorussian"   "Bulgarians_Y"  "Ukranians_Y"
In the following table, I report the lowest Z-scores for each target population (third column). So, for example, Polish_D can be seen as a mixture of Lithuanians and Balochi. Only negative scores are indicative of admixture. I highlight in bold the significant negative scores (Z less than -3)


Lithuanians North_Kannadi FIN30 0.001606 0.000259 6.193 280043
Ukrainian_D Belorussian Lithuanians 0.00078 0.000299 2.614 268493
Lithuanians North_Kannadi Russian -0.002738 0.000248 -11.045 279965
North_Kannadi Polish_D Pathan -0.006959 0.000229 -30.344 280220
North_Kannadi Bulgarians_Y Balochi -0.003636 0.000246 -14.781 281604
Pathan Ukrainian_D North_Kannadi 0.033802 0.000623 54.237 271858
Lithuanians Balochi Polish_D -0.001171 0.000178 -6.581 279519
Lithuanians Pathan Russian_D -0.001829 0.000166 -11.026 280658
Lithuanians Pathan Mixed_Slav_D -0.001715 2e-04 -8.594 277635
Lithuanians Balochi Bulgarian_D -0.001247 0.000313 -3.979 272342
Lithuanians Balochi Serb_D -0.00091 0.000377 -2.416 270807
Lithuanians Balochi Ukrainian_D -0.002222 0.000358 -6.211 270399
Lithuanians Balochi Belorussian -0.000897 0.00027 -3.325 273076
Balochi Polish_D Bulgarians_Y -0.001198 0.000185 -6.481 279632
Lithuanians Balochi Ukranians_Y -0.001727 0.000187 -9.236 278677

It is clear, that what I have described holds here: European populations appear like mixtures of Lithuanians and South Asians; conversely, South Asian populations appear like mixtures of Europeans and North Kannadi.

This does not mean that the populations that appear unadmixed (FIN30, Lithuanians, North_Kannadi, and Serbs) are in fact so, for at least two reasons:
  1. The f3 statistic confirms, but does not reject the presence of admixture; in particular, it fails to find real admixture in highly drifted populations
  2. The f3 statistics exploits allele frequency correlations between populations: but the North Kannadi and Lithuanians/Finns occupy opposite ends of the studied cline, so their lack of signal of admixture may be due to the non-existence of populations that are even more unadmixed than themselves.
In the case of South Indians, we are completely sure that this is the case. Reich et al. (2009) managed to show this not because there are any unadmixed Ancestral South Indians (ASI) left, but because they exploited the existence of the Onge, an isolated group from the Andaman Islands that was a sister group to the ASI. So, we can be fairly sure that southern Indians themselves have West Eurasian-like admixture, even the ones that are at the end of the West Eurasia-South India cline on its southern end.

The problem is: there is no isolated group of unadmixed Europeans left in existence that might serve a similar proxy function as the Onge did for South Asians.

Enter Pickrell et al. (2012) to the rescue. In that paper, the authors studied admixture in the Khoe-San of South Africa. Now, many of the Khoe-San sub-groups appeared to be admixed, but the "Juj'hoan North" population appeared to be at the "end of the cline": it's impossible to detect admixture in them using alelle frequency differences, because, quite simply, there are no populations that are less unadmixed than them: they're as pure descendants of "Ancestral Bushman" as exist on the earth today.

But, the clever thing is, that we don't have to detect admixture only using allele frequency differences, but also using admixture LD, i.e., by exploiting the correlation between linkage disequilibrium (the co-inheritance of physically separated markers on a chromosome) and allele frequency differences between populations. Pickrell el al. were able to do this not by conjuring up a more unadmixed population than the "Juj'hoan North" one available to them, but by splitting up that population, and using one half to find allele frequency differences, and the other half to detect admixture LD.

Admixture LD signal in Lithuanians

Using the aforementioned idea, I set out to see whether Lithuanians, who occupy the European end of the Europe-South Asia cline present such a signal of admixture LD. I used the Lithuanian_D sample from the Dodecad Project and the Balochi HGDP sample as reference populations (to calculate allele frequency differences), and the Behar et al. (2010) Lithuanians for admixture LD. There were only ~300k SNPs usuable in this set, but sufficient to detect the signal of admixture LD:
The admixture time estimate is 200.350 +/- 61.608 generations, or 5,810 +/- 1790 years. This is not very precise, probably because of the small number of SNPs and individuals used, but it certainly points to the Neolithic-to-Bronze Age for the occurrence of this admixture. The date is certainly reminiscent of the expansion of the Kurgan culture out of eastern Europe, or, the later Corded Ware culture of northern Europe.

So, it may well appear that at least some of the people participating in these groups of cultures, were indeed influenced by the Indo-Europeans as they expanded from their West Asian homeland. These intruders mixed with eastern Europeans who vacillated during the late Neolithic between a northern Europeoid pole akin to Mesolithic hunter gatherers from Gotland and Iberia, and a widely dispersed Sardinian-like population that is in evidence at least in the Sweden-Italian Alps-Bulgaria triangle. The gradual appearance of non-mtDNA U related lineages in Siberia and Ukraine is most likely related to this phenomenon.

It would seem that the Proto-Indo-Europeans mixed with different substrata in the four directions of their expansion: Sardinian-like people in southern Europe, Lithuanian-like people in northern Europe, South Indian-like people in South Asia, and East Eurasians in Siberia and east central Asia. Extant groups are descendants of divergent Neolithic population groups, brought closer together (genetically) because of variable admixture with the PIE population and its early offshoots.

Conclusion

There are mutual signals of admixture across a Europe-South Asia cline: Europeans appear to be mixed with South Asians, and South Asians appear to be mixed with Europeans. The simplest explanation for this pattern involves expansion of a third, geographically and genetically intermediate population that affected both Europe and South Asia. We can use the signal of admixture LD to prove that this expansion affected some of the most unadmixed populations in Europe (e.g., Lithuanians), just as it did the most unadmixed populations of India (e.g., Dravidians).

It will be interesting to use these techniques to study signals of admixture in other "end of the line" populations such as Sardinians, South Indians, etc.

UPDATE I (rolloff analysis of Poles):

I have carried out rolloff analysis of my 25-strong Polish_D sample using Lithuanians and Pathans as references:
The signal is fairly distinct, and corresponds to 149.296 +/- 38.783 generations or 4330 +/- 1120 years. I am guessing that either the different reference population (Pathans vs. Balochi), or, more likely the increased number of target individuals (25 vs. 10) have contributed to the narrowing down of the uncertainty. It will be interesting to explore this signal further with more population pairs.

UPDATE II (rolloff analysis of Finns):

I have also used the 1000 Genomes Finnish sample (FIN) in a similar manner as Lithuanians, using 15 individuals to estimate allele frequency differences, and 15 ones for admixture LD, and using the Pathans as a South Asian reference population. There is a clear signal of admixture:
This dates to 104.967 +/- 14.797 generations, or 3,040 +/- 430 years. Finland came under the influence of both Europeans (and likely Indo-Europeans) during the Bronze Age period (a mixture of Battle Axe with local Comb Ceramic seems to have occurred), as well as likely non-European (and likely Uralic) intrusions during the same time frame, as part of the Seima-Turbino phenomenon. It will be interesting to repeat this analysis with an East Eurasian reference population to isolate potential signals of admixture dating to either the Comb Ceramic or Seima-Turbino episodes of migration.

(Note; added Oct 14): I carried out rolloff analysis using Nganassans as suggested in the above paragraph here.

UPDATE III (rolloff analysis of Ukrainians):

I have used the Yunusbayev et al. sample of Ukrainians, and estimated its admixture time using Lithuanians and Balochi as reference populations:
The admixture time estimate is 191.078 +/- 35.079 generations, or 5,540 +/- 1,020 years. It seems very similar to that in Lithuanians, with a smaller standard error, perhaps on account of either the larger number of SNPs or larger number of individuals.

It is tempting to associate this admixture signal with the Maikop culture which appeared at around this time. Assuming that North_European/West_Asian (or Lithuanian-like and Balochi-like) gene pools existed north and south of the Pontic-Caspian-Caucasus set of geographical barriers, then the Maikop culture which shows links to both the early Transcaucasian culture and those of Eastern Europe would have been an ideal candidate region for the admixture picked up by rolloff to have taken place. There are, of course, other possibilities.

UPDATE IV (rolloff analysis of Lithuanians with Pathan reference):

I repeated the first analysis of this post, but this time, I used Pathans, rather than Balochi as a reference population:
The admixture time estimate of 217.501 +/- 51.170 generations, or 6,310 +/- 1,480 years appears to be similar with the original estimate of 5,810 +/- 1790 years, so it does not appear that the use of Balochi or Pathan as a reference population much affects this result.

October 07, 2012

rolloff analysis of Bulgarians as Sardinian+Pathan

Continuing my rolloff experiments, I have taken the Yunusbayev et al. sample of Bulgarians. This is interesting because of the recent evidence of a Sardinian-like individual from Iron Age Bulgaria, and also as a complement to a similar analysis on the Greeks. Bulgarians are Slavic speaking, but their ethnogenesis owes a great deal to the Bulgars, adding another potential element of complication. However, the paucity of East Eurasian admixture in Bulgarians, together with their Slavic language, probably suggests that this element represented a small elite that did not have a substantial role in the genetic formation of the Bulgarian population.

The top f3 statistics can be seen below:

Kshatriya_M Sardinian Bulgarians_Y -0.003813 0.000295 -12.918 237507
Velamas_M Sardinian Bulgarians_Y -0.003783 0.000285 -13.287 238276
Piramalai_Kallars_M Sardinian Bulgarians_Y -0.003693 0.000306 -12.061 238106
Kanjars_M Sardinian Bulgarians_Y -0.003643 0.000298 -12.227 237838
GIH30 Sardinian Bulgarians_Y -0.003638 0.000259 -14.028 240548
North_Kannadi Sardinian Bulgarians_Y -0.00355 0.000317 -11.187 237882
Muslim_M Sardinian Bulgarians_Y -0.003542 0.000333 -10.632 236964
Chamar_M Sardinian Bulgarians_Y -0.003505 0.000303 -11.585 238882
INS30 Sardinian Bulgarians_Y -0.003467 0.000264 -13.153 240279
Dharkars_M Sardinian Bulgarians_Y -0.003452 0.000309 -11.155 238211
Brahmins_from_Uttar_Pradesh_M Sardinian Bulgarians_Y -0.003448 0.000278 -12.42 238041
Indian_D Sardinian Bulgarians_Y -0.003411 0.000256 -13.308 241225
Iyer_D Sardinian Bulgarians_Y -0.003364 0.000291 -11.568 237509
Jatt_D Sardinian Bulgarians_Y -0.003327 0.000289 -11.513 236735
Pathan Sardinian Bulgarians_Y -0.003212 0.000239 -13.444 240969
Iyengar_D Sardinian Bulgarians_Y -0.003209 0.000308 -10.416 236840
Dusadh_M Sardinian Bulgarians_Y -0.003181 0.000313 -10.172 237512
Sindhi Sardinian Bulgarians_Y -0.003094 0.000239 -12.919 241268
Balochi Sardinian Bulgarians_Y -0.002804 0.00024 -11.686 240924


To maximize the number of SNPs and number of individuals, I used the Sardinian+Pathan pair as reference populations. 509,395 SNPs were used for this experiment. The exponential fit can be seen below:
There was a technical issue with the jackknife which I am currently investigating, but the mean time of the admixture was estimated at 126.83004 generations, or 3,680 years. This is similar to the value of 3,850 years I obtained on the Greek sample.

If this date is accepted, then the interesting issue is why an individual from Bulgaria was Sardinian-like during the Iron Age. Possibly, either this individual was Sardinian-like in the broad sense, despite having  minority West Asian admixture, or a few centuries after the admixture event, there was still an uneven distribution of the constituent elements, with most individuals still predominantly Sardinian-like. Given that the indigenous element was probably most numerous, so only part of it would have the opportunity to admix with the intrusive West Asian-like population, and this influence would spread to the population-at-large over time.

In any case, this evidence, such as it is, appears consistent with my idea about a Bronze Age invasion of Europe from Asia.

Naturally, only a broad sampling of ancient DNA variation from the Balkans, perhaps targeting different sites, cultures, times, social status, and physical types will be sufficient to track the early appearance of an intrusive population.

October 03, 2012

rolloff analysis of South Indian Brahmins as Armenian+Chamar

The first analysis of this population showed that there were negative f3(Brahmin; X, Y) signals when X were a variety of West European, Balkan, and West Asian population, and Y either the Chamar or North Kannadi. In the first analysis I used Orcadians and North Kannadi. I have now carried out a new rolloff analysis on 470,559 SNPs, using Armenians_Y and Chamar_M as the reference populations.

The exponential fit can be seen below.
The admixture date is 142.814 +/- 15.010 generations, or 4,140 +/- 440 years, which seems to correspond quite well with commonly accepted dates for the formation of Indo-Iranian.

I have previously observed that:

These patterns can be well-explained, I believe, if we accept that Indo-Iranians are partially descended not only from the early Proto-Indo-Europeans of the Near East, but also from a second element that had conceivable "South Asian" affiliations. The most likely candidate for the "second element" is the population of the Bactria Margiana Archaeological Complex (BMAC). The rise and demise of the BMAC fits well with the relative shallowness of the Indo-Iranian language family and its 2nd millennium BC breakup, and has been assigned an Indo-Iranian identity on other grounds by its excavator. As climate change led to the decline and abandonment of BMAC sites, its population must have spread outward: to the Iranian plateau, the steppe, and into South Asia, reinforcing the linguistic differentiation that must have already began over the extensive territory of the complex.
Quite possibly, as the West Asian element began mixing with the Sardinian-like population in Greece, another branch of the Indo-Europeans made its appearance east of the Caspian, in the territory of the BMAC, admixing with South Asian-like populations. Thus, it might seem that the Graeco-Aryan clade of Indo-European broke down during the Bronze Age, with one branch heading off to the Balkans, and another to the east. 

This scenario would also explain how the likely J2-bearing population associated with the earliest Proto-Indo-Europeans may have acquired the contrasting pattern I have previously described: the western (cis-Caspian) population would have admixed with R1b-bearers who occupy the "small arc" west and south of the Caspian, while the eastern (trans-Caspian) populations would have admixed with R1a-bearers who occupy the "large arc" in the flatlands north and east of the Caspian. It would also explain how the "western" branch (Graeco-Armenian) would have picked up Sardinian-like "Atlantic_Med" admixture, which is absent in the "eastern" Indo-Iranian branch.

At the same time, this scenario would explain the lack of "North European" admixture in the "western" branch (since this was shielded by the Caucasus and Black Sea from the northern Europeoids who may have lived north of these barriers), and explain it in the "eastern" branch (since the BMAC agriculturalists were in contact with presumably northern Europeoid groups inhabiting the steppelands, unhindered by any major physical barriers). (The relative absence of this admixture in the Graeco-Armenian branch may be advanced on the strength of its absence in Armenians, the evidence of a Sardinian-like Iron Age individual from Bulgaria, and the historical-era timing of admixture for the Greek population.)

It would be interesting to carry out similar experiments on Iranian groups, to see if they, too, present a similar pattern of admixture.

rolloff analysis of Greeks as Sardinian+Brahui

In a previous experiment, I showed that Greeks can be seen as composites of two alternate sets: either a Sardinia-South Asia mix, or a North European-Near East mix. I first studied the latter, which provided a historical-period estimate for the admixture time. I now turn to the former, and use Sardinians and Brahui as parental populations. This complements previous analyses on Armenians and French using similar reference populations. Since I used the Balochi and Burusho in the two previous experiments, I decided this time to use the Brahui, which is the third population which presents a significant f3(Greek; Sardinian, Brahui) signal along the Europe-West Asia axis.

473,174 SNPs were used in total. The exponential fit can be seen below.

The jackknife estimate is 132.890 +/- 35.527 generations, or 3,850 +/- 1030 years. This spans the entirety of the Helladic period, with the mean being close to two often-cited dates for the "coming of the Greeks", corresponding to the destructions at the EHIII/MH boundary (c. 2100BC), and the spread of "Minyan ware" at c. 1900BC, although an earlier or later date is certainly possible.

(An alternative interpretation would relate the earliest Greeks to a Sardinian-like European population and the Asian component to a Luwian-like Anatolian population responsible for the well-known -nth and -ss toponyms in the Aegean.)

A signal of West Asian admixture during the Bronze is certainly consistent with my musings on the spread of metallurgy from the east during this time.

September 30, 2012

Armenians as Phrygian colonists, or, rolloff analysis of Armenians as a mixture of Sardinians+Balochi

I analyze the Yunusbayev et al. Armenians_Y sample in a similar manner as the South Indian Brahmins. The 30 lowest f3 statistics are:

Sardinian Velamas_M Armenians_15_Y -0.00349 0.000264 -13.23 239451
Sardinian Piramalai_Kallars_M Armenians_15_Y -0.003213 0.00028 -11.484 239389
Sardinian GIH30 Armenians_15_Y -0.002983 0.00023 -12.986 241310
Balochi Sardinian Armenians_15_Y -0.002837 0.000193 -14.681 241698
Sardinian Sindhi Armenians_15_Y -0.002794 0.000203 -13.757 241928
Sardinian Muslim_M Armenians_15_Y -0.002761 0.000295 -9.351 238639
Indian_D Sardinian Armenians_15_Y -0.002743 0.000224 -12.226 241916
Sardinian Kanjars_M Armenians_15_Y -0.002727 0.000281 -9.722 239240
Iyer_D Sardinian Armenians_15_Y -0.002718 0.000263 -10.322 238943
Brahui Sardinian Armenians_15_Y -0.002715 0.000196 -13.882 241885
Sardinian Dusadh_M Armenians_15_Y -0.002666 0.000281 -9.502 238994
Sardinian INS30 Armenians_15_Y -0.00265 0.000237 -11.162 240965
Iyengar_D Sardinian Armenians_15_Y -0.002624 0.000297 -8.847 238564
Sardinian Dharkars_M Armenians_15_Y -0.002501 0.000267 -9.359 239505
Sardinian North_Kannadi Armenians_15_Y -0.002463 0.000288 -8.545 239278
Sardinian Chamar_M Armenians_15_Y -0.002445 0.000275 -8.904 240102
Sardinian Kshatriya_M Armenians_15_Y -0.002372 0.000267 -8.897 239047
Pathan Sardinian Armenians_15_Y -0.00224 0.000199 -11.264 241759
Sardinian Brahmins_from_Uttar_Pradesh_M Armenians_15_Y -0.002189 0.00025 -8.774 239395
Jatt_D Sardinian Armenians_15_Y -0.001806 0.000273 -6.608 238465
Cypriots Kanjars_M Armenians_15_Y -0.001699 0.00026 -6.547 238237
Cypriots Velamas_M Armenians_15_Y -0.001642 0.000275 -5.965 238392
Cypriots Muslim_M Armenians_15_Y -0.001618 0.000279 -5.798 237762
Cypriots Dusadh_M Armenians_15_Y -0.001611 0.000279 -5.779 238096
GIH30 Cypriots Armenians_15_Y -0.001608 0.000223 -7.216 239819
Iyer_D Cypriots Armenians_15_Y -0.001562 0.000251 -6.217 238012
Sindhi Cypriots Armenians_15_Y -0.001544 0.000209 -7.383 240276
Cypriots North_Kannadi Armenians_15_Y -0.001464 0.000276 -5.298 238273
Cypriots Kshatriya_M Armenians_15_Y -0.001438 0.00026 -5.534 238076

I carried out rolloff analysis using the Balochi and Sardinians as references, for a total of 510,844 SNPs. Note that the Burusho were not used in this experiment, because they were culled due to more than 5% East Eurasian admixture, as per the followed procedure

The Balochi are very similar to the Burusho otherwise, and this also gives me the opportunity to see a Sardinian+Balochi population pair to complement a previous analysis of French as Sardinian+Burusho, which presented an f3 signal of quite similar intensity. The exponential fit is seen below.

The jackknife gives an age estimate of 113.194 +/- 14.674 generations, or 3,280 +/- 430 years, assuming a generation length of 29 years.

I had somewhat expected the Armenians to show a more recent signal of admixture than the French, as they lived much closer to the boundary of Europe and Asia, and may have had more opportunity to admix between Sardinian-like populations of Europe and "West Asian"-like populations of Asia.

But, the inferred date also raises another possibility. Herodotus says of the Armenians who were part of the army of King Xerxes:
the Armenians were equipped like Phrygians, being Phrygian colonists" (7.73)
Now, the Phrygians became masters of Central Anatolia during the tumultuous events near the end of the Bronze Age (12th century BC), following the collapse of the Hittite Empire. And, their ancestral homeland was in Thrace. And, there is fairly good evidence that Armenian is the closest language related to Greek within the Indo-European language family. And, we have some tantalising evidence that even during the Iron Age, the population of Thrace was Sardinian-like. And, the Armenians do contrast with their Caucasian neighbors in possessing ~10% of the Sardinian-like Atlantic_Med component that South and Northeast Caucasians lack.

All of the above combine to make a pretty compelling story. Could it be that Armenians preserve a legacy of admixture between a linguistically Indo-European speaking, genetically Sardinian-like population, which arrived in Asia Minor from the Balkans at the end of the Bronze Age, finally settling in the Armenian Highlands, and mixing with the local people they encountered?

The plot thickens. And, this is, certainly, a question that can be answered by ancient DNA research, e.g., by comparing the genomes of historical Phrygians and Armenians with those from Hittite-era, or earlier Anatolians.

September 29, 2012

rolloff analysis of South Indian Brahmins

Populations with 5+ individuals and which belonged no more than 5% in African or East Eurasian components at K=7 were retained. South Indian Brahmins were combined from the Iyer_D and Iyengar_D datasets of the Dodecad Project. Other populations were from the current version of the Old World dataset used for the K7b/K12b calculators.

The lowest f3 statistics were the following:

English_D North_Kannadi South_Indian_Brahmin -0.006119 0.000339 -18.06 236533
North_Kannadi Orkney_1KG South_Indian_Brahmin -0.005987 0.000311 -19.223 237162
Irish_D North_Kannadi South_Indian_Brahmin -0.005958 0.000317 -18.817 237023
British_Isles_D Chamar_M South_Indian_Brahmin -0.005931 0.000334 -17.757 237112
Dutch_D North_Kannadi South_Indian_Brahmin -0.005914 0.00034 -17.416 236008
British_Isles_D North_Kannadi South_Indian_Brahmin -0.005914 0.000342 -17.273 236064
North_Kannadi Baleares_1KG South_Indian_Brahmin -0.005878 0.000367 -16.012 235743
German_D North_Kannadi South_Indian_Brahmin -0.005838 0.000306 -19.048 237156
Georgian_D North_Kannadi South_Indian_Brahmin -0.005827 0.000366 -15.924 235091
CEU30 North_Kannadi South_Indian_Brahmin -0.005818 0.000315 -18.461 237266
Greek_D North_Kannadi South_Indian_Brahmin -0.005812 0.000314 -18.527 237028
Orcadian North_Kannadi South_Indian_Brahmin -0.005807 0.000329 -17.649 236669
Austrian_D North_Kannadi South_Indian_Brahmin -0.005803 0.000371 -15.659 235187
North_Kannadi Pais_Vasco_1KG South_Indian_Brahmin -0.005796 0.000337 -17.193 235754
British_D North_Kannadi South_Indian_Brahmin -0.005794 0.000336 -17.243 236613
Mixed_Germanic_D North_Kannadi South_Indian_Brahmin -0.005778 0.000344 -16.794 236047
French North_Kannadi South_Indian_Brahmin -0.005773 0.000312 -18.493 237461
North_Kannadi Cornwall_1KG South_Indian_Brahmin -0.00577 0.000306 -18.846 237375
Armenians Chamar_M South_Indian_Brahmin -0.005759 0.000279 -20.674 238398
Orkney_1KG Chamar_M South_Indian_Brahmin -0.005757 0.000287 -20.042 238463
Hungarians North_Kannadi South_Indian_Brahmin -0.005756 0.000311 -18.482 237196
Serb_D North_Kannadi South_Indian_Brahmin -0.005751 0.000371 -15.499 235625
Iraq_Jews Chamar_M South_Indian_Brahmin -0.005742 0.000303 -18.968 237341
Greek_D Chamar_M South_Indian_Brahmin -0.005736 0.000293 -19.57 238374
North_Kannadi Aragon_1KG South_Indian_Brahmin -0.005733 0.000347 -16.526 235610
Armenians_15_Y Chamar_M South_Indian_Brahmin -0.005723 0.000293 -19.5 237884
German_D Chamar_M South_Indian_Brahmin -0.005719 0.000294 -19.483 238514
Orcadian Chamar_M South_Indian_Brahmin -0.005701 0.000305 -18.707 237880
French_Basque North_Kannadi South_Indian_Brahmin -0.005698 0.000333 -17.104 237119

Links between Western Europe and South Asia have turned up in many of the Project's analyses (e.g., the West European in Dodecad v3, or the Gedrosia component in K12b, or even earlier the "Dagestan" component in both West Europe and South Asia). 

Of course, we don't have to imagine a migration all the way from the the British Isles to South Asia, anymore than we may imagine a migration from South America to Europe to explain the strong negative f3(European; Karitiana, Sardinian) signals previously detected. I don't know what to make of this tendency to minimize f3 for the "longest possible clines". 

In any case, I carried out rolloff analysis using Orcadians and North_Kannadi. This is not the strongest signal, but it is very close to it, and also has the twin advantages of involving public data (so the analysis can be repeated) and a large number of SNPs, which were 466,644 in total. The fit can be seen below:

This appears to be excellent visually. The inferred date from the jackknife is 110.155 +/- 11.345 generations, or 3,190 +/- 330 years, assuming as always a generation length of 29 years.

The obvious candidate for this admixture signal is of course the arrival of the Indo-Aryans into South Asia. 

September 24, 2012

rolloff analysis of French as a mixture of Sardinian+Burusho

I obtain f3(French; Sardinian, Burusho) = -0.002652 (Z=-13.541) on the basis of 446,917 SNPs. This is the strongest signal of admixture in the French that involves a population that is high on the "West_Asian" component whose influence I have been investigating.

I thus carried out rolloff analysis using the French as a target population and the Sardinians and Burusho as reference populations. The exponential fit can be seen below:

The jackknife gives 239.556 +/- 50.553 generations for this admixture, which corresponds (assuming a generation length of 29 years) to 6,950 +/- 1,470 years.

Analysis of autosomal DNA from the Tyrolean Iceman and a Neolithic TRB farmer from Sweden have revealed an absence of the West Asian ancestral component and a Sardinian-like Neolithic population c. 5ka in Europe. This population may have extended to at least to the Balkans in space and down to the Iron Age in time.

In my opinion, the simplest explanaton for the evidence is that the admixture picked up by rolloff took place in West Asia itself c. 7ka, and then this population begun its movement into Europe at some post-5ka time period.

Importantly, the K=12 Caucasus component appears as a mixture of the K=7 West_Asian and Southern components. The former (West_Asian) is the most important one in the Burusho, and the latter (Southern) is the most important one in Sardinians.

European Neolithic farmers, of presumably West Asian origin only possessed Y-haplogroup G2a out of the wide variety of haplogroups found in West Asia today. They also lacked the West_Asian component which is modal in West Asia today. There is also physical anthropological evidence from Greece and Anatolia, for an introduction of new population elements during the Bronze Age.

These facts combine to make me believe that there were population movements across West Asia which preceded the Indo-European invasion of Europe during late pre-history. That event is then best seen as an extension of a broader Eurasian phenomenon that affected substantially both the western parts of Asia and Europe.

Taking all the evidence into account, I hypothesize that:
  • a "Southern"/"Atlantic_Med"/Sardinian-like population substratum existed in West Asia, and this spawned the early European Neolithic.
  • a new "West_Asian"/Burusho-like population arrived from the east, perhaps associated with the Halaf/Hassuna cultures, or from some other unknown center of dispersal in the Transcaucasus or Iran. Mobility may have been encouraged post-8.2 kiloyear event.
  • these two elements began mixing ~7 thousand years ago in West Asia
  • the admixed population expanded at some post-5ka period into Western Europe.
This scenario is also compatible with the lack of "Southern"/"Atlantic_Med" influences in the Indian subcontinent and Central Asia: if the West_Asian component originated to the east of the Sardinian-like population then it would not have the opportunity to incorporate "Southern" elements in its eastern expansion.

(Obviously, more rolloff analyses are needed to study these ideas; the current one took about ~3 days, which was a little faster than I expected.)

Related (?): Is Burushaski Indo-European?

Image credit: Don Perrault (source)

September 17, 2012

Quantifying Karitiana-like admixture in Eurasia

Using the same dataset as in a previous experiment, I decided to calculate the extent of East Eurasian-like admixture in Eurasia.

First, I identified, using qp3Pop a set of population with significantly negative f3(Sardinian, Karitiana, Target) statistics:

This is actually a very helpful figure, as it shows how the f3 signal of admixture becomes weaker for more drifted populations (e.g., Finns) even if they have more of the investigated admixture than others (e.g., French).

It also shows that most West Eurasian populations appear admixed between Sardinians and Karitiana, whereas most East Eurasian ones (see spreadsheet) do not appear to be so, at least on the basis of the f3 test.

I next used qpF4Ratio to estimate the extent of this admixture. This depends on the following topology (Fig. 4 of Patterson et al. 2012):


I used: A=Papuan, B=Karitiana, C=Sardinian, and O=San, with X= any of the different investigated populations.

Note that this topology does not really hold for all X target populations whose admixture we are investigating. In particular, some populations have African admixture, hence O=San is not really an outgroup for them.

In the following, you can see the admixture proportion estimates using the F4 ratio test:

It should be obvious now how admixture estimates using the f4 ratio method depend on an appropriate outgroup. The f3-statistics indicate that all the above-listed populations are admixed between a Sardinian-like and a Karitiana-like population. But, the estimate of admixture based on the f4 ratio becomes negative, because f4(Papuan, San; X, Sardinian) is negative in populations where X has African admixture.

So, the Karitiana-like admixture of populations such as Spanish_D (est. 1.2%) is lower than their actual such admixture, because Spanish_D includes African admixture. For the Portuguese_D (est. -3.3%) where African admixture is even more significant, the effect is even stronger, and a nonsensical negative admixture score appears.

The converse took place when the f4 ratio method was applied by Moorjani et al. (2011). In that case, negative f4 scores with CEU as a parental population were taken as evidence of African admixture. But, since CEU has Amerindian-like admixture, the estimates of African admixture in that paper were higher than the actual values.

It will be interesting to derive corrected African admixture estimates after taking into account that CEU have Amerindian-like admixture, and, covnersely, corrected Karitiana-like admixture estimates after taking into account African admixture in some populations.

In any case, the data used for the above plots can be found in the spreadsheet, together with the list of all considered populations.

September 16, 2012

Greeks on the crossroads of Eurasia

I used the qp3Pop program of ADMIXTOOLS which implements a 3-population test of admixture (Patterson et al. 2012), using Greek_D as a target population and any pair of other populations as possible parental populations. My dataset is similar to that used for the K7b/K12b, but includes all the new data that has accumulated since those tools were released. The number of SNPs is 186,241, and I have also limited the analysis to 115 populations with 10+ individuals.

For more details on the f3 statistic, you should really read the linked paper. Briefly, you should remember the following:

  1. Significant negative f3 statistics indicate that the target population and the two parentals do not form a simple tree, but are related in a complex way
  2. Positive f3 statistics are consistent with either a simple tree or a history of admixture followed by genetic drift
  3. It is not necessary for the parental populations to be themselves unadmixed
The full set of results can be seen in the spreadsheet.

Below you can see the 30 most negative f3 statistics.


The first thing that immediately jumps out is that Sardinians participate in most of these comparisons. And, given the mounting evidence for a Sardinian-like population in prehistoric Europe, including the Balkans, it does appear likely that a Sardinian-like element in the ancestry of Greeks is quite possible.

A different element that is paired up with Sardinians in the most negative f3 statistics consists of a variety of South Asian populations; these comparisons appear stronger than the Sardinian+East Asian ones. This dataset does not include Amerindian populations, for which the effect was strongest in the Patterson et al. paper. I suspect that South Asian populations give out stronger f3 statistics than East Asian ones, because South Asians are composed of a West Asian-like element and an Ancestral South Indian element which is related to East Eurasians. So, South Asians appear as a parental population on account of both the East Eurasian-shift effect observed by Patterson et al., as well as the West Asian-shift effect I've described in a few posts such as this.

A third set of significant comparisons involve Northern Europeans vs. Near Eastern populations, with extrema in the Baltic area and Arabia, which seems to correspond quite well with what I've called the "West Eurasian cline", with populations of northeastern Europe likely possessing a higher degree of continuity with the Mesolithic hunter-gatherers.

Overall, this exercise has convinced me that 2-way admixture models do not capture the complexity of Eurasian prehistory. The Greek population appears intermediate on a number of different clines, the two most important ones being between Sardinia and far Asia and between the Baltic and the Near East.

I will probably repeat this experiment with other populations from this set. I will also probably try to get some admixture dates using as many SNPs as possible, although rolloff appears to have fairly long running times, so I am not sure how practical that will be.

September 14, 2012

Inter-relationships between Dodecad K7b and K12b components

In a previous post I used leave-one-out to show how components inferred by ADMIXTURE could be related to each other.

One of the "problems" with ADMIXTURE and related analyses is that as the number of components K increases, additional components are formed by merging and/or splitting of components at lower K.

But, it turns out that thanks to the supervised mode, we can look at how components at different K are related to each other: we can treat, e.g., the K=12 ancestral populations as test data with the K=7 ancestral populations as references and vice versa.

I carried out precisely this procedure for my K7b/K12b components.

Below are the K12b components expressed as mixtures of the K7b ones:

And, the K7b ones expressed as mixtures of the K12b ones:


I have also calculated f3 statistics (ussing threepop) for all population triples using the  K7b/K12b calculators. Most of the mixes inferred by ADMIXTURE appear significant, although I didn't hand-check each one. I report the significant ones below:

Population f3(A; B, C) s.e. Z-score

Atlantic_Baltic_K7b;Atlantic_Med_K12b,North_European_K12b -0.00287483 2.64051e-05 -108.874
African_K7b;East_African_K12b,Sub_Saharan_K12b -0.00241502 2.3253e-05 -103.858
East_Asian_K7b;East_Asian_K12b,Southeast_Asian_K12b -0.00218574 2.17614e-05 -100.441
Caucasus_K12b;West_Asian_K7b,Southern_K7b -0.00317634 4.12205e-05 -77.0573
West_Asian_K7b;Gedrosia_K12b,Caucasus_K12b -0.00209044 3.14454e-05 -66.4785
Siberian_K7b;East_Asian_K12b,Siberian_K12b -0.00166911 2.60228e-05 -64.1403
South_Asian_K7b;Gedrosia_K12b,South_Asian_K12b -0.00195015 3.35149e-05 -58.1876
East_Asian_K12b;East_Asian_K7b,Siberian_K7b -0.00191747 3.49244e-05 -54.9034
Atlantic_Baltic_K7b;Southern_K7b,North_European_K12b -0.00181747 3.63948e-05 -49.9377
East_African_K12b;Southern_K7b,African_K7b -0.00412496 0.000101701 -40.5598
Atlantic_Med_K12b;Southern_K7b,Atlantic_Baltic_K7b -0.00138679 3.68608e-05 -37.6222
East_Asian_K7b;Southeast_Asian_K12b,Siberian_K7b -0.00127133 3.92998e-05 -32.3495
Northwest_African_K12b;Southern_K7b,Sub_Saharan_K12b -0.00272013 0.000110067 -24.7133
Northwest_African_K12b;Southern_K7b,African_K7b -0.00255262 0.000107527 -23.7394
East_African_K12b;African_K7b,Atlantic_Med_K12b -0.00237833 0.000107306 -22.1639
East_African_K12b;African_K7b,Caucasus_K12b -0.00217732 0.000101003 -21.557
Caucasus_K12b;West_Asian_K7b,Atlantic_Med_K12b -0.000977923 4.573e-05 -21.3847
Caucasus_K12b;West_Asian_K7b,Northwest_African_K12b -0.00100154 4.86387e-05 -20.5915
East_African_K12b;Southern_K7b,Sub_Saharan_K12b -0.00247983 0.000122139 -20.3034
Caucasus_K12b;Southern_K7b,Gedrosia_K12b -0.00112749 5.91335e-05 -19.0669
East_Asian_K12b;Southeast_Asian_K12b,Siberian_K7b -0.00100305 5.44851e-05 -18.4097
Atlantic_Baltic_K7b;North_European_K12b,Caucasus_K12b -0.000534432 2.98199e-05 -17.922
Southern_K7b;Southwest_Asian_K12b,Atlantic_Med_K12b -0.000683711 4.08148e-05 -16.7515
East_Asian_K12b;East_Asian_K7b,Siberian_K12b -0.000651854 4.01206e-05 -16.2474
African_K7b;Gedrosia_K12b,Sub_Saharan_K12b -0.000738345 4.5676e-05 -16.1648
African_K7b;Southern_K7b,Sub_Saharan_K12b -0.000769896 4.8516e-05 -15.8689
South_Asian_K7b;South_Asian_K12b,Northwest_African_K12b -0.000598387 3.84069e-05 -15.5802
African_K7b;Sub_Saharan_K12b,Northwest_African_K12b -0.000602378 4.07154e-05 -14.7948
East_African_K12b;African_K7b,Southwest_Asian_K12b -0.00141216 0.000102079 -13.834
African_K7b;Sub_Saharan_K12b,North_European_K12b -0.000663712 4.87314e-05 -13.6198
African_K7b;South_Asian_K7b,Sub_Saharan_K12b -0.000598399 4.51811e-05 -13.2445
Southern_K7b;Southwest_Asian_K12b,Northwest_African_K12b -0.000577559 4.50096e-05 -12.8319
Siberian_K7b;East_Asian_K7b,Siberian_K12b -0.000403499 3.17418e-05 -12.7119
Atlantic_Baltic_K7b;West_Asian_K7b,Atlantic_Med_K12b -0.000520714 4.41022e-05 -11.807
East_African_K12b;African_K7b,Atlantic_Baltic_K7b -0.00122819 0.000106897 -11.4895
African_K7b;Sub_Saharan_K12b,Siberian_K7b -0.00051246 4.93477e-05 -10.3847
East_African_K12b;African_K7b,North_European_K12b -0.00103911 0.000106816 -9.72802
African_K7b;Sub_Saharan_K12b,Southeast_Asian_K12b -0.000469707 4.98071e-05 -9.43052
African_K7b;East_Asian_K12b,Sub_Saharan_K12b -0.000461359 4.9918e-05 -9.24235
Gedrosia_K12b;South_Asian_K7b,West_Asian_K7b -0.00047115 5.11259e-05 -9.2155
South_Asian_K7b;East_African_K12b,South_Asian_K12b -0.000384664 4.18056e-05 -9.20125
African_K7b;Sub_Saharan_K12b,Caucasus_K12b -0.000430657 4.69419e-05 -9.17425
African_K7b;Sub_Saharan_K12b,Southwest_Asian_K12b -0.000421792 4.64037e-05 -9.08962
Atlantic_Baltic_K7b;North_European_K12b,Northwest_African_K12b -0.000328259 3.62081e-05 -9.06589
African_K7b;Sub_Saharan_K12b,East_Asian_K7b -0.000446564 4.9569e-05 -9.00895
African_K7b;Sub_Saharan_K12b,Siberian_K12b -0.000437012 4.88062e-05 -8.95404
Northwest_African_K12b;African_K7b,Atlantic_Med_K12b -0.00115555 0.000131897 -8.76101
African_K7b;West_Asian_K7b,Sub_Saharan_K12b -0.000397507 4.57534e-05 -8.68804
African_K7b;Sub_Saharan_K12b,Atlantic_Baltic_K7b -0.000418044 4.81379e-05 -8.68431
African_K7b;South_Asian_K12b,Sub_Saharan_K12b -0.000393516 4.57123e-05 -8.60853
South_Asian_K7b;South_Asian_K12b,Southwest_Asian_K12b -0.000290753 3.88373e-05 -7.48644
South_Asian_K7b;West_Asian_K7b,South_Asian_K12b -0.000228331 3.63783e-05 -6.27657
Atlantic_Med_K12b;Southern_K7b,North_European_K12b -0.000329428 5.28014e-05 -6.239
East_African_K12b;Gedrosia_K12b,African_K7b -0.000596188 0.000102434 -5.8202
African_K7b;Sub_Saharan_K12b,Atlantic_Med_K12b -0.00023116 4.95629e-05 -4.66397
South_Asian_K7b;South_Asian_K12b,Atlantic_Med_K12b -0.000172605 4.09236e-05 -4.21775
Siberian_K12b;Atlantic_Med_K12b,Siberian_K7b -0.000166672 4.4065e-05 -3.78243
East_African_K12b;West_Asian_K7b,African_K7b -0.00034931 0.000103503 -3.37489
Atlantic_Baltic_K7b;Atlantic_Med_K12b,Siberian_K7b -0.000226988 7.32706e-05 -3.09795

This leads to a very simple way of gauging whether an ancestral population is better seen as admixed or not: count the number of times it appears before the semi-colon, and subtract the number of times it appears after the semi-colon. This may not be a perfect measure, but it captures the basic idea. When I do this, I get:

 [1,] East_African_K12b      7  
 [2,] African_K7b            7  
 [3,] South_Asian_K7b        4  
 [4,] Atlantic_Baltic_K7b    3  
 [5,] East_Asian_K12b        0  
 [6,] Caucasus_K12b          0  
 [7,] Northwest_African_K12b -2 
 [8,] East_Asian_K7b         -2 
 [9,] Siberian_K12b          -3 
[10,] Southeast_Asian_K12b   -4 
[11,] Gedrosia_K12b          -4 
[12,] Siberian_K7b           -4 
[13,] Southwest_Asian_K12b   -5 
[14,] South_Asian_K12b       -7 
[15,] North_European_K12b    -7 
[16,] West_Asian_K7b         -7 
[17,] Southern_K7b           -8 
[18,] Atlantic_Med_K12b      -8 
[19,] Sub_Saharan_K12b       -19

I think this looks reasonable; the components at the bottom usually appear contributing to the admixture of other populations, and the components at the top usually appear admixed in terms of the other components. Of course admixed components may be themselves be useful if they represent regional mixes (such as teh East African), but this is certainly a good way to supplement and interpret ADMIXTURE analysis.