October 10, 2012

The Indo-European invasion of the Baltic

In some recent posts, I showed that South Asian populations (North Indian BrahminsSouth Indian Brahmins) can be seen as mixtures of West Eurasian and South Indian populations, but also that West Eurasians (BulgariansGreeksArmenians, and French) can be seen as mixtures of South Asian and Sardinian populations.

This may seem strange, but can be explained if we understand how f3-statistics and rolloff actually work. These methods do not require pure or unadmixed ancestral populations, but exploit allele frequency differences in the reference populations together with either (i) allele frequencies in the mixed population, in the case of f3-statistics, or (ii) admixture linkage disequilibrium in the mixed population, in the case of rolloff.

If a and b are allele frequencies in two ancestral populations A and B that mix, then:

  • The frequency of a will shift towards b if A experiences gene flow from B
  • The frequency of a will randomly shift if A experiences gene flow from an "outgroup" population
  • The frequency of a will shift towards b if A experiences gene flow from a third population that is geographically and genetically intermediate between A and B

An application to the Europe-South Asia cline

I took the following set of populations, and calculated all 1,365 possible f3-statistics:
"FIN30"         "Lithuanians"   "Russian"       "Pathan"        "Balochi"       "North_Kannadi" "Polish_D"      "Russian_D"     "Mixed_Slav_D"  "Bulgarian_D"   "Serb_D"        "Ukrainian_D"   "Belorussian"   "Bulgarians_Y"  "Ukranians_Y"
In the following table, I report the lowest Z-scores for each target population (third column). So, for example, Polish_D can be seen as a mixture of Lithuanians and Balochi. Only negative scores are indicative of admixture. I highlight in bold the significant negative scores (Z less than -3)

Lithuanians North_Kannadi FIN30 0.001606 0.000259 6.193 280043
Ukrainian_D Belorussian Lithuanians 0.00078 0.000299 2.614 268493
Lithuanians North_Kannadi Russian -0.002738 0.000248 -11.045 279965
North_Kannadi Polish_D Pathan -0.006959 0.000229 -30.344 280220
North_Kannadi Bulgarians_Y Balochi -0.003636 0.000246 -14.781 281604
Pathan Ukrainian_D North_Kannadi 0.033802 0.000623 54.237 271858
Lithuanians Balochi Polish_D -0.001171 0.000178 -6.581 279519
Lithuanians Pathan Russian_D -0.001829 0.000166 -11.026 280658
Lithuanians Pathan Mixed_Slav_D -0.001715 2e-04 -8.594 277635
Lithuanians Balochi Bulgarian_D -0.001247 0.000313 -3.979 272342
Lithuanians Balochi Serb_D -0.00091 0.000377 -2.416 270807
Lithuanians Balochi Ukrainian_D -0.002222 0.000358 -6.211 270399
Lithuanians Balochi Belorussian -0.000897 0.00027 -3.325 273076
Balochi Polish_D Bulgarians_Y -0.001198 0.000185 -6.481 279632
Lithuanians Balochi Ukranians_Y -0.001727 0.000187 -9.236 278677

It is clear, that what I have described holds here: European populations appear like mixtures of Lithuanians and South Asians; conversely, South Asian populations appear like mixtures of Europeans and North Kannadi.

This does not mean that the populations that appear unadmixed (FIN30, Lithuanians, North_Kannadi, and Serbs) are in fact so, for at least two reasons:
  1. The f3 statistic confirms, but does not reject the presence of admixture; in particular, it fails to find real admixture in highly drifted populations
  2. The f3 statistics exploits allele frequency correlations between populations: but the North Kannadi and Lithuanians/Finns occupy opposite ends of the studied cline, so their lack of signal of admixture may be due to the non-existence of populations that are even more unadmixed than themselves.
In the case of South Indians, we are completely sure that this is the case. Reich et al. (2009) managed to show this not because there are any unadmixed Ancestral South Indians (ASI) left, but because they exploited the existence of the Onge, an isolated group from the Andaman Islands that was a sister group to the ASI. So, we can be fairly sure that southern Indians themselves have West Eurasian-like admixture, even the ones that are at the end of the West Eurasia-South India cline on its southern end.

The problem is: there is no isolated group of unadmixed Europeans left in existence that might serve a similar proxy function as the Onge did for South Asians.

Enter Pickrell et al. (2012) to the rescue. In that paper, the authors studied admixture in the Khoe-San of South Africa. Now, many of the Khoe-San sub-groups appeared to be admixed, but the "Juj'hoan North" population appeared to be at the "end of the cline": it's impossible to detect admixture in them using alelle frequency differences, because, quite simply, there are no populations that are less unadmixed than them: they're as pure descendants of "Ancestral Bushman" as exist on the earth today.

But, the clever thing is, that we don't have to detect admixture only using allele frequency differences, but also using admixture LD, i.e., by exploiting the correlation between linkage disequilibrium (the co-inheritance of physically separated markers on a chromosome) and allele frequency differences between populations. Pickrell el al. were able to do this not by conjuring up a more unadmixed population than the "Juj'hoan North" one available to them, but by splitting up that population, and using one half to find allele frequency differences, and the other half to detect admixture LD.

Admixture LD signal in Lithuanians

Using the aforementioned idea, I set out to see whether Lithuanians, who occupy the European end of the Europe-South Asia cline present such a signal of admixture LD. I used the Lithuanian_D sample from the Dodecad Project and the Balochi HGDP sample as reference populations (to calculate allele frequency differences), and the Behar et al. (2010) Lithuanians for admixture LD. There were only ~300k SNPs usuable in this set, but sufficient to detect the signal of admixture LD:
The admixture time estimate is 200.350 +/- 61.608 generations, or 5,810 +/- 1790 years. This is not very precise, probably because of the small number of SNPs and individuals used, but it certainly points to the Neolithic-to-Bronze Age for the occurrence of this admixture. The date is certainly reminiscent of the expansion of the Kurgan culture out of eastern Europe, or, the later Corded Ware culture of northern Europe.

So, it may well appear that at least some of the people participating in these groups of cultures, were indeed influenced by the Indo-Europeans as they expanded from their West Asian homeland. These intruders mixed with eastern Europeans who vacillated during the late Neolithic between a northern Europeoid pole akin to Mesolithic hunter gatherers from Gotland and Iberia, and a widely dispersed Sardinian-like population that is in evidence at least in the Sweden-Italian Alps-Bulgaria triangle. The gradual appearance of non-mtDNA U related lineages in Siberia and Ukraine is most likely related to this phenomenon.

It would seem that the Proto-Indo-Europeans mixed with different substrata in the four directions of their expansion: Sardinian-like people in southern Europe, Lithuanian-like people in northern Europe, South Indian-like people in South Asia, and East Eurasians in Siberia and east central Asia. Extant groups are descendants of divergent Neolithic population groups, brought closer together (genetically) because of variable admixture with the PIE population and its early offshoots.


There are mutual signals of admixture across a Europe-South Asia cline: Europeans appear to be mixed with South Asians, and South Asians appear to be mixed with Europeans. The simplest explanation for this pattern involves expansion of a third, geographically and genetically intermediate population that affected both Europe and South Asia. We can use the signal of admixture LD to prove that this expansion affected some of the most unadmixed populations in Europe (e.g., Lithuanians), just as it did the most unadmixed populations of India (e.g., Dravidians).

It will be interesting to use these techniques to study signals of admixture in other "end of the line" populations such as Sardinians, South Indians, etc.

UPDATE I (rolloff analysis of Poles):

I have carried out rolloff analysis of my 25-strong Polish_D sample using Lithuanians and Pathans as references:
The signal is fairly distinct, and corresponds to 149.296 +/- 38.783 generations or 4330 +/- 1120 years. I am guessing that either the different reference population (Pathans vs. Balochi), or, more likely the increased number of target individuals (25 vs. 10) have contributed to the narrowing down of the uncertainty. It will be interesting to explore this signal further with more population pairs.

UPDATE II (rolloff analysis of Finns):

I have also used the 1000 Genomes Finnish sample (FIN) in a similar manner as Lithuanians, using 15 individuals to estimate allele frequency differences, and 15 ones for admixture LD, and using the Pathans as a South Asian reference population. There is a clear signal of admixture:
This dates to 104.967 +/- 14.797 generations, or 3,040 +/- 430 years. Finland came under the influence of both Europeans (and likely Indo-Europeans) during the Bronze Age period (a mixture of Battle Axe with local Comb Ceramic seems to have occurred), as well as likely non-European (and likely Uralic) intrusions during the same time frame, as part of the Seima-Turbino phenomenon. It will be interesting to repeat this analysis with an East Eurasian reference population to isolate potential signals of admixture dating to either the Comb Ceramic or Seima-Turbino episodes of migration.

(Note; added Oct 14): I carried out rolloff analysis using Nganassans as suggested in the above paragraph here.

UPDATE III (rolloff analysis of Ukrainians):

I have used the Yunusbayev et al. sample of Ukrainians, and estimated its admixture time using Lithuanians and Balochi as reference populations:
The admixture time estimate is 191.078 +/- 35.079 generations, or 5,540 +/- 1,020 years. It seems very similar to that in Lithuanians, with a smaller standard error, perhaps on account of either the larger number of SNPs or larger number of individuals.

It is tempting to associate this admixture signal with the Maikop culture which appeared at around this time. Assuming that North_European/West_Asian (or Lithuanian-like and Balochi-like) gene pools existed north and south of the Pontic-Caspian-Caucasus set of geographical barriers, then the Maikop culture which shows links to both the early Transcaucasian culture and those of Eastern Europe would have been an ideal candidate region for the admixture picked up by rolloff to have taken place. There are, of course, other possibilities.

UPDATE IV (rolloff analysis of Lithuanians with Pathan reference):

I repeated the first analysis of this post, but this time, I used Pathans, rather than Balochi as a reference population:
The admixture time estimate of 217.501 +/- 51.170 generations, or 6,310 +/- 1,480 years appears to be similar with the original estimate of 5,810 +/- 1790 years, so it does not appear that the use of Balochi or Pathan as a reference population much affects this result.


J said...

This looks neat, but I don't understand it.

Does the confidence interval (+/- number of generations) decrease by the square root of the sample size (both individuals sampled and number of SNPs)?

Slumbery said...

This would mean that the Baltic region was reached by IE much earlier than West/South Europe. The difference from the other admixture dates is 2 millenia, a way too much if you try to identify this as essentially the same migration wave. At the other hand if this is not the same migration wave, then how you know that they have anything to do with each other at all?

Also 5800 BP is not really Bronze Age. Do we have any bronze material from East Europe around this time?

Another question: how you know that this particular admixture detected in the Lithuanians is IE? It is a plausible assumption that multiple groups migrated. Given how far the Baltic region from the better studied "main road", this can be easily a signal of some never before identified migration. (For example a Central Asian population that also contributed in modern South Asians.)

I think you jumped into too much and too detailed conclusions.

Dienekes said...

At the other hand if this is not the same migration wave, then how you know that they have anything to do with each other at all?

The dates have large confidence intervals. For example, the Polish_D rolloff analysis I just added has a date of 4,330 years.

We know that they have something with each other because they involve the same reference populations (=South Asians) + populations likely to carry the greatest influence from substrata in southern (=Sardinians) or northern (=Lithuanians) Europe.

But, certainly, they were probably not the "same" event; it would be more accurate to speak of a set of invasions that took place in Europe after its initial Neolithic settlement, much like the Americas were settled over centuries by different groups of Europeans who originated in different sub-parts of Europe and settled different sub-parts of the Americas.

Project "Magnus Ducatus Lituaniae" said...

Very interesting.

I have obtained the similar results in my project:

Project "Magnus Ducatus Lituaniae" said...


Charles Nydorf said...

Thanks for the additional explanation. I was having trouble following some of the earlier Roll-off posts.

MOCKBA said...

Dienekes, I like the conclusions of rolloff tests which you publish - all of them - but I feel somewhat uneasy about the boundaries of the confidence intervals. It seems that the decay rates are strongly influenced by the very few "LD survivors" at approximately 1 cM scale? How well have we checked that the genetic distances are measured correctly in those individuals who display the LD? Because individual variations in recombinations in LD rate are bound to happen. Local inversions suppress recombinations, large deletions bring loci closer together, and mutations in natural recombination hotspot may reduce effective genetic distances too. When we are relying on rare survivors of recombination, any such effects which slow down recombination may contribute to systematic error.

Also, "plus-minus X generations" leaves an impression that the random error is equally likely to be positive or negative, but I suspect that the actual distribution of probabilities isn't symmetric.

Also, translation to "plus-minus years" may need to take into account statistical uncertainty about the effective age of a generation? In other word, the confidence, when expressed in years, should be proprtionally wider than the CI expressed in generations, because of the added uncertainty about the conversion factor.

Lastly, even a perfectly correct CI would have one in 20 probability of things happening outside of this interval; repeat the test several times, and you end up with a virtual guarantee that one of your estimates is off...

MOCKBA said...

OK I didn't get an answer, and I was also concerned about the markers with a low rate of recurrent mutagenesis (which may put them in correlation despite lack of IBD, artifactually increasing roloff signal). So I reread Moorjani et al., looking how they tested ROLLOFF, and what limitations they found.

Firstly, adjacent SNPs pairs turned out to be unreliable: "We do not show inter-SNP intervals of <0.5cM since we have found that at this distance admixture LD begins to be confounded by background LD, and so inferences are not reliable (exponential curve fitting does not include inter-SNP intervals at this scale)."

Testing ROLLOFF was conducted on simulated African_European mixed individuals with high (20%) European admixture. As Figure 4 shows, even with this deep admixture, and even under a simulated conditions of exact genetic distances and lack of recurrent mutagenesis, ages of admixture started falling out of confidence intervals after as few as 100 generations. They note that there is also a systematic upward bias in the ages of admixture when the admixture % is low, and/or admixture is old.

CIs are calculated using permutations with individual chromosomes dropped, one at a time, so if the estimated age of admission is strongly influenced by something very local such as for example a single inversion event, it might be manifested by a wider CI, and then caught by inspecting the permutations manually...

Unknown said...

Two remarks:

1. That the proto-Balts, and by inference presumably also the proto-Slavs and maybe the proto-Germanics, are descended from IEs that came from the east (Pontic steppe) rather than from the Balkans is also indicated by the fact that the K7b West_Asian component in Lithuanians is reasonably strong, while the K7b Southern component is virtually absent. This speaks against an IE derivation via the Balkans, at least for the Balto-Slavs, maybe also for the Germanics.

2. The lower admixture date in Poles relative to Lithuanians may be due to later additional admixture with Scythians, Sarmatians and the like, which quite possibly affected Slavs more than Balts.

Dienekes said...

while the K7b Southern component is virtually absent

Gok4-related farmer ancestry is estimated at 11-14% in Finns and North Russians by Skoglund et al. (2012). As I have emphasized before, absence of a component in an ADMIXTURE analysis indicates a minimum within a given set of populations, rather than complete absence.

At present, it is difficult to say much about the origins of the Corded Ware culture. But, its anthropological type is certainly related to that of the Neolithic inhabitants, and contrasts with both Mesolithic hunter-gatherers and Kurgan groups, so I would be much surprised if it was formed without input from the TRB population.

eurologist said...

At present, it is difficult to say much about the origins of the Corded Ware culture. But, its anthropological type is certainly related to that of the Neolithic inhabitants, and contrasts with both Mesolithic hunter-gatherers and Kurgan groups, so I would be much surprised if it was formed without input from the TRB population.

I agree 100%. In most places (but not all) Corded Ware was not intrusive, but rather based on local continuation.

I also see no relation between Corded Ware and Slavic or Baltic languages - there really is no geographic overlap, at the time. Conversely, proto-Germanic clearly shared a boundary with Uralic, and not with Slavic languages.

Unknown said...

Well, true, physically there is quite a difference between corded groups and the pit-grave people... And if the spread of IE languages was accompanied by the spread of the West_Asian component, then there must have been some gene flow. The question is, if this appears possible in the light of these differences. According to some dendrograms I've seen, Central European corded groups were rather similar to each other. The Polish groups just had a somewhat lower FI than the more western groups, though. Also Schwidetzky mentioned that the globular amphora people get broader faced towards the east and at least in this respect do resemble the pit-grave people. The spread of the globular amphora culture preceded and sort of prepared the spread of the corded ware. But I agree, the corded people can't be deduced from the pit-grave people and there is a lot of anthropological continuity with preceding groups.

As for the origin of the corded ware culture, I've read that the development and implementation of the typical corded ware burial rite preceded the spread of the corded ware pottery by two centuries and was without a clear point of origin. The corded ware pottery on the other hand clearly spread from Poland to the other locales, and this spread was connected with globular amphorae.

@ eurologist

Bold assertions! No geographic overlap between the corded ware and Balto-Slavic?? Well the question is where we tentatively localise the proto-Balts and the proto-Slavs, and probably there are different methods to achieve this. At least the evidence from hydronyms and place names shows a lot of overlap between Balto-Slavs and corded ware. Germanic stands somewhat inbetween Balto-Slavic and Italo-Celtic... And a boundary with Uralic doesn't rule out another boundary with Balto-Slavic.

Unknown said...

And, @ Dienekes: The point I was trying to make doesn't hinge on the reality of the near absence of the Southern component in Balts, but rather on the relation West_Asian : Southern. If the West_Asian IEs took the route via Anatolia and/or the Balkans to the Baltic, then they must have largely avoided the locals on their way, otherwise the relation West_Asian : Southern in Lithuanians would be lower than it is. Of course, that's not impossible - the early farmers behaved exactly this way - but if the IEs instead sprung out of their West Asian homeland directly onto the Pontic plain, there would be no need to assume such avoiding behaviour. Admittedly, not a compelling argument.

Slumbery said...


Don't you think that this result question the theory that source of the steppe IE is the BMAC?
Ukraine was on the main road of steppe migration after this admixture until the Medieval times. If the Eastern Steppe got a significant genetic impact from that South later, a Baluchi referenced roloff would pick that up.

Dienekes said...

Don't you think that this result question the theory that source of the steppe IE is the BMAC?

I am not sure what you are referring to. I think the BMAC was the source of Indo-Iranian, and hence (ultimately) of the Scythians who migrated to Europe during the Iron Age. The language of previous denizens of the European steppe is unknown, and may have included some Indo-Europeanized groups.

Slumbery said...

In earlier debates you questioned that there were IE on the steppe other than Indo-Iranians, so this is pretty much the same. But indeed the question was a bit too general and undefined. I reform it.

The Scytians lived in Ukraine for a considerable time, also presence of Scytian connected groups (Sarmatians, possibly Alans) outlived them. Should not an Ukrainian (Lithuanian vs. Baluchi) Roloff pick-up a later signal because of this? BMAC is rather close to Baluchi.

I can imagine multiple answers why it should not pick it up (Scytian genetic impact is insignificant in current Ukrainians, BMAC was too different from Baluchi, Scytians are not from the BMAC, ect.), but I think this result at least raise the question.

Dienekes said...

The Scytians lived in Ukraine for a considerable time, also presence of Scytian connected groups (Sarmatians, possibly Alans) outlived them. Should not an Ukrainian (Lithuanian vs. Baluchi) Roloff pick-up a later signal because of this? BMAC is rather close to Baluchi.

Eastern Slavs expanded into the eastern European plain in medieval times from Central Europe. I see no reason why they would be particularly related to the Scythians who lived there a thousand years earlier.

szopen said...

Eastern Slavs expanded into the eastern European plain in medieval times from Central Europe

Before the genetics, there were also several other theories, including the one of Slavs coming from the eastern steppes, or living in somewhere in Ukraine between steppes and forests.

The idea that Slavs came from Central Europe (e.g. somewhere around Poland) was called in those days "autochtonous theory" and was pretty much considered not just unlikely, but even ridiculous theory.

PS: got "internal error" while posting the comment. Hope i will not sent the same comment twice...

Unknown said...

Actually it's striking how the rolloff date for Poland, 2320 BC, fits well with the beginning of the bronze age in southern Poland, with the Unetice and Miersanovice cultures!

So if the expansion of IE was linked to the expansion of the West_Asian component, and I deem this idea to be quite convincing, then it would follow that, surprisingly, the Corded Ware wasn't yet IE.

There are two advantages in this insight:
1. There would no longer be any need to derive the Corded Ware IEs from the earlier West_Asianised Pit-Grave/Yamnaya culture.

2. The Corded Ware/Battle axe/Single Grave culture followed immediately on the genetically rather Southern TRB (with Gok4) and was apparently associated with R1a, but probably also with I1 in the north. It may be one of the first agents to diffuse the Amerindian component more widely in Europe. And since this component is also found in Basques, the beginning of its wider diffusion must have antedated the arrival of IEs in central Europe. A pre-IE Corded Ware would suit this well.