May 15, 2011

Genes and Languages in the Caucasus

If there was ever a paper that was the equivalent of a box of candy, this is probably it. I will update this post with my comments.

UPDATE I (Genealogical rate, Gene-language concordance, Ossetes): I seriously don't know where to begin with this paper. So, given the serendipitous appearance of an abstract on Y-chromosome mutation rates, here is a major new pro-genealogical rate quote from the new paper:
We found that “evolutionary” estimates of most clusters fall far outside the range of the respective linguistic dates, while “genealogical” estimates gave a good fit with the linguistic 23 dates. At least two population events in the Caucasus are documented archaeologically, which allows additional comparison with these “historical” dates. In both cases, the historical (archaeological) date is similar to a genetic estimate based on the “genealogical” mutation rate (Supplementary Note 2).
And, here's a comparison of the linguistic and genetic (based on Y-chromosomes) trees from the paper:
The correspondence seems remarkable; the only major discrepancy is for Iranic (Indo-European) Ossetes who group with NW Caucasians genetically, which makes sense as the Ossetes are probably to a large extent NW Caucasians that underwent a language shift at the influence of the Alans.

Speaking of the Ossetes, their negligible R1a1-M198 frequency (0.4-0.8%) should be a warning that Iranic steppe nomads _does not equal_ R1a1. While a limited contribution of Alans to the Ossetes is expected, it is not expected that Ossetes will have two of the lowest M198 frequencies in the Caucassus: in all probability R1a1 was not particularly important among Alans, and, by implication (?) Sarmatians.

UPDATE II (4 haplogroups for 4 language families):

The most interesting discovery in this paper is, of course, the correspondence between Y-chromosome haplogroups and language groups, thanks to the very large number of individuals tested and the deep phylogenetic resolution of the haplogroups:
Overall, the most frequent haplogroups in the Caucasus were G2a3b1-P303 (12%), G2a1a-P18 (8%), J1*-M267(xP58) (34%), and J2a4b*-M67(xM92) (21%), which together encompassed 73% of the Y chromosomes, while the other 24 haplogroups identified in our study comprise the remaining 27% (Table 2). ... haplogroup G2a3b1-P303 comprised at least 21% (and up to 86%) of the Y chromosomes in the Shapsug, Abkhaz and Circassians ... haplogroup G2a1a-P18 comprised at least 56% (and up to 73%) of the Digorians and Ironians (both from the Central Caucasus Iranic linguistic group), while not being found at more than 12% (average 3%) in other populations... haplogroup J2a4b*-M67(xM92) comprised 51-79% of the Y chromosomes in the Ingush and three Chechen populations (North-East Caucasus, Nakh linguistic group), while, in the rest of the Caucasus, its frequency was not higher than 9% (average 3%) ... haplogroup J1*-M267(xP58) comprised 44-99% of the Avar, Dargins, Kaitak, Kubachi, and Lezghins (South-East Caucasus, Dagestan linguistic group) but was less than 25% in Nakh populations and less than 5% in the rest of Caucasus.

Interestingly, G2a3 is one of the lineages of early Central European farmers, and 2 medieval German knights. G2 is also, curiously, one of the West Eurasian lineages that are found in very small quantities in India, especially among upper caste Hindus. We are beginning to make connections across space and time, even though the patterns are far from clear yet.

The prevalence of J1*-M267(xP58) in Dagestan is well known (or suspected) from previous studies. Notice that J-P58, if we use the genealogical rate has an age of ~5.4ky in Semitic groups, and this is in concordance with the 5,750 years ago origin of Semitic languages based on Bayesian phylogenetics. So, it is clear that part of haplogroup J1 was prevalent in ancient Semitic groups, another, disjoint part in ancient Dagestani groups.

To make things more interesting, the Nakh groups (Ingush and Chechens) have J2a4b*-M67(xM92) as their modal haplogroup. Nakh is also a Northeast Caucasian language subfamily, like Dagestani, and indeed NE Caucasian is also called Nakho-Daghestanian. What did the early speakers of this family look like?

It would be tempting to think that Proto-Nakho-Dagestanians were J1-dominated, as J1 exists in both Nakh (16-25%) and Dagestani (58-99%) groups, whereas J2a4b-M67 (the Nakh modal haplogroup) is nearly completely absent in Dagestanians.

UPDATE III (No European influence):

Another interesting discovery of this study is the lack of European influence in the populations of the North Caucasus.
It seems that both R1a1a-M198 and I2a-P37 have a major barrier eastward in the Don river. Please note that the former is not strictly a European haplogroup, but it nonetheless experiences a massive drop in frequency, and is negligible everywhere except in Abkhaz-Circassians (NW Caucasus; 10.3-19.7%), with an outlier in Dargins (22%).

This seems to put a limit on the origin of any hypothetical movements across the Eurasian steppe east of the Don river, as haplogroup I2a-P37 is largely absent in Central Asia, and occurs 3 times in 1,525 individuals in this sample. So, while there have been proposals of a Central European origin of some steppe pastoralist groups, these are hard to reconcile with this picture.

UPDATE IV (Haplogroup G):

Two of the modal haplogroups in this paper are G2a1a-P18 (Iranic, 56-73%) and G2a3b1-P303 (NW Caucasians, 21-86%). Battaglia et al. (2008) also found a high frequency of G2a* in Georgians and Balkars (~30%, also modal in both populations). It appears that G2a is a mainly West (both NW and SW) Caucasian phenomenon within the context of this region.

UPDATE V (Starostin and Language depth)

The authors applied the methodology of the late Sergei Starostin to the problem of language time depth:
The present work employs Starostin’s methodology, and we made special efforts to create the high-quality linguistic databases required for this analysis. Thus, based on significantly extended and revised linguistic databases, we have applied a glotto-chronological approach to the North Caucasian languages. As a result, our study provides a unique opportunity to make direct comparisons of linguistic and genetic data from the same populations. Lexico-statistical methods have also been applied to a number of language families using a Bayesian approach to increase the statistical robustness of language classification (Gray and Atkinson, 2003; Kitchen et al., 2009; Greenhill et al., 2010). Using these methods with the Caucasus languages under
study here will be the focus of future work.
It will certainly be interesting to see Bayesian phylogenetic methods applied to the Caucasus languages in the future, using the linguistic datasets developed here. The concordance of genetic-linguistic results in this paper, in addition to the many successes of the G&A approach, is making it increasingly difficult for those who doubt our ability to estimate the age of language families in a manner similar to that with which biologists estimate the age of genetic variation.

See also Tower of Babel project and the Evolution of Human Languages project at the Santa Fe Institute.

UPDATE VI (Haplogroup J2a)

I have recently speculated about a possible link between the Caucasus region and India based on the appearance of a "Dagestan" component in India, the clear West Asian origin of Ancestral North Indians, as well as a possible linguistic link between Northeast Caucasian, Hurrian, and Indo-European.

A problem with that theory is that the high J1*(xP58) frequency in Dagestan has no counterpart in South Asia. The current study, however, adds data on the Nakh part of the Nakho-Dagestanian (Northeast Caucasian) family, showing this to be J2a4b-M67 dominated. So, while I think that J1*(xP58) may have been present among Proto-Northeast Caucasians, these must have interacted with J2a folk.

J-M67 is clearly intrusive into the Central Caucasus, from the South where a much greater variety of J2a-related lineages is observed among Armenians, North Iranians, and Anatolian Turks.

We now have good coverage of J2a in the entirety of the West Asian region, with the exception of Azerbaijan, and a few patterns are beginning to emerge:
  1. The center of the J2a world is somewhere between eastern Turkey, Armenia, Azerbaijan, Iran, and Syria
  2. The Caucasus is a northern extension of this world, just as Greece and Italy are its main western extensions, with a strong extension into Central Asia as far as Xinjiang, and well into South Asia all the way to upper caste South Indian Hindus.
  3. In the Caucasus itself J-M67 is dominating Nakh speakers, but with little other J2a related variation.
  4. In comparison to Nakhs, J2a seems more varied in Georgians, among Ossetes, and among NW Caucasian speakers
It is hard to make any pronouncements on how J2a spread northwards from its Transcaucasian cradle, but I would think that the Kura-Araxes and Maikop cultures are fairly good candidates for that spread, with the former being J2a dominated, and the latter being more G2a dominated. I would not, however, dismiss a more recent spread of J2a into the region.

UPDATE VII (Absence of E1b1b1):

This haplogroup has a more Mediterranean distribution and is conspicuously absent in the North Caucasus. Unfortunately no downstream markers were typed, but (a) its presence in small amounts in NW Caucasians (1-1.7%) together with a similar low frequency (1.5%) in Georgians, (b) its absolute absence among Nakho-Dagestanians, except for one Lezghin, suggest to me that it arrived to the region from the west, and is probably a low-frequency trace of Ancient Greek colonies of the Black Sea, just as it is associated with Greek colonists in the West Mediterranean and Sicily.

UPDATE VIII (Haplogroups L and T):

There is a little haplogroup L in the North Caucasus. L-M27 and L-M317 seems concentrated in the Northwest, while L-M357 is found only in Nakh speakers. The detection of L-M357 in North but not South Iran may be related with this population, and also the L-rich population of Syria, especially from the eastern inland area.

Haplogroup T has been the subject of a major recent paper. In this region, it is found in 2 NW Caucasians, 1 Ossete and a couple of Lezgins, but unfortunately with no fine phylogenetic resolution.

Mol Biol Evol (2011) doi: 10.1093/molbev/msr126

Parallel Evolution of Genes and Languages in the Caucasus Region

Oleg Balanovsky1,2,*, Khadizhat Dibirova1,*, Anna Dybo3, Oleg Mudrak4, Svetlana Frolova1, Elvira Pocheshkhova5, Marc Haber6, Daniel Platt7, Theodore Schurr8, Wolfgang Haak9, Marina Kuznetsova1, Magomed Radzhabov1, Olga Balaganskaya1,2, Alexey Romanov1, Tatiana Zakharova1, David F. Soria Hernanz10,11, Pierre Zalloua6, Sergey Koshel12, Merritt Ruhlen13, Colin Renfrew14, R. Spencer Wells10, Chris Tyler-Smith15, Elena Balanovska1 and The Genographic Consortium16

We analyzed 40 SNP and 19 STR Y-chromosomal markers in a large sample of 1,525 indigenous individuals from 14 populations in the Caucasus and 254 additional individuals representing potential source populations. We also employed a lexicostatistical approach to reconstruct the history of the languages of the North Caucasian family spoken by the Caucasus populations. We found a different major haplogroup to be prevalent in each of four sets of populations that occupy distinct geographic regions and belong to different linguistic branches. The haplogroup frequencies correlated with geography and, even more strongly, with language. Within haplogroups, a number of haplotype clusters were shown to be specific to individual populations and languages. The data suggested a direct origin of Caucasus male lineages from the Near East, followed by high levels of isolation, differentiation and genetic drift in situ. Comparison of genetic and linguistic reconstructions covering the last few millennia showed striking correspondences between the topology and dates of the respective gene and language trees, and with documented historical events. Overall, in the Caucasus region, unmatched levels of gene-language co-evolution occurred within geographically isolated populations, probably due to its mountainous terrain.



matt said...

Ken is looking for how much M170 M438 etc are found. And how many are M26, or M423 or neither.

Onur said...

On the above trees there are 11 populations, but the abstract mentions 14 Caucasian populations investigated. What are the remaining 3 Caucasian populations? Also which populations from outside the Caucasus were investigated?

matt said...

Thanks for the I2a haploid group M170 M438 P37.2 stat of 3 samples.
There were previous reports of I2a M170 M438 P37.2- M436- found in Georgia,Armenia,Turkey. I think this is part of what Ken is looking for.

Ricardo Costa de Oliveira said...

J1 is the major haplogroup in the Caucasus with the most complex and diverse phylogenetic network in the region. Of all the major Caucasian haplogroups (G2a3b1, G2a1a, J2a4b), J1 (xP58) is also the most frequent haplogroup present around the Caspian Sea and the most frequent in Northern Iran. Other J1 (xP58) SNP like J1b M365 is a Northern Iranian haplogroup forming a distinct cluster with the Western European, Western Iberian Portuguese-Brazilian J1b haplotypes, what possibly can be a genetic testimony of a section of the Iranian-speaking Alan presence in Lusitania and Northwestern Iberia, well attested in historical documents and plausible in terms of their TMRCA in the Atlantic and the Caspian shores.

Dienekes said...

Ossetes have 1.3-3.9% J1*(xP58) in this study.

matt said...

Only 3? hits for I2* versus previous reports of I2* M170 M438 P37.2- M436- found in Georgia,Armenia,Turkey. In the future we should be getting more data from Iran etc. relative to the Caucasus?

Onur said...

This study is essentially a study of the highland Caucasus, the most isolated areas of the Caucasus, so it is not so surprising to find a clear correlation between languages and genetics in this study, especially as the Northwest and Northeast Caucasian language families are probably very old in the Caucasus (maybe so old as to be directly linked to the first Neolithic colonizations in the Caucasus).

But the lowland Caucasus and Transcaucasus populations, which are less isolated than the highland Caucasus populations, aren't included in this study. Previous genetic studies suggested that the Transcaucasus and lowland Caucasus populations have genetic structures correlated with geography rather than languages and were genetically very close to each other respectively despite speaking very different language families and religious differences.

terryt said...

"The correspondence seems remarkable"

Stunning in fact.

"1.The center of the J2a world is somewhere between eastern Turkey, Armenia, Azerbaijan, Iran, and Syria"

Coincides pretty closely with the region where farming began.

Onur said...

The correspondence seems remarkable; the only major discrepancy is for Iranic (Indo-European) Ossetes who group with NW Caucasians genetically, which makes sense as the Ossetes are probably to a large extent NW Caucasians that underwent a language shift at the influence of the Alans.

Bear in mind that unlike Northwest and Northeast Caucasian language families, the Ossetian language is a late arrival to the region, so no wonder that it is very weakly correlated, if at all, with genetics, because as a general rule the more recent the language dispersal the more of an elite dominance type it becomes in most of the Old World (especially in the regions with a long agriculturalist past like West Asia, the Caucasus, South Asia, East Asia, North Africa, most of Europe and most of Southeast Asia, and also in most of post-Bantu expansion Sub-Saharan Africa).

pconroy said...


In terms of J2a and the putative West Asians origin of the Ancestral North Indians, what do you make of the recent K=12 results for the Irish?

Here is a chart I put together for 11 of the 17 Irish_D sample who identified themselves as being Irish - via the identity thread, or via email to me:

And here is the raw data:

You'll notice that my Father DOD098 has:
West Asian = 0.00
South Asian = 1.40

He is the only Irish_D member that I know of, who has no known non-Irish ancestry. He also has both the highest Basque and NE European. So I'm guessing that you would have to say that his South Asian component comes via NE Europe??

lars said...

perhaps legacy of some irish tarveler/kalderash women some point in the lineage??

Dean said...

Speaking of haplogroup 12a, 12a2, that's found in southeast Europe, is getting different opinions now. Some are stating that the South Slavs brought this haplogroup to southeast Europe, and it grew due to founder effect.

There is a challenge to the idea that modern Croatia is the post-glacial origin of this haplogroup, as opposed to somewhere near the ancient Slavic homeland in Ukraine, due to variance. The proto-Slavs might be these people's ancestors. Haplogroup R1a1is a minority haplogroup in some of the south Slav countries, and if early Slavs were predominantly R1a1, one would think it would be present in higher frequencies.

tndl said...

Good question Dean. Who knows what the early Slavs or Slavic speakers were, but I'd wager a guess that they included different groups of people with different haplogroups. Looks like the I2a2 in the Balkans is very young if we can trust Nordtvest which means its post-glacial origin in Croatia is BUSTED.

Szilard Oberlaender said...

Interestingly, and why at you work Yunusbayev (2006.), but never research Bulayeva (2008) is always mentioned. In Dagestan nobody trusts to Yunusbayev. Given presented with these researchers cardinally differ.Excuse, but Yunusbayev researches of mountain people in Daghestan did not spend. Anybody never in mountain Daghestan saw it. Therefore it in the work at all does not specify at whom particularly from mountaineers of Daghestan it took the data.

Szilard Oberlaender said...

This research of 2006. And at Kazima Bulayeva (2008) there is other research which data strongly differ from those parameters which results of Yunusbayev. See: Culture creates genetic structure in the Caucasus: Autosomal,
mitochondrial, and Y-chromosomal variation in Daghestan
Elizabeth E Marchani1, W Scott Watkins2, Kazima Bulayeva3,
Henry C Harpending1 and Lynn B Jorde*2
Y-Chromosome according to this work: Avars F (F*xH,I,J2)-0.61;J2 - 0.33;R1**(xR1a1)0.06. In another work K.Bulayeva has written: Interestingly, haplogroup G occurs in the Avars (0.06) but not in the other highland groups. Haplogroup G is common in the Southern Caucasus.See "Genes and Languages in the Caucasus" in Dienekes'blog

ina said...

I'm a founder of genetic study of Dagestan indigenous ethnics from 1976 in VIGG RAS, and because myself are from one of small ethnics there originally I very well know everything about culture, history and demography of these ethnics. And I know that results in ethnogenomic diversities of ethnics in such populations is fully depending from field data/DNA collections methods. Strangers- non-Dagestan geneticists -Yunusbaev +. Balanovsky etc., even if they went to the rural monoethnic villages, collected data from anyone who was just agree to give them blood samples. Due to high inbreeding and subdivision of the villages into diverse kindreds such DNA sampling without any preliminary data collections about certain ethnics history, demography, traditions present often very strange results. We have some monoethnic villages where I sampled long before Yunusbaev or Balanovskies and our results are different because I'm working with those listed above historical and cultural data before collecting DNAs, including deep geneological data collections to define a proper selection of unrelated populations members from different kindreds. Othervise a sample from certain population isolate can include one family members of one of 7-8 kindred that will show extremely high frequency of ancestral for this kindred haplogroup, but it cannot characterize all certain population polymorphism.
We published DNA polymorphism with Y-hplgrs and mtHVS1, starting 2003( Bulayeva et al, 2003, Marchani et al, 2008, Bulayeva et al, 2006, Tofanelli et al 2009 etc). Any questions are welcome.
Kazima Bulayeva