November 04, 2010

Clustering of European Y-STRs

Roewer et al. had previously discovered structure in European Y-chromosomes with Y-STRs. The new study, five years later, uses a huge database of population samples. While Y-SNPs defining haplogroups are safer due to the avoidance of homoplasy, which can be a problem with a few Y-STR markers, I believe that most major haplogroups can be distinguished even with few Y-STRs, so the paper's results are valid.

From the paper:
In a total of 33,010 males we identified 4176 different haplotypes, 2192 were unique, and 56 corresponded to 42% of the Y chromosomes
Interesting that such a small fraction of haplotypes corresponds to almost half the Y chromosomes. 7 Y-STRs are generally not sufficient to define monophyletic lineages (as the Cohen Modal Haplotype folks well know by now). It would be interesting to see what this fraction is expected to be under an assumption of reproductive equality, to assess the strength of social selection that I've speculated may be behind the mega-haplogroups we observe in the world today.

Here is a synthetic map of Europe showing distribution of different clusters:


(a) Spatial distribution of the most frequent Y-STR haplotype clusters in
Europe and neighboring regions.
(b) Spatial distribution of the Y-STR haplotype clusters accounting for
the second frequency in western Europe.

Now, take a look at a map of predicted language distribution by Finnish scholar Kalevi Wiik for 5,500 BC:

The correspondence is not perfect, but it's pretty close to merit study. The little differences can be ascribed to 7,500 years of history; for example, in 5,500BC there were probably no Germanic speakers in Scandinavia.

Also of interest:
Two clusters were assigned to large areas of the Balkan Peninsula: 1) Croatia, Bosnia and Herzegovina, Serbia, Romania,Western and Eastern Hungary, and Central Ukraine: cluster 18;(2) continental Greece, Bulgaria, and Macedonia: cluster2. Cluster13 was assigned to Albania and to the western area of the Balkans 10 and cluster 11 to the Caucasus.

Forensic Science International: Genetics doi:10.1016/j.fsigen.2010.09.010

Geostatistical inference of main Y-STR-haplotype groups in Europe

Amalia Diaz-Lacava et al.

We examined the multifarious genetic heterogeneity of Europe and neighboring regions from a geographical perspective. We created composite maps outlining the estimated geographical distribution of major groups of genetically similar individuals on the basis of forensic Y-chromosomal markers. We analyzed Y-chromosomal haplotypes composed of 7 highly polymorphic STR loci, genotyped for 33,010 samples, collected at 249 sites in Europe, Western Asia and North Africa, deposited in the YHRD database (www.yhrd.org). The data set comprised 4176 different haplotypes, which we grouped into 20 clusters. For each cluster, the frequency per site was calculated. All geostatistical analysis was performed with the geographic information system GRASS-GIS. We interpolated frequency values across the study area separately for each cluster. Juxtaposing all 20 interpolated surfaces, we point-wisely screened for the highest cluster frequencies and stored it in parallel with the respective cluster label. We combined these two types of data in a composite map. We repeated this procedure for the second highest frequencies in Europe. Major groups were assigned to Northern, Western and Eastern Europe. North Africa built a separate region, Southeastern Europe, Turkey and Near East were divided into several regions. The spatial distribution of the groups accounting for the second highest frequencies in Europe overlapped with the territories of the largest countries. The genetic structure presented in the composite maps fits major historical geopolitical regions and is in agreement with previous studies of genetic frequencies, validating our approach. Our genetic geostatistical approach provides, on the basis of two composite maps, detailed evidence of the geographical distribution and relative frequencies of the most predominant groups of the extant male European population, examined on the basis of forensic Y-STR haplotypes. The existence of considerable genetic differences among geographic subgroups in Europe has important consequences for the statistical inference in forensic Y-STR haplotype analyses.

Link

40 comments:

  1. I looked at the map close-up, and the Peloponnese is green, like Egypt and southwest Asia. There is what appears to me to be a data point in the center of the peninsula. So the Peloponnese is different genetically than the rest of Greece?

    ReplyDelete
  2. The Peloponnese seems to have a high frequency of haplogroup E1b in Greece, so this probably links it to other E1b-heavy populations. But, I don't see any sample points in Egypt and a few ones in the Levant, so I'm not sure what's going on. These interpolated surfaces are good visually, but it's always best to look at the raw data for particular sample points.

    ReplyDelete
  3. How come the western half of Europe is in the same cluster with the Peloponnese, the eastern half of Turkey, the Levant, Egypt, Azerbaijan, most of Armenia and the eastern half of Libya while places in between are in completely different clusters from them?

    ReplyDelete
  4. @onur,

    They just re-used the same colors I'm sure.

    Wow, look how Iceland with its one data point belongs in the Swedish type of cluster

    ReplyDelete
  5. Hmm hard to interpret really. I suspect the devil is in the sensitivity. I am seeing:

    (1) Cluster 6 (green) seems to be the main basal structure from the Caucasus and Middle East to northern Scandinavia. Cluster 6 must be R1 which leads me to suspect the sensitivity. I would have expected I to to be the basal group. Presumably the distinction between R1a (southern route) and R1b (northern route) is disguised in the combined green. Which is why it looks different from the autosomal info.


    (2) Cluster 17 (red) appears to be a later overlay (later population wave) on both the green cluster 6 and blue (Finland/Lithuania) unknown cluster.

    (3) Cluster 20 (grey) seems to be the Balkans spreading out. Note the touches in southern Sardinia.

    (4) Clusters 1/13 (Blue) looks to be Turkish, and the Black Sea. Ottomans maybe. Interesting that the Greek Cypriots show green and the Turks blue.

    (5) The strong impact of northern African men on Turkey (or vice versa) is fascinating. Is this the Troy impact? Or is it more modern?

    ReplyDelete
  6. are in completely

    I should have written "are completely in" here.

    ReplyDelete
  7. Ugh, once again they only test western Switzerland even though the Swiss have already been shown to be genetically differentiated based on language (or geography). I really want to see them test northeastern Switzerland like Zurich (the biggest city), Schaffhausen, or St. Gallen. I wander if they would cluster with the central Germans.

    Its interesting that by the map the Swiss cluster with the eastern Austrians northern and western Italian, and the very southern German but they do NOT cluster with the French.

    It is also interesting the the people just north of the Alps should cluster with the people just south of the Alps since the Alps are a geographic barrier. Perhaps this gives some credence to the italo-celtic connection; the Celts I'm referring to are the ones of the "Celtic homeland" involving Switzerland, Austria, and southern Germany.

    .................................................................................

    Question, if a data point appears in a certain color/cluster must it therefore belong to that color/cluster?

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. I'm only speculating here.
    In the map b, there is a sky blu color touching Italy, Switzerland and Austria, basically where the celtic cultures of La Tene and Halstatt(more or less)were.
    Following that map it's not a case then that there was an italo-celtic language and maybe celts were more widely diffused on the peninsula than historical accounts tell us.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. They just re-used the same colors I'm sure.

    Doesn't make sense. Every color represents a specific Y-STR cluster.

    Clusters 1/13 (Blue) looks to be Turkish, and the Black Sea. Ottomans maybe.

    The Black Sea area of the Balkans is in the same cluster, so it can't be Turkish or Ottoman in origin. The golden cluster that exists in central Turkey also exists in almost all of Sweden, the eastern half of Norway and parts of Iceland, thus not exclusive to Turkey, so it too cannot be Turkish or Ottoman in origin.

    Interesting that the Greek Cypriots show green and the Turks blue.

    Cyprus wasn't sampled, its coloring is just interpolation.

    The strong impact of northern African men on Turkey (or vice versa) is fascinating.

    Y-STR clusters don't tell anything about overall genetics. Autosomal studies are much much more informative. So your conclusions are certainly unjustified.

    ReplyDelete
  15. The Black Sea area of the Balkans is in the same cluster, so it can't be Turkish or Ottoman in origin.

    I've just noticed. Also most of Albania.

    ReplyDelete
  16. @ Annie Mouse, the blue/ green island in the east Med is Crete not Cyprus. Cyprus is shown in blue.

    Some comments on the Balkans, looked at like a painting:

    The barriers between the groups in the Balkans seem much less distinct than those between the groups elsewhere. The Balkans also seems most complex, with the blue, red, grey, green and purple layers. Compare the Balkans to the solid green in western Europe or to the distinct green, yellow and blue blocks in Scandinavia.

    The red in the Balkans seems to overlay the blue and the grey. Notice the blue on both the Adriatic and the Black Sea coasts; and the grey on the Baltic coast and at 5 and 7 o'clock south from there.

    The blue and the grey look distinct from each other, like non-Balkan groups.

    The purple looks painted over the top as a top layer.

    So my guess is blue at the southern part and grey at the northern, with red over the blue and grey and purple over the top of the red.

    Clearly the western green looks much less present than the eastern red in the Balkans.

    ReplyDelete
  17. Yes, there are two different green clusters: 6 and 7.

    Interesting to see that there are two (presumably different R1a haplotypes) red (17) and "Aubergine" (cluster 18). The area of 18 is largely made up of non-Slavic speaking people (Romania, Hungary) in addition to Serbs and Croats and goes all the way to the Ukraine. This may indicate that this region (like Eastern Germany) already had its own R1a before Slavic expansion, and likely before the advent of the Slavic language. The known high frequency of haplogroup I in Serbia/Croatia is also an indication for a persistence of local groups.

    On the other hand, (17) goes along with the traditional view of northward migrating Slavs and Balto-Slavs from between the Ukraine and the Urals.

    In map (b) with he second-highest frequencies, yellow (4) appears to be the "Germanic" haplogroup I2. Note, however, that this haplogroup and its distribution also predates the Germanic languages.

    People just haven't really moved all that much in the past 7,000 or so years.

    ReplyDelete
  18. Dieneke, I'm very sorry for the multiple posting. To counteract redundancy in posting I've conflated all of my unpublished posts on this thread, so please don't publish them and also this post, only publish the conflated post that I will send after this one.

    I've asked you many times not to triple post. Opening up a text editor, combining ALL your thoughts into one post and then posting is not difficult.

    ReplyDelete
  19. Not sure why Wiik's map is seen as important here. It is extremely speculative and speculation about European pre-history is not hard to find.

    But compare to any other map, for example just a basic geographical or bio-region map, and you'll also see approximate correspondences.

    There are things like mountain ranges and swamps involved, which mean that no matter who was living in the various regions of Europe, those regions are the regions where people also have more contact with each other.

    Best Regards
    Andrew

    ReplyDelete
  20. Yes, there are two different green clusters: 6 and 7.

    7 (dark green) is only used in the second map, whose color coding is obviously entirely unrelated with the first one. In the first map the only green used is 6 (light green), which is the same in both Europe and the Middle East.

    the blue/ green island in the east Med is Crete not Cyprus. Cyprus is shown in blue

    True. But Cyprus wasn't sampled, so it is open to speculation. But considering the coloring of Turkey and Crete, Cyprus' interpolated coloring seems plausible. The point is, no cluster in Turkey is exclusive to Turkey, but they are all found also in either non-Turkish parts of Europe, non-Turkish countries lying east and south of Turkey or both.

    ReplyDelete
  21. Question, if a data point appears in a certain color/cluster must it therefore belong to that color/cluster?

    If on the same map, should be so, just as a color/cluster at the same K of an ADMIXTURE analysis.

    ReplyDelete
  22. ... just as a color/cluster at the same K...

    ... just like a color/cluster at the same K...

    ReplyDelete
  23. They sampled 7 STRs and some of the samples they are assigning to 'cluster 6' - in particular the Jordan and I *think* it was Egypt or N. Africa are a different group from the rest of cluster 6 which are most certainly R1b (including the Armenian one). I know R1b is quite frequent in the Jordan highlands but I'm not sure the 'most frequent' cluster identified is even the same haplogroup as the rest of cluster 6. There was a link to the supplementary info but I think it was taken down.

    ReplyDelete
  24. I'm highly skeptical of the timeline of the linguistic map at 5500 BCE. But, I think it is fair to think it may have looked like that sometime around ca. 1300 BCE-2500 BCE, which would have the same genetic impact. So, it could be an IE origin cluster, just a sevearl thousand years younger one.

    Alternately, because it isn't very clear how much of a Y-DNA demographic impact IE expansion had in this region, it isn't implausible to think that the area labeled IE in the linguistic map may really correspond to the area that spoke languages in some now extinct LBK linguistic family around 5500 BCE. The initial cultural unity and common cultural origin of the LBK (per artifacts and crop/domesticated animal ancient DNA), apparent low level of assimiliation of autochronous peoples in the first thousand years or so of that expansion (per-ancient DNA and physical anthropology), and rapid rate of demographic expansion across Europe in the early European Neolithic all point to probable linguistic unity among those populations.

    But, it seems more likely that the LBK language was not IE, and that instead Proto-IE was a creolization of a language in that extinct LBK language family and a Uralic language that was formed in the ethnically mixed Sredny Stog and Novodanylovka cultures initially in a very small North Pontic area ca. 4500 BCE to 3500 BCE that didn't really make big territorial gains into places where we have historical records until around 2500 BCE or later.

    ReplyDelete
  25. Cluster13 was assigned to Albania and to the western area of the Balkans

    It is clear from the first map that they misspelled "eastern" as "western" here.

    ReplyDelete
  26. onur,

    Both maps use the same color coding - the colors that define the clusters are to some degree mixed or they partially overlap on the maps.

    Cluster 7 was found in Turkey (2.6%) and Syria (5%). I think the resolution is really poor and likely doesn't even properly distinguish between the two main stars of R1b (western Europe and Turkey). For example for cluster 6, the haplogroup found in 10% to 15% from Spain to Ireland is 14_13_29_24_11_13_13, while that found to 23% in Jordan is 14_13_29_23_11_11_12 (but also lumped under cluster 6). Compare that with 14_14_30_24_11_13_13 - which was assigned cluster 9, instead (Spain, Finland).

    ReplyDelete
  27. Both maps use the same color coding - the colors that define the clusters are to some degree mixed or they partially overlap on the maps.

    The color codings of the two maps do not correspond to each other. This makes me think that they are entirely unrelated or there is something that we don't know here.

    ReplyDelete
  28. From the paper:

    "(a) composite map showing the 13 clusters (out of a total of 20 clusters) accounting for the highest frequency per tract of land; (b) composite map showing the clusters with the highest frequency per tract of land after excluding the most frequent clusters in the continental region of Western Europe (primarily covered by cluster 6); lighter and darker shading indicates higher and lower cluster frequency, respectively; dots indicate sampling sites"

    ReplyDelete
  29. To clarify
    Cluster 11/13 seems to be the Black Sea and Turkey. There is a wave of expansion west into Greece etc that may be the Ottoman empire.

    Ooops on Cyprus.

    ReplyDelete
  30. So...

    Haplogroup R1a is red.
    Haplogroup R1b and subclades are green.
    Haplogroup N is navy (Finland etc)
    Haplogroup J is pale blue (cluster 13).
    Haplogroup E is orange (cluster 3)
    Haplogroup I is possibly yellow (cluster 4) in Sweden.

    Not sure what the grey of the Balkan spread is. Could it be an an underlay of a Haplogroup I subclade?

    ReplyDelete
  31. Yes, yellow is I2, as I mentioned above. However, navy blue is just another R1b and not N, as I also mention above. Their resolution is so poor that their cluster system is off: the prime "haplogroup" that makes up blue is closer to the Iberian haplogroups that make up most of the green cluster than the main haplogroup that make up the near-East (but yet is grouped under green). Look at the underlying STR data I posted above.

    "Haplogroup" in this context means the identifying string of 7 STRs - not ySNPs.

    ReplyDelete
  32. To clarify
    Cluster 11/13 seems to be the Black Sea and Turkey. There is a wave of expansion west into Greece etc that may be the Ottoman empire.


    But it also includes most of Albania, which refutes your hypothesis, as Albania has never had any meaningful number of Turkish-speakers. Also the demographic impact of the Ottomans on the Balkans was considerably small.

    ReplyDelete
  33. "However, navy blue is just another R1b and not N, as I also mention above."

    I think she means the left map.

    Do you think that blue in Finland and northern Russia is "just another R1b?" There is only 3.5% R1b in Finland.

    So I think she is talking about the left map and you is talking about the right map.

    And I dont think both maps use the same clusters.

    Yellow MAY be I2 on the right map, but more like I1 on the left one.

    I would even doubt if that is I2 on the right map.

    Why should a super tiny minority haplogroup like I2, that is at 4%-6% in Germany, be "representative" for Germany on such a map?

    I would rather think that is just another (a specially "Germanic" one) branch of R1b.

    ReplyDelete
  34. Again - same color scheme on both maps. Both use the same clusters, both use the same colors. It says so in the paper - just read it! I am truly getting tired of this useless speculation of people who dont even ' the paper.

    Think about it: this represents subsets of "real" y-DNA and projects them into questionable clusters - and most are represented in the percent range, only, because only the largest contribution is displayed. Look at the actual data sheet - or some of the examples I posted above.

    Why should a super tiny minority haplogroup like I2, that is at 4%-6% in Germany, be "representative" for Germany on such a map?

    You are plainly misinformed. In some regions of Germany, R1b, I2, and R1a are roughly equally represented. However, if you only count the most prevalent strains, I2 may be enhanced there, locally, i/e/, in Scandinavia, since it does not have the 40,000 years of different R1b strains in that region.

    Just look at Scandinavia. Is it brighter yellow than Germany in (b) because it has more I2? No. it is brighter because it has (due to founder effects and drift) more of a particular STR variation, which dominates there and makes for a large percentage of the population. It is not that Germany has any less I2 - but it has way larger diversity of I2 - and as such, less of what is counted in this rather imbecile analysis, grouped in cluster 4.

    ReplyDelete
  35. Northern Russia is a different blue and a different "haplogroup" cluster.

    ReplyDelete
  36. (b) composite map showing the clusters with the highest frequency per tract of land after excluding the most frequent clusters in the continental region of Western Europe (primarily covered by cluster 6)

    Just as I guessed when writing "there is something that we don't know here".

    ReplyDelete
  37. Thats what my sources claim:

    North Germany:
    I1: 18%
    I2a: 1%
    I2b: 5%
    R1a: 23%
    R1b: 38%
    G2a: 3.5%
    J2: 4%
    J1: 0.5%
    E1b1b: 5.5%
    T: 1%
    Q: 2%
    N1c1: 1.5%

    East Germany:
    I1: 19.5%
    I2a: 1%
    I2b: 3%
    R1a: 25%
    R1b: 36%
    G2a: 4%
    J2: 2%
    J1: 0%
    E1b1b: 7.5%
    T: 1%
    Q: 1%
    N1c1: 1%

    West Germany:
    I1: 13%
    I2a: 2.5%
    I2b: 7%
    R1a: 9%
    R1b: 47%
    G2a: 5%
    J2: 5%
    J1: 0%
    E1b1b: 8%
    T: 1.5%
    Q: 0.5%
    N1c1: 1.5%

    South Germany:
    I1: 9.5%
    I2a: 5%
    I2b: 3%
    R1a: 9.5%
    R1b: 48.5%
    G2a: 7.5%
    J2: 5.5%
    J1: 1%
    E1b1b: 7.5%
    T: 1.5%
    Q: 0.5%
    N1c1: 0.5%

    Denmark:
    I1: 30.5%
    I2a: 0.5%
    I2b: 5%
    R1a: 12.5%
    R1b: 44.5%
    G2a: 1%
    J2: 3%
    J1: 0%
    E1b1b: 2.5%
    T: 0%
    Q: 0%
    N1c1: 1.5%

    Sweden:
    I1: 42%
    I2a: 0%
    I2b: 2%
    R1a: 23.5%
    R1b: 21%
    G2a: 0.5%
    J2: 1%
    J1: 0%
    E1b1b: 1%
    T: 0%
    Q: 0.5%
    N1c1: 7%

    I dont know what to think about someone who would chose I2 as the representative cluster of Sweden.

    ReplyDelete
  38. I apologize, I typed I2 when I meant I (in modern nomenclature). You are absolutely correct that I1 and I2 in modern nomenclature have different regions of prevalence - and perhaps, cluster (4) on this study simply is some strange superposition of these two - similar what appears to be happening with R1b.

    ReplyDelete
  39. Hmmm interesting.

    Finland has a navy shading to a purple in Northern Russia.

    The second ranked haplogroup is Iberia is navy in Spain shading to purple in Portugal.

    There is also a that similar looking purple patch in Hungary (primary haplogroup). Anyone know the primary Y-Haplogroups in Hungary? Don't they have a Finno Ugric connection?

    ReplyDelete
  40. The pale blue cluster (in the eastern Balkans and central Anatolia) seem to be based solely on samples from the Dobrudja region in the Balkans...

    Regarding cluster 20 (grey), present on the northwestern Black Sea coast and in Slovakia, could somebody explain it?

    ReplyDelete

Stay on topic. Be polite. Use facts and arguments. Be Brief. Do not post back to back comments in the same thread, unless you absolutely have to. Don't quote excessively. Google before you ask.