May 02, 2012

Drawing the human Y chromosome tree with SNPs

Terry, (tdrobb@gmail.com), a poster at GENEALOGY-DNA-L reported age estimates for various nodes of the Y-chromosome tree based on SNPs. These can be found in this PDF file and here (scroll down for UPDATE10). He used 1000 Genomes data and SNP counting to reach these estimates.

It will be nice to see others join in on the SNP bandwagon, because that is really the way forward in age estimation for Y-chromosome lineages. SNPs have an extremely low (=negligible) rate of back-mutation, but they occur at a much lower rate than Y-STR step mutations. On the other hand, there are at most a few hundred Y-STRs and only ~100 tested by commercial companies, while scientific datasets generally include at most a few dozen of them. The Y chromosome includes millions of mutable sites and these will be generally reported both by the 1000 Genomes Project, and the plethora of full genome sequences that is about to become available.

Y-SNP based age estimation has the potential of greatly improving estimates by tightening confidence intervals substantially; there will, of course, be lingering uncertainty of parameters such as generation length, but Y chromosome mutation rates are likely to become very secure once full genome sequencing becomes so cheap that it can be applied to a number of father-son pairs.

Looking at the inferred tree, what is striking is the great distance between haplogroup A1b and the rest of the tree, or about 100,000 years. Note that these are not "relative" estimates as were published by the 1000 Genomes Project (based on "archaeologically" calibrating a node and estimating ages of other nodes by counting the relative number of SNPs), but "absolute" ones (dividing SNPs with a mutation rate).

(UPDATE: There is apparently an even more basal clade than A1b currently investigated; I have removed the link to an announcement regarding this clade, since there are issues regarding the release of this information)

Going back to the age estimates, I cannot help but notice the concordance between Terry's age estimates for DE/CF split (55ky) with the mtDNA estimates for most mtDNA L3 subclades. Terry labels DE "African" and CF "Eurasian", but, in fact DE is Afrasian and "CF" Eurasian. Together with the absence of any evidence for a post-70ka Out-of-Africa, I'd say that it is becoming increasingly clear that while modern humans can be ultimately traced to the Middle Stone Age in Africa, their major expansion that went on to colonize the entire world originated in Asia, and included a major episode of back-migration into Africa.

I also earnestly hope that the next set of Y chromosome papers on recent populations will forego the cost of testing hundreds of samples on Y-STRs and invest in full Y-chromosome sequencing of a few samples after an initial Y-SNP screening.

15 comments:

Hector said...

His tree is clearly messed up.
For instance NO and IJ are more closely related than NO and P etc etc.

Currently ascertainment bias is too great to construct a tree of this type. With complete Y genome sequencing , theoretically, there should not be any ascertainment bias but the way things are right now, the bias stays.

Dienekes said...

For instance NO and IJ are more closely related than NO and P etc etc.

Yes, I did notice that, which is why I did not reproduce the tree here.

There are other oddities, e.g., the very low age estimate of G, probably because this is made mostly (only?) of Tuscans who belong to the common European subset thereof.

Still, I do think the age estimates have value for the big picture, and, of course, any effort to set the ball rolling with Y-SNP data is to be applauded.

Dr Rob said...

Apart from uncertainty about generation times, aren;t TMRCA values still influenced by other assumptions about population, eg whether it was constant, exponentially growing, or contracting ?

Vincent said...

Let's hope the author expands his explantation of his methods, so we can evaluate just HOW wrong the details are.

I agree that efforts towards a SNP-based molecular clock for the Y are worth applauding, but only if efforts give us better analysis than we have already.

These 1000 Genomes data are quite terrible in places, and the phylogenetic methods themselves are quite easy to get wrong. That "great distance between haplogroup A1b and the rest of the tree", for example, is quite possibly just an artifact of having constrained the tree to be ultrametric. E.g. http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/07-08/itb/slides/itb0708_slides_192-224.pdf

Perhaps the "big picture" is right after all, but without more detail it really is impossible to say.

Lank said...

The absence of any corresponding mtDNA lineage to these ancient Y chromosomes in West Africa is notable. This suggests that there's not much meaning behind the geographic location of these ancient lineages within Africa, as they're simply relics of the ancient past. Further research will be required to find the precise homeland of anatomically modern humans.

The age of CT does appear to resemble L3 quite nicely. What surprises me is the close age estimates for BT and CT. Y-DNA BT clearly marks an African spread that preceded the great Eurasian expansion. If BT and CT are roughly contemporaneous, then that would further support an African origin for Y-DNA CT. It'll be interesting to see how this holds up with further sequencing.

wemakeamericahappen said...

Links are much appreciated. As is the work by the gentleman who did it. However, note that aside from the faults listed above, he also accepts as dogma that R1b is an easterly Hg in Anatolia post-LGM. In other words, that the Irish and Spanish magically became 90% R1b by a roving band of 2000 genocidal maniacs who either didnt intermix along their loooooong journey to Ireland or who flew to Ireland in an airplane (to get their quicker in 10,000 BC) and also had WMDs to wipe out the populations.

As my reductio ab absurdem shows, this is far from accepted dogma. The lack of R1b in India, it's clear westerly cline, etc. do not bode in favor of this notion that R1b were IE farmers who in some incredible wave theory depopulated Western Europe with their awesomeness.

Ezr said...

@wemakeamericahappen

You are conveniently ignoring the fact that R1b is extremely frequent and even modal in much of NW China (Uyghurs) and South Siberia (Bashkirs), as well as in Bedouin groups in Jordan and Palestine. If its origin lies to the extreme West, how could it have become so frequent in the borders of Mongolia and Arabia without leaving much of a trace in the middle? The conundrum is even worse! The most parsimonious explanation is obviously that it originated between its two frequency extremes, in Anatolia or the South Caucasus, spread both east and west along a northern route, and then was swamped in the middle by other incoming hgs (E, J, R1a, G, C). Frequency means very little in these cases.

Dienekes said...

Terry sent me the following e-mail with some clarifications:

Hi Dienekes,
Thank you for your blog posting about my recent SNP work using the 1000 Genomes Project data. I hope it will encourage others to think about dating the y-haplogroups by methods other than STR's.

Note, I have now changed the text to say "DE (Afrasian)", since that is more accurate.

Also, in response to one of your readers comments that "For instance NO and IJ are more closely related than NO and P etc etc.": I did note that myself, and I did include the following explanation at the bottom of the webpage (but not in the PDF file unfortunately) as follows:

"Also be aware that only the count of nucleotide differences is used to construct the tree - if an inferred ordering of nucleotide differences was instead used, then such a tree would exactly represent the order in which branches occurred but at the cost of not being able to easily compute a timeline. So the above tree uses the count method which, subject to the error bars associated with the branch splits, may place a branch in slightly the wrong order. "

Also note there is a lot of no-call data in the samples I used, which undoubtedly is affecting the results. With more and better data, I expect things would improve. But I am hoping the principle of using SNP counts to produce a tree with a timeline is established and is a reasonable thing to do.

For me, one focus would be to get a better estimate for the Y-chromosome nucleotide mutation rate.

Best regards,
Terry

Ponto said...

Maybe he is just jumping the gun. What is needed is full genomic scanning of the Y chromosome of a representative sample of humanity and incorporating haplogroups obtained from ancient human remains. It is likely some extinct lines may be found in those ancient genomes.

Unfortunately the whole phylogenetic tree and dating has been set back by having Europeans tested first getting undue emphasis on the R1 and I haplogroups at the expense of all others. Look at the effort, time and money put into those mainly European haplogroups yet others are languishing.

mooreisbetter said...

EZR, I'm sorry, but I am with wemakeamericahappen.

To be an adherent to your "R1b out of Anatolia" theory, you have to believe that these R1b populations were strong enough to become the predominant Hg in the West, but so weak in the areas immediately next to the area of your purported expansion, that they were relatively swamped out by other Hgs in a short time, including ones that were there for much longer. It just doesn't make sense.

You can account for a pocket of a Hg with a grand migration. For example, we know historically that the Vandals, a Germanic nation, travelled from Northern Europe to North Africa in a few generations. Finding a pocket of Scandinavian Hgs in North Africa would be an example, were it to come to bear.

But you cannot account for massive numbers and total replacement from a grand migration, unless there is genetic evidence IN THE MIDDLE TOO.

It's not parsimonous -- indeed, it comes across as convenient -- to say that R1b is missing from Central Europe in grand numbers because it was later swamped out. That would assume several rounds of war, almost total replacement, and near genocide -- all which are simply not supported by the archaeological record or common sense.

eurologist said...

So, according to this, DE split ~50,000 ya. Since E didn't make it into central/East Asia, and this is one of the most important early splits, in reality it easily could have occurred around 100,000 when some DE migrated East from Arabia. If one accepts this anchor, then one needs to apply a factor of two to all dates. (You can argue the same about CT, doesn't change things too much).

Then the R1a/R1b split becomes 40,000 ya, which seems reasonable for those who haven't given up hope that R1b will turn up in pre-neolithic Europe... ;)

Another item placing doubt on the timeline is the accumulation of splits right around LGM. That does not make sense.

Maju said...

I find the work most interesting and the method something I have always craved for (scrap STRs, count actual SNPs). However the estimate timeline must be twice older.

I calibrated equalizing CF to 80 Ka. (what is very well supported by the archaeological evidence of India, notably but not only Petraglia 2010 and Petraglia 2007) and I had to reduce the scale to 52% of its original length. This works so well that it does not just make CF and F (and E) expand in the expected time-frame of before 70 Ka ago but also makes IJ and derived lineages to expand in the expected time-frame of 55 Ka. ago (Ahmarian culture) and later (other Aurignacoid cultures). R1b could then have expanded in the time frame of Gravettian.

Equally N1c and R1a would only have expanded as we approach the Holocene (earlier Eastern Europe was probably too cold).

I really think that the method produces a promising structure. It would also support, if my proposed alternative time-frame is correct, the hypothesis recently defended by Dienekes, of West Africans having some relative's introgression, which would be apparent in A1b (aka A0), too old (250 Ka ago) to be properly Homo sapiens (although not too distant either if we accept that our species coalesced in modern-like phenotypes c. the date of Omo: 190 Ka ago).

eurologist said...

Maju,

Seems we came to the same conclusion, independently. :)

Dr Rob said...

"Links are much appreciated. As is the work by the gentleman who did it. However, note that aside from the faults listed above, he also accepts as dogma that R1b is an easterly Hg in Anatolia post-LGM. In other words, that the Irish and Spanish magically became 90% R1b by a roving band of 2000 genocidal maniacs who either didnt intermix along their loooooong journey to Ireland or who flew to Ireland in an airplane (to get their quicker in 10,000 BC) and also had WMDs to wipe out the populations.

As my reductio ab absurdem shows, this is far from accepted dogma. The lack of R1b in India, it's clear westerly cline, etc. do not bode in favor of this notion that R1b were IE farmers who in some incredible wave theory depopulated Western Europe with their awesomeness."

Nobody is making statements about 'strength' and or ferocity. Quite simply this was luck that R1b became so predominant in western Europe. (through drift, founder effects and 'mutation surfing'). We are not even sure that its high presence even represents an actual demographic event (!) This is merely the conclusion of 'genetic anthropologists' and the assumptions they posit. Theoretical population genetics might not be so optimistic about the equivalence of molecular and population history.

Moreover, you can argue black & blue about whether STRs are "more diverse" in Ireland or Turkery, however, it seems clear that all the European R1b SNPs are downstream from those further east, and the 'bridge' between the two is, unsurprisingly, the Balkans (eg see new data on Serbian SNPs). And India harbours a large diversity of R haplgropus - various R1a, R1*, R2, etc.

Thom said...

It's a very good picture but R2 isn't there in the tree. Very good picture!