October 05, 2012

Calibrated human Y-chromosome phylogeny (Wei et al. 2012)

It seems that the first paper from the Wellcome Trust Sanger Institute folks on updating the Y-chromosome phylogeny is out. Three abstracts on the topic were recently presented.

I had recently used the Complete Genomics data myself, and you might want to consult that post, since I resolved some of the chromosomes at a finer phylogenetic detail using the current ISOGG nomenclature. For example, the 3 Y-haplogroup I chromosomes all belong to I1a, so their shallow divergence from each other do not reflect on the entire I haplogroup. Similarly, all the R1b chromosomes belong to R1b1a2a1a1b, so their shallow divergence does not reflect on the entire R1b haplogroup.

An anomaly in my own PhyML analysis was NA19649 which was placed in the right part of the tree, but showed an abnormally long branch length. This is explained by Wei et al.
Examination of the distribution of variants along the chromosome revealed an excess in one sample, GS19649, in the region 28,670,244 – 28,735,914 compared with the other individuals; this was associated with low coverage of this region, so we hypothesised that the excess arose from a combination of deletion of the region in this individual and mismapping of reads originating from other parts of the chromosome. We excluded from further consideration all variable sites from all individuals that fell into this region.   
Of course, the Complete Genomics data are lacking in many of the haplogroups present in modern humans, so the authors added a more basal individual:
We therefore generated additional sequence data from a haplogroup A individual, NA21313. This chromosome was derived for the markers M32, M190, M220, M144, M202, M305 and M219, and ancestral for P97, placing it in haplogroup A3b2*. 
Note that this haplogroup is not the most basal in the Y-chromosome tree. It corresponds to  A1b1b2b in current ISOGG terminology, and is several layers derived relative to the most basal A0 haplogroup, as well as an even more basal Y-chromosome which currently awaits publication. So, keep in mind that the age estimate of 101-115 thousand years does not include the most divergent Palaeoafrican lineages.

The pdf file at the journal website seems to be currently missing Table 1 and the supplementary materials. Here is Table 1, showing age inferences using different methods:

These results seem consistent with my post-70ka Out-of-Arabia scenario, preceded by a pre-100ka Out-of-Africa. Of particular interest:
The second internal node examined was the multifurcation of haplogroups I, G and NR (i.e. K), the representation in this reduced set of individuals of a larger multifurcation also involving several F sublineages, H and J (Karafet et al. 2008). The minimal resolution of the lineages, even in a phylogeny based on sequencing of 8.97 Mb, implies a rapid expansion, which we date here at 41-52 KYA.
This seems like a very good signal corresponding to the Upper Paleolithic of Eurasia, and might suggest that haplogroups C and DE split off before this seminal event, which involved F descendants. Hopefully, by using the larger sample (but lower coverage) of the 1000 Genomes data, already announced by this team, we will learn more about missing branches of the phylogeny.

The authors also address the issue of mutation rate:
We used a calibration based on direct measurement of the Y-chromosomal SNP mutation rate both in years and generations from a deep-rooting family (Xue et al. 2009) since this requires the minimum number of assumptions and has already been adopted in the literature (Cruciani et al. 2011). The measurement does, however, have wide confidence intervals since only a small number of mutations were observed, and these confidence intervals were not included in our consideration of times, which used the point estimate. Two lines of reasoning suggest that we may have more confidence in the point estimate than simple consideration of the number of mutations might suggest. First, it is consistent with other direct measurements of the human mutation rate, allowing for the expected higher mutation rate on the Y because of its permanent location in the mutation-prone male germ line (e.g. Roach et al. 2010). Second, it is consistent with the rate inferred from human-chimpanzee comparisons of the same sections of the Y chromosome: 1.3 x 10-9 mutations/nucleotide/year for a 6.5 million year Y-chromosomal divergence time (Scally et al. 2012, Xue et al. 2009). Nevertheless, additional measurements of mutation rate are urgently needed to improve calibration.
I am pretty sure that as genome sequencing costs plummet, we will get an ever better estimate of the Y-chromosome mutation rate. As I've said before, genealogists may have an important role to play here, by identifying very deep rooting pedigrees. The Xue et al. (2009) rate uses a Chinese pedigree about 2-centuries deep, but I am sure that European genealogists may have documented pedigrees that go even deeper. Doing whole genome sequencing on two lines from two sons of a man who lived, e.g., in the 1600s, or even before, will shrink the confidence intervals of the mutation rate drastically, because of the great number of generations separating his living descendants.

Using deep-rooted pedigrees is important, because by whole genome sequencing only two individuals you get several dozen generations' worth of mutation. By contrast, sampling father-son pairs would require dozens of genomes to achieve a comparable amount of mutation.

Genome Research doi: 10.1101/gr.143198.112

A calibrated human Y-chromosomal phylogeny based on resequencing

Wei Wei et al.

We have identified variants present in high-coverage complete sequences of 36 diverse human Y chromosomes from Africa, Europe, South Asia, East Asia and the Americas representing eight major haplogroups. After restricting our analysis to 8.97 Mb of unique male-specific Y sequence, we identified 6,662 high-confidence variants including SNPs, MNPs and indels. We constructed phylogenetic trees using these variants, or subsets of them, and recapitulated the known structure of the tree. Assuming a male mutation rate of 1x10-9 per bp per year, the time depth of the tree (haplogroups A3-R) was about 101-115 thousand years, and the lineages found outside Africa dated to 57-74 thousand years, both as expected. In addition, we dated a striking Paleolithic male lineage expansion to 41-52 thousand years ago and the node representing the major European Y lineage, R1b, to 4-13 thousand years ago, supporting a Neolithic origin for these modern European Y chromosomes. In all, we provide a nearly 10-fold increase in the number of Y markers with phylogenetic information, and novel historical insights derived from placing them on a calibrated phylogenetic tree.



royking said...

Finally, a Y paper that dates nodes of the tree using high resolution complete sequences! Despite the well-grounded caution of the authors in dating S116-R1b1a2a1a1b, I would tend to see the lower estimates (4300-4500 bp using the rho methods) as more accurate on account of ancient Y DNA results in Western Europe and due to the simplicity of rho estimates that appeals to my mathematician's mind. If so, this might confirm the wide-spread idea that S116 expands and perhaps even occurs in the Chalcolithic/Early Bronze Age of West Europe. The implications of this estimate for the linguistics and archaeology of Western Europe, if true, are enormous.

alex demontis said...

good morning royking, i don't know much of genetics but i am a linguist and i wish you go in deeper details about what you mean with your last words. I am very interested in the movemente and spreading of languages between 5th and 3rd millennium.