May 05, 2009

Alternate mtDNA phylogeny of N and M clades

I would be very interested in hearing from readers who are more up-to-date on mtDNA phylogenetics than myself on what they think of this paper. Until now, we knew that West Eurasians belonged almost entirely in macro-haplogroup N, with the exception of low-frequency haplogroup M1 and a few erratics representing more recent admixture. On the other hand, East Eurasians belonged to both macro-haplogroups N and M. What this paper seems to suggest is that haplogroup N itself has its own East vs. West structure, and the common West (or East) Eurasian haplogroups within N are phylogenetically related in addition to geographically co-existing.

J Mol Evol. 2008 Nov;67(5):465-87. Epub 2008 Oct 15.

PCA and clustering reveal alternate mtDNA phylogeny of N and M clades.

Alexe G, Satya RV, Seiler M, Platt D, Bhanot T, Hui S, Tanaka M, Levine AJ, Bhanot G.

Phylogenetic trees based on mtDNA polymorphisms are often used to infer the history of recent human migrations. However, there is no consensus on which method to use. Most methods make strong assumptions which may bias the choice of polymorphisms and result in computational complexity which limits the analysis to a few samples/polymorphisms. For example, parsimony minimizes the number of mutations, which biases the results to minimizing homoplasy events. Such biases may miss the global structure of the polymorphisms altogether, with the risk of identifying a "common" polymorphism as ancient without an internal check on whether it either is homoplasic or is identified as ancient because of sampling bias (from oversampling the population with the polymorphism). A signature of this problem is that different methods applied to the same data or the same method applied to different datasets results in different tree topologies. When the results of such analyses are combined, the consensus trees have a low internal branch consensus. We determine human mtDNA phylogeny from 1737 complete sequences using a new, direct method based on principal component analysis (PCA) and unsupervised consensus ensemble clustering. PCA identifies polymorphisms representing robust variations in the data and consensus ensemble clustering creates stable haplogroup clusters. The tree is obtained from the bifurcating network obtained when the data are split into k = 2,3,4,...,kmax clusters, with equal sampling from each haplogroup. Our method assumes only that the data can be clustered into groups based on mutations, is fast, is stable to sample perturbation, uses all significant polymorphisms in the data, works for arbitrary sample sizes, and avoids sample choice and haplogroup size bias. The internal branches of our tree have a 90% consensus accuracy. In conclusion, our tree recreates the standard phylogeny of the N, M, L0/L1, L2, and L3 clades, confirming the African origin of modern humans and showing that the M and N clades arose in almost coincident migrations. However, the N clade haplogroups split along an East-West geographic divide, with a "European R clade" containing the haplogroups H, V, H/V, J, T, and U and a "Eurasian N subclade" including haplogroups B, R5, F, A, N9, I, W, and X. The haplogroup pairs (N9a, N9b) and (M7a, M7b) within N and M are placed in nonnearest locations in agreement with their expected large TMRCA from studies of their migrations into Japan. For comparison, we also construct consensus maximum likelihood, parsimony, neighbor joining, and UPGMA-based trees using the same polymorphisms and show that these methods give consistent results only for the clade tree. For recent branches, the consensus accuracy for these methods is in the range of 1-20%. From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 +/- 14,000 years before present.



terryt said...

"What this paper seems to suggest is that haplogroup N itself has its own East vs. West structure, and the common West (or East) Eurasian haplogroups within N are phylogenetically related in addition to geographically co-existing".

Maju and I have been arguing heatedly over this for some time, especially over the route N took to link the two centres. You may like to comment at these sites:


terryt said...

Correstion. Maju and I agree that haplogroups H, V, H/V, J, T, and U came through India. It's the pre-R haplogroups, such as W and IX we argue about.

Maju has posted a nice mt haplogroup diagram at:

Maju said...

Does this paper scrap R? I cannot access the full text but that's what the abstract seems to suggest, with B and F (and I presume P as well), somehow taken out of R (though Western N1, X and W would remain as part of P). What happens with all the South Asian R subclades?

Anyhow, the core of the argument (from the abstract) would be that mtDNA SNPs are not true SNPs but allegedly mere convergent evolution accidents. This goes against the very definition of what is an SNP (virtually impossible to be replicated by chance in two distinct events in the time of human or even primate evolution) and casts a shadow of doubt over the whole Eurasian mtDNA tree as we know it.

All R sublineages (as defined before this paper) share two mutations at loci 12705 (coding region) and 16223 (control region). These being truly meanignful as phylogenetic had never been challeneged before, so I'm terribly confused.

I wonder if that same argument could be used at other mtDNA defining loci, like those leading to M, N or, who knows?, L3 or H or whatever.

If someone can send me a copy of the full paper to lialdamiz .AT., I'd be truly thankful.

eurologist said...

To me it looks like the methods and results of this paper are solid, and much more in line with common sense than previous, rather immature works that should not even have been published.

Unfortunately, the migration figure in the addendum is (like almost always, these days) an atrocious cartoon of geographical migration misconceptions and misunderstandings.

Maju said...

I have already been sent the paper. Thanks. :)

Maju said...

I've been avidly reading the paper (thanks to eurologist for sending me a copy so fast) and I am anything but persuaded that you can define a haplogroup or clade without being defined by SNPs.

I have posted a more extended comment at Leherensuge but this is basically what I have to say: (1) how valid is mere PCA/k-means analysis for haplogroup lineage definition? and (2) how can you define haploid "clades", "lineages", that do not even appear to share a common SNP at the root.

Additionally, as the realistic estimate for Pan-Homo divergence is of 7-8 million years, not just 5, the TMRCA estimate for L0'1 would be more of the order of 300 kya.