November 30, 2011

ChromoPainter and fineSTRUCTURE

The paintmychromosomes.com site gives pretty good information on this, although parts of it are still under construction. Link to paper and supporting information.

The authors use linkage information to discover fine-scale population structure. But, it should be noted that, while haplotype information (i.e., the co-inheritance of marker states at the local genomic level) does indeed provide additional information, the inference of fine-scale population structure does not depend on the presence of such information, nor is it impossible to obtain such fine-scale structure without it.

For example, a year ago, I showed how it is possible to infer K=64 clusters from the HGDP panel without using any linkage information, but setting the maximum possible number of clusters at 70. More recently, I inferred 20 clusters in North-Central Europe alone, 42 clusters in Africa alone, or even 124 clusters in my most ambitious run yet (out of a 150 maximum considered).

It should be noted that in all these experiments, the maximum number of clusters considered plays a significant role, as does the number of PCA/MDS dimensions considered, since MCLUST only finds the optimal number of clusters within the given limits. One of these days, I will try a mega-Clusters Galore exercise, with e.g., 100 MDS dimensions and 250 maximum number of clusters. This may take a while to run, but it will show the limits of the Clusters Galore approach.

While I do think that using haplotype information may add some extra power for ancestry inference, it should be properly compared against MCLUST over PCA/MDS, i.e., a state of the art clustering algorithm that has been shown to infer fine-scale structure without using any haplotype information. The claim that such structure "is only captured by the haplotype-based approach" is premature.


(UPDATE Jan 17, 2012) Lawson and Falush have compared MCLUST with fineSTRUCTURE here.


Inference of population structure using dense haplotype data

Daniel John Lawson et al.

The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in un-precedented detail, but presents new statistical challenges. We propose a novel inference frameworkthat aims to efficiently capture information on population structure provided by patterns of haplotypesimilarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes arereconstructed using chunks of DNA donated by the other individuals. Results of this ‘chromosome paint-ing’ can be summarized as a ‘coancestry matrix’, which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA), and model-based approaches such as STRUCTURE, in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability, and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and identify 226 populations reflecting differences on conti-nental, regional, local and family scales. We present multiple lines of evidence that whilst many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels, and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE are available from http://www.paintmychromosomes.com/

2 comments:

Daniel Falush said...

Thank you very much for your interest! We have noticed a massive surge in interest and downloads and were wondering why. The manuscript itself should come out very soon in PLoS Genetics and hopefully we will get more.

I don't doubt that you can get good clustering with unlinked markers on real data. Genetic data is extremely informative. However, it is difficult to evaluate exactly how good performance is since there isn't any truth to compare it to. We show for example that is a great deal of real structure below the label level. Moreover, many labels just are not very good so that the fact that the algorithm does not find them should not be regarded as algorithmic failure.

With the SAME clustering algorithm (fineSTRUCTURE) and simulated dataset, we find that it takes 200 regions to find structure if markers are treated as unlinked compared to 75 regions with our haplotype based approach. We also do much better in matching halves of real individuals together.

Its correct that we have not compared MCLUST with our own model-based clustering approach (fineSTRUCTURE) but I would argue that this does not bear on the question of whether our haplotype based painting algorithm (CHROMOPAINTER) extracts more information from the data than the unlinked model which is in almost universal use currently. Indeed the results above suggest it is approximately equivalent to tripling the size of the genome for very dense datasets.

Thanks again!
Daniel Falush

Dienekes said...

>> However, it is difficult to evaluate exactly how good performance is since there isn't any truth to compare it to. We show for example that is a great deal of real structure below the label level. Moreover, many labels just are not very good so that the fact that the algorithm does not find them should not be regarded as algorithmic failure.

That is what I have also observed using MCLUST. When I find, for example, a few sub-components within Mozabites, or Sardinians, or many other populations, there is no way of knowing what those mean, since I have no extra information about the individuals, i.e., whether they were sampled in different locations, belonged to different clans etc.

>> With the SAME clustering algorithm (fineSTRUCTURE) and simulated dataset, we find that it takes 200 regions to find structure if markers are treated as unlinked compared to 75 regions with our haplotype based approach. We also do much better in matching halves of real individuals together.

That is interesting. I really think you should give MCLUST a try. It doesn't really deal in "regions", but it is usually able to find fine-scale population structure with a very small number of SNPs, e.g., 42 clusters in an African set with about ~55k SNPs.

http://dienekes.blogspot.com/2011/03/clusters-galore-analysis-of-henn-et-al.html

I routinely run trials of the MDS/MCLUST combination on ~10k genome-wide marker sets, because MDS is computed much faster on them, so they give me a good idea of the structure of a set, and invariably a lot of the structure that is present in the full marker set is preserved even with such small number of random SNPs.

I don't know whether I have the computational power to run your software on equivalent datasets, but I'd love for someone to try this.

And, it is also possible to estimate the admixture proportions of individuals by combining MCLUST with ADMIXTURE, as I describe here.

http://dienekes.blogspot.com/2011/10/putting-it-all-together-dracos-for-fine.html