January 17, 2012

Comparison of MCLUST with fineSTRUCTURE

Dan Lawson has written up a comparison of fineSTRUCTURE and MCLUST and a PDF with further details. Dan first talked to me about doing this comparison in December, and it's unfortunate that I didn't try my new fastIBD method in time, so it could also be included in the analysis.

There are two parts to this type of structure inference:

  • Deriving a matrix of relationships between individuals (using PLINK IBS, ChromoPainter, or fastIBD, or ...)
  • Clustering these relationships (using fineSTRUCTURE, MCLUST, or ...)
Assessing the quality of the inferred structure is tricky, since these linkage-based methods tend to infer clusters that are finer-scaled than the level of population labels. It's not easy to know what e.g., a couple of Sardinian clusters mean if one does not have finer-level details about the origin of different Sardinian individuals. I tend to take a pragmatic view, that if clusters correspond to real-world phenomena (as the Iberian or Armenian ones do), then they are of value.

The analysis of Lawson and Falush seems to identify the main issues qute well: MCLUST is much faster, as good, but requires tuning for the number of dimensions; fineSTRUCTURE on the other hand does not require such tuning, is slower, but requires a prior (which is good or bad depending on whether you're a Bayesian or not). Both clustering algorithms perform better in the presence of linkage information than in the absence thereof.

One additional issue that MCLUST seems good at is its ability to detect clusters of varying shape, and hence discover recently admixed populations that form such clusters in PCA/MDS space. The simulated data of Lawson & Falush assume a biological model of splits/expansions, so it is not clear how their approach would handle lateral gene flow that results in "stretched" clusters of individuals.

I would love to see many different methods evaluated on a standard real-world dataset. Running ChromoPainter/fineSTRUCTURE is computationally very expensive, but I will try my hand at the Stanford HGDP set and the No1stOr2ndDegreeRelatives subset thereof, which consists of 940 individuals. If anyone wants to try alternative methods on the same real-world set, drop me an e-mail or write a comment, and I'll link to your analysis.

PS: I also have to applaud the quick response of Lawson and Falush to my idea of comparing MCLUST and fineSTRUCTURE. It is exactly the type of "open science" that I am a strong advocate for.

1 comment:

niko said...

Hello, I am completely new to inferring population structure from genetics but I am quite experienced with clustering, which as I understand it, is the underlying problem here. In particular, the problem you mention; that of inferring the number K of clusters (which K lies somewhere between 1 and N, the number of individuals in your study).
While clustering by fitting mixture models (like MClust) is useful for assessing the population mixing, if you are more interested in finding an appropriate number of clusters, then I think the dendrogram yielded by hierarchical clustering would be more interpretable.