There are two parts to this type of structure inference:
- Deriving a matrix of relationships between individuals (using PLINK IBS, ChromoPainter, or fastIBD, or ...)
- Clustering these relationships (using fineSTRUCTURE, MCLUST, or ...)
The analysis of Lawson and Falush seems to identify the main issues qute well: MCLUST is much faster, as good, but requires tuning for the number of dimensions; fineSTRUCTURE on the other hand does not require such tuning, is slower, but requires a prior (which is good or bad depending on whether you're a Bayesian or not). Both clustering algorithms perform better in the presence of linkage information than in the absence thereof.
One additional issue that MCLUST seems good at is its ability to detect clusters of varying shape, and hence discover recently admixed populations that form such clusters in PCA/MDS space. The simulated data of Lawson & Falush assume a biological model of splits/expansions, so it is not clear how their approach would handle lateral gene flow that results in "stretched" clusters of individuals.
I would love to see many different methods evaluated on a standard real-world dataset. Running ChromoPainter/fineSTRUCTURE is computationally very expensive, but I will try my hand at the Stanford HGDP set and the No1stOr2ndDegreeRelatives subset thereof, which consists of 940 individuals. If anyone wants to try alternative methods on the same real-world set, drop me an e-mail or write a comment, and I'll link to your analysis.
PS: I also have to applaud the quick response of Lawson and Falush to my idea of comparing MCLUST and fineSTRUCTURE. It is exactly the type of "open science" that I am a strong advocate for.
1 comment:
Hello, I am completely new to inferring population structure from genetics but I am quite experienced with clustering, which as I understand it, is the underlying problem here. In particular, the problem you mention; that of inferring the number K of clusters (which K lies somewhere between 1 and N, the number of individuals in your study).
While clustering by fitting mixture models (like MClust) is useful for assessing the population mixing, if you are more interested in finding an appropriate number of clusters, then I think the dendrogram yielded by hierarchical clustering would be more interpretable.
Post a Comment