By way of introduction, here is the command that literally made me
jump from my seat:
> MCLUST <- Mclust(X,G=1:36)
Warning messages:
1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
best model occurs at the min or max # of components considered
2: In Mclust(X, G = 1:36) :
optimal number of clusters occurs at max choice
It may look like gibberish, but this is what happened when I tried to apply
Model-based clustering as implemented in the R package
mclust, over the first few dimensions of Multidimensional Scaling (MDS) of my standard 36-population, 692-individual dataset I have been using in the
Dodecad Project.
But, let's take the story from the beginning...
The basic idea
When we look at an MDS or PCA plot, like the following MDS plot of the 11 HapMap-3 populations, it is obvious that individuals form clusters.
Here are dimensions 1 and 2:
West and East Eurasians form a cluster, and Africans form an elongated cluster towards West Eurasians. Gujaratis and Mexicans overlap between West and East Eurasians.
Here are dimensions 2 and 3:
Here, the Gujarati are shown to be quite different from the Mexicans.
We can use a standard clustering algorithm such as k-means to infer the existence of these clusters. This has two benefits:
- We don't have to visually inspect an exponential number of 2D scatterplots
- We can put some actual numbers on our visual impression of the existence of clusters
Actually,
k-means is not a very good way to find clusters. For two reasons:
- You have to specify k. But, how can you know which k is supported by the data, unless you look at an exponential number of 2D scatterplots?
- k-means, using the Euclidean distance measure prefers "spherical" clusters. But, as you can see, some populations, especially recently admixed ones form elongated clusters, stretched towards their two (or more) ancestral populations.
I had previously used
mclust, a model-based clustering algorithm to infer the existence of
14 different clusters in a standard worldwide craniometric dataset. This was 6 years ago, and only recently have geneticists been able to reach that level of resolution with genomic data.
But, for assessing ancestry, genomic data are obviously much better than craniometric ones: the latter reflect both genes and environmental/developmental factors.
So, while 6 years ago I had neither the computing power nor the data to push the envelope of fine-scale ancestry inference, today that's possible.
What mclust does (in a nutshell)
mclust has many bells and whistles for anyone willing to study it, but the basic idea is this: the program iterates between different k and different "forms" of clusters (e.g., spherical or ellipsoidal) and finds the best one.
Best is defined as the one that maximizes the Bayes Information Criterion. Without getting too technical, this tries to balance the "detail" of the model (how many parameters, e.g., k) it has, with its parsimony (how conservative it is in inferring the existence of phantom clusters).
How to combine mclust with PCA or MDS
mclust does not work on 0/1 binary SNP data; it needs scalar data such as skull measurements. However, that's not a problem, because you can convert 0/1 (or ACGT) SNP data into scalar variables using either MDS or PCA.
From a few hundred thousand SNPs, representing each individual, you get a few dozen numerical values placing the individual along each of the first few dimensions of MDS or PCA.
You can then run mclust over that reduced-dimensional representation. This is exactly what I attempted to do.
Clusters galore in HapMap-3 populations
I had previously used ADMIXTURE to infer
admixture in the HapMap-3 populations, reaching K=9. So, naturally, I wanted to see whether the approach I just described could do as well as ADMIXTURE.
I used about 177k SNPs after quality-control and Linkage-disequilibrium based pruning and ran MDS as implemented in
PLINK over a set of 275 individuals, 25 from each of the 11 HapMap-3 populations. I kept 11 dimensions, equal to the number of populations.
MDS took a few minutes to complete. Subsequently I ran mclust on the 275 individuals, allowing k to be as high as 11. Thus, if there were as many clusters as populations, I wanted mclust to find them. mclust finished running in a second. Here are the results (population averages):
The software esssentially rediscovered the existence of 10 different populations in the data, but was unable to split the Denver Chinese from the Beijing Chinese. Notice also a mysterious low-frequency component in the Maasai reminiscent of that which appeared in the previous ADMIXTURE experiment.
A question might arise why most of these populations look completely unadmixed? Even the Mexicans and African Americans get their own cluster. This is due to mclust's ability to use clusters of different shape. In particular, the "best" model was the one called "VVI", which allows for diagonal clusters of varying volume. In short, the software detected the presence of the elongated clusters associated with the admixed groups.
Indeed, the approach I am describing is not really measuring admixture. It is quantifying the probability that a sample is drawn from each of a set of inferred populations. Hence it is not really suitable for recently admixed individuals, but works like a charm in guessing the population labels of unlabeled individuals.
Clusters galore in Eurasia
Let's now see what clusters are inferred in the 36-population 692-individual dataset I commonly use in the Dodecad Project. This is done with 177k, 36 MDS dimensions retained, and allowing k to be as high as 36. This is what made me jump off my seat, and since I don't have enough colors to represent it, I'll put it in tabular form:
I could hardly believe this when I saw it, but the conclusion is inescepable: dozens of distinct populations can be inferred from unlabeled data of individuals that largely correspond, by a posteriori inspection to the individuals' population labels.
UPDATE: The above table has the average probabilities for the 36 clusters, but a better way might be to look at how many individuals are assigned to each cluster from each population:
For example, out of the 28 French individuals, 23 are assigned to cluster #1 (the French-CEU cluster), and 5 to cluster #3 (the North-Italian/Spanish/Tuscan cluster).
Some interesting observations:
- Some populations (e.g., CEU and French, or Belorussians and Lithuanians) remain unsplit even at K=36.
- Some populations are split into multiple components (e.g., Sardinians into 2)
- Some mini-clusters emerge (e.g., 4 clusters in Maasai, each of them corresponding to 8% of 25 = 2 individuals). These may correspond to pairs of relatives or very genetically close individuals.
Quantifying uncertainty
Naturally, we want to be able to assess how good a particular classification is. Fortunately, this is easy to do with mclust and its uncertainty feature. Looking at my 692-individual dataset, 687 have a less than 5% uncertainty level, and 682 have less than 1%. I did not inspect these fully, but some of them are "borderline" individuals who might belong on several components, e.g., a Frenchman who could either go to the CEU-French cluster #1 (36% probability) or the North/Central Italian-Spanish cluster #3 (64% probability).
Here is a dendrogram of the 36 components:
What does it all mean?
What this means, in short, is that the day of extremely fine-scale ancestry inference has arrived. We already had premonitions of this in the ability of researchers to place individuals within a few 100km of their place of birth in Europe. Now, it is clear that model-based clustering + MDS/PCA can infer ethnic/national identity, or something quite close to it.
This is obviously just the beginning. I allowed K to vary from 1 to 36, not really hoping that the optimal number of clusters would be 36. This raises the question: more than 36?
...
UPDATE:
I have followed up on this exciting new technique in the Dodecad Project blog: