For background on this type of analysis, please read:
- Clusters galore: extremely fine-scale ancestry inference
- Galore analysis improved, plus K=56 results of Clusters Galore analysis for Dodecad Project members (up to DOD236)
I've taken the Stanford HGDP dataset and extracted the markers common to it and to HapMap-3, Behar et al. (2010), Rasmussen et al. (2010) and the 23andMe v2 genotyping platform, or about 500k SNPs in total (I removed C/G and A/T SNPs as a precaution and flipped strand in discordant ones to the HapMap-3 standard when it differed from that of HGDP).
I removed SNPs with less than 99% genotyping rate in any of the four data sources, and about 434k SNPs were retained. Subsequently I applied linkage disequilibrium-based pruning on the HGDP set (PLINK parameter: --indep-pairwise 50 5 0.3) resulting in a final dataset of about 177k SNPs. In all analyses of the HGDP set, I followed the recommendations of Rosenberg et al. (2006) keeping the 940 individuals in common between his 952-individual panel and the Stanford data.
Subsequently I ran multidimensional-scaling (MDS) on the 940 individual/57 population/177k SNP set in PLINK, and then I applied model-based clustering as implemented in mclust over the first 42 MDS dimensions, with a maximum number of clusters = 70. In total there were 64 clusters in the optimal solution suggested by mclust (*)
Before I give the results, it might be worth looking at the pairwise MDS scatterplots for just the first 5 dimensions:
As you can see, clusteredness emerges in different dimensions. Rather than inspecting innumerable 2D combinations visually (and indeed we should 3D, etc. as well, because clusters might emerge in 3D and higher subspaces that are not discernible in 2D projections), we let mclust iterate over k, the number of clusters, and different shapes, orientations, and volumes of clusters, using the well-known EM algorithm together with the Bayes Information Criterion to choose a good solution that maximizes detail without sacrficing parsimony.
Below you can see how many individuals are assigned to each of the 64 clusters from each of the 57 populations:
This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity. And, there are a few populations, such as Tuscans and North Italians that are not split. But, the fact that this was inferred from unlabeled individuals is remarkable.
I remember reading Rosenberg et al. (2002), "The genetic structure of human populations" (pdf) which used structure, a model-based algorithm on raw genetic data to infer the existence of 6 clusters corresponding to continental populations. How is it that so much more detail can be achieved today?
There are three reasons: First, dense genotyping data are much better than the few hundreds of microsatellites used by Rosenberg et al. (2002). Second, the use of dimensionality reduction in the form of MDS allowed us to remove most of the "noise" in the genotyping data and focus on dimensions capturing a lot of distinctions. Third, the use of a sophisticated clustering algorithm such as mclust which can adapt to clusters of different shape, size, and orientation without human input was able to produce this result. mclust is computationally expensive, but it works like a charm (in a few minutes) with a few dozen dimensions and about a thousand individuals, producing a clustering of obviously good value.
How to repeat the experiment
If anyone wants to repeat this experiment they can do it easily. After you've managed to put the HGDP data into PLINK ped/map format, say in files HGDP.ped and HGDP.map (or any other data for that matter), just run
> plink --cluster --mds-plot d --file HGDP
Where d is the number of dimensions you want to retain. This produces a plink.mds file in which there is a header line, and each each line after that corresponds to an individual: the individual's projection in the first d dimensions are in columns number 4 to d+3.
Then, in R, after you install and load the mclust package (see the MCLUST page for limitations on its use and licensing information), you just run:
> MDS <- read.table("plink.mds", header=T)
> maxclust <- 70
> MCLUST <- Mclust(MDS[, 4:(d+3)], G=1:maxclust)
where maxclust is the maximum number of clusters you want to consider.
Then, if you run:
you will see a table in which each line corresponds to an individual and each column to the probability that it belongs to the i-th cluster.
There's much more that you can do in R with the mclust package, but this is enough for anyone wanting to repeat the experiment in its basic form.
(*) The number of clusters in the optimal solution varied between 11 with 2 dimensions retained and 64 with 42 dimensions retained. There was a secondary maximum of 60 clusters with 30 dimensions retained; choosing more dimensions than 42 (up to 50 that I examined), also resulted in a very high number of clusters, but I've decided to keep the one with 42 dimensions and 64 clusters as it is enough to serve the purpose of this post.