November 29, 2010

Clusters galore in HGDP panel

For background on this type of analysis, please read:

I've taken the Stanford HGDP dataset and extracted the markers common to it and to HapMap-3, Behar et al. (2010), Rasmussen et al. (2010) and the 23andMe v2 genotyping platform, or about 500k SNPs in total (I removed C/G and A/T SNPs as a precaution and flipped strand in discordant ones to the HapMap-3 standard when it differed from that of HGDP).

I removed SNPs with less than 99% genotyping rate in any of the four data sources, and about 434k SNPs were retained. Subsequently I applied linkage disequilibrium-based pruning on the HGDP set (PLINK parameter: --indep-pairwise 50 5 0.3) resulting in a final dataset of about 177k SNPs. In all analyses of the HGDP set, I followed the recommendations of Rosenberg et al. (2006) keeping the 940 individuals in common between his 952-individual panel and the Stanford data.

Subsequently I ran multidimensional-scaling (MDS) on the 940 individual/57 population/177k SNP set in PLINK, and then I applied model-based clustering as implemented in mclust over the first 42 MDS dimensions, with a maximum number of clusters = 70. In total there were 64 clusters in the optimal solution suggested by mclust (*)

Before I give the results, it might be worth looking at the pairwise MDS scatterplots for just the first 5 dimensions:

As you can see, clusteredness emerges in different dimensions. Rather than inspecting innumerable 2D combinations visually (and indeed we should 3D, etc. as well, because clusters might emerge in 3D and higher subspaces that are not discernible in 2D projections), we let mclust iterate over k, the number of clusters, and different shapes, orientations, and volumes of clusters, using the well-known EM algorithm together with the Bayes Information Criterion to choose a good solution that maximizes detail without sacrficing parsimony.

Below you can see how many individuals are assigned to each of the 64 clusters from each of the 57 populations:

This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity. And, there are a few populations, such as Tuscans and North Italians that are not split. But, the fact that this was inferred from unlabeled individuals is remarkable.

I remember reading Rosenberg et al. (2002), "The genetic structure of human populations" (pdf) which used structure, a model-based algorithm on raw genetic data to infer the existence of 6 clusters corresponding to continental populations. How is it that so much more detail can be achieved today?

There are three reasons: First, dense genotyping data are much better than the few hundreds of microsatellites used by Rosenberg et al. (2002). Second, the use of dimensionality reduction in the form of MDS allowed us to remove most of the "noise" in the genotyping data and focus on dimensions capturing a lot of distinctions. Third, the use of a sophisticated clustering algorithm such as mclust which can adapt to clusters of different shape, size, and orientation without human input was able to produce this result. mclust is computationally expensive, but it works like a charm (in a few minutes) with a few dozen dimensions and about a thousand individuals, producing a clustering of obviously good value.

How to repeat the experiment

If anyone wants to repeat this experiment they can do it easily. After you've managed to put the HGDP data into PLINK ped/map format, say in files HGDP.ped and (or any other data for that matter), just run

> plink --cluster --mds-plot d --file HGDP

Where d is the number of dimensions you want to retain. This produces a plink.mds file in which there is a header line, and each each line after that corresponds to an individual: the individual's projection in the first d dimensions are in columns number 4 to d+3.

Then, in R, after you install and load the mclust package (see the MCLUST page for limitations on its use and licensing information), you just run:

> MDS <- read.table("plink.mds", header=T)
> maxclust <- 70
> MCLUST <- Mclust(MDS[, 4:(d+3)], G=1:maxclust)

where maxclust is the maximum number of clusters you want to consider.

Then, if you run:


you will see a table in which each line corresponds to an individual and each column to the probability that it belongs to the i-th cluster.

There's much more that you can do in R with the mclust package, but this is enough for anyone wanting to repeat the experiment in its basic form.

(*) The number of clusters in the optimal solution varied between 11 with 2 dimensions retained and 64 with 42 dimensions retained. There was a secondary maximum of 60 clusters with 30 dimensions retained; choosing more dimensions than 42 (up to 50 that I examined), also resulted in a very high number of clusters, but I've decided to keep the one with 42 dimensions and 64 clusters as it is enough to serve the purpose of this post.


  1. North Italians and Tuscans are on the same line, but also 2 French on 28. We know that more than four millions French are of Italian extraction. If the sample is representative, they would be 7%. What a pity that there aren’t Spaniards, otherwise it would be interesting to see their location.

  2. "This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity".

    Uh? After one removes too low samples (I used n=10 as treshold), the mono-clustering populations are just 17, while the multi-clustering ones are 26. Total: 43. Mono-clustering populations represent only a bare 40% of all relevant samples.

    Furthermore, some of the mono-cluster populations are clearly "tribal" and often alleged as "inbreeding", such as, in Europe: Basques, Sardinians, North Italians, Tuscans... but not French, Russian, Adigey nor Orcadian, all clear cases of multiple origins.

    Valid homogeneous clusters are also found in Africa when the populations are homogeneous (the two Pygmy groups, Mandenka, Yoruba), among some rather isolated East Asian groups (Uygur, Dai, She, Tu, Japanese), but not among more cosmopolitan groups such as Han, Mongola, Cambodians or Yakuts (at least two population layers at their origins).

    [No follow up. If you really need to contact me, follow my profile].

  3. Why were the Maasai and Ethiopians removed?

  4. Why were the Maasai and Ethiopians removed?

    They were not, because they are not in the HGDP panel to begin with.

  5. Dear dienekes,thank you for your job.
    Could i make the same analysis starting from an mds made with adegenet?
    Thank you for your patience

  6. I am not familiar with the way adegenet outputs its MDS representation.

    The MDS object in the given code is a N by D array where:

    N: number of individuals
    D: number of MDS dimensions

    As long as you are able to put your data in this format, it doesn't matter which software produces the MDS or PCA representation.

  7. Dear Dienekes,Thank you for your answer. Do you Think that i could use this method to define two (or more)cluster starting from population data instead individual data?

  8. Dear Dienekes,
    Thank you for your quickly answer.
    Do you think that I could use it with population data instead that individual?

  9. You can cluster a large number of populations instead of a large number of individuals.

  10. I have no experience of R or mclust, but I think there may be an error in the instructions. I get an error that Z is not found.

    > MDS <- read.table("plink.mds", header=T)
    > maxclust <- 70
    > MCLUST <- Mclust(Z[, 4:(d+3)], G=1:maxclust)


    > Z <- read.table("plink.mds", header=T)
    > maxclust <- 70
    > MCLUST <- Mclust(Z[, 4:(d+3)], G=1:maxclust)



  11. How do you decide on the maximum number of clusters to consider in mclust (i.e., where does maxclust=70 come from)?

  12. How do you decide on the maximum number of clusters to consider in mclust (i.e., where does maxclust=70 come from)?

    The number of inferred clusters should be lower than maxclust. So, if you set, e.g., maxclust=50, then MCLUST will give you a warning that the number of clusters occurs at Max choice, and thus it will be a good idea to increase maxclust and re-run it.

    You should make maxclust as high as your computing power allows.


Stay on topic. Be polite. Use facts and arguments. Be Brief. Do not post back to back comments in the same thread, unless you absolutely have to. Don't quote excessively. Google before you ask.