March 09, 2011

Clusters Galore analysis of Henn et al. (2011) data

The great thing about researchers putting their data online, like Henn et al. (2011) did, is that they can expect anyone with a computer, a bit of knowledge, and a bit of time, to study it, analyze it, play with it, and perhaps add a little value of their own.

As soon as I realized that there were 30 populations and 587 individuals in this dataset, most of them previously unsampled Africans, I had to get my hands on them and try my Galore approach. This can be summarized as dimensionality reduction via PCA/MDS, followed by MCLUST for an unsupervised clustering of unlabeled individuals with no a priori setting of the number of clusters K. (If you want to try it, instructions here)

As I have explained before, my favorite way of using the Galore method is by iterating over the number of retained MDS dimensions, seeing the optimal K chosen by MCLUST based on the Bayes Information Criterion, and reporting the results for the number of dimensions which produces the highest K. Considering only the first 20 dimensions, there were 42 clusters with 15 retained MDS dimensions.

I have placed a RAR archive of scatterplots of the first 20 dimensions here. Below you can see the first 2 dimensions, which shows a triangle with vertices anchored on Tuscans, San, and the bulk of Sub-Saharan Africans.

Here are the results of the Galore analysis, showing the number of individuals from each population assigned to each cluster.
I would say that the Galore approach had remarkable success in grouping unlabeled individuals into very meaningful clusters:
  • Some populations got their own exclusive clusters (e.g., Mandenka, Tuscans, and Mada)
  • A few clusters included individuals from related populations, e.g., #12 from two different groups of San, or #26-32 of various types of North and Saharan Africans
  • Some populations were split across different clusters; I think it is instructive to see which ones were: the quite diverse San, Hadza, and Sandawe, and also the quite heterogeneous North Africans. In the latter case Arab, Berber, and Sub-Saharan ancestry probably co-exist in various proportions in individuals.
I anticipate that the ~55k SNPs included in the released data will be largely compatible with the datasets included in the Dodecad Project, and while that project's focus is on Eurasian populations, the availability of such rich and varied African data will surely be welcome, and allow me to frame the ancestry of African-admixed individuals more accurately.


Lank said...

Wow. This is great!

astenb said...

Very much looking forward to some of your African centered exercises.

pconroy said...


I would love to see you add one or two more populations to this analysis, namely:
1. Austronesians
2. Papuans

It may be that there is slight Austronesian in San or Xhosa populations, from neighboring Madagascar?!

Unknown said...

OK This is what I think I am seeing.

Top is the San
Bottom is Yoruba-like (Bantu representative?)
Right is European/Out of Africa

There is a a nice straight line leading from Out of Africa(Tuscan) to Yoruba or a closely related group. Indicating that something like this population was the source for out of Africa, or that Yoruba is the main admixture partner for Out of Africa back in Africa. I favour the second explanation.

On the line are:
Tunisia (closest to Tuscan)
Sahara OCC
Fulani (West Africa)
Bulala (Chad).

Presumably the first 6 are Out of Africa admixture. Tunisia maybe directly from Italy? Egypt via the middle East, Libya from Egypt and Morocco, and the others the long route via Spain.

Perhaps the Fulani are next because they have received Spanish admixture down the West coast of Africa? I am baffled by the Bulala result, maybe Libyan admixture.

The second line is:
San (northern South Africa)
Hadza (East Africa)
Sandawe (East Africa)
Maasai (East Africa)
Fulani (West Africa)

I am baffled by the Fulani in this set. Otherwise it looks like local admixture up the east coast.

The final line is:
San (northern South Africa)
MbutiPygmy (Central Africa)
BiakaPygmy (Central Africa)
Xhosa (South East Africa, Bantu)
Fang (West Africa, near Congo river outlet))
Kongo (Congo river, Bantu)
Bamoun (Cameroon)
Yoruba (West Africa)

These folk mostly live alongside each other so I think this is all just local admixture connecting San and Yoruba via the Congo.

I dont think we are seeing anything to ancient here.

Clay said...

What does the diversity of the San population tell us?

Sebi said...

This is indeed a very rich African dataset. It's interesting to see the position of the Fulani in this MDS plot. They seem partially Berber admixed. Several physical anthropologists have noted the Caucasoid strain in this Sahelian population long before the age of modern genetics.

Sebi said...

In essence, the shared Central Saharan ancestry you are postulating between Fulanis and Berbers is Eurasian influenced. Unmixed West Africans cluster like the Mandenka from Senegal, among others.