December 12, 2010

Human genetic variation: 124+ clusters with the Galore approach

The following uses the same 2,230-individual/139-population dataset described here, and analyzed with the Clusters Galore method introduced here, refined here, and applied to subsets of this data (HGDP and Behar et al. (2010)).

In short, this method exploits the clusteredness of individuals along different dimensions of the MDS representation of dense genotypic data. It uses a powerful model-based clustering algorithm (MCLUST) that can infer the existence of clusters of different size, shape, and orientation in the MDS space, and which automatically optimizes for the Bayes Information Criterion, balancing off detail with parsimony.

The only parameter that I need to specify to MCLUST is the number of MDS dimensions to retain (for a more detailed analysis, see here), as extra dimensions may add "clusteredness" but also noise. In order to decide on how many dimensions to retain, I empirically run MCLUST with a different number of dimensions (from 2 to 50).

Below is the number of clusters inferred by MCLUST as maximizing the Bayes Information Criterion, depending on how many MDS dimensions were retained. The maximum (124 clusters) was attained with 18 dimensions retained. I have allowed as many as 150 clusters to be considered.


The 124 clusters

The 2,230 individuals are assigned to 124 clusters. If we group them, a posteriori, with their populations, this results in a very sparse 139 by 124 matrix, where each i row and j column is the number of individuals from each (of the 139) population belonging to each (of the 124) clusters.


Alternatively, we can have an array of the same size with the percentage of individuals assigned to each cluster.


I won't even bother to post a screenshot of this table, as it is huge (17,236 elements) and very sparse (96.5% empty), which again confirms the impression that the algorithm is able to discover the population structure effectively.

K=124 is only the beginning

If you've followed previous Cluster Galore analyses, you will note that some populations that belong to the same cluster in this one (e.g., Sephardic and Ashkenazi Jews) were split in a previous one.

In the Dodecad blog, I also showed how Assyrians and Armenians who have never been split before by MCLUST, can, nonetheless be very nicely distinguished from each other if one does not rely on the Bayes Information Criterion (BIC) to guide choice of K, but does an analysis that includes only them with K=2.

It is important not to deify the BIC, and to consider it as a sort of a rough guide that tries to discover as many clusters as can be supported by the data. It's always possible that MCLUST infers some phantom clusters, e.g., between outliers. It's also possible that the BIC-optimal number of clusters misses some that actually exist.

A concrete example: splitting cluster #4

Cluster #4 in this analysis includes 2 of 10 Greeks, all 12 South Italians/Sicilians, 7 of 7 Ashkenazi Jews of the Dodecad Project, 17 of 21 Ashkenazi Jews from Behar et al. (2010), 12 of 16 Morocco Jews, and 18 of 19 Sephardic Jews from the same paper.

Yet, in the previous analyses, I was able to distinguish between almost all of these groups. Now, I will show that this is indeed possible, and it's a good idea to follow up on clusters that encompass multiple populations to uncover structure that may exist in them and the BIC-based optimization may miss.

Now, let's look at Greeks, South Italians/Sicilians, Ashkenazi, Morocco, and Sephardic Jews in a regional analysis. Here are the first two dimensions of the regional MDS:
Notice that dimension 1 splits a couple of Moroccan Jews from the rest; I'd venture that these are probably close relatives in the genetic sense. Let's apply MCLUST to these first 2 dimensions; 4 clusters are inferred, as we might guess by looking at the figure:

As you can see, Greeks, South Italians/Sicilians and Sephardic Jews fall in cluster #1, Morocco Jews in #2, Ashkenazi Jews in #3, and the 2 possibly related Morocco Jews to #4.

But, that is not the end of the story. The power of this approach is that we don't have to rely on our eyes to infer clusters. Let's try, instead of using only 2 dimensions, to vary the number of dimensions and see how the number of clusters inferred varies:

The number of clusters inferred can be as high as 9, with 7 MDS dimensions retained. Even, with 4 MDS dimensions retained, we get 8 clusters. Apparently, our monolithic cluster #4 has a lot of structure in it. So, let's look at what types of clusters are inferred with 7 MDS dimensions retained and 9 clusters:

There you have it: Cluster #1 is Greek/South Italian/Sicilian, and all the Jewish groups belong to largely disjoint sets of clusters, with even some within-group structure uncovered as well.

Rolling back

But, this is not the end of the story. If we don't prefer this level of detail, we can roll back, and examine the individuals with a smaller number of clusters. Here is what we get with the same 7 dimensions, but this time we fix 4 clusters.

There you go: cluster #1 is Greek/South Italian/Sicilian, cluster #2 is Sephardic, cluster #3 is Ashkenazi, and cluster #4 is Moroccan Jewish.

Conclusion

This is the first time, as far as I know that it has been shown that an excess of 100 clusters with strong correspondence to actual populations can be inferred from genotypic data of unlabeled humans.

The drilling of the populations participating in cluster #4 has shown that it is possible to increase this number by as much as an order of magnitude. Or, if you prefer, you can look at individuals at different levels of detail, to see both how populations group together, and also whether they have internal extremely fine-scale structure.

This type of analysis is very easy to do, and I encourage everyone to follow the instructions and try it.

25 comments:

Gioiello said...

What have you demonstrated by this? Of course Ashkenazi Jews haven't only a "Southern European component", that you find in Southern-Italians-Sicilians and Greeks (anyway the Greek component cannot be 100% as you probably likes if you knows history: it could be also 0%), but also many other European components and probably North Africans etc. and this carries Ashkenazim a little far, very little indeed.

Dienekes said...

As I've mentioned before, this type of analysis is not an admixture test, so it doesn't really try to estimate admixture in populations. It does, however detect population distinctiveness, and this might be due to a number of factors, such as isolation and/or a unique pattern of admixture.

Jack said...

I am still trying to disentagle my neurons...

mikej2 said...

The MCLUST is as good as the input data, reminding about an old truth in computing. It could be good if the MDS data and dimensions are reasonable. Anyway, MDS never represent the whole data. What I am waiting is a method that uses directly IBS-data for clustering without any modelling with data losses. Until that straightforward IBS-comparisons are only good for me.

George said...

Interesting to see that in this analysis Greeks fit into three clusters. I would expect the one who fits into the Romanian cluster to be a Vlach, and the two who cleave to Sicily Cretans, whereas the mainlanders would cluster with Tuscans. Is this so?

Lugus said...

I read this blog casually on a regular basis; as a non-geneticist, I don't really understand how your snp data is "coded" for statistical analysis, nor do I understand how such coding translates into principal components (what do the axes represent?). Also, is mclust utilizing pca? A simple explanation or a relevant paper would be much appreciated and would greatly expand my appreciation of this blog.

eurologist said...

Would be interesting to do the same with cluster D. Germans and Hungarians often come out very close together, and then you have the Utahns, Orcadians, half of French, and 1/3 of Scandinavians in that mix.

I have no doubt that they can be separated as well, but it would be interesting to see how easily and cleanly.

Dienekes said...

The MCLUST is as good as the input data, reminding about an old truth in computing. It could be good if the MDS data and dimensions are reasonable. Anyway, MDS never represent the whole data. What I am waiting is a method that uses directly IBS-data for clustering without any modelling with data losses. Until that straightforward IBS-comparisons are only good for me.

Ergo, you don't need the whole data to cluster unlabeled individuals into clusters that largely correspond to their population labels.

You also seem to misunderstand what clustering does. Clustering always prefers certain aspects of the data, namely the aspects in which individuals form detectible blobs.

A good clustering algorithm will identify these aspects if it is "given all the data", but it is just as likely to be lost in the noise. This is quite visible in my experiments when increasing the number of MDS dimensions retained does not, in general, improve the algorithm's ability to detect clusters.

So, this approach is quite good because it helps MCLUST by giving it aspects of the data (the first few dimensions) that capture most of the variation of the data, and that are known (empirically) to present clusters.

Dienekes said...

I read this blog casually on a regular basis; as a non-geneticist, I don't really understand how your snp data is "coded" for statistical analysis, nor do I understand how such coding translates into principal components (what do the axes represent?).


MDS is a technique that takes a distance matrix between all individuals (1-IBS distance matrix) and projects the individuals into a D-dimensional metric space. By looking at the distances in that D-dimensional space, you can reconstruct (with loss) their original distances. The good thing is that individuals tend to be clustered in the first few dimensions...

http://dienekes.blogspot.com/2010/12/human-genetic-variation-first-50.html

... and hence by doing this you are jump-starting any clustering algorithm you apply to the data.

Also, is mclust utilizing pca? A simple explanation or a relevant paper would be much appreciated and would greatly expand my appreciation of this blog.

Mclust is not utilizing PCA itself. It is applied on data generated by MDS (which is a technique similar to PCA).

Dodecad Project said...

Would be interesting to do the same with cluster D. Germans and Hungarians often come out very close together, and then you have the Utahns, Orcadians, half of French, and 1/3 of Scandinavians in that mix.

http://tinypic.com/r/be91eb/7

5 MDS dimensions, 13 clusters

eurologist said...

Thanks - but it would have been better without the Lithuanians, Belorussians, and Russians for the question I asked. On that matrix, the populations I mentioned are not well distinguished, at all - due to introducing yet more variables. The results are almost trivial: Germans group with French, Scandinavians, and Hungarians. Who would have thought...

Of course, the very low number of Germans compared to others is a problem, as well.

David said...

Hey Dienekes, can you post a phylo tree of the North Euro clusters in that last image?

I'd run the "clusters galore" on my own data sets, but I can't do so now until the New Year.

Dienekes said...

Hey Dienekes, can you post a phylo tree of the North Euro clusters in that last image?

http://tinypic.com/r/be91eb/7
http://tinypic.com/r/262tufn/7

Dienekes said...

Interestingly, the two Hungarians who form cluster #11, which is the most divergent one are the ones who have a bit of the "Altaic" component, in the Hungarian "population portrait"

http://dodecad.blogspot.com/2010/11/admixture-analysis-of-eurasian.html

mikej2 said...

Dienekes "So, this approach is quite good because it helps MCLUST by giving it aspects of the data (the first few dimensions) that capture most of the variation of the data, and that are known (empirically) to present clusters."

I am not trying to prove that there could be better way, but I simply dont understand why I see so many different result on MDS-plots using ALMOST SAME data. It can be explained by a comment that it is because the data is different, but then we agree that results are not trustworthy. Hey, I am only saying that I had to wait until I can feel myself confident what I see now (I am not blaming you), because now cannot do it.

Dienekes said...

but I simply dont understand why I see so many different result on MDS-plots using ALMOST SAME data. It can be explained by a comment that it is because the data is different, but then we agree that results are not trustworthy.

I have no idea what "different results" you see. Like I said, everyone is free to try their hand at this with the publicly available datasets. And, I certainly do not agree that "results are not trustworthy".

Dean said...

Dienekes, would it be a good idea to ask Dodecad participants to list their geographic origin within their country? If you decide to continue doing Dodecad and your database grows, it would be interesting to see different patterns within countries. This can be done without revealing the identities of the participants.

Jack said...

Now that I think about it, why are Sicilians and Southern Italians pooled together?
In other studies seen on this website the two were separate. I wonder if next time these two populations could be analyzed separately.
How about Serbs and Cretans if the latter are not included in the Greek sample.

Dienekes said...

I may split Sicilians from South Italians when I have enough numbers.

horacioh said...

Foreseeable and very nice plot also.
Pre-Ashkenazim Jews belonged at one of the three nucleus of Jewish ancient populations, that evolving the called "Syrian-European
nucleus" (Greek and Roman times of profuse and lavish proselytism with genetic inlays) like Sephardim are still today.
The "Babylonian and Persian nucleus" and "The Coptic nucleus" are the others of these three Ancient centers.

The Ashkenazim Jews Europeans Components (around 35 to 50%), are, a half from South Europe acquired in ancient times, while the named “Hellenistic Proselytism” in Greek, Anatolia and Rome including Women and Men (“mtDNA” and “Y” markers) as well.
The other half: a 1/4 of the Whole DNA markers was shared and inlaying with E. Europe and host populations along the Galut or Diaspora life also Y and mtDNA from these Jews Kazhars descendant as well. The other 1/4 part was mainly from converted women and it could be seen in the mtDNA almost exclusively.
The West Asian markers represents the time when A. J. were living in E. Europe and the contact with Turkish Jews Khazars – a mixed Jews
people with West and Central Asia ancient components and I call "Medieval Age Four Nucleus East Europe" - and the inbred
into religious restrictions in Middle Ages, and Modern Era, that was the norm, modifying the profile patterns from a “Coptic” - “European Syrian” Jewish ancient mix -Alexandria and other M.E.- returning and resulting take in part to a more "South East European Syrian like". We need to study and compared the DNA markers of Cristhians Coptos from Egypt, with Ashkenazim and Ethiopians and others groups of Jewish and non Jewish origin and too bones or teeth of Jews graves of Alexandria and others ancient Jews communities, because the losing
and take up of haplotypes reveled ancient Egypt and Abyssinian components in moderns A.J. like mtDNA N1b (12% A.J.and 48% Ethiopians Jews and only 0.2% or minor in whole Europe) or “Y” DNA marker E3b1 Ashkenazim have 24% more than Sephardim 19%-20% as well. Note that “East African Y chromosomes in haplogroup E3b1-M78 which is abundant (38%), and may have originated in Ethiopia (Cruciani et al. 2004; Luis et al. 2004). East Africans E3b1 may have carried the latter to Egypt and, farther, to Europe via the Levantine corridor.

The Jews pre Ashkenazim arrived to Rome and Tuscany from Alexandria and others M. East regions mainly when the Muslims arrived to Egypt en the VII century AE, and from there to the North, they were heading to the Rhine river to the nowadays Lorena and Alsace, they got to oriental Europe where they mixed and got in contact with Khazars Jews.
Remember that the hyperhaploydia present in Ashkenazim is only
compatible with a most ancient population about 8500 years -not
beyond 1200 years like they are-, or a mixed with people along the
Diaspora life without religious restriction that is not the case.
The A.J. hyperhaploydia and heterozygosis, - practically almost absent in Sephardic - as well a great L.D. That could cluster these A.J. populations everywhere you want (not common in isolated population, or for mtDNA coming in great rate from host population, and endogamy practice that Ashkenazim hold), That is showed and explained easily by MDS,at four Dimensions and more visible with a number samples over 400 for each population, the same for PCA.

pconroy said...

Dienekes,

I'd be very interested in how the Sicilians fare after a split from South Italians, as apart from Calabrians, I think the Sicilians have more Semitic ancestry and North African ancestry - at least this is what I'm seeing for the Sicilian samples I submitted.

Also, are you still assigning a population level grouping, when you have 5 samples from that group? If so how many more do you need to form and Irish group - as I submitted 3 Irish samples?

Dienekes said...

The S_Italian_Sicilian sample doesn't seem to split along that direction for the time being.

I need 1 more Irish.

pconroy said...

Also, was the one South_Italian_Sicilian who clusters with the Sephardic Jews, submitted by me?

pconroy said...

I'd love to see an analysis of Irish, Scottish, English, Welsh, French, Netherlands, Belgium, Norway, Denmark - with maybe Sweden, Germany and Finland thrown in - to get a better understanding of who the Anglo-Saxons were, or were not?!

MikeNassau said...

If Mclust can distinguish Moroccan Jews from Sephardic Jews, could it be used to determine the source of the Mediterranean component in the Melungeon population? The presence of Mediterranean ancestry is confirmed by some Melungeons being diagnosed with Familial Mediterranean Fever. Some Melungeons believe this is the traditionally claimed Portuguese, but there are also claims of Sephardic Jewish or Crypto-Jewish ancestry, or Moorish or Morisco, or Turkish or Spanish.