November 27, 2010

Clusters galore: extremely fine-scale ancestry inference

By way of introduction, here is the command that literally made me jump from my seat:
> MCLUST <- Mclust(X,G=1:36)
Warning messages:
1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
best model occurs at the min or max # of components considered
2: In Mclust(X, G = 1:36) :
optimal number of clusters occurs at max choice
It may look like gibberish, but this is what happened when I tried to apply Model-based clustering as implemented in the R package mclust, over the first few dimensions of Multidimensional Scaling (MDS) of my standard 36-population, 692-individual dataset I have been using in the Dodecad Project.

But, let's take the story from the beginning...

The basic idea

When we look at an MDS or PCA plot, like the following MDS plot of the 11 HapMap-3 populations, it is obvious that individuals form clusters.

Here are dimensions 1 and 2:
West and East Eurasians form a cluster, and Africans form an elongated cluster towards West Eurasians. Gujaratis and Mexicans overlap between West and East Eurasians.

Here are dimensions 2 and 3:
Here, the Gujarati are shown to be quite different from the Mexicans.

We can use a standard clustering algorithm such as k-means to infer the existence of these clusters. This has two benefits:
  • We don't have to visually inspect an exponential number of 2D scatterplots
  • We can put some actual numbers on our visual impression of the existence of clusters
Actually, k-means is not a very good way to find clusters. For two reasons:
  • You have to specify k. But, how can you know which k is supported by the data, unless you look at an exponential number of 2D scatterplots?
  • k-means, using the Euclidean distance measure prefers "spherical" clusters. But, as you can see, some populations, especially recently admixed ones form elongated clusters, stretched towards their two (or more) ancestral populations.
I had previously used mclust, a model-based clustering algorithm to infer the existence of 14 different clusters in a standard worldwide craniometric dataset. This was 6 years ago, and only recently have geneticists been able to reach that level of resolution with genomic data.

But, for assessing ancestry, genomic data are obviously much better than craniometric ones: the latter reflect both genes and environmental/developmental factors.

So, while 6 years ago I had neither the computing power nor the data to push the envelope of fine-scale ancestry inference, today that's possible.

What mclust does (in a nutshell)

mclust has many bells and whistles for anyone willing to study it, but the basic idea is this: the program iterates between different k and different "forms" of clusters (e.g., spherical or ellipsoidal) and finds the best one.

Best is defined as the one that maximizes the Bayes Information Criterion. Without getting too technical, this tries to balance the "detail" of the model (how many parameters, e.g., k) it has, with its parsimony (how conservative it is in inferring the existence of phantom clusters).

How to combine mclust with PCA or MDS

mclust does not work on 0/1 binary SNP data; it needs scalar data such as skull measurements. However, that's not a problem, because you can convert 0/1 (or ACGT) SNP data into scalar variables using either MDS or PCA.

From a few hundred thousand SNPs, representing each individual, you get a few dozen numerical values placing the individual along each of the first few dimensions of MDS or PCA.

You can then run mclust over that reduced-dimensional representation. This is exactly what I attempted to do.

Clusters galore in HapMap-3 populations

I had previously used ADMIXTURE to infer admixture in the HapMap-3 populations, reaching K=9. So, naturally, I wanted to see whether the approach I just described could do as well as ADMIXTURE.

I used about 177k SNPs after quality-control and Linkage-disequilibrium based pruning and ran MDS as implemented in PLINK over a set of 275 individuals, 25 from each of the 11 HapMap-3 populations. I kept 11 dimensions, equal to the number of populations.

MDS took a few minutes to complete. Subsequently I ran mclust on the 275 individuals, allowing k to be as high as 11. Thus, if there were as many clusters as populations, I wanted mclust to find them. mclust finished running in a second. Here are the results (population averages):
The software esssentially rediscovered the existence of 10 different populations in the data, but was unable to split the Denver Chinese from the Beijing Chinese. Notice also a mysterious low-frequency component in the Maasai reminiscent of that which appeared in the previous ADMIXTURE experiment.

A question might arise why most of these populations look completely unadmixed? Even the Mexicans and African Americans get their own cluster. This is due to mclust's ability to use clusters of different shape. In particular, the "best" model was the one called "VVI", which allows for diagonal clusters of varying volume. In short, the software detected the presence of the elongated clusters associated with the admixed groups.

Indeed, the approach I am describing is not really measuring admixture. It is quantifying the probability that a sample is drawn from each of a set of inferred populations. Hence it is not really suitable for recently admixed individuals, but works like a charm in guessing the population labels of unlabeled individuals.

Clusters galore in Eurasia

Let's now see what clusters are inferred in the 36-population 692-individual dataset I commonly use in the Dodecad Project. This is done with 177k, 36 MDS dimensions retained, and allowing k to be as high as 36. This is what made me jump off my seat, and since I don't have enough colors to represent it, I'll put it in tabular form:

I could hardly believe this when I saw it, but the conclusion is inescepable: dozens of distinct populations can be inferred from unlabeled data of individuals that largely correspond, by a posteriori inspection to the individuals' population labels.

UPDATE: The above table has the average probabilities for the 36 clusters, but a better way might be to look at how many individuals are assigned to each cluster from each population:


For example, out of the 28 French individuals, 23 are assigned to cluster #1 (the French-CEU cluster), and 5 to cluster #3 (the North-Italian/Spanish/Tuscan cluster).

Some interesting observations:
  1. Some populations (e.g., CEU and French, or Belorussians and Lithuanians) remain unsplit even at K=36.
  2. Some populations are split into multiple components (e.g., Sardinians into 2)
  3. Some mini-clusters emerge (e.g., 4 clusters in Maasai, each of them corresponding to 8% of 25 = 2 individuals). These may correspond to pairs of relatives or very genetically close individuals.
Quantifying uncertainty

Naturally, we want to be able to assess how good a particular classification is. Fortunately, this is easy to do with mclust and its uncertainty feature. Looking at my 692-individual dataset, 687 have a less than 5% uncertainty level, and 682 have less than 1%. I did not inspect these fully, but some of them are "borderline" individuals who might belong on several components, e.g., a Frenchman who could either go to the CEU-French cluster #1 (36% probability) or the North/Central Italian-Spanish cluster #3 (64% probability).

Here is a dendrogram of the 36 components:



What does it all mean?

What this means, in short, is that the day of extremely fine-scale ancestry inference has arrived. We already had premonitions of this in the ability of researchers to place individuals within a few 100km of their place of birth in Europe. Now, it is clear that model-based clustering + MDS/PCA can infer ethnic/national identity, or something quite close to it.

This is obviously just the beginning. I allowed K to vary from 1 to 36, not really hoping that the optimal number of clusters would be 36. This raises the question: more than 36?

...

UPDATE:

I have followed up on this exciting new technique in the Dodecad Project blog:

12 comments:

Michael said...

Component 15 looks like a Neolithic Nostratic signature? Wow, great work!

Dienekes said...

Don't get carried away, #15 has the 2 Romanian Gypsies I discovered earlier.

Fanty said...

This is very interesting and promising! Good work Dienekes! :-)

mcdonald said...

Theprogram you describe is a clustering program, which is separate from an assigment program. Since you
have pre-defined populations, assigment is easy. Just calculate the
mahalanobis distance from the test point to
the center of each cluster, and choose the smallest one.

My program has always been able to accurately assign people to single populations if they are fact unmixed.
(It uses a simpler distance measure, since my clusters are pruned to be more or less round, but doing the full calculation only bothers the computer.)

The problem, which has no guaranteed correct solution, is assignment of mixed individuals to a large list of populations. A small list is done well by Admixture, or a simple least squares fit to the first few PC components. The number that can be fit depends on the number of high quality PCs. Admixture generates purported ancestral populations;
my program uses present ones.

Doug McDonald

Ponto said...

This may seem a silly question but here goes.

With those ethnic group splits, for example the Sardinians, do the numbers actually represent single individuals of one sub ethnic group or composites of individuals? I am finding it difficult to accept that a small group of 28 Sardinian Italians can be split into two separate sub ethnic groups particularly as the Sardinian Italians come from one part of the island of Sardinia.

Gioiello said...

Of course 15 is the component of Sindhi in Romanians, but all the rest is very good for me:
1)North Italians and Tuscans are the same: no Etruscan=Anatolian extraction
2)Spaniards are the same with us: the Latin world is the same: 700,000,000 who speak Latin and are the dominants by a genetic point of view (R1b and subclade)
3) my theory of an Italian origin of R1b is strengthened
4)Romanians (apart the Sindhi component) are the same with us, Latins they too
5) a close genetic origin of all Western Europe
6)Ashkenazim and Sephardim have a little bit to do each other.

I think it is enough.

Dienekes said...

Ponto: I've seen the Sardinians split with ADMIXTURE to. I've also seen a paper on village-level differentiation in Sardinia. But, without any additional biographical information it's hard to tell what these two clusters mean.

Theprogram you describe is a clustering program, which is separate from an assigment program. Since you
have pre-defined populations, assigment is easy. Just calculate the
mahalanobis distance from the test point to
the center of each cluster, and choose the smallest one.


I don't know what you mean by "pre-defined populations". There are pre-defined populations only in the sense that these are the populations found in the literature, my program does not use any of the population information.

Of course it's possible to assign individuals to populations by calculating the distance of individuals to the populations' centroids. But, the question has always been: to what extent do these populations emerge as distinct entities _from the data_

Dienekes said...

1)North Italians and Tuscans are the same: no Etruscan=Anatolian extraction

You are misinterpreting what the program does. If you notice, African Americans get their own cluster, would you conclude from that there is no "African extraction" for them?

Gioiello said...

Dienekes writes: “You are misinterpreting what the program does. If you notice, African Americans get their own cluster, would you conclude from that there is no "African extraction" for them?”

Dieneke, I have everything to learn from you on this matter, but North-Italians, Tuscans and Spaniards are on the same line, then they are the same at the autosomic level, then probably the components were on average the same. The problem of the admixture is when a population is put on an own line where none is, like Ashkenazim. Then this line is a false line, due to admixture where there is nobody else.
Perhaps Tuscans had some Etruscan input from Anatolia (but it could be also the contrary: that Lemnians and other Aegeans derived from Italy, otherwise we can’t explain why Etruscan is linked to Rhaetian and Camun: you know this is my theory), but now at autosomic level it is irrelevant.

Jack said...

About Etruscans,
I just found this map:
http://paleoglot.blogspot.com/2010/07/etruscan-entry-into-italy.html
A different spin to an old question.
By tha way what are groups 4 and 5?

Aaron said...

As others mentioned Cluster 15 is interesting. I can't find any common thread between Mozambite, Pathan, Sindhi, Saudis, Sephardic, Romanians, Egyptians, Syrians....anyone?

Andrew Oh-Willeke said...

Cluster 15 may be Lower Egyptian (i.e. Coptic) or Phoenecian or "Sea People." Clusters 14 and 15 include Syrians, Jordanian, Egyptians and a significant subcomponent of Saudis in similar proportions, which could have arisen during Bronze Age Egyptian rule of the Levant or from Phoenician participation in the sea trade which spanned the Levant and North Africa.

Egypt controlled most of the Levant at one time or another, and was pretty much the sole link of the Mozambites to the rest of the world until the Phoenicians plied the seas; the Egyptians introduced pastrolism to the Mozabites.

The Pathans (i.e Pashtuns), by strong oral tradition coroborated in part by the earliest histories, claim Israeli or Coptic Egyptian Jewish convert origins, probably ca. the 17th to 7th century BCE, which the frequency of cluster 15 in modern samples suggests might not have been found in the few available ancient DNA samples simply due to random chance, and could be a minority component of the largely indigenous Pathan population.

Sephardic Jews could also have a small Coptic Egyptian contribution or could derive their Cluster 15 links indirectly from a Levantine source population. Cluster 15's absence in other Jewish populations could plausibly be due to simple sampling issues with an Ashkenzai Jewish sample of just 19 and a Cluster 15 frequency of about 5%-10% in other populations where it appears.

In the Sindhi sample, Cluster 15 could have arisen from sailors in the ancient Indian Ocean sea trade documented in the historic era, or, less likely, could reflect a component of a Harappan substrate.

Dienkes has notes that the Romanians in Cluster 15 are both gypsies, and that gypsies have roots in the same population at the Sindhi (probably around 1000 CE); the Romani language is most close to the Sinhalese language out of all the languages of India. The diasporan gypsy population was also small enough to have strong founder effects.