December 15, 2010

Human genetic variation: the first ? components

This is part III of my series on human genetic variation; it is based on the same dataset as part I: Human genetic variation: the first 50 dimensions and part II: Human genetic variation: 124+ clusters with the Galore approach

I have run ADMIXTURE on the 139-population/2,230-individual dataset, starting from K=3 and increasing K for as long as the Bayes Information Criterion increases. There was a temporary dip in the BIC at K=4, which, surprisingly, I had also encountered when analyzing a worldwide craniometric dataset (for an updated version of that analysis look here)

Below is a plot of the BIC as a function of K, the number of clusters:


The above plot should clarify the ? in the post's title. The BIC seems to increase law-like up to K=15, and I have no idea when it will plateau. Certainly many clusters I've encountered in previous analyses with subsets of these populations have yet to appear, so who knows?

As long as I have a functional computer and enough RAM I will continue this analysis, and the updated results will be posted in this blog post (although I think that after a certain K, I may have to invent a new color palette to represent them, or resort to just posting the numbers).

K=3

At K=3, Sub-Saharan Africans, West, and East Eurasians are distinguished

K=4
At K=4, Native Americans get their own cluster (dark green)

K=5


At K=5, Australoids (Papuans and Melanesians) get their own cluster (pink) which shows some affinity with populations from South Asia.

K=6
At K=6, the East Eurasian cluster is split into a North Eurasian/Central Asian (light blue) one and an East Asian (pink) one.

K=7
At K=7, a South Asian (light blue) cluster emerges.

K=8
At K=8, the Caucasoid cluster is split into European-centered (orange) and West Asian-centered (light blue) components

K=9
At K=9 the Mbuti, Biaka Pygmies, and San get their own cluster (Palaeofricans), with the Biaka showing some admixture with other Sub-Saharan Africans.

K=10
At K=10, a cluster (light green) centered on Koryak, Chukchi, Greenland emerges. Notice that this is also represented strongly among Athabask, much less in Pima and Maya from Mexico, and none at all in Karitiana and Surui, the southernmost Amerindian groups. This component is probably related to Y-haplogroup C3b-P39

It would be interesting to consider this in the light of the theory of a separate migration of Na Dene speakers (of which Athapaskans are a part) into North America and their inferred relationship with Kets. The absence of the "dark green" component which is present in Kets in Athabask does not really invalidate this hypothesis, as the "dark green" component may postdate the expansion of Na Dene speakers into the New World; however, the presence of the "light green" component in Athabask and its absence in other Amerindian groups is quite consistent with the two-migration model. No specific genetic relationship can be detected with Kets, however.

K=11
At K=11, the isolated Kalash of Pakistan get their own cluster, and this occurs at a high level in their neighbors

K=12

At K=12, a Southeast Asian cluster (red) emerges, highest in Malay and Cambodians, and well-represented in Chinese ethnic minorities such as Dai and Lahu. Notice also that the East Asian component in Melanesians also becomes "red", linking them to the Austronesians.

K=13


At K=13, a blue and a purple cluster supplant the previous West Asian cluster, with the blue one spilling to East Africa and the purple one to South Asia.

K=14
At K=14, the Karitiana, an Amerindian group from Brazil get their own cluster (pink), which spills into other Amerindian groups, but not substantially to the more northern Pima and Athabask.

K=15

At K=15 the Papuans and Melanesians are split into beige (?) and yellow population-specific clusters. Hence, the Melanesians, or at least the Nasioi from Bougainville where the HGDP sample is from, revealed in previous K to be associated with both Southeast Asians and Papuans, have actually acquired a genetic distinctiveness of their own.

Notice also that the Karitiana component that appeared at K=14 has "folded back" to the Amerindian component, while a "West Asian" and "Red Sea" component has appeared, the latter appearing on both Arabians and East Africans. As I've mentioned before, as K increases, ADMIXTURE has many roughly equiprobable choices in trying to represent the data.

At K=15 we are far from exhausting the available structure in modern humans.

Fst distances between components

Below is a table of genetic distances between the 15 inferred ancestral components:


As always, you should treat the chosen names for the components as helpful mnemonics; also, if a name used here has been used in a different ADMIXTURE analysis, with another set of populations and/or K, you should not assume that it reflects exactly the same entity.

Below is a dendrogram of hierarchical clustering of these 15 components with complete linkage. Once again, I emphasize that tree-like representations of human variation are not to be taken as anything other than a useful visualization of the data, as human populations did not evolve strictly tree-like, but have experienced lateral gene flow.

The tree shows clearly the four major divisions of mankind, which are separated quite distinctly from each other. From top to bottom: East Eurasians, West Eurasians, Australo-Melanesians, and Sub-Saharan Africans.

Once again, I emphasize that you should look at the table of Fst distances above, especially for closely related populations. For example, the Mediterranean component is joined to the Red Sea component in the dendrogram, but the table of distances shows that it is marginally closest to the North European (0.057), equidistant to the Red Sea and West Asian ones (0.062), the Indian (0.084) and the Kalash isolate (0.092). Do not rely on lossy representations like dendrograms when you can examine the actual distances themselves.

For completeness' sake, here is also a dendrogram of the hierarchical clustering using the average linkage method:


There is some internal re-arrangement of branches within the major races, and the Amerindian population becomes unattached from East Eurasians. Amerindians separated from East Eurasians fairly long ago, but their relationship to them is evidenced by the fact that they have their closest distances to East Asians and Siberians.

Finally, here is an MDS plot of the 15 components based on the inter-component Fst distances:


The maximum Fst in humans is between the Palaeoafrican ancestral population (Pygmies and San) and the Papuan one at 0.346, with a close second, that between Palaeoafricans and Amerindians (0.333).

The average Fst between the 15 components is 0.167. Notice that these are Fst distances between inferred ancestral populations, not between extant human populations. As such, they can be expected to be somewhat higher than conventionally given Fst distances for human populations.

However, the maximum distance also corresponds to distance between extant populations: guided by this analysis, I carried out a separate ADMIXTURE run using Papuans and Mbuti Pygmies from the HGDP set, arriving at Fst=0.377. This is probably not the limit of genetic differentiation within our species though, as Australian Aborigines, who are one further step removed from Africa than Papuans may be even more distant.

Downloads

For anyone interested in exploring this data further, I've made a RAR file of the ADMIXTURE plots at a better resolution, as well as the raw admixture proportions behind them.

This also includes a file of Fst distances between components, and information about the samples (note that ancestral populations are labeled Pop0, Pop1, etc. and 1, 2, etc. in the distance file included in the RAR)

34 comments:

Anonymous said...

Dienekes, the North European ancestral component seems to be closer to the West Asian (0.036) than to the Mediterranean (0.057) - and also closer than the Mediterranean is to West Asian (0.062).

I would guess that the genetic distance between the modern populations reflects the geographic distances and later admixture between the ancestral components - with NE closer to Med and W Asian closer to Med?

I wonder how this all relates to haplogroups. Is it possible that the program has identified the relation between h. I (North European) and h. J (West Asian) and 'ignored' the rest of the haplos in those regions? Or does the admixture of the haplos predate the formation of the regional ancestral components represented here? Is the Med further then because of the later OOA E3b?

Fascinating, thanks!

Jack said...

Again trying to disentangle my neurons.
I understand that with these diagrams NE Europe is closer to W Asia in the "cluster" sense not in the genetic distance sense of old PC1, PC2, PC3... plots.
Is that correct?
What a mess with all these colors.

Dienekes said...

I would guess that the genetic distance between the modern populations reflects the geographic distances and later admixture between the ancestral components - with NE closer to Med and W Asian closer to Med?

I'm in data dump mode, so I haven't quite thought about this yet.

Is it possible that the program has identified the relation between h. I (North European) and h. J (West Asian) and 'ignored' the rest of the haplos in those regions?

Strictly speaking the program does not deal in haplogroups or any non-autosomal markers. It's an open question whether the IJ relationship has a counterpart in the autosomal sense.

But, there is plentiful haplogroup I-M26 in Sardinia, so I'm not sure it's as simple as that.

I understand that with these diagrams NE Europe is closer to W Asia in the "cluster" sense not in the genetic distance sense of old PC1, PC2, PC3... plots.
Is that correct?


We can analyze genetic distance between populations as different composition (in terms of admixture proportions of the components) and difference between components (Fst as output by ADMIXTURE).

Living northern Europeans are closer to living southern Europeans. However, the NE _component_ appeared about equidistant to the SE and West Asian component in previous analyses. This "Mediterranean" component of this analysis is not quite the same as the "Southern European" of other analyses.

Unknown said...

I have prepared a small demo showing a PCA plot of 84 populations in a three dimensional
rotating animated display. This is a simple HTML5 animation I made in a hurry after reading a couple of HTML5 tutorials. It is somewhat jerky on my computers, but illustrates in a most graphic way how various populations are arranged in "lines" in the 3D space. It is quite interesting!

It is at
http://www.scs.illinois.edu/~mcdonald/PCA84pops.html

Once running, you can't stop it without killing the broswer or tab.

The three dimensions are linear combinations of the first 16 PCA components. At one point in time the horizontal axis is component 1.

Doug McDonald

terryt said...

"The tree shows clearly the four major divisions of mankind, which are separated quite distinctly from each other. From top to bottom: East Eurasians, West Eurasians, Australo-Melanesians, and Sub-Saharan Africans".

Fascinating stuff Dienekes. Exactly as we would expect: the most distinct clusters are at the margins. Those four populations certainly demonstrate easily observable phenotypic differences from each other. Perhaps the west Asian population can be regarded as being a sort of 'middle population', a combination of movements back into the region from near the four marginal populations which have overlaid an 'original' west Asian, or middle, population.

"I would guess that the genetic distance between the modern populations reflects the geographic distances and later admixture between the ancestral components"

Apart from the 'four major divisions of mankind' all other populations are probably basically a mixture between the nearest two marginal populations and the 'middle' population. For example Central Asians are a mixture of 'East Eurasians' and 'West Eurasians', SE Asians are a mixture of 'East Eurasians' and 'Australo-Melanesians', South Asians are a mixture of 'Australo-Melanesians' and 'Sub-Saharan Africans', and Mediterraneans are a complex mix of 'West Eurasians' and 'Sub-Saharan Africans'. Populations from the 'middle' have contributed especially to South Asian and Mediterranean populations, and seemingly to the Native American population.

Dienekes said...

It is at
http://www.scs.illinois.edu/~mcdonald/PCA84pops.html


Looks awesome. Too bad the human perceptual system is limited to 3D because there is structure to be uncovered in higher-dimensions as well. But, even with 3D I've discovered that you can detect 20 distinct clusters or so, and many of these are quite obvious in this animation.

GrIQ said...

"Mediterranean" is a very loose term from a genetic point of view. A spanaird, a north-Italian, a Greek and a Turk are very different genetically, and the two first are closer to French, while the greek is closer to balkans and S.italians and the Turk closer to levantines. So, how do we interpret this ?

German Dziebel said...

"The maximum Fst in humans is between the Palaeoafrican ancestral population (Pygmies and San) and the Papuan one at 0.346, with a close second, that between Palaeoafricans and Amerindians (0.333)."

In the vast majority of studies, it's Amerindians that are the most divergent from Africans, but your data is in the ballpark.

"At K=9 the Mbuti, Biaka Pygmies, and San get their own cluster (Palaeofricans)"

Aren't they supposed to segregate earlier, at K=1, if they are "paleo" Africans.

"The absence of the "dark green" component which is present in Kets in Athabask does not really invalidate this hypothesis, as the "dark green" component may postdate the expansion of Na Dene speakers into the New World."

Yes, Kets are likely to have admixed with neighboring Selkups to obliterate all the signs of an earlier Na-Dene link.

Umi said...

Wonderful Job Dienekes! Your experiments are truly amazing.

Gui S said...

Amazing work!
I find the case of the Japanese very interesting. Having the highest amount of the East Asian component, one of the lowest Southeast Asian in Asia Pacific and a small yet significant Melanesian component.
They stand out from all other Asian populations.

I wonder if that presence of the Melanesian component can be traced back to the Jômon population. In which case, getting data from the Ainu and Ryukyuan could be really interesting.

Anyways, keep up the good work, and congratulation for appearing in Nature!

Dienekes said...

Aren't they supposed to segregate earlier, at K=1, if they are "paleo" Africans.

The pattern of splits does not imply a phylogeny.

Gui S said...

Is there any thing implied in the pattern of splits?

Dienekes said...

Is there any thing implied in the pattern of splits?

In the order in which splits occur at successive K, no.

German Dziebel said...

"The pattern of splits does not imply a phylogeny."

Then what determines the order of Ks? The way you laid them out suggests that broader continental groups come first followed by individual populations.

pconroy said...

Dienekes,

Truly fascinating!

Is there any data available on Australian Aboriginals, how would they cluster?

AG said...

"I find the case of the Japanese very interesting. Having the highest amount of the East Asian component"

So Japanese is the "purest" yellow race. lol.

Anonymous said...

Living northern Europeans are closer to living southern Europeans. However, the NE _component_ appeared about equidistant to the SE and West Asian component in previous analyses. This "Mediterranean" component of this analysis is not quite the same as the "Southern European" of other analyses.

Could this Med component be older than the southern European component of other analyses, seeing as it is more distant from the northern European? Is it possible that the Med component became the southern European component by mixture with the north European and west Asian components?

That raises the question of whether it may be possible to identify ancestral components from different periods, with earlier components mixing or otherwise evolving to become later components. In other words, does the analysis always identify the oldest root components or may it sometimes identify components that are intermediate between the most ancient and the modern populations?

Could it be possible to trace the historical formation of components in this way?

Jack said...

How about this idea:
the Med component is the old European ice age (Atlantis) component.
The other in the NE was hidden somewhere in Eastern Europe or beyond, maybe not even "European" originally.
This might explain why the Sardinians appear so "western" and extreme, almost as if they originated in the Alantic ocean.
Later they were wiped out or mixed with the eastern "hordes" in different parts of Europe.

GrIQ said...

In the K = 15 , Spaniards are 49.35% North European and 40.39 Mediterranean.

Jack said...

Hypothesis: the roughly 50% NE element in Spaniards arrived later than the Med. I understand that a few people seem to think that R1b may have an "Eastern" origin.
Or maybe the two mingled, survived the ice age and expanded.
Mine is an attempt to justify this apparently very Western Med component.

princenuadha said...

Just as the hierarchical clustering suggests there are consistently shorter distances between the East Eurasians (Central Siberians, Northeast Siberians, East Asians, and Southeast Asians). Even the Central Siberians are closer to the southeast Asians than they are to the NEU, Kala, or WAS. Due to the East Eurasians relative relatedness I'd say that the area was primarily populated by one migration route. India seems to be the bridge between east and west and since the India cluster is closer to every east Eurasian group than any West Eurasian group think that the major migration route that populated east Eurasia was southern (or barely north of the Himalayas).

Something I found surprising was that the NA cluster is closer to the East Asian, Indian, and CSI cluser than the northeast Siberian cluster. I don't know how to interpret that though.

princenuadha said...

@ GrIQ

where did you get those numbers? The far file? (I can't get it, again...)

Could you please tell me the east Eurasian components in the CEU?

@ Dienekes, Congrats on the Nature article.

Gui S said...

What to do of the closeness of the Northeastern European component with that of the Kalash in the MDS plot too?

BTW, I translated the locations of the components of the MDS plot on a colour wheel, and made a colour map of the world populations (I didn't include any recently admixed populations though) based on their percentage of each component and the colour associated with the component on the colour wheel.
See here:
http://i108.photobucket.com/albums/n9/biskui/renduk15b.jpg

Anonymous said...

The designations given to the clusters represent ancestor groups and are not directly transferable to modern populations.

The ancestor cluster called Mediterranean does not represent Modern Mediterranean groups that span countries in the Mare Nostrum just an ancestor component which is found in Modern Mediterranean people in varying degrees. Same with the other cluster names. It is pretty obvious that from the Fst distances of those ancient clusters which stayed in close proximity and which separated with increasing time. With Caucasoid ancestral components the North European was in closer contact with the West Asian and separated at a recent date. Thus implies to me that ancient North Europeans who no longer exist, actually resided in Central South Asia close to ancient West Asians rather than in Europe. However Modern North Europeans would be closer genetically to Modern South Europeans than they are to Modern West Asians. I believe that has already been shown by Fst distances of those modern European peoples. The two things are not directly comparable. Apples with Oranges.

Anonymous said...

Ponto, yes that is what I was getting at.

The Fst distances of the ancient NE and Med are congruent with a geographical proximity of these groups and therefore with a Central South Asian (and Indo-European) origin of the NE.

<< Previous studies suggested a Paleolithic origin, but here we show that the geographical distribution of its [R1b's] microsatellite diversity is best explained by spread from a single source in the Near East via Anatolia during the Neolithic. >>

http://www.plosbiology.org/article/info:doi%2F10.1371%2Fjournal.pbio.1000285

<< New insights into recent human evolution can also be gained from the branch lengths; for example, the short internal branch lengths within the haplogroup R1b relative to the other haplogroups suggest a recent expansion of this European haplogroup (Balaresque, Bowden et al. 2010). >>

http://dienekes.blogspot.com/search/label/1000Genomes

For the Indo-European angle:

http://www.eupedia.com/europe/origins_haplogroups_europe.shtml#R1b

Wausar said...

I've found that networks tend to give a better visualization of distances than either trees or PC plots. Here's one generated using SplitsTree for K=15:

http://i54.tinypic.com/zumuy9.png

Anonymous said...

hi! nice info, but where can i see the number of samples for each country? cause probably it's not correct to put info about some countries that have 3 or 4 samples...

best regards

Dienekes said...

I've found that networks tend to give a better visualization of distances than either trees or PC plots. Here's one generated using SplitsTree for K=15:


Thanks for the tip, I'll look into it.

princenuadha said...

"The Fst distances of the ancient NE and Med are congruent with a geographical proximity of these groups and therefore with a Central South Asian (and Indo-European) origin of the NE."

The question of where NE is from is not a simple one. One question is did NE split with WAS recently or did they mix recently. We don't know how it evolved and it is an just inferred population to begin with.

My first guess was that NE cluster evolved north of the Caucasus while WAS evolved in the Northern part if the Middle East. That location For the NE makes sense to me because it is congruent with the genetic distances of the clusters and it is a good starting ground For the spread of NE that is now in modern populations. All they would have to do is move west or Northwest to northeastern Europe then spread across the European plane. They'd be well represented in Russia and marginally represented in central Asia. The Caucasus could then prevent them from going to the middle east.

"Central South Asian"

Where is that. Sounds too close to the kalash and too close to modern Afghans and Pakistanis.

clusteredmaps said...

Dienkes facial analysis stated that particular ancient European skulls clustered with Australian aborigines . Perhaps, Caucasians are closer to Australian aborigines and Melanesians then they are to East Asians? Can anyone prove or disprove this? I would like some sources on this fascinating subject.

Anonymous said...

South Central Asia: Where is that. Sounds too close to the kalash and too close to modern Afghans and Pakistanis.

Princenuadha, yes that is what I was getting at, that the Fst distances for the NE component identified in this analysis possibly imply a location rather close to the WAS and so maybe somewhere near Turkey or the Caucasus.

I meant that the Fst distances are compatible with the thesis that R1 originated in South Central Asia (east of the Caspian Sea and west of the Hindu Kush) and gradually made its way into northern Europe via Turkey and/ or the Caucasus.

See the map here:

http://www.eupedia.com/europe/neolithic_europe_map.shtml#R1b

Discussion here:

http://www.eupedia.com/europe/origins_haplogroups_europe.shtml#R1b

I was getting at the idea that the NE component identified in this analysis appears to date from the time at which the ancient R1b populations dwelt in Turkey and/or the Caucasus or somewhere around there anyway and R1a was to the east (maybe 5 or 10,000 years ago?)

But... Then again, maybe the NE component dates from an earlier period seeing as the NE component is common to both R1a and R1b populations and R1a and R1b split like 25,000 years ago? Perhaps the NE was proximate to the WAS that far back?

So I am not sure what if anything could be infered from these Fst distances - but the proximity of the NE to the WAS rather than to the Med is interesting nevertheless! There seems to have been movement/ change, of the NE, WAS and/ or Med.

It is an interesting question whether geographic locations, absolute or relative, could be infered from the Fst of the ancient components. I saw a study that argued that it is possible to produce a location map of modern populations via a selection of SNPs.

Map:

http://scienceblogs.com/notrocketscience/Europegenetics.jpg

Quote:

They analysed single-letter differences in DNA ("single nucleotide polymorphisms" or SNPs) at about 200,000 places in each of the genomes. [...]

Zoom in closer, and the map even reveals distinct genetic cluster within Switzerland based on the language people speak. German-speaking Swiss cluster to the east, Italian speakers to the south and Francophiles to the west.

http://scienceblogs.com/notrocketscience/2008/09/european_genes_mirror_european_geography.php

If components from different times could be identified then it might be possible to track migrations from the relative distances?

Unknown said...

May I use two of the figures from this page in an online course I'll be doing in Coursera (https://www.coursera.org/course/geneticsevolution)? Happy to provide attribution, too.

Dienekes said...

Sure

Unknown said...

So despite insistence from Afrocentrists that the perceived phenotypical similarities between Austronesians (such as Andamanese or Papuans) and SSA populations makes them close genetic relatives, the reality is Austronesians and SSA cluster farther from eachother than some SSA and European populations? That's rich. Maybe now the cultists that follow Van Sertima, Clyde Winters, et al can stop insisting Austronesian admixture in NA populations is proof that the first humans to reach the Americas were related to SSA populations.

NA populations clearly most closely resemblie something of a predominant EA population with some Austronesian admixture.