March 18, 2011

Analysis of 1000 Genomes + HapMap 3 data

A reader tipped me on the availability of data from the 1000 Genomes Project genotyped on the Illumina Omni 2.5 chip. Out of 2.5 million or so SNPs, there are about 720,000 with rs-numbers in the working dataset. There are a few new populations in the data:
  • GBR (Great Britain)
  • FIN (Finland)
  • IBS (Iberian Spanish)
  • CLM (Colombians)
  • MXL (Mexican Americans from Los Angeles)
  • PUR (Puerto Ricans)
I've been rebuilding my various datasets to account for common markers, high quality SNPs, and linkage disequilibrium, so this is based on about 133,000 markers. I also limited the number of individuals at 25 per population.

I took the HapMap-3 data to make sure that the integration was correct and ran various analytical techniques over the joint dataset of 17 populations and 425 individuals.

Multidimensional Scaling

As expected the three poles correspond to West Eurasians (top left, GBR, CEU, TSI), East Eurasians (bottom left, CHB, CHD, JPT), and Sub-Saharan Africans (YRI).

Other populations fall in between the three poles: for example, FIN slightly removed from West Eurasians in an East Eurasian direction, Mexicans and Gujarati Indians (GIH) in-between West and East Eurasians, African Americans (ASW) and Maasai East Africans (MKK) in-between Sub-Saharan Africans and West Eurasians.

Clusters Galore Analysis

I then used the Clusters Galore approach to cluster individuals. As I've mentioned before, individuals with quite distinct origins may overlap in the MDS representation, and the Galore approach is able to discover distinct clusters by looking at several dimensions at the same time, and using a state-of-the art clustering algorithm, MCLUST.

As can be seen in the MDS plot, Mexicans and Gujarati Indians overlap, as well as African Americans and Maasai. Obviously these populations are completely different mixtures that happen to coincide in genomic space due to the relatedness of their ancestral components that intermixed at different times and in different continents.

Here are the results of the Galore analysis. With 20 MDS dimensions retained (the maximum I considered) there were 35 clusters in the MCLUST solution that maximized the Bayes Information Criterion.


This is quite instructive:
  • Some populations (FIN and YRI) form their own very specific clusters #2 and #35
  • Some clusters join 2 or more populations. For example White Americans (CEU) and Britons (GBR) form cluster #1
  • Latinos form several clusters, especially the Mexicans. This should've been anticipated from the MDS plot where they are shown to be widely dispersed (quite variable). In essence, Latinos are not homogeneous populations but sets of individuals possessing variable admixture proportions
Note also, that some populations that are folded into a single cluster in this analysis (e.g., Spanish and Tuscans in #3) can in fact be distinguished from each other although not so easily in the first 20 dimensions considered here, as these are dominated by more salient features of the global genetic landscape.

ADMIXTURE analysis

I then ran ADMIXTURE over the dataset for K=5.

Here are the admixture proportions corresponding to this plot:


This is quite instructive with respect to the absence of particular reference populations: Finns show East Eurasian influences in the form of "Native American" (1.5%) and "East Asian" (6.2%) elements. Clearly, we don't have to imagine Native Americans moving into Finland, and these two components are standins for the Siberian ancestors of the European Finns. Similarly, Spanish show African admixture (1.6%). This is also probably due to both North and Sub-Saharan African elements, but the absence of appropriate North African references makes the distinction impossible. Finally, the Maasai show European and African admixture. This may be due to the non-emergence of a specific East African component at this level of resolution, as well as the absence of appropriate West Asian Caucasoid groups that are more likely to have influenced them. The absence of West Asian reference populations also probably affects Tuscans as their West Asian admixture may be misinterpreted as South Asian.

Here are the Fst distances between components:


This is also instructive: the South Asian component, in the absence of relatively unadmixed South Asian references is closer to Europeans than to East Asians. In fact, it is a composite of West Eurasian and indigenous South Asian population elements, the latter being distantly related to East Asians. Similarly, in the absence of Amerindian references, the Native American component (a bit of a misnomer) is equidistant to Europeans and East Asians. In fact, it is also a composite of West Eurasian and pre-Columbian American populations.


Conclusion

The Omni 2.5 data seem to work fine, and genome bloggers can anticipate good things in the future from the 1000 Genomes Project, as many more populations are in the pipeline. Clearly, the full-sequence data will probably be too much to handle for most hobbyists at the moment, but for anthropological investigations the 2.5 million SNPs will be more than enough.

The few experiments I carried out here also served to highlight the problems associated with using a limited number of reference populations. But, thankfully, this was a contrived problem aimed to make a point: there are now publicly available data for most major human populations, so the field is wide open for anyone interested in the study of human variation.

13 comments:

Eric said...

hmm...Nothing new here. Once you incrase the K's the east-african, north-african, middle-east, siberian, etc would start appearing.

Cuah123 said...

"Latinos are not homogeneous populations but sets of individuals possessing variable admixture proportions
"

Yes, I had posted this earlier this year. Depending on what part of California you are testing you are going to get significantly different results. Santa Monica in the 1960's would have given you different results then the population living nearby in Venice today.

To further study the group, their should be map made with what state each mexican comes from.

truth said...

Acutally the spanish do not have any sub-saharan (as seen by your own Dodecad project, in which they have 0.1% of west-african, noise)

Anonymous said...

Thank you so much for this Dienekes. Please please do a high K analysis that includes the British.

Unknown said...

What this suggests to me is (at least) three different archaic human populations.

PRjibaro said...

Thanks for this article, now I have more knowledge of my nation(Puerto Rico) genetic ancestry.

Anonymous said...

The results are a bit noisier than usual, but in particular the Latin American results seem to be just wrong, probably because of lack of a pure-blooded American Indian population. When using Admixture with only Mexicans and Eurasians/Africans, Mexicans form their own cluster and belong to it 75%, with the other 25% being European. But if an American Indian population is added, then the HapMap Mexicans belong only 50% to this American Indian cluster, the other 50% to Europeans. Also, they never showed Chinese in previous Admixture runs, or just a small fraction of 1%, which fits perfectly with Mexican y-dna/mtdna, but in this noisier Admixture run (I'm not knocking it, just pointing it out, I'm sure Dienekes didn't have time yet to add all the populations he would have wanted) all the Latinos have 1% or 2% or more East Asian, which is way too much. Also, there's just no way Colombians are 65% European, that would require 100% European ancestry on the y-dna side, and 30% European ancestry on the mtdna side.

Dienekes said...

The results are a bit noisier than usual, but in particular the Latin American results seem to be just wrong, probably because of lack of a pure-blooded American Indian population.

Patience, this will be the subject of a new post in the next few days.

Dienekes said...

Also, there's just no way Colombians are 65% European, that would require 100% European ancestry on the y-dna side, and 30% European ancestry on the mtdna side.

These are Colombians from Medellin (urban). The HGDP Colombians are indigenous. There is also a Colombian sample from Bryc et al. (2010). One of these days I may get around to compare these 3.

Anonymous said...

What's the exact address of the file with the genotype data?

Dienekes said...

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20110217_broad_omni_genotypes/

Kepler said...

"Latinos are not homogeneous populations but sets of individuals possessing variable admixture proportions
"
Guys, are you discovering Latinos now?
Variance among Latinos is huge, even in one given country, even in one given region of that country...by the way, it was probably like that in Europe thousands of years ago.
So far for "races"...just more or less tenous clusters where everyone has a lot of mixes but we cannot detect the degree of those mixes anymore.

Blair K. said...

Interesting, the small "Native American" and "East Asian" percentage Finns sometimes show. Have you ever seen this with Slovenians?