A reader tipped me on the availability of
data from the 1000 Genomes Project genotyped on the Illumina Omni 2.5 chip. Out of 2.5 million or so SNPs, there are about 720,000 with rs-numbers in the working dataset. There are a few new populations in the data:
- GBR (Great Britain)
- FIN (Finland)
- IBS (Iberian Spanish)
- CLM (Colombians)
- MXL (Mexican Americans from Los Angeles)
- PUR (Puerto Ricans)
I've been rebuilding my various datasets to account for common markers, high quality SNPs, and linkage disequilibrium, so this is based on about 133,000 markers. I also limited the number of individuals at 25 per population.
I took the
HapMap-3 data to make sure that the integration was correct and ran various analytical techniques over the joint dataset of 17 populations and 425 individuals.
Multidimensional Scaling
As expected the three poles correspond to West Eurasians (top left, GBR, CEU, TSI), East Eurasians (bottom left, CHB, CHD, JPT), and Sub-Saharan Africans (YRI).
Other populations fall in between the three poles: for example, FIN slightly removed from West Eurasians in an East Eurasian direction, Mexicans and Gujarati Indians (GIH) in-between West and East Eurasians, African Americans (ASW) and Maasai East Africans (MKK) in-between Sub-Saharan Africans and West Eurasians.
Clusters Galore Analysis
I then used the
Clusters Galore approach to cluster individuals. As I've mentioned before, individuals with quite distinct origins may overlap in the MDS representation, and the Galore approach is able to discover distinct clusters by looking at several dimensions at the same time, and using a state-of-the art clustering algorithm,
MCLUST.
As can be seen in the MDS plot, Mexicans and Gujarati Indians overlap, as well as African Americans and Maasai. Obviously these populations are completely different mixtures that happen to coincide in genomic space due to the relatedness of their ancestral components that intermixed at different times and in different continents.
Here are the results of the Galore analysis. With 20 MDS dimensions retained (the maximum I considered) there were 35 clusters in the MCLUST solution that maximized the Bayes Information Criterion.
This is quite instructive:
- Some populations (FIN and YRI) form their own very specific clusters #2 and #35
- Some clusters join 2 or more populations. For example White Americans (CEU) and Britons (GBR) form cluster #1
- Latinos form several clusters, especially the Mexicans. This should've been anticipated from the MDS plot where they are shown to be widely dispersed (quite variable). In essence, Latinos are not homogeneous populations but sets of individuals possessing variable admixture proportions
Note also, that some populations that are folded into a single cluster in this analysis (e.g., Spanish and Tuscans in #3) can in fact be distinguished from each other although not so easily in the first 20 dimensions considered here, as these are dominated by more salient features of the global genetic landscape.
ADMIXTURE analysis
I then ran ADMIXTURE over the dataset for K=5.
Here are the admixture proportions corresponding to this plot:
This is quite instructive with respect to the absence of particular reference populations: Finns show East Eurasian influences in the form of "Native American" (1.5%) and "East Asian" (6.2%) elements. Clearly, we don't have to imagine Native Americans moving into Finland, and these two components are standins for the Siberian ancestors of the European Finns. Similarly, Spanish show African admixture (1.6%). This is also probably due to both North and Sub-Saharan African elements, but the absence of appropriate North African references makes the distinction impossible. Finally, the Maasai show European and African admixture. This may be due to the non-emergence of a specific East African component at this level of resolution, as well as the absence of appropriate West Asian Caucasoid groups that are more likely to have influenced them. The absence of West Asian reference populations also probably affects Tuscans as their West Asian admixture may be misinterpreted as South Asian.
Here are the Fst distances between components:
This is also instructive: the South Asian component, in the absence of relatively unadmixed South Asian references is closer to Europeans than to East Asians. In fact, it is a composite of West Eurasian and indigenous South Asian population elements, the latter being distantly related to East Asians. Similarly, in the absence of Amerindian references, the Native American component (a bit of a misnomer) is equidistant to Europeans and East Asians. In fact, it is also a composite of West Eurasian and pre-Columbian American populations.
Conclusion
The Omni 2.5 data seem to work fine, and
genome bloggers can anticipate good things in the future from the 1000 Genomes Project, as many more populations are in the
pipeline. Clearly, the full-sequence data will probably be too much to handle for most hobbyists at the moment, but for anthropological investigations the 2.5 million SNPs will be more than enough.
The few experiments I carried out here also served to highlight the problems associated with using a limited number of reference populations. But, thankfully, this was a contrived problem aimed to make a point: there are now publicly available data for most major human populations, so the field is wide open for anyone interested in the study of human variation.