March 21, 2011

A note of caution on admixture estimates

I want to expand on a theme I touched upon briefly in a previous post: the importance of choosing appropriate parental populations in admixture analyses.

I will first show empirically the impact of this choice to the admixture proportions. Then, I will deal with a special and difficult cases: the Indian Cline.

The not so easy case of Mexican Mestizos

Mexican Mestizos are a tri-hybrid population composed of European, Native American, and West African elements. These elements began interbreeding only in the last half millennium or so, and, hence, the process occurred in historical time.

Consider a sample of 25 Mexicans from the HapMap and 25 Yoruba from the Hapmap, 25 Iberian Spanish from the 1000 Genomes Project, and 14 Pima from the HGDP as parental populations. We obtain for our Mexican sample:
  • 59.7% European
  • 36.9% "Native American"
  • 3.4% African
Now, substitute the Pima with 21 Maya from the HGDP as representative of Native Americans. We now obtain:
  • 49.9% European
  • 47.3% "Native American"
  • 2.8% African
Notice that the Native American component has increased. We will see shortly why this is the case. But, let's run a final experiment with just the Mexicans, Spanish, and Yoruba, i.e., with no Native American samples. At K=3 we obtain:
  • 70% "Native American"
  • 29.7% European
  • 0.4% African
The "Native American" component has increased again! The explanation is simple: as we exclude less admixed Native American groups, Mexicans appear (comparatively) more Native American. The "Native American pole" has shifted, and so has the relative position of populations between them.

In other terms, what is labeled "Native American" in the three experiments is not the same: in the first one it is anchored on the more unadmixed Pima, in the last one in the more admixed Mexicans.

A color analogy is apt: imagine you had white and black paint, and you wanted to achieve a medium grey hue: you could mix equal parts white and black (1/2 each) to achieve this. Now, imagine that instead of white paint, you had a light grey hue. You would now have to mix greater amounts of light grey (more than 1/2) to achieve the same medium hue.

The moral:
  • If you are going to study admixture, you'd better find unadmixed representatives of ancestral populations.
As we will now see, this is not always possible:

No unadmixed populations: the Indian Cline

What if the process of admixture had occurred for a thousand more years and all inhabitants of the New World had acquired a generous portion of European ancestry? We would then have no unadmixed native populations to use in the estimation of admixture proportions.

This is, in essence, the problem that Reich et al. (2009) had to deal with in the context of India. West Eurasian-like people have been arriving to the Indian subcontinent since at least Neolithic times and until quite recently. The caste system has served to barricade gene flow to some extent, but, nonetheless, the populations of India are, today, variable mixes of West Eurasians and indigenous Indians.

Even the Andamanese Islanders had evidence of the West Eurasian-like element (which they termed Ancestral North Indian). Looking back to the Mexican example, the lack of unadmixed reference populations would inflate estimates of native ancestry.

To see whether this is the case, I took the 18 populations of the Indian Cline described by Reich et al. (2009) together with 25 Europeans from HapMap CEU and ran ADMIXTURE over the set. Below you can see the comparison between the "West Eurasian" component of ADMIXTURE and the Ancestral North Indian:

The cline is preserved in both representations, but the right column has smaller numbers than the left one, confirming our intuition about the use of admixed populations.

Below is a scatterplot of the two columns, with the regression equation on the chart:

The high R2 value suggests that two techniques are measuring the same underlying reality, but ADMIXTURE produces lower West Eurasian admixture (by about 38%) over the technique of Reich et al. (2009). Indeed, this is what we expect, as Reich et al. (2009) assign 38.8% ANI ancestry in the "most indigenous" group (the Mala) along the cline.

The position of populations along the cline is roughly the same, but the two sets of admixture proportions are shifted by about 38% with respect to each other.

(Reich et al. (2009) removed 8 individuals from their dataset as well as 7 Pathans and 14 Sindhis as outliers. I used the recommendations of Rosenberg with respect to the Pathans and Sindhis, using his H971 set and kept all the Indian individuals of Reich et al. (2009). As can be seen, the slightly different datasets did not largely affect the correlation between admixture proportions)

Reich et al. (2009) were able to infer the existence of ANI ancestry even in the most "indigenous" of Indian populations by exploiting the simple structure of the problem, namely:
  1. Admixture occurred between only 2 ancestral groups
  2. The 2 groups were related to extant human populations that are not part of the cline: CEU and Adygei for ANI and Onge for ASI
  3. There was treelike evolution of all studied groups except for the ANI-ASI admixture event
It is a beautiful result that showed that there are cases where the extent admixture can be inferred even in the absence of unadmixed populations representative of involved populations.

Conclusion

Much more can be said on this issue, but let's summarize a couple of lessons:
  • The full extent of an admixture cline can be captured only if unadmixed populations on either side of the cline exist. Use as many populations as possible to capture the full extent of an admixture cline.
  • Use of an admixed population in lieu of an unadmixed native one inflates the inferred native component. Use native populations if possible instead of admixed ones .
  • Even in the absence of unadmixed native populations, it is sometimes possible to reconstruct the admixture proportions as per Reich et al. (2009).
Capturing the complexities of human prehistory from modern populations is tricky. Nonetheless, with increased coverage of human genetic diversity (there are already ~9k individuals in my database), new analytical techniques, and, hopefully some archaeogenetic calibration, we are bound to learn much more about the distant human past in the not-so distant future.

PS: The substantial correlation between the ANI-ASI populations of Reich et al. (2009) and of the "West Eurasian"-"South Asian" ones in K=2 ADMIXTURE analysis makes it possible to infer a person's ANI-ASI proportions from their ADMIXTURE results. Dodecad Project members of South Asian heritage should keep an eye on the Dodecad Project blog for that type of inference.

6 comments:

Cuah123 said...

Thanks Dienekes...

In the following study, they also ran into a similiar issue:
Genome-wide patterns of population structure and admixture among Hispanic/Latino populations
Katarzyna Bryc

"Our results suggest future genome-wide association scans in Hispanic/Latino populations may require correction for local genomic ancestry at a subcontinental scale "


The Mexican identity is wrapped into a nationalistic and cultural identity. Other studies have shown that the European conquest of Mexico nearly wiped out the male native population, showing mtnda across several modern populations, and showing differing y dna for those same populations. Basically European males, killed Native Males and ended up with the females.

I think its going to be daunting task, unless the studies are broken up into subcontinental regions, in some cases even neighborhoods. Mexico has been mixing for a longer amount of time than it's northern neighbor. Imagine trying to get a picture of the structure of the US, with nearly everyone saying they are native.

Diogenes said...

Very interesting, I agree ADMIXTURE results should be analysed in a more scientific manner, theories formulated and results correlated then maybe checked with future available populations for validation.

I don't think Lithuanians and other Balts are "pure" Eastern Wave either. Otherwise it seems to reach too far in too large amounts...
Eurogenes' recent results on Baltic Finns seems to point out that the Neolithic Near Eastern element predominant in Finns is actually more akin to Scandinavian (more "Western") than it is to Baltic people's.
This could imply maybe that Finnish Comb Ceramic Pottery culture was founded by an element from the West Neolithic wave coming from Denmark/Sweden/Pomerania, and there was then some minor fusion with a local now described as "Siberian" hunter-gatherer element. Comb ceramic pottery culture was possibly semi-agriculturalist, with much complementing by hunting because of the cold weather (rye as we know it being a later development). Pure hunter-gatherers have no need for pottery, and as they move constantly it doesn't last much either.
Archaeologically this culture seems to have expanded towards the Urals, which is tantalising.
Ungrian tongues can be found on the far side of the Urals. Would this imply the proto-uralic homeland may actually coincide with the comb ceramic homeland?

AP said...

"West Eurasian-like people have been arriving to the Indian subcontinent since at least Neolithic times and until quite recently."

How do we explain this [the %numbers are approximate]:

1. 60% of South Asia is mtDNA M. There is minimal M in West Eurasia. 40% is N of which the most common is U. South Asian U comprised mainly of U2a, U2b, and U2c also absent in West Eurasia.

2. South Asian Y-DNA is H 20%; R1a1-M17 20%; L-10%; R2-M124 10% O2a-M95 8%; J2-M172 8%; O3e-M134 5%; F*-M89 5%.

Did all the above except J-2 and R1a1 disappear in west eurasia if they have been arriving to the Indian subcontinent since at least Neolithic times? And how did L(xM,N) which was present in neolithic west eurasia get left behind?

I do think that a big proportion of R1a1 and some J-2 came from west eurasia to south asia, but I can't see how it could have transformed all of south asia so much.

Cuah123 said...

In my opinion, if they are to do further testing say in Guadalajara Mexico, where the caste system and birthcertificates denoted race. The testing should first do people non mix Nahua descent. Second, first conquest and colonists circa 1550 till 1600 (these families are well known, mine being one of them r1b u106 l48). Then French incursion to Guadalajara and last all other migrations to modern time.
I'm not sure why these scientists are not using family well known history to test in Mexico, there are tons of records, like birth certificates and marriage.

Fanty said...

"I'm not sure why these scientists are not using family well known history to test in Mexico, there are tons of records, like birth certificates and marriage."

Thats a hell of work for each single person. Several monthes of research for each person minimum.

Plus, that these documents are not really reliable.

For German documents on family history (existing usualy for the past 500 years) its estaminated that about 8% of them are wrong (women got pregnant by other men, and their husbands never knew)

So, it would cost several million dollars, take 5 years and 99.9% of the work and cost are the research of family history. And in the end, the results are all wrong still, because the documents are not relyable.

andrew said...

@Fanty

If you are examining the whole community, you get efficiencies that you don't get person by person studies of the same records.

Also, even significant error rates, so long as they involve women who got pregnant by other men from the same community, don't necessarily impact the results at a population genetic level very much, and even if all you do is to estimate population genetics as of ca. 1850 from population genetics as of ca. 2011, this is a significant accomplishment, and will also provide good empirical estimates of the constants to include in a mathematical model designed to produce Monte Carlo estimates of what the population genetic make up would have been ca. 1600.

A huge proportion of the error bars for estimates of old populations comes from inaccurate population modeling, and this could be much, much more accurate than that kind of estimation.