October 01, 2011

Further caution on admixture estimates: at the edges of variation

My recent analysis of the Yunusbayev et al. (2011) revealed an interesting anomaly: the Armenians_Y sample tested much more "European" than the existing Armenians_D and Armenians (from Behar et al. (2010)).

A related "anomaly" was the use of some of my newer Dodecad tools that have been targeted to particular regions (e.g., Europe and West Asia for the newer euro7 calculator) by individuals from outside these regions (e.g., South Asia for euro7). Of course, I have cautioned against such use, but can we say something about why their use is not a very good idea.

Suspecting a systematic effect, I decided to investigate.

My geometric intuition is encapsulated in the following figure:

Suppose that a cline has been inferred from B to A1. Suppose that dist(A1, B) = dist(A2, B). So, A1 and A2 differ from each other in an orthogonal direction relative to their difference from B.

Now, if we project A2 onto the BA1 line, we see that A2 appears "intermediate" between them. The converse would occur if we project A1 onto the BA2 line.

This is reminiscent of my ciriticism of Moorjani et al. (2011) in that shifts away from a linear cline cause spurious admixture to be inferred. But, it is more general:
  • A2 may differ from A1 because it has cryptic admixture from an unsampled group, or vice versa
  • A2 may differ from A1 because of random genetic drift
We can go one step further and consider a population that is not only "off-cline", but beyond its edges. This can be seen in the following figure:
As you can see, now A2 (which is beyond the edge of the horizontal cline) cannot be projected between B and A1.

In order to examine this intuition, I carried out a few simple tests.

I set B to be HGDP North_Italians, A1 to be Behar et al. (2010) Armenians (with 3 outliers excluded), and A2 to be Yunusbayev et al. (2011) Armenians_Y.

1) Unsupervised analysis

North_Italian 100.0 0.0
Armenians_Y 1.3 98.7
Armenians 2.8 97.2

2) Supervised analysis: Armenians_Y as test population; North_Italian, Armenians fixed

North_Italian 100.0 0.0
Armenians_Y 0.6 99.4
Armenians 0.0 100.0

3) Supervised analysis: Armenians as test population; North_Italian, Armenians_Y fixed

North_Italian 100.0 0.0
Armenians_Y 0.0 100.0
Armenians 2.7 97.3


These results seem to confirm the geometric intuition.

Populations beyond the cline

Now, I will add Assyrians_D, a population that seems closely related to Armenians, but appear to be a little more "eastern" in most analyses. So, it is "beyond" the North_Italian-Armenian cline.

1) Unsupervised analysis

North_Italian 100.0 0.0
Armenians_Y 3.6 96.4
Armenians 5.3 94.7
Assyrians 0.6 99.4

2) Supervised analysis: Assyrians, Armenians as test populations; North_Italian, Armenians_Y fixed

North_Italian 100.0 0.0
Armenians_Y 0.0 100.0
Armenians 3.8 96.2
Assyrians 0.5 99.5

Again, the intuition is confirmed. A reasonable recommendation is to avoid mapping populations that are geographically outside the convex hull of the fixed populations.

Long clines

The effect described in this post is expected to abate in "long clines". For example, the amount of drift between Miaozu and She populations from east Asia is expected to be miniscule relative to the distance of either population to North Italians:

1) Unsupervised analysis

North_Italian 100 0
Miaozu 0 100
She 0 100

2) Supervised analysis: She as test population; North_Italian, Miaozu fixed

(identical)

3) Supervised analysis: Miaozu as test population; North_Italian, She fixed

(identical)

Conclusion

In determining the relative position of individuals along clines it is useful to remember:
  • The position is most accurately determined when the edges of the cline are most securely "fastened". Use as many populations and individuals from the perimeter of the region under study as possible.
  • The position is most accurately determined when the cline is long; small deviations due to drift or incomplete sampling at the edges are miniscule compared to the length of the cline. Components marking continent-wide distances (e.g., Europeans vs. East Asians) are estimated more accurately than those marking shorter distances (e.g., Southern Europeans vs. West Asians)

The way forward

There is no simple solution to the problem identified in this post. For short clines (e.g., within Europe) that are not securely fastened (few individuals from outlying groups), we can expect relatively large systematic errors.

As an analogy, imagine trying to measure the height of a 5-year old on the wall with measuring tape and a book. If you don't keep the book steady, one of the endpoints of your measurement will be "wobbly". If you don't keep your measuring tape vertical, your measurement will be off.

What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.

7 comments:

Onur Dincer said...

So, how should we interpret the high average total "European" component ("East European" + "West European") percentage of the Yunusbayev Armenians compared to those of the Dodecad and Behar Armenians? Your small tests show that the Yunusbayev Armenians are genetically more distant from European populations than the Dodecad and Behar Armenians are. Is it because of the high average "South Asian" component percentage of the Yunusbayev Armenians (4.3%) compared to those of the Dodecad and Behar Armenians (2.8% and 1.7% respectively)? This may explain why the Yunusbayev Armenians are genetically more distant from Europeans than the Dodecad and Behar Armenians are despite having higher average total "European" component percentage than the Dodecad and Behar Armenians have.

Dienekes said...

So, how should we interpret the high average total "European" component ("East European" + "West European") percentage of the Yunusbayev Armenians compared to those of the Dodecad and Behar Armenians?

With caution. Since these Armenians were sampled at a different location than the Behar et al. Armenians, they are likely to be different from them due to drift. In the next iteration of the Dodecad analysis, which will include the new Y. populations, their position vis a vis Europeans will be better defined.

Dienekes said...

And, actually the euro7 results already indicate that drift and the phenomenon described in this post account for the elevated European percentage

http://dodecad.blogspot.com/2011/09/euro7-calculator.html

But, there will be more Eurasian-wide analyses that will incorporate the new Y. samples, as well as those from participants that have accumulated since the development of v3.

Onur Dincer said...

And, actually the euro7 results already indicate that drift and the phenomenon described in this post account for the elevated European percentage

http://dodecad.blogspot.com/2011/09/euro7-calculator.html


Unfortunately, there is no South Asian population and consequently no "South Asian" component in euro7. This makes its Armenian results less reliable than they would be in the presence of a "South Asian" component.

pconroy said...

My South Asian segments seem to have been subsumed into Far_Asian

Onur Dincer said...

My South Asian segments seem to have been subsumed into Far_Asian

Yeah, that is a common result in the populations analyzed.

Onur Dincer said...

Mait Metspalu (one of the lead authors of the Yunusbayev et al. and Behar et al. papers) has just told me via email that the Yunusbayev Armenian samples come from different parts of the Republic of Armenia. So the Yunusbayev Armenians are unlikely to have experienced genetic drift as whole; they must be much more representative of the overall Armenian genetic variation than the Behar Armenians (than even just the non-outlier ones of the Behar Armenians), as, in contrast to the Yunusbayev Armenians, the Behar Armenians all come from a single city (Maikop) in Russia (Mr. Metspalu previously told me that).