My recent analysis of the Yunusbayev et al. (2011) revealed an interesting anomaly: the Armenians_Y sample tested much more "European" than the existing Armenians_D and Armenians (from Behar et al. (2010)).
A related "anomaly" was the use of some of my newer Dodecad tools that have been targeted to particular regions (e.g., Europe and West Asia for the newer euro7 calculator) by individuals from outside these regions (e.g., South Asia for euro7). Of course, I have cautioned against such use, but can we say something about why their use is not a very good idea.
Suspecting a systematic effect, I decided to investigate.
My geometric intuition is encapsulated in the following figure:
Suppose that a cline has been inferred from B to A1. Suppose that dist(A1, B) = dist(A2, B). So, A1 and A2 differ from each other in an orthogonal direction relative to their difference from B.
Now, if we project A2 onto the BA1 line, we see that A2 appears "intermediate" between them. The converse would occur if we project A1 onto the BA2 line.
This is reminiscent of my ciriticism of Moorjani et al. (2011) in that shifts away from a linear cline cause spurious admixture to be inferred. But, it is more general:
- A2 may differ from A1 because it has cryptic admixture from an unsampled group, or vice versa
- A2 may differ from A1 because of random genetic drift
As you can see, now A2 (which is beyond the edge of the horizontal cline) cannot be projected between B and A1.
In order to examine this intuition, I carried out a few simple tests.
I set B to be HGDP North_Italians, A1 to be Behar et al. (2010) Armenians (with 3 outliers excluded), and A2 to be Yunusbayev et al. (2011) Armenians_Y.
1) Unsupervised analysis
North_Italian 100.0 0.0
Armenians_Y 1.3 98.7
Armenians 2.8 97.2
2) Supervised analysis: Armenians_Y as test population; North_Italian, Armenians fixed
North_Italian 100.0 0.0
Armenians_Y 0.6 99.4
Armenians 0.0 100.0
3) Supervised analysis: Armenians as test population; North_Italian, Armenians_Y fixed
North_Italian 100.0 0.0
Armenians_Y 0.0 100.0
Armenians 2.7 97.3
These results seem to confirm the geometric intuition.
Populations beyond the cline
Now, I will add Assyrians_D, a population that seems closely related to Armenians, but appear to be a little more "eastern" in most analyses. So, it is "beyond" the North_Italian-Armenian cline.
1) Unsupervised analysis
North_Italian 100.0 0.0
Armenians_Y 3.6 96.4
Armenians 5.3 94.7
Assyrians 0.6 99.4
2) Supervised analysis: Assyrians, Armenians as test populations; North_Italian, Armenians_Y fixed
North_Italian 100.0 0.0
Armenians_Y 0.0 100.0
Armenians 3.8 96.2
Assyrians 0.5 99.5
Again, the intuition is confirmed. A reasonable recommendation is to avoid mapping populations that are geographically outside the convex hull of the fixed populations.
Long clines
The effect described in this post is expected to abate in "long clines". For example, the amount of drift between Miaozu and She populations from east Asia is expected to be miniscule relative to the distance of either population to North Italians:
1) Unsupervised analysis
North_Italian 100 0
Miaozu 0 100
She 0 100
2) Supervised analysis: She as test population; North_Italian, Miaozu fixed
(identical)
3) Supervised analysis: Miaozu as test population; North_Italian, She fixed
(identical)
Conclusion
In determining the relative position of individuals along clines it is useful to remember:
- The position is most accurately determined when the edges of the cline are most securely "fastened". Use as many populations and individuals from the perimeter of the region under study as possible.
- The position is most accurately determined when the cline is long; small deviations due to drift or incomplete sampling at the edges are miniscule compared to the length of the cline. Components marking continent-wide distances (e.g., Europeans vs. East Asians) are estimated more accurately than those marking shorter distances (e.g., Southern Europeans vs. West Asians)
There is no simple solution to the problem identified in this post. For short clines (e.g., within Europe) that are not securely fastened (few individuals from outlying groups), we can expect relatively large systematic errors.
As an analogy, imagine trying to measure the height of a 5-year old on the wall with measuring tape and a book. If you don't keep the book steady, one of the endpoints of your measurement will be "wobbly". If you don't keep your measuring tape vertical, your measurement will be off.
What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.