I was planning to do a post on the meaning of cluster separability, but this new paper actually demonstrates the main point I was going to investigate.
Over the last years, studies such as this have shown that is possible to distinguish between many European population groups.
However, clusters that are completely separable, may still harbor a substantial amount of internal variation. To see this, consider the following example of two groups, each consisting of 3 individuals and 5 markers:
It can be easily seen that group 1 is perfectly separable from group 2, in this case simply by looking at the first marker where group 1 has invariably an A, and group 2 has invariably a C.
But, if we try to see what the "best match" is for these individuals, we see that e.g., for individual a of group 1, the best match is d from group 2, for b it is e and for c it is f.
Now, for very distant populations, the scenario described above will almost never occur. Using 10K SNPs from the HGDP data, for the post I was preparing, I was able to conclude, for example, that any pair of Oroqen is always closer to each other than an Oroqen is to any Bantu sample.
This new study answers this question in the case of closely related European groups, showing that it is not the case that an individual will always have a member of his own group as his "best overall match" (BOM). Finns, for example, who appear as most distinct, have a Finnish BOM some 39 (out of 47) times, while some Finns have a Norwegian, German, or Polish BOM.
Moving into Central Europe, we see some counterintuitive results: no Austrian has an Austrian BOM, for example, but British, Danish, Dutch, German, Italian, and Polish ones.
Sample sizes play a role, however. For example, 25 out of 51 Greeks have Germans as their BOM, and only 7 out of 51 have Greek BOM's. But, since there is a sample of 983 Germans overall, Greeks are in fact 5.4 times more likely to match a Greek than a German.
Some markers and combinations of markers do differ between groups in a systematic way, like the A/C in the simple example above. Such markers allow us to separate groups, and distinguish between them. But, if we look at the overall genetic similarity between individuals, it turns out that members of one group may be more similar to some members of another than to their own.
So, if one were to be in a room with people from all over Europe, say during a meeting of the European Parliament, he might share some traits with people from his own country, but his best overall genetic match might be quite different.
Someone with the computing power and patience should carry out this investigation with the large HGDP dataset, to see which groups are strongly separable in the Oroqen-Bantu sense, and which ones are more weakly separable as in the European sense.
European Journal of Human Genetics doi: 10.1038/ejhg.2008.266
An evaluation of the genetic-matched pair study design using genome-wide SNP data from the European population
Timothy Tehva Lu et al.
Genetic matching potentially provides a means to alleviate the effects of incomplete Mendelian randomization in population-based gene–disease association studies. We therefore evaluated the genetic-matched pair study design on the basis of genome-wide SNP data (309 790 markers; Affymetrix GeneChip Human Mapping 500K Array) from 2457 individuals, sampled at 23 different recruitment sites across Europe. Using pair-wise identity-by-state (IBS) as a matching criterion, we tried to derive a subset of markers that would allow identification of the best overall matching (BOM) partner for a given individual, based on the IBS status for the subset alone. However, our results suggest that, by following this approach, the prediction accuracy is only notably improved by the first 20 markers selected, and increases proportionally to the marker number thereafter. Furthermore, in a considerable proportion of cases (76.0%), the BOM of a given individual, based on the complete marker set, came from a different recruitment site than the individual itself. A second marker set, specifically selected for ancestry sensitivity using singular value decomposition, performed even more poorly and was no more capable of predicting the BOM than randomly chosen subsets. This leads us to conclude that, at least in Europe, the utility of the genetic-matched pair study design depends critically on the availability of comprehensive genotype information for both cases and controls.