January 22, 2009

Best overall matching in Europeans

I was planning to do a post on the meaning of cluster separability, but this new paper actually demonstrates the main point I was going to investigate.

Over the last years, studies such as this have shown that is possible to distinguish between many European population groups.

However, clusters that are completely separable, may still harbor a substantial amount of internal variation. To see this, consider the following example of two groups, each consisting of 3 individuals and 5 markers:

G1:

a: ACGTA
b: AGACT
c: ACATT

G2:

d: CCGTA
e: CGACT
f: CCATT

It can be easily seen that group 1 is perfectly separable from group 2, in this case simply by looking at the first marker where group 1 has invariably an A, and group 2 has invariably a C.

But, if we try to see what the "best match" is for these individuals, we see that e.g., for individual a of group 1, the best match is d from group 2, for b it is e and for c it is f.

Now, for very distant populations, the scenario described above will almost never occur. Using 10K SNPs from the HGDP data, for the post I was preparing, I was able to conclude, for example, that any pair of Oroqen is always closer to each other than an Oroqen is to any Bantu sample.

This new study answers this question in the case of closely related European groups, showing that it is not the case that an individual will always have a member of his own group as his "best overall match" (BOM). Finns, for example, who appear as most distinct, have a Finnish BOM some 39 (out of 47) times, while some Finns have a Norwegian, German, or Polish BOM.

Moving into Central Europe, we see some counterintuitive results: no Austrian has an Austrian BOM, for example, but British, Danish, Dutch, German, Italian, and Polish ones.

Sample sizes play a role, however. For example, 25 out of 51 Greeks have Germans as their BOM, and only 7 out of 51 have Greek BOM's. But, since there is a sample of 983 Germans overall, Greeks are in fact 5.4 times more likely to match a Greek than a German.

Some markers and combinations of markers do differ between groups in a systematic way, like the A/C in the simple example above. Such markers allow us to separate groups, and distinguish between them. But, if we look at the overall genetic similarity between individuals, it turns out that members of one group may be more similar to some members of another than to their own.

So, if one were to be in a room with people from all over Europe, say during a meeting of the European Parliament, he might share some traits with people from his own country, but his best overall genetic match might be quite different.

Someone with the computing power and patience should carry out this investigation with the large HGDP dataset, to see which groups are strongly separable in the Oroqen-Bantu sense, and which ones are more weakly separable as in the European sense.

European Journal of Human Genetics doi: 10.1038/ejhg.2008.266

An evaluation of the genetic-matched pair study design using genome-wide SNP data from the European population

Timothy Tehva Lu et al.

Abstract

Genetic matching potentially provides a means to alleviate the effects of incomplete Mendelian randomization in population-based gene–disease association studies. We therefore evaluated the genetic-matched pair study design on the basis of genome-wide SNP data (309 790 markers; Affymetrix GeneChip Human Mapping 500K Array) from 2457 individuals, sampled at 23 different recruitment sites across Europe. Using pair-wise identity-by-state (IBS) as a matching criterion, we tried to derive a subset of markers that would allow identification of the best overall matching (BOM) partner for a given individual, based on the IBS status for the subset alone. However, our results suggest that, by following this approach, the prediction accuracy is only notably improved by the first 20 markers selected, and increases proportionally to the marker number thereafter. Furthermore, in a considerable proportion of cases (76.0%), the BOM of a given individual, based on the complete marker set, came from a different recruitment site than the individual itself. A second marker set, specifically selected for ancestry sensitivity using singular value decomposition, performed even more poorly and was no more capable of predicting the BOM than randomly chosen subsets. This leads us to conclude that, at least in Europe, the utility of the genetic-matched pair study design depends critically on the availability of comprehensive genotype information for both cases and controls.

Link

6 comments:

Polak said...

Interesting stuff, but their table is very confusing. Here are the rates of BOMs for Poland after correcting for varying sample sizes.

1. Denmark - 42.5%
2. Poland - 31.5%
3. Norway - 11.1%
4. North Germany (DE1) - 6.2%
5. Czech Rep. - 4.3%
6. South Germany (DE2) - 2.0%
7. Netherlands - 1.4%
8. United Kingdom 1%

Unless corrected, Poles end up having most BMOs in North Germany. And that's no wonder considerng the number of Germans tested.

Polak said...

Results for Finland...

1. Finland - 90.37%
2. Poland - 6.68%
3. Norway - 2.08%
4. North Germany - 0.87%

Polak said...

I gotta say, some of these results are very curious. But often there are patterns to them, and sometimes they go both ways. Denmark and Norway...

1. Denmark - 42.61%
2. Poland - 15.39%
3. North Germany - 11.71%
4. Holland - 11.65%
5. Italy (Marche) - 10.26%
6. Norway - 4.81%
7. South Germany - 3.56%

1. Norway - 36.9%
2. Poland - 19.64%
3. Italy (Marche) - 14.76%
4. North Germany - 12.16%
5. Sweden - 12.16%
6. South Germany - 3.46%
7. Holland - 2.6%

AX said...

Dienekes, will you post your intended Oroqen-Bantu comparison post in the future or at least describe its methodology?

Dienekes said...

Dienekes, will you post your intended Oroqen-Bantu comparison post in the future or at least describe its methodology?

The sample consisted of 10K SNPs, basically every 65th one from the raw data file. The comparison was between Oroqen and one of the Bantu groups, which I don't recall. The score function was 0 for both alleles different, 0.5 for one allele the same, 1 for both alleles the same.

It was just something I did on the fly to see for myself, so I don't plan to post on it soon.

AX said...

Dienekes, thanks a lot for the explanation. It makes sense. Though, have the authors of the study calculated this relateness in a similar manner or in a different way? If the latter, could you briefly explain how or copy paste from the study itself?