June 12, 2008

Classifier for 23andMe/deCODEme genotype data

As I mentioned in my previous post, the genotype data provided by companies such as 23andMe and deCODEme allow us to build ancestry assessment tools that use published genotype data from scientific studies.

I have built a simple classifier tool based on the panel of 300 markers of Price et al. (2008), which uses the frequency data supplied in this paper to assess the probability that an individual belongs to the "Northwest European", "Southeast European", or "Ashkenazi Jewish" categories.

The input is genotype values for a number of markers (e.g., the 169/192 markers in common between the deCODEme and 23andMe results) for an individual, and the output is a set of three probabilities for belonging to any of the three groups, summing up to 1.

Using the Greg and Lilly Mendel data that you can download from 23andMe, I came up with the following probabilities (NWE,SEE,AJ):

Greg: 0.89, 0.11, 0
Lilly: 1, 0, 0
(corrected June 15)

23andMe lists the similarity of these individuals to "Northern Europeans","Southern Europeans", and "Near Easterners" as:

Greg: 67.84, 67.74, 67.15
Lilly: 67.85, 67.72, 67.11


So, at least for these two individuals the results of my calculator appear to be analogous to those reported by 23andMe, with Lilly seeming more "Northern" than her husband.

PS: Unfortunately my calculator cannot be released at present, as it's not a standalone program but rather relies on a bunch of different tools with minimum development.

3 comments:

cacio said...

Intriguing. Are you weighting the marker values using the weights provided (frequencies) and then normalizing the three values to 1? Or are you using something more sophisticated?

I may try something similar if I hav! time. Thanks for the post.

cacio

Dienekes said...

The probability that a person from the, say, NW European group will have a particular genotype is the product of the probabilities that they have a particular genotype for any particular SNP (assuming independence; I didn't look into the marker selection process in the paper, but I don't think they would have picked up tightly linked markers).

The probability that they have the reference allele for a SNP is the frequency of that allele which can be read off the authors' table. 1-that is the probability that they don't have it.

So, if in a particular group the ref alleles are:

ACGT

and their frequencies are:
0.2,0.3,0.1,0.4

and the individual has a genotype:
ACCT

Then we calculate:
0.2*0.3*(1-0.1)*0.4

for him, and the same for all three populations. Finally you have to normalize things so that they add up to 1.

Another important consideration is to use Log(P) and sum up Log's rather than multiply probabilities, because if you multiply hundreds of probabilities you may end up with 0, i.e., the machine can't represent such a small number.

cacio said...

Thanks! I'll try playing around ...