Comments on Dienekes’ Anthropology Blog: How to fix 23andMe's Ancestry Composition

I would like to add two things: Firstly, I just l...

2012-12-19T13:17:03.353+02:00

I would like to add two things:

Firstly, I just learned that another commercial provider combines France and Germany into "Central European." I really don't know what they are thinking - clearly they are not.

Secondly, there is hope that things will improve. Fortunately, the GenoChip 2.0 Genographic Project asks local information down to the village (!) level. (The bad news is this is only asked uniparentally and as deep down as participants know, which is not the best, but perhaps a decent approximation of autosomal contribution, especially if good filters are used).

@eurologist I was wondering about the validity of...

2012-12-12T15:33:06.117+02:00

@eurologist

I was wondering about the validity of those calculations… Thank you very much for the explanation.

Westgoth, Surely OT - but I could not leave this ...

2012-12-12T12:34:04.936+02:00

Westgoth,

Surely OT - but I could not leave this unanswered.

I smell Klyosov's breath.

This paper is useless, because it does not take into account four major limitations:

(i) backmutations are huge for STRs compared to SNPs
(ii) there is a saturation effect: mutation rates decrease quickly as STR copy numbers grow
(iii) the mutation rate varies extremely widely from location to location. It is almost impossible to establish a mean.
(iv) STR mutation rates have historically been overestimated by at least a factor of three, as modern work and comparison to ancient DNA has shown. The author uses a rate that is seven years old and thus likely completely outdated. Add to this a factor of 2-3 of uncertainty due to (iii).

It would also make sense to split (in particular l...

2012-12-12T11:29:42.770+02:00

It would also make sense to split (in particular larger) countries into regional domains - at the minimum four quadrants. Even if the latter is ambiguous, it will help, and in most countries people have an intuition for where such boundaries should be rather than straight down the middle. A fifth "central" domain might also make sense.

E.g., as I mentioned before, SW French have much less in common with Central Europeans than E French; and Hungarians - in particular those who are not ethnic Romanians or from the extreme east - are very close to other Central Europeans and Germans, but are not close to typical Eastern Europeans (neither are Czech or Slovenians). Differences of similar magnitude surely also exist in Poland, Ukraine, and Turkey.

It almost looks like they did not consult an expert on the European autosomal makeup. I know, one could argue that Fst is generally small in Europe - but by now we have sufficient information to exploit even the small differences to a very regional (I would say, about state/regional-level) resolution.

Yes, there should be an ethnicity option. People s...

2012-12-11T17:57:56.329+02:00

Yes, there should be an ethnicity option. People should be asked to give information about the ethnicity of their ancestors in the last few generations. Populations should be based on ethnic group.

Off-topic: http://arxiv.org/abs/1103.0878 Order o...

2012-12-11T11:51:43.710+02:00

Off-topic:
http://arxiv.org/abs/1103.0878

Order of precedence and age of Y-DNA
haplotypes -

"[...] conclusion that on its migration to
Europe the R1b population split once at about 6000 years ago in the teritory of Asia Minor
while its final expansion into whole of Western and Central Europe, and separation into
local populations occured around 4000 years ago during expansion of the archeological Bell
Beaker culture."

@Eric Durand, Thank you for your reply, and looki...

2012-12-11T11:44:45.016+02:00

@Eric Durand,

Thank you for your reply, and looking forward to seeing how Ancestry Composition will evolve over time.

23andMe tend to be arrogant about the apps they pr...

2012-12-11T05:19:59.836+02:00

23andMe tend to be arrogant about the apps they provide, some of the functions date back to 2008 and unrevised despite the greater number of reference groups available.

Ancestry Composition does provide people whose ancestry is known for hundreds of years, and is the same as one of their reference groups, confirmation of their ancestry but at a cost. Finns lose their East Eurasian ancestry, South Asians lose most of their East Eurasian ancestry, Anatolian Turks lose everything except their Middle Eastern ancestry which for Anatolian Turks is a minor contributor to their overall ancestry which is West Asian.

I seriously doubt 23andMe will do much about the problem of their Nationality Composition.

I think 23andme needs to have a ethnicity option a...

2012-12-11T02:19:20.930+02:00

I think 23andme needs to have a ethnicity option and not just country of birth. I am Kurd from Turkey and there is no option to select Kurdistan or Kurd. Many Kurds from Turkey likely were included in the reference population for Turkey which is wrong. It would be better for Kurds to select Iran rather then any other country as our DNA matchs best with Iranians.

Dear Dienekes, First of all, I’d like to thank y...

2012-12-11T00:51:44.652+02:00

Dear Dienekes,

First of all, I’d like to thank you for the constructive comments on our new admixture tool, Ancestry Composition.

We are aware that the people in the training set might experience overfitting. Overfitting is a common problem with almost any machine learning method, and we’ve known for a while that it would occur to some extent in our inference engine. We have taken steps to minimize the impact of overfitting, and are still actively working on reducing it further.

The main thing to do when trying to prevent overfitting is regularization. At the core of Ancestry Composition is a well known machine learning algorithm called Support Vector Machines (SVM). It is easy to implement L2-regularization for SVMs, and we did. Furthermore, the strength of this L2-regularization was chosen using cross-validation, an approach similar to your suggestion: we split the training data into 5 groups, training on 80% and testing on the remaining 20%.

L2-regularization enabled us to tackle overfitting for the most part, but not entirely. However, I’d like to stress that the problem does not affect all the training individuals, but only a fraction of them.

Still, we are aware that overfitting remains a problem for a part of our training set. We are exploring ways to mitigate its effect even further. Obviously, adding more training data will help, and we’re trying to gather as much public data as we can. We are also exploring computational solutions, not very different in spirit from your leave-n-out suggestion.

More on Ancestry Composition soon!

Eric Durand