There have been enough public reports by now that reinforce my initial suggestion that Ancestry Composition overfits to the training data (aka people with four grandparents from a reference population who filled their ancestry survey). The result of this is that such people get 99-100% of their ancestry assigned to a particular population, and the test essentially returns the customer-supplied population label instead of returning the person's ancestry based on his actual DNA.
Now, this is not a problem for the majority of 23andMe customers who don't have 4 grandparents from the same country, or have 4 grandparents from a colonial country such as the United States.
But, the problem for the rest of the 23andMe community cannot be overlooked, because it is significant for people from non-colonial countries who make up the reference populations. Ironically, the people who are actually making this type of analysis possible (people who dutifully filled in their ancestry survey) are the ones getting the raw end of the deal.
I have seen talk of people retracting their ancestry survey answers in the hope of getting some accurate results! I don't think that's the way to go, as that would lead to a race to the bottom: people might retract or change their ancestry survey answers in the hope of improving their results, but, if enough people do this, the training sample will be shrunk and distorted, so the results will be worse for everybody!
How to solve the problem
In a world with infinite computing resources, and a large number of samples, the problem could be solved optimally by leaving out each of N training samples, rebuilding the ancestry predictive model, using the remaining N-1 samples, N times, once for each training sample, and then applying it to each of these left out samples.
Naturally, this would have the effect of increasing the computational complexity of ancestry estimation approximately N-fold, so it does not seem practical.
An alternative approach would be to build the model only once (using all N training samples) and incrementally update it for each training individual. This depends on the feasibility of such an incremental update which would incur a minor cost per individual -to adapt parameters of the model by "virtually" taking out the individual. My suspicion is that it will be extremely difficult to do this type of incremental update for the fairly complex model used by 23andMe in their Ancestry Composition.
So, what would be a practical solution?
Partition the N training samples into a number of G groups, each of which will have N/G individuals. Now, rebuild the model G times, each time using N-N/G individuals, i.e., leaving one group out. Note that the initially proposed solution (i.e., leaving one out) is a special case of the above with G=1.
The computational cost of this solution will be something less than G times the cost of building the full model with all N training samples. This is due to the fact that you are building the model G times, but over a slightly smaller dataset (of N-N/G individuals).
Practically, G=10 would be reasonable number of groups, which would, however, require the model to be built ten times. Whether or not this is practical for 23andMe, I don't know, but since they have to periodically update their model, I think that they ought to try this approach. If they already have idle CPU cycles, that's a great way to occupy them, and if they don't, then investing in processing power would be a good idea.
10 comments:
Dear Dienekes,
First of all, I’d like to thank you for the constructive comments on our new admixture tool, Ancestry Composition.
We are aware that the people in the training set might experience overfitting. Overfitting is a common problem with almost any machine learning method, and we’ve known for a while that it would occur to some extent in our inference engine. We have taken steps to minimize the impact of overfitting, and are still actively working on reducing it further.
The main thing to do when trying to prevent overfitting is regularization. At the core of Ancestry Composition is a well known machine learning algorithm called Support Vector Machines (SVM). It is easy to implement L2-regularization for SVMs, and we did. Furthermore, the strength of this L2-regularization was chosen using cross-validation, an approach similar to your suggestion: we split the training data into 5 groups, training on 80% and testing on the remaining 20%.
L2-regularization enabled us to tackle overfitting for the most part, but not entirely. However, I’d like to stress that the problem does not affect all the training individuals, but only a fraction of them.
Still, we are aware that overfitting remains a problem for a part of our training set. We are exploring ways to mitigate its effect even further. Obviously, adding more training data will help, and we’re trying to gather as much public data as we can. We are also exploring computational solutions, not very different in spirit from your leave-n-out suggestion.
More on Ancestry Composition soon!
Eric Durand
I think 23andme needs to have a ethnicity option and not just country of birth. I am Kurd from Turkey and there is no option to select Kurdistan or Kurd. Many Kurds from Turkey likely were included in the reference population for Turkey which is wrong. It would be better for Kurds to select Iran rather then any other country as our DNA matchs best with Iranians.
23andMe tend to be arrogant about the apps they provide, some of the functions date back to 2008 and unrevised despite the greater number of reference groups available.
Ancestry Composition does provide people whose ancestry is known for hundreds of years, and is the same as one of their reference groups, confirmation of their ancestry but at a cost. Finns lose their East Eurasian ancestry, South Asians lose most of their East Eurasian ancestry, Anatolian Turks lose everything except their Middle Eastern ancestry which for Anatolian Turks is a minor contributor to their overall ancestry which is West Asian.
I seriously doubt 23andMe will do much about the problem of their Nationality Composition.
@Eric Durand,
Thank you for your reply, and looking forward to seeing how Ancestry Composition will evolve over time.
Off-topic:
http://arxiv.org/abs/1103.0878
Order of precedence and age of Y-DNA
haplotypes -
"[...] conclusion that on its migration to
Europe the R1b population split once at about 6000 years ago in the teritory of Asia Minor
while its final expansion into whole of Western and Central Europe, and separation into
local populations occured around 4000 years ago during expansion of the archeological Bell
Beaker culture."
Yes, there should be an ethnicity option. People should be asked to give information about the ethnicity of their ancestors in the last few generations. Populations should be based on ethnic group.
It would also make sense to split (in particular larger) countries into regional domains - at the minimum four quadrants. Even if the latter is ambiguous, it will help, and in most countries people have an intuition for where such boundaries should be rather than straight down the middle. A fifth "central" domain might also make sense.
E.g., as I mentioned before, SW French have much less in common with Central Europeans than E French; and Hungarians - in particular those who are not ethnic Romanians or from the extreme east - are very close to other Central Europeans and Germans, but are not close to typical Eastern Europeans (neither are Czech or Slovenians). Differences of similar magnitude surely also exist in Poland, Ukraine, and Turkey.
It almost looks like they did not consult an expert on the European autosomal makeup. I know, one could argue that Fst is generally small in Europe - but by now we have sufficient information to exploit even the small differences to a very regional (I would say, about state/regional-level) resolution.
Westgoth,
Surely OT - but I could not leave this unanswered.
I smell Klyosov's breath.
This paper is useless, because it does not take into account four major limitations:
(i) backmutations are huge for STRs compared to SNPs
(ii) there is a saturation effect: mutation rates decrease quickly as STR copy numbers grow
(iii) the mutation rate varies extremely widely from location to location. It is almost impossible to establish a mean.
(iv) STR mutation rates have historically been overestimated by at least a factor of three, as modern work and comparison to ancient DNA has shown. The author uses a rate that is seven years old and thus likely completely outdated. Add to this a factor of 2-3 of uncertainty due to (iii).
@eurologist
I was wondering about the validity of those calculations… Thank you very much for the explanation.
I would like to add two things:
Firstly, I just learned that another commercial provider combines France and Germany into "Central European." I really don't know what they are thinking - clearly they are not.
Secondly, there is hope that things will improve. Fortunately, the GenoChip 2.0 Genographic Project asks local information down to the village (!) level. (The bad news is this is only asked uniparentally and as deep down as participants know, which is not the best, but perhaps a decent approximation of autosomal contribution, especially if good filters are used).
Post a Comment