There have been enough public reports by now that reinforce my initial suggestion that Ancestry Composition overfits to the training data (aka people with four grandparents from a reference population who filled their ancestry survey). The result of this is that such people get 99-100% of their ancestry assigned to a particular population, and the test essentially returns the customer-supplied population label instead of returning the person's ancestry based on his actual DNA.
Now, this is not a problem for the majority of 23andMe customers who don't have 4 grandparents from the same country, or have 4 grandparents from a colonial country such as the United States.
But, the problem for the rest of the 23andMe community cannot be overlooked, because it is significant for people from non-colonial countries who make up the reference populations. Ironically, the people who are actually making this type of analysis possible (people who dutifully filled in their ancestry survey) are the ones getting the raw end of the deal.
I have seen talk of people retracting their ancestry survey answers in the hope of getting some accurate results! I don't think that's the way to go, as that would lead to a race to the bottom: people might retract or change their ancestry survey answers in the hope of improving their results, but, if enough people do this, the training sample will be shrunk and distorted, so the results will be worse for everybody!
How to solve the problem
In a world with infinite computing resources, and a large number of samples, the problem could be solved optimally by leaving out each of N training samples, rebuilding the ancestry predictive model, using the remaining N-1 samples, N times, once for each training sample, and then applying it to each of these left out samples.
Naturally, this would have the effect of increasing the computational complexity of ancestry estimation approximately N-fold, so it does not seem practical.
An alternative approach would be to build the model only once (using all N training samples) and incrementally update it for each training individual. This depends on the feasibility of such an incremental update which would incur a minor cost per individual -to adapt parameters of the model by "virtually" taking out the individual. My suspicion is that it will be extremely difficult to do this type of incremental update for the fairly complex model used by 23andMe in their Ancestry Composition.
So, what would be a practical solution?
Partition the N training samples into a number of G groups, each of which will have N/G individuals. Now, rebuild the model G times, each time using N-N/G individuals, i.e., leaving one group out. Note that the initially proposed solution (i.e., leaving one out) is a special case of the above with G=1.
The computational cost of this solution will be something less than G times the cost of building the full model with all N training samples. This is due to the fact that you are building the model G times, but over a slightly smaller dataset (of N-N/G individuals).
Practically, G=10 would be reasonable number of groups, which would, however, require the model to be built ten times. Whether or not this is practical for 23andMe, I don't know, but since they have to periodically update their model, I think that they ought to try this approach. If they already have idle CPU cycles, that's a great way to occupy them, and if they don't, then investing in processing power would be a good idea.