Ancestry Composition (AC) works by learning (training) a set of useful features from reference individuals with known ancestry (the training set) and then using these features to predict the ancestry of our customers.I had proposed basically the same solution about a month ago, and it's great that the issue is being addressed so soon after it first appeared. If any of the people who had written to me/commented on the topic get their new updated results and want to comment, feel free to do so in this post.
Our set of reference individuals consists in part of customers who reported their 4 grandparents were born in the same country. Remember that we also remove the outliers, or people whose genetic ancestry doesn't match their survey answers. From this set, AC learns to associate certain haplotypes with their geographical origin. AC is then able to recognize similar haplotypes and thus to predict the ancestry of other customers.
However, when predicting the ancestry of reference individuals, AC suffers from overfitting, a problem common to many supervised learning methods. As a consequence, AC predicts the ancestry of most reference individuals as being 100% from their grandparents’ birthplace.
We addressed this issue using a method inspired from cross-validation. We divided the training set into 5 folds, each containing 20% of the reference individuals. We then trained 5 AC models in which each fold in turn is excluded from the set of reference individuals. So each of these models is learned using 80% of the reference individuals. Additionally, we retain the model that was trained using all the reference individuals. From this process, we end up with 6 different models from which we can predict the ancestry of our customers.
Now, when predicting the ancestry of a customer, we start by figuring out if he/she is a reference individual. If yes, we identify the fold in which the customer belongs, and we use the corresponding model for prediction. If not, we use the fold containing all of the reference data. This way, we ensure that AC was never trained using the haplotypes of the individual it tries to predict.
I am not sure how 23andMe plans to handle their Ancestry Composition feature in the future, but I would suggest that they periodically re-update it as they get more samples. According to a recent estimate, there are over 180,000 people in their database at the moment, a fraction of which meets the twin requirements of: (i) having 4 grandparents from the same country, and (ii) not being an outlier. As this number increases over time, it might be a good idea to occasionally re-partition the sample and re-calculate participants' ancestry composition results.
The fact that they are ready to roll out their updated results so soon after the initial ones tells me that they do have the computing power to do so, and it might be a good idea to update Ancestry Composition periodically, say on a quarterly basis or when a certain increase in the training set (say, 10%) is achieved. Eventually the admixture estimates may stabilize, in which case the way forward may involve rethinking the choice of ancestral populations currently in use.