January 26, 2013

Ancestry Composition to be fixed

From the explanation at the relevant thread:
Ancestry Composition (AC) works by learning (training) a set of useful features from reference individuals with known ancestry (the training set) and then using these features to predict the ancestry of our customers.

Our set of reference individuals consists in part of customers who reported their 4 grandparents were born in the same country. Remember that we also remove the outliers, or people whose genetic ancestry doesn't match their survey answers. From this set, AC learns to associate certain haplotypes with their geographical origin. AC is then able to recognize similar haplotypes and thus to predict the ancestry of other customers.

However, when predicting the ancestry of reference individuals, AC suffers from overfitting, a problem common to many supervised learning methods. As a consequence, AC predicts the ancestry of most reference individuals as being 100% from their grandparents’ birthplace.

We addressed this issue using a method inspired from cross-validation. We divided the training set into 5 folds, each containing 20% of the reference individuals. We then trained 5 AC models in which each fold in turn is excluded from the set of reference individuals. So each of these models is learned using 80% of the reference individuals. Additionally, we retain the model that was trained using all the reference individuals. From this process, we end up with 6 different models from which we can predict the ancestry of our customers.

Now, when predicting the ancestry of a customer, we start by figuring out if he/she is a reference individual. If yes, we identify the fold in which the customer belongs, and we use the corresponding model for prediction. If not, we use the fold containing all of the reference data. This way, we ensure that AC was never trained using the haplotypes of the individual it tries to predict.
I had proposed basically the same solution about a month ago, and it's great that the issue is being addressed so soon after it first appeared. If any of the people who had written to me/commented on the topic get their new updated results and want to comment, feel free to do so in this post.

I am not sure how 23andMe plans to handle their Ancestry Composition feature in the future, but I would suggest that they periodically re-update it as they get more samples. According to a recent estimate, there are over 180,000 people in their database at the moment, a fraction of which meets the twin requirements of: (i) having 4 grandparents from the same country, and (ii) not being an outlier. As this number increases over time, it might be a good idea to occasionally re-partition the sample and re-calculate participants' ancestry composition results.

The fact that they are ready to roll out their updated results so soon after the initial ones tells me that they do have the computing power to do so, and it might be a good idea to update Ancestry Composition periodically, say on a quarterly basis or when a certain increase in the training set (say, 10%) is achieved. Eventually the admixture estimates may stabilize, in which case the way forward may involve rethinking the choice of ancestral populations currently in use.


eurologist said...

As I mentioned before, they should also fix other issues. Currently, they combine France and Germany. That's about the same if not worse than combining Denmark and Hungary. Conversely, they are combining Hungary and Poland - which is not that different from combining Austria and the Ukraine...

Also, for many countries it would pay to divide them into 4-5 regions (NE, NW, SW, SE, central).

Clay said...

I look forward to the new results, since my first results could not be explained by my known ancestry. For example, my great-grandmother is from Finland, which shows up clearly enough in the results. But Scandinavian also shows up, even in the "conservative" interpretation, and there is no one in my family tree known to be from there. In the more speculative interpretation, Scandinavian is as prominent as Finn. So maybe there was Swedish admixture in my Finnish great grandmother. Maybe she is an "outlier" within the Finn population and this will clarify.

Charles Nydorf said...

I show up as .2 Finnish on the speculative sub-regional estimate. I'm pretty sure that these are really Lithuanian or Belarusian genes.

Dmytro said...

Major glitches seem to accompany the recalculations. My mother-in-law has now become more than half Sub-Saharan (she had 0% initially, which is correct for her), while my wife has none whatsoever.

pconroy said...

I see that my mother's slight Native American ancestry has returned again.