October 20, 2014

Ancestry Composition preprint

This is one of the main ancestry tools of 23andMe so it is nice to see its methodology described in detail.

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.



When the first descriptions and results were reported, a lot of us were highly critical. In particular, it appeared that the location process was way too broad and did not include known historic, prehistoric, and genetic data. E.g., it seemed as though France (with a very different background) was aligned with Germany, while (the much closer) Hungary was aligned with some general, ill-defined "Eastern European" group (although it is a classic example of Central Europe, along with much of Germany, the Czech Republic, Austria, Slovakia, Slovenia, and the northern portions of the Balkans).

Unlearned political or neo-geographic concepts versus a long-standing historic and prehistoric, and of course also genetic reality.

The published process relies on "four grandparents with the same country-of-origin" - which is a joke. Take Germany, for example. Four grandparents from NW Germany will be similar to Dutch, from the N to Danes/ Swedes, four from the NE to NW Poles (and both with similarities to Balts), four from the SE to Czechs and SW Polish Slavs, four from the S to Austrians, Swiss, and N Italians, and four from the SE to SW French and Swiss.

What you need is coordinates of each grandparent - not this ridiculous concept of "country of origin."

As I commented on another 23andme article recently discussed on this blog:
Unfortunately, I am not so sure about 23andMe's European subgroups. I have traced all my ancestors back to at least the 18th century and some much further back and the very large majority of them are Dutch, with some minor German import. Still, 23andMe states that my ancestors are 31% British/Irish, 19% Scandinavian, 12,6% German/French and 37% "broadly Northern European". I realize that it may be difficult to distinguish between European subgroups. If so, then it would be preferable not to distinguish. Still, it may also be a matter of proper analysis. A recent article supports the latter (http://www.nature.com/ng/journal/v46/n8/full/ng.3021.html) .

Real Eric, did you really trace all lineages back far enough? Far enough would mean up to the generation that immigrated, otherwise how could you be sure? And all would mean not just the male lineages, but ALL ancestral lineages of all women as well. I don't want to suggest you didn't, but I guess there are many out there who don't fully realize this.

As for me, I doubt that this ancestry test could be of much use for me. I know where all my recent ancestors came from. What would be interesting to me is how I'm related with ancient ethnic groups. Because modern nations and modern ethnic groups are often a mix of several ancient ethnic groups. So if that 23andme test (at best) just tells me from which modern nations my ancestry is extracted, I will know as much as I did before.

"and four from the SE to SW French and Swiss."

Sorry - this should be "and four from the SW to SE French and Swiss," of course.

"As for the person complaining about his Dutch ancestry being overlooked, what exactly is Dutch compared to SE English or Germans? Personally I think it unreasonable to try to separate what is inseparable."


Yes and no. I agree with you as long as the tools used are this blunt. Conversely, if there were a large database based on grandparent (or better great-grandparent) coordinates, then any Welsh or ancient English, Celtic, or Roman, etc. heritage would have been filtered out, and whatever Frisian or N Saxon or Angle/ Jute contribution from England would be largely sourced from roughly their original homelands, instead.