October 20, 2011

Putting it all together: DRACOS for fine-scale admixture estimation

I have previously developed new techniques and tools for ancestry estimation:
• Clusters Galore
• Zombies
Clusters Galore allows very fine-scale ancestry estimation. In on of the most ambitious runs from last December, I was able to infer 124 different clusters on a global dataset. The downside is that individuals are placed on single clusters: the MCLUST algorithm estimates the probability that each individual belongs to each cluster, but does not estimate admixture proportions.

The Dodecad Oracle tries to address this problem, by using simple geometry to estimate 2-way mixes between populations. Individuals are projected onto lines formed by population pairs. An individual X that can be expressed as a mixture of A and B will tend to fall on the line segment AB, or close to it: the distance between X and AB is a measure of the closeness of fit. There are two downsides to this approach:
• The limitation to two populations
• The fact that different "populations" may in fact be different samples from the same population (e.g., the Behar et al. (2010) Ashkenazy_Jews and the Dodecad Project Ashkenazi_D populations)
The idea of Zombies is a powerful one. It allows one to convert the allele frequency data output by ADMIXTURE software to synthetic individuals representing the inferred ancestral populations. These individuals can then be used to estimate the ancestry of other individuals and populations. Not only does this have a tremendous performance benefit, but it also allows comparison across individuals and populations with the same "measuring stick".

The DRACOS pipeline

Fine-scale admixture estimation can be achieved by putting together these three ideas. I have called this new technique DRACOS:
• Dimensionality Reduction
• Analysis into COmponents
• Structure estimation
Here are the steps of the full DRACOS pipeline:

1. Dimensionality Reduction: Use PCA or MDS to convert genotype data into a few principal components or MDS dimensions
2. Analysis into Components: Use MCLUST over the MDS/PCA representation to infer the presence of clusters at a fine scale
3. Identify sets of individuals that clearly belong to each of the clusters; one can use a filter based on posterior probability (e.g., greater 0.99) and/or distance from the cluster centroid (e.g., the 30 closest individuals)
4. Convert these sets of cluster-typical individuals into zombies for use with ADMIXTURE; alternatively, their allele frequencies themselves can be used, as in DIYDodecad, or any other structure-like analysis.

The DRACOS approach addresses all the drawbacks of the three individual methods:
1. Compared to Clusters Galore, it allows for admixture
2. It allows one to create zombies at a fine-scale. ADMIXTURE cannot do this, both because of its O(K^2) running time, as well as its lack of the model-based sophistication of MCLUST as applied over the first few principal components.
3. Admixture can be estimated with any number of ancestral populations, not just two
There are of course drawbacks to the DRACOS approach as well; my recent post on increased error in short-range clines identifies the major issues with attempting to do admixture estimation at this level.

I have a few things running in parallel at this time, but I am pretty sure I will eventually release a DRACOS-based calculator on the Dodecad project page. I anticipate that such a tool, in conjunction with DIYDodecad's "byseg" and "target" modes may be helpful to genealogists, as it has the potential of inferring the geographical origin of segments of DNA at a finer level of detail.

Andrew Oh-Willeke said...

Have you done any proof of concept runs of DRACOS? How did they differ (or resemble) other approaches?

pconroy said...

Can't wait to see this - keep up the good work!

Dienekes said...

Have you done any proof of concept runs of DRACOS?

Of course

How did they differ (or resemble) other approaches?

DRACOS allows for fine-scale admixture estimation. The quadratic cost of ADMIXTURE and its idiosyncracies (e.g., getting lost in likelihood space and finding local minima) prohibits it from inferring fine-scale components on both performance and theoretical grounds.

DRACOS can convert any Clusters Galore or MCLUST run into synthetic individuals that can be used in ADMIXTURE. You can think of it as turboboosting ADMIXTURE by doing the cluster inference for it, and ADMIXTURE is then used to do what it knows how, i.e., admixture proportions estimation.

DRACOS-based results for long-range clines (e.g., Sub-Saharans vs. East Asians) are virtually identical to those produced by simple ADMIXTURE. For short-range ones, all the usual caveats enumerated in my recent post apply.