- Clusters Galore
- Zombies
- The Dodecad Oracle

*single*clusters: the MCLUST algorithm estimates the probability that each individual belongs to each cluster, but does not estimate admixture proportions.

The Dodecad Oracle tries to address this problem, by using simple geometry to estimate 2-way mixes between populations. Individuals are projected onto lines formed by population pairs. An individual X that can be expressed as a mixture of A and B will tend to fall on the line segment AB, or close to it: the distance between X and AB is a measure of the closeness of fit. There are two downsides to this approach:

- The limitation to two populations
- The fact that different "populations" may in fact be different samples from the same population (e.g., the Behar et al. (2010) Ashkenazy_Jews and the Dodecad Project Ashkenazi_D populations)

**The DRACOS pipeline**

Fine-scale admixture estimation can be achieved by putting together these three ideas. I have called this new technique DRACOS:

**D**imensionality**R**eduction**A**nalysis into**CO**mponents**S**tructure estimation

1. Dimensionality Reduction: Use PCA or MDS to convert genotype data into a few principal components or MDS dimensions

2. Analysis into Components: Use MCLUST over the MDS/PCA representation to infer the presence of clusters at a fine scale

3. Identify sets of individuals that clearly belong to each of the clusters; one can use a filter based on posterior probability (e.g., greater 0.99) and/or distance from the cluster centroid (e.g., the 30 closest individuals)

4. Convert these sets of cluster-typical individuals into zombies for use with ADMIXTURE; alternatively, their allele frequencies themselves can be used, as in DIYDodecad, or any other structure-like analysis.

The DRACOS approach addresses all the drawbacks of the three individual methods:

- Compared to Clusters Galore, it allows for admixture
- It allows one to create zombies at a fine-scale. ADMIXTURE cannot do this, both because of its O(K^2) running time, as well as its lack of the model-based sophistication of MCLUST as applied over the first few principal components.
- Admixture can be estimated with any number of ancestral populations, not just two

I have a few things running in parallel at this time, but I am pretty sure I will eventually release a DRACOS-based calculator on the Dodecad project page. I anticipate that such a tool, in conjunction with DIYDodecad's "byseg" and "target" modes may be helpful to genealogists, as it has the potential of inferring the geographical origin of segments of DNA at a finer level of detail.

Have you done any proof of concept runs of DRACOS? How did they differ (or resemble) other approaches?

ReplyDeleteCan't wait to see this - keep up the good work!

ReplyDelete

ReplyDeleteHave you done any proof of concept runs of DRACOS?Of course

How did they differ (or resemble) other approaches?DRACOS allows for fine-scale admixture estimation. The quadratic cost of ADMIXTURE and its idiosyncracies (e.g., getting lost in likelihood space and finding local minima) prohibits it from inferring fine-scale components on both performance and theoretical grounds.

DRACOS can convert any Clusters Galore or MCLUST run into synthetic individuals that can be used in ADMIXTURE. You can think of it as turboboosting ADMIXTURE by doing the cluster inference for it, and ADMIXTURE is then used to do what it knows how, i.e., admixture proportions estimation.

DRACOS-based results for long-range clines (e.g., Sub-Saharans vs. East Asians) are virtually identical to those produced by simple ADMIXTURE. For short-range ones, all the usual caveats enumerated in my recent post apply.

http://dienekes.blogspot.com/2011/10/further-caution-on-admixture-estimates.html

Great work D, you're a machine! (in a good way :)

ReplyDelete