Showing posts with label DRACOS. Show all posts
Showing posts with label DRACOS. Show all posts

October 20, 2011

Putting it all together: DRACOS for fine-scale admixture estimation

I have previously developed new techniques and tools for ancestry estimation:
  • Clusters Galore
  • Zombies
  • The Dodecad Oracle
Clusters Galore allows very fine-scale ancestry estimation. In on of the most ambitious runs from last December, I was able to infer 124 different clusters on a global dataset. The downside is that individuals are placed on single clusters: the MCLUST algorithm estimates the probability that each individual belongs to each cluster, but does not estimate admixture proportions.

The Dodecad Oracle tries to address this problem, by using simple geometry to estimate 2-way mixes between populations. Individuals are projected onto lines formed by population pairs. An individual X that can be expressed as a mixture of A and B will tend to fall on the line segment AB, or close to it: the distance between X and AB is a measure of the closeness of fit. There are two downsides to this approach:
  • The limitation to two populations
  • The fact that different "populations" may in fact be different samples from the same population (e.g., the Behar et al. (2010) Ashkenazy_Jews and the Dodecad Project Ashkenazi_D populations)
The idea of Zombies is a powerful one. It allows one to convert the allele frequency data output by ADMIXTURE software to synthetic individuals representing the inferred ancestral populations. These individuals can then be used to estimate the ancestry of other individuals and populations. Not only does this have a tremendous performance benefit, but it also allows comparison across individuals and populations with the same "measuring stick".

The DRACOS pipeline

Fine-scale admixture estimation can be achieved by putting together these three ideas. I have called this new technique DRACOS:
  • Dimensionality Reduction
  • Analysis into COmponents
  • Structure estimation
Here are the steps of the full DRACOS pipeline:

1. Dimensionality Reduction: Use PCA or MDS to convert genotype data into a few principal components or MDS dimensions
2. Analysis into Components: Use MCLUST over the MDS/PCA representation to infer the presence of clusters at a fine scale
3. Identify sets of individuals that clearly belong to each of the clusters; one can use a filter based on posterior probability (e.g., greater 0.99) and/or distance from the cluster centroid (e.g., the 30 closest individuals)
4. Convert these sets of cluster-typical individuals into zombies for use with ADMIXTURE; alternatively, their allele frequencies themselves can be used, as in DIYDodecad, or any other structure-like analysis.

The DRACOS approach addresses all the drawbacks of the three individual methods:
  1. Compared to Clusters Galore, it allows for admixture
  2. It allows one to create zombies at a fine-scale. ADMIXTURE cannot do this, both because of its O(K^2) running time, as well as its lack of the model-based sophistication of MCLUST as applied over the first few principal components.
  3. Admixture can be estimated with any number of ancestral populations, not just two
There are of course drawbacks to the DRACOS approach as well; my recent post on increased error in short-range clines identifies the major issues with attempting to do admixture estimation at this level.

I have a few things running in parallel at this time, but I am pretty sure I will eventually release a DRACOS-based calculator on the Dodecad project page. I anticipate that such a tool, in conjunction with DIYDodecad's "byseg" and "target" modes may be helpful to genealogists, as it has the potential of inferring the geographical origin of segments of DNA at a finer level of detail.