September 19, 2010

Playing with ADMIXTURE

(Last Update: 22 Sep)

I've been trying out ADMIXTURE recently. It's lightning fast compared to both frappe and STRUCTURE, its main competitors in the admixture estimate field, simple to use, and well-documented.

My main goal was to analyze the data in the recent Xing et al. (2010) paper. It's unfortunate that many recent papers do not have their data online, or they hide them behind various institutional controls, but the data in that paper (a total of 40 populations typed for a quarter million markers) is available online.

My main goal is to eventually update the EURO-DNA-CALC, making it more powerful and extending it with non-European populations. There are a few aspects that are particularly important:
  1. You can't assume that people will have the computing power and know-how to go through various steps to run ADMIXTURE themselves.
  2. The alternative of having people send me their genotype data is impossible because of legitimate privacy concerns and the obvious impossibility of accommodating a large number of requests.
The beauty of ADMIXTURE is that it provides allele frequency estimates for its inferred K ancestral populations. Thus, end users can side-step the task of running the full analysis (850 individuals x 250k markers), which should make it possible to run the next version of EURO-DNA-CALC in modest machines.

Here is a 10k SNP/K=7 run of ADMIXTURE on the aforementioned data, which had a running time of a few minutes in my machine. As you can see 10k is already quite good in separating different groups of individuals. I will probably use more SNPs in the final version.



Feel free to leave comments on what features you'd like to see in the new version. I can't promise a timetable, but I will try to incorporate as many suggestions as I can.


UPDATE I (Sep 21):

Here is a run with all 246,554 SNPs for the 850 individuals. If you notice, this looks like the figure published in the Xing et al. paper, although I've kept the individuals in the order they appear in the genotype file, while the published version has re-arranged them so that the different clusters will appear contiguously. This run took several minutes, and I am estimating that the full run for K=12, i.e., to generate the other figure from the paper will take about half a day, so I will probably leave it running overnight one of these days, and post it as well.


UPDATE II (Sep 12):

The results for K=12 and 246,554 SNPs, which took (as I had estimated) about 10.5 hours to compute.

7 comments:

Spy said...

Good work, Dienekes! With more SNPs, perhaps you could even revive the EuropeanDNA2.0 categories (Southeastern, Iberian, Basque, Continental, Northeastern) — or perhaps give the user the choice of K each run. I would love it if you could collaborate with 23andme and make it one of their additional "Ancestry Lab" tools.

By the way, how does ADMIXTURE run so much faster? Does it still compute MLEs?

Salabencher said...

Support FTDNA data!

Fanty said...

"Support FTDNA data!"

Yeah! Second that!
I have a useless 500k autosomal SNP "Family Finder" data from me.

Wich is totaly useless since I only got 5. Grade cousins and all live in the USA with descants from all over Europe.

and wich is useless again, since the DATA is not compatibe with autosomal Tests of other companies. :-(

Well, ok... FTDNA promises a "Population Finder" wich shall come in "A few weeks", but they promise this since several month.

clusteredmaps said...

I wonder if this Admixture program could help with the Central Asian admixture estimates.

Fanty said...

As it seems FTNDA has a Beta of their "Population Finder" out.

It claims me:

85% Western European (+-11%)
15% European (+-11%)

hmm

John said...

Does ADMIXTURE only run in Linux or on a MacOS?

Dienekes said...

Does ADMIXTURE only run in Linux or on a MacOS?

Yes, AFAIK.