December 02, 2012

D-statistics on ADMIXTURE components

One of the most persistent questions I get as admin of the Dodecad Project is whether some low level of admixture (e.g., 0.7%) of some ancestral component is "noise" or "real".

I have hitherto advised all those who contacted me about this issue to (i) treat low levels of admixture with suspicion, and (ii) to run DIYDodecad in byseg mode; this might show whether this type of admixture is concentrated in some specific long segments, and is thus more likely to be "real" recent ancestry than low-level noise sprinkled across the genome that is more difficult to interpret.

Nonetheless, this was always unsatisfying to me, because it did not provide a way of quantifying one's confidence on the "reality" of the admixture evidence. Thus, I developed admixtureDstat.r an R script which calculates D-statistics of the form:

D(Pop1, Individual; Pop3, Outgroup)

If the individual can be seen as being drawn from population Pop1 but with some admixture from population Pop3, then this statistic will take significant negative values. For example, suppose that your main admixture component is "North_European", but you also have 1% "Siberian" admixture. You would want to calculate the following statistic:

D(North_European, YOU; Siberian, Palaeo_African)

which would tell you whether the Siberian admixture is "real" or not. (Of course, things are more complicated for those who might have both Siberian and African admixture, in which case their Siberian admixture would tend to make the D-statistic negative, and the African one positive, with the end result being a balance of the two processes).

There are of course many subtleties in the interpretation of D-statistics and I refer you to Green et al. (2010), Durand et al. (2011), and Patterson et al. (2012) for some of the technical details.

Using the script is quite simple, and only requires that you have R installed on your computer:
  • download standardize.r and admixtureDstat.r from here, saving them into some directory in your computer (henceforth, we will call this the "working directory"). If you have Genographic 2.0 data, you should also download hgdp.base.txt.
  • unzip your raw genotype data (from 23andMe, Family Finder, or Genographic 2.0) into the working directory. 
  • launch R and change the directory into the working directory (using the Menu in Windows, or setwd() in Unix-like operating systems). Enter in one line:
 source('admixtureDstat.r'); source('standardize.r')
  • In R, enter the command:

standardize('johndoe.txt', company='23andMe') 

  • The above command, will convert your data into a format understood by my script, writing a genotype.txt file in the working directory.  You should change johndoe.txt to whatever your unzipped raw data file is called, and the company should be one of '23andMe', 'ftdna', or 'geno2', or 'geno2new' depending on the source of your data. If you have used DIYDodecad before, you have already created a genotype.txt file, so you can skip this step.
  • Finally, you should have the four calculator files (with endings .par, .txt, .alleles, and .F) in the working directory. You can, for example, use the calculator files of the globe13, or if you have experience working with ADMIXTURE, you may make your own using your dataset. The .txt file will contain the names of the ancestral populations that you can use, so make sure you type them correctly if you decide to choose "listfile" mode (see below).
  • You are now all set to use the script! You can do this in either of two ways:
(1) outgroup mode:

In this mode, you specify an outgroup, i.e., one of the populations from the calculator, and the program cycles through all possible (Pop1, Pop3) pairs, outputs the D-statistics to the screen as it calculates them, and finally writes them to a dstat.txt file in the working directory.

To use this mode, you simply type:

admixtureDstat(parfile="globe13.par", outgroup="Palaeo_African")

The use of "Palaeo_African" as an outgroup is a reasonable choice for most non-Africans, since these are unlikely to have recent admixture from Sub-Saharan hunter-gatherer groups in which this component is represented.

Note that many of the D-statistics produced this way may have little meaning for you. For example, a person that is mostly European will get a very negative statistic of the form:

D(West_African, YOU; East_Asian, Palaeo_African)

But this will have little to do with your potential West_African or East_Asian ancestry, but rather with the relationships of populations (e.g., Europeans being more closely related to East Asians than West Africans). A little West_African/East_Asian ancestry will increase/decrease the value of this statistic, which will, however, remain strongly negative.

Instead, you should look at D-statistics that might be meaningful to you, e.g., if the following is negative:

D(North_European, YOU; East_Asian, Palaeo_African)

Then you might have some real East_Asian admixture.
(2) listfile mode:

In this mode, you write all the D-statistics you are interested in in a simple text file, e.g., listDstat.txt, in the order Pop1, Pop3, Outgroup, e.g.:
Mediterranean North_European Palaeo_African
North_European Siberian Palaeo_African
North_European West_Asian Palaeo_African
A reasonable choice is to calculate D-statistics where Pop1 is your most important component, e.g., North_European for someone from Finland, and Pop3 is a minor component whose "reality" you seek to investigate, e.g., Siberian.

Using the listfile mode will take less time (because you calculate a subset of D-statistics), and can be invoked as follows:

admixtureDstat(parfile="globe13.par", listfile="listDstat.txt")


The significance of D-statistics is assessed by the Z-scores, which are the last column of the output.    If they are greater than 3 in absolute value (i.e., less than -3 or greater than 3) then Z-scores are significant.

Other details: 

There are some additional options you might use. For example

   admixtureDstat(parfile="globe13.par", listfile="listDstat.txt", k=1000)

will use 1,000 SNPs for the block jackknife instead of the default 500. In general, there is little reason to mess with this parameter.

The screen output might be too wide for your R window, and you can fix this prior to running admixtureDstat by entering something like options(width=300) which allows more characters per line of screen output. In any case, you can see the program's output nicely formatted in the dstat.txt file in the working directory after it completes its run.


I will give an example of program usage using globe13 results. Take individual DOD133 whose results are seen below:

This individual is mostly Mediterranean (52.5%) and North_European (42%), but with small percentages of Amerindian (1.1%), Southwest_Asian (1.5%), Arctic (0.3%), and South_Asian (1.7%).

First, I calculate D(Mediterranean, DOD133; North_European, Palaeo_African) and D(North_European, DOD133; Mediterranean, Palaeo_African) to confirm the major admixture between Mediterranean and North_European. In listfile mode, I put the following in the listDstat.txt file:

Mediterranean North_European Palaeo_African
North_European Mediterranean Palaeo_African

The results are as follows:
Pop1 Pop3 Outgroup Dstat Z
Mediterranean North_European Palaeo_African -0.02399 -11.2
North_European Mediterranean Palaeo_African -0.033 -15.06

Ok, this confirms that DOD133 does indeed appear to be a mixture of North_European and Mediterranean. Now, let's take one of the minor components, e.g., South_Asian, and put the following in the listDstat.txt file:

Mediterranean South_Asian Palaeo_African
North_European South_Asian Palaeo_African

The results are now:

Pop1 Pop3 Outgroup Dstat Z
Mediterranean South_Asian Palaeo_African -0.01268 -5.98
North_European South_Asian Palaeo_African 0.00202 0.96

A possible interpretation for this pattern is that the individual does have some South_Asian-like admixture that is lacking in his Mediterranean component. Perhaps this reflects an ancient Central Asian population that migrated into both northern Europe and south Asia; some alleles from this population were incorporated into the Northern European gene pool, thus becoming part of what it means to be "northern European", so the evidence for admixture does not exist in the {North_European, South Asian} pair, since both of these contain gene flow from our hypothetical Central Asian population. There are many ways to interpret the observed patterns, and using admixtureDstat you can explore some of them.

Now, let's take another minor component, Arctic (0.3%):

Pop1 Pop3 Outgroup Dstat Z
Mediterranean Arctic Palaeo_African -0.02413 -9.48
North_European Arctic Palaeo_African 0.01028 3.95

This is an interesting pattern; the individual appears admixed with Arctic relative to Mediterranean, but North_European appears to be more Arctic than DOD133. A possible explanation is that this Arctic component represents ancestry that was mediated by a north European population, that as Patterson et al. (2012) have shown contain some "north Eurasian" ancestry.

Finally, let's take the Southwest_Asian minor component (1.5%), where the reverse situation applies:
Pop1 Pop3 Outgroup Dstat Z
Mediterranean Southwest_Asian Palaeo_African 0.00496 2.4
North_European Southwest_Asian Palaeo_African -0.01075 -5.05

So, in this case, this might represent ancestry common between Mediterranean and Southwest_Asian that contrasts with the North_European portion of the individual's genome.

I won't pretend that interpreting D-statistics is easy, but they are certainly a nice exploratory tool to have in one's arsenal, and I hope that they will prove useful.

TERMS OF USE: You are free to use and modify this tool for any non-commercial purpose, as long as you provide a link to Dienekes' Anthropology Blog or this blog post when you do so. You should probably also cite one of the aforementioned papers where D-statistics were discussed, as well as the ADMIXTURE paper.

UPDATE (Dec 3): You might want to try D-statistics using globe13anc, a new calculator that includes an Ancestral (chimp) outgroup.


Gui S said...

Very interesting development.
Just wanted to mention that you wrote
setwd('johndoe.txt', company='23andMe')
instead of
standardize('johndoe.txt', company='23andMe')
in the first part of your explanation.

Westgoth said...

[1,] "Pop1" "Pop3" "Outgroup" "Dstat" "Z"
[2,] "Mediterranean" "North_European" "Palaeo_African" "-0.01484" "-6.62"
[3,] "North_European" "Mediterranean" "Palaeo_African" "-0.01213" "-5.75"
[4,] "West_Asian" "North_European" "Palaeo_African" "-0.01634" "-7.42"
[5,] "North_European" "West_Asian" "Palaeo_African" "0.00678" "3.49"
[6,] "Mediterranean" "West_Asian" "Palaeo_African" "-0.01174" "-5.69"
[7,] "West_Asian" "Mediterranean" "Palaeo_African" "-0.03186" "-14.51"
[8,] "Mediterranean" "Southwest_Asian" "Palaeo_African" "0.00916" "4.34"
[9,] "North_European" "Southwest_Asian" "Palaeo_African" "-0.0064" "-3.02"
[10,] "West_Asian" "Southwest_Asian" "Palaeo_African" "-0.0176" "-8.01"
[11,] "Mediterranean" "Siberian" "Palaeo_African" "-0.01452" "-6.03"
[12,] "North_European" "Siberian" "Palaeo_African" "0.00813" "3.45"
[13,] "West_Asian" "Siberian" "Palaeo_African" "-0.00601" "-2.45"
[14,] "Mediterranean" "Amerindian" "Palaeo_African" "-0.02186" "-8.76"
[15,] "North_European" "Amerindian" "Palaeo_African" "0.01645" "6.78"
[16,] "West_Asian" "Amerindian" "Palaeo_African" "-0.00461" "-1.87"
[17,] "Mediterranean" "South_Asian" "Palaeo_African" "-0.00649" "-2.95"
[18,] "North_European" "South_Asian" "Palaeo_African" "0.00799" "3.76"
[19,] "West_Asian" "South_Asian" "Palaeo_African" "-0.00222" "-1.02"
[20,] "Mediterranean" "East_Asian" "Palaeo_African" "-0.01169" "-4.84"
[21,] "North_European" "East_Asian" "Palaeo_African" "0.00663" "2.85"
[22,] "West_Asian" "East_Asian" "Palaeo_African" "-0.00321" "-1.33"

So... the North_European component is more West_Asian, Siberian, Amerindian and South_Asian than I am...

Epicurious said...

[1,] "Pop1" "Pop3" "Outgroup" "Dstat" "Z"
[2,]"East_African" "Mediterranean" "Ancestral" -0.10803 -29.46
"East_African" "Mediterranean" "Australasian" -0.03937 -15.87
"East_African" "West_African" "Ancestral" -0.00663 -2.22
"East_African" "West_African" "Australasian" 0.07242 27.16

[1,] "Pop1" "Pop3" "Outgroup" "Dstat" "Z"
[2,] "East_African" "Arctic" "Ancestral" "-0.08653" "-23.73"
[3,] "East_African" "Siberian" "Ancestral" "-0.08349" "-23.08"
[4,] "East_African" "Palaeo_African" "Ancestral" "-0.00427" "-1.41"

[1,] "Pop1" "Pop3" "Outgroup" "Dstat" "Z"
[2,] "East_African" "Australasian" "Ancestral" "-0.07684" "-20.99"
[3,] "East_African" "Amerindian" "Ancestral" "-0.08622" "-23.94"
[4,] "East_African" "East_Asian" "Ancestral" "-0.07973" "-22.63"

[1,] "Pop1" "Pop3" "Outgroup" "Dstat" "Z"
[2,] "East_African" "Southwest_Asian" "Ancestral" "-0.12617" "-35.1"
[3,] "Southwest_Asian" "East_African" "Ancestral" "-0.01684" "-5.13"

Why the Z-score significance disprecancy for a Somali: pop1East_African, pop3West_African, outgroups ancestral,-2.22 /australasian, 27.16?

Dienekes said...

Australasian isn't an outgroup to {East African, West African}; Australasians are related to other non-Africans, and hence also to East Africans who have West Eurasian admixture.

Unknown said...

Could you do the analysis on my file. My membership number in Dodecad project is DOD865. I wanted to know if 0.1% African is real or not? Could you do it?