Showing posts with label D-statistics. Show all posts
Showing posts with label D-statistics. Show all posts

February 22, 2013

ADMIXTOOLS 1.1 released

A new 1.1 version of ADMIXTOOLS has been released. From the description:
ADMIXTOOLS (Patterson et al. 2012) is a software package that supports formal tests of whether admixture occurred, and makes it possible to infer admixture proportions and dates. It can be downloaded for LINUX (see documentation). The software package also includes Affymetrix Human Origins Curated Dataset. Write to Arti Tandon if you have questions about the software and for scientific questions write to Nick Patterson. The new release fixes a serious bug in qpDstat. 
I've used this software before and posted some D-statistics from it on the blog, so if you find any that look strange, feel free to leave a comment. In any case, I'll be using the new version of qpDstat from now on.

UPDATE (Feb 22): Nick Patterson asked me to post the following for users of ADMIXTOOLS:
Choongwon Jeong of the University of Chicago found a serious bug in qpDstat (computes D-statistics) that sometimes returns D with an incorrect sign.  If you use the program please download ADMIXTOOLS version 1.1 from  the Reich lab web page.  http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html

December 03, 2012

'globe13anc' calculator with chimp outgroup

I was thinking a bit about my suggestion to use Palaeo_African as an outgroup for D-statistic calculations using my new admixtureDstat script, and it occurred to me that it would be fairly easy to modify one of my calculators to include a sample that is indeed symmetrically related to all modern human groups.

To do this, I created an individual possessing the ancestral allele using hgdpGeo as a reference. According to the reference for this table:

Samples collected by the HGDP-CEPH from 1,043 individuals from around the world were genotyped for 657,000 SNPs at Stanford. Ancestral states for all SNPs were estimated using whole genome human-chimpanzee alignments from the UCSC database. For each SNP in the human genome (NCBI Build 35, UCSC database hg17), the allele at the corresponding position in the chimp genome (Build 2 version 1, UCSC database pantro2) was used as ancestral.
My new globe13anc calculator is simply a version of the latest globe13 one, but with an extra "Ancestral" component, so it has 13+1 = 14 ancestral components in total.

You can of course use globe13anc as any other calculator designed for DIYDodecad, and hopefully no one will get anything other than 0% for the "Ancestral" component :)

But, the main point of building this is to help you infer D-statistics with no suspicion that gene flow within the human species may affect the results; while the Khoesan of South Africa (where the Palaeo_African component is modal) are an approximate outgroup to the rest of mankind, there is evidence that even their most isolated groups have some external gene flow. So, using this "Ancestral" outgroup instead of Palaeo_African ought to make things cleaner for everyone.

December 02, 2012

D-statistics on ADMIXTURE components

One of the most persistent questions I get as admin of the Dodecad Project is whether some low level of admixture (e.g., 0.7%) of some ancestral component is "noise" or "real".

I have hitherto advised all those who contacted me about this issue to (i) treat low levels of admixture with suspicion, and (ii) to run DIYDodecad in byseg mode; this might show whether this type of admixture is concentrated in some specific long segments, and is thus more likely to be "real" recent ancestry than low-level noise sprinkled across the genome that is more difficult to interpret.

Nonetheless, this was always unsatisfying to me, because it did not provide a way of quantifying one's confidence on the "reality" of the admixture evidence. Thus, I developed admixtureDstat.r an R script which calculates D-statistics of the form:

D(Pop1, Individual; Pop3, Outgroup)

If the individual can be seen as being drawn from population Pop1 but with some admixture from population Pop3, then this statistic will take significant negative values. For example, suppose that your main admixture component is "North_European", but you also have 1% "Siberian" admixture. You would want to calculate the following statistic:

D(North_European, YOU; Siberian, Palaeo_African)

which would tell you whether the Siberian admixture is "real" or not. (Of course, things are more complicated for those who might have both Siberian and African admixture, in which case their Siberian admixture would tend to make the D-statistic negative, and the African one positive, with the end result being a balance of the two processes).

There are of course many subtleties in the interpretation of D-statistics and I refer you to Green et al. (2010), Durand et al. (2011), and Patterson et al. (2012) for some of the technical details.

Using the script is quite simple, and only requires that you have R installed on your computer:
  • download standardize.r and admixtureDstat.r from here, saving them into some directory in your computer (henceforth, we will call this the "working directory"). If you have Genographic 2.0 data, you should also download hgdp.base.txt.
  • unzip your raw genotype data (from 23andMe, Family Finder, or Genographic 2.0) into the working directory. 
  • launch R and change the directory into the working directory (using the Menu in Windows, or setwd() in Unix-like operating systems). Enter in one line:
 source('admixtureDstat.r'); source('standardize.r')
  • In R, enter the command:

standardize('johndoe.txt', company='23andMe') 

  • The above command, will convert your data into a format understood by my script, writing a genotype.txt file in the working directory.  You should change johndoe.txt to whatever your unzipped raw data file is called, and the company should be one of '23andMe', 'ftdna', or 'geno2', or 'geno2new' depending on the source of your data. If you have used DIYDodecad before, you have already created a genotype.txt file, so you can skip this step.
  • Finally, you should have the four calculator files (with endings .par, .txt, .alleles, and .F) in the working directory. You can, for example, use the calculator files of the globe13, or if you have experience working with ADMIXTURE, you may make your own using your dataset. The .txt file will contain the names of the ancestral populations that you can use, so make sure you type them correctly if you decide to choose "listfile" mode (see below).
  • You are now all set to use the script! You can do this in either of two ways:
(1) outgroup mode:

In this mode, you specify an outgroup, i.e., one of the populations from the calculator, and the program cycles through all possible (Pop1, Pop3) pairs, outputs the D-statistics to the screen as it calculates them, and finally writes them to a dstat.txt file in the working directory.

To use this mode, you simply type:

admixtureDstat(parfile="globe13.par", outgroup="Palaeo_African")

The use of "Palaeo_African" as an outgroup is a reasonable choice for most non-Africans, since these are unlikely to have recent admixture from Sub-Saharan hunter-gatherer groups in which this component is represented.

Note that many of the D-statistics produced this way may have little meaning for you. For example, a person that is mostly European will get a very negative statistic of the form:

D(West_African, YOU; East_Asian, Palaeo_African)

But this will have little to do with your potential West_African or East_Asian ancestry, but rather with the relationships of populations (e.g., Europeans being more closely related to East Asians than West Africans). A little West_African/East_Asian ancestry will increase/decrease the value of this statistic, which will, however, remain strongly negative.

Instead, you should look at D-statistics that might be meaningful to you, e.g., if the following is negative:

D(North_European, YOU; East_Asian, Palaeo_African)

Then you might have some real East_Asian admixture.
(2) listfile mode:

In this mode, you write all the D-statistics you are interested in in a simple text file, e.g., listDstat.txt, in the order Pop1, Pop3, Outgroup, e.g.:
Mediterranean North_European Palaeo_African
North_European Siberian Palaeo_African
North_European West_Asian Palaeo_African
A reasonable choice is to calculate D-statistics where Pop1 is your most important component, e.g., North_European for someone from Finland, and Pop3 is a minor component whose "reality" you seek to investigate, e.g., Siberian.

Using the listfile mode will take less time (because you calculate a subset of D-statistics), and can be invoked as follows:

admixtureDstat(parfile="globe13.par", listfile="listDstat.txt")

Z-scores

The significance of D-statistics is assessed by the Z-scores, which are the last column of the output.    If they are greater than 3 in absolute value (i.e., less than -3 or greater than 3) then Z-scores are significant.

Other details: 

There are some additional options you might use. For example

   admixtureDstat(parfile="globe13.par", listfile="listDstat.txt", k=1000)

will use 1,000 SNPs for the block jackknife instead of the default 500. In general, there is little reason to mess with this parameter.

The screen output might be too wide for your R window, and you can fix this prior to running admixtureDstat by entering something like options(width=300) which allows more characters per line of screen output. In any case, you can see the program's output nicely formatted in the dstat.txt file in the working directory after it completes its run.

AN EXAMPLE

I will give an example of program usage using globe13 results. Take individual DOD133 whose results are seen below:



This individual is mostly Mediterranean (52.5%) and North_European (42%), but with small percentages of Amerindian (1.1%), Southwest_Asian (1.5%), Arctic (0.3%), and South_Asian (1.7%).

First, I calculate D(Mediterranean, DOD133; North_European, Palaeo_African) and D(North_European, DOD133; Mediterranean, Palaeo_African) to confirm the major admixture between Mediterranean and North_European. In listfile mode, I put the following in the listDstat.txt file:

Mediterranean North_European Palaeo_African
North_European Mediterranean Palaeo_African

The results are as follows:
Pop1 Pop3 Outgroup Dstat Z
Mediterranean North_European Palaeo_African -0.02399 -11.2
North_European Mediterranean Palaeo_African -0.033 -15.06


Ok, this confirms that DOD133 does indeed appear to be a mixture of North_European and Mediterranean. Now, let's take one of the minor components, e.g., South_Asian, and put the following in the listDstat.txt file:


Mediterranean South_Asian Palaeo_African
North_European South_Asian Palaeo_African


The results are now:

Pop1 Pop3 Outgroup Dstat Z
Mediterranean South_Asian Palaeo_African -0.01268 -5.98
North_European South_Asian Palaeo_African 0.00202 0.96

A possible interpretation for this pattern is that the individual does have some South_Asian-like admixture that is lacking in his Mediterranean component. Perhaps this reflects an ancient Central Asian population that migrated into both northern Europe and south Asia; some alleles from this population were incorporated into the Northern European gene pool, thus becoming part of what it means to be "northern European", so the evidence for admixture does not exist in the {North_European, South Asian} pair, since both of these contain gene flow from our hypothetical Central Asian population. There are many ways to interpret the observed patterns, and using admixtureDstat you can explore some of them.

Now, let's take another minor component, Arctic (0.3%):

Pop1 Pop3 Outgroup Dstat Z
Mediterranean Arctic Palaeo_African -0.02413 -9.48
North_European Arctic Palaeo_African 0.01028 3.95

This is an interesting pattern; the individual appears admixed with Arctic relative to Mediterranean, but North_European appears to be more Arctic than DOD133. A possible explanation is that this Arctic component represents ancestry that was mediated by a north European population, that as Patterson et al. (2012) have shown contain some "north Eurasian" ancestry.

Finally, let's take the Southwest_Asian minor component (1.5%), where the reverse situation applies:
Pop1 Pop3 Outgroup Dstat Z
Mediterranean Southwest_Asian Palaeo_African 0.00496 2.4
North_European Southwest_Asian Palaeo_African -0.01075 -5.05

So, in this case, this might represent ancestry common between Mediterranean and Southwest_Asian that contrasts with the North_European portion of the individual's genome.

I won't pretend that interpreting D-statistics is easy, but they are certainly a nice exploratory tool to have in one's arsenal, and I hope that they will prove useful.

TERMS OF USE: You are free to use and modify this tool for any non-commercial purpose, as long as you provide a link to Dienekes' Anthropology Blog or this blog post when you do so. You should probably also cite one of the aforementioned papers where D-statistics were discussed, as well as the ADMIXTURE paper.

UPDATE (Dec 3): You might want to try D-statistics using globe13anc, a new calculator that includes an Ancestral (chimp) outgroup.

November 14, 2012

Pig genome + admixture into European wild boars

Of interest: 
The domestic pig (Sus scrofa) is a eutherian mammal and a member of the Cetartiodactyla order, a clade distinct from rodent and primates, that last shared a common ancestor with humans between 79 and 97 million years (Myr) ago1,2 (http://www.timetree.net). Molecular genetic evidence indicates that Sus scrofa emerged in South East Asia during the climatic fluctuations of the early Pliocene 5.3–3.5 Myr ago. Then, beginning ~10,000 years ago, pigs were domesticated in multiple locations across Eurasia3 (Frantz, L. A. F. et al., manuscript submitted).

also:
We found a clear signal for admixture between North Chinese and European populations of wild boars that we interpret as migrations across Eurasia during the later stage of the Pleistocene (Supplementary Table 24). Moreover, this hypothesis is further supported by the high value of concordance factor on the X chromosomes (Supplementary Table 20). The demographic analysis shows that the last glacial maximum (LGM)-induced bottleneck had similar magnitude in Europe and North China (Figure 2, main text). Together, these evidences suggest the existence of another (besides Asian + European) biogeographic zone for pigs, extending across North Eurasia. 
... 
There was a strong signal for admixture from Asian into European breeds. We found that European domestic breeds such as Landrace and Large White have a significant amount of Asian genetic material (Supplementary Table 24). This admixture is likely to be due to importation of Chinese breeds into Europe (especially UK) at the onset of the 'agricultural' revolution in the late 18th and 19th century.
Nature 491, 393–398 (15 November 2012) doi:10.1038/nature11622

Analyses of pig genomes provide insight into porcine demography and evolution

Martien A. M. Groenen et al.

For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars ~1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model.

Link

November 08, 2012

Okinawans and admixture in East Asia

I don't use the Pan-Asian SNP Consortium data much, but the upcoming paper on the Ainu spurred me to give it a look, because it contains an Okinawan sample (JP-RK). I calculated all f3-statistics that involved this sample, and report the lowest f3-statistic for all populations in this set that appear to be admixed:


Several of these are interesting:
  • A set of Indonesian populations (ID prefix; Lamaholot, Lembata, Kambera, Manggarai) are mixed with Melanesians (AX-ME)
  • A set of Indian populations appear admixed (IN prefix). It seems that the Okinawan sample acts as a surrogate for "Asian" ancestry 
  • Filipino populations PI-UI and PI-UN (listed as Visaya, Chabakano and Tagalog) are seen as mixtures of Okinawans and PI-UB (Ilocano)
  • The three Singaporean populations (SG prefix) are seen as mixtures with Caucasoids (the SG-ID Tamil Indians with CEU), with Sunda Indonesians (SG-ML Malay with ID-SU), with Zhuang Chinese (SG-CH Singaporean Chinese with CN-CC Zhuang, northern)
  • Tai Yuan from Thailand with Mlabri (TH-TU with TH-MA)
  • Taiwanese (Hakka TW-HA and Minnan TW-HB) with CN-CC (Zhuang) and Jiamao (CN-JI)
  • Cantonese CN-GA  with Jiamao (CN-JI)
  • Uygur CN-UG with West Eurasians (CEU)
And, of course JPT and JP-ML (Japanese) are seen as a mixture of Okinawans and Mandarin Han (CN-SH) and Beijing Chinese (CHB).

An interesting question is whether the mainland East Asian Yayoi element in Japanese is more similar to Han (as the f3 statistic suggests) or to Koreans. Interestingly, Koreans themselves (KR-KR) appear admixed between Han (CN-SH) and Okinawans. So, it seems that whatever this Okinawan element represents was not limited to the isles of Japan.

I also calculated the D-statistic:

D(CN-SH      KR-KR  :      JP-RK        YRI) =      -0.0154   (Z = -14.779)

which suggests indeed, that there is an excess of "Okinawan"-like ancestry in Koreans compared to the Chinese. This is very interesting, because it suggests that similarity between Koreans and Japanese is due to a common substratum in the two populations. 

October 18, 2012

ADMIXTURE tracks Amerindian-like admixture in northern Europe

I have recently assembled a new "world" dataset of 4,280 individuals that I am currently incrementally analyzing with ADMIXTURE. But, I noticed an interesting pattern at K=4 that I wanted to share right away.

4 ancestral populations emerge at this level of resolution, which I have named: European, Asian, African, Amerindian. The names aren't important, and you can replace them with whatever you prefer. 

The interesting thing about this K=4 analysis is that European populations show evidence of Amerindian admixture, consistent with the pattern inferred using f-statistics, where European populations show admixture between Sardinians and a Karitiana-like population.

This pattern may have emerged at previous ADMIXTURE analyses at this level of resolution, but thanks to the f3 evidence presented in previous posts, it is now clear that it is no quirk of ADMIXTURE, but indicative of a real (albeit still rather mysterious) pattern of gene flow that differentially affected European populations.

For example, the Irish_D population has 7.6% of the Amerindian component, and so do HGDP Orcadians. HGDP Sardinians have only 1.7% of it, which appears to be the minimum in Europe, with French_Basque having more at 4.6%.

Another interesting observation is that West Eurasian populations that show an excess of East Eurasian-like admixture appear to be doing so for two separate reasons. For example, HGDP Russians have 11.7% of Amerindian component, but also 4.5% of "Asian", and 1000 Genomes Finns have 3.3% Asian and 12% Amerindian. Behar et al. (2010) Turks, on the other hand, have 9.9% Asian and 2.2% Amerindian. All these populations are East Eurasian-shifted relative to Sardinians, a pattern which can also be observed by looking at the K=3 analysis, but for apparently different reasons.

The pattern for Near Eastern populations is also interesting. For example, Yunusbayev et al. (2011) Armenians have 0% of the Amerindian component, and 5.7% of the Asian, and all three HGDP Arab populations (Druze, Palestinian, Bedouin) also have 0% of the Amerindian component, with variable levels of the Asian.

It would appear that whatever process contributed Amerindian-like admixture in Europeans, minimally affected Near Eastern populations, with Sardinians being demonstrably related to Neolithic Europeans (thanks to ancient DNA evidence), tilting towards the Near Eastern pattern. On the other hand, Near Eastern populations show evidence of Asian admixture, which probably involves unresolved East Asian/ASI ancestry, and will be resolved at higher K. Sardinians appear to be at the end of three clines: (i) Amerindian-like cline of Europe-Siberia-Americas, (ii) East Asian-like cline of Europe-Central Asia/Siberia-East Asia, (iii) ASI-like cline of Europe-Near East-South Asia. These are separate, but not independent phenomena.

To confirm that the signal picked up by ADMIXTURE tracks the signal picked up by ADMIXTOOLS formal tests, I calculated the following D-statistic:

D(Sardinian, European, Karitiana, San)

where European is any population with a sample size of at least 10, and which belonged at 99% in the European+Amerindian components:


And, here is a scatterplot:
The correlation is clear, and the Pearson coefficient is -0.96. This means that populations with higher % Amerindian, as estimated by ADMIXTURE, also show higher D-statistic evidence for admixture.

What of the actual estimates of admixture produced by ADMIXTURE? Using the F4 ratio test, I recently showed that African admixture in Sardinians confounds estimates of Amerindian-like admixture in northern Europeans and vice versa (Amerindian-like admixture in northern Europeans confounds African admixture in Sardinians).

In that experiment, I "scrubbed" Sardinians to remove segments of African ancestry, and showed that estimates of Amerindian-like admixture in the CEU population diminished from 13.9% to 8.8%. The latter seems reasonably close to the 7.1% inferred by ADMIXTURE.

On balance, I would say that ADMIXTURE at K=4 provides a good proxy for the effect described in Patterson et al. (2012). Its results are more difficult to interpret, because its underlying model does not take into account evolutionary relationships between populations. On the other hand, it has the advantage of being able to handle multiple ancestral populations, and has consistently proven able to generate useful data that correlate well with those from other techniques of population genetics.

October 17, 2012

The tangled web of humanity

Indian populations are composed of two ancestral components: Ancestral North Indians (ANI) and Ancestral South Indians (ASI), discovered by Reich et al. (2009). In that paper, it was also shown that ASI forms a clade with East Eurasians, while ANI does so with West Eurasians.

Patterson et al. (2012) published a different pattern: non-Sardinian Europeans have North Eurasian-like ancestry that links them to Amerindian populations. It is thus possible that ASI and the East Eurasian-like admixture in North Europeans may share a common evolutionary history:


Now, consider a hypothetical population of the Indian Cline. A European population is related to it both via its ANI/West Eurasian ancestry, but also via its ASI ancestry, because the East_Eurasian component in Europeans shares a portion of ancestry (indicated by the red arrow) with ASI.

Sardinians lack (or have less of) this "red arrow" portion of ancestry. 

It is also possible that ANI itself may have some East_Eurasian ancestry, like Europeans do; this is not indicated in the figure. More on this later.

Consider the following D-statistic:

D(European, Sardinian, Indian, San)

As we shall see, this takes positive values, consistent with the idea of gene flow between Europeans and Indians at the exclusion of Sardinians. However, this gene flow may involve either the West Eurasian component in the ancestry of Indians (i.e., this component is more related to Europeans than to Sardinians), or to the ASI component (which is related to Europeans via the common "red arrow" portions of ancestry).

We can figure out what is going on by trying different Indian populations along the Indian Cline, and seeing whether the D-statistic is inflated/deflated in populations of greater ANI/ASI ancestry.

Here are the results:


                Russian Orcadian French Lithuanians   ANI
Mala             0.0153   0.0120 0.0088      0.0131 38.86
Madiga           0.0153   0.0122 0.0091      0.0111 40.66
Chenchu          0.0157   0.0108 0.0088      0.0115 40.76
Bhil             0.0149   0.0115 0.0086      0.0124 42.96
Satnami          0.0166   0.0125 0.0091      0.0126 43.06
Kurumba          0.0156   0.0117 0.0095      0.0121 43.26
Kamsali          0.0139   0.0105 0.0088      0.0098 44.56
Vysya            0.0130   0.0099 0.0083      0.0102 46.26
Lodi             0.0143   0.0124 0.0092      0.0125 49.96
Naidu            0.0138   0.0104 0.0092      0.0108 50.16
Tharu            0.0150   0.0112 0.0095      0.0118 51.06
Velama           0.0126   0.0107 0.0083      0.0095 54.76
Srivastava       0.0144   0.0124 0.0091      0.0116 56.46
Meghawal         0.0131   0.0107 0.0088      0.0117 60.36
Vaish            0.0143   0.0144 0.0099      0.0128 62.66
Kashmiri_Pandit  0.0119   0.0116 0.0090      0.0116 70.66
Sindhi           0.0106   0.0112 0.0095      0.0111 73.76
Pathan           0.0098   0.0114 0.0087      0.0106 76.96

For each Indian Cline population, I list the ANI percentage, as estimated by Reich et al. (2009) in the last column, and the D-statistic of the above given form for different pairs of Indian and European populations.

We can plot the D-statistic vs. ANI for each of our European populations:




The correlation coefficients confirm the visual impression, that for the HGDP Russians there is a significantly negative relationship between ANI admixture in an Indian Cline population and the D-statistic:

Russian   Orcadian    French Lithuanians
-0.8631118 0.08670188 0.1870127  -0.1889908

In other words, the evidence for gene flow between Russians and Indians is maximized when south Indian (ASI-rich) populations are used.

The lack of a clear pattern in the other three populations is itself interesting. One possible explanation involves East Eurasian-like admixture in the ANI, a conjecture which would make sense, given that all non-Sardinian continental West Eurasians seem to possess it.

If that is true, then as we go "south" along the Indian Cline, ASI related admixture inflates the D-statistic by increasing the "red arrow" overlap with the East Eurasian-like admixture in Europeans. As we go "north" along this cline, then the D-statistic decreases, due to ASI-reduction, but also increases, due to East Eurasian-like admixture in ANI, with an end result of no clear pattern in the superposition of processes.

In any case, this is an interesting example of a crisscrossing type of admixture where unrelated processes (east Eurasian-like admixture in Russians and ASI admixture in Indians) combine to present an unusual effect.

October 14, 2012

Differential relationship of ANI to Caucasus populations

The observation in Reich et al. (2009) that Ancestral North Indians (ANI) and CEU (HapMap White Utahns) form a clade to the exclusion of Adygei (a NW Caucasian HGDP population) has always puzzled me, because in my ADMIXTURE experiments, the dominant West Eurasian component in South Asia has always been one centered in the Caucasus rather than Europe, an observation also confirmed by Metspalu et al. (2011).

I have now used the qpDstat program of ADMIXTOOLS to calculate some D-statistics using a wide variety of West Asian populations that have appeared in the literature since 2009 (mainly Behar et al. 2010, and Yunusbayev et al. 2011), in addition to the Adygei. This analysis is based on 87,925 SNPs. I have kept SNPs included in the Rutgers map for Illumina chips, since most of the datasets merged with the Reich et al. (2009) dataset were genotyped on such chips, and applied a --geno 0.01 flag after merging the various datasets.

The following populations were considered:
North_Kannadi, Sindhi, Pathan, Kashmiri_Pandit, Brahmins_from_Uttar_Pradesh_M, Iyer_D, Iyengar_D, CEU30, Onge, Adygei, Lezgins, Georgians, Ukranians_Y, Abhkasians_Y, Chechens_Y, North_Ossetians_Y, Armenians_Y, Kurds_Y, Iranians_19, Romanians_14, Bulgarians_Y, Greek_D
I calculated D-statistics of the form:

D(CEU30, non-CEU West Eurasian; South Asian, Onge)

I report, for each South Asian population, the score for non-CEU West Eurasian being Adygei, and the most negative Z-score:


It is clear, that while CEU are more related to Indian cline populations than Adygei are, at least for the case of the Pathans, they are less related to them than Georgians are. The Georgian population is one of the modal populations of the West Asian autosomal component.

The full set of results can be found here. It appears that North Ossetians (who are also from the NW Caucasus) follow the Adygei pattern, while Abkhazians, Lezgins, and Armenians appear more related to ANI than CEU are, similar to the Georgian pattern.

Interestingly, D(CEU, Iranian; South Asian, Onge) appear positive, and this is probably not because CEU are more related to ANI than Iranians, but because Iranians also have ASI admixture.

Ukrainians do not appear more closely related to ANI than CEU are, rather the opposite. This is consistent with the recent f3-statistics analysis of South Indian Brahmins, in which the strongest signals of admixture involved populations from Western Europe, the Balkans, and West Asia, but not from eastern Europe.

All the available evidence suggests that ANI is most related to populations of the South and NE Caucasus, and not to those of the NW Caucasus like Adygei. To confirm this, I calculated some additional D-statistics (also included in the spreadsheet):


All in all, this seems to be very consistent with my working model of Eurasian prehistory. It is also in agreement with proposals for a genetic relationship between Indo-European and NE Caucasian/Hurrian and/or early contacts between it and Kartvelian. No such relationship, as far as I can tell, has been seriously advanced with respect to NW Caucasian languages.

A valuable lesson from this analysis is that now that multiple West Asian populations have been genotyped, caution must be exercised when using the HGDP Adygei, because they are clearly not representative of the different language families (NE/S Caucasian and Indo-European) present in West Asia. Surprises may lurk even at the sub-1000km scale in a region as diverse as the Caucasus.

October 05, 2012

D-statistics reveal contrast between Yoruba and San in "Neandertal ancestry"

I have been exploring the HGDP version released by Patterson et al. (2012) in order to see whether patterns  of "archaic Eurasian" admixture could be detected in living Africans. In a previous experiment, I looked into a surprising link between Denisovans and Africans. Now, I want to investigate possible differences in Neandertal ancestry within Africa itself. An ASHG 2012 abstract suggests that both Neandertal and Denisovan ancestry may be relevant to the African story.

Previous research has concluded that living African groups do not appear to have substantial differences in their apportionment of archaic Eurasian ancestry. This has led to the reasonable idea that the signal of Neandertal admixture in non-Africans was driven by the encounter of Out-of-Africans with a Neandertal population in Asia, perhaps in the Near East, during their early steps outside Africa, involving a single or limited episodes of admixture, although more complex models may be needed as of late.

I have long suspected that part of this signal is due to population structure in Africa itself, and the possibility of archaic admixture in that continent, a hypothesis that is feasible a priori due to the geographical and ecological diversity of Africa and its large surface area, and which has also found support on the basis of recent palaeoanthropological and genetic research. In my opinion, the well-known abundance of polymorphism in Africans vis a vis non-Africans is not only due to the Out-of-Africa bottleneck, but may also be due to an addition of polymorphism via admixture with divergent native African hominin groups.

Advancing a good case for this admixture is rendered difficult by two factors:

  1. The inability of methods relying on linkage disequilibrium to operate on old admixture events, due to the exponential decay of LD over time, which renders archaic-introgressed segments pitifully small at long time scales.
  2. The high temperatures prevalent in sub-Saharan Africa which render DNA preservation problematic, although, to be honest, I have not even seen many attempts to test this hypothesis on whatever prehistoric skeletal remains there do exist from the region.
Why do African groups appear so little different in terms of possible "Neandertal admixture"? I conjecture that the answer lies in the idea that archaic African admixture will tend to even out the signal of Neandertal admixture. To use a geographical analogy, there is little distance difference (in relative terms) between Tokyo and Beijing from the vantage point of New York, but quite a lot from the vantage point or Hong Kong. Tests of archaic admixture rely on relative allele sharing between individuals or populations; consequently, the signal may be muddied by the occurrence of archaic admixture in Africans which -to use our geographical analogy- transposes them from Hong Kong to New York.

Now, consider the Z scores of the D-statistic of the form D(African1, African2, Neander, Outgroup) calculated using different panels and Outgroup being Chimp, Gorilla, or Orang. The raw numbers can be found in this spreadsheet.

Look at the Pearson correlations between the different panels:


While the Z-scores in most of the panels are strongly correlated with each other, the San panel #4 is strongly anti-correlated. An inspection of the raw numbers show why this is the case. For example:


Surprisingly, the San appear more Neandertal-admixed than the Yoruba using all Eurasian and the Yoruba ascertainment, and less so, using the San ascertainment!

A possible explanation for this pattern involves Eurasian back-migration into Africa combined with differential archaic African admixture.

The San may possess Eurasian ancestry consistent with the positive D(San, Yoruba, Neander, Chimp) statistics for all panels except their own; the negative statistics for their own panel is due to their archaic African ancestry which makes them less like Neandertals.

I conjecture that different archaic populations have contributed polymorphism to different African populations.

This question can be addressed empirically on the basis of whole genome sequence data. The Out-of-Africa bottleneck hypothesis suggests that reduced polymorphism in non-Africans is due to loss of variation as a limited number of founders exited Africa, carrying a subset of African variation. If Africans are descended primarily from the modern human groups left behind, then they will all carry the same "missing variation" set not found in Eurasians.

On the other hand, if, as I suggest, modern human groups encountered and admixed with different divergent African hominins, then different African populations will carry substantially disjoint sets of variants, reflecting deep population structure within Africa itself. Time will tell whether this prediction will prove to be true.

September 29, 2012

More on the surprising link between Africans and Denisovans

In a previous post, I showed that there is an unexpected link between Africans and Denisovans. Papuans appeared more "Denisovan" than other populations irrespective of SNP subset used, but Africans appeared more "Denisovan" than Eurasians for both a subset of SNPs polymorphic in Eurasians and monomorphic in Africans, as well as a subset of SNPs polymorphic in all 5 major populations.

In the current post, I explore this issue further by using the SNP ascertainment panels released by Patterson et al. (2012). In particular, I use panel #3, which involves 48,531 SNPs ascertained in a Papuan individual.

 MbutiPygmy French ; Denisova Chimp 0.0234 2.597

 Yoruba French ; Denisova Chimp 0.0303 3.970
 San French ; Denisova Chimp 0.0334 3.475
 BantuKenya French ; Denisova Chimp 0.0206 2.625
 MbutiPygmy Sardinian ; Denisova Chimp 0.0224 2.413
 Yoruba Sardinian ; Denisova Chimp 0.0293 3.683
 San Sardinian ; Denisova Chimp 0.0324 3.319
 BantuKenya Sardinian ; Denisova Chimp 0.0196 2.399
 MbutiPygmy Dai ; Denisova Chimp 0.0331 3.347
 Yoruba Dai ; Denisova Chimp 0.0401 4.694
 San Dai ; Denisova Chimp 0.0428 4.265
 BantuKenya Dai ; Denisova Chimp 0.0307 3.524
 MbutiPygmy Japanese ; Denisova Chimp 0.0366 3.833
 Yoruba Japanese ; Denisova Chimp 0.0436 5.256
 San Japanese ; Denisova Chimp 0.0463 4.724
 BantuKenya Japanese ; Denisova Chimp 0.0342 4.070
 MbutiPygmy Karitiana ; Denisova Chimp 0.0187 1.611
 Yoruba Karitiana ; Denisova Chimp 0.0252 2.320
 San Karitiana ; Denisova Chimp 0.0287 2.457
 BantuKenya Karitiana ; Denisova Chimp 0.0158 1.461
 MbutiPygmy Surui ; Denisova Chimp 0.0303 2.486
 Yoruba Surui ; Denisova Chimp 0.0368 3.260
 San Surui ; Denisova Chimp 0.0398 3.265
 BantuKenya Surui ; Denisova Chimp 0.0275 2.429

All of these are positive, and many of them are significant with a Z-score greater than 3. Africans appear more "Denisovan" than West/East Eurasians and Amerindians using this panel. So, perhaps, this is another indication of the "surprising link" I discovered in my previous post.

This link may have been overlooked in previous analyses which found that Africans are less Denisovan than all Eurasian groups. But, as I argue in my previous post, this is potentially due to introgression of archaic African alleles into living Sub-Saharan Africans which shifted them away from Denisovans. So, the African story may involve admixture between a population somehow related to Denisovans (whether due to an early Out-of-Africa that affected them, or due to an Into-Africa event), and divergent native Palaeoafrican populations.

It would be worthwhile to follow up on these observations using the high-quality Denisovan genome recently published, to see how they might hold up.

September 27, 2012

A surprising link between Africans and Denisovans

I took the following populations from the version of the HGDP released by Patterson et al. (2012). I use the _AHOA suffix (Affymetrix Human Origins Array) to distinguish them from other versions of the same populations:
  • MbutiPygmy_AHOA 11
  • Italian_AHOA    11
  • Miao_AHOA       10
  • Papuan_AHOA     12
  • Karitiana_AHOA  8
I identified the following SNP subsets:
  • AFRICA: 67022 SNPs that were polymorphic in MbutiPygmy and monomorphic in the other populations
  • EURASIA: 94858 SNPs that were polymorphic in at least one non-African population and monomorphic in MbutiPygmy
  • AFRICA_EURASIA: 367051 SNPs that were polymorphic in both MbutiPygmy and at least one non-African population
  • ALL: 528931 SNPs that were polymorphic in at least one population
  • GLOBAL: 168640 SNPs that were polymorphic in all 5 populations
Note that the union of AFRICA, EURASIA, and AFRICA_EURASIA is the ALL set.

Here is a Venn diagram of SNP sharing:


I then calculated all D-statistics of the following form:

D(Pop1, Pop2; Archaic, Chimp)

where Archaic is either Neandertal or Denisova, and Pop1, Pop2 is any possible pair of the modern populations. These D-statistics were calculated for all 5 SNP subsets.

Below, you can find all D-statistics, followed by their Z-scores

Pop1 Pop2 Archaic Chimp D-AFRICA D-EURASIA D-AFRICA_EURASIA D-ALL D-GLOBAL Z-AFRICA Z-EURASIA Z-AFRICA_EURASIA Z-ALL Z-GLOBAL
Italian_AHOA MbutiPygmy_AHOA Neander_AHOA Chimp_AHOA 0.3357 0.1231 0.0038 0.0297 -0.0041 19.575 6.041 1.099 8.052 -0.882
Italian_AHOA Miao_AHOA Neander_AHOA Chimp_AHOA 0 -0.0393 -9e-04 -0.0044 8e-04 0 -3.271 -0.251 -1.161 0.184
Italian_AHOA Karitiana_AHOA Neander_AHOA Chimp_AHOA 0 -0.0243 3e-04 -0.0019 0.0067 0 -1.765 0.063 -0.43 1.406
Italian_AHOA Papuan_AHOA Neander_AHOA Chimp_AHOA 0 -0.0348 -0.0088 -0.0113 -0.0017 0 -2.212 -2.027 -2.335 -0.341
MbutiPygmy_AHOA Miao_AHOA Neander_AHOA Chimp_AHOA -0.3357 -0.1646 -0.0046 -0.0331 0.0048 -19.575 -7.399 -1.208 -7.949 0.998
MbutiPygmy_AHOA Karitiana_AHOA Neander_AHOA Chimp_AHOA -0.3357 -0.147 -0.0036 -0.0311 0.0104 -19.575 -6.089 -0.821 -6.852 2.013
MbutiPygmy_AHOA Papuan_AHOA Neander_AHOA Chimp_AHOA -0.3357 -0.1467 -0.0114 -0.0387 0.0024 -19.575 -6.381 -2.661 -7.947 0.45
Miao_AHOA Karitiana_AHOA Neander_AHOA Chimp_AHOA 0 0.0165 0.0013 0.0027 0.0061 0 1.263 0.296 0.63 1.291
Miao_AHOA Papuan_AHOA Neander_AHOA Chimp_AHOA 0 8e-04 -0.0083 -0.0074 -0.0025 0 0.05 -1.987 -1.631 -0.532
Karitiana_AHOA Papuan_AHOA Neander_AHOA Chimp_AHOA 0 -0.0136 -0.0094 -0.0099 -0.0083 0 -0.804 -1.861 -1.853 -1.568
Italian_AHOA MbutiPygmy_AHOA Denisova_AHOA Chimp_AHOA 0.2354 -0.1057 -0.0102 -0.0014 -0.0147 13.01 -5.396 -2.806 -0.378 -3.156
Italian_AHOA Miao_AHOA Denisova_AHOA Chimp_AHOA 0 -0.0131 0.0023 0.0012 0.0058 0 -1.193 0.713 0.369 1.452
Italian_AHOA Karitiana_AHOA Denisova_AHOA Chimp_AHOA 0 -0.0046 -0.0024 -0.0025 -0.0012 0 -0.364 -0.563 -0.623 -0.255
Italian_AHOA Papuan_AHOA Denisova_AHOA Chimp_AHOA 0 -0.1248 -0.0349 -0.0421 -0.0193 0 -8.966 -7.553 -9.218 -3.71
MbutiPygmy_AHOA Miao_AHOA Denisova_AHOA Chimp_AHOA -0.2354 0.0808 0.0121 0.0023 0.0199 -13.01 3.832 3.24 0.617 4.251
MbutiPygmy_AHOA Karitiana_AHOA Denisova_AHOA Chimp_AHOA -0.2354 0.0939 0.0082 -7e-04 0.013 -13.01 4.085 1.853 -0.16 2.596
MbutiPygmy_AHOA Papuan_AHOA Denisova_AHOA Chimp_AHOA -0.2354 -0.0781 -0.0201 -0.0343 -0.0042 -13.01 -3.458 -4.554 -7.527 -0.795
Miao_AHOA Karitiana_AHOA Denisova_AHOA Chimp_AHOA 0 0.0092 -0.0051 -0.004 -0.0069 0 0.718 -1.221 -0.996 -1.503
Miao_AHOA Papuan_AHOA Denisova_AHOA Chimp_AHOA 0 -0.1158 -0.0391 -0.0455 -0.0253 0 -8.253 -8.608 -10.045 -5.24
Karitiana_AHOA Papuan_AHOA Denisova_AHOA Chimp_AHOA 0 -0.1242 -0.034 -0.0414 -0.0176 0 -7.973 -6.447 -7.892 -3.195

Some brief observations, before we get to the "main course" of this post:
  • Eurasians appear substantially Neandertal/Denisovan-admixed when SNPs polymorphic in Africans and monomorphic in Eurasians are used. I can think of no other explanation than archaic African admixture for this finding.
  • Papuans appear Denisovan-admixed across the board. 
  • For the GLOBAL set, population differences in Neandertal admixture are all non-signficant. Given that the GLOBAL set includes SNPs likely to have existed in the ancestral modern humans, this indicates a fairly symmetrical relationship of to Neandertals.
The most unexpected and surprising finding, is doubtlessly, the evidence that Africans have more Denisovan ancestry than all Eurasians (except Papuans) when SNPs polymorphic in non-Africans and monomorphic in Africans are used (EURASIA panel). I highlight some comparisons:

First of all, clearly Papuans have a special relationship with Denisovans compared to all the remaining 4 populations:

Italian_AHOA Papuan_AHOA ; Denisova_AHOA Chimp_AHOA -0.1248 -8.966
MbutiPygmy_AHOA Papuan_AHOA ; Denisova_AHOA Chimp_AHOA -0.0781 -3.458
Miao_AHOA Papuan_AHOA ; Denisova_AHOA Chimp_AHOA -0.1158 -8.253
Karitiana_AHOA Papuan_AHOA ; Denisova_AHOA Chimp_AHOA -0.1242 -7.973

But, look at this:

Italian_AHOA MbutiPygmy_AHOA ; Denisova_AHOA Chimp_AHOA -0.1057 -5.396
MbutiPygmy_AHOA Miao_AHOA ; Denisova_AHOA Chimp_AHOA 0.0808 3.832
MbutiPygmy_AHOA Karitiana_AHOA ; Denisova_AHOA Chimp_AHOA 0.0939 4.085

That's right: Mbuti Pygmies are actually closer to Denisovans than Eurasians over the subset of SNPs that are polymorphic in Eurasians and monomorphic in the Mbuti.

I do not quite know what to make of this surprising signal. I can think of two explanations:
  1. An early Out-of-Africa movement that affected "Denisovans" and Papuans but not other Eurasians. Living Africans are pulled away from Denisovans because of their archaic African ancestry and towards them because of contributions from their ancestors to the Denisovan population. Hence, they appear less Denisovan-like in African-polymorphic sites (where there is an excess of archaic admixture in Africans) and more Denisovan-like in African-monomorphic sites.
  2. An Into-Africa movement of a population related to Denisovans, a kind of "reverse bottleneck" where a subset of Denisova-like variation entered Africa, hence leaving Eurasians polymorphic and Africans monomorphic.
I would like to stress that these results do not really depend on the choice of the MbutiPygmy population. I have also seen them when I carried out similar experiments using Yoruba and Mandenka.

I often get the feeling that the problem of human origins as it stands is one of too little data for too many variables. But, I am more or less convinced that admixture between very divergent populations of Homo heidelbergensis played a major role in shaping modern humans. 

UPDATE (29 Sep): I continue the investigation of this link in a new post.