Showing posts with label ADMIXTURE-experiments. Show all posts
Showing posts with label ADMIXTURE-experiments. Show all posts

March 01, 2015

Two observations on the ancestry of Armenians

I was thinking a bit on how to interpret the findings of the new Haber et al. preprint, and especially the idea that "29% of the Armenian ancestry may originate from an ancestral population best represented by Neolithic Europeans." I looked at the globe13 proportions, and strangely enough, I had estimated that the three Armenian samples (Armenian_D, Armenians, and Armenians_15_Y) have 28-29% of the Mediterranean component that is modal in Sardinians. This seems like a curious coincidence which has raised my confidence that Haber et al. is picking something real.

Looking back at my inferences of Armenian ancestry, it seems (according to globe13) to come completely from West_Asian, Mediterranean, and Southwest_Asian. The Mediterranean component seems real enough as it seems to match Sardinians/early European farmers well. I am not so sure about the Southwest Asian component which is modal in Yemen Jews and may represent population-specific drift in relatively recent Arabians. The West_Asian component is bimodal in Caucasus and Gedrosia, so it can't be the result of a very drifted population in either region (unless there is spooky action at a distance). 

Another curious finding is the lack of North_European in a latitudinal "column" of populations from the Yemen, through the Levant to the South Caucasus (Georgians and Armenians). It seems that North_European is the only one of the four major Caucasoid components that Armenians lack to any important degree. There is a rather abrupt change between the South Caucasus (~1%) and the North Caucasus (15-20%). It seems that the Greater Caucasus did act like a barrier to gene flow. The K=4 analysis of the same dataset that produced K=13 (globe13) also shows the same barrier: all three Armenian samples and Georgians have ~0% of "Amerindian" (which is surely correlated to "Ancient North Eurasian" ancestry and via it with North_European), but North Caucasians and Europeans have 4-10%.  It's clear that this influence did not cross the Greater Caucasus, as Armenians and Georgians lack it. 

December 02, 2012

D-statistics on ADMIXTURE components

One of the most persistent questions I get as admin of the Dodecad Project is whether some low level of admixture (e.g., 0.7%) of some ancestral component is "noise" or "real".

I have hitherto advised all those who contacted me about this issue to (i) treat low levels of admixture with suspicion, and (ii) to run DIYDodecad in byseg mode; this might show whether this type of admixture is concentrated in some specific long segments, and is thus more likely to be "real" recent ancestry than low-level noise sprinkled across the genome that is more difficult to interpret.

Nonetheless, this was always unsatisfying to me, because it did not provide a way of quantifying one's confidence on the "reality" of the admixture evidence. Thus, I developed admixtureDstat.r an R script which calculates D-statistics of the form:

D(Pop1, Individual; Pop3, Outgroup)

If the individual can be seen as being drawn from population Pop1 but with some admixture from population Pop3, then this statistic will take significant negative values. For example, suppose that your main admixture component is "North_European", but you also have 1% "Siberian" admixture. You would want to calculate the following statistic:

D(North_European, YOU; Siberian, Palaeo_African)

which would tell you whether the Siberian admixture is "real" or not. (Of course, things are more complicated for those who might have both Siberian and African admixture, in which case their Siberian admixture would tend to make the D-statistic negative, and the African one positive, with the end result being a balance of the two processes).

There are of course many subtleties in the interpretation of D-statistics and I refer you to Green et al. (2010), Durand et al. (2011), and Patterson et al. (2012) for some of the technical details.

Using the script is quite simple, and only requires that you have R installed on your computer:
  • download standardize.r and admixtureDstat.r from here, saving them into some directory in your computer (henceforth, we will call this the "working directory"). If you have Genographic 2.0 data, you should also download hgdp.base.txt.
  • unzip your raw genotype data (from 23andMe, Family Finder, or Genographic 2.0) into the working directory. 
  • launch R and change the directory into the working directory (using the Menu in Windows, or setwd() in Unix-like operating systems). Enter in one line:
 source('admixtureDstat.r'); source('standardize.r')
  • In R, enter the command:

standardize('johndoe.txt', company='23andMe') 

  • The above command, will convert your data into a format understood by my script, writing a genotype.txt file in the working directory.  You should change johndoe.txt to whatever your unzipped raw data file is called, and the company should be one of '23andMe', 'ftdna', or 'geno2', or 'geno2new' depending on the source of your data. If you have used DIYDodecad before, you have already created a genotype.txt file, so you can skip this step.
  • Finally, you should have the four calculator files (with endings .par, .txt, .alleles, and .F) in the working directory. You can, for example, use the calculator files of the globe13, or if you have experience working with ADMIXTURE, you may make your own using your dataset. The .txt file will contain the names of the ancestral populations that you can use, so make sure you type them correctly if you decide to choose "listfile" mode (see below).
  • You are now all set to use the script! You can do this in either of two ways:
(1) outgroup mode:

In this mode, you specify an outgroup, i.e., one of the populations from the calculator, and the program cycles through all possible (Pop1, Pop3) pairs, outputs the D-statistics to the screen as it calculates them, and finally writes them to a dstat.txt file in the working directory.

To use this mode, you simply type:

admixtureDstat(parfile="globe13.par", outgroup="Palaeo_African")

The use of "Palaeo_African" as an outgroup is a reasonable choice for most non-Africans, since these are unlikely to have recent admixture from Sub-Saharan hunter-gatherer groups in which this component is represented.

Note that many of the D-statistics produced this way may have little meaning for you. For example, a person that is mostly European will get a very negative statistic of the form:

D(West_African, YOU; East_Asian, Palaeo_African)

But this will have little to do with your potential West_African or East_Asian ancestry, but rather with the relationships of populations (e.g., Europeans being more closely related to East Asians than West Africans). A little West_African/East_Asian ancestry will increase/decrease the value of this statistic, which will, however, remain strongly negative.

Instead, you should look at D-statistics that might be meaningful to you, e.g., if the following is negative:

D(North_European, YOU; East_Asian, Palaeo_African)

Then you might have some real East_Asian admixture.
(2) listfile mode:

In this mode, you write all the D-statistics you are interested in in a simple text file, e.g., listDstat.txt, in the order Pop1, Pop3, Outgroup, e.g.:
Mediterranean North_European Palaeo_African
North_European Siberian Palaeo_African
North_European West_Asian Palaeo_African
A reasonable choice is to calculate D-statistics where Pop1 is your most important component, e.g., North_European for someone from Finland, and Pop3 is a minor component whose "reality" you seek to investigate, e.g., Siberian.

Using the listfile mode will take less time (because you calculate a subset of D-statistics), and can be invoked as follows:

admixtureDstat(parfile="globe13.par", listfile="listDstat.txt")

Z-scores

The significance of D-statistics is assessed by the Z-scores, which are the last column of the output.    If they are greater than 3 in absolute value (i.e., less than -3 or greater than 3) then Z-scores are significant.

Other details: 

There are some additional options you might use. For example

   admixtureDstat(parfile="globe13.par", listfile="listDstat.txt", k=1000)

will use 1,000 SNPs for the block jackknife instead of the default 500. In general, there is little reason to mess with this parameter.

The screen output might be too wide for your R window, and you can fix this prior to running admixtureDstat by entering something like options(width=300) which allows more characters per line of screen output. In any case, you can see the program's output nicely formatted in the dstat.txt file in the working directory after it completes its run.

AN EXAMPLE

I will give an example of program usage using globe13 results. Take individual DOD133 whose results are seen below:



This individual is mostly Mediterranean (52.5%) and North_European (42%), but with small percentages of Amerindian (1.1%), Southwest_Asian (1.5%), Arctic (0.3%), and South_Asian (1.7%).

First, I calculate D(Mediterranean, DOD133; North_European, Palaeo_African) and D(North_European, DOD133; Mediterranean, Palaeo_African) to confirm the major admixture between Mediterranean and North_European. In listfile mode, I put the following in the listDstat.txt file:

Mediterranean North_European Palaeo_African
North_European Mediterranean Palaeo_African

The results are as follows:
Pop1 Pop3 Outgroup Dstat Z
Mediterranean North_European Palaeo_African -0.02399 -11.2
North_European Mediterranean Palaeo_African -0.033 -15.06


Ok, this confirms that DOD133 does indeed appear to be a mixture of North_European and Mediterranean. Now, let's take one of the minor components, e.g., South_Asian, and put the following in the listDstat.txt file:


Mediterranean South_Asian Palaeo_African
North_European South_Asian Palaeo_African


The results are now:

Pop1 Pop3 Outgroup Dstat Z
Mediterranean South_Asian Palaeo_African -0.01268 -5.98
North_European South_Asian Palaeo_African 0.00202 0.96

A possible interpretation for this pattern is that the individual does have some South_Asian-like admixture that is lacking in his Mediterranean component. Perhaps this reflects an ancient Central Asian population that migrated into both northern Europe and south Asia; some alleles from this population were incorporated into the Northern European gene pool, thus becoming part of what it means to be "northern European", so the evidence for admixture does not exist in the {North_European, South Asian} pair, since both of these contain gene flow from our hypothetical Central Asian population. There are many ways to interpret the observed patterns, and using admixtureDstat you can explore some of them.

Now, let's take another minor component, Arctic (0.3%):

Pop1 Pop3 Outgroup Dstat Z
Mediterranean Arctic Palaeo_African -0.02413 -9.48
North_European Arctic Palaeo_African 0.01028 3.95

This is an interesting pattern; the individual appears admixed with Arctic relative to Mediterranean, but North_European appears to be more Arctic than DOD133. A possible explanation is that this Arctic component represents ancestry that was mediated by a north European population, that as Patterson et al. (2012) have shown contain some "north Eurasian" ancestry.

Finally, let's take the Southwest_Asian minor component (1.5%), where the reverse situation applies:
Pop1 Pop3 Outgroup Dstat Z
Mediterranean Southwest_Asian Palaeo_African 0.00496 2.4
North_European Southwest_Asian Palaeo_African -0.01075 -5.05

So, in this case, this might represent ancestry common between Mediterranean and Southwest_Asian that contrasts with the North_European portion of the individual's genome.

I won't pretend that interpreting D-statistics is easy, but they are certainly a nice exploratory tool to have in one's arsenal, and I hope that they will prove useful.

TERMS OF USE: You are free to use and modify this tool for any non-commercial purpose, as long as you provide a link to Dienekes' Anthropology Blog or this blog post when you do so. You should probably also cite one of the aforementioned papers where D-statistics were discussed, as well as the ADMIXTURE paper.

UPDATE (Dec 3): You might want to try D-statistics using globe13anc, a new calculator that includes an Ancestral (chimp) outgroup.

October 27, 2012

Inter-relationships between 'world' components

In a previous post I calculated f3-statistics between my K=7 and K=12 ancestral components. The basic idea is to discover which component A can be seen as a mixture of two other components, B and C, in which case (assuming A does not have excessive drift), we expect a negative f3(A; B, C) statistic.

As part of my analysis of the world dataset, I calculated f3-statistics for each of the K=3 to K=12, that is, for some K, I tried to see if one of the K inferred components could be seen as a mixture of the remaining K-1. It turns out that no negative f3 statistics appeared at all, and this suggests that the components inferred by ADMIXTURE at each K tend to form an "orthogonal" set that are not mixtures of each other.

More generally, we can calculate f3 statistics where A, B, and C are components inferred from any of the K=3 to K=12 runs. There is a total of 75 such components, and hence 75*(74 choose 2) = 202,575 such f3 statistics. Since calculating these would take a while (and would become intractable as K increases further), I decided to calculate pairwise f3 statistics, i.e., statistics where A, B, and C are constrained to be from successive K, K+1 runs. The significant results can be seen in the spreadsheet.

It might be worthwhile to develop an automated way of using these statistics to guide us in the interpretation of ADMIXTURE components. But, they are useful, in any case, as a source of information.

For example, consider the following (the third column represents the mixed population):

Atlantic_Baltic_6/globe6_Z Near_East_6/globe6_Z European_5/globe5_Z -0.013911 0.000084 -166.457

This means that the European component at K=5 can be seen as a mix of the Atlantic_Baltic and Near_East components at K=6. So, this suggests that the European component can be seen as "secondary", the product of admixture. But:

European_5/globe5_Z Amerindian_5/globe5_Z Atlantic_Baltic_6/globe6_Z -0.003964 0.000175 -22.588

This indicates conversely that the Atlantic_Baltic at K=6 component can be seen as a mix of the European and Amerindian components at K=6.

It would be very interesting to use f-statistics to guide one in the choice of an "orthogonal" set of ancestral populations, or to summarize the relationships between them in tree or network form. One could potentially use my ADMIXTURE to TreeMix script to do something like this, although as K increases, there is a combinatorial explosion in the total number of components with a probable runtime slowdown/memory usage blowup which might render this approach unusable, at least for large K.

October 23, 2012

Ancient European DNA assessment with 'globe10'

I had previously assessed the same using globe4. See post on globe10 and associated spreadsheet.


The results appear similar to previous analyses overall, with the main features being the presence of "Southern" in Neolithic farmers (which peaks in the Near East), and its absence in hunter-gatherers. Some of the "Amerindian"-like admixture that was evident in globe4 has been "absorbed" by the Atlantic_Baltic (main European) component, but it is interesting that the Swedish hunter-gatherers (Ajv52/Ajv70) continue to show some Amerindian as well as other eastern (Australasian/South Asian) admixture that is lacking in the other samples. These individuals are outside the range of modern populations, but they overall tend to map to the most similar Atlantic_Baltic component with the addition of some eastern influences.

Also of interest is the fact the Oetzi is the only sample which shows a slice of West Asian (5.7%) admixture in this analysis. This was also the case in the previous one using K7b (1.4%). Gok4, on the other hand, the fellow Neolithic individual from Sweden seems to lack this. The arrangement of the Big Three West Eurasian components (Southern/West Asian/Atlantic_Baltic) has subtly changed in this calculator, but it would be tempting, nonetheless, to see in the little West Asian admixture that Oetzi has but Gok4 and the Mesolithic samples seem to lack, something of the vanguard of the arrival of the West Asian component in Europe. Obviously more samples are needed, including ones from the most interesting regions of the Balkans and Anatolia.

October 21, 2012

Ancient European DNA assessment with 'globe4'

In a previous experiment, I showed that ADMIXTURE at K=4 tracks the same signal of Amerindian-like admixture detected with f-statistics. I encapsulated that analysis in the globe4 calculator over at the Dodecad Project blog, and decided to use it to assess a few ancient European autosomal samples:


Please note that a very variable number of SNPs was extracted from these various samples. These results should be viewed as indicative of possible patterns that might be confirmed by a more thorough analysis. Also, please consult the globe4 post for more details on the methodology behind it, and the interpretation of the 4 components.

With these various caveats, I would say that these results seem to make some sense and to be fairly consistent with the scenario of Patterson et al. (2012):

  • Oetzi and Gok4, the "farmers" seem to lack the Amerindian component
  • Ajv52, and Ajv70, the northern hunter-gatherers seem to possess it
  • Bra1, the Mesolithic Iberian seems to lack it as well
Bra1 also happens to be the most limited sample in terms of available SNPs. Nonetheless, this would appear broadly consistent with the idea that the "Amerindian"-like admixture in Europeans emanated from north-eastern Europe. Today, all continental Europeans seem to possess some of it, but this can be explained by migration of Ajv-like individuals and their mixtures into Western and Southern Europe from central or northern Europe for which there is ample historical and archaeological evidence (e.g., Italo-Celts, Germans, and Slavs, in addition to other, earlier phenomena).

A broader context

The absence of the Amerindian-like admixture in South Indian Brahmins and Armenians, and its paucity Kurds and Iranians might indicate that this type of ancestry was not represented in ancient Armenians and Indo-Iranians. Indeed, all these populations possess less of this admixture than those of the North Caucasus. Cypriots possess none of it as well, where the Greek_D sample, a small 2.5% portion. In a previous analysis, I estimated a historical-era estimate of North European admixture in Greeks, and this admixture presumably incorporates the signal of Amerindian-like admixture. Additionally, an Iron Age individual from Bulgaria will soon be announced as being Sardinian-like.

The sum of these factors leads me to believe that the signal of Amerindian-like admixture did not play an important role in the formation of the Graeco-Phrygians (and their Armenian relatives) and the Indo-Iranians, or at least did so to an insignificant degree. As the former expanded westward from the PIE homeland, and the latter eastward, they would have had little opportunity to encounter this type of admixture; rather, they would have admixed with Sardinian-like individuals in the west, and Ancestral South Indian (ASI)-like or East Asian individuals in the east.

On the other hand, as Indo-European groups expanded into eastern Europe, setting off a chain of events that would eventually transform most of the northern part of the continent, and, in historical times, much of the rest of it, they would have met with Ajv-like individuals carrying the signal of Amerindian-like admixture, as well as the Oetzi/Sardinian-like farmers that had spread all the way to Scandinavia by the late Neolithic. The population formed by this mixture would have carried with it the signal of Amerindian-like ancestry, and would then transpose it across the continent. The signal would become increasingly muted westward and southward, and indeed this is what we observe.

UPDATE: It is interesting to see that South Indian Brahmins (both the Metspalu et al. sample, and my Iyer_D and Iyengar_D samples) lack this admixture, while Uttar Pradesh Brahmins do not, given the rolloff evidence for a more recent admixture of the latter. This is consistent with a historical admixture event, after the migration of Brahmin groups southwards, as described in that post.

October 18, 2012

Relatives/duplicates in ADMIXTURE

The presence of relatives in a dataset tends to throw ADMIXTURE out, but this does not always happen. In particular, I've noticed that at low K, relatives do not appear to form their own hyper-specific clusters. A good example of this is the Yunusbayev et al. Armenians_Y sample (N=16) that happens to include what appears to be a common individual (or a twin?) with my own own Armenian_D sample from the Dodecad Project. This was discovered the last time I ran ADMIXTURE, so I henceforth began using a subset of 15 Armenians (Armenians_15_Y) from that dataset whenever I also included my Dodecad sample.

In my current ongoing analysis of the world dataset, I included two versions of the Sakilli, Paniya, and Malayan samples, from Behar et al. and Chaubey et al. I believe that HarrappaDNA Project has previously identified that some of these are not exactly the same individuals, so I wanted to see what the ancestry of all these individuals was, to help me decide which ones to keep.

Here are the K=5 ancestral proportions of the Behar et al. Sakilli:


GSM536813 10.2 7.8 2.2  0 79.9
GSM536814  8.5 9.3 2.1  0 80.0
GSM536815  9.7 7.9 3.6  0 78.8
GSM536816  8.8 8.7 2.1  0 80.4

and of the Chaubey et al. Sakilli:

SAKD60 10.2 7.8 2.2  0 79.9
SAKD72  9.7 7.9 3.6  0 78.8
SAKD75  8.8 8.7 2.1  0 80.4
SAKD64  8.5 9.4 2.1  0 80.0

These appear to be the same individuals, which was confirmed by IBD analysis.

The Malayan individuals also appear to be the same:

GSM536915 0.3 15.5 2.7  0 81.6
GSM536812 3.3 16.6 2.8  0 77.3

A382 0.3 15.5 2.7  0 81.6
MLYA383 3.3 16.6 2.8  0 77.3

But, as noticed by HAP, the Paniya individuals are not the same:

GSM536916 5.1 11.2 2.2 0.0 81.6
GSM536806 0.4 69.7 0.0 4.3 25.6
GSM536807 0.0 79.7 0.0 2.4 18.0
GSM536808 0.0 77.5 0.5 1.7 20.3

2953   D36 5.1 11.2 2.2 0.0 81.6
2954 PNYD9 0.0 19.8 2.5 0.6 77.1
2955 PNYD3 0.0 21.2 1.5 0.0 77.3
2956 PNYD1 0.0 21.7 2.7 0.3 75.2

As I move forward in my "world" analysis, I've decided to drop GSM536916 and the Chaubey et al. versions of Sakilli and Malayan. Thus, PANIYA will refer to the Southeast Asian-like individuals of the Behar et al. set, and Paniya_Ch to the South Asian-like individuals of the Chaubey et al. set, with one copy of the duplicated individual removed.

ADMIXTURE tracks Amerindian-like admixture in northern Europe

I have recently assembled a new "world" dataset of 4,280 individuals that I am currently incrementally analyzing with ADMIXTURE. But, I noticed an interesting pattern at K=4 that I wanted to share right away.

4 ancestral populations emerge at this level of resolution, which I have named: European, Asian, African, Amerindian. The names aren't important, and you can replace them with whatever you prefer. 

The interesting thing about this K=4 analysis is that European populations show evidence of Amerindian admixture, consistent with the pattern inferred using f-statistics, where European populations show admixture between Sardinians and a Karitiana-like population.

This pattern may have emerged at previous ADMIXTURE analyses at this level of resolution, but thanks to the f3 evidence presented in previous posts, it is now clear that it is no quirk of ADMIXTURE, but indicative of a real (albeit still rather mysterious) pattern of gene flow that differentially affected European populations.

For example, the Irish_D population has 7.6% of the Amerindian component, and so do HGDP Orcadians. HGDP Sardinians have only 1.7% of it, which appears to be the minimum in Europe, with French_Basque having more at 4.6%.

Another interesting observation is that West Eurasian populations that show an excess of East Eurasian-like admixture appear to be doing so for two separate reasons. For example, HGDP Russians have 11.7% of Amerindian component, but also 4.5% of "Asian", and 1000 Genomes Finns have 3.3% Asian and 12% Amerindian. Behar et al. (2010) Turks, on the other hand, have 9.9% Asian and 2.2% Amerindian. All these populations are East Eurasian-shifted relative to Sardinians, a pattern which can also be observed by looking at the K=3 analysis, but for apparently different reasons.

The pattern for Near Eastern populations is also interesting. For example, Yunusbayev et al. (2011) Armenians have 0% of the Amerindian component, and 5.7% of the Asian, and all three HGDP Arab populations (Druze, Palestinian, Bedouin) also have 0% of the Amerindian component, with variable levels of the Asian.

It would appear that whatever process contributed Amerindian-like admixture in Europeans, minimally affected Near Eastern populations, with Sardinians being demonstrably related to Neolithic Europeans (thanks to ancient DNA evidence), tilting towards the Near Eastern pattern. On the other hand, Near Eastern populations show evidence of Asian admixture, which probably involves unresolved East Asian/ASI ancestry, and will be resolved at higher K. Sardinians appear to be at the end of three clines: (i) Amerindian-like cline of Europe-Siberia-Americas, (ii) East Asian-like cline of Europe-Central Asia/Siberia-East Asia, (iii) ASI-like cline of Europe-Near East-South Asia. These are separate, but not independent phenomena.

To confirm that the signal picked up by ADMIXTURE tracks the signal picked up by ADMIXTOOLS formal tests, I calculated the following D-statistic:

D(Sardinian, European, Karitiana, San)

where European is any population with a sample size of at least 10, and which belonged at 99% in the European+Amerindian components:


And, here is a scatterplot:
The correlation is clear, and the Pearson coefficient is -0.96. This means that populations with higher % Amerindian, as estimated by ADMIXTURE, also show higher D-statistic evidence for admixture.

What of the actual estimates of admixture produced by ADMIXTURE? Using the F4 ratio test, I recently showed that African admixture in Sardinians confounds estimates of Amerindian-like admixture in northern Europeans and vice versa (Amerindian-like admixture in northern Europeans confounds African admixture in Sardinians).

In that experiment, I "scrubbed" Sardinians to remove segments of African ancestry, and showed that estimates of Amerindian-like admixture in the CEU population diminished from 13.9% to 8.8%. The latter seems reasonably close to the 7.1% inferred by ADMIXTURE.

On balance, I would say that ADMIXTURE at K=4 provides a good proxy for the effect described in Patterson et al. (2012). Its results are more difficult to interpret, because its underlying model does not take into account evolutionary relationships between populations. On the other hand, it has the advantage of being able to handle multiple ancestral populations, and has consistently proven able to generate useful data that correlate well with those from other techniques of population genetics.

September 22, 2012

ADMIXTURE analysis of Schlebusch et al. (2012) data

The ADMIXTURE analysis of Schlebusch et al. (2012) did not include Eurasian references, but thanks to the fact that the authors have made their data publicly available, anyone can carry out additional analyses on it. I am sure that this data will be very useful in the future. The list of included populations, with sample sizes are:


  • ColouredColesberg_Sch 20
  • ColouredWellington_Sch 20
  • Khomani_Sch 39
  • Karretjie_Sch 20
  • Khwe_Sch 17
  • GuiGhanaKgal_Sch 15
  • Juhoansi_Sch 18
  • Nama_Sch 20
  • Xun_Sch 19
  • SEBantu_Sch 20
  • SWBantu_Sch 12

As is my convention, the _Sch ending denotes that these populations are from the Schlebusch et al. paper


As always with a new dataset, after processing it, I ran a quick test to make sure everything seemed to be alright. This time, I included the 220 individuals in the released datasets together with 28 HGDP Sardinians and 10 HGDP Dai, and ran a quick K=4 ADMIXTURE analysis:


These appear to make sense. The "green" Dai-like element in the Coloured samples is probably a stand-in for Indian ancestry in that population. The plot of individuals shows considerable variation within several populations:

September 14, 2012

Inter-relationships between Dodecad K7b and K12b components

In a previous post I used leave-one-out to show how components inferred by ADMIXTURE could be related to each other.

One of the "problems" with ADMIXTURE and related analyses is that as the number of components K increases, additional components are formed by merging and/or splitting of components at lower K.

But, it turns out that thanks to the supervised mode, we can look at how components at different K are related to each other: we can treat, e.g., the K=12 ancestral populations as test data with the K=7 ancestral populations as references and vice versa.

I carried out precisely this procedure for my K7b/K12b components.

Below are the K12b components expressed as mixtures of the K7b ones:

And, the K7b ones expressed as mixtures of the K12b ones:


I have also calculated f3 statistics (ussing threepop) for all population triples using the  K7b/K12b calculators. Most of the mixes inferred by ADMIXTURE appear significant, although I didn't hand-check each one. I report the significant ones below:

Population f3(A; B, C) s.e. Z-score

Atlantic_Baltic_K7b;Atlantic_Med_K12b,North_European_K12b -0.00287483 2.64051e-05 -108.874
African_K7b;East_African_K12b,Sub_Saharan_K12b -0.00241502 2.3253e-05 -103.858
East_Asian_K7b;East_Asian_K12b,Southeast_Asian_K12b -0.00218574 2.17614e-05 -100.441
Caucasus_K12b;West_Asian_K7b,Southern_K7b -0.00317634 4.12205e-05 -77.0573
West_Asian_K7b;Gedrosia_K12b,Caucasus_K12b -0.00209044 3.14454e-05 -66.4785
Siberian_K7b;East_Asian_K12b,Siberian_K12b -0.00166911 2.60228e-05 -64.1403
South_Asian_K7b;Gedrosia_K12b,South_Asian_K12b -0.00195015 3.35149e-05 -58.1876
East_Asian_K12b;East_Asian_K7b,Siberian_K7b -0.00191747 3.49244e-05 -54.9034
Atlantic_Baltic_K7b;Southern_K7b,North_European_K12b -0.00181747 3.63948e-05 -49.9377
East_African_K12b;Southern_K7b,African_K7b -0.00412496 0.000101701 -40.5598
Atlantic_Med_K12b;Southern_K7b,Atlantic_Baltic_K7b -0.00138679 3.68608e-05 -37.6222
East_Asian_K7b;Southeast_Asian_K12b,Siberian_K7b -0.00127133 3.92998e-05 -32.3495
Northwest_African_K12b;Southern_K7b,Sub_Saharan_K12b -0.00272013 0.000110067 -24.7133
Northwest_African_K12b;Southern_K7b,African_K7b -0.00255262 0.000107527 -23.7394
East_African_K12b;African_K7b,Atlantic_Med_K12b -0.00237833 0.000107306 -22.1639
East_African_K12b;African_K7b,Caucasus_K12b -0.00217732 0.000101003 -21.557
Caucasus_K12b;West_Asian_K7b,Atlantic_Med_K12b -0.000977923 4.573e-05 -21.3847
Caucasus_K12b;West_Asian_K7b,Northwest_African_K12b -0.00100154 4.86387e-05 -20.5915
East_African_K12b;Southern_K7b,Sub_Saharan_K12b -0.00247983 0.000122139 -20.3034
Caucasus_K12b;Southern_K7b,Gedrosia_K12b -0.00112749 5.91335e-05 -19.0669
East_Asian_K12b;Southeast_Asian_K12b,Siberian_K7b -0.00100305 5.44851e-05 -18.4097
Atlantic_Baltic_K7b;North_European_K12b,Caucasus_K12b -0.000534432 2.98199e-05 -17.922
Southern_K7b;Southwest_Asian_K12b,Atlantic_Med_K12b -0.000683711 4.08148e-05 -16.7515
East_Asian_K12b;East_Asian_K7b,Siberian_K12b -0.000651854 4.01206e-05 -16.2474
African_K7b;Gedrosia_K12b,Sub_Saharan_K12b -0.000738345 4.5676e-05 -16.1648
African_K7b;Southern_K7b,Sub_Saharan_K12b -0.000769896 4.8516e-05 -15.8689
South_Asian_K7b;South_Asian_K12b,Northwest_African_K12b -0.000598387 3.84069e-05 -15.5802
African_K7b;Sub_Saharan_K12b,Northwest_African_K12b -0.000602378 4.07154e-05 -14.7948
East_African_K12b;African_K7b,Southwest_Asian_K12b -0.00141216 0.000102079 -13.834
African_K7b;Sub_Saharan_K12b,North_European_K12b -0.000663712 4.87314e-05 -13.6198
African_K7b;South_Asian_K7b,Sub_Saharan_K12b -0.000598399 4.51811e-05 -13.2445
Southern_K7b;Southwest_Asian_K12b,Northwest_African_K12b -0.000577559 4.50096e-05 -12.8319
Siberian_K7b;East_Asian_K7b,Siberian_K12b -0.000403499 3.17418e-05 -12.7119
Atlantic_Baltic_K7b;West_Asian_K7b,Atlantic_Med_K12b -0.000520714 4.41022e-05 -11.807
East_African_K12b;African_K7b,Atlantic_Baltic_K7b -0.00122819 0.000106897 -11.4895
African_K7b;Sub_Saharan_K12b,Siberian_K7b -0.00051246 4.93477e-05 -10.3847
East_African_K12b;African_K7b,North_European_K12b -0.00103911 0.000106816 -9.72802
African_K7b;Sub_Saharan_K12b,Southeast_Asian_K12b -0.000469707 4.98071e-05 -9.43052
African_K7b;East_Asian_K12b,Sub_Saharan_K12b -0.000461359 4.9918e-05 -9.24235
Gedrosia_K12b;South_Asian_K7b,West_Asian_K7b -0.00047115 5.11259e-05 -9.2155
South_Asian_K7b;East_African_K12b,South_Asian_K12b -0.000384664 4.18056e-05 -9.20125
African_K7b;Sub_Saharan_K12b,Caucasus_K12b -0.000430657 4.69419e-05 -9.17425
African_K7b;Sub_Saharan_K12b,Southwest_Asian_K12b -0.000421792 4.64037e-05 -9.08962
Atlantic_Baltic_K7b;North_European_K12b,Northwest_African_K12b -0.000328259 3.62081e-05 -9.06589
African_K7b;Sub_Saharan_K12b,East_Asian_K7b -0.000446564 4.9569e-05 -9.00895
African_K7b;Sub_Saharan_K12b,Siberian_K12b -0.000437012 4.88062e-05 -8.95404
Northwest_African_K12b;African_K7b,Atlantic_Med_K12b -0.00115555 0.000131897 -8.76101
African_K7b;West_Asian_K7b,Sub_Saharan_K12b -0.000397507 4.57534e-05 -8.68804
African_K7b;Sub_Saharan_K12b,Atlantic_Baltic_K7b -0.000418044 4.81379e-05 -8.68431
African_K7b;South_Asian_K12b,Sub_Saharan_K12b -0.000393516 4.57123e-05 -8.60853
South_Asian_K7b;South_Asian_K12b,Southwest_Asian_K12b -0.000290753 3.88373e-05 -7.48644
South_Asian_K7b;West_Asian_K7b,South_Asian_K12b -0.000228331 3.63783e-05 -6.27657
Atlantic_Med_K12b;Southern_K7b,North_European_K12b -0.000329428 5.28014e-05 -6.239
East_African_K12b;Gedrosia_K12b,African_K7b -0.000596188 0.000102434 -5.8202
African_K7b;Sub_Saharan_K12b,Atlantic_Med_K12b -0.00023116 4.95629e-05 -4.66397
South_Asian_K7b;South_Asian_K12b,Atlantic_Med_K12b -0.000172605 4.09236e-05 -4.21775
Siberian_K12b;Atlantic_Med_K12b,Siberian_K7b -0.000166672 4.4065e-05 -3.78243
East_African_K12b;West_Asian_K7b,African_K7b -0.00034931 0.000103503 -3.37489
Atlantic_Baltic_K7b;Atlantic_Med_K12b,Siberian_K7b -0.000226988 7.32706e-05 -3.09795

This leads to a very simple way of gauging whether an ancestral population is better seen as admixed or not: count the number of times it appears before the semi-colon, and subtract the number of times it appears after the semi-colon. This may not be a perfect measure, but it captures the basic idea. When I do this, I get:

 [1,] East_African_K12b      7  
 [2,] African_K7b            7  
 [3,] South_Asian_K7b        4  
 [4,] Atlantic_Baltic_K7b    3  
 [5,] East_Asian_K12b        0  
 [6,] Caucasus_K12b          0  
 [7,] Northwest_African_K12b -2 
 [8,] East_Asian_K7b         -2 
 [9,] Siberian_K12b          -3 
[10,] Southeast_Asian_K12b   -4 
[11,] Gedrosia_K12b          -4 
[12,] Siberian_K7b           -4 
[13,] Southwest_Asian_K12b   -5 
[14,] South_Asian_K12b       -7 
[15,] North_European_K12b    -7 
[16,] West_Asian_K7b         -7 
[17,] Southern_K7b           -8 
[18,] Atlantic_Med_K12b      -8 
[19,] Sub_Saharan_K12b       -19

I think this looks reasonable; the components at the bottom usually appear contributing to the admixture of other populations, and the components at the top usually appear admixed in terms of the other components. Of course admixed components may be themselves be useful if they represent regional mixes (such as teh East African), but this is certainly a good way to supplement and interpret ADMIXTURE analysis.

August 30, 2012

Scrubbing Sardinians

In a series of posts, I showed that European populations have east Eurasian-like admixture, an element that appears to be lacking in Sardinians. I did this both on the basis of the 3-population test and a number of different comparisons between West Eurasian populations, as well as on the basis of the 4-population test.

The fact that f4(Sardinian, CEU, Asian, African) is negative was interpreted by  Moorjani et al. (2011) as evidence that Sardinians have ~2.9% African admixture. As I pointed out at the time this level of admixture was predicated on the assumption that CEU did not have Asian admixture, and this assumption now appears not to hold.

Of course, the above-mentioned paper also used an admixture LD based method (ROLLOFF) to date the African admixture in Sardinians, coming up with an estimate of ~71 generations. But, we should remember that ROLLOFF does not quantify the extent of this admixture.

Imagine walking along a Sardinian genome: the negative f4 signal is created both by occasional African-like segments you meet along the way, but also by the presence of East Eurasian SNPs in CEU in other locations where Sardinians may have no African admixture. The f4 signal is a genomewide average that is influenced by two different processes: punctuation by African segments whose length distribution can supply information about the time of their introgression; and, the background genome that is lacking in East Eurasian-like polymorphism present in CEU.

In this post, I will show that:
  • The admixture estimate of 2.9% is not robust, but depends on the choice of Asian population for f4 ancestry estimation, consistent with the idea that it is influenced by east Eurasian-like admixture that has affected northern European populations.
  • If Sardinians are "scrubbed" of any trace of African admixture, the negative  f4(Sardinian, CEU, Asian, African) signal persists
Estimates of African admixture in Sardinians depend on choice of Asian/American population

African ancestry in Sardinians was estimated by Moorjani et al. (2011), using the following ratio:

f4(San,Papuan; Sardinian,CEU) / f4(San,Papuan; YRI, CEU)

In Table S6 different ancestral populations were used for f4 ancestry estimation, and all results ranged between 2.9-3.4%.

The signal of east Eurasian-like admixture in northern Europe is strongest when Karitiana as used as an Asian/American reference. If the level of "African" admixture in Sardinians is driven, as I suspect, by the presence of east Eurasian-like admixture in northern Europe, then I expect this admixture to be highest when Karitiana instead of Papuans are used. And, indeed, this is what I observe :

f4(San,Papuan;Sardinian,CEU) = 0.00118099 (Z=10.6838)
f4(San,Papuan;YRI,CEU) = 0.0379664 (Z=88.2287)

(in all experiments I use a set of 28 Sardinians vs. 27 in the Moorjani et al. paper, a set of 112 CEU, 147 YRI, a set of 166,770 SNPs, and -k 200 for fourpop)

therefore, African admixture in Sardinians using Papuan reference = 0.00118099/0.0379664 = 3.1%

but


f4(San,Karitiana;Sardinian,CEU) =  0.00272141 (Z=22.7288)

f4(San,Karitiana;YRI,CEU) = 0.04449 (Z=100.19)

therefore, African admixture in Sardinians using Karitiana reference = 0.00272141/0.04449 = 6.1%

A ~2-fold difference in African admixture has resulted from a different choice of outgroup. This is unexpected if West Eurasians did not exchange genes with Papuans and Karitiana since their divergence, but expected if CEU received genes from an Asian population that was more like Karitiana and less like Papuans.

Scrubbing Sardinians

Another way to demonstrate that east Eurasian-like admixture in CEU is inflating the perceived level of African-like admixture in Sardinians is to comprehensively "scrub" Sardinians of all traces of African ancestry by replacing segments of their DNA when there is even a hint of such ancestry with missing values.

Going back to the mental experiment of walking along the Sardinian genome, we are going to remove spots of even remote possibility of African admixture. It will be shown that CEU continues to have evidence of east Eurasian-like admixture using the scrubbed Sardinians, suggesting that it is not only African-like admixture in Sardinians generating this signal, but also East Eurasian-like admixture in CEU.

I used DIYDodecad to do this scrubbing, but one could potentially try any approach that can identify African segments, such as HAPMIX or PCA. I used the dataset assembled for K7b and K12b, and carried out a K=3 ADMIXTURE analysis, which resulted in 3 components centered on West Eurasia, Asia, and Africa. I chose not to use an African component from higher-K (e.g. the K7b calculator), because it is conceivable that African ancestry might be lurking in southern Caucasoid components inferred with these tools (e.g., the "Southern" component of K7b or the "Southwest Asian" one of K12b). The average African admixture in Sardinians using the K3b calculator is 0.9%, and for the subset of CEU used it is 0.2%.

Using the byseg mode of DIYDodecad, I created ancestry maps of the 28 HGDP Sardinians, and I only kept windows where the African admixture was exactly 0%. This is a very aggressive scrubbing, designed to remove virtually all African admixture from the population. For example, if a window has 99.9% West Eurasian admixture and 0.01% African, I will nonetheless remove it, even though chances are extremely high that the 0.01% represents only noise. I did not want to leave any doubt that any trace of identifiable African ancestry remained in my "scrubbed Sardinians".

I am very confident that my scrubbed Sardinians do not have any hint of African ancestry, but you can decide for yourselves. I base my confidence on (a) the extreme nature of the scrubbing , which threw away much of the Sardinian genome in order to ensure that no hints of local African ancestry remained (b) re-assessment of the scrubbed Sardinians with K3b showing that they are now 100% West Eurasian, (c) ab initio ADMIXTURE analysis of CHB, YRI, CEU, and scrubbed Sardinians, demonstrating that the latter are 100% West Eurasian, while CEU has traces of 0.1% African and 0.3% Asian ancestry.

So, here are the results for the scrubbed Sardinians:

f4(San,Papuan;Sardinian_scrubbed,CEU) = 0.000678108 (Z=4.05225)
f4(San,Papuan;YRI,CEU) = 0.0379664 (Z=88.2287)
so scrubbed Sardinians with Papuan reference appear 0.000678108 / 0.0379664 = 1.8% African

and 

f4(San,Karitiana;Sardinian_scrubbed,CEU) = 0.00205526 (Z=11.2848)
f4(San,Karitiana;YRI,CEU) = 0.04449 (Z=100.19)
so scrubbed Sardinians with Karitiana reference appear 0.00205526/0.04449 = 4.6% African

Despite the thorough scrubbing, Sardinians continue to show African admixture using f4 ancestry estimation. This is consistent with the idea that much of the African ancestry inferred using f4 ancestry estimation in Sardinians is an artifact of not taking into account east Eurasian-like admixture in CEU.

Conversely, a significant signal of east Eurasian-liked admixture in CEU persists whether one uses regular or scrubbed Sardinians:

With regular Sardinians

f4(San,Papuan;Sardinian,Karitiana) = 0.0084678 (Z=21.2137)
f4(San,Papuan;Sardinian,CEU) = 0.00118099 (Z=10.6838)

So, CEU appears = 0.00118099/0.0084678 = 13.9% East Eurasian

With scrubbed Sardinians

San,Papuan;Sardinian_scrubbed,Karitiana 0.00774427 0.00056725 13.6523
San,Papuan;Sardinian_scrubbed,CEU 0.000678108 0.000167341 4.05225

So, CEU appears = 0.000678108/0.00774427 = 8.8% East Eurasian

Conclusion

My "palimpsest" idea seems to be confirmed by the data. A first observation is that the level of African-like admixture in Sardinians depended on whether one used Papuans or Karitiana as an outgroup, suggesting that neither population was a true outgroup, and the signal of African admixture in Sardinians was driven in part by East Eurasian-like admixture in CEU. African admixture in Europe cannot be assessed accurately if one ignores the confounding effect of East Eurasian admixture.

When I aggressively scrubbed Sardinians so as to remove all traces of African ancestry, part of the African admixture fraction disappeared (expected, since African ancestry was removed from Sardinians), but a substantial part of it remained (unexpected, if the signal was driven only by African admixture, but expected, if it was driven in part by East Eurasian-like admixture in CEU). Conversely, using scrubbed Sardinians reduced, but did not make disappear, the admixture estimate for CEU.

August 27, 2012

3-population test and east Eurasian-like admixture in Europe or The Isle of Refuge

The 3-population test (Reich et al. 2009) allows one to detect the presence of admixture in a population X from two other populations A and B. The value

f3(X; A, B)

is negative when X does not appear to form a simple tree with A and B but appears to be a mixture of A and B.

In a previous entry, I noted that continental European populations, and especially northern Europeans appear to have East Eurasian-like admixture on the basis of the 4-population test. The results of that test are more difficult to interpret, because the quantity f4(X, Y; A, B) can take significant negative or positive values depending on the relationships of populations X, Y with A, B. When A, B are East Eurasian and African populations respectively, and X, Y are West Eurasian ones, East Eurasian-like admixture in a northern European population will affect the f4 quantity similarly as African-like admixture in a southern Caucasoid one. This is not a problem with the f3 test, although caution is needed: a negative value indicates deviation from "treeness" and admixture, but a positive one does not reject admixture.

The f3 statistics were calculated with the threepop program of TreeMix with -k 500 over a set of 598,467 SNPs.

I have used 3 Asian/American reference populations (Karitiana from South America, CHB Chinese, and Papuans) and calculated the following:

f3(West Eurasian 1; West Eurasian 2, Asian/American)

As noted above, negative values of this indicate that West Eurasian 1 can be seen as an admixed population of West Eurasian 2 + Asian/American. The set of 14 West Eurasian populations used is:
CEU, TSI, Tuscan, Orcadian, French, French_Basque, North_Italian, Bedouin, Palestinian, Druze, Mozabite, Adygei, Russian, Sardinian
I thus report 2*(14 choose 2)*3 = 546 values of f3. Hence, I did not privilege Sardinians as a reference point, but instead tried all pairs of West Eurasian populations, and 3 different American/Asian references. There results can be found in the spreadsheet.

Out of the 546 triples, 64 show an f3 score less than Z less or equal to -3, and are thus significant.

The following populations have such a score in at least one pairwise comparison, when they are set as West Eurasian 1, and thus appear to have east Eurasian-like admixture

CEU, Russian, French, Adygei, TSI, Tuscan, Orcadian, North_Italian, Palestinian  
Note that east Eurasian-like admixture cannot be rejected for the other populations, but it can be confirmed for the above. Moreover, the mean strength of the observed effect for the significant comparisons was Z=-5.5 for Papuan reference, Z=-10.2 for CHB, and Z=-10.9 for Karitiana, again suggesting a northern origin of the east Eurasian-like admixture, albeit without so major a difference between Karitiana and CHB as in the 4-population test.

But, it is worth reading the raw data. For example, note above that of the Middle Eastern and North African populations, only Palestinians show a negative f3 score in any pairwise comparison. And actually they only do so for f3(Palestinian; Sardinian, Papuan) with a Z-score of -4.1. So, it appears that Palestinians have undergone admixture of a different sort than Europeans.

Significant differences were observed for Sardinians as West Eurasian 2 in 21 cases, for French Basque in 11 cases, for North_Italian and TSI in 6 cases, for CEU, OrcadianFrench, and Tuscan in 4 cases. So, it appears that other populations appear east Eurasian-liked admixed relative to Sardinians, and a couple of populations (Russian and Adygei) also appear so admixed relative to west Europeans.

Oetzi the Tyrolean Iceman

The fact that Europeans appear admixed with an east Eurasian-like element when compared with Sardinians does not mean that Sardinians may not also be admixed with this element. I used the genome of the Tyrolean Iceman (Keller et al. 2012) to test whether Sardinians appear east Eurasian-like admixed relative to the Iceman.

f3(Sardinian;Karitiana,Oetzi) = 5.36496e-06 (Z=0.00940612)

This might indicate no admixture, but f3 can detect admixture but can't prove non-admixture. The f4 is suggestive:

f4(Sardinian,Oetzi;Karitiana,San) = -0.00221783 (Z=-3.06251)

You should probably not take my word for the above. It may appear that, contrary to expectation, Oetzi was more east Eurasian-like than modern Sardinians. Indeed, in my initial analysis of him with ADMIXTURE, I found that he was 2.8% East_Asian, which would point to an East Eurasian shift of Oetzi relative to Sardinians, and which might be consistent with the f4 result. On the other hand, the negative f4 score could be related to African-like gene flow. On balance I would say that Sardinians appear quite similar to Oetzi.

Gok4 and Ajv52

Furthermore, I carried out the same analysis on Neolithic samples from Sweden (Skoglund et al. 2012). The number of SNPs here is much smaller. Results are:

Gok4 (TRB farmer): f4(Sardinian,Gok4;Karitiana,San) = -0.00167365 (Z=-1.23616)
Ajv52 (PWC hunter-gatherer): f4(Sardinian,Ajv52;Karitiana,San) = -0.004676 (Z=-3.76048)

While I would not bet the farm on these results (because of the small number of SNPs and the fact that they're based on a single individual), they do seem to suggest that these Neolithic Swedes were east Eurasian shifted relative to Sardinians. For example, for my Swedish_D sample, I get f4(Sardinian, Swedish_D; Karitiana, San) = -0.00372751 (Z=-22.8715). The Z-score is stronger (probably because of the much larger number of SNPs), but the f4 value of Ajv52 is lower (more east-Eurasian like). Modern Swedish_D appears intermediate between Gok4 and Ajv52, so this may suggest that Mesolithic Europeans may be, at least in part the source of this element.

(Comparison with Brana-1 Mesolithic Iberian indicates a negative non-significant f4 score, but with an even smaller number of SNPs).

In sum total, my experiments with ancient DNA samples from Europe suggest a little more east Eurasian-like shift relative to Sardinians (or conversely a little more African-like shift in Sardinians). Both Oetzi (who has the highest quality genome) appears to be so-shifted, but Ajv52 (a Neolithic northern hunter-gatherer) appears to be so as well. I am sure that if we get more high quality ancient DNA from Europe, some clear pattern may emerge, but I would not speculate further on the basis of these initial results.

Isle of Refuge


The above set of experiments has revealed once more that "there's something about Sardinians." There is perhaps a reason for the fact that the arrival of population elements from continental Europe seems to have bypassed them to some degree, or, at least affected them least. However it was that continental Europeans got their east Eurasian-like shift, the great tank of European genetic variation does not seem to have achieved equilibrium with the little cup of Sardinia. Something stood in the way.

Sardinia is the west-most of the large Mediterranean islands. It is more distant from mainland Europe/Asia than the other big islands (Cyprus, Crete, Sicily, and Corsica).

And, unlike islands much smaller than itself, its size has probably been instrumental in helping it afford it a certain autonomy and continuity of population. Only Sicily is largest, but one can practically swim across the Strait of Messina to reach it from the Italian peninsula.

Hence, a combination of large size, western geographical location, and distance from the mainland have contributed to the continuity of its population. But, geography may not have been sufficient if other events had not taken place. Through a combination of favorable geography and historical contingency, the Sardinians made it to the present largely unscathed, and, among their other graces, can now help scientists figure out what happened to the rest of us.

August 26, 2012

Inter-relationships of the Dodecad K12b and world9 components

Pconroy made a most excellent suggestion in the comments of a previous post, so I decided to follow up on it. His idea is to see what Dodecad components look like when they're measured in terms of other components. So, I took the K12b components and carried out the following procedure:

I used each of the 12 different components as "test data" in a supervised ADMIXTURE analysis that used the other 11 components as "reference data". This simple procedure can show what each component appears to be made of, if it is seen in the context of the remaining components. It is a good way to demonstrate relationships between them.

Here are the results:


Some observations:

  • Gedrosia appears to be Caucasus + a slice of Siberian
  • Both Siberian and Southeast Asian appear to be wholly East Asian
  • East Asian on the other hand, appears to be mostly Southeast Asian + minority Siberian
  • Northwest African appears to be Caucasus + a minority Sub Saharan
  • Atlantic Med appears to be Caucasus + a slice of North European
  • North European appears to be Atlantic Med + Gedrosia with a slice of Siberian
  • South Asian appears to be Caucasus + East Asian
  • East African appears to be Sub Saharan + minority Caucasus
  • Southwest Asian appears to be Caucasus
  • Sub Saharan appears to be East African
  • Caucasus appears Atlantic Med + Gedrosia + slices of Northwest African and Southwest Asian
The most salient point about this analysis is the central position of the Caucasus component vis a vis the others, consistent with my womb of nations theory. Not only do all West Eurasian components (except the North European) appear substantially "Caucasus" in this analysis, but the Caucasus component itself shows links with four others.

It could be argued that these results represent a confluence of peoples from all over West Eurasia into the highlands of West Asia where the Caucasus component is modal. But, the Caucasus region is arguably the most linguistically diverse in West Eurasia, and many of its languages do not appear to have come from elsewhere. Also, the Near East (where the Caucasus component is the most important one in most populations) is the birthplace of agriculture, which has demonstrably affected most of West Eurasia. On balance, this analysis seems consistent with population expansions out of West Asia.

The following graph summarizes the relationship between the 12 components. I used color intensity of the edges to indicate admixture intensity:




Finally, a few points to remember: 
  • the South Asian component appears like a mix of of Caucasus and East Asian; the latter probably acts as a stand-in for the Ancestral South Indians of Reich et al. (2009)
  • Similarly, the Gedrosia/Siberian influences on the North European component do not necessarily mean direct influences from these two regions; an explanation for these influences may intersect with the issue of East Eurasian-like ancestry in northern Europe
  • It is the Caucasus, rather than Southwest Asian component that seems to donate to the Northwest African and East African ones. That seems to flaunt geography, but probably indicates that the Southwest Asian component, with its strong Semitic associations (see distribution in K12b spreadsheet) represents a more specialized form of the more generalized Caucasus component.
  • Some components appear to be "terminal", affected but not much affecting: Southwest Asian, Northwest African, and Southeast Asian. These tend to appear at high K in admixture analyses, and probably represent either recent mixtures (Northwest African) or specialized forms of more generalized ones (Southwest Asian of Caucasus and Southeast Asian of East Asian)
  • Finally, remember that living populations show admixture proportions of many of these components. So, for example, the East African population often has Southwest Asian admixture, even though the East African component lacks it. And, as mentioned above, this may reflect the more generalized west Asian admixture that has affected East Africa, as well as the more specific Arabian admixture, associated e.g., with the spread of Semitic languages. Please refer to the K12b spreadsheet for admixture proportions of populations for the 12 components.
I have also done the same with the world9 calculator, which includes Amerindian and Australasian components. Here is how the world9 components are seen as mixtures of the remaining ones:



And, here is the graph showing how they seem to contribute to each other.


A few observations:

  • Amerindian appears wholly Siberian
  • East Asian appears Siberian + South Asian + slice of Australasian
  • African appears South Asian. I would attribute this to Africans being related to both West and East Eurasians approximately symmetrically, so in this type of experiment, South Asian (which is an ANI/ASI mix) appears like the best match
  • Atlantic_Baltic appears Caucasus_Gedrosia + Southern + slice of Amerindian
  • Australasian appears South Asian. I would guess that ASI and Australo-Melanesians share deep common ancestry from the earliest settlement of southern parts of Asia.
  • Siberian appears East Asian + slice of Amerindian
  • Caucasus_Gedrosia and Southern appear Atlantic_Baltic
  • South Asian appears an about equal mix of East Asian + Caucasus_Gedrosia + slice of Australasian
Raw data for these experiments can be found here.

August 25, 2012

ADMIXTURE 1.22 correcting Fst bias

In many of my experiments I have used ADMIXTURE versions prior to 1.22. According to the authors' website, version 1.22 (3/10/2012):
Fst estimates were upward biased; have now switched to the method of Reynolds et al. (1983).
This probably means that many of the Fst divergences reported here and in the Dodecad blog must be reduced. This is not really a big problem, since, biased or not, the reported numbers show the relative similarity of difference components. But, I decided to investigate, so I re-ran the ADMIXTURE analysis that created the K7b calculator.


The correlation with the old (ver. 1.21) Fst values is very strong (+0.9993209) and the new values can be estimated from the old using the following regression:

New = 0.782324*Old + 0.009335

Of course it would be a good idea to re-run this type of analysis separately whenever the absolute values are important. For example, in a previous experiment, I suggested that Fst's between the K12a components were so low, that these components should not be interpreted as having diverged in very old (say, Upper Paleolithic) times, but rather in a more recent (post-glacial, and probably mostly Neolithic) time frame. Correction for this upward bias would probably strengthen that hypothesis which was one way of arguing in favor of the womb of nations theory.

August 23, 2012

Dodecad Project components and East Eurasian-like admixture

See Part 1, Part 2, and Part 3.

I went back to the Dodecad Project K7b and K12b calculators, and calculated f4 statistics of the form:

f4(Southern_K7b, X, East_Asian_K7b, African_K7b)

I wanted to see how the various components related to East Eurasians.

Here are the results:


Visually for the West Eurasian components:

This shows the relative ordering of the different components on the East Asian-African axis. Notice that of the mainly Caucasoid components the most Asian-shifted is the North European component, the most African shifted is the Southwest Asian one. This makes sense because of the admixture phenomenon I've been describing in this series, and also the proximity of Arabia (which is where the Southwest Asian component is modal) to Africa.

The existence of East Eurasian-like admixture in Europe is further supported by the following observation: both the Atlantic_Baltic and North_European components (who are the most East Asian-shifted) are mainly geographically distributed to the west of the West Asian, Caucasus, and Gedrosia components (who are less East Asian-shifted). This seems discordant with geography. On the other hand, the relative position of the Caucasus, Southern, and Southwest Asian components vis a vis Africa are concordant with geography, as their center of distribution is close to Africa along land migration routes, with Southwest Asia being closer both genetically and geographically, and Caucasus most distant.

Another observation is that the Atlantic_Med component, which is modal in Sardinians and Basques is actually Asian-shifted relative to the Southern component (modal in Arabia).This might indicate the presence of some degree of East Eurasian-like admixture in Sardinia itself. So, while Sardinia may possess the minimum of this element in Europe, it may not do so in the wider Caucasoid world.

Unscrambling the omelette of West Eurasian origins is no easy task. Hopefully, new statistical methods and ancient DNA will help us achieve it.

August 21, 2012

4-population test and East Eurasian-like ancestry in Northern Europe

Update: This is the first part of my discussion on the topic. For part 2 go here; for part 3 here.

I decided to follow up on a hint in the recent Reich et al. (2012) paper on Native Americans to the effect that:
... east/central Asian admixture has affected northern Europeans to a greater extent than Sardinians (in our separate manuscript in submission, we show that this is a result of the different amounts of central/east Asian-related gene flow into these groups).
 I used the implementation of the 4-population test of Reich et al. (2009) in the fourpop program of  TreeMix. 255,020 SNPs common in the various datasets were used throughout, and blocks of 200 SNPs for standard error estimation.

I used HGDP Sardinian, X, Han, San, with X being one of the following:
Armenian_D, Turkish_D, Russian_D, Polish_D, German_D, Irish_D, Greek_D, Finnish_D, Sicilian_D, Swedish_D, Portuguese_D, Lithuanian_D, Somali_D, AMHARA_Pa, Dai, Japanese, Kyrgyz_Bishkek_Ho, Mozabite, Bedouin, North_Italian, French_Basque, Tuscan, Russian, Orkney_1KG, Kent_1KG, Cornwall_1KG, Yoruba, Mbuti_Pygmies
As always, you can find a list of population sources at the bottom of the Dodecad blog.

As I have noted in my review of Moorjani et al., this test shows a superposition of a set of populations on the African-East Asian axis, so populations occupy different positions depending on whether they have African or East Asian admixture. It's a palimpsest. That paper ignored the Eastern ancestry in North Europeans, and used the CEU (a population of mainly North European origin) instead of Sardinians, hence generating inflated estimates of African ancestry in Southern Europeans.

Now that the Central/East Asian ancestry in northern Europeans seems to be recognized by some of the co-authors of the earlier paper, and using the Reich et al. (2012) framework, the different processes superimposed on the African-East Asian axis can probably be disentangled. Hopefully, we won't have to wait too long for the full treatment. Maybe it can go to the arXiv too!

In any case, here are the f4(Sardinian, X; Han, San) values for the different populations:

It is quite clear that North European populations are shifted towards East Asians, with the exception of the Turkish_D sample which is also so shifted, due to its Central Asian Turkic admixture. There are also a few cases of substantial African shift, such as the Bedouin and Mozabite Berbers.

I have also rescaled the f4 statistics on a 0: Sardinian to 100: Japanese scale, including only West Eurasian populations that are East Asian-shifted relative to Sardinians:
I have calculated the correlation coefficient between the f4 statistics for this set of West_Eurasian populations and the sum of the Siberian+East Asian components of my K7b calculator on the same set of populations. This is +0.85, highly significant, and consistent with the idea that ADMIXTURE software and formal tests of admixture capture the same phenomenon. I also calculated the correlation coefficient with the Atlantic_Baltic component that is modal in Europeans, which is equal to +0.56 and confirms the higher East Eurasian shift in  European populations.

(I have also repeated the above with the K12b calculator; the correlation between the f4 statistics and the Siberian+East Asian+Southeast Asian components is +0.76, and the North_European +0.84. The latter is higher than with the Atlantic_Baltic component (+0.56) which combines North and West European ancestry. It thus appears that the East Eurasian admixture in Europe is not a general feature of the oldest Europeans, but reflects a more recent phenomenon.)

Furthermore, I have carried out f4 regression ancestry estimation (Reich et al. 2009) using the f4(Sardinian, San; X, Han), in the horizontal axis, and f4(Sardinian, X; San, Han) statistics, in the vertical. An initial plot shows that while northern Europeans fall precisely on a line in this space, Turks and Armenians deviate substantially, while Greeks, Tuscans, and Basques less noticeably so:
The regression analysis shows a weak correlation (R^2=0.149). The southern Caucasoid populations from Armenia to Iberia appear roughly perpendicular to the north European cline, suggesting that they do not significantly partake in the same phenomenon as the northern groups.

I thus limited myself to the European populations falling on the Russian to North_Italian cline, which form a near-perfect cline (R^2= 0.9783)


Admixture estimation was performed on the triangle whose three corners were (from the regression equation):

LOW: (0.0151411, 0.0000000)
HIGH: (0.00000, 0.02049)

and

TEST: the f4 statistics for each test population

Inferred admixture proportions using this method can be seen below:



I have repeated the above experiment using French_Basque instead of Sardinian, Mbuti instead of San, and Dai instead of Han:
Now, North_Italian shows no East-Eurasian-like admixture relative to French Basque, so there is one less row. The Basques appear to be Asian-shifted relative to Sardinians, so, overall, I would trust the former results more than the latter, but, in any case, the overall pattern seems fairly solid across a choice of reference populations.

While I would not take these results very literally in the absolute sense, I think they show quite well the relative ordering of populations, and are consistent with both my initial observation...
With respect to the Asian- and African- shift of West Eurasian populations, I note that northern Europeans (and Basques) are less African-shifted than southern Europeans, and, at the same time they are more Asian-shifted: the 16 least Asian-shifted populations have a coastline in the Mediterranean (excluding the Portuguese), while the 16 least African-shifted populations do not (excluding the French).
... as well as the remark in passim in the Reich et al. (2012) supplementary material mentioned above.

The prominent position that Sardinians have assumed in the genetic history of Europe puts the discovery of Veeramah et al. that Sardinians tend to be monomorphic in sites where mainland Europeans are polymorphic into new light. It now appears that the reduced genetic polymorphism of Sardinians vis a vis mainland Europeans may not be due to them having undergone a "bottleneck" relative to mainland Europeans, but rather, at least in part, a consequence of admixture in the latter. Admixture matters.

Hopefully, geneticists will become more willing to interpret patterns of decrasing genetic diversity not only as a consequence of diversity-reducing "bottlenecks", but also of admixture in populations that are especially diverse.

In the case of Sardinians and Europeans we have been lucky in that East Asians continue to exist, helping us untangle their (or their relatives') contribution to the population history of Europe. But, in other cases (such as the introgression of archaic DNA into the modern human gene pool), latent population admixture between divergent populations may lead to misinterpretations of the direction of gene flow.

 The raw dump of fourpop output can be obtained from here (for Sardinian-Han-San) and here (for Basque-Mbuti-Dai).