Dienekes’ Anthropology Blog: Dodecad

Showing posts with label Dodecad. Show all posts

March 01, 2015

Two observations on the ancestry of Armenians

I was thinking a bit on how to interpret the findings of the new Haber et al. preprint, and especially the idea that "29% of the Armenian ancestry may originate from an ancestral population best represented by Neolithic Europeans." I looked at the globe13 proportions, and strangely enough, I had estimated that the three Armenian samples (Armenian_D, Armenians, and Armenians_15_Y) have 28-29% of the Mediterranean component that is modal in Sardinians. This seems like a curious coincidence which has raised my confidence that Haber et al. is picking something real.

Looking back at my inferences of Armenian ancestry, it seems (according to globe13) to come completely from West_Asian, Mediterranean, and Southwest_Asian. The Mediterranean component seems real enough as it seems to match Sardinians/early European farmers well. I am not so sure about the Southwest Asian component which is modal in Yemen Jews and may represent population-specific drift in relatively recent Arabians. The West_Asian component is bimodal in Caucasus and Gedrosia, so it can't be the result of a very drifted population in either region (unless there is spooky action at a distance).

Another curious finding is the lack of North_European in a latitudinal "column" of populations from the Yemen, through the Levant to the South Caucasus (Georgians and Armenians). It seems that North_European is the only one of the four major Caucasoid components that Armenians lack to any important degree. There is a rather abrupt change between the South Caucasus (~1%) and the North Caucasus (15-20%). It seems that the Greater Caucasus did act like a barrier to gene flow. The K=4 analysis of the same dataset that produced K=13 (globe13) also shows the same barrier: all three Armenian samples and Georgians have ~0% of "Amerindian" (which is surely correlated to "Ancient North Eurasian" ancestry and via it with North_European), but North Caucasians and Europeans have 4-10%. It's clear that this influence did not cross the Greater Caucasus, as Armenians and Georgians lack it.

May 05, 2014

SPAMIX for spatial localization of admixed individuals

A new preprint on the bioRxiv suggests that it is possible to geographically localize the location of a person's four grandparents. This is often a problem for persons of mixed ancestry who often tend to plot in PCAs in some average location between their ancestors (so someone who is Swedish+Italian+Spanish+Russian might end up somewhere in central Europe even though none of his ancestors are central European).

This has appeared shortly after the GPS method of Elhaik et al. (2014) which presents evidence of being more accurate than SPA, so it will be interesting to see a comparison between SPAMIX and GPS. My experience in the Dodecad Project suggests that this is a useful feature (the Dodecad Oracle could sometimes be used for this purpose and e.g., could infer that a person that had one Ashkenazi Jewish grandparent and 3 English ones was a ~3/4 British+~1/4 Jewish mix, but it is limited to mixtures of two populations, so it could not cope with the case of 3-4 grandparents with different origins). There is an under-appreciated pool of adoptees who would love a tool like that, and there are also obvious forensic implications if something like this really works.

bioRxiv doi: 10.1101/004713

Spatial localization of recent ancestors for admixed individuals

Wen-Yun Yang et al.

Ancestry analysis from genetic data plays a critical role in studies of human disease and evolution. Recent work has introduced explicit models for the geographic distribution of genetic variation and has shown that such explicit models yield superior accuracy in ancestry inference over non-model-based methods. Here we extend such work to introduce a method that models admixture between ancestors from multiple sources across a geographic continuum. We devise efficient algorithms based on hidden Markov models to localize on a map the recent ancestors (e.g. grandparents) of admixed individuals, joint with assigning ancestry at each locus in the genome. We validate our methods using empirical data from individuals with mixed European ancestry from the POPRES study and show that our approach is able to localize their recent ancestors within an average of 470Km of the reported locations of their grandparents. Furthermore, simulations from real POPRES genotype data show that our method attains high accuracy in localizing recent ancestors of admixed individuals in Europe (an average of 550Km from their true location for localization of 2 ancestries in Europe, 4 generations ago). We explore the limits of ancestry localization under our approach and find that performance decreases as the number of distinct ancestries and generations since admixture increases. Finally, we build a map of expected localization accuracy across admixed individuals according to the location of origin within Europe of their ancestors.

Link

April 30, 2014

Nature Communications, the Genographic Project, Elhaik et al. re-discover zombies, the Oracle, etc. 3 years after the fact...

... and (sadly) do not care to cite my lowly blog.

From the new paper's Methods:

To infer the putative ancestral populations, we applied ADMIXTURE46 in an unsupervised mode to the filtered data set. This analysis uses a maximum likelihood approach to determine the admixture proportions of the individuals in question assuming they emerged from K hypothetical populations. We speculated that our method will be the most accurate when populations have uniform admixture assignments. In choosing the value of K that seemed to best satisfy this condition, we experimented with different Ks ranging from 6 to 12. We identified a substructure at K=10 in which populations appeared homogeneous in their admixture composition. Higher values of K yielded noise that appeared as ancestry shared by very few individuals within the same populations. ADMIXTURE outputs the speculated allele frequencies of each SNP for each hypothetical population.

Using these data, we simulated 15 samples for each hypothetical population and plotted them in a PCA analysis with the Genographic populations. We observed that two hypothetical populations were markedly close to one another, suggesting they share the same ancestry and eliminated one of them to avoid redundancy. The remaining nine populations were considered the putative ancestral populations and were used in all further analyses.

Given nine admixture proportions for a sample of unknown geographic origin obtained using ADMIXTURE’s supervised approach with the nine putative ancestral populations, we calculated the Euclidean distance between its admixture proportions and the N reference populations (GEN). All reference populations were sorted in an ascending order according to their genetic distance from the sample.

I'm sure my readers, and users of DIYDodecad know exactly why this is a carbon-copy of the tools I developed for the Dodecad Project. But, in any case...

The most exciting use of "zombies" is to convert unsupervised ADMIXTURE runs into supervised ones. In unsupervised mode, ADMIXTURE treats all individuals alike, and tries to infer their ancestral proportions. In supervised mode, some individuals are treated as "fixed" (belonging 100% in one of K ancestral components), and the ancestry of the rest is inferred.

The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.

... and the first post on the Oracle which shows how to find proximity to a population by calculating Euclidean distance in the space of admixture proportions between reference populations and a test individual (and also considers mixtures of populations).

I am flattered that the zombie approach has been copied and tested, but I doubt that all of the paper's 32 authors were unaware of the previous publication of the gist of their "new" method.

Nature Communications 5, Article number: 3513 doi:10.1038/ncomms4513

Geographic population structure analysis of worldwide human populations infers their biogeographical origins

Eran Elhaik et al.

The search for a method that utilizes biological information to predict humans’ place of origin has occupied scientists for millennia. Over the past four decades, scientists have employed genetic data in an effort to achieve this goal but with limited success. While biogeographical algorithms using next-generation sequencing data have achieved an accuracy of 700?km in Europe, they were inaccurate elsewhere. Here we describe the Geographic Population Structure (GPS) algorithm and demonstrate its accuracy with three data sets using 40,000–130,000 SNPs. GPS placed 83% of worldwide individuals in their country of origin. Applied to over 200 Sardinians villagers, GPS placed a quarter of them in their villages and most of the rest within 50?km of their villages. GPS’s accuracy and power to infer the biogeography of worldwide individuals down to their country or, in some cases, village, of origin, underscores the promise of admixture-based methods for biogeography and has ramifications for genetic ancestry testing.

Link

April 10, 2013

Closed-access story about DIY analysis tools

I find it a little odd that this story about DIY analysis tools, which (apparently) includes some quotes by myself, has now appeared in a closed-access publication. Had I known that to be the case, I doubt that I would have offered any response. It's probably not too late to make that item open access.

In any case here's what I had to say (in full) to the author of the piece:

I think that a plurality of tools from a number of different analysts is an unambiguously good thing, both for the creators of these tools and their users.

For the users it is good because they can obtain different assessments of their ancestry, so they learn to be skeptical of extraordinary or unexpected claims of any particular test, and also to be more convinced of results that recur across many different tests.

For the creators it is good because of both (i) the motivation to improve their tools driven by competition with other test creators, and also (ii) the feedback they get from users of their tests.

These tools are also good for science in general, because a plurality of eyes (test creators and users) examine genetic data trying to detect interesting patterns in them that might be missed by more narrowly-focused research. So, a whole ecosystem of ideas springs up from these tests, as people try to fit their results into a broader pattern of human history. This is complementary to academic research: less structured and more "noisy" in terms of ideas that don't pan out, but also more dynamic, fast-paced and democratic.

As for Dodecad, I have developed my calculators by utilizing standard population genetics software, as well as software developed by myself, making use of publicly accessible academic datasets together with data from volunteers; the latter is very useful, because it helps me fill in gaps in population coverage: either because some populations have not been sampled in the literature yet, or, if they have, because their data is not publicly accessible to everyone.

December 10, 2012

On the South Asian (?) ancestry of Daniel MacArthur

Razib investigates an unexpected region of South Asian admixture in Daniel MacArthur of GenomesUznzipped, and wonders why this has never been found before, despite the fact that his data was out in the public for a while.

I was surprised about this myself, since I had studied this data when I was starting my ADMIXTURE experiments a couple of years ago. But looking back at that old experiment, it's immediately clear why Dr. MacArthur's column (highlighted) showed no evidence of South Asian admixture at the time: there was no South Asian ancestral population in that reference set!

Naturally, I was curious to see what would turn up if I ran this sample again through my most recent globe13 calculator, which I did using the "bychr" mode of DIYDodecad, which treats each of the 22 autosomes separately:

A clear outlier is indeed shown on chr10 which shows 20.51% "South_Asian" admixture; most of the other chromosomes lack this altogether, so this seems like a legitimate signal of admixture.

I next used the "byseg" mode of DIYDodecad in order to (i) localize this admixture signal within chr10 and study it further. Furthermore, I used the paint_byseg script in order to show how the top-4 components within chr10 varied along the length of the chromosome:

It does appear that a good portion of the first half of chr10 has "South_Asian" ancestry, with the signal close to ~50%, which is a fairly good indication that one half of the diploid genome in this region has this type of ancestry.

Interestingly, the South_Asian signal does not appear "constant" along this portion, but in some of its troughs, the "West_Asian" component shows a corresponding local peak. Now, this might be the case of one really long segment of ancestry which is interpreted sometimes as South_Asian, sometimes as West_Asian by the software, given that the South_Asian component inferred by ADMIXTURE is a composite of West_Asian-like Ancestral North Indians (ANI), and Ancestral South Indians (ASI). But, we can investigate this further by using globe4, which looks at the same chromosome at a lower level of resolution:

It does appear to me that a fairly convincing "Asian" signal exists in a good portion of this region. Note that "Asian" within the context of globe4 is a combination of East/South Eurasians and even Australasians; it is a generalized "Asian" component that captures some of the common ancestry of these populations.

So, on balance I would say that there does indeed appear to evidence of South Asian ancestry within chr10 for this sample, and, moreover, this type of South Asian ancestry is probably partly ASI-related.

December 03, 2012

'globe13anc' calculator with chimp outgroup

I was thinking a bit about my suggestion to use Palaeo_African as an outgroup for D-statistic calculations using my new admixtureDstat script, and it occurred to me that it would be fairly easy to modify one of my calculators to include a sample that is indeed symmetrically related to all modern human groups.

To do this, I created an individual possessing the ancestral allele using hgdpGeo as a reference. According to the reference for this table:

Samples collected by the HGDP-CEPH from 1,043 individuals from around the world were genotyped for 657,000 SNPs at Stanford. Ancestral states for all SNPs were estimated using whole genome human-chimpanzee alignments from the UCSC database. For each SNP in the human genome (NCBI Build 35, UCSC database hg17), the allele at the corresponding position in the chimp genome (Build 2 version 1, UCSC database pantro2) was used as ancestral.

My new globe13anc calculator is simply a version of the latest globe13 one, but with an extra "Ancestral" component, so it has 13+1 = 14 ancestral components in total.

You can of course use globe13anc as any other calculator designed for DIYDodecad, and hopefully no one will get anything other than 0% for the "Ancestral" component :)

But, the main point of building this is to help you infer D-statistics with no suspicion that gene flow within the human species may affect the results; while the Khoesan of South Africa (where the Palaeo_African component is modal) are an approximate outgroup to the rest of mankind, there is evidence that even their most isolated groups have some external gene flow. So, using this "Ancestral" outgroup instead of Palaeo_African ought to make things cleaner for everyone.

December 02, 2012

D-statistics on ADMIXTURE components

One of the most persistent questions I get as admin of the Dodecad Project is whether some low level of admixture (e.g., 0.7%) of some ancestral component is "noise" or "real".

I have hitherto advised all those who contacted me about this issue to (i) treat low levels of admixture with suspicion, and (ii) to run DIYDodecad in byseg mode; this might show whether this type of admixture is concentrated in some specific long segments, and is thus more likely to be "real" recent ancestry than low-level noise sprinkled across the genome that is more difficult to interpret.

Nonetheless, this was always unsatisfying to me, because it did not provide a way of quantifying one's confidence on the "reality" of the admixture evidence. Thus, I developed admixtureDstat.r an R script which calculates D-statistics of the form:

D(Pop1, Individual; Pop3, Outgroup)

If the individual can be seen as being drawn from population Pop1 but with some admixture from population Pop3, then this statistic will take significant negative values. For example, suppose that your main admixture component is "North_European", but you also have 1% "Siberian" admixture. You would want to calculate the following statistic:

D(North_European, YOU; Siberian, Palaeo_African)

which would tell you whether the Siberian admixture is "real" or not. (Of course, things are more complicated for those who might have both Siberian and African admixture, in which case their Siberian admixture would tend to make the D-statistic negative, and the African one positive, with the end result being a balance of the two processes).

There are of course many subtleties in the interpretation of D-statistics and I refer you to Green et al. (2010), Durand et al. (2011), and Patterson et al. (2012) for some of the technical details.

Using the script is quite simple, and only requires that you have R installed on your computer:

download standardize.r and admixtureDstat.r from here, saving them into some directory in your computer (henceforth, we will call this the "working directory"). If you have Genographic 2.0 data, you should also download hgdp.base.txt.
unzip your raw genotype data (from 23andMe, Family Finder, or Genographic 2.0) into the working directory.
launch R and change the directory into the working directory (using the Menu in Windows, or setwd() in Unix-like operating systems). Enter in one line:

source('admixtureDstat.r'); source('standardize.r')

In R, enter the command:

standardize('johndoe.txt', company='23andMe')

The above command, will convert your data into a format understood by my script, writing a genotype.txt file in the working directory. You should change johndoe.txt to whatever your unzipped raw data file is called, and the company should be one of '23andMe', 'ftdna', or 'geno2', or 'geno2new' depending on the source of your data. If you have used DIYDodecad before, you have already created a genotype.txt file, so you can skip this step.
Finally, you should have the four calculator files (with endings .par, .txt, .alleles, and .F) in the working directory. You can, for example, use the calculator files of the globe13, or if you have experience working with ADMIXTURE, you may make your own using your dataset. The .txt file will contain the names of the ancestral populations that you can use, so make sure you type them correctly if you decide to choose "listfile" mode (see below).
You are now all set to use the script! You can do this in either of two ways:

(1) outgroup mode:

In this mode, you specify an outgroup, i.e., one of the populations from the calculator, and the program cycles through all possible (Pop1, Pop3) pairs, outputs the D-statistics to the screen as it calculates them, and finally writes them to a dstat.txt file in the working directory.

To use this mode, you simply type:

admixtureDstat(parfile="globe13.par", outgroup="Palaeo_African")

The use of "Palaeo_African" as an outgroup is a reasonable choice for most non-Africans, since these are unlikely to have recent admixture from Sub-Saharan hunter-gatherer groups in which this component is represented.

Note that many of the D-statistics produced this way may have little meaning for you. For example, a person that is mostly European will get a very negative statistic of the form:

D(West_African, YOU; East_Asian, Palaeo_African)

But this will have little to do with your potential West_African or East_Asian ancestry, but rather with the relationships of populations (e.g., Europeans being more closely related to East Asians than West Africans). A little West_African/East_Asian ancestry will increase/decrease the value of this statistic, which will, however, remain strongly negative.

Instead, you should look at D-statistics that might be meaningful to you, e.g., if the following is negative:

D(North_European, YOU; East_Asian, Palaeo_African)

Then you might have some real East_Asian admixture.

(2) listfile mode:

In this mode, you write all the D-statistics you are interested in in a simple text file, e.g., listDstat.txt, in the order Pop1, Pop3, Outgroup, e.g.:

Mediterranean North_European Palaeo_African
North_European Siberian Palaeo_African
North_European West_Asian Palaeo_African

A reasonable choice is to calculate D-statistics where Pop1 is your most important component, e.g., North_European for someone from Finland, and Pop3 is a minor component whose "reality" you seek to investigate, e.g., Siberian.

Using the listfile mode will take less time (because you calculate a subset of D-statistics), and can be invoked as follows:

admixtureDstat(parfile="globe13.par", listfile="listDstat.txt")

Z-scores

The significance of D-statistics is assessed by the Z-scores, which are the last column of the output. If they are greater than 3 in absolute value (i.e., less than -3 or greater than 3) then Z-scores are significant.

Other details:

There are some additional options you might use. For example

admixtureDstat(parfile="globe13.par", listfile="listDstat.txt", k=1000)

will use 1,000 SNPs for the block jackknife instead of the default 500. In general, there is little reason to mess with this parameter.

The screen output might be too wide for your R window, and you can fix this prior to running admixtureDstat by entering something like options(width=300) which allows more characters per line of screen output. In any case, you can see the program's output nicely formatted in the dstat.txt file in the working directory after it completes its run.

AN EXAMPLE

I will give an example of program usage using globe13 results. Take individual DOD133 whose results are seen below:

This individual is mostly Mediterranean (52.5%) and North_European (42%), but with small percentages of Amerindian (1.1%), Southwest_Asian (1.5%), Arctic (0.3%), and South_Asian (1.7%).

First, I calculate D(Mediterranean, DOD133; North_European, Palaeo_African) and D(North_European, DOD133; Mediterranean, Palaeo_African) to confirm the major admixture between Mediterranean and North_European. In listfile mode, I put the following in the listDstat.txt file:

Mediterranean North_European Palaeo_African
North_European Mediterranean Palaeo_African

The results are as follows:

Pop1 Pop3 Outgroup Dstat Z

Mediterranean North_European Palaeo_African -0.02399 -11.2

North_European Mediterranean Palaeo_African -0.033 -15.06

Ok, this confirms that DOD133 does indeed appear to be a mixture of North_European and Mediterranean. Now, let's take one of the minor components, e.g., South_Asian, and put the following in the listDstat.txt file:

Mediterranean South_Asian Palaeo_African
North_European South_Asian Palaeo_African

The results are now:

Pop1 Pop3 Outgroup Dstat Z
Mediterranean South_Asian Palaeo_African -0.01268 -5.98
North_European South_Asian Palaeo_African 0.00202 0.96

A possible interpretation for this pattern is that the individual does have some South_Asian-like admixture that is lacking in his Mediterranean component. Perhaps this reflects an ancient Central Asian population that migrated into both northern Europe and south Asia; some alleles from this population were incorporated into the Northern European gene pool, thus becoming part of what it means to be "northern European", so the evidence for admixture does not exist in the {North_European, South Asian} pair, since both of these contain gene flow from our hypothetical Central Asian population. There are many ways to interpret the observed patterns, and using admixtureDstat you can explore some of them.

Now, let's take another minor component, Arctic (0.3%):

Pop1 Pop3 Outgroup Dstat Z
Mediterranean Arctic Palaeo_African -0.02413 -9.48
North_European Arctic Palaeo_African 0.01028 3.95

This is an interesting pattern; the individual appears admixed with Arctic relative to Mediterranean, but North_European appears to be more Arctic than DOD133. A possible explanation is that this Arctic component represents ancestry that was mediated by a north European population, that as Patterson et al. (2012) have shown contain some "north Eurasian" ancestry.

Finally, let's take the Southwest_Asian minor component (1.5%), where the reverse situation applies:

Pop1 Pop3 Outgroup Dstat Z

Mediterranean Southwest_Asian Palaeo_African 0.00496 2.4

North_European Southwest_Asian Palaeo_African -0.01075 -5.05

So, in this case, this might represent ancestry common between Mediterranean and Southwest_Asian that contrasts with the North_European portion of the individual's genome.

I won't pretend that interpreting D-statistics is easy, but they are certainly a nice exploratory tool to have in one's arsenal, and I hope that they will prove useful.

TERMS OF USE: You are free to use and modify this tool for any non-commercial purpose, as long as you provide a link to Dienekes' Anthropology Blog or this blog post when you do so. You should probably also cite one of the aforementioned papers where D-statistics were discussed, as well as the ADMIXTURE paper.

UPDATE (Dec 3): You might want to try D-statistics using globe13anc, a new calculator that includes an Ancestral (chimp) outgroup.

November 30, 2012

Using Genographic 2.0 data with DIYDodecad

I have released a converter for Genographic 2.0 data at the Dodecad blog. This will allow you to use DIYDodecad with your Genographic 2.0 raw data download.

November 24, 2012

Assessment of Totonac and Bolivian samples using 'globe13'

I was on the lookout for some Affy 6.0 samples recently, and I discovered the data of the recent Watkins et al. (2012) paper, so I decided to run them through my globe13 calculator. A total of 49,233 SNPs were in common between that and my globe13 set, which is not much, but ought to be sufficient to discover the main features of these two population samples.

It appears that both samples are mainly "Amerindian", with the Bolivian sample having some more European admixture than the Totonac one.

Here are the population portraits, clearly showing that the "European" admixture in Bolivians comes from a subset of individuals.

For comparison, here are the ADMIXTURE results from the original paper that appear quite similar to my own. (Note that the individual ordering is probably not the same as my own):

The Mediterranean/North_European ratio of my own analysis suggests the likely "southern" (probably Spanish) origin of the European admixture in these populations.

UPDATE:

I also combined the two Amerindian populations with HGDP Karitiana, Sardinian, and French to calculate f3-statistics. Here are the significant ones:

So, admixture in the Bolivian sample is confirmed, while in the Totonac one it is not. I do think it's possible that the Totonac might have a little European admixture though which might be masked by their history of drift. Also notice the evidence for admixture in the French using all three Amerindian samples, with lowest f3(French; Amerindian, Sardinian) using the Karitiana reference.

October 29, 2012

Assessment of ancient European DNA with 'globe13'

Here is my assessment of ancient DNA from Europe using the globe13 calculator:

You can consult the spreadsheet for the distribution of these components in modern populations. As in previous analyses, the main distinction is between Northern European-like Mesolithic population (Ajv52, Ajv70, and Bra1), and Mediterranean-like Neolithic (Oetzi and Gok4) one.

October 27, 2012

Inter-relationships between 'world' components

In a previous post I calculated f3-statistics between my K=7 and K=12 ancestral components. The basic idea is to discover which component A can be seen as a mixture of two other components, B and C, in which case (assuming A does not have excessive drift), we expect a negative f3(A; B, C) statistic.

As part of my analysis of the world dataset, I calculated f3-statistics for each of the K=3 to K=12, that is, for some K, I tried to see if one of the K inferred components could be seen as a mixture of the remaining K-1. It turns out that no negative f3 statistics appeared at all, and this suggests that the components inferred by ADMIXTURE at each K tend to form an "orthogonal" set that are not mixtures of each other.

More generally, we can calculate f3 statistics where A, B, and C are components inferred from any of the K=3 to K=12 runs. There is a total of 75 such components, and hence 75*(74 choose 2) = 202,575 such f3 statistics. Since calculating these would take a while (and would become intractable as K increases further), I decided to calculate pairwise f3 statistics, i.e., statistics where A, B, and C are constrained to be from successive K, K+1 runs. The significant results can be seen in the spreadsheet.

It might be worthwhile to develop an automated way of using these statistics to guide us in the interpretation of ADMIXTURE components. But, they are useful, in any case, as a source of information.

For example, consider the following (the third column represents the mixed population):

Atlantic_Baltic_6/globe6_Z Near_East_6/globe6_Z European_5/globe5_Z -0.013911 0.000084 -166.457

This means that the European component at K=5 can be seen as a mix of the Atlantic_Baltic and Near_East components at K=6. So, this suggests that the European component can be seen as "secondary", the product of admixture. But:

European_5/globe5_Z Amerindian_5/globe5_Z Atlantic_Baltic_6/globe6_Z -0.003964 0.000175 -22.588

This indicates conversely that the Atlantic_Baltic at K=6 component can be seen as a mix of the European and Amerindian components at K=6.

It would be very interesting to use f-statistics to guide one in the choice of an "orthogonal" set of ancestral populations, or to summarize the relationships between them in tree or network form. One could potentially use my ADMIXTURE to TreeMix script to do something like this, although as K increases, there is a combinatorial explosion in the total number of components with a probable runtime slowdown/memory usage blowup which might render this approach unusable, at least for large K.

October 23, 2012

Ancient European DNA assessment with 'globe10'

I had previously assessed the same using globe4. See post on globe10 and associated spreadsheet.

The results appear similar to previous analyses overall, with the main features being the presence of "Southern" in Neolithic farmers (which peaks in the Near East), and its absence in hunter-gatherers. Some of the "Amerindian"-like admixture that was evident in globe4 has been "absorbed" by the Atlantic_Baltic (main European) component, but it is interesting that the Swedish hunter-gatherers (Ajv52/Ajv70) continue to show some Amerindian as well as other eastern (Australasian/South Asian) admixture that is lacking in the other samples. These individuals are outside the range of modern populations, but they overall tend to map to the most similar Atlantic_Baltic component with the addition of some eastern influences.

Also of interest is the fact the Oetzi is the only sample which shows a slice of West Asian (5.7%) admixture in this analysis. This was also the case in the previous one using K7b (1.4%). Gok4, on the other hand, the fellow Neolithic individual from Sweden seems to lack this. The arrangement of the Big Three West Eurasian components (Southern/West Asian/Atlantic_Baltic) has subtly changed in this calculator, but it would be tempting, nonetheless, to see in the little West Asian admixture that Oetzi has but Gok4 and the Mesolithic samples seem to lack, something of the vanguard of the arrival of the West Asian component in Europe. Obviously more samples are needed, including ones from the most interesting regions of the Balkans and Anatolia.

September 14, 2012

Inter-relationships between Dodecad K7b and K12b components

In a previous post I used leave-one-out to show how components inferred by ADMIXTURE could be related to each other.

One of the "problems" with ADMIXTURE and related analyses is that as the number of components K increases, additional components are formed by merging and/or splitting of components at lower K.

But, it turns out that thanks to the supervised mode, we can look at how components at different K are related to each other: we can treat, e.g., the K=12 ancestral populations as test data with the K=7 ancestral populations as references and vice versa.

I carried out precisely this procedure for my K7b/K12b components.

Below are the K12b components expressed as mixtures of the K7b ones:

And, the K7b ones expressed as mixtures of the K12b ones:

I have also calculated f3 statistics (ussing threepop) for all population triples using the K7b/K12b calculators. Most of the mixes inferred by ADMIXTURE appear significant, although I didn't hand-check each one. I report the significant ones below:

Population f3(A; B, C) s.e. Z-score

Atlantic_Baltic_K7b;Atlantic_Med_K12b,North_European_K12b -0.00287483 2.64051e-05 -108.874
African_K7b;East_African_K12b,Sub_Saharan_K12b -0.00241502 2.3253e-05 -103.858
East_Asian_K7b;East_Asian_K12b,Southeast_Asian_K12b -0.00218574 2.17614e-05 -100.441
Caucasus_K12b;West_Asian_K7b,Southern_K7b -0.00317634 4.12205e-05 -77.0573
West_Asian_K7b;Gedrosia_K12b,Caucasus_K12b -0.00209044 3.14454e-05 -66.4785
Siberian_K7b;East_Asian_K12b,Siberian_K12b -0.00166911 2.60228e-05 -64.1403
South_Asian_K7b;Gedrosia_K12b,South_Asian_K12b -0.00195015 3.35149e-05 -58.1876
East_Asian_K12b;East_Asian_K7b,Siberian_K7b -0.00191747 3.49244e-05 -54.9034
Atlantic_Baltic_K7b;Southern_K7b,North_European_K12b -0.00181747 3.63948e-05 -49.9377
East_African_K12b;Southern_K7b,African_K7b -0.00412496 0.000101701 -40.5598
Atlantic_Med_K12b;Southern_K7b,Atlantic_Baltic_K7b -0.00138679 3.68608e-05 -37.6222
East_Asian_K7b;Southeast_Asian_K12b,Siberian_K7b -0.00127133 3.92998e-05 -32.3495
Northwest_African_K12b;Southern_K7b,Sub_Saharan_K12b -0.00272013 0.000110067 -24.7133
Northwest_African_K12b;Southern_K7b,African_K7b -0.00255262 0.000107527 -23.7394
East_African_K12b;African_K7b,Atlantic_Med_K12b -0.00237833 0.000107306 -22.1639
East_African_K12b;African_K7b,Caucasus_K12b -0.00217732 0.000101003 -21.557
Caucasus_K12b;West_Asian_K7b,Atlantic_Med_K12b -0.000977923 4.573e-05 -21.3847
Caucasus_K12b;West_Asian_K7b,Northwest_African_K12b -0.00100154 4.86387e-05 -20.5915
East_African_K12b;Southern_K7b,Sub_Saharan_K12b -0.00247983 0.000122139 -20.3034
Caucasus_K12b;Southern_K7b,Gedrosia_K12b -0.00112749 5.91335e-05 -19.0669
East_Asian_K12b;Southeast_Asian_K12b,Siberian_K7b -0.00100305 5.44851e-05 -18.4097
Atlantic_Baltic_K7b;North_European_K12b,Caucasus_K12b -0.000534432 2.98199e-05 -17.922
Southern_K7b;Southwest_Asian_K12b,Atlantic_Med_K12b -0.000683711 4.08148e-05 -16.7515
East_Asian_K12b;East_Asian_K7b,Siberian_K12b -0.000651854 4.01206e-05 -16.2474
African_K7b;Gedrosia_K12b,Sub_Saharan_K12b -0.000738345 4.5676e-05 -16.1648
African_K7b;Southern_K7b,Sub_Saharan_K12b -0.000769896 4.8516e-05 -15.8689
South_Asian_K7b;South_Asian_K12b,Northwest_African_K12b -0.000598387 3.84069e-05 -15.5802
African_K7b;Sub_Saharan_K12b,Northwest_African_K12b -0.000602378 4.07154e-05 -14.7948
East_African_K12b;African_K7b,Southwest_Asian_K12b -0.00141216 0.000102079 -13.834
African_K7b;Sub_Saharan_K12b,North_European_K12b -0.000663712 4.87314e-05 -13.6198
African_K7b;South_Asian_K7b,Sub_Saharan_K12b -0.000598399 4.51811e-05 -13.2445
Southern_K7b;Southwest_Asian_K12b,Northwest_African_K12b -0.000577559 4.50096e-05 -12.8319
Siberian_K7b;East_Asian_K7b,Siberian_K12b -0.000403499 3.17418e-05 -12.7119
Atlantic_Baltic_K7b;West_Asian_K7b,Atlantic_Med_K12b -0.000520714 4.41022e-05 -11.807
East_African_K12b;African_K7b,Atlantic_Baltic_K7b -0.00122819 0.000106897 -11.4895
African_K7b;Sub_Saharan_K12b,Siberian_K7b -0.00051246 4.93477e-05 -10.3847
East_African_K12b;African_K7b,North_European_K12b -0.00103911 0.000106816 -9.72802
African_K7b;Sub_Saharan_K12b,Southeast_Asian_K12b -0.000469707 4.98071e-05 -9.43052
African_K7b;East_Asian_K12b,Sub_Saharan_K12b -0.000461359 4.9918e-05 -9.24235
Gedrosia_K12b;South_Asian_K7b,West_Asian_K7b -0.00047115 5.11259e-05 -9.2155
South_Asian_K7b;East_African_K12b,South_Asian_K12b -0.000384664 4.18056e-05 -9.20125
African_K7b;Sub_Saharan_K12b,Caucasus_K12b -0.000430657 4.69419e-05 -9.17425
African_K7b;Sub_Saharan_K12b,Southwest_Asian_K12b -0.000421792 4.64037e-05 -9.08962
Atlantic_Baltic_K7b;North_European_K12b,Northwest_African_K12b -0.000328259 3.62081e-05 -9.06589
African_K7b;Sub_Saharan_K12b,East_Asian_K7b -0.000446564 4.9569e-05 -9.00895
African_K7b;Sub_Saharan_K12b,Siberian_K12b -0.000437012 4.88062e-05 -8.95404
Northwest_African_K12b;African_K7b,Atlantic_Med_K12b -0.00115555 0.000131897 -8.76101
African_K7b;West_Asian_K7b,Sub_Saharan_K12b -0.000397507 4.57534e-05 -8.68804
African_K7b;Sub_Saharan_K12b,Atlantic_Baltic_K7b -0.000418044 4.81379e-05 -8.68431
African_K7b;South_Asian_K12b,Sub_Saharan_K12b -0.000393516 4.57123e-05 -8.60853
South_Asian_K7b;South_Asian_K12b,Southwest_Asian_K12b -0.000290753 3.88373e-05 -7.48644
South_Asian_K7b;West_Asian_K7b,South_Asian_K12b -0.000228331 3.63783e-05 -6.27657
Atlantic_Med_K12b;Southern_K7b,North_European_K12b -0.000329428 5.28014e-05 -6.239
East_African_K12b;Gedrosia_K12b,African_K7b -0.000596188 0.000102434 -5.8202
African_K7b;Sub_Saharan_K12b,Atlantic_Med_K12b -0.00023116 4.95629e-05 -4.66397
South_Asian_K7b;South_Asian_K12b,Atlantic_Med_K12b -0.000172605 4.09236e-05 -4.21775
Siberian_K12b;Atlantic_Med_K12b,Siberian_K7b -0.000166672 4.4065e-05 -3.78243
East_African_K12b;West_Asian_K7b,African_K7b -0.00034931 0.000103503 -3.37489
Atlantic_Baltic_K7b;Atlantic_Med_K12b,Siberian_K7b -0.000226988 7.32706e-05 -3.09795

This leads to a very simple way of gauging whether an ancestral population is better seen as admixed or not: count the number of times it appears before the semi-colon, and subtract the number of times it appears after the semi-colon. This may not be a perfect measure, but it captures the basic idea. When I do this, I get:

[1,] East_African_K12b 7

[2,] African_K7b 7

[3,] South_Asian_K7b 4

[4,] Atlantic_Baltic_K7b 3

[5,] East_Asian_K12b 0

[6,] Caucasus_K12b 0

[7,] Northwest_African_K12b -2

[8,] East_Asian_K7b -2

[9,] Siberian_K12b -3

[10,] Southeast_Asian_K12b -4

[11,] Gedrosia_K12b -4

[12,] Siberian_K7b -4

[13,] Southwest_Asian_K12b -5

[14,] South_Asian_K12b -7

[15,] North_European_K12b -7

[16,] West_Asian_K7b -7

[17,] Southern_K7b -8

[18,] Atlantic_Med_K12b -8

[19,] Sub_Saharan_K12b -19

I think this looks reasonable; the components at the bottom usually appear contributing to the admixture of other populations, and the components at the top usually appear admixed in terms of the other components. Of course admixed components may be themselves be useful if they represent regional mixes (such as teh East African), but this is certainly a good way to supplement and interpret ADMIXTURE analysis.

August 30, 2012

Scrubbing Sardinians

In a series of posts, I showed that European populations have east Eurasian-like admixture, an element that appears to be lacking in Sardinians. I did this both on the basis of the 3-population test and a number of different comparisons between West Eurasian populations, as well as on the basis of the 4-population test.

The fact that f4(Sardinian, CEU, Asian, African) is negative was interpreted by Moorjani et al. (2011) as evidence that Sardinians have ~2.9% African admixture. As I pointed out at the time this level of admixture was predicated on the assumption that CEU did not have Asian admixture, and this assumption now appears not to hold.

Of course, the above-mentioned paper also used an admixture LD based method (ROLLOFF) to date the African admixture in Sardinians, coming up with an estimate of ~71 generations. But, we should remember that ROLLOFF does not quantify the extent of this admixture.

Imagine walking along a Sardinian genome: the negative f4 signal is created both by occasional African-like segments you meet along the way, but also by the presence of East Eurasian SNPs in CEU in other locations where Sardinians may have no African admixture. The f4 signal is a genomewide average that is influenced by two different processes: punctuation by African segments whose length distribution can supply information about the time of their introgression; and, the background genome that is lacking in East Eurasian-like polymorphism present in CEU.

In this post, I will show that:

The admixture estimate of 2.9% is not robust, but depends on the choice of Asian population for f4 ancestry estimation, consistent with the idea that it is influenced by east Eurasian-like admixture that has affected northern European populations.
If Sardinians are "scrubbed" of any trace of African admixture, the negative f4(Sardinian, CEU, Asian, African) signal persists

Estimates of African admixture in Sardinians depend on choice of Asian/American population

African ancestry in Sardinians was estimated by Moorjani et al. (2011), using the following ratio:

f4(San,Papuan; Sardinian,CEU) / f4(San,Papuan; YRI, CEU)

In Table S6 different ancestral populations were used for f4 ancestry estimation, and all results ranged between 2.9-3.4%.

The signal of east Eurasian-like admixture in northern Europe is strongest when Karitiana as used as an Asian/American reference. If the level of "African" admixture in Sardinians is driven, as I suspect, by the presence of east Eurasian-like admixture in northern Europe, then I expect this admixture to be highest when Karitiana instead of Papuans are used. And, indeed, this is what I observe :

f4(San,Papuan;Sardinian,CEU) = 0.00118099 (Z=10.6838)
f4(San,Papuan;YRI,CEU) = 0.0379664 (Z=88.2287)

(in all experiments I use a set of 28 Sardinians vs. 27 in the Moorjani et al. paper, a set of 112 CEU, 147 YRI, a set of 166,770 SNPs, and -k 200 for fourpop)

therefore, African admixture in Sardinians using Papuan reference = 0.00118099/0.0379664 = 3.1%

but

f4(San,Karitiana;Sardinian,CEU) = 0.00272141 (Z=22.7288)

f4(San,Karitiana;YRI,CEU) = 0.04449 (Z=100.19)

therefore, African admixture in Sardinians using Karitiana reference = 0.00272141/0.04449 = 6.1%

A ~2-fold difference in African admixture has resulted from a different choice of outgroup. This is unexpected if West Eurasians did not exchange genes with Papuans and Karitiana since their divergence, but expected if CEU received genes from an Asian population that was more like Karitiana and less like Papuans.

Scrubbing Sardinians

Another way to demonstrate that east Eurasian-like admixture in CEU is inflating the perceived level of African-like admixture in Sardinians is to comprehensively "scrub" Sardinians of all traces of African ancestry by replacing segments of their DNA when there is even a hint of such ancestry with missing values.

Going back to the mental experiment of walking along the Sardinian genome, we are going to remove spots of even remote possibility of African admixture. It will be shown that CEU continues to have evidence of east Eurasian-like admixture using the scrubbed Sardinians, suggesting that it is not only African-like admixture in Sardinians generating this signal, but also East Eurasian-like admixture in CEU.

I used DIYDodecad to do this scrubbing, but one could potentially try any approach that can identify African segments, such as HAPMIX or PCA. I used the dataset assembled for K7b and K12b, and carried out a K=3 ADMIXTURE analysis, which resulted in 3 components centered on West Eurasia, Asia, and Africa. I chose not to use an African component from higher-K (e.g. the K7b calculator), because it is conceivable that African ancestry might be lurking in southern Caucasoid components inferred with these tools (e.g., the "Southern" component of K7b or the "Southwest Asian" one of K12b). The average African admixture in Sardinians using the K3b calculator is 0.9%, and for the subset of CEU used it is 0.2%.

Using the byseg mode of DIYDodecad, I created ancestry maps of the 28 HGDP Sardinians, and I only kept windows where the African admixture was exactly 0%. This is a very aggressive scrubbing, designed to remove virtually all African admixture from the population. For example, if a window has 99.9% West Eurasian admixture and 0.01% African, I will nonetheless remove it, even though chances are extremely high that the 0.01% represents only noise. I did not want to leave any doubt that any trace of identifiable African ancestry remained in my "scrubbed Sardinians".

I am very confident that my scrubbed Sardinians do not have any hint of African ancestry, but you can decide for yourselves. I base my confidence on (a) the extreme nature of the scrubbing , which threw away much of the Sardinian genome in order to ensure that no hints of local African ancestry remained (b) re-assessment of the scrubbed Sardinians with K3b showing that they are now 100% West Eurasian, (c) ab initio ADMIXTURE analysis of CHB, YRI, CEU, and scrubbed Sardinians, demonstrating that the latter are 100% West Eurasian, while CEU has traces of 0.1% African and 0.3% Asian ancestry.

So, here are the results for the scrubbed Sardinians:

f4(San,Papuan;Sardinian_scrubbed,CEU) = 0.000678108 (Z=4.05225)

f4(San,Papuan;YRI,CEU) = 0.0379664 (Z=88.2287)

so scrubbed Sardinians with Papuan reference appear 0.000678108 / 0.0379664 = 1.8% African

and

f4(San,Karitiana;Sardinian_scrubbed,CEU) = 0.00205526 (Z=11.2848)
f4(San,Karitiana;YRI,CEU) = 0.04449 (Z=100.19)

so scrubbed Sardinians with Karitiana reference appear 0.00205526/0.04449 = 4.6% African

Despite the thorough scrubbing, Sardinians continue to show African admixture using f4 ancestry estimation. This is consistent with the idea that much of the African ancestry inferred using f4 ancestry estimation in Sardinians is an artifact of not taking into account east Eurasian-like admixture in CEU.

Conversely, a significant signal of east Eurasian-liked admixture in CEU persists whether one uses regular or scrubbed Sardinians:

With regular Sardinians:

f4(San,Papuan;Sardinian,Karitiana) = 0.0084678 (Z=21.2137)

f4(San,Papuan;Sardinian,CEU) = 0.00118099 (Z=10.6838)

So, CEU appears = 0.00118099/0.0084678 = 13.9% East Eurasian

With scrubbed Sardinians:

San,Papuan;Sardinian_scrubbed,Karitiana 0.00774427 0.00056725 13.6523

San,Papuan;Sardinian_scrubbed,CEU 0.000678108 0.000167341 4.05225

So, CEU appears = 0.000678108/0.00774427 = 8.8% East Eurasian

Conclusion

My "palimpsest" idea seems to be confirmed by the data. A first observation is that the level of African-like admixture in Sardinians depended on whether one used Papuans or Karitiana as an outgroup, suggesting that neither population was a true outgroup, and the signal of African admixture in Sardinians was driven in part by East Eurasian-like admixture in CEU. African admixture in Europe cannot be assessed accurately if one ignores the confounding effect of East Eurasian admixture.

When I aggressively scrubbed Sardinians so as to remove all traces of African ancestry, part of the African admixture fraction disappeared (expected, since African ancestry was removed from Sardinians), but a substantial part of it remained (unexpected, if the signal was driven only by African admixture, but expected, if it was driven in part by East Eurasian-like admixture in CEU). Conversely, using scrubbed Sardinians reduced, but did not make disappear, the admixture estimate for CEU.

August 26, 2012

Inter-relationships of the Dodecad K12b and world9 components

Pconroy made a most excellent suggestion in the comments of a previous post, so I decided to follow up on it. His idea is to see what Dodecad components look like when they're measured in terms of other components. So, I took the K12b components and carried out the following procedure:

I used each of the 12 different components as "test data" in a supervised ADMIXTURE analysis that used the other 11 components as "reference data". This simple procedure can show what each component appears to be made of, if it is seen in the context of the remaining components. It is a good way to demonstrate relationships between them.

Here are the results:

Some observations:

Gedrosia appears to be Caucasus + a slice of Siberian
Both Siberian and Southeast Asian appear to be wholly East Asian
East Asian on the other hand, appears to be mostly Southeast Asian + minority Siberian
Northwest African appears to be Caucasus + a minority Sub Saharan
Atlantic Med appears to be Caucasus + a slice of North European
North European appears to be Atlantic Med + Gedrosia with a slice of Siberian
South Asian appears to be Caucasus + East Asian
East African appears to be Sub Saharan + minority Caucasus
Southwest Asian appears to be Caucasus
Sub Saharan appears to be East African
Caucasus appears Atlantic Med + Gedrosia + slices of Northwest African and Southwest Asian

The most salient point about this analysis is the central position of the Caucasus component vis a vis the others, consistent with my womb of nations theory. Not only do all West Eurasian components (except the North European) appear substantially "Caucasus" in this analysis, but the Caucasus component itself shows links with four others.

It could be argued that these results represent a confluence of peoples from all over West Eurasia into the highlands of West Asia where the Caucasus component is modal. But, the Caucasus region is arguably the most linguistically diverse in West Eurasia, and many of its languages do not appear to have come from elsewhere. Also, the Near East (where the Caucasus component is the most important one in most populations) is the birthplace of agriculture, which has demonstrably affected most of West Eurasia. On balance, this analysis seems consistent with population expansions out of West Asia.

The following graph summarizes the relationship between the 12 components. I used color intensity of the edges to indicate admixture intensity:

Finally, a few points to remember:

the South Asian component appears like a mix of of Caucasus and East Asian; the latter probably acts as a stand-in for the Ancestral South Indians of Reich et al. (2009)
Similarly, the Gedrosia/Siberian influences on the North European component do not necessarily mean direct influences from these two regions; an explanation for these influences may intersect with the issue of East Eurasian-like ancestry in northern Europe
It is the Caucasus, rather than Southwest Asian component that seems to donate to the Northwest African and East African ones. That seems to flaunt geography, but probably indicates that the Southwest Asian component, with its strong Semitic associations (see distribution in K12b spreadsheet) represents a more specialized form of the more generalized Caucasus component.
Some components appear to be "terminal", affected but not much affecting: Southwest Asian, Northwest African, and Southeast Asian. These tend to appear at high K in admixture analyses, and probably represent either recent mixtures (Northwest African) or specialized forms of more generalized ones (Southwest Asian of Caucasus and Southeast Asian of East Asian)
Finally, remember that living populations show admixture proportions of many of these components. So, for example, the East African population often has Southwest Asian admixture, even though the East African component lacks it. And, as mentioned above, this may reflect the more generalized west Asian admixture that has affected East Africa, as well as the more specific Arabian admixture, associated e.g., with the spread of Semitic languages. Please refer to the K12b spreadsheet for admixture proportions of populations for the 12 components.

I have also done the same with the world9 calculator, which includes Amerindian and Australasian components. Here is how the world9 components are seen as mixtures of the remaining ones:

And, here is the graph showing how they seem to contribute to each other.

A few observations:

Amerindian appears wholly Siberian
East Asian appears Siberian + South Asian + slice of Australasian
African appears South Asian. I would attribute this to Africans being related to both West and East Eurasians approximately symmetrically, so in this type of experiment, South Asian (which is an ANI/ASI mix) appears like the best match
Atlantic_Baltic appears Caucasus_Gedrosia + Southern + slice of Amerindian
Australasian appears South Asian. I would guess that ASI and Australo-Melanesians share deep common ancestry from the earliest settlement of southern parts of Asia.
Siberian appears East Asian + slice of Amerindian
Caucasus_Gedrosia and Southern appear Atlantic_Baltic
South Asian appears an about equal mix of East Asian + Caucasus_Gedrosia + slice of Australasian

Raw data for these experiments can be found here.

August 22, 2012

East Eurasian-like ancestry in Northern Europe (part 3)

(This is the third part of the series. See part 1 and part 2.)

In the first two parts of the series, I showed that northern European populations show hints of East Eurasian ancestry when compared against Sardinians. I used Dai, Han, and Karitiana as reference populations for East Eurasia. In the current post, I extend this analysis by using HGDP Papuans and the Onge (Reich et al. 2009) from the Andaman Islands.

The f4 statistics using Karitiana, Papuan, and Onge populations can be found in this spreadsheet.

Below, you can see that they are all near perfectly correlated with each other.

The visual appraisal is confirmed when we calculate the correlation coefficients:

The fact that all three populations track the same signal is strong evidence for the direction of gene flow: from Asia into northern Europe. If the signal was present in only one of the three populations, then it could conceivably be an artefact of gene flow in the opposite direction (from northern Europeans to the affected population). But, the fact that all three populations show the same pattern would require northern European-like admixture in the Andaman Islands, Papuan New Guinea and South America, which does not appear very parsimonious.

While the signals from the three populations are correlated, their intensity varies. The Z-scores provide a measure of this intensity. The mean Z-scores using a Karitiana, Papuan, and Onge reference across all populations are respectively -17.7, -8.0, and -6.0.

While I did not include the Han reference of part 1 in this analysis, inspection of the f4 statistics (which can be obtained at the bottom of that part), suggests that the Z-scores become more significant when using an Onge, Papuan, Han, and Karitiana reference in that order. For example, for the Finnish_D population, they are: -10.037, -13.2949, -23.9305, and -27.764 respectively.

It thus appears that the element contributing East Eurasian-like ancestry in northern Europeans was derived from the northern spectrum of East Eurasians; the Karitiana may live in South America today, but they trace their ancestors to northern Eurasia, having entered the Americas c. 15ka.

In my opinion, the signal has been formed by a superposition of a few factors:

The fact that Y-haplogroup R, the main lineage in modern northern Europeans has a common origin (Y-haplogroup P) with haplogroup Q, the main lineage in modern Amerindians, and many Siberians. We can hypothesize that the population that brought R into Europe was intermediate genetically across the Caucasoid-Mongoloid spectrum. In West Eurasia, this population admixed with the Palaeo-West Eurasians (Y-haplogroups IJ, G, and possibly LT), and contributed their DNA primarily to the northern Europeoids.
Other population movements of more regional impact, such as Y-haplogroup N, which affected mainly Uralic, Baltic, and East Slavic populations, as well as elements from the mixed West/East Eurasian mtDNA contact zone that ancient DNA analysis has revealed in Eastern Europe and Siberia.

The raw dumps of fourpop output for Papuan and Onge reference can be found here.

East Eurasian-like admixture in Northern Europe (part 2)

This is a continuation of my earlier post. Please refer to it for the methodology. A new part 3 can be found here.

I have repeated the experiment with a much larger set of populations:

English_D, British_D, Ukranians_Y, Karitiana, Spaniards, Sardinian, Serb_D, Mordovians_Y, Irish_D, French, Finnish_D, Chuvashs_16, Romanian_D, N_Italian_D, French_Basque, Austrian_D, Russian_D, Hungarians_19, Kent_1KG, German_D, Belorussian, Tuscan, Lithuanian_D, Orkney_1KG, Dutch_D, TSI30, Ukrainian_D, Bulgarians_Y, Bulgarian_D, Russian, Swedish_D, Pais_Vasco_1KG, French_D, Castilla_Y_Leon_1KG, Lithuanians, San, Polish_D, Romanians_14, Orcadian, Cornwall_1KG, Valencia_1KG, North_Italian, FIN30, Norwegian_D, CEU30

I used Sardinians as the Caucasoid reference population, Karitiana for Mongoloids, and San for Africans. The latter two were chosen because they live at maximally opposite corners of the Earth (South America vs. South Africa).

A first plot of the f4 statistics used for f4 regression ancestry estimation is seen below:

Clearly, some evidence of a cline is present, but several populations appear to deviate from it. In order to get the cleanest possible cline, I carried out the following greedy procedure: I calculate the correlation coefficient of this set, and iteratively remove one population that leads to the maximum improvement of the correlation, until no further improvement takes place. The following populations were removed with this procedure:

Spaniards, Serb_D, Romanian_D, N_Italian_D, Tuscan, TSI30, Bulgarians_Y, Bulgarian_D, Castilla_Y_Leon_1KG, Romanians_14, Valencia_1KG

This seems to make sense, as all these are southern European populations. Note that their removal does not mean that they do not partake in the same phenomenon as northern Europeans: they also exhibit Karitiana-shift relative to the Sardinians, but there are probably other confounding factors that make them fall "off-cline". Including them would diminish the clarity of the cline for Northern European populations. The regression of the remaining populations can be seen on the right:

f4 regression ancestry estimation results are shown on the left. These appear to be much higher than was the case with the Han and Dai in the previous experiment.

I can't say that I've made any obvious mistakes, but these admixture proportions are substantial, and call for an explanation. Whatever their true levels, I am fairly confident on at least a few points:

First, it is evident that northern Europeans have higher levels of this element than southern Europeans; the latter are not altogether deficient in it, but they fall "off-cline", making estimation of their admixture proportions more difficult.

Second, within northern Europe, there is a fairly clear east-west cline of diminishing Amerasian-like admixture. The minimum occurs in Sardinians and secondarily in Southwest Europe. Romance, Celtic, and Germanic populations all have less of it than Balto-Slavic and Uralic ones. And, some populations of northeastern Europe seem to have a noticeable excess of it.

The groups with the most Amerasian-like admixture possess Y-haplogroup N, a clear trace of eastern ancestry that is not shared by most Europeans. The arrival of this haplogroup, either with Comb Ceramic of the Baltic Neolithic or later with Seima Turbino Bronze Age expansions is probably responsible for the local excess in Northeastern Europe. The Chuvash are, of course, a Turkic population but of Finno-Ugrian genetic origin.

But, the presence of this element even in Western Europe cannot be explained on the basis of typically Mongoloid elements which are almost completely lacking there. If Mesolithic Europeans were themselves Asian-shifted, then this would account for the presence of the element, but not necessarily for its clinal manifestation. The double (north-south and east-west) cline indicates every sign of an intrusive element. So, for the time being, I will propose that this is associated with late (e.g., Copper and Bronze Age) phenomena, such as the northern stream of the Bronze Age Indo-European invasion of Europe.

This may be due to the

(i) northern Indo-European groups picking up some native east European or Siberian elements as they made their way into Europe,
or (ii), more likely, in my opinion, that the Y-haplogroup R1 group of people, whose closest relatives are in Central/South Asia (R2) , and whose more distant relatives (Q) are in Siberia and the Americas, were from the beginning an "intermediate population" between West and East Eurasia. The R1 group of people in its R1b and R1a varieties first appear in Europe during the Copper Age, and they are lacking in early Neolithic sites.

Eight years ago, and in a totally different context, I wrote:

Similarly, 9 out of 10 Basques are descended from a man who has also fathered 9 out of 10 Kets from Siberia and 9 out of 10 Maya Indians from America. That man, founder of haplogroup P thus has descendants who belong to two of the major human races (or three, if Amerindians are considered as separate from Asian Mongoloids)

...

In conclusion, human continental populations form groups of genetic and phenotypic similarity, and these groups can be considered races in the phenetic sense. However, these groups are not monophyletic, hence in the cladistic sense they should not be considered as valid taxa. Since the principle of common descent is generally applied in modern systematics (or at least it should!), I think it's best not to recognize human subspecies.

If these data pan out, it may be revealed that the European branch of the Caucasoids is actually a product of admixture too, with at least two of its constituent elements being the "Palaeo-West Eurasians" (Y-haplogroups G, IJ, possibly LT) and the "Neo-NW Eurasians" (Y-haplogroups N1 and R1), with the "Neo-Afrasians" (Y-haplogroup E1b1b) forming a third element.

(A raw dump of fourpop output can be found here).

August 21, 2012

4-population test and East Eurasian-like ancestry in Northern Europe

Update: This is the first part of my discussion on the topic. For part 2 go here; for part 3 here.

I decided to follow up on a hint in the recent Reich et al. (2012) paper on Native Americans to the effect that:

... east/central Asian admixture has affected northern Europeans to a greater extent than Sardinians (in our separate manuscript in submission, we show that this is a result of the different amounts of central/east Asian-related gene flow into these groups).

I used the implementation of the 4-population test of Reich et al. (2009) in the fourpop program of TreeMix. 255,020 SNPs common in the various datasets were used throughout, and blocks of 200 SNPs for standard error estimation.

I used HGDP Sardinian, X, Han, San, with X being one of the following:

Armenian_D, Turkish_D, Russian_D, Polish_D, German_D, Irish_D, Greek_D, Finnish_D, Sicilian_D, Swedish_D, Portuguese_D, Lithuanian_D, Somali_D, AMHARA_Pa, Dai, Japanese, Kyrgyz_Bishkek_Ho, Mozabite, Bedouin, North_Italian, French_Basque, Tuscan, Russian, Orkney_1KG, Kent_1KG, Cornwall_1KG, Yoruba, Mbuti_Pygmies

As always, you can find a list of population sources at the bottom of the Dodecad blog.

As I have noted in my review of Moorjani et al., this test shows a superposition of a set of populations on the African-East Asian axis, so populations occupy different positions depending on whether they have African or East Asian admixture. It's a palimpsest. That paper ignored the Eastern ancestry in North Europeans, and used the CEU (a population of mainly North European origin) instead of Sardinians, hence generating inflated estimates of African ancestry in Southern Europeans.

Now that the Central/East Asian ancestry in northern Europeans seems to be recognized by some of the co-authors of the earlier paper, and using the Reich et al. (2012) framework, the different processes superimposed on the African-East Asian axis can probably be disentangled. Hopefully, we won't have to wait too long for the full treatment. Maybe it can go to the arXiv too!

In any case, here are the f4(Sardinian, X; Han, San) values for the different populations:

It is quite clear that North European populations are shifted towards East Asians, with the exception of the Turkish_D sample which is also so shifted, due to its Central Asian Turkic admixture. There are also a few cases of substantial African shift, such as the Bedouin and Mozabite Berbers.

I have also rescaled the f4 statistics on a 0: Sardinian to 100: Japanese scale, including only West Eurasian populations that are East Asian-shifted relative to Sardinians:

I have calculated the correlation coefficient between the f4 statistics for this set of West_Eurasian populations and the sum of the Siberian+East Asian components of my K7b calculator on the same set of populations. This is +0.85, highly significant, and consistent with the idea that ADMIXTURE software and formal tests of admixture capture the same phenomenon. I also calculated the correlation coefficient with the Atlantic_Baltic component that is modal in Europeans, which is equal to +0.56 and confirms the higher East Eurasian shift in European populations.

(I have also repeated the above with the K12b calculator; the correlation between the f4 statistics and the Siberian+East Asian+Southeast Asian components is +0.76, and the North_European +0.84. The latter is higher than with the Atlantic_Baltic component (+0.56) which combines North and West European ancestry. It thus appears that the East Eurasian admixture in Europe is not a general feature of the oldest Europeans, but reflects a more recent phenomenon.)

Furthermore, I have carried out f4 regression ancestry estimation (Reich et al. 2009) using the f4(Sardinian, San; X, Han), in the horizontal axis, and f4(Sardinian, X; San, Han) statistics, in the vertical. An initial plot shows that while northern Europeans fall precisely on a line in this space, Turks and Armenians deviate substantially, while Greeks, Tuscans, and Basques less noticeably so:

The regression analysis shows a weak correlation (R^2=0.149). The southern Caucasoid populations from Armenia to Iberia appear roughly perpendicular to the north European cline, suggesting that they do not significantly partake in the same phenomenon as the northern groups.

I thus limited myself to the European populations falling on the Russian to North_Italian cline, which form a near-perfect cline (R^2= 0.9783)

Admixture estimation was performed on the triangle whose three corners were (from the regression equation):

LOW: (0.0151411, 0.0000000)
HIGH: (0.00000, 0.02049)

and

TEST: the f4 statistics for each test population

Inferred admixture proportions using this method can be seen below:

I have repeated the above experiment using French_Basque instead of Sardinian, Mbuti instead of San, and Dai instead of Han:

Now, North_Italian shows no East-Eurasian-like admixture relative to French Basque, so there is one less row. The Basques appear to be Asian-shifted relative to Sardinians, so, overall, I would trust the former results more than the latter, but, in any case, the overall pattern seems fairly solid across a choice of reference populations.

While I would not take these results very literally in the absolute sense, I think they show quite well the relative ordering of populations, and are consistent with both my initial observation...

With respect to the Asian- and African- shift of West Eurasian populations, I note that northern Europeans (and Basques) are less African-shifted than southern Europeans, and, at the same time they are more Asian-shifted: the 16 least Asian-shifted populations have a coastline in the Mediterranean (excluding the Portuguese), while the 16 least African-shifted populations do not (excluding the French).

... as well as the remark in passim in the Reich et al. (2012) supplementary material mentioned above.

The prominent position that Sardinians have assumed in the genetic history of Europe puts the discovery of Veeramah et al. that Sardinians tend to be monomorphic in sites where mainland Europeans are polymorphic into new light. It now appears that the reduced genetic polymorphism of Sardinians vis a vis mainland Europeans may not be due to them having undergone a "bottleneck" relative to mainland Europeans, but rather, at least in part, a consequence of admixture in the latter. Admixture matters.

Hopefully, geneticists will become more willing to interpret patterns of decrasing genetic diversity not only as a consequence of diversity-reducing "bottlenecks", but also of admixture in populations that are especially diverse.

In the case of Sardinians and Europeans we have been lucky in that East Asians continue to exist, helping us untangle their (or their relatives') contribution to the population history of Europe. But, in other cases (such as the introgression of archaic DNA into the modern human gene pool), latent population admixture between divergent populations may lead to misinterpretations of the direction of gene flow.

The raw dump of fourpop output can be obtained from here (for Sardinian-Han-San) and here (for Basque-Mbuti-Dai).