August 30, 2012

Denisova genome at high coverage (Meyer et al. 2012)

The high coverage Denisova genome had been made available online in February, and now the paper to accompany it has appeared in Science. I do appreciate the fact that the Max Planck folks did not wait until publication to share their data; anything that can accelerate the process of scientific discovery is good in my book. I'll post any comments on this paper as an update when I read it.

UPDATE I: The most exciting thing about this paper, even above and beyond the new information it provides is the new technology of single strand sequencing. From a ScienceNOW story:

Meyer's breakthrough came in developing a method to start the sequencing process with single strands of DNA instead of double strands, as is usually done. By binding special molecules to the ends of a single strand, the ancient DNA was held in place while enzymes copied its sequence. The result was a sixfold to 22-fold increase in the amount of Denisovan DNA sequenced from a meager 10-milligram sample from the girl's finger. The team was able to cover 99.9% of the mappable nucleotide positions in the genome at least once, and more than 92% of the sites at least 20 times, which is considered a benchmark for identifying sites reliably. 
Back in Leipzig, the mood is upbeat, as researchers pull fossil samples off the shelf to test anew with "Matthias's method." First on Paabo's list: Neandertal bone samples, to try to produce a Neandertal genome to rival that of the little Denisovan girl.

There are two reasons to be very happy: 

First, the new method will probably open new vistas in ancient DNA, as it represents an order of magnitude improvement in the amount of coverage that can be accomplished. 

Second, the authors were able to estimate the age of the specimen by cleverly measuring how much it differed from chimpanzees vs. how much we, living humans, so differ. As they explain in the supplement:
Based on the differences in branch length to the common ancestor of human and chimpanzee (1.13% to 1.27%; see Table S13), we estimate that the observed branch shortening corresponds to 73,614 – 82,421 (average 75,443) years assuming a human-chimpanzee divergence time of 6.5 million years.
Note that recent developments in dating human-chimpanzee divergence may push it to an older date than 6.5 million years. But, even if that date is accepted, the Denisova specimen is now the oldest Homo sequenced and our ability to get high-coverage DNA from a ~75ka specimen means that we may be getting DNA from other really old samples. Human evolutionary genetics is going to be very interesting in the coming years!

UPDATE II (East Eurasians more Neandertal than Europeans): 

There were hints of this in the previous papers, but they did not reach statistical significance. Now, it appears that they are confirmed. From the paper:
Interestingly, we find that Denisovans share more alleles with the three populations from eastern Asia and South America (Dai, Han, and Karitiana) than with the two European populations (French and  Sardinian) (Z=5.3). However, this does not appear to be due to Denisovan gene flow into the ancestors of present-day Asians, since the excess archaic material is more closely related to Neandertals than to Denisovans (Table S27). We estimate that the proportion of Neandertal ancestry in Europe is 24% lower than in eastern Asia and South America (95% C.I. 12-36%). 
This finding is important, because it shows that the simple model of modern humans expanding Out-of-Africa, interbreeding with Near Eastern Neandertals and carrying on to the rest of the world, carrying with them a fraction of Neandertal ancestry is incomplete.

Some explanations for this finding are discussed on p. 41 of the supplement. The fact that Neandertals were a West Eurasian-distributed species is at great odds with the finding of greater Neandertal admixture in Asian/American populations. But, if the Iceman, and, by implication, Paleolithic Europeans were more similar to Neandertals still, a further complication is added. This may be consistent with ideas from palaeoanthropology about great levels of variation in late Pleistocene humans compared to recent ones. One can imagine that groups varied substantially in their proportions of Neandertal ancestry until fairly recent times, but homogenizing gene flow evened out what was initially a very uneven distribution, but not completely.

UPDATE III (Mutation rate):

The paper appears in a transitional period in our understanding of mutation rates. So, while it presents a much better Denisova genome than the earlier published one, our understanding of how and when the Denisova population diverged from modern humans is now less clear. I have covered some of the mutation rate controversies recently on the basis of three papers: Kong et al., Sun et al., and Langergraber et al.

The discussion in Supplementary Note 10 summarizes the increased uncertainty about the topic:
An important date in human evolution is when the ancestors of modern humans diverged from Denisovans and their sister group the Neandertals. In the paper on the draft sequence of the Neandertal genome, we estimated this date for Neandertals (1). Since Denisovans are a sister group of Neandertals (2), they should have approximately the same population divergence; however, we never assessed this directly. Furthermore, the inference in the Neandertal genome paper was based on assumptions about mutation rates from early 2010. Since that time, better data have become available, lower mutation rates have been suggested, and the true value of the mutation rate has become less certain. It is important to obtain a new date estimate in light of this. 
Paleontological calibration can only take us so far; for example, Neandertaloid traits in the Atapuerca hominins suggest an early split with modern humans and therefore a low mutation rate, but they also suggest a much earlier human-chimp speciation time than commonly thought. I think that a technical solution to the problem will eventually be found, which will show why there is the 2-fold difference in mutation rate estimates.

UPDATE IV (Demographic History): 

The plot on the left shows inferred changes in population size for 12 different populations, using the PSMC approach of Li & Durbin.

One can see that the different populations seem to match quite well until ~750/375ky (depending on mutation rate), when Denisovan population starts decreasing, and the population of the modern human groups starts increasing. Then at ~110/55ky, population sizes in modern humans begin diverging.

There are two ways to look at this: if one assumes tree-like divergence of population, then obviously the fact that Denisova spends the period between 750/375 to 100/50ky at a much lower population size than modern humans speaks of an isolated population with limited genetic diversity.

But, as I've mentioned before in this blog, genetic diversity can be created by admixture. Take two populations that diverged a long time ago, even ones with low intra-population diversity, mix them, and the end result will be one very diverse population. In the absence of admixture, variation is generated by mutation, and culled by drift and selection. But, mutation is a random process that adds variation incrementally into the population, with new alleles appearing at a rate ~ to the number of breeding bodies times the mutation rate/per genome. Admixture, on the other hand, introduces a whole bunch of new alleles in a limited amount of time.

Here is what I think may have happened; I will use the older dates, as they currently make more sense to me:

  1. Homo heidelbergensis emerges in western parts of the Old World c. 750ky. Whatever adaptations gave heidelbergensis a bigger brain than erectus spread quickly throughout Europe and Africa. Admixture between European and African hominins at this time and/or expansion of the H. h. population lead to an increase in population size.
  2. Further east, heidelbergensis is less visible, and older erectus populations persist. The Denisovan population can then be seen as an eastern H. h. that had more limited opportunity to expand and/or experience gene flow, because of its remote location; the Deniosovans were not unlike isolated Siberian groups of today: substantially less diverse than the bulk of mankind.
  3. Pre-100ka sees the rise of the modern humans. According to my "two deserts" theory, these were a population of AMH living in North Africa.
  4. Post-100ka population histories begin to diverge, but with all population sizes decreasing (consistent with the rise of behaviorally modern humans carrying a small subset of the genetic variation in the broader group of archaic H. sapiens / anatomically modern humans). This is the major bottleneck of modern human origins that has transformed us into a fairly homogeneous species.
  5. But, Africans and non-Africans follow different trajectories, with the former maintaining higher population sizes than the latter. This is probably related to the ecological calamities that befell Eurasians during the 100-50ka period (notably the drying up of the Sahara-Arabia belt post-70ka and /or the Toba eruption), and also to a partial breakdown of African population structure as modern humans expanded deeper into Sub-Saharan Africa and started mixing with pre-existing humans living there, consistent with signals of archaic admixture detected for this period.

Science DOI: 10.1126/science.1224344

A High-Coverage Genome Sequence from an Archaic Denisovan Individual

Matthias Meyer et al.

We present a DNA library preparation method that has allowed us to reconstruct a high-coverage (30X) genome sequence of a Denisovan, an extinct relative of Neandertals. The quality of this genome allows a direct estimation of Denisovan heterozygosity, indicating that genetic diversity in these archaic hominins was extremely low. It also allows tentative dating of the specimen on the basis of “missing evolution” in its genome, detailed measurements of Denisovan and Neandertal admixture into present-day human populations, and the generation of a near-complete catalog of genetic changes that swept to high frequency in modern humans since their divergence from Denisovans.


Scrubbing Sardinians

In a series of posts, I showed that European populations have east Eurasian-like admixture, an element that appears to be lacking in Sardinians. I did this both on the basis of the 3-population test and a number of different comparisons between West Eurasian populations, as well as on the basis of the 4-population test.

The fact that f4(Sardinian, CEU, Asian, African) is negative was interpreted by  Moorjani et al. (2011) as evidence that Sardinians have ~2.9% African admixture. As I pointed out at the time this level of admixture was predicated on the assumption that CEU did not have Asian admixture, and this assumption now appears not to hold.

Of course, the above-mentioned paper also used an admixture LD based method (ROLLOFF) to date the African admixture in Sardinians, coming up with an estimate of ~71 generations. But, we should remember that ROLLOFF does not quantify the extent of this admixture.

Imagine walking along a Sardinian genome: the negative f4 signal is created both by occasional African-like segments you meet along the way, but also by the presence of East Eurasian SNPs in CEU in other locations where Sardinians may have no African admixture. The f4 signal is a genomewide average that is influenced by two different processes: punctuation by African segments whose length distribution can supply information about the time of their introgression; and, the background genome that is lacking in East Eurasian-like polymorphism present in CEU.

In this post, I will show that:
  • The admixture estimate of 2.9% is not robust, but depends on the choice of Asian population for f4 ancestry estimation, consistent with the idea that it is influenced by east Eurasian-like admixture that has affected northern European populations.
  • If Sardinians are "scrubbed" of any trace of African admixture, the negative  f4(Sardinian, CEU, Asian, African) signal persists
Estimates of African admixture in Sardinians depend on choice of Asian/American population

African ancestry in Sardinians was estimated by Moorjani et al. (2011), using the following ratio:

f4(San,Papuan; Sardinian,CEU) / f4(San,Papuan; YRI, CEU)

In Table S6 different ancestral populations were used for f4 ancestry estimation, and all results ranged between 2.9-3.4%.

The signal of east Eurasian-like admixture in northern Europe is strongest when Karitiana as used as an Asian/American reference. If the level of "African" admixture in Sardinians is driven, as I suspect, by the presence of east Eurasian-like admixture in northern Europe, then I expect this admixture to be highest when Karitiana instead of Papuans are used. And, indeed, this is what I observe :

f4(San,Papuan;Sardinian,CEU) = 0.00118099 (Z=10.6838)
f4(San,Papuan;YRI,CEU) = 0.0379664 (Z=88.2287)

(in all experiments I use a set of 28 Sardinians vs. 27 in the Moorjani et al. paper, a set of 112 CEU, 147 YRI, a set of 166,770 SNPs, and -k 200 for fourpop)

therefore, African admixture in Sardinians using Papuan reference = 0.00118099/0.0379664 = 3.1%


f4(San,Karitiana;Sardinian,CEU) =  0.00272141 (Z=22.7288)

f4(San,Karitiana;YRI,CEU) = 0.04449 (Z=100.19)

therefore, African admixture in Sardinians using Karitiana reference = 0.00272141/0.04449 = 6.1%

A ~2-fold difference in African admixture has resulted from a different choice of outgroup. This is unexpected if West Eurasians did not exchange genes with Papuans and Karitiana since their divergence, but expected if CEU received genes from an Asian population that was more like Karitiana and less like Papuans.

Scrubbing Sardinians

Another way to demonstrate that east Eurasian-like admixture in CEU is inflating the perceived level of African-like admixture in Sardinians is to comprehensively "scrub" Sardinians of all traces of African ancestry by replacing segments of their DNA when there is even a hint of such ancestry with missing values.

Going back to the mental experiment of walking along the Sardinian genome, we are going to remove spots of even remote possibility of African admixture. It will be shown that CEU continues to have evidence of east Eurasian-like admixture using the scrubbed Sardinians, suggesting that it is not only African-like admixture in Sardinians generating this signal, but also East Eurasian-like admixture in CEU.

I used DIYDodecad to do this scrubbing, but one could potentially try any approach that can identify African segments, such as HAPMIX or PCA. I used the dataset assembled for K7b and K12b, and carried out a K=3 ADMIXTURE analysis, which resulted in 3 components centered on West Eurasia, Asia, and Africa. I chose not to use an African component from higher-K (e.g. the K7b calculator), because it is conceivable that African ancestry might be lurking in southern Caucasoid components inferred with these tools (e.g., the "Southern" component of K7b or the "Southwest Asian" one of K12b). The average African admixture in Sardinians using the K3b calculator is 0.9%, and for the subset of CEU used it is 0.2%.

Using the byseg mode of DIYDodecad, I created ancestry maps of the 28 HGDP Sardinians, and I only kept windows where the African admixture was exactly 0%. This is a very aggressive scrubbing, designed to remove virtually all African admixture from the population. For example, if a window has 99.9% West Eurasian admixture and 0.01% African, I will nonetheless remove it, even though chances are extremely high that the 0.01% represents only noise. I did not want to leave any doubt that any trace of identifiable African ancestry remained in my "scrubbed Sardinians".

I am very confident that my scrubbed Sardinians do not have any hint of African ancestry, but you can decide for yourselves. I base my confidence on (a) the extreme nature of the scrubbing , which threw away much of the Sardinian genome in order to ensure that no hints of local African ancestry remained (b) re-assessment of the scrubbed Sardinians with K3b showing that they are now 100% West Eurasian, (c) ab initio ADMIXTURE analysis of CHB, YRI, CEU, and scrubbed Sardinians, demonstrating that the latter are 100% West Eurasian, while CEU has traces of 0.1% African and 0.3% Asian ancestry.

So, here are the results for the scrubbed Sardinians:

f4(San,Papuan;Sardinian_scrubbed,CEU) = 0.000678108 (Z=4.05225)
f4(San,Papuan;YRI,CEU) = 0.0379664 (Z=88.2287)
so scrubbed Sardinians with Papuan reference appear 0.000678108 / 0.0379664 = 1.8% African


f4(San,Karitiana;Sardinian_scrubbed,CEU) = 0.00205526 (Z=11.2848)
f4(San,Karitiana;YRI,CEU) = 0.04449 (Z=100.19)
so scrubbed Sardinians with Karitiana reference appear 0.00205526/0.04449 = 4.6% African

Despite the thorough scrubbing, Sardinians continue to show African admixture using f4 ancestry estimation. This is consistent with the idea that much of the African ancestry inferred using f4 ancestry estimation in Sardinians is an artifact of not taking into account east Eurasian-like admixture in CEU.

Conversely, a significant signal of east Eurasian-liked admixture in CEU persists whether one uses regular or scrubbed Sardinians:

With regular Sardinians

f4(San,Papuan;Sardinian,Karitiana) = 0.0084678 (Z=21.2137)
f4(San,Papuan;Sardinian,CEU) = 0.00118099 (Z=10.6838)

So, CEU appears = 0.00118099/0.0084678 = 13.9% East Eurasian

With scrubbed Sardinians

San,Papuan;Sardinian_scrubbed,Karitiana 0.00774427 0.00056725 13.6523
San,Papuan;Sardinian_scrubbed,CEU 0.000678108 0.000167341 4.05225

So, CEU appears = 0.000678108/0.00774427 = 8.8% East Eurasian


My "palimpsest" idea seems to be confirmed by the data. A first observation is that the level of African-like admixture in Sardinians depended on whether one used Papuans or Karitiana as an outgroup, suggesting that neither population was a true outgroup, and the signal of African admixture in Sardinians was driven in part by East Eurasian-like admixture in CEU. African admixture in Europe cannot be assessed accurately if one ignores the confounding effect of East Eurasian admixture.

When I aggressively scrubbed Sardinians so as to remove all traces of African ancestry, part of the African admixture fraction disappeared (expected, since African ancestry was removed from Sardinians), but a substantial part of it remained (unexpected, if the signal was driven only by African admixture, but expected, if it was driven in part by East Eurasian-like admixture in CEU). Conversely, using scrubbed Sardinians reduced, but did not make disappear, the admixture estimate for CEU.

August 29, 2012

Pre-Neolithic dispersals into Arabia

The harsh climate of Arabia, periodically interrupted by more "green" periods has probably meant that the population living there has occasionally been driven out as climate deteriorated, with new populations moving in as climate improved. In more recent times, technological invention (e.g., the camel, the deep water well, or even more recently the discovery of oil) has allowed people to subsist in the desert a little more "comfortably."

One interesting question is whether the current Arabian population derives entirely from early Levantine Neolithic peoples, or also from people who had ventured there prior to it. A new paper in AJPA suggests that living Arabians are not entirely the descendants of Neolithic peoples, but also preserve signals of pre-Neolithic input from the Near East, by studying the mtDNA haplogroup R2 (see map on left for its current distribution).

From the paper:

It is noteworthy, however, that these pre-Neolithic sites do not bear any technological traits analogous to Terminal Pleistocene (Epipalaeolithic) assemblages found in the Near East. The only germane possibility of a connection between Arabia and the Near East during this period comes from the Faw Well site at the western edge of the Rub’ Al Khali (Edens, 2001). Although undated, the Faw Well lithic assemblage bears a close resemblance to the Late Ahmarian of the Levant (20–17 ka). Perhaps it was this, or a subsequent pulse from the Levant, that provided the demographic input expressed by the genetic lineages documented in this article.   
The results from the three analyzed southern Arabian clades do not support population continuity from the first occupants more than 50 ka ago (Fernandes et al., 2012) but do suggest some continuity across the Pleistocene- Holocene boundary. Our analysis indicates that the observed population expansion 13–12 ka is probably the result of genetic input from the Near East a few thousand years before the (debated) arrival of the PPNB culture in Arabia. If, however, there was a population expansion southward through Arabia some 13–12 ka, we have not yet found its archaeological signatures. Both regions exhibit stone tool technologies with some overlapping features, so it is warranted to suppose that we may one day locate a firm link between southern Arabia and the Near East sometime during the Late Pleistocene. Given the vast amount of unexplored territory in Arabia and paucity of archaeological sites with numerical ages, future investigations (both archaeogenetical and archaeological) throughout the Peninsula will undoubtedly serve to shed more light on this question. 

American Journal of Physical Anthropology DOI: 10.1002/ajpa.22131

Pleistocene-Holocene boundary in Southern Arabia from the perspective of human mtDNA variation

Abdulrahim Al-Abri et al.

It is now known that several population movements have taken place at different times throughout southern Arabian prehistory. One of the principal questions under debate is if the Early Holocene peopling of southern Arabia was mainly due to input from the Levant during the Pre-Pottery Neolithic B, to the expansion of an autochthonous population, or some combination of these demographic processes. Since previous genetic studies have not been able to include all parts of southern Arabia, we have helped fill this lacuna by collecting new population datasets from Oman (Dhofar) and Yemen (Al-Mahra and Bab el-Mandab). We identified several new haplotypes belonging to haplogroup R2 and generated its whole genome mtDNA tree with age estimates undertaken by different methods. R2, together with other considerably frequent southern Arabian mtDNA haplogroups (R0a, HV1, summing up more than 20% of the South Arabian gene pool) were used to infer the past effective population size through Bayesian skyline plots. These data indicate that the southern Arabian population underwent a large expansion already some 12 ka. A founder analysis of these haplogroups shows that this expansion is largely attributed to demographic input from the Near East. These results support thus the spread of a population coming from the north, but at a significantly earlier date than presently considered by archaeologists. Our data suggest that some of the mtDNA lineages found in southern Arabia have persisted in the region since the end of the Last Ice Age.


August 28, 2012

Paleolithic Europeans may have been substantially Neandertal-admixed

In Oetzi the Neandertal Champion, I suggested a way to determine whether the higher Neandertal admixture in Oetzi suggested by Sams and Hawks is due to his Near Eastern Neolithic or European Paleolithic ancestry.

The basic idea is simple: first build a map of Oetzi's ancestry with a tool that distinguishes between the Near East and Europe. I did this with my West Eurasian cline (weac2) calculator. The output of this procedure is to calculate admixture proportions for Oetzi across the genome. In some windows along his chromosomes, he will appear to be 100% Atlantic_Baltic (European-centered component), in others 100% Near_East, and in some intermediate, or possessing some other component. At each window we have an Atlantic_Baltic and a Near_East admixture score.

Secondly, we need to calculate a score of Oetzi-Neandertal (Vindija) similarity in the same window. I used the Neandertal data from the Harvard HGDP, and my own copy of the Oetzi genome which I've created by intersecting a SNP file provided by Andreas Keller with the Stanford HGDP set of SNPs. In the end I combined Oetzi and Vindija in a common set of 37,320 SNPs, removing all SNPs with missing alleles. One could get more SNPs by not taking these various intersections and working with full genomes, but this set of SNPs suffices for my purposes.

In any case, I used 0/1/2 coding and took the absolute value of the Oetzi minus the Vindija value, normalizing by dividing with the number of SNPs in each window. That was my Score variable, and the lower the value the more Oetzi matches Vindija.

The idea is simple: does Oetzi tend to appear "Near_East" or "Atlantic_Baltic" in places along his genome where he is close to Neandertals?

I limited myself to windows where there were at least 10 SNPs common between Oetzi and Vindija, as well as windows where the sum of Atlantic_Baltic and Near_East was at least 95%, so there was good evidence that these two components were responsible for the whole diploid pair of segments. A total of 1,128 windows remained. The results are as follows:

Cor("Near_East", Score) = +0.082
Cor("Atlantic_Baltic", Score) = -0.079

These are small, but significant, and we should remember that relative levels of Near_East and Atlantic_Baltic vary for reasons unrelated to Neandertal ancestry in most of the genome. Here is a plot of the Score variable for the 1,128 windows, ordered from high-to-low:
There is a group of windows with particularly low Score, so perhaps these represent the strongest evidence for Neandertal ancestry in Oetzi. Any such ancestry may mostly consist of small segments, so my not-so-dense sieve formed by the small number of studied SNPs is probably missing a lot of Neandertal segments that may turn up with full genome comparisons.

In any case, the median Atlantic_Baltic for all windows is 46.48%, and the median Near_East one is 52.78%. But, let's see how these numbers look when we consider the lowest quantiles of Score (=more Neandertal matching):

These numbers kinda speak for themselves. Of course, it is wrong to equate Near_East = Neolithic and Atlantic_Baltic = Paleolithic. On the other hand, the assumption that these two components possess a greater relationship with the Neolithic farmers and the Paleolithic Europeans respectively, seems justified.

So, it seems that in regions where Oetzi matches the Vindija genome, he tends to be "Atlantic_Baltic". The implication is that Paleolithic Europeans were more Vindija-like than incoming Neolithic ones from the Near East. Oetzi may have been more Neolithic farmer than Paleolithic hunter-gatherer across his whole genome, but the situation is reversed for regions suggestive of Neandertal ancestry.

The idea that Upper Paleolithic Europeans were admixed with Neandertals is not new. Its most recent prominent champion is Milford Wolpoff (figure on the left is from one of his most recent works, showing a Copper Age European male (top) sharing features with La Chapelle Neandertal (middle) at the exclusion of Herto (an archaic H. sapiens from Ethiopia), indicating a degree of continuity from Neandertals to more recent Europeans.)

Just how Neandertal-admixed were the Paleolithic Europeans? If Hawks is correct in his claim that Oetzi was ~5.5% Neandertal, and given that Oetzi appears to have been overall more than 50% incoming farmer and less than 50% local hunter-gatherer (conservatively), then it is easy to conclude that even ~10% Neandertal for Paleolithic Europeans may not be too far from the truth. We'll have to see what their actual ancient DNA looks like to confirm the hypothesis found in this post.

Finally, since it's my custom to resuscitate old physical anthropology when it matches modern observation, here's Carleton Coon's 1939 Races of Europe, from his "Statement of Aims and Proposals":
At any rate, the main conclusion of this study will be that the present races of Europe are derived from a blend of (A), food-producing peoples from Asia and Africa, of basically Mediterranean racial form, with (B), the descendants of interglacial and glacial food-gatherers, produced in turn by a blending of basic Homo sapiens, related to the remote ancestor of the Mediterraneans, with some non-sapiens species of general Neanderthaloid form. The actions and interactions of environment, selection, migration, and human culture upon the various entities within this amalgam, have produced the white race in its present complexity.
I'd say that if these results are confirmed by subsequent research, then "bullseye" is a good way to describe the above passage.

(But, I will not refrain from spoiling the fun a little bit, by pointing out that the potential high similarity of UP Europeans with the Vindija genome may be due, at least in part, to gene flow from UP Europeans-to-European Neandertals in that particular specimen. Now, it may be that such gene flow may have gradually made the European Neandertals more modern-like, and thus facilitated their eventual full absorption into the gene pool of subsequent Europeans. But, we don't know that for sure. A second Neandertal genome, preferably a pre-contact one will be the decisive factor in determining the direction of gene flow conclusively.)

August 27, 2012

3-population test and east Eurasian-like admixture in Europe or The Isle of Refuge

The 3-population test (Reich et al. 2009) allows one to detect the presence of admixture in a population X from two other populations A and B. The value

f3(X; A, B)

is negative when X does not appear to form a simple tree with A and B but appears to be a mixture of A and B.

In a previous entry, I noted that continental European populations, and especially northern Europeans appear to have East Eurasian-like admixture on the basis of the 4-population test. The results of that test are more difficult to interpret, because the quantity f4(X, Y; A, B) can take significant negative or positive values depending on the relationships of populations X, Y with A, B. When A, B are East Eurasian and African populations respectively, and X, Y are West Eurasian ones, East Eurasian-like admixture in a northern European population will affect the f4 quantity similarly as African-like admixture in a southern Caucasoid one. This is not a problem with the f3 test, although caution is needed: a negative value indicates deviation from "treeness" and admixture, but a positive one does not reject admixture.

The f3 statistics were calculated with the threepop program of TreeMix with -k 500 over a set of 598,467 SNPs.

I have used 3 Asian/American reference populations (Karitiana from South America, CHB Chinese, and Papuans) and calculated the following:

f3(West Eurasian 1; West Eurasian 2, Asian/American)

As noted above, negative values of this indicate that West Eurasian 1 can be seen as an admixed population of West Eurasian 2 + Asian/American. The set of 14 West Eurasian populations used is:
CEU, TSI, Tuscan, Orcadian, French, French_Basque, North_Italian, Bedouin, Palestinian, Druze, Mozabite, Adygei, Russian, Sardinian
I thus report 2*(14 choose 2)*3 = 546 values of f3. Hence, I did not privilege Sardinians as a reference point, but instead tried all pairs of West Eurasian populations, and 3 different American/Asian references. There results can be found in the spreadsheet.

Out of the 546 triples, 64 show an f3 score less than Z less or equal to -3, and are thus significant.

The following populations have such a score in at least one pairwise comparison, when they are set as West Eurasian 1, and thus appear to have east Eurasian-like admixture

CEU, Russian, French, Adygei, TSI, Tuscan, Orcadian, North_Italian, Palestinian  
Note that east Eurasian-like admixture cannot be rejected for the other populations, but it can be confirmed for the above. Moreover, the mean strength of the observed effect for the significant comparisons was Z=-5.5 for Papuan reference, Z=-10.2 for CHB, and Z=-10.9 for Karitiana, again suggesting a northern origin of the east Eurasian-like admixture, albeit without so major a difference between Karitiana and CHB as in the 4-population test.

But, it is worth reading the raw data. For example, note above that of the Middle Eastern and North African populations, only Palestinians show a negative f3 score in any pairwise comparison. And actually they only do so for f3(Palestinian; Sardinian, Papuan) with a Z-score of -4.1. So, it appears that Palestinians have undergone admixture of a different sort than Europeans.

Significant differences were observed for Sardinians as West Eurasian 2 in 21 cases, for French Basque in 11 cases, for North_Italian and TSI in 6 cases, for CEU, OrcadianFrench, and Tuscan in 4 cases. So, it appears that other populations appear east Eurasian-liked admixed relative to Sardinians, and a couple of populations (Russian and Adygei) also appear so admixed relative to west Europeans.

Oetzi the Tyrolean Iceman

The fact that Europeans appear admixed with an east Eurasian-like element when compared with Sardinians does not mean that Sardinians may not also be admixed with this element. I used the genome of the Tyrolean Iceman (Keller et al. 2012) to test whether Sardinians appear east Eurasian-like admixed relative to the Iceman.

f3(Sardinian;Karitiana,Oetzi) = 5.36496e-06 (Z=0.00940612)

This might indicate no admixture, but f3 can detect admixture but can't prove non-admixture. The f4 is suggestive:

f4(Sardinian,Oetzi;Karitiana,San) = -0.00221783 (Z=-3.06251)

You should probably not take my word for the above. It may appear that, contrary to expectation, Oetzi was more east Eurasian-like than modern Sardinians. Indeed, in my initial analysis of him with ADMIXTURE, I found that he was 2.8% East_Asian, which would point to an East Eurasian shift of Oetzi relative to Sardinians, and which might be consistent with the f4 result. On the other hand, the negative f4 score could be related to African-like gene flow. On balance I would say that Sardinians appear quite similar to Oetzi.

Gok4 and Ajv52

Furthermore, I carried out the same analysis on Neolithic samples from Sweden (Skoglund et al. 2012). The number of SNPs here is much smaller. Results are:

Gok4 (TRB farmer): f4(Sardinian,Gok4;Karitiana,San) = -0.00167365 (Z=-1.23616)
Ajv52 (PWC hunter-gatherer): f4(Sardinian,Ajv52;Karitiana,San) = -0.004676 (Z=-3.76048)

While I would not bet the farm on these results (because of the small number of SNPs and the fact that they're based on a single individual), they do seem to suggest that these Neolithic Swedes were east Eurasian shifted relative to Sardinians. For example, for my Swedish_D sample, I get f4(Sardinian, Swedish_D; Karitiana, San) = -0.00372751 (Z=-22.8715). The Z-score is stronger (probably because of the much larger number of SNPs), but the f4 value of Ajv52 is lower (more east-Eurasian like). Modern Swedish_D appears intermediate between Gok4 and Ajv52, so this may suggest that Mesolithic Europeans may be, at least in part the source of this element.

(Comparison with Brana-1 Mesolithic Iberian indicates a negative non-significant f4 score, but with an even smaller number of SNPs).

In sum total, my experiments with ancient DNA samples from Europe suggest a little more east Eurasian-like shift relative to Sardinians (or conversely a little more African-like shift in Sardinians). Both Oetzi (who has the highest quality genome) appears to be so-shifted, but Ajv52 (a Neolithic northern hunter-gatherer) appears to be so as well. I am sure that if we get more high quality ancient DNA from Europe, some clear pattern may emerge, but I would not speculate further on the basis of these initial results.

Isle of Refuge

The above set of experiments has revealed once more that "there's something about Sardinians." There is perhaps a reason for the fact that the arrival of population elements from continental Europe seems to have bypassed them to some degree, or, at least affected them least. However it was that continental Europeans got their east Eurasian-like shift, the great tank of European genetic variation does not seem to have achieved equilibrium with the little cup of Sardinia. Something stood in the way.

Sardinia is the west-most of the large Mediterranean islands. It is more distant from mainland Europe/Asia than the other big islands (Cyprus, Crete, Sicily, and Corsica).

And, unlike islands much smaller than itself, its size has probably been instrumental in helping it afford it a certain autonomy and continuity of population. Only Sicily is largest, but one can practically swim across the Strait of Messina to reach it from the Italian peninsula.

Hence, a combination of large size, western geographical location, and distance from the mainland have contributed to the continuity of its population. But, geography may not have been sufficient if other events had not taken place. Through a combination of favorable geography and historical contingency, the Sardinians made it to the present largely unscathed, and, among their other graces, can now help scientists figure out what happened to the rest of us.

Out-of-Iberia (Arenas et al. 2012)

A new paper argues that the SE-NW gradient of genetic variation in modern Europeans is consistent with a large Paleolithic contribution of the European gene pool if modern Europeans are principally descended from people who spent the last Ice Age in the Iberian refugium.

In my opinion, the question cannot be solved on the basis of modern populations alone: clines do not carry dates, and can be formed by accretion of different events operating under the constraints of a given geography. Ancient DNA has already begun to inform our view of the past:  we now have data from Mesolithic Iberians, the presumable denizens of a pre-farming refugium, and they do not appear closely related to modern Iberians. Moreover, Europe as a whole shows discontinuity between Neolithic and Mesolithic populations, and even between Neolithic and modern ones.

If humans expanded from Iberia in postglacial times, and modern Europeans are largely descended from them, then it is strange that the gene pool of Mesolithic Europeans  is so restricted: why didn't the Out-of-Iberians create a modern European-like mtDNA and Y chromosome gene pool in the thousands of years intervening between deglaciation and the available DNA samples?

Arguably, the ancient DNA record of Europe (except in the case of mtDNA) is still in its infancy and there may be more surprises to come. But, the way things look like right now, Paleolithic genetic continuity does not seem warranted. As more ancient samples accumulate from different regions and different periods, we will see how clines of variation in modern populations were formed. But, if history of the field is any guide, we're probably in for a few strange surprises.

Mol Biol Evol (2012) doi: 10.1093/molbev/mss203

Influence of admixture and Paleolithic range contractions on current European diversity gradients

Miguel Arenas et al.

Cavalli-Sforza and colleagues (1963) initiated the representation of genetic relationships among human populations with principal component analysis (PCA).Their study revealed the presence of a southeast–northwest (SE-NW) gradient of genetic variation in current European populations, which was interpreted as the result of the demic diffusion of early Neolithic farmers during their expansion from the Near East. However, this interpretation has been questioned, as PCA gradients can occur even when there is no expansion, and because the first PC axis is often orthogonal to the expansion axis. Here, we revisit PCA patterns obtained under realistic scenarios of the settlement of Europe, focusing on the effects of various levels of admixture between Paleolithic and Neolithic populations, and of range contractions during the Last Glacial Maximum (LGM). Using extensive simulations, we find that the first PC (PC1) gradients are orthogonal to the expansion axis, but only when the expansion is recent (Neolithic). More ancient (Paleolithic) expansions alter the orientation of the PC1 gradient due to a spatial homogenization of genetic diversity over time, and to the exact location of LGM refugia from which re-expansions proceeded. Overall we find that PC1 gradients consistently follow a SE-NW orientation if there is a large Paleolithic contribution to the current European gene pool, and if the main refuge area during the last ice age was in the Iberian Peninsula. Our study suggests that a SE-NW PC1 gradient is compatible with little genetic impact of Neolithic populations on the current European gene pool, and that range contractions have affected observed genetic patterns.


When Eurasians got lighter skin

My default position is to doubt all molecular dates until I understand how they were derived. Nonetheless, these results seem broadly consistent with the idea that Eurasian modern humans got lighter as their ancestors moved into more northern latitudes of the Old World and replaced Neandertals and others earlier Eurasian occupants, and then they got really lighter post-LGM, and then some got really really lighter with mutations in genes such as SLC24A4 (not studied here).

I suppose we will really find out who got what mutation when only through ancient DNA.

Mol Biol Evol (2012) doi: 10.1093/molbev/mss207

The timing of pigmentation lightening in Europeans

Sandra Belezal et al.

The inverse correlation between skin pigmentation and latitude observed in human populations is thought to have been shaped by selective pressures favoring lighter skin in order to facilitate vitamin D synthesis in regions far from the equator. Several candidate genes for skin pigmentation have been shown to exhibit patterns of polymorphism that overlap the geospatial variation in skin color. However, little work has focused on estimating the timeframe over which skin pigmentation has changed and on the intensity of selection acting on different pigmentation genes. To provide a temporal framework for the evolution of lighter pigmentation, we used forward Monte Carlo simulations coupled with a rejection sampling algorithm to estimate the time of onset of selective sweeps and selection coefficients at four genes associated with this trait in Europeans: KITLG, TYRP1, SLC24A5, and SLC45A2. Using compound haplotype systems consisting of rapidly evolving microsatellites linked to one SNP in each gene, we estimate that the onset of the sweep shared by Europeans and East Asians at KITLG occurred about 30,000 years ago, after the out-of-Africa migration, while the selective sweeps for the European-specific alleles at TYRP1, SLC24A5, and SLC45A2 started much later, within the last 11,000-19,000 years, well after the first migrations of modern humans into Europe. We suggest that these patterns were influenced by recent increases in size of human populations, which favored the accumulation of advantageous variants at different loci.


August 26, 2012

Inter-relationships of the Dodecad K12b and world9 components

Pconroy made a most excellent suggestion in the comments of a previous post, so I decided to follow up on it. His idea is to see what Dodecad components look like when they're measured in terms of other components. So, I took the K12b components and carried out the following procedure:

I used each of the 12 different components as "test data" in a supervised ADMIXTURE analysis that used the other 11 components as "reference data". This simple procedure can show what each component appears to be made of, if it is seen in the context of the remaining components. It is a good way to demonstrate relationships between them.

Here are the results:

Some observations:

  • Gedrosia appears to be Caucasus + a slice of Siberian
  • Both Siberian and Southeast Asian appear to be wholly East Asian
  • East Asian on the other hand, appears to be mostly Southeast Asian + minority Siberian
  • Northwest African appears to be Caucasus + a minority Sub Saharan
  • Atlantic Med appears to be Caucasus + a slice of North European
  • North European appears to be Atlantic Med + Gedrosia with a slice of Siberian
  • South Asian appears to be Caucasus + East Asian
  • East African appears to be Sub Saharan + minority Caucasus
  • Southwest Asian appears to be Caucasus
  • Sub Saharan appears to be East African
  • Caucasus appears Atlantic Med + Gedrosia + slices of Northwest African and Southwest Asian
The most salient point about this analysis is the central position of the Caucasus component vis a vis the others, consistent with my womb of nations theory. Not only do all West Eurasian components (except the North European) appear substantially "Caucasus" in this analysis, but the Caucasus component itself shows links with four others.

It could be argued that these results represent a confluence of peoples from all over West Eurasia into the highlands of West Asia where the Caucasus component is modal. But, the Caucasus region is arguably the most linguistically diverse in West Eurasia, and many of its languages do not appear to have come from elsewhere. Also, the Near East (where the Caucasus component is the most important one in most populations) is the birthplace of agriculture, which has demonstrably affected most of West Eurasia. On balance, this analysis seems consistent with population expansions out of West Asia.

The following graph summarizes the relationship between the 12 components. I used color intensity of the edges to indicate admixture intensity:

Finally, a few points to remember: 
  • the South Asian component appears like a mix of of Caucasus and East Asian; the latter probably acts as a stand-in for the Ancestral South Indians of Reich et al. (2009)
  • Similarly, the Gedrosia/Siberian influences on the North European component do not necessarily mean direct influences from these two regions; an explanation for these influences may intersect with the issue of East Eurasian-like ancestry in northern Europe
  • It is the Caucasus, rather than Southwest Asian component that seems to donate to the Northwest African and East African ones. That seems to flaunt geography, but probably indicates that the Southwest Asian component, with its strong Semitic associations (see distribution in K12b spreadsheet) represents a more specialized form of the more generalized Caucasus component.
  • Some components appear to be "terminal", affected but not much affecting: Southwest Asian, Northwest African, and Southeast Asian. These tend to appear at high K in admixture analyses, and probably represent either recent mixtures (Northwest African) or specialized forms of more generalized ones (Southwest Asian of Caucasus and Southeast Asian of East Asian)
  • Finally, remember that living populations show admixture proportions of many of these components. So, for example, the East African population often has Southwest Asian admixture, even though the East African component lacks it. And, as mentioned above, this may reflect the more generalized west Asian admixture that has affected East Africa, as well as the more specific Arabian admixture, associated e.g., with the spread of Semitic languages. Please refer to the K12b spreadsheet for admixture proportions of populations for the 12 components.
I have also done the same with the world9 calculator, which includes Amerindian and Australasian components. Here is how the world9 components are seen as mixtures of the remaining ones:

And, here is the graph showing how they seem to contribute to each other.

A few observations:

  • Amerindian appears wholly Siberian
  • East Asian appears Siberian + South Asian + slice of Australasian
  • African appears South Asian. I would attribute this to Africans being related to both West and East Eurasians approximately symmetrically, so in this type of experiment, South Asian (which is an ANI/ASI mix) appears like the best match
  • Atlantic_Baltic appears Caucasus_Gedrosia + Southern + slice of Amerindian
  • Australasian appears South Asian. I would guess that ASI and Australo-Melanesians share deep common ancestry from the earliest settlement of southern parts of Asia.
  • Siberian appears East Asian + slice of Amerindian
  • Caucasus_Gedrosia and Southern appear Atlantic_Baltic
  • South Asian appears an about equal mix of East Asian + Caucasus_Gedrosia + slice of Australasian
Raw data for these experiments can be found here.

August 25, 2012

ADMIXTURE 1.22 correcting Fst bias

In many of my experiments I have used ADMIXTURE versions prior to 1.22. According to the authors' website, version 1.22 (3/10/2012):
Fst estimates were upward biased; have now switched to the method of Reynolds et al. (1983).
This probably means that many of the Fst divergences reported here and in the Dodecad blog must be reduced. This is not really a big problem, since, biased or not, the reported numbers show the relative similarity of difference components. But, I decided to investigate, so I re-ran the ADMIXTURE analysis that created the K7b calculator.

The correlation with the old (ver. 1.21) Fst values is very strong (+0.9993209) and the new values can be estimated from the old using the following regression:

New = 0.782324*Old + 0.009335

Of course it would be a good idea to re-run this type of analysis separately whenever the absolute values are important. For example, in a previous experiment, I suggested that Fst's between the K12a components were so low, that these components should not be interpreted as having diverged in very old (say, Upper Paleolithic) times, but rather in a more recent (post-glacial, and probably mostly Neolithic) time frame. Correction for this upward bias would probably strengthen that hypothesis which was one way of arguing in favor of the womb of nations theory.

Genes and Geography (Wang et al. 2012)

Gene-geography correlations have been explored before at a regional level. More recently, they were also studied at the global level with the SPA method. A new open access paper shows gene-geography correlations across the world.

These correlations arise from the fact that humans tend to intermarry with their neighbors, so alleles have a decreasing probability of being transmitted from a person at location X to future generations, the further we go from X. But, the more interesting cases are those which show a violation of the overall pattern. These can usually arise because of genetic isolation or long-distance migration. An example is that of the African hunter-gatherer groups:
When hunter-gatherer populations (!Kung, San, Biaka Pygmy, and Mbuti Pygmy) and Mbororo Fulani were included in the analysis, they appeared as isolated clusters on the PCA plots and greatly reduced the similarity between PCA maps and geographic maps (Figure S3, Table S7). The similarity score decreased from 0.790 to 0.548 after including all five of these populations in the analysis. This value, however, is still statistically significant, with a -value of ; further, if we disregard the hunter-gatherer populations and Mbororo Fulani in Figure S3B and only examine the relative locations of the original 23 populations, we can still find a clear resemblance between genetic and geographic coordinates. Compared to the other 23 populations, the four hunter-gatherer populations appear as isolated groups at the south, and Mbororo Fulani appears at the north. These observations are clearer in plots with only one among the five outlier populations included at a time (Figure S3C–S3G), each of which also produces significant similarity scores between genetic and geographic coordinates (Figure S4, Table S7).
Figure S3 is very informative:

Observe that in Figure S3C, the Mbororo Fulani appear in the Balkans (!) relative to Sub-Saharan Africans. That is of course, due to their partial West Eurasian ancestry, but the magnitude of the difference is such that one suspects that it is not only due to this factor; if it were, then the Fulani would place somewhere between Europe and Central Africa.

The remaining figures (D-G) supply the explanation: the four hunter-gatherer groups appear well south of their actual locations; the Pygmy groups not in W/C Africa, but in S Africa; the Khoisan ones not in S Africa but in the Ocean well south of it.

Why does gene-geography correlation suffer such a violation in Africa? Figure S3 shows how different groups relate to W/C Africans. But, one could also use hunter-gatherers as an anchor point (i.e., place them where they actually live): in that case the W/C Africans would be the ones who would be pushed north towards the Mediterranean.

 And, indeed, that is a good argument for the idea I've floated a few times, of substantial Eurasian back-migration into Africa: the genetic difference between African farmers and African hunter-gatherers dwarfs the geographic distance. This can easily be explained if we assume that back-migration from Eurasia affected the former much more than the latter. So, African farmers can be shown to be the outcome of mixture between two-divergent elements: one Eurasian-like, one African hunter-gatherer-like. The latter could include both groups like existing African H-Gs but might also include other groups who had the misfortune of being completely absorbed before the Eye of Science set its sights on the African continent.

PLoS Genet 8(8): e1002886. doi:10.1371/journal.pgen.1002886

A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations

Chaolong Wang et al.

The spatial pattern of human genetic variation provides a basis for investigating the history of human migrations. Statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been used to summarize spatial patterns of genetic variation, typically by placing individuals on a two-dimensional map in such a way that pairwise Euclidean distances between individuals on the map approximately reflect corresponding genetic relationships. Although similarity between these statistical maps of genetic variation and the geographic maps of sampling locations is often observed, it has not been assessed systematically across different parts of the world. In this study, we combine genome-wide SNP data from more than 100 populations worldwide to perform a formal comparison between genes and geography in different regions. By examining a worldwide sample and samples from Europe, Sub-Saharan Africa, Asia, East Asia, and Central/South Asia, we find that significant similarity between genes and geography exists in general in different geographic regions and at different geographic levels. Surprisingly, the highest similarity is found in Asia, even though the geographic barrier of the Himalaya Mountains has created a discontinuity on the PCA map of genetic variation.


August 24, 2012

Proto-Indo-European homeland in Neolithic Anatolia (Bouckaert et al.)

A new paper in Science uses Bayesian phylogeographic methods to model the spatial expansion of Indo-European languages from their Anatolian homeland. An informative video shows how the authors estimate the process took place across space and time:

There is also a podcast with Q.D. Atkinson on the new study, as well as a website by the authors on their research; the FAQ/Controversies section seems particularly useful.

I don't hold high hopes that, despite the mounting evidence, this will dissuade people from arguing for a steppe PIE origin. And, it shouldn't. Only a vigorous debate will resolve the issue conclusively. And, since IE languages appear on the archaeological record long after their split under any scenario, this may be one of those problems that will never be solved to everyone's satisfaction.

I don't agree with all the details of the authors' model, but certainly they place the PIE homeland near to where I believe it was. Resistance to an Anatolian origin will become more convincing if adherents of different homeland solutions manage to put their ideas in quantitative form. Expert opinion is valuable, but very knowledgeable linguists and/or archaeologists have placed the PIE homeland all the way from Central Europe to Bactria-Sogdiana and from the Pontic-Caspian steppe to Mesopotamia. So, one has to wonder why expert opinion has such a high variance, but every quantitative effort to solve the problem has come up with a single solution.

As I wrote recently:
My own working hypothesis would derive the earliest Proto-Indo-Europeans with groups living in Neolithic eastern Anatolia and northern Mesopotamia. There are details to be fleshed out, such as when this group of people reached the Balkans (pending ancient DNA from the region), and how they interfaced with the populations living in the north of the Black and Caspian seas (e.g., via a trans-Caucasus movement or a counterclockwise spread around the Caspian).
The current paper suggest a slightly different origin, in Southern Anatolia, perhaps influenced by the distribution of the historical Anatolian languages in the area when they were first put down in writing. But, I suspect that the transposition of Anatolian languages into the areas where they were first attested may have happened late in prehistory. In any case, whether the PIE homeland was in Southern or Eastern Anatolia, the results of this paper explicitly reject the Kurgan Pontic steppe hypothesis.

From the paper:
The distribution for the root location lies in the region of Anatolia in present-day Turkey. To quantify the strength of support for an Anatolian origin, we calculated the Bayes factors (21) comparing the posterior to prior odds ratio of a root location within the hypothesized Anatolian homeland (11) (Fig. 1, yellow polygon) with two versions of the steppe hypothesis—the initial proposed Kurgan steppe homeland (6) and a later refined hypothesis (7) (Table 1). Bayes factors show strong support for the Anatolian hypothesis under a RRW model.  
As the earliest representatives of the main Indo-European lineages, our 20 ancient languages might provide more reliable location information. Conversely, the position of the ancient languages in the tree, particularly the three Anatolian varieties, might have unduly biased our results in favor of an Anatolian origin. We investigated both possibilities by repeating the above analyses separately on only the ancient languages and only the contemporary languages (which excludes Anatolian). Consistent with the analysis of the full data set, both analyses still supported an Anatolian origin (Table 1). 

The West Asian origin of the Proto-Indo-Europeans makes excellent sense in the light of the genetic evidence. But, as I hint at the above paragraph, the tempo of their expansion into Europe remains to be clarified. I strongly suspect, on the basis of the Iceman and Swedish Neolithic TRB farmer (Gok4) whose DNA has been published that the earliest Neolithic was not Indo-European, because these individuals lack the "West Asian" autosomal component.

But, when did the Indo-Europeans first set foot on Europe? Were they already present at the time of Dimini and Vinča in the Balkans? I tend to think that a reasonable proposition, because the 8.2 kiloyear event may have transposed a second set of Neolithic farmers into Europe, of Halafian origin. Or, did they appear later, during the Copper and Bronze Ages with the spread of metallurgy? Until we get ancient DNA from the Balkans and Anatolia, we won't know for sure. But, Y-haplogroups J2, and R1 so conspicuously absent from Neolithic Europe down to 5ka (and in the case of J2, completely missing from the record altogether) must have entered Europe at some point. Did they take the fast train into Europe post-5ka, or did they lurk in both Anatolia and Europe pre-5ka? Thanks to the BEAN project we might find out.

The idea that ~5ka something happened in Europe is also supported by the paper:
Despite support for an Anatolian Indo- European origin, we think it unlikely that agriculture serves as the sole driver of language expansion on the continent. The five major Indo-European subfamilies—Celtic, Germanic, Italic, Balto-Slavic, and Indo-Iranian—all emerged as distinct lineages between 4000 and 6000 years ago (Fig. 2 and fig. S1), contemporaneous with a number of later cultural expansions evident in the archaeological record, including the Kurgan expansion (5–7). 
So, while the deepest prehistory of Indo-European is firmly rooted in Anatolia during the early Neolithic, this is not inconsistent with something important happening in Europe during c. 5ka. But this was a secondary phenomenon, not the earliest seat of the Indo-Europeans. Also, I would not particularly relate this to the Kurgan expansion, but more probably to the arrival of metallurgical "guilds" with higher social complexity.

Both horses and wheeled vehicles quickly spread far and wide because of their simplicity and utility; if they were first adopted by a particular people, they quickly spread beyond it. Metallurgy, on the other hand, requires specialized knowledge about a variety of technical subjects, as well as a complex network of people with distinct roles: miners, metalworkers, traders, warriors, administrators. As such, the people who invented it would have had a distinct advantage until their trade secrets were leaked, or too many Bronze weapons were in the hands of their enemies. During the Bronze Age, more and more people got access to weaponry, and by the end of it, wars were raging all across Western Eurasia.

We tend to think of the Neolithic farmers, but it is quite likely that people kept coming into Europe since its initial colonization. After all, the people who came to the Americas in 1492 were the vanguard of many others who followed them. The same must have happened in Europe as well: a continuous process of settlement by various groups at different times, at least until the Bronze and Iron Ages when everyone, all over West Eurasia, seem to have become very quarrelsome and more than willing to use their swords, spears, axes, and arrows to dissuade newcomers who ventured into their territory.

Coverage of the new paper elsewhere: NY Times, Nature, Gene Expression, John Hawks.

Science 24 August 2012: Vol. 337 no. 6097 pp. 957-960 DOI: 10.1126/science.1219669

Mapping the Origins and Expansion of the Indo-European Language Family 

Remco Bouckaert et al.


There are two competing hypotheses for the origin of the Indo-European language family. The conventional view places the homeland in the Pontic steppes about 6000 years ago. An alternative hypothesis claims that the languages spread from Anatolia with the expansion of farming 8000 to 9500 years ago. We used Bayesian phylogeographic approaches, together with basic vocabulary data from 103 ancient and contemporary Indo-European languages, to explicitly model the expansion of the family and test these hypotheses. We found decisive support for an Anatolian origin over a steppe origin. Both the inferred timing and root location of the Indo-European language trees fit with an agricultural expansion from Anatolia beginning 8000 to 9500 years ago. These results highlight the critical role that phylogeographic inference can play in resolving debates about human prehistory.


August 23, 2012

Or, maybe they speciated 3.7-6.6Ma ago? (Sun et al. 2012)

This has certainly been an eventful August in human origins research; if the Neandertal Wars weren't enough, a different issue that had simmered for a while now, the human autosomal sequence mutation rate, has now come to a full boil.

A couple of weeks ago, Langergraber et al. (2012) came out, and combined direct measurement of generation lengths in humans and other primates with the directly measured human autosomal sequence mutation rate to argue for an old 7-13Ma divergence between Pan and Homo.

Yesterday, Kong et al. (2012) independently derived a low direct mutation rate of 1.2x10^-8, and added the observation that older human fathers pass on more mutations to their offspring than younger ones. As I point out in my post on the topic, this has implications for the Homo-Pan divergence as well: if chimp dads are younger than human dads, they will tend to pass fewer mutations to their offspring. Thus, the chimp mutation rate (/generation) might be lower rather than equal to the human one, and this might push the speciation time even further back in time.

Today, a new paper has appeared in Nature Genetics which argues for an "intermediate" rate between  the direct ~1-1.3x10^-8 rate and the widely used 2.5x10^-8 one: their rate estimate is: 1.4–2.3x10^-8 and the corresponding Human-Chimp speciation time is 3.7-6.6 million years ago. Kari Stefansson is a co-author of the new paper, as he is of the Kong et al. one, which estimated the mutation rate at 1.2x10^-8.

The new paper builds what appears to be a very exhaustive model of microsatellite mutation:
Microsatellites have been widely used to make inferences about evolutionary history. However, the accuracy of these inferences has been limited by a poor understanding of the mutation process. We developed a new model of microsatellite evolution (Supplementary Note). This model can estimate the time to the most recent common ancestor (TMRCA) of two samples at a microsatellite by taking into account (i) the dependence of the mutation rate on allele length and parental age (Fig. 2a,c); (ii) the step size of mutations (Fig. 2b); (iii) the size constraints on allele length (Fig. 2d and Supplementary Figs. 8 and 9); and (iv) the variation in generation interval over history. In contrast to the generalized stepwise mutation model (GSMM), which predicts a linear increase of average squared distance (ASD) between microsatellite alleles over time, the new model predicts a sublinear increase (Fig. 3) and saturation of the molecular clock, due to the constraints on allele length. We also extended the model to estimate the sequence mutation rate, using the per-nucleotide diversity flanking each microsatellite as an additional datum. To implement the model, we used a Bayesian hierarchical approach, first generating global parameters common to all loci, followed by locus-specific parameters and finally the microsatellite alleles at each locus (Online Methods). We used Markov chain Monte Carlo to infer TMRCA and sequence mutation rate. 
I haven't delved deeply into the details of how the sequence mutation rate (per nucleotide/per generation) can be derived by exploiting the microsatellite rate. But, why would the rate estimated with the new method be different than the directly measured one? The authors propose some ideas:
We hypothesize that the lower mutation rate estimates from the whole-genome sequencing studies might be due to (i) the limited number of mutations detected in these studies, which explains why their confidence intervals overlap ours, (ii) possible underestimation of the false negative rate in the whole-genome sequencing studies or (iii) variability in the mutation rate across individuals, such that a few families cannot provide a reliable estimate of the population-wide rate.  
 Apparently, the team behind Sun et al. became aware of the new Kong et al. after the paper was accepted, so they attached the following note at the end of it, as well as a discussion in the supplement:
Note added in proof: After this paper was accepted, another study35 was published that independently estimates the human sequence mutation rate, using a direct measurement in contrast to the indirect measurement we report here. In spite of some key similarities between our results and those of Kong et al.35 (the male-to-female mutation rate ratio and the absence of an effect of mother's age), they estimate a considerably stronger effect of father's age and an overall sequence mutation rate below the range we infer. The discrepancies in the sequence mutation rate may be in part due to the fact that Kong et al. focus on a more intensively filtered subset of the human genome than we analyze here, but other factors are also likely to be at work (Supplementary Note). As an initial attempt to compare the two studies in terms of their implications for evolutionary history, we ran the same Bayesian inference procedure we developed in this paper (integrating over uncertainty in unknown parameters), now using the sequence-based estimates rather than the microsatellite-based estimates as input (Supplementary Note). Notably, the inferred dates based on the measurement of the sequence mutation rate are older and no longer in direct conflict with the inference that S. tchadensis is on the human lineage since the split from chimpanzees. The sequence- and microsatellite-based data sets are very different, and an important direction for future research will be to understand why the direct sequence–based mutation rate estimate is lower than the one inferred on the basis of microsatellites. 
All this leaves me rather perplexed. I guess one take-home lesson from the debate would be to avoid making strong statements about the past that are dependent on a particular mutation rate. The following table from the supplementary material pretty much says it all:

Notice that the two estimates are approximately double one of the other. Personally, I tend to favor the older dates, since they might "match" better with key developments: Out-of-Africa will become pre-100ka and consistent with the appearance of the Nubian technocomplex in Arabia, which seems to be the only real solid evidence of Out-of-Africa in the archaeological record. It would also be consistent with the appearance of modern humans in the Levant c. 100ca at Mt. Carmel, the first clear evidence of Homo sapiens in Eurasia. Moreover, it would explain the early appearance of Neandertaloid features in the Atapuerca hominins at c. 600ka, long before the inferred split of modern humans from Neandertals when the slowest rate is used.

But, my confidence in these correspondences is low until the controversy is resolved one way or another. If the 1.8x10^-8 rate of this paper is closer to the truth, then my money would be on the false negative rate, i.e., full genome sequencing is systematically overlooking SNPs that exist in the genomes.

Apparently, now, we have three rates to contend with: (i) the Icelandic 1.2x10^-8 rate (and other similar rates, such as the 1.36x10^-8 one); the 2.5x10^-8 one that has been very widely used in the literature, and (iii) the "1.82x10^-8 mutations per base pair per generation (90% CI 1.40–2.28 × 10-8; Table 2)" from this paper. This may be disheartening, but all setbacks represent opportunities to learn something new, and now that the issue is out in the open, I'm sure that many "top dogs" will try to figure out what is going on.

Nature Genetics doi:10.1038/ng.2398

A direct characterization of human mutation based on microsatellites

James X Sun et al.

Mutations are the raw material of evolution but have been difficult to study directly. We report the largest study of new mutations to date, comprising 2,058 germline changes discovered by analyzing 85,289 Icelanders at 2,477 microsatellites. The paternal-to-maternal mutation rate ratio is 3.3, and the rate in fathers doubles from age 20 to 58, whereas there is no association with age in mothers. Longer microsatellite alleles are more mutagenic and tend to decrease in length, whereas the opposite is seen for shorter alleles. We use these empirical observations to build a model that we apply to individuals for whom we have both genome sequence and microsatellite data, allowing us to estimate key parameters of evolution without calibration to the fossil record. We infer that the sequence mutation rate is 1.4–2.3-10^-8 mutations per base pair per generation (90% credible interval) and that humanchimpanzee speciation occurred 3.7–6.6 million years ago.


Dodecad Project components and East Eurasian-like admixture

See Part 1, Part 2, and Part 3.

I went back to the Dodecad Project K7b and K12b calculators, and calculated f4 statistics of the form:

f4(Southern_K7b, X, East_Asian_K7b, African_K7b)

I wanted to see how the various components related to East Eurasians.

Here are the results:

Visually for the West Eurasian components:

This shows the relative ordering of the different components on the East Asian-African axis. Notice that of the mainly Caucasoid components the most Asian-shifted is the North European component, the most African shifted is the Southwest Asian one. This makes sense because of the admixture phenomenon I've been describing in this series, and also the proximity of Arabia (which is where the Southwest Asian component is modal) to Africa.

The existence of East Eurasian-like admixture in Europe is further supported by the following observation: both the Atlantic_Baltic and North_European components (who are the most East Asian-shifted) are mainly geographically distributed to the west of the West Asian, Caucasus, and Gedrosia components (who are less East Asian-shifted). This seems discordant with geography. On the other hand, the relative position of the Caucasus, Southern, and Southwest Asian components vis a vis Africa are concordant with geography, as their center of distribution is close to Africa along land migration routes, with Southwest Asia being closer both genetically and geographically, and Caucasus most distant.

Another observation is that the Atlantic_Med component, which is modal in Sardinians and Basques is actually Asian-shifted relative to the Southern component (modal in Arabia).This might indicate the presence of some degree of East Eurasian-like admixture in Sardinia itself. So, while Sardinia may possess the minimum of this element in Europe, it may not do so in the wider Caucasoid world.

Unscrambling the omelette of West Eurasian origins is no easy task. Hopefully, new statistical methods and ancient DNA will help us achieve it.

More mutations in children of older fathers, and how it relates to human origins

Most of the coverage of the new Kong et al. paper has focused on the rising risk for inheritable diseases such as autism and schizophrenia in the children of older fathers. And, indeed, that is is the larger story, and, perhaps, the more useful one for society.

But, for those of us interested in the origins of our species, there is another story:
We show that in our samples, with an average father’s age of 29.7, the average de novo mutation rate is 1.20 × 10−8 per nucleotide per generation.
This mutation rate is in line with other direct measured rates, and is about twice smaller than the widely used 2.5x10^-8 rate used in evolutionary studies. Application of the low rate has led to a much older Human-Chimp divergence than was previously thought. That, in turn, will make mitochondrial Eve much older, because the mtDNA clock is calibrated on the Human-Chimp divergence. Practically every study of the last 10 years that looked at human origins and used the 2.5x10^-8 rate needs to be dusted off and made up to date.

But there is yet another story. The beauty of the Langergraber et al. paper is that it inferred the Human-Chimp divergence on the basis of directly observed quantities: mutation rates and generation times. But, there was one quantity which they could not measure directly: the mutation rate in the apes. Thus, they used the mutation rate of humans for the apes as well; that is very reasonable, because presumably the same underlying chemical machinery affects the rate in humans and their simian friends. But, here's where things get complicated:

Mean human paternal ages are about ~7 years older than chimp ones, and ~10 years older than gorilla ones. What this means, is that on average, younger chimp dads and younger gorilla dads have babies. But, the new Kong et al. paper:
Most notably, the diversity in mutation rate of single nucleotide polymorphisms is dominated by the age of the father at conception of the child. The effect is an increase of about two mutations per year. An exponential model estimates paternal mutations doubling every 16.5 years.
A back-of-the envelope calculation suggests that the higher age of human fathers may contribute ~30-50% more mutation in humans than in chimps/gorillas. Conversely, the mutation rate used for chimps should not be the human one: it should be even lower.

What are the implications of this?

The divergence of Humans from Chimps has been estimated by summing up mutations on two branches to their most recent common ancestor (MRCA). Younger chimp fathers = lower mutation rate / generation = Chimp-to-MRCA branch just got older.

In other words, just as we learned than humans diverged from chimps ~7-13 million years ago, it may be that they did so even earlier.

Nature 488, 471–475 (23 August 2012) doi:10.1038/nature11396

Rate of de novo mutations and the importance of father’s age to disease risk

Augustine Kong et al.

Mutations generate sequence diversity and provide a substrate for selection. The rate of de novo mutations is therefore of major importance to evolution. Here we conduct a study of genome-wide mutation rates by sequencing the entire genomes of 78 Icelandic parent–offspring trios at high coverage. We show that in our samples, with an average father’s age of 29.7, the average de novo mutation rate is 1.20???10?8 per nucleotide per generation. Most notably, the diversity in mutation rate of single nucleotide polymorphisms is dominated by the age of the father at conception of the child. The effect is an increase of about two mutations per year. An exponential model estimates paternal mutations doubling every 16.5?years. After accounting for random Poisson variation, father’s age is estimated to explain nearly all of the remaining variation in the de novo mutation counts. These observations shed light on the importance of the father’s age on the risk of diseases such as schizophrenia and autism.