December 09, 2014

"Ancient DNA: the first three decades" meeting papers

A bucketload of papers here. Some titles of interest:

  1. Where are the Caribs? Ancient DNA from ceramic period human remains in the Lesser Antilles
  2. Identification of kinship and occupant status in Mongolian noble burials of the Yuan Dynasty through a multidisciplinary approach
  3. The ancient Yakuts: a population genetic enigma
  4. Ancient mitochondrial DNA from the northern fringe of the Neolithic farming expansion in Europe sheds light on the dispersion process
  5. Mitochondrial DNA variation in the Viking age population of Norway
  6. Almost 20 years of Neanderthal palaeogenetics: adaptation, admixture, diversity, demography and extinction
  7. Screening ancient tuberculosis with qPCR: challenges and opportunities
  8. Parallel detection of ancient pathogens via array-based DNA capture
  9. Unravelling the complexity of domestication: a case study using morphometrics and ancient DNA analyses of archaeological pigs from Romania
  10. Ancient genomics
  11. Ancient population genomics and the study of evolution

December 06, 2014

African Genome Variation project paper

A choice quote:
To assess the effect of gene flow on population differentiation in SSA, we masked Eurasian ancestry across the genome (Supplementary Methods and Supplementary Note 6). This markedly reduced population differentiation, as measured by a decline in mean pairwise FST from 0.021 to 0.015 (Supplementary Note 6), suggests that Eurasian ancestry has a substantial impact on differentiation among SSA populations. We speculate that residual differentiation between Ethiopian and other SSA populations after masking Eurasian ancestry (pairwise FST = 0.027) may be a remnant of East African diversity pre-dating the Bantu expansion10.
I think this should be highlighted for a couple of reasons.

1. In too many papers to count, decreasing genetic diversity from East Africa was taken as evidence of an origin of H. sapiens in that locality and its expansion from there to Eurasia. This "East Africa=cradle of mankind" theory has, as far as I can tell, nothing really to stand on. Granted, the oldest anatomically modern human remains have been found in East Africa 200-150 thousand years ago. But, the fact that old sapiens have been found in East Africa and not elsewhere is easily explained by the excellent conditions for preservation (as opposed, e.g., deserts or rainforests of Africa or elsewhere), and by the extraordinary effort by palaeoanthropologists in that area. One also needs to overlook a century of physical anthropology that concluded that East Africa was a contact zone between Caucasoids and Sub-Saharan Africans. We now know that there is no deep lineage of humans in modern east Africans. Take out the Eurasian ancestry and only a paltry Fst=0.027 remains with other Sub-Saharan Africans, a fraction of the Fst between, say, Europeans and East Asians.

2. There has been enormous literature about phenotypic variation in Africans. The ultra-migrationism of old was replaced by ultra-selectionism that sought to explain every phenotypic marker of Eurasian admixture in Africa not as evidence of such admixture, but as a parallel process of evolution whereby some Africans tended to resemble some Eurasians not because of admixture but because of adaptation to similar environmental conditions.

This suggests that a large proportion of differentiation observed among African populations could be due to Eurasian admixture, rather than adaptation to selective forces (Supplementary Note 6).
This study also confirms the presence of Eurasian admixture in the Yoruba
Our finding of ancient Eurasian admixture corroborates findings of non-zero Neanderthal ancestry in Yoruba, which is likely to have been introduced through Eurasian admixture and back migration, possibly facilitated by greening of the Sahara desert during this period13, 14.

Nature (2014) doi:10.1038/nature13997

The African Genome Variation Project shapes medical genetics in Africa

Deepti Gurdasani, Tommy Carstensen, Fasil Tekola-Ayele, Luca Pagani, Ioanna Tachmazidou, et al.

Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.


December 02, 2014

Remains of Richard III identified

From the paper:
Four of the modern relatives were found to belong to Y-haplogroup R1b-U152 (x L2, Z36, Z56, M160, M126 and Z192)13, 14 with STR haplotypes being consistent with them comprising a single patrilinear group. One individual (Somerset 3) was found to belong to haplogroup I-M170 (x M253, M223) and therefore could not be a patrilinear relative of the other four within the time span considered, indicating that a false-paternity event had occurred within the last four generations. 
In contrast to the Y-haplotypes of the putative modern relatives, Skeleton 1 belongs to haplogroup G-P287, with a corresponding Y-STR haplotype. Thus, the putative modern patrilinear relatives of Richard III are not genetically related to Skeleton 1 through the male line over the time period considered. However, this is not surprising, given an estimated average false-paternity rate of ~1–2% (refs 12, 17, 18). The putative modern relatives and Richard III are related through a male relative (Edward III) four generations up from Richard III (Fig. 1a and Supplementary Fig. 2), and a false-paternity event could have happened in any of the 19 generations separating Richard III and the 5th Duke of Beaufort, on either branch of the genealogy descending from Edward III. Indeed, even with a conservative false-paternity rate18 (see Supplementary Methods) the chance of a false-paternity occuring in this number of generations is 16%.

Nature Communications 5, Article number: 5631 doi:10.1038/ncomms6631

Identification of the remains of King Richard III

Turi E. King et al.


In 2012, a skeleton was excavated at the presumed site of the Grey Friars friary in Leicester, the last-known resting place of King Richard III. Archaeological, osteological and radiocarbon dating data were consistent with these being his remains. Here we report DNA analyses of both the skeletal remains and living relatives of Richard III. We find a perfect mitochondrial DNA match between the sequence obtained from the remains and one living relative, and a single-base substitution when compared with a second relative. Y-chromosome haplotypes from male-line relatives and the remains do not match, which could be attributed to a false-paternity event occurring in any of the intervening generations. DNA-predicted hair and eye colour are consistent with Richard’s appearance in an early portrait. We calculate likelihood ratios for the non-genetic and genetic data separately, and combined, and conclude that the evidence for the remains being those of Richard III is overwhelming.


November 25, 2014

E-M81 in Morocco

Hum Biol. 2014 May;86(2):105-12.

Phylogeography of e1b1b1b-m81 haplogroup and analysis of its subclades in morocco.

Reguig A, Harich N, Barakat A, Rouba H.


In this study we analyzed 295 unrelated Berber-speaking men from northern, central, and southern Morocco to characterize frequency of the E1b1b1b-M81 haplogroup and to refine the phylogeny of its subclades: E1b1b1b1-M107, E1b1b1b2-M183, and E1b1b1b2a-M165. For this purpose, we typed four biallelic polymorphisms: M81, M107, M183, and M165. A large majority of the Berber-speaking male lineages belonged to the Y-chromosomal E1b1b1b-M81 haplogroup. The frequency ranged from 79.1% to 98.5% in all localities sampled. E1b1b1b2-M183 was the most dominant subclade in our samples, ranging from 65.1% to 83.1%. In contrast, the E1b1b1b1-M107 and E1b1b1b2a-M165 subclades were not found in our samples. Our results suggest a predominance of the E1b1b1b-M81 haplogroup among Moroccan Berber-speaking males with a decreasing gradient from south to north. The most prevalent subclade in this haplogroup was E1b1b1b2-M183, for which diffferences among these three groups were statistically significant between central and southern groups.


Paternal lineages and languages in the Caucasus

An interesting new study on Y chromosome and languages in the Caucasus. The distribution of haplogroups is on the left. The authors make some associations of haplogroups with language families:

  • R1b: Indo-European
  • R1a: Scytho-Sarmatian
  • J2: Hurro-Urartian
  • G2: Kartvelian

Hum Biol. 2014 May;86(2):113-30.

Human paternal lineages, languages, and environment in the caucasus.

Tarkhnishvili D1, Gavashelishvili A1, Murtskhvaladze M1, Gabelaia M1, Tevzadze G2.


Publications that describe the composition of the human Y-DNA haplogroup in diffferent ethnic or linguistic groups and geographic regions provide no explicit explanation of the distribution of human paternal lineages in relation to specific ecological conditions. Our research attempts to address this topic for the Caucasus, a geographic region that encompasses a relatively small area but harbors high linguistic, ethnic, and Y-DNA haplogroup diversity. We genotyped 224 men that identified themselves as ethnic Georgian for 23 Y-chromosome short tandem-repeat markers and assigned them to their geographic places of origin. The genotyped data were supplemented with published data on haplogroup composition and location of other ethnic groups of the Caucasus. We used multivariate statistical methods to see if linguistics, climate, and landscape accounted for geographical diffferences in frequencies of the Y-DNA haplogroups G2, R1a, R1b, J1, and J2. The analysis showed significant associations of (1) G2 with wellforested mountains, (2) J2 with warm areas or poorly forested mountains, and (3) J1 with poorly forested mountains. R1b showed no association with environment. Haplogroups J1 and R1a were significantly associated with Daghestanian and Kipchak speakers, respectively, but the other haplogroups showed no such simple associations with languages. Climate and landscape in the context of competition over productive areas among diffferent paternal lineages, arriving in the Caucasus in diffferent times, have played an important role in shaping the present-day spatial distribution of patrilineages in the Caucasus. This spatial pattern had formed before linguistic subdivisions were finally shaped, probably in the Neolithic to Bronze Age. Later historical turmoil had little influence on the patrilineage composition and spatial distribution. Based on our results, the scenario of postglacial expansions of humans and their languages to the Caucasus from the Middle East, western Eurasia, and the East European Plain is plausible.

Link (pdf)

November 07, 2014

Genome of Kostenki-14, an Upper Paleolithic European (Seguin-Orlando, Korneliussen, Sikora, et al. 2014)

A new paper in Science reports on the genome of Kostenki-14 (K14), an Upper Paleolithic European from Russia. This is now the third oldest Homo sapiens for which we have genetic data, after Ust'-Ishim (Siberia, 45 thousand years), Tianyuan (China, 40 thousand years), and now Kostenki (European part of Russia, 37 thousand years). Of these three genomes, the Ust'-Ishim is both the highest coverage and earliest (Siberia is the gift that keeps on givin'), Tianyuan only has its chromosome 21 known, and K14, a complete 2.42x coverage sequence (and, apparently, good teeth, after all these years; left).

The publication of the Tianyuan genome showed that populations related to East Asians and Oceanians existed in the world 40 thousand years ago. So, models based on modern humans that put the split of East Asians from Europeans to a much more recent time period were basically wrong (more on this a little below). The Ust'-Ishim genome showed that populations basal to both East Asians and Europeans existed in the world 45 thousand years ago. So, either East Asians and Europeans hadn't gone along their different paths yet, or, if they had, Ust'-Ishim happened to be a side branch and not the major East Asian and European lineages.

K14 may not be the older Upper Paleolithic human, but as of this writing it is the only Upper Paleolithic European that has been published so far, the next ones being the Loschbour, Motala, and La Brana Mesolithic Europeans who who have about 1/5 of its age. The new paper shows that K14 was definitely European (or more correctly West Eurasian or Caucasoid), as it was more similar to modern Europeans than to East Asians or other non-West Eurasian populations. Thus, the morphological description of the sample as "Australoid" by some early anthropologists did not reflect its ancestral makeup. Also, this proves that Caucasoids existed 37,000 years ago, which most physical anthropologists would believe, but it is nice to have direct confirmation. This pushes the lower bound from 24,000 years ago (because MA-1 was West Eurasian according to the results of Raghavan et al.). It will be nice to push the lower bound further to the past as there are much older bones (and plenty of teeth) from earlier Upper Paleolithic Europeans.

But there is a slight kink in the story, as K14 also belonged to Y-haplogroup C which is predominantly East Asian/Ocenian/Native American today. So, maybe there is some distant link to these populations in its ancestry. But, there is definitely a link to much more recent Europeans: the tiny percentage of living Europeans who have preserved K14's Y-chromosomal type (some of which were doubtlessly told a few years back that they were descendants of Genghis Khan, before the phylogenetic structure of C was known), the La Brana hunter-gatherer from Mesolithic Spain, as well as Neolithic Europeans from Hungary.

The authors of the current paper also date the date of Neandertal admixture to 54 thousand years. This seems very compatible with the finding of between 50 and 60 thousand years by Fu et al. (2014) based on the Ust'-Ishim genome (which is both earlier and better, so the chunks of Neandertal ancestry in it are probably be longer and more well-defined).

The authors propose the following model for how various populations are related to each other:

This model is not formally tested, but at least it seems to derive Europeans as a 3-way mixture that is basically identical to that of Lazaridis et al., with some relabeling of populations (MHG=WHG and NEOL=EEF).

The model also includes Yeniseian Siberians as a mixture of MHG and East Asians (although it does not include actual East Asians). It's strange that Yeniseians apparently are given no ANE ancestry but only WHG/MHG. Both Raghavan et al. and Lazaridis et al. mentioned that ancestry related to MA-1 in living Siberians is diminished, but none at all?

The major new finding of this paper, however, is that K14 had Basal Eurasian ancestry, which was first proposed for EEF from Germany 7,000 years ago, so now it postulated for Russian hunter-gatherers 37,000 years ago. I don't think many archaeologists would derive European farmers from Russia (Russia is actually one of the last places in Europe that became agricultural). So, maybe the hunter-gatherers from Russia had Basal Eurasian ancestry and this wasn't limited to the ancestors of the EEF? If they did, it's strange that Loschbour, La Brana, MA-1, Ust'-Ishim, Swedish Mesolithic (and maybe KO1?) didn't have it. So, either Kostenki was very unique or there is an alternative explanation for its strangeness.

The evidence for the Basal Eurasian ancestry in K14 is summarized in the figure above in bullet point (b).

  • The statistic D(Mbuti, East Asia; HG, K14) is less than 0. So, there's some link between HG and East Asians. Is this because of Basal Eurasian admixture in K14 or due to some admixture between Caucasoids and Mongoloids after the time of K14? (this might cause the lower dates of European-East Asian splits alluded to above).
  • The statistic D(Mbuti, East Asia; NEOL, K14) is 0. So, East Asians don't "prefer" either Neolithic Europeans (NEOL) or K14. I guess the value of this statistic depends on how much Basal Eurasian the different populations have and what's the relationship between East Asians, K14, and the non-Basal Eurasian part in K14.
  • Finally, "NEOL component for K14 in ADMIXTURE". I think they are referring to the "Middle East" component (right). This may be Basal Eurasian ancestry, or maybe because K14 is so old, it pre-dates the European/Middle Eastern divide and its ancestry isn't attracted to either Europe or the Middle East, so it gets ancestry from both (and many other colors besides).

It is fascinating how many new questions are both answered and raised each time a new genome gets published (and there has been a constant stream of these over the last couple of years).

Science DOI: 10.1126/science.aaa0114

Genomic structure in Europeans dating back at least 36,200 years

Andaine Seguin-Orlando1,*, Thorfinn S. Korneliussen1,*, Martin Sikora1, et al.

The origin of contemporary Europeans remains contentious. We obtain a genome sequence from Kostenki 14 in European Russia dating to 38,700 to 36,200 years ago, one of the oldest fossils of Anatomically Modern Humans from Europe. We find that K14 shares a close ancestry with the 24,000-year-old Mal’ta boy from central Siberia, European Mesolithic hunter-gatherers, some contemporary western Siberians, and many Europeans, but not eastern Asians. Additionally, the Kostenki 14 genome shows evidence of shared ancestry with a population basal to all Eurasians that also relates to later European Neolithic farmers. We find that Kostenki 14 contains more Neandertal DNA that is contained in longer tracts than present Europeans. Our findings reveal the timing of divergence of western Eurasians and East Asians to be more than 36,200 years ago and that European genomic structure today dates back to the Upper Paleolithic and derives from a meta-population that at times stretched from Europe to central Asia.


October 22, 2014

High coverage genome from 45,000-year old Siberian (Ust'-Ishim)

This is the oldest full genome of a modern human published to date and it also comes from a time (45 thousand years ago) that coincides with the Upper Paleolithic revolution in Eurasia.

45 thousand years ago is probably close to when Eurasians started diverging from each other as they spread in all directions. So, we expect that a human from that time would be "undifferentiated Eurasian" and indeed this seems to be the case.

First the Y-chromosome:
The Y chromosome sequence of the Ust’-Ishim individual is similarly inferred to be ancestral to a group of related Y chromosomes (haplogroup K(xLT)) that occurs across Eurasia today6 (Supplementary Information section 9).
and mtDNA:
The Ust’-Ishim mtDNA sequence falls at the root of a large group of related mtDNAs (the ‘R haplogroup’), which occurs today across Eurasia (Supplementary Information section 8).
It is clear that this was a Eurasian individual:
Based on genotyping data for 87 African and 108 non-African individuals (Supplementary Information section 11), the Ust’-Ishim genome shares more alleles with non-Africans than with sub-Saharan Africans (|Z| = 41–89), consistent with the principal component analysis, mtDNA and Y chromosome results.
It was also more like East Asians than Europeans:
Among the non-Africans, the Ust’-Ishim genome shares more derived alleles with present-day people from East Asia than with present-day Europeans (|Z| = 2.1–6.4).
But, when they compared East Asians with La Brana and MA-1 they didn't see a difference:
However, when an ~8,000-year-old genome from western Europe (La Braña)9 or a 24,000-year-old genome from Siberia (Mal’ta 1)10 were analysed, there is no evidence that the Ust’-Ishim genome shares more derived alleles with present-day East Asians than with these prehistoric individuals (|Z| < 2). This suggests that the population to which the Ust’-Ishim individual belonged diverged from the ancestors of present-day West Eurasian and East Eurasian populations before—or simultaneously with—their divergence from each other. The finding that the Ust’-Ishim individual is equally closely related to present-day Asians and to 8,000- to 24,000-year-old individuals from western Eurasia, but not to present-day Europeans, is compatible with the hypothesis that present-day Europeans derive some of their ancestry from a population that did not participate in the initial dispersals of modern humans into Europe and Asia11.
So it seems that the Ust'-Ishim individual belonged to the same branch as Asians and WHG/ANE and modern Europeans are less like it because they also have "Basal Eurasian" admixture which they inherited via the EEF in the model of Lazaridis et al.

The authors could also get estimates of the mutation rate because this is a 45,000 year old individual that hasn't experienced 45,000 years worth of mutations:
Assuming that this corresponds to the number of mutations that have accumulated over around 45,000 years, we estimate a mutation rate of 0.43 × 10−9 per site per year (95% CI 0.38 × 10−9 to 0.49 × 10−9) that is consistent across all non-African genomes regardless of their coverage (Supplementary Information section 14). This overall rate, as well as the relative rates inferred for different mutational classes (transversions, non-CpG transitions, and CpG transitions), is similar to the rate observed for de novo estimates from human pedigrees (~0.5 × 10−9 per site per year14, 15) and to the direct estimate of branch shortening (Supplementary Information section 10). As discussed elsewhere14, 16, 17, these rates are slower than those estimated using calibrations based on the fossil record and thus suggest older dates for the splits of modern human and archaic populations.
This is a very direct confirmation of the "slow" autosomal rate of ~1.2x10-8 mutations/generation/bp using a technology much different than those used before to estimate this. The slower mutation rate implies that major splits in human history (such as the Out-of-Africa event) took place much earlier than the Upper Paleolithic revolution and the spread of humans across Eurasia. Modern humans probably established an early presence in the Levant/Arabia (consistent with Out-of-Arabia), and invented the Upper Paleolithic-related tools/behaviors there much later, and only then spread across Eurasia.

The authors write:
we estimate that the admixture between the ancestors of the Ust’-Ishim individual and Neanderthals occurred approximately 50,000 to 60,000 years BP, which is close to the time of the major expansion of modern humans out of Africa and the Middle East.
This clinches the hypothesis of Neandertal introgression in Eurasians, as Ust'-Ishim has longer Neandertal segments than modern humans, as one might expect from an individual who experienced this admixture more recently in its evolutionary past than modern humans did. It's probably in the Middle East that the Levantine/Arabian modern humans that expanded Out-of-Africa more than 100 thousand years ago came into contact with Neandertals, admixed with them and later carried this ancestry to the rest of Eurasia. I tend to think that the AMH "colony" was first limited to Arabia and only later (post-70kya) expanded north as the climate deteriorated there. The authors estimate the common ancestor of non-African Y-chromosomes (including E, which is probably a back-migration to Africa) to around 70 thousand years ago which may coincide with the Arabian Exodus event.

Nature 514, 445–449 (23 October 2014) doi:10.1038/nature13810

Genome sequence of a 45,000-year-old modern human from western Siberia

Qiaomei Fu et al.

We present the high-quality genome sequence of a ~45,000-year-old modern human male from Siberia. This individual derives from a population that lived before—or simultaneously with—the separation of the populations in western and eastern Eurasia and carries a similar amount of Neanderthal ancestry as present-day Eurasians. However, the genomic segments of Neanderthal ancestry are substantially longer than those observed in present-day individuals, indicating that Neanderthal gene flow into the ancestors of this individual occurred 7,000–13,000 years before he lived. We estimate an autosomal mutation rate of 0.4 × 10−9 to 0.6 × 10−9 per site per year, a Y chromosomal mutation rate of 0.7 × 10−9 to 0.9 × 10−9 per site per year based on the additional substitutions that have occurred in present-day non-Africans compared to this genome, and a mitochondrial mutation rate of 1.8 × 10−8 to 3.2 × 10−8 per site per year based on the age of the bone.


October 21, 2014

Ancient DNA from prehistoric inhabitants of Hungary

A very interesting new article on Europe describes new data from ancient Hungary from the Neolithic to the Iron Age. It is open access, so go ahead and read it. I will update this entry with some comments after I read the paper myself.

UPDATE I (The petrous bone):

The authors write:
The endogenous DNA yields from the petrous samples exceeded those from the teeth by 4- to 16-fold and those from other bones up to 183-fold. Thus, while other skeletal elements yielded human, non-clonal DNA contents ranging from 0.3 to 20.7%, the levels for petrous bones ranged from 37.4 to 85.4% (Fig. 1).
This seems like a very exciting technical breakthrough that will increase DNA yields in future studies.


The Neolithic Hungarians are close to Sardinians (this has been replicated in study after study, so it's no longer a surprise when you find Neolithic Europeans that look like Sardinians).

What is surprising is that one KO1 Neolithic European is with the hunter-gatherers (top of the plot). At some level you would expect to find some hunter-gatherers in the earliest Neolithic communities in Europe as Europe wasn't empty land when the early farmers showed up. And KO1 appears one of those guys, "caught in the act" of first contact between the two groups.

The two Bronze Age samples are more like modern continental Europeans but not exactly like modern Hungarians. The Iron Age sample is in the no-man's land between Europe and the Caucasus and his "Asian" Y chromosome and mtDNA seems to agree that this is no ordinary European.

UPDATE III (How they looked):

I really like the visualization of hair and eye color predictions of the last two columns of the table on the right. It seems that the ancient Hungarians had mainly brown hair with more variability after 5,000 years ago. They mostly had brown eyes except three individuals.

An interesting thing is that NE7 who seems to have light hair and blue eyes is just like other Sardinian-like farmers of the Neolithic and also has the mtDNA haplogroup N1a1a1a that is ultra-typical for Neolithic people from Europe. So this is a warning not to conflate appearance with ancestry.

UPDATE IV (Y chromosomes):

As always, the supplement has many of the interesting details. Two Neolithic males were C6 which is the same "weird" haplogroup that La Brana hunter-gatherer from Spain had. Two other ones were I2a which is what Loschbour and Swedish hunter-gatherers had. Strangely, no Neolithic males had G which was found before in many Neolithic Europeans.

A new finding is that the Bronze Age individual BR2 belonged to haplogroup J2a1. I think this is the first time this has been found in ancient DNA and it falsifies the Phoenician sea-faring theory of the dispersal of this lineage.

Finally, the Iron Age Hungarian belonged to haplogroup N. I believe this was found in ancient Magyars from Hungary before, but apparently it existed there long before them.

Nature Communications 5, Article number: 5257 doi:10.1038/ncomms6257

Genome flux and stasis in a five millennium transect of European prehistory

Cristina Gamba et al.

The Great Hungarian Plain was a crossroads of cultural transformations that have shaped European prehistory. Here we analyse a 5,000-year transect of human genomes, sampled from petrous bones giving consistently excellent endogenous DNA yields, from 13 Hungarian Neolithic, Copper, Bronze and Iron Age burials including two to high (~22 × ) and seven to ~1 × coverage, to investigate the impact of these on Europe’s genetic landscape. These data suggest genomic shifts with the advent of the Neolithic, Bronze and Iron Ages, with interleaved periods of genome stability. The earliest Neolithic context genome shows a European hunter-gatherer genetic signature and a restricted ancestral population size, suggesting direct contact between cultures after the arrival of the first farmers into Europe. The latest, Iron Age, sample reveals an eastern genomic influence concordant with introduced Steppe burial rites. We observe transition towards lighter pigmentation and surprisingly, no Neolithic presence of lactase persistence.


October 20, 2014

Ancestry Composition preprint

This is one of the main ancestry tools of 23andMe so it is nice to see its methodology described in detail.


Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Eric Y Durand et al.

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.


October 10, 2014

Tomb II at Vergina belonged to Philip II and a possible Scythian wife

Remains of Alexander the Great's Father Confirmed Found
A team of Greek researchers has confirmed that bones found in a two-chambered royal tomb at Vergina, a town some 100 miles away from Amphipolis's mysterious burial mound, indeed belong to the Macedonian King Philip II, Alexander the Great's father. 
The anthropological investigation examined 350 bones and fragments found in two larnakes, or caskets, of the tomb. It uncovered pathologies, activity markers and trauma that helped identify the tomb's occupants.

Along with the cremated remains of Philip II, the burial, commonly known as Tomb II, also contained the bones of a woman warrior, possibly the daughter of the Skythian King Athea, Theodore Antikas, head of the Art-Anthropological research team of the Vergina excavation, told Discovery News.

October 09, 2014

~40 thousand year old cave art from Indonesia

The BBC website has some nice pictures of it.

Nature 514, 223–227 (09 October 2014) doi:10.1038/nature13422

Pleistocene cave art from Sulawesi, Indonesia

M. Aubert et al.

Archaeologists have long been puzzled by the appearance in Europe ~40–35 thousand years (kyr) ago of a rich corpus of sophisticated artworks, including parietal art (that is, paintings, drawings and engravings on immobile rock surfaces)1, 2 and portable art (for example, carved figurines)3, 4, and the absence or scarcity of equivalent, well-dated evidence elsewhere, especially along early human migration routes in South Asia and the Far East, including Wallacea and Australia5, 6, 7, 8, where modern humans (Homo sapiens) were established by 50 kyr ago9, 10. Here, using uranium-series dating of coralloid speleothems directly associated with 12 human hand stencils and two figurative animal depictions from seven cave sites in the Maros karsts of Sulawesi, we show that rock art traditions on this Indonesian island are at least compatible in age with the oldest European art11. The earliest dated image from Maros, with a minimum age of 39.9 kyr, is now the oldest known hand stencil in the world. In addition, a painting of a babirusa (‘pig-deer’) made at least 35.4 kyr ago is among the earliest dated figurative depictions worldwide, if not the earliest one. Among the implications, it can now be demonstrated that humans were producing rock art by ~40 kyr ago at opposite ends of the Pleistocene Eurasian world.


September 28, 2014

43,500-year old Aurignacian north of the Alps

PNAS doi: 10.1073/pnas.1412201111

Early modern human settlement of Europe north of the Alps occurred 43,500 years ago in a cold steppe-type environment

Philip R. Nigst et al.

The first settlement of Europe by modern humans is thought to have occurred between 50,000 and 40,000 calendar years ago (cal B.P.). In Europe, modern human remains of this time period are scarce and often are not associated with archaeology or originate from old excavations with no contextual information. Hence, the behavior of the first modern humans in Europe is still unknown. Aurignacian assemblages—demonstrably made by modern humans—are commonly used as proxies for the presence of fully behaviorally and anatomically modern humans. The site of Willendorf II (Austria) is well known for its Early Upper Paleolithic horizons, which are among the oldest in Europe. However, their age and attribution to the Aurignacian remain an issue of debate. Here, we show that archaeological horizon 3 (AH 3) consists of faunal remains and Early Aurignacian lithic artifacts. By using stratigraphic, paleoenvironmental, and chronological data, AH 3 is ascribed to the onset of Greenland Interstadial 11, around 43,500 cal B.P., and thus is older than any other Aurignacian assemblage. Furthermore, the AH 3 assemblage overlaps with the latest directly radiocarbon-dated Neanderthal remains, suggesting that Neanderthal and modern human presence overlapped in Europe for some millennia, possibly at rather close geographical range. Most importantly, for the first time to our knowledge, we have a high-resolution environmental context for an Early Aurignacian site in Central Europe, demonstrating an early appearance of behaviorally modern humans in a medium-cold steppe-type environment with some boreal trees along valleys around 43,500 cal B.P.


A limited genetic link between Mansi and Hungarians

Mol Genet Genomics. 2014 Sep 26. [Epub ahead of print]

Y-SNP L1034: limited genetic link between Mansi and Hungarian-speaking populations.

Fehér T1, Németh E, Vándor A, Kornienko IV, Csáji LK, Pamjav H.


Genetic studies noted that the Hungarian Y-chromosomal gene pool significantly differs from other Uralic-speaking populations. Hungarians show very limited or no presence of haplogroup N-Tat, which is frequent among other Uralic-speaking populations. We proposed that some genetic links need to be observed between the linguistically related Hungarian and Mansi populations.This is the first attempt to divide haplogroup N-Tat into subhaplogroups by testing new downstream SNP markers L708 and L1034. Sixty Northern Mansi samples were collected in Western Siberia and genotyped for Y-chromosomal haplotypes and haplogroups. We found 14 Mansi and 92 N-Tat samples from 7 populations. Comparative results showed that all N-Tat samples carried the N-L708 mutation. Some Hungarian, Sekler, and Uzbek samples were L1034 SNP positive, while all Mongolians, Buryats, Khanty, Finnish, and Roma samples yielded a negative result for this marker. Based on the above, L1034 marker seems to be a subgroup of N-Tat, which is typical for Mansi and Hungarian-speaking ethnic groups so far. Based on our time to most recent common ancestor data, the L1034 marker arose 2,500 years before present. The overall frequency of the L1034 is very low among the analyzed populations, thus it does not necessarily mean that proto-Hungarians and Mansi descend from common ancestors. It does provide, however, a limited genetic link supporting language contact. Both Hungarians and Mansi have much more complex genetic population history than the traditional tree-based linguistic model would suggest.


Levallois technology in Nor Geghi 1, Armenia

From the paper:
Empirical evidence supports the contention that Levallois technology is an inherent property of the Acheulian that evolves out of the existing, but previously separate technological systems of façonnage and débitage (7, 35), and shows that Acheulian bifacial technology and Levallois technology are homologous, reflecting an ancestor-descendant relationship (36). Rather than a “technical breakthrough” that spread from a single point of origin, Levallois technology resulted from the gradual synthesis of stone knapping behaviors shared among hominins in Africa and those indigenous to the Acheulian dispersal area in Eurasia (Fig. 1). Consequently, the development of Levallois technology within Late Acheulian contexts represents instances of technological convergence.

Science 26 September 2014: Vol. 345 no. 6204 pp. 1609-1613 DOI: 10.1126/science.1256484

Early Levallois technology and the Lower to Middle Paleolithic transition in the Southern Caucasus 

D. S. Adler

ABSTRACT The Lower to Middle Paleolithic transition (~400,000 to 200,000 years ago) is marked by technical, behavioral, and anatomical changes among hominin populations throughout Africa and Eurasia. The replacement of bifacial stone tools, such as handaxes, by tools made on flakes detached from Levallois cores documents the most important conceptual shift in stone tool production strategies since the advent of bifacial technology more than one million years earlier and has been argued to result from the expansion of archaic Homo sapiens out of Africa. Our data from Nor Geghi 1, Armenia, record the earliest synchronic use of bifacial and Levallois technology outside Africa and are consistent with the hypothesis that this transition occurred independently within geographically dispersed, technologically precocious hominin populations with a shared technological ancestry.


September 23, 2014

ESHE 2014 abstracts

The abstracts for the European Society for the study of Human Evolution meeting that just took place are available in this PDF.

September 18, 2014

23andMe mega-study on different American groups

It's great to see that the massive dataset of 23andMe was used for a study like this that seeks to capture the landscape of ancestry of different American groups.

First, distribution of ancestry in African Americans:

The higher fraction of African ancestry in the south and of European ancestry in the north, shouldn't be very surprising. There are some interesting loci of higher "Native American" ancestry; most African Americans don't seem to have a lot of this ancestry, but some apparently do.

Second, distribution of ancestry in "Latinos":

To my eye, this seems like more African ancestry in the eastern parts (presumbly from Caribbean-type Latinos?) and more Native American ancestry in the west.

Third, distribution of ancestry in European Americans:

Overall, it seems that relatively few (less than 5%) of European Americans have more than 2% either African or Native American ancestry in any of the states, so the breakdown of European ancestry into various subgroups  is perhaps more interesting.

The distribution of African ancestry in European and African Americans is also interesting:

The existence of "African Americans" with virtually no African ancestry and of "European Americans" with as much as half African ancestry is probably due to either misreporting or some quite strange self-perception issues. The bulk of the African ancestry in European Americans seems to be in the sub-10% range (equivalent to less than 1 great grandparent). It is possible that many of these individuals might not even be aware of the existence of such ancestors.

bioRxiv doi:

The genetic ancestry of African, Latino, and European Americans across the United States.

Katarzyna Bryc, Eric Durand, J Michael Macpherson, David Reich, Joanna Mountain

Over the past 500 years, North America has been the site of ongoing mixing of Native Americans, European settlers, and Africans brought largely by the Trans-Atlantic slave trade, shaping the early history of what became the United States. We studied the genetic ancestry of 5,269 self-described African Americans, 8,663 Latinos, and 148,789 European Americans who are 23andMe customers and show that the legacy of these historical interactions is visible in the genetic ancestry of present-day Americans. We document pervasive mixed ancestry and asymmetrical male and female ancestry contributions in all groups studied. We show that regional ancestry differences reflect historical events, such as early Spanish colonization, waves of immigration from many regions of Europe, and forced relocation of Native Americans within the US. This study sheds light on the fine-scale differences in ancestry within and across the United States, and informs our understanding of the relationship between racial and ethnic identities and genetic ancestry.


Murderous chimps

Nature 513, 414–417 (18 September 2014) doi:10.1038/nature13727

Lethal aggression in Pan is better explained by adaptive strategies than human impacts

Michael L. Wilson et al.

Observations of chimpanzees (Pan troglodytes) and bonobos (Pan paniscus) provide valuable comparative data for understanding the significance of conspecific killing. Two kinds of hypothesis have been proposed. Lethal violence is sometimes concluded to be the result of adaptive strategies, such that killers ultimately gain fitness benefits by increasing their access to resources such as food or mates1, 2, 3, 4, 5. Alternatively, it could be a non-adaptive result of human impacts, such as habitat change or food provisioning6, 7, 8, 9. To discriminate between these hypotheses we compiled information from 18 chimpanzee communities and 4 bonobo communities studied over five decades. Our data include 152 killings (n = 58 observed, 41 inferred, and 53 suspected killings) by chimpanzees in 15 communities and one suspected killing by bonobos. We found that males were the most frequent attackers (92% of participants) and victims (73%); most killings (66%) involved intercommunity attacks; and attackers greatly outnumbered their victims (median 8:1 ratio). Variation in killing rates was unrelated to measures of human impacts. Our results are compatible with previously proposed adaptive explanations for killing by chimpanzees, whereas the human impact hypothesis is not supported.


September 13, 2014

Ancient mtDNA from southern Africa related to San

Genome Biol Evol (2014) doi: 10.1093/gbe/evu202

First Ancient Mitochondrial Human Genome from a Pre-Pastoralist Southern African

Alan G. Morris et al.

The oldest contemporary human mitochondrial lineages arose in Africa. The earliest divergent extant maternal offshoot, namely haplogroup L0d, is represented by click-speaking forager peoples of Southern Africa. Broadly defined as Khoesan, contemporary Khoesan are today largely restricted to the semi-desert regions of Namibia and Botswana, while archeological, historical and genetic evidence promotes a once broader southerly dispersal of click-speaking peoples including southward migrating pastoralists and indigenous marine-foragers. Today extinct, no genetic data has been recovered from the indigenous peoples that once sustained life along the southern coastal waters of Africa pre-pastoral arrival. In this study we generate a complete mitochondrial genome from a 2,330 year old male skeleton, confirmed via osteological and archeological analysis as practicing a marine-based forager existence. The ancient mtDNA represents a new L0d2c lineage (L0d2c1c) that is today, unlike its Khoe-language based sister-clades (L0d2c1a and L0d2c1b) most closely related to contemporary indigenous San-speakers (specifically Ju). Providing the first genomic evidence that pre-pastoral Southern African marine foragers carried the earliest diverged maternal modern human lineages, this study emphasizes the significance of Southern African archeological remains in defining early modern human origins.


September 10, 2014

ASHG 2014 titles and abstracts

Some interesting titles from the ASHG 2014 conference.

UPDATE: I have added the abstracts.

The human X chromosome is the target of megabase wide selective sweeps associated with multi-copy genes expressed in male meiosis and involved in reproductive isolation. M. H. Schierup, K. Munch, K. Nam, T. Mailund, J. Y. Dutheil.
   The X chromosome differs from the autosomes in its hemizogosity in males and in its intimate relationship with the very different Y chromosome. It has a different gene content than autosomes and undergo specific processes such as meiotic sex chromosome inactivation (MSCI) and XY body formation. Previous studies have shown that natural selection is more efficient against deleterious mutations and, in chimpanzee, that positive selection is prevalent. We show that in all great apes species, megabase wide regions of the X chromosome has severely reduced diversity (by more than 80%). These regions are partly shared among species and indicate a large number of strong selective sweeps that have occurred independently on the same set of targets in different great apes species. We use simulations and deterministic calculations to show that background selection or soft selective sweeps are unlikely to be responsible. The regions also bear all the hallmarks of selective sweeps such as an increased proportion of singletons and higher divergence among closely related populations. Human populations are differently affected, suggesting that a large fraction of sweeps are private to specific human populations. The regions of reduced diversity correlates strongly with the position of X-ampliconic regions, which are 100-500 kb regions containing multiple copies of genes that are solely expressed during male meiosis. We propose that the genes in these regions escape MSCI and participate in an intragenomic conflict with regions of similar function on the Y chromosome for transmission of sex chromosomes to the next generation, i.e. sex chromosome meiotic drive. Recent results from Neanderthal introgression into humans point to the same regions as showing no introgression, consistent with the above process leading to reproductive isolation. Strikingly, the same regions of the X also shows much reduced divergence between human and chimpanzee, suggesting either that this speciation process was indeed complex or that the same regions were under strong selection in the human chimpanzee ancestor.
New insights on human de novo mutation rate and parental age. W. S. W. Wong, B. Solomon, D. Bodian, D. Thach, R. Iyer, J. Vockley, J. Niederhuber.
 Germline mutations have a major role to play in evolution. Much attention has been given to studying the pattern and rate of human mutations using biochemical or phylogenetic methods based on closely related species. Massively parallel sequencing technologies have given scientists the opportunity to study directly measured de novo mutations (DNMs) at an unprecedented scale. Here we report the largest study (to our knowledge) of de novo point mutations in humans, in which we used whole genome deep sequencing (~60x) data from 605 family trios (father, mother and newborn). These trios represent the first group of approximately 2,700 trios who have undergone whole-genome sequencing (WGS) through our pediatric-based WGS research studies. The fathers ages range from 17 to 63 years and the mothers ages range from 17 to 43 years. We identified over 23000 DNMs (~40 per newborn) in the autosomal chromosomes using a customized pipeline and infer that the mutation rate per basepair is around 1.2x10-8 per generation, well within the reported range in previous studies. We were also able to confirm that the total number of DNMs in the newborn was directly proportional to the paternal age (P  less than 2x10-16). Maternal age is shown to have a small but significant positive effect on the number of DNMs passed onto the offspring, (P =0.003) , even after accounting for the paternal age. This contradicts the prior dogma that maternal age only has an effect on chromosomal abnormalities related to nondisjunction events. Furthermore, 5% (22 total) of newborns in the analyzed group were conceived with assisted reproductive technologies (ARTs), and these infants have on average 5 more DNMs (Bias corrected and accelerated bootstrap 95% Confidence Interval, 1.24 to 8.00) than those conceived naturally, after controlling for both parents ages. Both parents ages remain significant as independently correlated with DNMs even after the families that used ARTs were removed from the analysis. Our study enhances current knowledge related to the human germline mutational rates.
Alignment to an ancestry specific reference genome discovers additional variants among 1000 Genomes ASW Cohort. R. A. Neff, J. Vargas, G. H. Gibbons, A. R. Davis.
   Whole genome sequencing studies across certain populations, such as those with African ancestry, are often underpowered due to a larger divergence between the common reference genome and the true genetic sequence of the population. However, a common reference genome is not designed to account for this divergence in population-specific studies. Strong signals from common (MAF>50%) single nucleotide polymorphisms (SNPs), insertion-deletions (indels), and structural variants (SVs) can make alignment and variant calling difficult by masking nearby variants with weaker genetic signals. We present the results generated from alignment to an African descent population-specific reference genome by applying variants present in a majority of individuals with African descent from all phases of the 1000 Genomes Project and the International HapMap Consortium. We identified 882,826 single nucleotide polymorphisms, short insertion-deletion events, and large structural variations present at MAF>50%; in the population, representing 2.39 MB of genetic variation changed from hg19. We demonstrate that utilization of a population-specific reference improves variant call quality, coverage level, and imputation accuracy. We compared alignment of 27 African-American SW population (ASW) samples from the 1000 Genomes Phase 1 project between the population-specific and the hg19 reference. We discovered an additional 443,036 SNPs by alignment to the population specific reference in union across all samples, including thousands of exonic variants that are non-synonymous and are clinically relevant to the study of disease.
Using compressed data structures to capture variation in thousands of human genomes. S. A. McCarthy, Z. Lui, J. T. Simpson, Z. Iqbal, T. M. Keane, R. Durbin.
   Currently the most widely used approach to catalogue variation amongst a set of samples is to align the sequencing reads to a single linear reference genome. This principle has been at the core of the 1000 Genomes data processing pipeline since the pilot phase of the project. However, there is now an increased awareness of the limitations of this approach, such as alignment artefacts, reference bias and unobserved variation on non-reference haplotypes. The Burrows-Wheeler transform and FM-index are compact data structures that have been successfully used in sequence alignment and assembly. One of the key features of these structures is that they are a searchable and reference-free representation of the raw sequencing reads. Our project aims to build a web server based on BWT data structures containing all the reads from many thousands of samples so as to efficiently retrieve matching reads and information about samples and populations. Enticingly, it is expected that data storage for this system would plateau as we collect more data since most new sequencing reads will have already been observed. We expect this to enable powerful new ways to query variation data from thousands of individuals. For the first phase of this project, we include all 87 Tbp of the low-coverage and exome data from the 2,535 samples in 1000 Genomes Phase 3. We envisage this would provide a means for researchers to easily check the prevalence of any human sequence in a control set of thousands of putatively healthy samples. We present our approaches and initial benchmarks on variant sensitivity and specificity against truth datasets and explore several applications for these structures such as validation of short insertion/deletion and structural variant calls, and rapid searching for traces of viral DNA.
Second-generation PLINK: Rising to the challenge of larger and richer datasets. C. C. Chang, C. C. Chow, L. C. A. M. Tellier, S. Vattikuti, S. M. Purcell, J. J. Lee.
   PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format.    To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information.    The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
Exploring genetic variation and genotypes among millions of genomes. R. M. Layer, A. R. Quinlann.
Integrated analysis of protein-coding variation in over 90,000 individuals from exome sequencing data. D. G. MacArthur, M. Lek, E. Banks, R. Poplin, T. Fennell, K. Samocha, B. Thomas, K. Karczewski, S. Purcell, P. Sullivan, S. Kathiresan, M. I. McCarthy, M. Boehnke, S. Gabriel, D. M. Altshuler, G. Getz, M. J. Daly, Exome Aggregation Consortium.
   Rare, and thus largely unknown, variants are a major reason that, typically, less than 10% of the heritability of complex diseases currently can be explained by known genetic variation. While increasing the number of sequenced genomes may improve our ability to reveal this “hidden heritability,” the scale of the resulting dataset poses substantial storage and computational demands. Current efforts to sequence 100,000 genomes, and combined efforts that are likely to surpass 1 million genomes will identify hundreds of millions to billions of polymorphic loci. The minimum storage requirement for directly representing the variability found by these projects (1 bit per individual per variant, ignoring the necessary metadata) will range from terabytes to petabytes. Like most big-data problems, a balance must be found between optimizing storage and computational efficiency. For example, while compression can minimize storage by reducing file size, it can also cause inefficient computation since data must be decompressed before it can be analyzed. Conversely, highly structured data can reduce analysis times but typically require extra metadata that increase file size. Current variation storage schemes were not designed to quickly analyze massive datasets and fail to balance these competing goals. We present GENOTQ, an open source API and toolkit that reduces file size and data access time through use of a succinct data structure, a class of data structures that compress data such that operations can be performed without requiring the full decompression. Word aligned hybrid (WAH) bitmap compression is one such data structure that was developed to improve query times for relational databases. Binary values are encoded such that logical operations (AND, OR, NOT) can be performed on the compressed data. This encoding results in file sizes that are 20X smaller than uncompressed versions, and only 50% larger than the compressed version. Queries, such as finding shared variants among a subpopulation, are also 21X faster. Furthermore, representing the genotypes in this manner makes our method well suited to both distributed architectures like BigQuery and parallel processors like GPUs. We stress that this method is only part of a larger solution that would incorporate genomic annotations, medical histories, and pedigrees. Incorporating fast genotype queries with this web of metadata will provide a rich information source to both clinicians and researchers.
Capture of 390,000 SNPs in dozens of ancient central Europeans reveals a population turnover in Europe thousands of years after the advent of farming. I. Lazaridis, W. Haak, N. Patterson, N. Rohland, S. Mallick, B. Llamas, S. Nordenfelt, E. Harney, A. Cooper, K. W. Alt, D. Reich.
   To understand the population transformations that took place in Europe since the early Neolithic, we used a DNA capture technique to obtain reads covering ~390 thousand single nucleotide polymorphisms (SNPs) from a number of different archaeological cultures of central Europe (Germany and Hungary). The samples spanned the time period from 7,500 BP to 3,500 BP (Early Neolithic to Early Bronze Age periods) and most of them were previously studied using mtDNA (Brandt, Haak et al., Science, 2013). The captured SNPs include about 360,000 SNPs from the Affymetrix Human Origins Array that were discovered in African individuals, as well as about 30,000 SNPs chosen for other reasons (that are thought to have been affected by natural selection, or to have phenotypic effects, or are useful in determining Y-chromosome haplogroups). By analyzing this data together with a dataset of 2,345 present-day humans and other published ancient genomes, we show that late Neolithic inhabitants of central Europe belonging to the Corded Ware culture were not a continuation of the earlier occupants of the region. Our results highlight the importance of migration and major population turnover in Europe long after the arrival of farming. * Contributed equally to this work.
Insights into British and European population history from ancient DNA sequencing of Iron Age and Anglo-Saxon samples from Hinxton, England. S. Schiffels, W. Haak, B. Llamas, E. Popescu, L. Loe, R. Clarke, A. Lyons, P. Paajanen, D. Sayer, R. Mortimer, C. Tyler-Smith, A. Cooper, R. Durbin.
   British population history is shaped by a complex series of repeated immigration periods and associated changes in population structure. It is an open question however, to what extent each of these changes is reflected in the genetic ancestry of the current British population. Here we use ancient DNA sequencing to help address that question. We present whole genome sequences generated from five individuals that were found in archaeological excavations at the Wellcome Trust Genome Campus near Cambridge (UK), two of which are dated to around 2,000 years before present (Iron Age), and three to around 1,300 years before present (Anglo-Saxon period). Good preservation status allowed us to generate one high coverage sequence (12x) from an Iron Age individual, and four low coverage sequences (1x-4x) from the other samples.   By providing the first ancient whole genome sequences from Britain, we get a unique picture of the ancestral populations in Britain before and after the Anglo-Saxon immigrations. We use modern genetic reference panels such as the 1000 Genomes Project to examine the relationship of these ancient samples with present day population genetic data. Results from principal component analysis suggest that all samples fall consistently within the broader Northern European context, which is also consistent with mtDNA haplogroups. In addition, we obtain a finer structural genetic classification from rare genetic variants and haplotype based methods such as FineStructure. Reflecting more recent genetic ancestry, results from these methods suggest significant differences between the Iron Age and the Anglo-Saxon period samples when compared to other European samples. We find in particular that while the Anglo-Saxon samples resemble more closely the modern British population than the earlier samples, the Iron Age samples share more low frequency variation than the later ones with present day samples from southern Europe, in particular Spain (1000GP IBS). In addition the Anglo-Saxon period samples appear to share a stronger older component with Finnish (1000GP FIN) individuals. Our findings help characterize the ancestral European populations involved in major European migration movements into Britain in the last 2,000 years and thus provide more insights into the genetic history of people in northern Europe.
Fine-scale population structure in Europe. S. Leslie, G. Hellenthal, S. Myers, P. Donnelly, International Multiple Sclerosis Genetics Consortium.
   There is considerable interest in detecting and interpreting fine-scale population structure in Europe: as a signature of major events in the history of the populations of Europe, and because of the effect undetected population structure may have on disease association studies. Population structure appears to have been a minor concern for most of the recent generation of genome-wide association studies, but is likely to be important for the next generation of studies seeking associations to rare variants. Thus far, genetic studies across Europe have been limited to a small number of markers, or to methods that do not specifically account for the correlation structure in the genome due to linkage disequilibrium. Consequently, these studies were unable to group samples into clusters of similar ancestry on a fine (within country) scale with any confidence. We describe an analysis of fine-scale population structure using genome-wide SNP data on 6,209 individuals, sampled mostly from Western Europe. Using a recently published clustering algorithm (fineSTRUCTURE), adapted for specific aspects of our analysis, the samples were clustered purely as a function of genetic similarity, without reference to their known sampling locations. When plotted on a map of Europe one observes a striking association between the inferred clusters and geography. Interestingly, for the most part modern country boundaries are significant i.e. we see clear evidence of clusters that exclusively contain samples from a single country. At a high level we see: the Finns are the most differentiated from the rest of Europe (as might be expected); a clear divide between Sweden/Norway and the rest of Europe (including Denmark); and an obvious distinction between southern and northern Europe. We also observe considerable structure within countries on a hitherto unseen fine-scale - for example genetically distinct groups are detected along the coast of Norway. Using novel techniques we perform further analyses to examine the genetic relationships between the inferred clusters. We interpret our results with respect to geographic and linguistic divisions, as well as the historical and archaeological record. We believe this is the largest detailed analysis of very fine-scale human genetic structure and its origin within Europe. Crucial to these findings has been an approach to analysis that accounts for linkage disequilibrium.
The population structure and demographic history of Sardinia in relationship to neighboring populations. J. Novembre, C. Chiang, J. Marcus, C. Sidore, M. Zoledziewska, M. Steri, H. Al-asadi, G. Abecasis, D. Schlessinger, F. Cucca.
   Numerous studies have made clear that Sardinian populations are relatively isolated genetically from other populations of the Mediterranean, and more recently, intriguing connections between Sardinian ancestry and early Neolithic ancient DNA samples have been made. In this study, we analyze a whole-genome low-coverage sequencing dataset from 2120 Sardinians to more fully characterize patterns of genetic diversity in Sardinia. The study contains one subsample that contains individuals from across Sardinia and a second subsample that samples 4 villages from the more isolated Ogliastra region. We also merge the data with published reference data from Europe and North Africa. Overall Fst values of Sardinia to other European populations are low (less than 0.015); however using a novel method for visualizing genetic differentiation on a geographic map, we formally show how Sardinia is more differentiated than would be expected given its geographic distance from the mainland, consistent with periods of isolation. Applications of the software Admixture show how Sardinia populations differ in the levels of recent admixture with mainland European populations and that there are only minor contributions from North African populations to Sardinian ancestry. Notably the Sardinians from Ogliastra contain a distinct genetic cluster with minimal evidence of recent admixture with mainland Europe. We found frequency-based f3 tests and the tree-based algorithm Treemix both also show minimal evidence of recent admixture. Given the relative isolation, one might expect to see a unique demographic history from neighboring populations. Using coalescent-based approaches, we find Sardinian populations have had more constant effective sizes over the past several thousand years than mainland European populations, which typically show evidence for rapid growth trajectories in the recent past. This unique demographic history has consequences for the abundance of putatively damaging and deleterious variants, and we use our data to address the prediction that the genetic architecture of disease traits is expected to involve fewer loci with a greater proportion of variants at common frequencies in Sardinia.
Population structure in African-Americans. S. Gravel, M. Barakatt, B. Maples, M. Aldrich, E. E. Kenny, C. D. Bustamante, S. Baharian.
   We present a detailed population genetic study of 4 African-American cohorts comprising over 6000 genotyped individuals across US urban and rural communities: two nation-wide longitudinal cohorts, one biobank cohort, and the 1000 genomes ASW cohort. Ancestry analysis reveals a uniform breakdown of continental ancestry proportions across regions and urban/rural status, with 79% African, 19% European, and 1.5% Native American/Asian ancestries, with substantial between-individual variation. The Native Ancestry proportion is higher than previous estimates and is maintained after self-identified hispanics and individuals with substantial inferred Spanish ancestry are removed. This strongly supports direct admixture between Native Americans and African Americans on US territory, and linkage patterns suggest contact early after African-American arrival to the Americas. Local ancestry patterns and variation in ancestry proportions across individuals are broadly consistent with a single African-American population model with early Native American admixture and ongoing European gene flow in the South. The size and broad geographic sampling of our cohorts enables detailed analysis the geographic and cultural determinants of finer-scale population structure. Recent Identity-by-descent analysis reveals fine-scale structure consistent with the routes used during slavery and in the great African-American migrations of the twentieth century: east-to-west migrations in the south, and distinct south-to-north migrations into New England and the Midwest. These migrations follow transit routes available at the time, and are in stark contrast with European-American relatedness patterns.
Genetic testing of 400,000 individuals reveals the geography of ancestry in the United States. Y. Wang, J. M. Granka, J. K. Byrnes, M. J. Barber, K. Noto, R. E. Curtis, N. M. Natalie, C. A. Ball, K. G. Chahine.
   The population of the United States is formed by the interplay of immigration, migration and admixture. Recent research (R. Sebro et al., ASHG 2013) has shed light on the U.S. demography by studying the self-reported ethnicity from the 2010 U.S. Census. However, self-reported ethnicity may not accurately represent true genetic ancestry and may therefore introduce unknown biases. Since launching its DNA service in May 2012, AncestryDNA has genotyped over 400, 000 individuals from the United States. Leveraging this huge volume of DNA data, we conducted a large-scale survey of the ancestry of the United States. We predicted genetic ethnicity for each individual, relying on a rigorously curated reference panel of 3,000 single-origin individuals. Combining that with birth locations, we explored how various ethnicities are distributed across the United States Our results reveal a distinct spatial distribution for each ethnicity. For example, we found that individuals from Massachusetts have the highest proportion of Irish genetic ancestry and individuals from New York have the highest proportion of Southern European genetic ancestry, indicating their unique immigration and migration histories. We also performed pairwise IBD analysis on the entire sample set and identified over 300 million shared genomic segments among all 400,000 individuals. From this data, we calculated the average amount of sharing for pairs of individuals born within the same state or from two different states. In general, we found the genetic sharing decreases as the geographic distance between two states increases. However, the pattern also varies substantially among the 50 states. In summary, our analysis has provided significant insight on the biogeographic patterns of the ancestry in the United States.
Statistical inference of archaic introgression and natural selection in Central African Pygmies. P. Hsieh, J. D. Wall, J. Lachance, S. A. Tishkoff, R. N. Gutenkunst, M. F. Hammer.
   Recent evidence from ancient DNA studies suggests that genetic material introgressed from archaic forms of Homo, such as Neanderthals and Denisovans, into the ancestors of contemporary non-African populations. These findings also imply that hybridization may have given rise to some of adaptive novelties in anatomically modern humans (AMH) as they expanded from Africa into various ecological niches in Eurasia. Within Africa, fossil evidence suggests that AMH and a variety of archaic forms coexisted for much of the last 200,000 years. Here we present preliminary results leveraging high quality whole-genome data (>60X coverage) for three contemporary sub-Saharan African populations (Biaka, Baka, and Yoruba) from Central and West Africa to test for archaic admixture. With the current lack of African ancient DNA, especially in Central Africa due to its rainforest environment, our statistical inference approach provides an alternative means to understand the complex evolutionary dynamics among groups of the genus Homo. To identify candidate introgressive loci, we scan the genomes of 16 individuals and calculate S*, a summary statistic that was specifically designed by one of us (JDW) to detect archaic admixture. The significance of each candidate is assessed through extensive whole-genome level simulations using demographic parameters estimated by ∂a∂i to obtain a parametric distribution of S* values under the null hypothesis of no archaic introgression. As a complementary approach, top candidates are also examined by an approximate-likelihood computation method. The admixture time for each individual introgressive variant is inferred by estimating the decay of the genetic length of the diverged haplotype as a function of its underlying recombination rate. A neutrality test that controls for demography is performed for each candidate to test the hypothesis that introgressive variants rose to high frequency due to positive directional selection. Several genomic regions were identified by both selection and introgression scans, and we will discuss the possible genetic and functional properties of these “double-hits”. The present study represents one of the most comprehensive genomic surveys to date for evidence of archaic introgression to anatomically modern humans in Africa.
Inferences about human history and natural selection from 280 complete genome sequences from 135 diverse populations. S. Mallick, D. Reich, Simons Genome Diversity Project Consortium.
   The most powerful way to study population history and natural selection is to analyze whole genome sequences, which contain all the variation that exists in each individual. To date, genome-wide studies of history and selection have primarily analyzed data from single nucleotide polymorphism (SNP) arrays which are biased by the choice of which SNPs to include. Alternatively they have analyzed sequence data that have been generated as part of medical genetic studies from populations with large census sizes, and thus do not capture the full scope of human genetic variation. Here we report high quality genome sequences (~40x average) from 280 individuals from 135 worldwide populations, including 45 Africans, 26 Native Americans, 27 Central Asians or Siberians, 46 East Asians, 25 Oceanians, 46 South Asians, and 71 West Eurasians. All samples were sequenced using an identical protocol at the same facility (Illumina Ltd.). We modified standard pipelines to eliminate biases that might confound population genetic studies. We report novel inferences, as well as a high resolution map that shows where archaic ancestry (Neanderthal and Denisovan) is distributed throughout the world. We compare and contrast the genomic landscape of the Denisovan introgression into mainland Eurasians to that in island Southeast Asians. We are making this dataset fully available on Amazon Web Services as a resource to the community, coincident with the American Society of Human Genetics meeting.
Improved haplotype phasing using identity by descent. B. L. Browning, S. R. Browning.
   We present a new haplotype phasing method that achieves higher accuracy than existing methods. The method is based on the Beagle haplotype frequency model, but unlike the original Beagle phasing method, the new method incorporates genetic recombination, genotype error, and segments of identity by descent.     We compared the new haplotype phasing method to Beagle (r1230) and to SHAPEIT version 2 (r778) using Illumina Human 1M SNP data for chromosome 20. We phased 44 HapMap3 CEU trio offspring together with subsets of Wellcome Trust Case Control Consortium 2 controls (n=650, 1300, 2600, 5200). Phase error was measured at trio offspring genotypes on chromosome 20 that have phase determined by parental genotypes. The SHAPEIT “states” parameter was set at 6400 in order to increase its phasing accuracy.     The new haplotype phasing method produced haplotype switch error rates that were 20-25% lower than the error rates for the existing Beagle method and 1-7% lower than the error rates for SHAPEIT. The difference in switch error rates between the new method and SHAPEIT increased with increasing sample size.     The new haplotype phasing method will be incorporated into version 4 of the Beagle software package (
Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis. E. Y. Durand, N. Eriksson, C. Y. McLean.
   Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases. A large number of methods to detect IBD segments have been developed recently. However, IBD detection accuracy in non-simulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father-mother-child trios from the 23andMe, Inc. dataset. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false positive rate over 67% for 2-4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. We show that nearly all false positives arise due to allowing switch errors between haplotypes when detecting IBD, a necessity for retrieving long (> 6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that enables detection and filtering of false positive IBD segments on population-scale datasets. HaploScore scores IBD segments proportional to the number of switch errors they contain. Thus, it enables filtering of spurious segments reported due to GERMLINE being overly permissive to imperfect phasing. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative genotyping arrays using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can be readily adapted to improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.
Parente2: A fast and accurate method for detecting identity by descent. S. Bercovici, J. M. Rodriguez, L. Huang, S. Batzoglou.
   Identity-by-descent (IBD) inference is the problem of establishing a direct and explicit genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is key to a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of both accurate and efficient IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: given a cohort’s size, as quadratic number of pairwise genome comparisons must be performed, in principle. Therefore, computation time and the false discovery rate can also scale quadratically. To enable practical large-scale IBD detection, we developed Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses an ensemble windowing approach to model complex linkage disequilibrium in the underlying studied population. Parente2 is applied directly on genotype data without the need to phase data prior to IBD inference. Through extensive simulations using real data, we evaluate Parente2’s performance. We show that Parente2 is superior to previous state-of-the-art methods, detecting pairs of related individuals sharing a 4 cM IBD segment with 99.9%; sensitivity at a 0.1%; false positive rate, and achieving 79.2%; sensitivity at a 1%; false positive rate for the more challenging case of pairs sharing a 2 cM IBD segment. Additionally, Parente2 is efficient, providing one to two orders of magnitude speedup compared to previous state of the art methods. Parente2 is freely available at
Fast PCA of very large samples in linear time. K. J. Galinsky, P. Loh, G. Bhatia, S. Georgiev, S. Mukherjee, N. J. Patterson, A. L. Price.
   Principal components analysis (PCA) is an effective tool for inferring population structure and correcting for population stratification in genetic data. Traditionally, PCA runs in O(MN2+N3 ) time, where M is the number of variants and N is the number of samples. Here, we describe a new algorithm, fastpca, for approximating the top K PCs that runs in time O(MNK), making use of recent advances in random low-rank matrix approximation algorithms (Rokhlin et al. 2009). fastpca avoids computing the GRM and associated computational and memory storage costs, enabling PCA of very large datasets on standard hardware. We estimated the top 10 PCs of the WTCCC dataset (16k samples, 101k variants) in roughly 7 minutes while consuming 1GB of RAM, compared to 1 hour and 2.5GB for PLINK2. The fastpca approximation was extremely accurate (r2>99% between all fastpca and PLINK2 PCs). The improvement in running time becomes even larger at larger samples sizes; for example, fastpca estimated the top 10 PCs of a simulated data set with 100k samples and 300k variants in 135 minutes 8.5GB of RAM, vs. an estimated 350 hours and 85GB of RAM using PLINK2. A recently published O(MN2) time method, flashpca, did not complete on this data set due to exceeding 40GB memory requirement. All of these analyses were based on LD-pruning SNPs with r2>0.2, which leads to much more accurate PCs in simulations as compared to retaining all SNPs; more complex LD-adjustment strategies provide only a small further improvement.
Fast detection of IBD segments associated with quantitative traits in genome-wide association studies. Z. Wang, E. Kang, B. Han, S. Snir, E. Eskin.
   Recently, many methods have been developed to detect the identity-by-descent (IBD) segments between a pair of individuals. These methods are able to detect very small shared IBD segments between a pair of individuals up to 2 centimorgans in length. This IBD information can be used to identify recent rare mutations associated with phenotype of interest. Previous approaches for IBD association were applicable to case/control phenotypes. In this work, we propose a novel and natural statistic for the IBD association testing, which can be applied to quantitative traits. A drawback of the statistic is that it requires a large number of permutations to assess the significance of the association, which can be a great computational challenge. We make a connection between the proposed statistic and linear models so that it does not require permutations to assess the significance of an association. In addition, our method can control population structure by utilizing linear mixed models.
Long-range haplotype mapping in Hispanic/Latinos reveals loci for short stature. G. Belbin, D. Ruderfer, K. Slivinski, M.C. Yee, J. Jeff, O. Gottesman, E.A. Stahl, R.J.F. Loos, E.P. Bottinger, E.E. Kenny.
   The Hispanic/Latino (HL) population of Northern Manhattan represents a diverse recent diaspora population, with 95% of the individuals reporting having grandparents born outside of the United States. Of these 43% report grandparents born in Puerto Rico, 23% the Dominican Republic, 13% Central America, and 5%, 4%, and 2% from Mexico, South America, and Europe respectively. Despite complex patterns of migration, admixture, and diversity, strong signatures of cryptic relatedness persist amongst HLs. We have detected long-range genomic tract sharing (>3cM), or identity-by-descent (IBD), across 5,194 HL in the Mount Sinai BioMe Biobank. We observed an average population level IBD sharing of 0.0025 in HL, which is 2.5- and 5-fold higher than that observed in BioMe European- and African-American populations, respectively. We hypothesize that these patterns of recent migration and genetic drift may drive some otherwise rare functional alleles to detectable frequency. We clustered groups of homologous IBD tracts (n=112,250) segregating in this HL population. We observed that IBD clusters represent a class of low frequency alleles (median minor allele frequency =0.0077, s.d.=0.0015). We performed a genome-wide association of the IBD clusters, or ‘population-based linkage’, to detect loci implicated in height, a highly heritable polygenic trait. 15 independent loci surpassed our empirically derived genome-wide significance threshold of less than 4.4710-4, 11 of which replicated in an independent cohort of BioMe HLs. Strikingly, two regions confer strong recessive effects. In the case of the top hit on 9q32 (MAF less than 0.005; p less than8x10-6), homozygous non-referent individuals were shorter by 6” or 10”, for men or women, respectively, compared to the population mean (5’ 7” and 5’ 2” for men and women, respectively). In addition, IBD haplotypes in the 9q32 cluster harbored a significant enrichment of Native American ancestry (p less than 1x10-16). Finally, this interval contains a number of biologically compelling candidate genes, including COL27A1 and PALM2. This study demonstrates that rich population structure, rather than being a confounding factor in biomedical discovery efforts, may be leveraged to reveal novel genetic associations with complex human traits.
A haplotype reference panel of over 31,000 individuals and next-generation imputation methods. S. Das, on behalf of Haplotype Reference Consortium.
   Genotype imputation is now a key tool in the analysis of human genetic studies, enabling array-based genetic association studies to examine the millions of variants that are being discovered by advances in whole genome sequencing. Examining these variants increases power and resolution of genetic association studies and makes it easier to compare the results of studies conducted using different arrays. Genotype imputation improves in accuracy with increasing numbers of sequenced samples, particularly for low frequency variants. The goal of the Haplotype Reference Consortium is to combine haplotype information from ongoing whole genome sequencing studies to create a large imputation resource. To date, we have collected information on >31,500 sequenced whole genomes, aggregated over 20 studies of predominantly European ancestry, to create a very large reference panel of human haplotypes where ~50M genetic variants are observed 5 or more times. These haplotypes can be used to guide genotype imputation and haplotype estimation. In preliminary empirical evaluations, our panel provides substantial increases in accuracy relative to the 1000 Genomes Project Phase 1 reference panel and other smaller panels, particularly for variants with frequency less than 
5%. I will describe our evaluation of strategies for merging haplotypes and variant lists across studies and advances in methods for genotype likelihood-based haplotype estimation that can be applied to 10,000s of samples. I will also summarize new methods for next generation imputation that perform faster and require less memory than contemporary methods while attaining similar levels of imputation accuracy. Our full resource is available to the community through imputation servers that enable scientists to impute missing variants in any study and respect the privacy of subjects contributing to the studies that constitute the Haplotype Reference Consortium. The majority of haplotypes will also be deposited in the European Genotype Archive.
A rare variant local haplotype sharing method with application to admixed populations. S. Hooker, G. T. Wang, B. Li, Y. Guan, S. M. Leal.
   With the advent of next generation sequencing there is great interest in studying the involvement of rare variants in complex trait etiology. For many complex traits sequence data is being generated on DNA samples from African Americans and Hispanics to elucidate rare variant associations. Analyses of admixed populations present special challenges due to spurious associations which can occur because of confounding. However using information on admixture and local ancestry can also be highly beneficial and increase the power to detect associations in these populations. Here a local haplotype sharing (LHS) method (Xu and Guan 2014) was extended to test for rare variant (RV) associations in admixed populations. Previously the Weighted Haplotype and Imputation-based Test (WHAIT) (Li et al. 2010) was proposed to test for rare variant associations using haplotype data. The RV-LHS method unlike WHAIT, does not require reconstruction of haplotypes which can be both computationally intensive and error prone. Additionally the RV-LHS uses information on local ancestry which is particularly advantageous when analyzing admixed populations. Results will be shown from simulation studies performed for rare variant data from an admixed population. Both Type I and II errors are evaluated for the RV-LHS method. Additionally the power of the RV-LHS method is compared to WHAIT as well as several other non-haplotype-based rare variant association methods including the combined multivariate collapsing (CMC) (Li and Leal, 2008), Variable Threshold (VT) (Price et al. 2010) and Sequence Kernel Association Test (SKAT) (Wu et al. 2010). Several heart, lung and blood phenotypes were analyzed using sequence data on African-Americans from the NHLBI-Exome Sequencing Project to better evaluate the performance of the RV-LHS compared to other rare variant association methods.