Dienekes’ Anthropology Blog: Natural selection

Showing posts with label Natural selection. Show all posts

January 18, 2017

Dysgenic trend in educational attainment in Iceland

This is a very important study which (if replicated in other countries, with more complex demography, less complete genealogy, but much larger sample sizes) bodes ill for the future. It should also prompt studies of the evolution of cognitive ability at longer time scales (beyond traditional genealogy). Much has been written about genetic differences between the human races, for example, with the "cold winters" theory proposed to explain them as a product of natural selection.

But, this assumes that these differences are long-standing and date to the time that modern humans left Africa for more northern (and colder) latitudes. There is good reason to doubt this explanation: ancient writers of the Mediterranean classical world predictably identified themselves as the optimum, but remarked on the spiritedness and dullness of northerners in contrast to the lack of spirit but intelligence of southerners, which seemingly contradicts present-day cognitive ability distributions. But, it may very well be that cognitive ability has changed dramatically over this time period; certainly the fact that one of its correlates (educational attainment) can change in a small isolated population (Icelanders) over a century does not add to one's confidence that this is a trait that has been stable for millennia (let alone since the time of harsh Ice Age winters). As more markers are discovered to predict cognitive ability in human populations and it becomes easier to study ancient ones, it might be possible to track this trait convincingly.

On the positive side, the pliability of the genetic influences on cognition undercuts arguments that possible differences in this trait among human races and ethnic groups are solidly entrenched and unalterable,. Rather they may be accidents of recent evolution which could, in principle, be reversed.

PNAS doi: 10.1073/pnas.1612113114

Selection against variants in the genome associated with educational attainment

Augustine Kong et al.

Epidemiological and genetic association studies show that genetics play an important role in the attainment of education. Here, we investigate the effect of this genetic component on the reproductive history of 109,120 Icelanders and the consequent impact on the gene pool over time. We show that an educational attainment polygenic score, POLYEDU, constructed from results of a recent study is associated with delayed reproduction (P less than 10−100) and fewer children overall. The effect is stronger for women and remains highly significant after adjusting for educational attainment. Based on 129,808 Icelanders born between 1910 and 1990, we find that the average POLYEDU has been declining at a rate of ∼0.010 standard units per decade, which is substantial on an evolutionary timescale. Most importantly, because POLYEDU only captures a fraction of the overall underlying genetic component the latter could be declining at a rate that is two to three times faster.

Link

May 11, 2016

74 loci associated with educational attainment

Other than the claim in the abstract that educational attainment is "mostly environmentally determined" (*), this seems like a very useful study, as it identifies 74 loci associated with educational attainment and explores their interesting biology.

The utility of this type of study does not consist so much in the ability to predict one's educational potential by looking at one's genotype (we're a long way off from that, and a traditional pencil-and-paper test will probably beat genetics for a long time to come). Rather, it helps move the culture forward, away from the polite ultra-egalitarianism of today's dominant worldview and towards a more scientific attitude concerning the limits of education. Such an attitude will necessarily acknowledge -whether it seems fair to us or not- that genes sometimes dictate that the smart but slothful kid should outperform his diligent but dull-witted peer.

It will certainly be very interesting to see what better methods or even larger sample sizes will bring in years to come.

(*) The heritability of educational attainment has been estimated to be 67-74% of Norwegian males of the 1940-1961 period. There is actually no "universal heritability" of a trait. In a third world country it may very well be that one's educational attainment is determined mostly by environmental effects such as whether you have access to a school within reasonable distance or to just enough food during development. In a modern country (like post-war Norway or a technologically advanced future utopia), environmental effects are expected to be minimal (as everyone will get the best of everything), and variation in educational attainment will simply be due to genes (+noise).

Nature (2016) doi:10.1038/nature17671

Genome-wide association study identifies 74 loci associated with educational attainment

Aysu Okbay et al.

Educational attainment is strongly influenced by social and other environmental factors, but genetic factors are estimated to account for at least 20% of the variation across individuals1. Here we report the results of a genome-wide association study (GWAS) for educational attainment that extends our earlier discovery sample1, 2 of 101,069 individuals to 293,723 individuals, and a replication study in an independent sample of 111,349 individuals from the UK Biobank. We identify 74 genome-wide significant loci associated with the number of years of schooling completed. Single-nucleotide polymorphisms associated with educational attainment are disproportionately found in genomic regions regulating gene expression in the fetal brain. Candidate genes are preferentially expressed in neural tissue, especially during the prenatal period, and enriched for biological pathways involved in neural development. Our findings demonstrate that, even for a behavioural phenotype that is mostly environmentally determined, a well-powered GWAS identifies replicable associated genetic variants that suggest biologically relevant pathways. Because educational attainment is measured in large numbers of individuals, it will continue to be useful as a proxy phenotype in efforts to characterize the genetic influences of related phenotypes, including cognition and neuropsychiatric diseases.

Link

May 08, 2016

Natural selection in Britain during the last 2,000 years

The latest ancient DNA studies from the British Isles (Schiffels et al and Martiniano et al. and Cassidy et al.) support continuity over the last 2,000 years. Sure, there were continued migrations like the arrival of the Anglo-Saxons, but these were very similar groups in the grand scheme of things.

But, while ancestrally the modern Briton is probably a descendant of the Britons of 2,000 years ago with some admixture from similar continental European groups, he is not the same, as (apparently) substantial genetic adaptation has continued to operate in Britain over the same period. A new preprint by Field, Boyle, Telis et al. makes the case for adaptation in a variety of traits in the ancestors of Britons over this period. Mind you, the genetic underpinnings of many important human traits known to have high heritability are currently unknown, but there is little doubt that selection would have affected traits beyond those detected in this study. I am quite curious to see whether the striking efflorescence of cultural achievement in Britain over the last half millennium could have (at least in part) a genetic underpinning.

Depigmentation is a trait whose genetic architecture is as well as understood as any. The results of this study might surprise writers of decades and centuries past who supposed that the spectrum of pigmentation of modern Europeans was the result of admixture-in varying measure- between Xanthochrooi and Melanchrooi races of primordial antiquity. All indications seem to be that depigmentation of hair, skin, and eyes did not co-occur in such a hypothetical race, but rather in different parts of the Caucasoid range, only reaching a high combined frequency in northern Europe to form the distinctive physical type that is distinctive of the natives of that region. It would be quite interesting to see how these traits evolved in Fennoscandia and the Baltic, regions that sport an even higher depigmentation than the British Isles. Traditionally, these areas were viewed as refuges of the Xanthochrooi but it may very well turn out to be that for whatever reason selection has acted in that area as well, as it did in the Eastern European plain where rather dark Bronze Age steppe groups gave way to rather light pigmented living eastern Slavs.

bioRxiv doi: http://dx.doi.org/10.1101/052084

Detection of human adaptation during the past 2,000 years

Yair Field, Evan A Boyle, Natalie Telis, Ziyue Gao, Kyle J Gaulton, David Golan, Loic Yengo, Ghislain Rocheleau, Philippe Froguel, Mark I McCarthy, Jonathan K Pritchard

Detection of recent natural selection is a challenging problem in population genetics, as standard methods generally integrate over long timescales. Here we introduce the Singleton Density Score (SDS), a powerful measure to infer very recent changes in allele frequencies from contemporary genome sequences. When applied to data from the UK10K Project, SDS reflects allele frequency changes in the ancestors of modern Britons during the past 2,000 years. We see strong signals of selection at lactase and HLA, and in favor of blond hair and blue eyes. Turning to signals of polygenic adaptation we find, remarkably, that recent selection for increased height has driven allele frequency shifts across most of the genome. Moreover, we report suggestive new evidence for polygenic shifts affecting many other complex traits. Our results suggest that polygenic adaptation has played a pervasive role in shaping genotypic and phenotypic variation in modern humans.

Link

May 02, 2016

Neandertal ancestry, going, going, ..., gone (?)

A deluge of new data from Upper Paleolithic Europe will give us all a lot to think about. It is incredible that Neandertal ancestry seems to have decreased over time in Europe (Oase1 is off-cline with lots of extra Neandertal ancestry from a recent genealogical Neandertal in the family tree). The functional form of the decrease seems pretty well approximated as linear.

The authors write:

Using one statistic, we estimate a decline from 4.3–5.7% from a time shortly after introgression to 1.1–2.2% in Eurasians today (Fig. 2).

This is remarkable because it shows that most of the Neandertal ancestry of the earliest AMH in Europe was gone by the Mesolithic. It really seems that Neandertal genes were bred out of the gene pool over time. Will this trend continue into the future? Perhaps only minute traces of Neandertal DNA will remain in humans in 10,000 more years. Some of Neandertal DNA may yet prove to be neutral or beneficial, so at the limit the percentage may be more than zero. Nonetheless, the historical trend does suggest that modern humans inherited mostly genetic garbage from Neandertals and evolution is more than halfway through the process of getting rid of it.

As a corollary, there may have been other episodes of archaic admixture that are no longer detectable. Perhaps our modern human lineage has repeatedly admixed with other species, but traces of those admixtures are long gone by the action of natural selection. The reason for our relative homogeneity as a species may not be that we avoided intermixing with others, but that, sadly, most others had not much that was beneficial to offer to our ancestors.

Nature (2016) doi:10.1038/nature17993

The genetic history of Ice Age Europe

Qiaomei Fu et al.

Modern humans arrived in Europe ~45,000 years ago, but little is known about their genetic composition before the start of farming ~8,500 years ago. Here we analyse genome-wide data from 51 Eurasians from ~45,000–7,000 years ago. Over this time, the proportion of Neanderthal DNA decreased from 3–6% to around 2%, consistent with natural selection against Neanderthal variants in modern humans. Whereas there is no evidence of the earliest modern humans in Europe contributing to the genetic composition of present-day Europeans, all individuals between ~37,000 and ~14,000 years ago descended from a single founder population which forms part of the ancestry of present-day Europeans. An ~35,000-year-old individual from northwest Europe represents an early branch of this founder population which was then displaced across a broad region, before reappearing in southwest Europe at the height of the last Ice Age ~19,000 years ago. During the major warming period after ~14,000 years ago, a genetic component related to present-day Near Easterners became widespread in Europe. These results document how population turnover and migration have been recurring themes of European prehistory.

Link

March 20, 2016

Adaptation in the light of ancient genomes

Nature Communications 7, Article number: 10775 doi:10.1038/ncomms10775

Human adaptation and population differentiation in the light of ancient genomes

Felix M. Key, Qiaomei Fu, Frédéric Romagné, Michael Lachmann and Aida M. Andrés

The influence of positive selection sweeps in human evolution is increasingly debated, although our ability to detect them is hampered by inherent uncertainties in the timing of past events. Ancient genomes provide snapshots of allele frequencies in the past and can help address this question. We combine modern and ancient genomic data in a simple statistic (DAnc) to time allele frequency changes, and investigate the role of drift and adaptation in population differentiation. Only 30% of the most strongly differentiated alleles between Africans and Eurasians changed in frequency during the colonization of Eurasia, but in Europe these alleles are enriched in genic and putatively functional alleles to an extent only compatible with local adaptation. Adaptive alleles—especially those associated with pigmentation—are mostly of hunter-gatherer origin, although lactose persistence arose in a haplotype present in farmers. These results provide evidence for a role of local adaptation in human population differentiation.

Link

November 04, 2015

Selection against Neandertal deleterious alleles

Sampled Neandertals (from Europe, the Caucasus, and Siberia) certainly had lower effective population size than living humans, but I wonder what the comparison would be between ancient tribes of modern humans and Neandertals in the Near East where admixture presumably took place.

doi: http://dx.doi.org/10.1101/030387

The Genetic Cost of Neanderthal Introgression

Kelley Harris, Rasmus Nielsen

Approximately 2-4% of the human genome is in non-Africans comprised of DNA intro- gressed from Neanderthals. Recent studies have shown that there is a paucity of introgressed DNA around functional regions, presumably caused by selection after introgression. This observation has been suggested to be a possible consequence of the accumulation of a large amount of Dobzhansky-Muller incompatibilities, i.e. epistatic effects between human and Neanderthal specific mutations, since the divergence of humans and Neanderthals approx. 400-600 kya. However, using previously published estimates of inbreeding in Neanderthals, and of the distribution of fitness effects from human protein coding genes, we show that the average Neanderthal would have had at least 40% lower fitness than the average human due to higher levels of inbreeding and an increased mutational load, regardless of the dominance coefficients of new mutations. Using simulations, we show that under the assumption of additive dominance effects, early Neanderthal/human hybrids would have experienced strong negative selection, though not so strong that it would prevent Neanderthal DNA from entering the human population. In fact, the increased mutational load in Neanderthals predicts the observed reduction in Neanderthal introgressed segments around protein coding genes, without any need to invoke epistasis. The simulations also predict that there is a residual Neanderthal derived mutational load in non-African humans, leading to an average fitness reduction of at least 0.5%. Although there has been much previous debate about the effects of the out-of-Africa bottleneck on mutational loads in non-Africans, the significant deleterious effects of Neanderthal introgression have hitherto been left out of this discussion, but might be just as important for understanding fitness differences among human populations. We also show that if deleterious mutations are recessive, the Neanderthal admixture fraction would gradually increase over time due to selection for Neanderthal haplotypes that mask human deleterious mutations in the heterozygous state. This effect of dominance heterosis might partially explain why adaptive introgression appears to be widespread in nature.

Link

doi: http://dx.doi.org/10.1101/030148

The Strength of Selection Against Neanderthal Introgression

Ivan Juric, Simon Aeschbacher, Graham Coop

Hybridization between humans and Neanderthals has resulted in a low level of Neanderthal ancestry scattered across the genomes of many modern-day humans. After hybridization, on average, selection appears to have removed Neanderthal alleles from the human population. Quantifying the strength and causes of this selection against Neanderthal ancestry is key to understanding our relationship to Neanderthals and, more broadly, how populations remain distinct after secondary contact. Here, we develop a novel method for estimating the genome-wide average strength of selection and the density of selected sites using estimates of Neanderthal allele frequency along the genomes of modern-day humans. We confirm that East Asians had somewhat higher initial levels of Neanderthal ancestry than Europeans even after accounting for selection. We find that there are systematically lower levels of initial introgression on the X chromosome, a finding consistent with a strong sex bias in the initial matings between the populations. We find that the bulk of purifying selection against Neanderthal ancestry is best understood as acting on many weakly deleterious alleles. We propose that the majority of these alleles were effectively neutral-and segregating at high frequency-in Neanderthals, but became selected against after entering human populations of much larger effective size. While individually of small effect, these alleles potentially imposed a heavy genetic load on the early-generation human-Neanderthal hybrids. This work suggests that differences in effective population size may play a far more important role in shaping levels of introgression than previously thought.

Link

April 21, 2015

PCA and natural selection

arXiv:1504.04543 [q-bio.PE]

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Nicolas Duforet-Frebourg et al.

(Submitted on 8 Apr 2015)

Large-scale genomic data offers the perspective to decipher the genetic architecture of natural selection. To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal component analysis. We show that the common Fst index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components. Looking at the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) after removal of recently admixed individuals resulting in 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3X). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and non-coding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). PCA-based statistics retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially in non-model species for which defining populations can be difficult. Genome scan based on PCA is implemented in the open-source and freely available PCAdapt software.

Link

bioRxiv http://dx.doi.org/10.1101/018143

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Kevin J Galinsky et al.

Principal components analysis (PCA) is a widely used tool for inferring population structure and correcting confounding in genetic data. We introduce a new algorithm, FastPCA, that leverages recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using a new test for natural selection based on population differentiation along these PCs, we replicate previously known selected loci and identify three new signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984 has previously been associated to alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents.

Link

March 15, 2015

Natural selection and ancient European DNA

A new preprint on the bioRxiv studies the same data as the recent Haak et al. paper, but focuses on natural selection in Europe. Until recently, selection could only be studied by looking at modern populations, but since selection is genetic change over time effected by the environment, it's possible that studies like this will be the norm in the future.

The new study seems to confirm the results of Wilde et al. on steppe groups, as the Yamnaya had a very low frequency of the HERC2 derived "blue eye" allele and a lower frequency of the SLC45A2 "light skin" allele than any modern Europeans. The Yamnaya seem to have been fixed for the other SLC24A5 "light skin" allele which seems to have been at high frequency in all ancient groups save the "Western Hunter Gatherers".

It seems that light pigmentation traits had already existed in pre-Indo-European Europeans (both farmers and hunter-gatherers) and so long-standing philological attempts to correlate them with the arrival of light-pigmented Indo-Europeans from the steppe (or indeed anywhere), and to contrast them with darker pre-Indo-European inhabitants of Europe were misguided. If anything, it seems that the "fairest of them all" were the Scandinavian hunter-gatherers, and a combination of light/dark pigmentation was also present in Neolithic farmers and Western Hunter Gatherers in various combinations.

It also seems that both the theory that lactose tolerance started with LBK farmers and the theory that it came to Europe from milk-drinking steppe Indo-Europeans were wrong, as this trait seems to be altogether absent in European hunter-gatherers, farmers, and Yamnaya, and make a very timid appearance in the Late neolithic/Bronze Age before shooting up in frequency to the present.

Another new development is the ability to predict "genetic height" from ancient DNA. I think this may be a little bit superfluous as you can predict "actual height" by measuring long bone lengths. On the other hand, actualized height depends not only on genetics but also on diet, disease, etc., so it's useful to look at genetic changes in such polygenic traits directly.

A big surprise was the presence of the derived EDAR allele in Swedish hunter-gatherers. This allele is very rare in modern Europeans and seems to have pleiotropic effects in East Asians. This raises the question why this allele (that was so successful in East Asians), never "took hold" in Europeans. One possibility is that it never provided an advantage to Europeans (I don't think anyone really knows what it's actually good for). Another is that Swedish hunter-gatherers simply didn't contribute much ancestry to modern Europeans and so the allele never got the chance to rise in frequency by much.

bioRxiv http://dx.doi.org/10.1101/016477

Eight thousand years of natural selection in Europe

Iain Mathieson et al.

The arrival of farming in Europe beginning around 8,500 years ago required adaptation to new environments, pathogens, diets, and social organizations. While evidence of natural selection can be revealed by studying patterns of genetic variation in present-day people, these pattern are only indirect echoes of past events, and provide little information about where and when selection occurred. Ancient DNA makes it possible to examine populations as they were before, during and after adaptation events, and thus to reveal the tempo and mode of selection. Here we report the first genome-wide scan for selection using ancient DNA, based on 83 human samples from Holocene Europe analyzed at over 300,000 positions. We find five genome-wide signals of selection, at loci associated with diet and pigmentation. Surprisingly in light of suggestions of selection on immune traits associated with the advent of agriculture and denser living conditions, we find no strong sweeps associated with immunological phenotypes. We also report a scan for selection for complex traits, and find two signals of selection on height: for short stature in Iberia after the arrival of agriculture, and for tall stature on the Pontic-Caspian steppe earlier than 5,000 years ago. A surprise is that in Scandinavian hunter-gatherers living around 8,000 years ago, there is a high frequency of the derived allele at the EDAR gene that is the strongest known signal of selection in East Asians and that is thought to have arisen in East Asia. These results document the power of ancient DNA to reveal features of past adaptation that could not be understood from analyses of present-day people.

Link (pdf)

March 11, 2015

Genetic pacification of Western Europeans (?)

Evolutionary Psychology www.epjournal.net – 2015. 13(1): 230-243

Western Europe, State Formation, and Genetic Pacification

Peter Frost, Henry C. Harpending

Through its monopoly on violence, the State tends to pacify social relations. Such pacification proceeded slowly in Western Europe between the 5th and 11th centuries, being hindered by the rudimentary nature of law enforcement, the belief in a man’s right to settle personal disputes as he saw fit, and the Church’s opposition to the death penalty. These hindrances began to dissolve in the 11th century with a consensus by Church and State that the wicked should be punished so that the good may live in peace. Courts imposed the death penalty more and more often and, by the late Middle Ages, were condemning to death between 0.5 and 1.0% of all men of each generation, with perhaps just as many offenders dying at the scene of the crime or in prison while awaiting trial. Meanwhile, the homicide rate plummeted from the 14th century to the 20th. The pool of violent men dried up until most murders occurred under conditions of jealousy, intoxication, or extreme stress. The decline in personal violence is usually attributed to harsher punishment and the longer-term effects of cultural conditioning. It may also be, however, that this new cultural environment selected against propensities for violence.

Link (pdf)

February 13, 2015

Why do East Asians have more Neandertal ancestry than Europeans?

This is quite the paradox, because even though Neandertals are now known to have existed all the way to the Altai, they were still overall a West Eurasian-distributed species. As far as I can tell, three explanations have been proposed: (1) East Asians have at least one extra Neandertal admixture event on top of what all Eurasians share, (2) West Eurasians have at least one admixture event that reduces their Neandertal ancestry relative to what all Eurasians share, (3) Neither of them have any such events, but natural selection has acted to reduce Neandertal alleles more in Europeans than East Asians.

I don't really have an opinion on this highly technical subject, but a couple of papers in AJHG make the case for (1 or 2) and 2, and against (3).

AJHG doi:10.1016/j.ajhg.2015.01.006

Complex History of Admixture between Modern Humans and Neandertals

Benjamin Vernot, Joshua M. Akey

Recent analyses have found that a substantial amount of the Neandertal genome persists in the genomes of contemporary non-African individuals. East Asians have, on average, higher levels of Neandertal ancestry than do Europeans, which might be due to differences in the efficiency of purifying selection, an additional pulse of introgression into East Asians, or other unexplored scenarios. To better define the scope of plausible models of archaic admixture between Neandertals and anatomically modern humans, we analyzed patterns of introgressed sequence in whole-genome data of 379 Europeans and 286 East Asians. We found that inferences of demographic history restricted to neutrally evolving genomic regions allowed a simple one-pulse model to be robustly rejected, suggesting that differences in selection cannot explain the differences in Neandertal ancestry. We show that two additional demographic models, involving either a second pulse of Neandertal gene flow into the ancestors of East Asians or a dilution of Neandertal lineages in Europeans by admixture with an unknown ancestral population, are consistent with the data. Thus, the history of admixture between modern humans and Neandertals is most likely more complex than previously thought.

Link

AJHG doi:10.1016/j.ajhg.2014.12.029

Selection and Reduced Population Size Cannot Explain Higher Amounts of Neandertal Ancestry in East Asian than in European Human Populations

Bernard Y. Kim, Kirk E. Lohmueller

It has been hypothesized that the greater proportion of Neandertal ancestry in East Asians than in Europeans is due to the fact that purifying selection is less effective at removing weakly deleterious Neandertal alleles from East Asian populations. Using simulations of a broad range of models of selection and demography, we have shown that this hypothesis cannot account for the higher proportion of Neandertal ancestry in East Asians than in Europeans. Instead, more complex demographic scenarios, most likely involving multiple pulses of Neandertal admixture, are required to explain the data.

Link

September 10, 2014

ASHG 2014 titles and abstracts

Some interesting titles from the ASHG 2014 conference.

UPDATE: I have added the abstracts.

The human X chromosome is the target of megabase wide selective sweeps associated with multi-copy genes expressed in male meiosis and involved in reproductive isolation. M. H. Schierup, K. Munch, K. Nam, T. Mailund, J. Y. Dutheil.

The X chromosome differs from the autosomes in its hemizogosity in males and in its intimate relationship with the very different Y chromosome. It has a different gene content than autosomes and undergo specific processes such as meiotic sex chromosome inactivation (MSCI) and XY body formation. Previous studies have shown that natural selection is more efficient against deleterious mutations and, in chimpanzee, that positive selection is prevalent. We show that in all great apes species, megabase wide regions of the X chromosome has severely reduced diversity (by more than 80%). These regions are partly shared among species and indicate a large number of strong selective sweeps that have occurred independently on the same set of targets in different great apes species. We use simulations and deterministic calculations to show that background selection or soft selective sweeps are unlikely to be responsible. The regions also bear all the hallmarks of selective sweeps such as an increased proportion of singletons and higher divergence among closely related populations. Human populations are differently affected, suggesting that a large fraction of sweeps are private to specific human populations. The regions of reduced diversity correlates strongly with the position of X-ampliconic regions, which are 100-500 kb regions containing multiple copies of genes that are solely expressed during male meiosis. We propose that the genes in these regions escape MSCI and participate in an intragenomic conflict with regions of similar function on the Y chromosome for transmission of sex chromosomes to the next generation, i.e. sex chromosome meiotic drive. Recent results from Neanderthal introgression into humans point to the same regions as showing no introgression, consistent with the above process leading to reproductive isolation. Strikingly, the same regions of the X also shows much reduced divergence between human and chimpanzee, suggesting either that this speciation process was indeed complex or that the same regions were under strong selection in the human chimpanzee ancestor.

New insights on human de novo mutation rate and parental age. W. S. W. Wong, B. Solomon, D. Bodian, D. Thach, R. Iyer, J. Vockley, J. Niederhuber.

Germline mutations have a major role to play in evolution. Much attention has been given to studying the pattern and rate of human mutations using biochemical or phylogenetic methods based on closely related species. Massively parallel sequencing technologies have given scientists the opportunity to study directly measured de novo mutations (DNMs) at an unprecedented scale. Here we report the largest study (to our knowledge) of de novo point mutations in humans, in which we used whole genome deep sequencing (~60x) data from 605 family trios (father, mother and newborn). These trios represent the first group of approximately 2,700 trios who have undergone whole-genome sequencing (WGS) through our pediatric-based WGS research studies. The fathers ages range from 17 to 63 years and the mothers ages range from 17 to 43 years. We identified over 23000 DNMs (~40 per newborn) in the autosomal chromosomes using a customized pipeline and infer that the mutation rate per basepair is around 1.2x10^-8 per generation, well within the reported range in previous studies. We were also able to confirm that the total number of DNMs in the newborn was directly proportional to the paternal age (P less than 2x10^-16). Maternal age is shown to have a small but significant positive effect on the number of DNMs passed onto the offspring, (P =0.003) , even after accounting for the paternal age. This contradicts the prior dogma that maternal age only has an effect on chromosomal abnormalities related to nondisjunction events. Furthermore, 5% (22 total) of newborns in the analyzed group were conceived with assisted reproductive technologies (ARTs), and these infants have on average 5 more DNMs (Bias corrected and accelerated bootstrap 95% Confidence Interval, 1.24 to 8.00) than those conceived naturally, after controlling for both parents ages. Both parents ages remain significant as independently correlated with DNMs even after the families that used ARTs were removed from the analysis. Our study enhances current knowledge related to the human germline mutational rates.

Alignment to an ancestry specific reference genome discovers additional variants among 1000 Genomes ASW Cohort. R. A. Neff, J. Vargas, G. H. Gibbons, A. R. Davis.

Whole genome sequencing studies across certain populations, such as those with African ancestry, are often underpowered due to a larger divergence between the common reference genome and the true genetic sequence of the population. However, a common reference genome is not designed to account for this divergence in population-specific studies. Strong signals from common (MAF>50%) single nucleotide polymorphisms (SNPs), insertion-deletions (indels), and structural variants (SVs) can make alignment and variant calling difficult by masking nearby variants with weaker genetic signals. We present the results generated from alignment to an African descent population-specific reference genome by applying variants present in a majority of individuals with African descent from all phases of the 1000 Genomes Project and the International HapMap Consortium. We identified 882,826 single nucleotide polymorphisms, short insertion-deletion events, and large structural variations present at MAF>50%; in the population, representing 2.39 MB of genetic variation changed from hg19. We demonstrate that utilization of a population-specific reference improves variant call quality, coverage level, and imputation accuracy. We compared alignment of 27 African-American SW population (ASW) samples from the 1000 Genomes Phase 1 project between the population-specific and the hg19 reference. We discovered an additional 443,036 SNPs by alignment to the population specific reference in union across all samples, including thousands of exonic variants that are non-synonymous and are clinically relevant to the study of disease.

Using compressed data structures to capture variation in thousands of human genomes. S. A. McCarthy, Z. Lui, J. T. Simpson, Z. Iqbal, T. M. Keane, R. Durbin.

Currently the most widely used approach to catalogue variation amongst a set of samples is to align the sequencing reads to a single linear reference genome. This principle has been at the core of the 1000 Genomes data processing pipeline since the pilot phase of the project. However, there is now an increased awareness of the limitations of this approach, such as alignment artefacts, reference bias and unobserved variation on non-reference haplotypes. The Burrows-Wheeler transform and FM-index are compact data structures that have been successfully used in sequence alignment and assembly. One of the key features of these structures is that they are a searchable and reference-free representation of the raw sequencing reads. Our project aims to build a web server based on BWT data structures containing all the reads from many thousands of samples so as to efficiently retrieve matching reads and information about samples and populations. Enticingly, it is expected that data storage for this system would plateau as we collect more data since most new sequencing reads will have already been observed. We expect this to enable powerful new ways to query variation data from thousands of individuals. For the first phase of this project, we include all 87 Tbp of the low-coverage and exome data from the 2,535 samples in 1000 Genomes Phase 3. We envisage this would provide a means for researchers to easily check the prevalence of any human sequence in a control set of thousands of putatively healthy samples. We present our approaches and initial benchmarks on variant sensitivity and specificity against truth datasets and explore several applications for these structures such as validation of short insertion/deletion and structural variant calls, and rapid searching for traces of viral DNA.

Second-generation PLINK: Rising to the challenge of larger and richer datasets. C. C. Chang, C. C. Chow, L. C. A. M. Tellier, S. Vattikuti, S. M. Purcell, J. J. Lee.

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Exploring genetic variation and genotypes among millions of genomes. R. M. Layer, A. R. Quinlann.

Integrated analysis of protein-coding variation in over 90,000 individuals from exome sequencing data. D. G. MacArthur, M. Lek, E. Banks, R. Poplin, T. Fennell, K. Samocha, B. Thomas, K. Karczewski, S. Purcell, P. Sullivan, S. Kathiresan, M. I. McCarthy, M. Boehnke, S. Gabriel, D. M. Altshuler, G. Getz, M. J. Daly, Exome Aggregation Consortium.

Rare, and thus largely unknown, variants are a major reason that, typically, less than 10% of the heritability of complex diseases currently can be explained by known genetic variation. While increasing the number of sequenced genomes may improve our ability to reveal this “hidden heritability,” the scale of the resulting dataset poses substantial storage and computational demands. Current efforts to sequence 100,000 genomes, and combined efforts that are likely to surpass 1 million genomes will identify hundreds of millions to billions of polymorphic loci. The minimum storage requirement for directly representing the variability found by these projects (1 bit per individual per variant, ignoring the necessary metadata) will range from terabytes to petabytes. Like most big-data problems, a balance must be found between optimizing storage and computational efficiency. For example, while compression can minimize storage by reducing file size, it can also cause inefficient computation since data must be decompressed before it can be analyzed. Conversely, highly structured data can reduce analysis times but typically require extra metadata that increase file size. Current variation storage schemes were not designed to quickly analyze massive datasets and fail to balance these competing goals. We present GENOTQ, an open source API and toolkit that reduces file size and data access time through use of a succinct data structure, a class of data structures that compress data such that operations can be performed without requiring the full decompression. Word aligned hybrid (WAH) bitmap compression is one such data structure that was developed to improve query times for relational databases. Binary values are encoded such that logical operations (AND, OR, NOT) can be performed on the compressed data. This encoding results in file sizes that are 20X smaller than uncompressed versions, and only 50% larger than the compressed version. Queries, such as finding shared variants among a subpopulation, are also 21X faster. Furthermore, representing the genotypes in this manner makes our method well suited to both distributed architectures like BigQuery and parallel processors like GPUs. We stress that this method is only part of a larger solution that would incorporate genomic annotations, medical histories, and pedigrees. Incorporating fast genotype queries with this web of metadata will provide a rich information source to both clinicians and researchers.

Capture of 390,000 SNPs in dozens of ancient central Europeans reveals a population turnover in Europe thousands of years after the advent of farming. I. Lazaridis, W. Haak, N. Patterson, N. Rohland, S. Mallick, B. Llamas, S. Nordenfelt, E. Harney, A. Cooper, K. W. Alt, D. Reich.

To understand the population transformations that took place in Europe since the early Neolithic, we used a DNA capture technique to obtain reads covering ~390 thousand single nucleotide polymorphisms (SNPs) from a number of different archaeological cultures of central Europe (Germany and Hungary). The samples spanned the time period from 7,500 BP to 3,500 BP (Early Neolithic to Early Bronze Age periods) and most of them were previously studied using mtDNA (Brandt, Haak et al., Science, 2013). The captured SNPs include about 360,000 SNPs from the Affymetrix Human Origins Array that were discovered in African individuals, as well as about 30,000 SNPs chosen for other reasons (that are thought to have been affected by natural selection, or to have phenotypic effects, or are useful in determining Y-chromosome haplogroups). By analyzing this data together with a dataset of 2,345 present-day humans and other published ancient genomes, we show that late Neolithic inhabitants of central Europe belonging to the Corded Ware culture were not a continuation of the earlier occupants of the region. Our results highlight the importance of migration and major population turnover in Europe long after the arrival of farming. * Contributed equally to this work.

Insights into British and European population history from ancient DNA sequencing of Iron Age and Anglo-Saxon samples from Hinxton, England. S. Schiffels, W. Haak, B. Llamas, E. Popescu, L. Loe, R. Clarke, A. Lyons, P. Paajanen, D. Sayer, R. Mortimer, C. Tyler-Smith, A. Cooper, R. Durbin.

British population history is shaped by a complex series of repeated immigration periods and associated changes in population structure. It is an open question however, to what extent each of these changes is reflected in the genetic ancestry of the current British population. Here we use ancient DNA sequencing to help address that question. We present whole genome sequences generated from five individuals that were found in archaeological excavations at the Wellcome Trust Genome Campus near Cambridge (UK), two of which are dated to around 2,000 years before present (Iron Age), and three to around 1,300 years before present (Anglo-Saxon period). Good preservation status allowed us to generate one high coverage sequence (12x) from an Iron Age individual, and four low coverage sequences (1x-4x) from the other samples. By providing the first ancient whole genome sequences from Britain, we get a unique picture of the ancestral populations in Britain before and after the Anglo-Saxon immigrations. We use modern genetic reference panels such as the 1000 Genomes Project to examine the relationship of these ancient samples with present day population genetic data. Results from principal component analysis suggest that all samples fall consistently within the broader Northern European context, which is also consistent with mtDNA haplogroups. In addition, we obtain a finer structural genetic classification from rare genetic variants and haplotype based methods such as FineStructure. Reflecting more recent genetic ancestry, results from these methods suggest significant differences between the Iron Age and the Anglo-Saxon period samples when compared to other European samples. We find in particular that while the Anglo-Saxon samples resemble more closely the modern British population than the earlier samples, the Iron Age samples share more low frequency variation than the later ones with present day samples from southern Europe, in particular Spain (1000GP IBS). In addition the Anglo-Saxon period samples appear to share a stronger older component with Finnish (1000GP FIN) individuals. Our findings help characterize the ancestral European populations involved in major European migration movements into Britain in the last 2,000 years and thus provide more insights into the genetic history of people in northern Europe.

Fine-scale population structure in Europe. S. Leslie, G. Hellenthal, S. Myers, P. Donnelly, International Multiple Sclerosis Genetics Consortium.

There is considerable interest in detecting and interpreting fine-scale population structure in Europe: as a signature of major events in the history of the populations of Europe, and because of the effect undetected population structure may have on disease association studies. Population structure appears to have been a minor concern for most of the recent generation of genome-wide association studies, but is likely to be important for the next generation of studies seeking associations to rare variants. Thus far, genetic studies across Europe have been limited to a small number of markers, or to methods that do not specifically account for the correlation structure in the genome due to linkage disequilibrium. Consequently, these studies were unable to group samples into clusters of similar ancestry on a fine (within country) scale with any confidence. We describe an analysis of fine-scale population structure using genome-wide SNP data on 6,209 individuals, sampled mostly from Western Europe. Using a recently published clustering algorithm (fineSTRUCTURE), adapted for specific aspects of our analysis, the samples were clustered purely as a function of genetic similarity, without reference to their known sampling locations. When plotted on a map of Europe one observes a striking association between the inferred clusters and geography. Interestingly, for the most part modern country boundaries are significant i.e. we see clear evidence of clusters that exclusively contain samples from a single country. At a high level we see: the Finns are the most differentiated from the rest of Europe (as might be expected); a clear divide between Sweden/Norway and the rest of Europe (including Denmark); and an obvious distinction between southern and northern Europe. We also observe considerable structure within countries on a hitherto unseen fine-scale - for example genetically distinct groups are detected along the coast of Norway. Using novel techniques we perform further analyses to examine the genetic relationships between the inferred clusters. We interpret our results with respect to geographic and linguistic divisions, as well as the historical and archaeological record. We believe this is the largest detailed analysis of very fine-scale human genetic structure and its origin within Europe. Crucial to these findings has been an approach to analysis that accounts for linkage disequilibrium.

The population structure and demographic history of Sardinia in relationship to neighboring populations. J. Novembre, C. Chiang, J. Marcus, C. Sidore, M. Zoledziewska, M. Steri, H. Al-asadi, G. Abecasis, D. Schlessinger, F. Cucca.

Numerous studies have made clear that Sardinian populations are relatively isolated genetically from other populations of the Mediterranean, and more recently, intriguing connections between Sardinian ancestry and early Neolithic ancient DNA samples have been made. In this study, we analyze a whole-genome low-coverage sequencing dataset from 2120 Sardinians to more fully characterize patterns of genetic diversity in Sardinia. The study contains one subsample that contains individuals from across Sardinia and a second subsample that samples 4 villages from the more isolated Ogliastra region. We also merge the data with published reference data from Europe and North Africa. Overall Fst values of Sardinia to other European populations are low (less than 0.015); however using a novel method for visualizing genetic differentiation on a geographic map, we formally show how Sardinia is more differentiated than would be expected given its geographic distance from the mainland, consistent with periods of isolation. Applications of the software Admixture show how Sardinia populations differ in the levels of recent admixture with mainland European populations and that there are only minor contributions from North African populations to Sardinian ancestry. Notably the Sardinians from Ogliastra contain a distinct genetic cluster with minimal evidence of recent admixture with mainland Europe. We found frequency-based f3 tests and the tree-based algorithm Treemix both also show minimal evidence of recent admixture. Given the relative isolation, one might expect to see a unique demographic history from neighboring populations. Using coalescent-based approaches, we find Sardinian populations have had more constant effective sizes over the past several thousand years than mainland European populations, which typically show evidence for rapid growth trajectories in the recent past. This unique demographic history has consequences for the abundance of putatively damaging and deleterious variants, and we use our data to address the prediction that the genetic architecture of disease traits is expected to involve fewer loci with a greater proportion of variants at common frequencies in Sardinia.

Population structure in African-Americans. S. Gravel, M. Barakatt, B. Maples, M. Aldrich, E. E. Kenny, C. D. Bustamante, S. Baharian.

We present a detailed population genetic study of 4 African-American cohorts comprising over 6000 genotyped individuals across US urban and rural communities: two nation-wide longitudinal cohorts, one biobank cohort, and the 1000 genomes ASW cohort. Ancestry analysis reveals a uniform breakdown of continental ancestry proportions across regions and urban/rural status, with 79% African, 19% European, and 1.5% Native American/Asian ancestries, with substantial between-individual variation. The Native Ancestry proportion is higher than previous estimates and is maintained after self-identified hispanics and individuals with substantial inferred Spanish ancestry are removed. This strongly supports direct admixture between Native Americans and African Americans on US territory, and linkage patterns suggest contact early after African-American arrival to the Americas. Local ancestry patterns and variation in ancestry proportions across individuals are broadly consistent with a single African-American population model with early Native American admixture and ongoing European gene flow in the South. The size and broad geographic sampling of our cohorts enables detailed analysis the geographic and cultural determinants of finer-scale population structure. Recent Identity-by-descent analysis reveals fine-scale structure consistent with the routes used during slavery and in the great African-American migrations of the twentieth century: east-to-west migrations in the south, and distinct south-to-north migrations into New England and the Midwest. These migrations follow transit routes available at the time, and are in stark contrast with European-American relatedness patterns.

Genetic testing of 400,000 individuals reveals the geography of ancestry in the United States. Y. Wang, J. M. Granka, J. K. Byrnes, M. J. Barber, K. Noto, R. E. Curtis, N. M. Natalie, C. A. Ball, K. G. Chahine.

The population of the United States is formed by the interplay of immigration, migration and admixture. Recent research (R. Sebro et al., ASHG 2013) has shed light on the U.S. demography by studying the self-reported ethnicity from the 2010 U.S. Census. However, self-reported ethnicity may not accurately represent true genetic ancestry and may therefore introduce unknown biases. Since launching its DNA service in May 2012, AncestryDNA has genotyped over 400, 000 individuals from the United States. Leveraging this huge volume of DNA data, we conducted a large-scale survey of the ancestry of the United States. We predicted genetic ethnicity for each individual, relying on a rigorously curated reference panel of 3,000 single-origin individuals. Combining that with birth locations, we explored how various ethnicities are distributed across the United States Our results reveal a distinct spatial distribution for each ethnicity. For example, we found that individuals from Massachusetts have the highest proportion of Irish genetic ancestry and individuals from New York have the highest proportion of Southern European genetic ancestry, indicating their unique immigration and migration histories. We also performed pairwise IBD analysis on the entire sample set and identified over 300 million shared genomic segments among all 400,000 individuals. From this data, we calculated the average amount of sharing for pairs of individuals born within the same state or from two different states. In general, we found the genetic sharing decreases as the geographic distance between two states increases. However, the pattern also varies substantially among the 50 states. In summary, our analysis has provided significant insight on the biogeographic patterns of the ancestry in the United States.

Statistical inference of archaic introgression and natural selection in Central African Pygmies. P. Hsieh, J. D. Wall, J. Lachance, S. A. Tishkoff, R. N. Gutenkunst, M. F. Hammer.

Recent evidence from ancient DNA studies suggests that genetic material introgressed from archaic forms of Homo, such as Neanderthals and Denisovans, into the ancestors of contemporary non-African populations. These findings also imply that hybridization may have given rise to some of adaptive novelties in anatomically modern humans (AMH) as they expanded from Africa into various ecological niches in Eurasia. Within Africa, fossil evidence suggests that AMH and a variety of archaic forms coexisted for much of the last 200,000 years. Here we present preliminary results leveraging high quality whole-genome data (>60X coverage) for three contemporary sub-Saharan African populations (Biaka, Baka, and Yoruba) from Central and West Africa to test for archaic admixture. With the current lack of African ancient DNA, especially in Central Africa due to its rainforest environment, our statistical inference approach provides an alternative means to understand the complex evolutionary dynamics among groups of the genus Homo. To identify candidate introgressive loci, we scan the genomes of 16 individuals and calculate S*, a summary statistic that was specifically designed by one of us (JDW) to detect archaic admixture. The significance of each candidate is assessed through extensive whole-genome level simulations using demographic parameters estimated by ∂a∂i to obtain a parametric distribution of S* values under the null hypothesis of no archaic introgression. As a complementary approach, top candidates are also examined by an approximate-likelihood computation method. The admixture time for each individual introgressive variant is inferred by estimating the decay of the genetic length of the diverged haplotype as a function of its underlying recombination rate. A neutrality test that controls for demography is performed for each candidate to test the hypothesis that introgressive variants rose to high frequency due to positive directional selection. Several genomic regions were identified by both selection and introgression scans, and we will discuss the possible genetic and functional properties of these “double-hits”. The present study represents one of the most comprehensive genomic surveys to date for evidence of archaic introgression to anatomically modern humans in Africa.

Inferences about human history and natural selection from 280 complete genome sequences from 135 diverse populations. S. Mallick, D. Reich, Simons Genome Diversity Project Consortium.

The most powerful way to study population history and natural selection is to analyze whole genome sequences, which contain all the variation that exists in each individual. To date, genome-wide studies of history and selection have primarily analyzed data from single nucleotide polymorphism (SNP) arrays which are biased by the choice of which SNPs to include. Alternatively they have analyzed sequence data that have been generated as part of medical genetic studies from populations with large census sizes, and thus do not capture the full scope of human genetic variation. Here we report high quality genome sequences (~40x average) from 280 individuals from 135 worldwide populations, including 45 Africans, 26 Native Americans, 27 Central Asians or Siberians, 46 East Asians, 25 Oceanians, 46 South Asians, and 71 West Eurasians. All samples were sequenced using an identical protocol at the same facility (Illumina Ltd.). We modified standard pipelines to eliminate biases that might confound population genetic studies. We report novel inferences, as well as a high resolution map that shows where archaic ancestry (Neanderthal and Denisovan) is distributed throughout the world. We compare and contrast the genomic landscape of the Denisovan introgression into mainland Eurasians to that in island Southeast Asians. We are making this dataset fully available on Amazon Web Services as a resource to the community, coincident with the American Society of Human Genetics meeting.

Improved haplotype phasing using identity by descent. B. L. Browning, S. R. Browning.

We present a new haplotype phasing method that achieves higher accuracy than existing methods. The method is based on the Beagle haplotype frequency model, but unlike the original Beagle phasing method, the new method incorporates genetic recombination, genotype error, and segments of identity by descent. We compared the new haplotype phasing method to Beagle (r1230) and to SHAPEIT version 2 (r778) using Illumina Human 1M SNP data for chromosome 20. We phased 44 HapMap3 CEU trio offspring together with subsets of Wellcome Trust Case Control Consortium 2 controls (n=650, 1300, 2600, 5200). Phase error was measured at trio offspring genotypes on chromosome 20 that have phase determined by parental genotypes. The SHAPEIT “states” parameter was set at 6400 in order to increase its phasing accuracy. The new haplotype phasing method produced haplotype switch error rates that were 20-25% lower than the error rates for the existing Beagle method and 1-7% lower than the error rates for SHAPEIT. The difference in switch error rates between the new method and SHAPEIT increased with increasing sample size. The new haplotype phasing method will be incorporated into version 4 of the Beagle software package (http://faculty.washington.edu/browning/beagle/beagle.html).

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis. E. Y. Durand, N. Eriksson, C. Y. McLean.

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases. A large number of methods to detect IBD segments have been developed recently. However, IBD detection accuracy in non-simulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father-mother-child trios from the 23andMe, Inc. dataset. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false positive rate over 67% for 2-4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. We show that nearly all false positives arise due to allowing switch errors between haplotypes when detecting IBD, a necessity for retrieving long (> 6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that enables detection and filtering of false positive IBD segments on population-scale datasets. HaploScore scores IBD segments proportional to the number of switch errors they contain. Thus, it enables filtering of spurious segments reported due to GERMLINE being overly permissive to imperfect phasing. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative genotyping arrays using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can be readily adapted to improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.

Parente2: A fast and accurate method for detecting identity by descent. S. Bercovici, J. M. Rodriguez, L. Huang, S. Batzoglou.

Identity-by-descent (IBD) inference is the problem of establishing a direct and explicit genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is key to a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of both accurate and efficient IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: given a cohort’s size, as quadratic number of pairwise genome comparisons must be performed, in principle. Therefore, computation time and the false discovery rate can also scale quadratically. To enable practical large-scale IBD detection, we developed Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses an ensemble windowing approach to model complex linkage disequilibrium in the underlying studied population. Parente2 is applied directly on genotype data without the need to phase data prior to IBD inference. Through extensive simulations using real data, we evaluate Parente2’s performance. We show that Parente2 is superior to previous state-of-the-art methods, detecting pairs of related individuals sharing a 4 cM IBD segment with 99.9%; sensitivity at a 0.1%; false positive rate, and achieving 79.2%; sensitivity at a 1%; false positive rate for the more challenging case of pairs sharing a 2 cM IBD segment. Additionally, Parente2 is efficient, providing one to two orders of magnitude speedup compared to previous state of the art methods. Parente2 is freely available at http://parente.stanford.edu/.

Fast PCA of very large samples in linear time. K. J. Galinsky, P. Loh, G. Bhatia, S. Georgiev, S. Mukherjee, N. J. Patterson, A. L. Price.

Principal components analysis (PCA) is an effective tool for inferring population structure and correcting for population stratification in genetic data. Traditionally, PCA runs in O(MN²+N³ ) time, where M is the number of variants and N is the number of samples. Here, we describe a new algorithm, fastpca, for approximating the top K PCs that runs in time O(MNK), making use of recent advances in random low-rank matrix approximation algorithms (Rokhlin et al. 2009). fastpca avoids computing the GRM and associated computational and memory storage costs, enabling PCA of very large datasets on standard hardware. We estimated the top 10 PCs of the WTCCC dataset (16k samples, 101k variants) in roughly 7 minutes while consuming 1GB of RAM, compared to 1 hour and 2.5GB for PLINK2. The fastpca approximation was extremely accurate (r²>99% between all fastpca and PLINK2 PCs). The improvement in running time becomes even larger at larger samples sizes; for example, fastpca estimated the top 10 PCs of a simulated data set with 100k samples and 300k variants in 135 minutes 8.5GB of RAM, vs. an estimated 350 hours and 85GB of RAM using PLINK2. A recently published O(MN²) time method, flashpca, did not complete on this data set due to exceeding 40GB memory requirement. All of these analyses were based on LD-pruning SNPs with r²>0.2, which leads to much more accurate PCs in simulations as compared to retaining all SNPs; more complex LD-adjustment strategies provide only a small further improvement.

Fast detection of IBD segments associated with quantitative traits in genome-wide association studies. Z. Wang, E. Kang, B. Han, S. Snir, E. Eskin.

Recently, many methods have been developed to detect the identity-by-descent (IBD) segments between a pair of individuals. These methods are able to detect very small shared IBD segments between a pair of individuals up to 2 centimorgans in length. This IBD information can be used to identify recent rare mutations associated with phenotype of interest. Previous approaches for IBD association were applicable to case/control phenotypes. In this work, we propose a novel and natural statistic for the IBD association testing, which can be applied to quantitative traits. A drawback of the statistic is that it requires a large number of permutations to assess the significance of the association, which can be a great computational challenge. We make a connection between the proposed statistic and linear models so that it does not require permutations to assess the significance of an association. In addition, our method can control population structure by utilizing linear mixed models.

Long-range haplotype mapping in Hispanic/Latinos reveals loci for short stature. G. Belbin, D. Ruderfer, K. Slivinski, M.C. Yee, J. Jeff, O. Gottesman, E.A. Stahl, R.J.F. Loos, E.P. Bottinger, E.E. Kenny.

The Hispanic/Latino (HL) population of Northern Manhattan represents a diverse recent diaspora population, with 95% of the individuals reporting having grandparents born outside of the United States. Of these 43% report grandparents born in Puerto Rico, 23% the Dominican Republic, 13% Central America, and 5%, 4%, and 2% from Mexico, South America, and Europe respectively. Despite complex patterns of migration, admixture, and diversity, strong signatures of cryptic relatedness persist amongst HLs. We have detected long-range genomic tract sharing (>3cM), or identity-by-descent (IBD), across 5,194 HL in the Mount Sinai BioMe Biobank. We observed an average population level IBD sharing of 0.0025 in HL, which is 2.5- and 5-fold higher than that observed in BioMe European- and African-American populations, respectively. We hypothesize that these patterns of recent migration and genetic drift may drive some otherwise rare functional alleles to detectable frequency. We clustered groups of homologous IBD tracts (n=112,250) segregating in this HL population. We observed that IBD clusters represent a class of low frequency alleles (median minor allele frequency =0.0077, s.d.=0.0015). We performed a genome-wide association of the IBD clusters, or ‘population-based linkage’, to detect loci implicated in height, a highly heritable polygenic trait. 15 independent loci surpassed our empirically derived genome-wide significance threshold of less than 4.4710-4, 11 of which replicated in an independent cohort of BioMe HLs. Strikingly, two regions confer strong recessive effects. In the case of the top hit on 9q32 (MAF less than 0.005; p less than8x10-6), homozygous non-referent individuals were shorter by 6” or 10”, for men or women, respectively, compared to the population mean (5’ 7” and 5’ 2” for men and women, respectively). In addition, IBD haplotypes in the 9q32 cluster harbored a significant enrichment of Native American ancestry (p less than 1x10-16). Finally, this interval contains a number of biologically compelling candidate genes, including COL27A1 and PALM2. This study demonstrates that rich population structure, rather than being a confounding factor in biomedical discovery efforts, may be leveraged to reveal novel genetic associations with complex human traits.

A haplotype reference panel of over 31,000 individuals and next-generation imputation methods. S. Das, on behalf of Haplotype Reference Consortium.

Genotype imputation is now a key tool in the analysis of human genetic studies, enabling array-based genetic association studies to examine the millions of variants that are being discovered by advances in whole genome sequencing. Examining these variants increases power and resolution of genetic association studies and makes it easier to compare the results of studies conducted using different arrays. Genotype imputation improves in accuracy with increasing numbers of sequenced samples, particularly for low frequency variants. The goal of the Haplotype Reference Consortium is to combine haplotype information from ongoing whole genome sequencing studies to create a large imputation resource. To date, we have collected information on >31,500 sequenced whole genomes, aggregated over 20 studies of predominantly European ancestry, to create a very large reference panel of human haplotypes where ~50M genetic variants are observed 5 or more times. These haplotypes can be used to guide genotype imputation and haplotype estimation. In preliminary empirical evaluations, our panel provides substantial increases in accuracy relative to the 1000 Genomes Project Phase 1 reference panel and other smaller panels, particularly for variants with frequency less than
5%. I will describe our evaluation of strategies for merging haplotypes and variant lists across studies and advances in methods for genotype likelihood-based haplotype estimation that can be applied to 10,000s of samples. I will also summarize new methods for next generation imputation that perform faster and require less memory than contemporary methods while attaining similar levels of imputation accuracy. Our full resource is available to the community through imputation servers that enable scientists to impute missing variants in any study and respect the privacy of subjects contributing to the studies that constitute the Haplotype Reference Consortium. The majority of haplotypes will also be deposited in the European Genotype Archive.

A rare variant local haplotype sharing method with application to admixed populations. S. Hooker, G. T. Wang, B. Li, Y. Guan, S. M. Leal.

With the advent of next generation sequencing there is great interest in studying the involvement of rare variants in complex trait etiology. For many complex traits sequence data is being generated on DNA samples from African Americans and Hispanics to elucidate rare variant associations. Analyses of admixed populations present special challenges due to spurious associations which can occur because of confounding. However using information on admixture and local ancestry can also be highly beneficial and increase the power to detect associations in these populations. Here a local haplotype sharing (LHS) method (Xu and Guan 2014) was extended to test for rare variant (RV) associations in admixed populations. Previously the Weighted Haplotype and Imputation-based Test (WHAIT) (Li et al. 2010) was proposed to test for rare variant associations using haplotype data. The RV-LHS method unlike WHAIT, does not require reconstruction of haplotypes which can be both computationally intensive and error prone. Additionally the RV-LHS uses information on local ancestry which is particularly advantageous when analyzing admixed populations. Results will be shown from simulation studies performed for rare variant data from an admixed population. Both Type I and II errors are evaluated for the RV-LHS method. Additionally the power of the RV-LHS method is compared to WHAIT as well as several other non-haplotype-based rare variant association methods including the combined multivariate collapsing (CMC) (Li and Leal, 2008), Variable Threshold (VT) (Price et al. 2010) and Sequence Kernel Association Test (SKAT) (Wu et al. 2010). Several heart, lung and blood phenotypes were analyzed using sequence data on African-Americans from the NHLBI-Exome Sequencing Project to better evaluate the performance of the RV-LHS compared to other rare variant association methods.

July 29, 2014

Lethal mutations quantified

A very interesting new preprint on the arXiv (so it can be freely read). The founder population is the Hutterites. The key sentence:

Our approach indicates that on average, one in every two humans carries a recessive lethal allele on the autosomes that lead to lethality after birth and before reproductive age or to complete sterility.

arXiv:1407.7518 [q-bio.PE]

An estimate of the average number of recessive lethal mutations carried by humans

Ziyue Gao, Darrel Waggoner, Matthew Stephens, Carole Ober, Molly Przeworski

The effects of inbreeding on human health depend critically on the number and severity of recessive, deleterious mutations carried by individuals. In humans, existing estimates of these quantities are based on comparisons between consanguineous and non-consanguineous couples, an approach that confounds socioeconomic and genetic effects of inbreeding. To circumvent this limitation, we focused on a founder population with almost complete Mendelian disease ascertainment and a known pedigree. By considering all recessive lethal diseases reported in the pedigree and simulating allele transmissions, we estimated that each haploid set of human autosomes carries on average 0.29 (95% credible interval [0.10, 0.83]) autosomal, recessive alleles that lead to complete sterility or severe disorders at birth or before reproductive age when homozygous. Comparison to existing estimates of the deleterious effects of all recessive alleles suggests that a substantial fraction of the burden of autosomal, recessive variants is due to single mutations that lead to death between birth and reproductive age. In turn, the comparison to estimates from other eukaryotes points to a surprising constancy of the average number of recessive lethal mutations across organisms with markedly different genome sizes.

Link

July 17, 2014

More selection on the X than in autosomes in humans

Mol Biol Evol (2014) doi: 10.1093/molbev/msu166

Evidence for Increased Levels of Positive and Negative Selection on the X Chromosome versus Autosomes in Humans

Krishna R. Veeramah et al.

Partially recessive variants under positive selection are expected to go to fixation more quickly on the X chromosome as a result of hemizygosity, an effect known as faster-X. Conversely, purifying selection is expected to reduce substitution rates more effectively on the X chromosome. Previous work in humans contrasted divergence on the autosomes and X chromosome, with results tending to support the faster-X effect. However, no study has yet incorporated both divergence and polymorphism to quantify the effects of both purifying and positive selection, which are opposing forces with respect to divergence. In this study, we develop a framework that integrates previously developed theory addressing differential rates of X and autosomal evolution with methods that jointly estimate the level of purifying and positive selection via modeling of the distribution of fitness effects (DFE). We then utilize this framework to estimate the proportion of nonsynonymous substitutions fixed by positive selection (α) using exome sequence data from a West African population. We find that varying the female to male breeding ratio (β) has minimal impact on the DFE for the X chromosome, especially when compared with the effect of varying the dominance coefficient of deleterious alleles (h). Estimates of α range from 46% to 51% and from 4% to 24% for the X chromosome and autosomes, respectively. While dependent on h, the magnitude of the difference between α values estimated for these two systems is highly statistically significant over a range of biologically realistic parameter values, suggesting faster-X has been operating in humans.

Link