Dienekes’ Anthropology Blog: America

Showing posts with label America. Show all posts

May 27, 2016

The great migration of African Americans

PLoS Genet 12(5): e1006059. doi:10.1371/journal.pgen.1006059

The Great Migration and African-American Genomic Diversity
Soheil Baharian et al.

We present a comprehensive assessment of genomic diversity in the African-American population by studying three genotyped cohorts comprising 3,726 African-Americans from across the United States that provide a representative description of the population across all US states and socioeconomic status. An estimated 82.1% of ancestors to African-Americans lived in Africa prior to the advent of transatlantic travel, 16.7% in Europe, and 1.2% in the Americas, with increased African ancestry in the southern United States compared to the North and West. Combining demographic models of ancestry and those of relatedness suggests that admixture occurred predominantly in the South prior to the Civil War and that ancestry-biased migration is responsible for regional differences in ancestry. We find that recent migrations also caused a strong increase in genetic relatedness among geographically distant African-Americans. Long-range relatedness among African-Americans and between African-Americans and European-Americans thus track north- and west-bound migration routes followed during the Great Migration of the twentieth century. By contrast, short-range relatedness patterns suggest comparable mobility of ∼15–16km per generation for African-Americans and European-Americans, as estimated using a novel analytical model of isolation-by-distance.

Link

April 30, 2016

More on Kennewick Man

A new technical report re-analyzes the data of Rasmussen et al. study on Kennewick man and confirms that he is related to Native Americans. From the report:

We find the Kennewick sample has the highest shared similarity to Native American populations with the highest values observed being with populations from South America (Figure 7), in line with the observations from Rasmussen et al.

Hopefully this will end the campaign to put him back to the ground. I have added a horizontal line to the new study's Figure 7 to mark the population claiming the skeleton among the huge number considered, showing that there's no particularly strong relationship to it (the strongest connection is at the bottom of the figure).

The Rasmussen et al. and Novembre et al. studies are really science working at its best: simultaneously falsifying claims that Kennewick was some sort of Australoid (or even more implausibly Caucasoid) based on its craniofacial morphology, but not overreaching to validate emotional appeals to make him into an ancestor he wasn't. Thankfully, the way forward is to keep studying Kennewick Man (and modern Native Americans) with ever-better data and techniques which may turn up (who knows?) a real (rather than imagined) ancestral link.

Technical Report: Assessment of the genetic analyses of Rasmussen et al. (2015)

John Novembre, PhD, David Witonsky, Anna Di Rienzo, PhD

The primary aim of the analysis undertaken here (U.S. Army Corps of Engineers, St Louis District Contract #W912P9-16-P-0010) is to provide an independent validation of the genetic evidence underlying a recent publication by Morten Rasmussen and colleagues on July 23rd, 2015, in Nature (Vol 523:455–58). Based on our analysis of the Kennewick Man’s sequence data and Colville tribe genotype data generated by Rasmussen et al., we concur with the findings of the original paper that the sample is genetically closer to modern Native Americans than to any other population worldwide. We carried out several analyses to support this conclusion, including (i) principal component analysis (PCA; Patterson et al. 2006), (ii) unsupervised genetic clustering using ADMIXTURE (Alexander, Novembre, and Lange 2009), (iii) estimation of genetic affinity to modern human populations using f3 and D statistics (Patterson et al. 2012), and (iv) a novel approach based on the geographic distribution of rare variants. Importantly, these distinct analyses, spanning three non-overlapping subsets of the data, are each consistent with Native American ancestry.

Link

July 26, 2015

Paleoamericans galore

Two new papers in Nature and Science add to the debate on Native American origins. The first study (in Nature) detects that some Amazonians have a few percent ancestry from a group related to Australasians, which suggests that early native Americans were not homogeneous but came in two flavors: the main one found all over the Americans and the Australasian-related one. The second study (in Science) looks at ancient "Paleoamerican"-postulated populations and finds that they don't have any particular relationship to Australasians. Thus, whatever population brought the "Paleoamerican" admixture into the Amazon, it remains to be found.

Nature (2015) doi:10.1038/nature14895

Genetic evidence for two founding populations of the Americas

Pontus Skoglund et al.

Genetic studies have consistently indicated a single common origin of Native American groups from Central and South America1, 2, 3, 4. However, some morphological studies have suggested a more complex picture, whereby the northeast Asian affinities of present-day Native Americans contrast with a distinctive morphology seen in some of the earliest American skeletons, which share traits with present-day Australasians (indigenous groups in Australia, Melanesia, and island Southeast Asia)5, 6, 7, 8. Here we analyse genome-wide data to show that some Amazonian Native Americans descend partly from a Native American founding population that carried ancestry more closely related to indigenous Australians, New Guineans and Andaman Islanders than to any present-day Eurasians or Native Americans. This signature is not present to the same extent, or at all, in present-day Northern and Central Americans or in a ~12,600-year-old Clovis-associated genome, suggesting a more diverse set of founding populations of the Americas than previously accepted.

Link

Science DOI: 10.1126/science.aab3884

Genomic evidence for the Pleistocene and recent population history of Native Americans

Maanasa Raghavan1,*, Matthias Steinrücken2,3,4,*, Kelley Harris5,*, Stephan Schiffels6,*, Simon Rasmussen7,*, Michael DeGiorgio8,*, Anders Albrechtsen9,*, Cristina Valdiosera1,10,*, María C. Ávila-Arcos1,11,*, Anna-Sapfo Malaspinas1* et al.

How and when the Americas were populated remains contentious. Using ancient and modern genome-wide data, we find that the ancestors of all present-day Native Americans, including Athabascans and Amerindians, entered the Americas as a single migration wave from Siberia no earlier than 23 thousand years ago (KYA), and after no more than 8,000-year isolation period in Beringia. Following their arrival to the Americas, ancestral Native Americans diversified into two basal genetic branches around 13 KYA, one that is now dispersed across North and South America and the other is restricted to North America. Subsequent gene flow resulted in some Native Americans sharing ancestry with present-day East Asians (including Siberians) and, more distantly, Australo-Melanesians. Putative ‘Paleoamerican’ relict populations, including the historical Mexican Pericúes and South American Fuego-Patagonians, are not directly related to modern Australo-Melanesians as suggested by the Paleoamerican Model.

Link

June 18, 2015

Kennewick Man was a Native American

Nature (2015) doi:10.1038/nature14625

The ancestry and affiliations of Kennewick Man

Morten Rasmussen, Martin Sikora, Anders Albrechtsen, Thorfinn Sand Korneliussen, J. Víctor Moreno-Mayar, G. David Poznik, Christoph P. E. Zollikofer, Marcia S. Ponce de León, Morten E. Allentoft, Ida Moltke, Hákon Jónsson, Cristina Valdiosera, Ripan S. Malhi, Ludovic Orlando, Carlos D. Bustamante, Thomas W. Stafford Jr, David J. Meltzer, Rasmus Nielsen & Eske Willerslev

Kennewick Man, referred to as the Ancient One by Native Americans, is a male human skeleton discovered in Washington state (USA) in 1996 and initially radiocarbon-dated to 8,340–9,200 calibrated years before present (BP)1. His population affinities have been the subject of scientific debate and legal controversy. Based on an initial study of cranial morphology it was asserted that Kennewick Man was neither Native American nor closely related to the claimant Plateau tribes of the Pacific Northwest, who claimed ancestral relationship and requested repatriation under the Native American Graves Protection and Repatriation Act (NAGPRA). The morphological analysis was important to judicial decisions that Kennewick Man was not Native American and that therefore NAGPRA did not apply. Instead of repatriation, additional studies of the remains were permitted2. Subsequent craniometric analysis affirmed Kennewick Man to be more closely related to circumpacific groups such as the Ainu and Polynesians than he is to modern Native Americans2. In order to resolve Kennewick Man’s ancestry and affiliations, we have sequenced his genome to ~1× coverage and compared it to worldwide genomic data including the Ainu and Polynesians. We find that Kennewick Man is closer to modern Native Americans than to any other population worldwide. Among the Native American groups for whom genome-wide data are available for comparison, several seem to be descended from a population closely related to that of Kennewick Man, including the Confederated Tribes of the Colville Reservation (Colville), one of the five tribes claiming Kennewick Man. We revisit the cranial analyses and find that, as opposed to genomic-wide comparisons, it is not possible on that basis to affiliate Kennewick Man to specific contemporary groups. We therefore conclude based on genetic comparisons that Kennewick Man shows continuity with Native North Americans over at least the last eight millennia.

Link

March 06, 2015

Craniofacial plasticity in ancient Peru

Anthropologischer Anzeiger doi:10.1127/anthranz/2015/0458

Craniofacial plasticity in ancient Peru

Jessica H. Stone; Kristen Chew; Ann H. Ross; John W. Verano

Numerous studies have utilized craniometric data to explore the roles of genetic diversity and environment in human cranial shape variation. Peru is a particularly interesting region to examine cranial variation due to the wide variety of high and low altitude ecological zones, which in combination with rugged terrain have created isolated populations with vastly different physiological adaptations. This study examines seven samples from throughout Peru in an effort to understand the contributions of environmental adaptation and genetic relatedness to craniofacial variation at a regional scale. Morphological variation was investigated using a canonical discriminant analysis and Mahalanobis D2 analysis. Results indicate that all groups are significantly different from one another with the closest relationship between Yauyos and Jahuay, two sites that are located geographically close in central Peru but in very different ecozones. The relationship between latitude/longitude and face shape was also examined with a spatial autocorrelation analysis (Moran’s I) using ArcMap and show that there is significant spatial patterning for facial measures and geographic location suggesting that there is an association between biological variation and geographic location.

Link

January 18, 2015

Kennewick Man was Native American

First DNA tests say Kennewick Man was Native American

Genetic analysis is still under way in Denmark, but documents obtained through the federal Freedom of Information Act say preliminary results point to a Native-American heritage.

The researchers performing the DNA analysis “feel that Kennewick has normal, standard Native-American genetics,” according to a 2013 email to the U.S. Army Corps of Engineers, which is responsible for the care and management of the bones. “At present there is no indication he has a different origin than North American Native American.”

...

Willerslev’s Danish lab is a world leader in ancient DNA analysis. Last year, he and his colleagues reported the genome of the so-called Anzick boy, an infant buried 12,600 years ago in Montana. He, too, was a direct ancestor of modern Native Americans and a descendant of people from Beringia.

Until details of the Kennewick analysis are published, there’s no way to know what other relationships his genes will reveal, Kemp said. It may never be possible to link him to specific tribes, partly because so few Native Americans in the United States have had their genomes sequenced for comparison.

The recent publication of the Kostenki-14 genome, which has been described as morphologically Australoid, but appears to be genetically European should make us wary of interpreting phenotypes of early specimens in terms of the much later human populations. In the case of Europeans, it seems that the Caucasoid genetic lineage existed even before full Caucasoid morphology had evolved (at least in some specimens of Upper Paleolithic Europeans, as others had clear Caucasoid morphology).

I would not be surprised if the same was true for Native Americans, that is, the typical morphology of recent Native Americans was not present in their earliest predecessors, who, nonetheless, were part of the same evolving lineage of humans in the Americas. The Anzick-1 genome from the Clovis culture and several mtDNA results have not really turned up anything "exotic" in ancient inhabitants of the Americas, so it seems that the hypothesis of recent Native Americans being descended from a wave of people that replaced earlier inhabitants is losing ground with each new discovery.

September 18, 2014

23andMe mega-study on different American groups

It's great to see that the massive dataset of 23andMe was used for a study like this that seeks to capture the landscape of ancestry of different American groups.

First, distribution of ancestry in African Americans:

The higher fraction of African ancestry in the south and of European ancestry in the north, shouldn't be very surprising. There are some interesting loci of higher "Native American" ancestry; most African Americans don't seem to have a lot of this ancestry, but some apparently do.

Second, distribution of ancestry in "Latinos":

To my eye, this seems like more African ancestry in the eastern parts (presumbly from Caribbean-type Latinos?) and more Native American ancestry in the west.

Third, distribution of ancestry in European Americans:

Overall, it seems that relatively few (less than 5%) of European Americans have more than 2% either African or Native American ancestry in any of the states, so the breakdown of European ancestry into various subgroups is perhaps more interesting.

The distribution of African ancestry in European and African Americans is also interesting:

The existence of "African Americans" with virtually no African ancestry and of "European Americans" with as much as half African ancestry is probably due to either misreporting or some quite strange self-perception issues. The bulk of the African ancestry in European Americans seems to be in the sub-10% range (equivalent to less than 1 great grandparent). It is possible that many of these individuals might not even be aware of the existence of such ancestors.

bioRxiv doi: http://dx.doi.org/10.1101/009340

The genetic ancestry of African, Latino, and European Americans across the United States.

Katarzyna Bryc, Eric Durand, J Michael Macpherson, David Reich, Joanna Mountain

Over the past 500 years, North America has been the site of ongoing mixing of Native Americans, European settlers, and Africans brought largely by the Trans-Atlantic slave trade, shaping the early history of what became the United States. We studied the genetic ancestry of 5,269 self-described African Americans, 8,663 Latinos, and 148,789 European Americans who are 23andMe customers and show that the legacy of these historical interactions is visible in the genetic ancestry of present-day Americans. We document pervasive mixed ancestry and asymmetrical male and female ancestry contributions in all groups studied. We show that regional ancestry differences reflect historical events, such as early Spanish colonization, waves of immigration from many regions of Europe, and forced relocation of Native Americans within the US. This study sheds light on the fine-scale differences in ancestry within and across the United States, and informs our understanding of the relationship between racial and ethnic identities and genetic ancestry.

Link

September 10, 2014

ASHG 2014 titles and abstracts

Some interesting titles from the ASHG 2014 conference.

UPDATE: I have added the abstracts.

The human X chromosome is the target of megabase wide selective sweeps associated with multi-copy genes expressed in male meiosis and involved in reproductive isolation. M. H. Schierup, K. Munch, K. Nam, T. Mailund, J. Y. Dutheil.

The X chromosome differs from the autosomes in its hemizogosity in males and in its intimate relationship with the very different Y chromosome. It has a different gene content than autosomes and undergo specific processes such as meiotic sex chromosome inactivation (MSCI) and XY body formation. Previous studies have shown that natural selection is more efficient against deleterious mutations and, in chimpanzee, that positive selection is prevalent. We show that in all great apes species, megabase wide regions of the X chromosome has severely reduced diversity (by more than 80%). These regions are partly shared among species and indicate a large number of strong selective sweeps that have occurred independently on the same set of targets in different great apes species. We use simulations and deterministic calculations to show that background selection or soft selective sweeps are unlikely to be responsible. The regions also bear all the hallmarks of selective sweeps such as an increased proportion of singletons and higher divergence among closely related populations. Human populations are differently affected, suggesting that a large fraction of sweeps are private to specific human populations. The regions of reduced diversity correlates strongly with the position of X-ampliconic regions, which are 100-500 kb regions containing multiple copies of genes that are solely expressed during male meiosis. We propose that the genes in these regions escape MSCI and participate in an intragenomic conflict with regions of similar function on the Y chromosome for transmission of sex chromosomes to the next generation, i.e. sex chromosome meiotic drive. Recent results from Neanderthal introgression into humans point to the same regions as showing no introgression, consistent with the above process leading to reproductive isolation. Strikingly, the same regions of the X also shows much reduced divergence between human and chimpanzee, suggesting either that this speciation process was indeed complex or that the same regions were under strong selection in the human chimpanzee ancestor.

New insights on human de novo mutation rate and parental age. W. S. W. Wong, B. Solomon, D. Bodian, D. Thach, R. Iyer, J. Vockley, J. Niederhuber.

Germline mutations have a major role to play in evolution. Much attention has been given to studying the pattern and rate of human mutations using biochemical or phylogenetic methods based on closely related species. Massively parallel sequencing technologies have given scientists the opportunity to study directly measured de novo mutations (DNMs) at an unprecedented scale. Here we report the largest study (to our knowledge) of de novo point mutations in humans, in which we used whole genome deep sequencing (~60x) data from 605 family trios (father, mother and newborn). These trios represent the first group of approximately 2,700 trios who have undergone whole-genome sequencing (WGS) through our pediatric-based WGS research studies. The fathers ages range from 17 to 63 years and the mothers ages range from 17 to 43 years. We identified over 23000 DNMs (~40 per newborn) in the autosomal chromosomes using a customized pipeline and infer that the mutation rate per basepair is around 1.2x10^-8 per generation, well within the reported range in previous studies. We were also able to confirm that the total number of DNMs in the newborn was directly proportional to the paternal age (P less than 2x10^-16). Maternal age is shown to have a small but significant positive effect on the number of DNMs passed onto the offspring, (P =0.003) , even after accounting for the paternal age. This contradicts the prior dogma that maternal age only has an effect on chromosomal abnormalities related to nondisjunction events. Furthermore, 5% (22 total) of newborns in the analyzed group were conceived with assisted reproductive technologies (ARTs), and these infants have on average 5 more DNMs (Bias corrected and accelerated bootstrap 95% Confidence Interval, 1.24 to 8.00) than those conceived naturally, after controlling for both parents ages. Both parents ages remain significant as independently correlated with DNMs even after the families that used ARTs were removed from the analysis. Our study enhances current knowledge related to the human germline mutational rates.

Alignment to an ancestry specific reference genome discovers additional variants among 1000 Genomes ASW Cohort. R. A. Neff, J. Vargas, G. H. Gibbons, A. R. Davis.

Whole genome sequencing studies across certain populations, such as those with African ancestry, are often underpowered due to a larger divergence between the common reference genome and the true genetic sequence of the population. However, a common reference genome is not designed to account for this divergence in population-specific studies. Strong signals from common (MAF>50%) single nucleotide polymorphisms (SNPs), insertion-deletions (indels), and structural variants (SVs) can make alignment and variant calling difficult by masking nearby variants with weaker genetic signals. We present the results generated from alignment to an African descent population-specific reference genome by applying variants present in a majority of individuals with African descent from all phases of the 1000 Genomes Project and the International HapMap Consortium. We identified 882,826 single nucleotide polymorphisms, short insertion-deletion events, and large structural variations present at MAF>50%; in the population, representing 2.39 MB of genetic variation changed from hg19. We demonstrate that utilization of a population-specific reference improves variant call quality, coverage level, and imputation accuracy. We compared alignment of 27 African-American SW population (ASW) samples from the 1000 Genomes Phase 1 project between the population-specific and the hg19 reference. We discovered an additional 443,036 SNPs by alignment to the population specific reference in union across all samples, including thousands of exonic variants that are non-synonymous and are clinically relevant to the study of disease.

Using compressed data structures to capture variation in thousands of human genomes. S. A. McCarthy, Z. Lui, J. T. Simpson, Z. Iqbal, T. M. Keane, R. Durbin.

Currently the most widely used approach to catalogue variation amongst a set of samples is to align the sequencing reads to a single linear reference genome. This principle has been at the core of the 1000 Genomes data processing pipeline since the pilot phase of the project. However, there is now an increased awareness of the limitations of this approach, such as alignment artefacts, reference bias and unobserved variation on non-reference haplotypes. The Burrows-Wheeler transform and FM-index are compact data structures that have been successfully used in sequence alignment and assembly. One of the key features of these structures is that they are a searchable and reference-free representation of the raw sequencing reads. Our project aims to build a web server based on BWT data structures containing all the reads from many thousands of samples so as to efficiently retrieve matching reads and information about samples and populations. Enticingly, it is expected that data storage for this system would plateau as we collect more data since most new sequencing reads will have already been observed. We expect this to enable powerful new ways to query variation data from thousands of individuals. For the first phase of this project, we include all 87 Tbp of the low-coverage and exome data from the 2,535 samples in 1000 Genomes Phase 3. We envisage this would provide a means for researchers to easily check the prevalence of any human sequence in a control set of thousands of putatively healthy samples. We present our approaches and initial benchmarks on variant sensitivity and specificity against truth datasets and explore several applications for these structures such as validation of short insertion/deletion and structural variant calls, and rapid searching for traces of viral DNA.

Second-generation PLINK: Rising to the challenge of larger and richer datasets. C. C. Chang, C. C. Chow, L. C. A. M. Tellier, S. Vattikuti, S. M. Purcell, J. J. Lee.

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Exploring genetic variation and genotypes among millions of genomes. R. M. Layer, A. R. Quinlann.

Integrated analysis of protein-coding variation in over 90,000 individuals from exome sequencing data. D. G. MacArthur, M. Lek, E. Banks, R. Poplin, T. Fennell, K. Samocha, B. Thomas, K. Karczewski, S. Purcell, P. Sullivan, S. Kathiresan, M. I. McCarthy, M. Boehnke, S. Gabriel, D. M. Altshuler, G. Getz, M. J. Daly, Exome Aggregation Consortium.

Rare, and thus largely unknown, variants are a major reason that, typically, less than 10% of the heritability of complex diseases currently can be explained by known genetic variation. While increasing the number of sequenced genomes may improve our ability to reveal this “hidden heritability,” the scale of the resulting dataset poses substantial storage and computational demands. Current efforts to sequence 100,000 genomes, and combined efforts that are likely to surpass 1 million genomes will identify hundreds of millions to billions of polymorphic loci. The minimum storage requirement for directly representing the variability found by these projects (1 bit per individual per variant, ignoring the necessary metadata) will range from terabytes to petabytes. Like most big-data problems, a balance must be found between optimizing storage and computational efficiency. For example, while compression can minimize storage by reducing file size, it can also cause inefficient computation since data must be decompressed before it can be analyzed. Conversely, highly structured data can reduce analysis times but typically require extra metadata that increase file size. Current variation storage schemes were not designed to quickly analyze massive datasets and fail to balance these competing goals. We present GENOTQ, an open source API and toolkit that reduces file size and data access time through use of a succinct data structure, a class of data structures that compress data such that operations can be performed without requiring the full decompression. Word aligned hybrid (WAH) bitmap compression is one such data structure that was developed to improve query times for relational databases. Binary values are encoded such that logical operations (AND, OR, NOT) can be performed on the compressed data. This encoding results in file sizes that are 20X smaller than uncompressed versions, and only 50% larger than the compressed version. Queries, such as finding shared variants among a subpopulation, are also 21X faster. Furthermore, representing the genotypes in this manner makes our method well suited to both distributed architectures like BigQuery and parallel processors like GPUs. We stress that this method is only part of a larger solution that would incorporate genomic annotations, medical histories, and pedigrees. Incorporating fast genotype queries with this web of metadata will provide a rich information source to both clinicians and researchers.

Capture of 390,000 SNPs in dozens of ancient central Europeans reveals a population turnover in Europe thousands of years after the advent of farming. I. Lazaridis, W. Haak, N. Patterson, N. Rohland, S. Mallick, B. Llamas, S. Nordenfelt, E. Harney, A. Cooper, K. W. Alt, D. Reich.

To understand the population transformations that took place in Europe since the early Neolithic, we used a DNA capture technique to obtain reads covering ~390 thousand single nucleotide polymorphisms (SNPs) from a number of different archaeological cultures of central Europe (Germany and Hungary). The samples spanned the time period from 7,500 BP to 3,500 BP (Early Neolithic to Early Bronze Age periods) and most of them were previously studied using mtDNA (Brandt, Haak et al., Science, 2013). The captured SNPs include about 360,000 SNPs from the Affymetrix Human Origins Array that were discovered in African individuals, as well as about 30,000 SNPs chosen for other reasons (that are thought to have been affected by natural selection, or to have phenotypic effects, or are useful in determining Y-chromosome haplogroups). By analyzing this data together with a dataset of 2,345 present-day humans and other published ancient genomes, we show that late Neolithic inhabitants of central Europe belonging to the Corded Ware culture were not a continuation of the earlier occupants of the region. Our results highlight the importance of migration and major population turnover in Europe long after the arrival of farming. * Contributed equally to this work.

Insights into British and European population history from ancient DNA sequencing of Iron Age and Anglo-Saxon samples from Hinxton, England. S. Schiffels, W. Haak, B. Llamas, E. Popescu, L. Loe, R. Clarke, A. Lyons, P. Paajanen, D. Sayer, R. Mortimer, C. Tyler-Smith, A. Cooper, R. Durbin.

British population history is shaped by a complex series of repeated immigration periods and associated changes in population structure. It is an open question however, to what extent each of these changes is reflected in the genetic ancestry of the current British population. Here we use ancient DNA sequencing to help address that question. We present whole genome sequences generated from five individuals that were found in archaeological excavations at the Wellcome Trust Genome Campus near Cambridge (UK), two of which are dated to around 2,000 years before present (Iron Age), and three to around 1,300 years before present (Anglo-Saxon period). Good preservation status allowed us to generate one high coverage sequence (12x) from an Iron Age individual, and four low coverage sequences (1x-4x) from the other samples. By providing the first ancient whole genome sequences from Britain, we get a unique picture of the ancestral populations in Britain before and after the Anglo-Saxon immigrations. We use modern genetic reference panels such as the 1000 Genomes Project to examine the relationship of these ancient samples with present day population genetic data. Results from principal component analysis suggest that all samples fall consistently within the broader Northern European context, which is also consistent with mtDNA haplogroups. In addition, we obtain a finer structural genetic classification from rare genetic variants and haplotype based methods such as FineStructure. Reflecting more recent genetic ancestry, results from these methods suggest significant differences between the Iron Age and the Anglo-Saxon period samples when compared to other European samples. We find in particular that while the Anglo-Saxon samples resemble more closely the modern British population than the earlier samples, the Iron Age samples share more low frequency variation than the later ones with present day samples from southern Europe, in particular Spain (1000GP IBS). In addition the Anglo-Saxon period samples appear to share a stronger older component with Finnish (1000GP FIN) individuals. Our findings help characterize the ancestral European populations involved in major European migration movements into Britain in the last 2,000 years and thus provide more insights into the genetic history of people in northern Europe.

Fine-scale population structure in Europe. S. Leslie, G. Hellenthal, S. Myers, P. Donnelly, International Multiple Sclerosis Genetics Consortium.

There is considerable interest in detecting and interpreting fine-scale population structure in Europe: as a signature of major events in the history of the populations of Europe, and because of the effect undetected population structure may have on disease association studies. Population structure appears to have been a minor concern for most of the recent generation of genome-wide association studies, but is likely to be important for the next generation of studies seeking associations to rare variants. Thus far, genetic studies across Europe have been limited to a small number of markers, or to methods that do not specifically account for the correlation structure in the genome due to linkage disequilibrium. Consequently, these studies were unable to group samples into clusters of similar ancestry on a fine (within country) scale with any confidence. We describe an analysis of fine-scale population structure using genome-wide SNP data on 6,209 individuals, sampled mostly from Western Europe. Using a recently published clustering algorithm (fineSTRUCTURE), adapted for specific aspects of our analysis, the samples were clustered purely as a function of genetic similarity, without reference to their known sampling locations. When plotted on a map of Europe one observes a striking association between the inferred clusters and geography. Interestingly, for the most part modern country boundaries are significant i.e. we see clear evidence of clusters that exclusively contain samples from a single country. At a high level we see: the Finns are the most differentiated from the rest of Europe (as might be expected); a clear divide between Sweden/Norway and the rest of Europe (including Denmark); and an obvious distinction between southern and northern Europe. We also observe considerable structure within countries on a hitherto unseen fine-scale - for example genetically distinct groups are detected along the coast of Norway. Using novel techniques we perform further analyses to examine the genetic relationships between the inferred clusters. We interpret our results with respect to geographic and linguistic divisions, as well as the historical and archaeological record. We believe this is the largest detailed analysis of very fine-scale human genetic structure and its origin within Europe. Crucial to these findings has been an approach to analysis that accounts for linkage disequilibrium.

The population structure and demographic history of Sardinia in relationship to neighboring populations. J. Novembre, C. Chiang, J. Marcus, C. Sidore, M. Zoledziewska, M. Steri, H. Al-asadi, G. Abecasis, D. Schlessinger, F. Cucca.

Numerous studies have made clear that Sardinian populations are relatively isolated genetically from other populations of the Mediterranean, and more recently, intriguing connections between Sardinian ancestry and early Neolithic ancient DNA samples have been made. In this study, we analyze a whole-genome low-coverage sequencing dataset from 2120 Sardinians to more fully characterize patterns of genetic diversity in Sardinia. The study contains one subsample that contains individuals from across Sardinia and a second subsample that samples 4 villages from the more isolated Ogliastra region. We also merge the data with published reference data from Europe and North Africa. Overall Fst values of Sardinia to other European populations are low (less than 0.015); however using a novel method for visualizing genetic differentiation on a geographic map, we formally show how Sardinia is more differentiated than would be expected given its geographic distance from the mainland, consistent with periods of isolation. Applications of the software Admixture show how Sardinia populations differ in the levels of recent admixture with mainland European populations and that there are only minor contributions from North African populations to Sardinian ancestry. Notably the Sardinians from Ogliastra contain a distinct genetic cluster with minimal evidence of recent admixture with mainland Europe. We found frequency-based f3 tests and the tree-based algorithm Treemix both also show minimal evidence of recent admixture. Given the relative isolation, one might expect to see a unique demographic history from neighboring populations. Using coalescent-based approaches, we find Sardinian populations have had more constant effective sizes over the past several thousand years than mainland European populations, which typically show evidence for rapid growth trajectories in the recent past. This unique demographic history has consequences for the abundance of putatively damaging and deleterious variants, and we use our data to address the prediction that the genetic architecture of disease traits is expected to involve fewer loci with a greater proportion of variants at common frequencies in Sardinia.

Population structure in African-Americans. S. Gravel, M. Barakatt, B. Maples, M. Aldrich, E. E. Kenny, C. D. Bustamante, S. Baharian.

We present a detailed population genetic study of 4 African-American cohorts comprising over 6000 genotyped individuals across US urban and rural communities: two nation-wide longitudinal cohorts, one biobank cohort, and the 1000 genomes ASW cohort. Ancestry analysis reveals a uniform breakdown of continental ancestry proportions across regions and urban/rural status, with 79% African, 19% European, and 1.5% Native American/Asian ancestries, with substantial between-individual variation. The Native Ancestry proportion is higher than previous estimates and is maintained after self-identified hispanics and individuals with substantial inferred Spanish ancestry are removed. This strongly supports direct admixture between Native Americans and African Americans on US territory, and linkage patterns suggest contact early after African-American arrival to the Americas. Local ancestry patterns and variation in ancestry proportions across individuals are broadly consistent with a single African-American population model with early Native American admixture and ongoing European gene flow in the South. The size and broad geographic sampling of our cohorts enables detailed analysis the geographic and cultural determinants of finer-scale population structure. Recent Identity-by-descent analysis reveals fine-scale structure consistent with the routes used during slavery and in the great African-American migrations of the twentieth century: east-to-west migrations in the south, and distinct south-to-north migrations into New England and the Midwest. These migrations follow transit routes available at the time, and are in stark contrast with European-American relatedness patterns.

Genetic testing of 400,000 individuals reveals the geography of ancestry in the United States. Y. Wang, J. M. Granka, J. K. Byrnes, M. J. Barber, K. Noto, R. E. Curtis, N. M. Natalie, C. A. Ball, K. G. Chahine.

The population of the United States is formed by the interplay of immigration, migration and admixture. Recent research (R. Sebro et al., ASHG 2013) has shed light on the U.S. demography by studying the self-reported ethnicity from the 2010 U.S. Census. However, self-reported ethnicity may not accurately represent true genetic ancestry and may therefore introduce unknown biases. Since launching its DNA service in May 2012, AncestryDNA has genotyped over 400, 000 individuals from the United States. Leveraging this huge volume of DNA data, we conducted a large-scale survey of the ancestry of the United States. We predicted genetic ethnicity for each individual, relying on a rigorously curated reference panel of 3,000 single-origin individuals. Combining that with birth locations, we explored how various ethnicities are distributed across the United States Our results reveal a distinct spatial distribution for each ethnicity. For example, we found that individuals from Massachusetts have the highest proportion of Irish genetic ancestry and individuals from New York have the highest proportion of Southern European genetic ancestry, indicating their unique immigration and migration histories. We also performed pairwise IBD analysis on the entire sample set and identified over 300 million shared genomic segments among all 400,000 individuals. From this data, we calculated the average amount of sharing for pairs of individuals born within the same state or from two different states. In general, we found the genetic sharing decreases as the geographic distance between two states increases. However, the pattern also varies substantially among the 50 states. In summary, our analysis has provided significant insight on the biogeographic patterns of the ancestry in the United States.

Statistical inference of archaic introgression and natural selection in Central African Pygmies. P. Hsieh, J. D. Wall, J. Lachance, S. A. Tishkoff, R. N. Gutenkunst, M. F. Hammer.

Recent evidence from ancient DNA studies suggests that genetic material introgressed from archaic forms of Homo, such as Neanderthals and Denisovans, into the ancestors of contemporary non-African populations. These findings also imply that hybridization may have given rise to some of adaptive novelties in anatomically modern humans (AMH) as they expanded from Africa into various ecological niches in Eurasia. Within Africa, fossil evidence suggests that AMH and a variety of archaic forms coexisted for much of the last 200,000 years. Here we present preliminary results leveraging high quality whole-genome data (>60X coverage) for three contemporary sub-Saharan African populations (Biaka, Baka, and Yoruba) from Central and West Africa to test for archaic admixture. With the current lack of African ancient DNA, especially in Central Africa due to its rainforest environment, our statistical inference approach provides an alternative means to understand the complex evolutionary dynamics among groups of the genus Homo. To identify candidate introgressive loci, we scan the genomes of 16 individuals and calculate S*, a summary statistic that was specifically designed by one of us (JDW) to detect archaic admixture. The significance of each candidate is assessed through extensive whole-genome level simulations using demographic parameters estimated by ∂a∂i to obtain a parametric distribution of S* values under the null hypothesis of no archaic introgression. As a complementary approach, top candidates are also examined by an approximate-likelihood computation method. The admixture time for each individual introgressive variant is inferred by estimating the decay of the genetic length of the diverged haplotype as a function of its underlying recombination rate. A neutrality test that controls for demography is performed for each candidate to test the hypothesis that introgressive variants rose to high frequency due to positive directional selection. Several genomic regions were identified by both selection and introgression scans, and we will discuss the possible genetic and functional properties of these “double-hits”. The present study represents one of the most comprehensive genomic surveys to date for evidence of archaic introgression to anatomically modern humans in Africa.

Inferences about human history and natural selection from 280 complete genome sequences from 135 diverse populations. S. Mallick, D. Reich, Simons Genome Diversity Project Consortium.

The most powerful way to study population history and natural selection is to analyze whole genome sequences, which contain all the variation that exists in each individual. To date, genome-wide studies of history and selection have primarily analyzed data from single nucleotide polymorphism (SNP) arrays which are biased by the choice of which SNPs to include. Alternatively they have analyzed sequence data that have been generated as part of medical genetic studies from populations with large census sizes, and thus do not capture the full scope of human genetic variation. Here we report high quality genome sequences (~40x average) from 280 individuals from 135 worldwide populations, including 45 Africans, 26 Native Americans, 27 Central Asians or Siberians, 46 East Asians, 25 Oceanians, 46 South Asians, and 71 West Eurasians. All samples were sequenced using an identical protocol at the same facility (Illumina Ltd.). We modified standard pipelines to eliminate biases that might confound population genetic studies. We report novel inferences, as well as a high resolution map that shows where archaic ancestry (Neanderthal and Denisovan) is distributed throughout the world. We compare and contrast the genomic landscape of the Denisovan introgression into mainland Eurasians to that in island Southeast Asians. We are making this dataset fully available on Amazon Web Services as a resource to the community, coincident with the American Society of Human Genetics meeting.

Improved haplotype phasing using identity by descent. B. L. Browning, S. R. Browning.

We present a new haplotype phasing method that achieves higher accuracy than existing methods. The method is based on the Beagle haplotype frequency model, but unlike the original Beagle phasing method, the new method incorporates genetic recombination, genotype error, and segments of identity by descent. We compared the new haplotype phasing method to Beagle (r1230) and to SHAPEIT version 2 (r778) using Illumina Human 1M SNP data for chromosome 20. We phased 44 HapMap3 CEU trio offspring together with subsets of Wellcome Trust Case Control Consortium 2 controls (n=650, 1300, 2600, 5200). Phase error was measured at trio offspring genotypes on chromosome 20 that have phase determined by parental genotypes. The SHAPEIT “states” parameter was set at 6400 in order to increase its phasing accuracy. The new haplotype phasing method produced haplotype switch error rates that were 20-25% lower than the error rates for the existing Beagle method and 1-7% lower than the error rates for SHAPEIT. The difference in switch error rates between the new method and SHAPEIT increased with increasing sample size. The new haplotype phasing method will be incorporated into version 4 of the Beagle software package (http://faculty.washington.edu/browning/beagle/beagle.html).

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis. E. Y. Durand, N. Eriksson, C. Y. McLean.

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases. A large number of methods to detect IBD segments have been developed recently. However, IBD detection accuracy in non-simulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father-mother-child trios from the 23andMe, Inc. dataset. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false positive rate over 67% for 2-4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. We show that nearly all false positives arise due to allowing switch errors between haplotypes when detecting IBD, a necessity for retrieving long (> 6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that enables detection and filtering of false positive IBD segments on population-scale datasets. HaploScore scores IBD segments proportional to the number of switch errors they contain. Thus, it enables filtering of spurious segments reported due to GERMLINE being overly permissive to imperfect phasing. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative genotyping arrays using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can be readily adapted to improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.

Parente2: A fast and accurate method for detecting identity by descent. S. Bercovici, J. M. Rodriguez, L. Huang, S. Batzoglou.

Identity-by-descent (IBD) inference is the problem of establishing a direct and explicit genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is key to a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of both accurate and efficient IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: given a cohort’s size, as quadratic number of pairwise genome comparisons must be performed, in principle. Therefore, computation time and the false discovery rate can also scale quadratically. To enable practical large-scale IBD detection, we developed Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses an ensemble windowing approach to model complex linkage disequilibrium in the underlying studied population. Parente2 is applied directly on genotype data without the need to phase data prior to IBD inference. Through extensive simulations using real data, we evaluate Parente2’s performance. We show that Parente2 is superior to previous state-of-the-art methods, detecting pairs of related individuals sharing a 4 cM IBD segment with 99.9%; sensitivity at a 0.1%; false positive rate, and achieving 79.2%; sensitivity at a 1%; false positive rate for the more challenging case of pairs sharing a 2 cM IBD segment. Additionally, Parente2 is efficient, providing one to two orders of magnitude speedup compared to previous state of the art methods. Parente2 is freely available at http://parente.stanford.edu/.

Fast PCA of very large samples in linear time. K. J. Galinsky, P. Loh, G. Bhatia, S. Georgiev, S. Mukherjee, N. J. Patterson, A. L. Price.

Principal components analysis (PCA) is an effective tool for inferring population structure and correcting for population stratification in genetic data. Traditionally, PCA runs in O(MN²+N³ ) time, where M is the number of variants and N is the number of samples. Here, we describe a new algorithm, fastpca, for approximating the top K PCs that runs in time O(MNK), making use of recent advances in random low-rank matrix approximation algorithms (Rokhlin et al. 2009). fastpca avoids computing the GRM and associated computational and memory storage costs, enabling PCA of very large datasets on standard hardware. We estimated the top 10 PCs of the WTCCC dataset (16k samples, 101k variants) in roughly 7 minutes while consuming 1GB of RAM, compared to 1 hour and 2.5GB for PLINK2. The fastpca approximation was extremely accurate (r²>99% between all fastpca and PLINK2 PCs). The improvement in running time becomes even larger at larger samples sizes; for example, fastpca estimated the top 10 PCs of a simulated data set with 100k samples and 300k variants in 135 minutes 8.5GB of RAM, vs. an estimated 350 hours and 85GB of RAM using PLINK2. A recently published O(MN²) time method, flashpca, did not complete on this data set due to exceeding 40GB memory requirement. All of these analyses were based on LD-pruning SNPs with r²>0.2, which leads to much more accurate PCs in simulations as compared to retaining all SNPs; more complex LD-adjustment strategies provide only a small further improvement.

Fast detection of IBD segments associated with quantitative traits in genome-wide association studies. Z. Wang, E. Kang, B. Han, S. Snir, E. Eskin.

Recently, many methods have been developed to detect the identity-by-descent (IBD) segments between a pair of individuals. These methods are able to detect very small shared IBD segments between a pair of individuals up to 2 centimorgans in length. This IBD information can be used to identify recent rare mutations associated with phenotype of interest. Previous approaches for IBD association were applicable to case/control phenotypes. In this work, we propose a novel and natural statistic for the IBD association testing, which can be applied to quantitative traits. A drawback of the statistic is that it requires a large number of permutations to assess the significance of the association, which can be a great computational challenge. We make a connection between the proposed statistic and linear models so that it does not require permutations to assess the significance of an association. In addition, our method can control population structure by utilizing linear mixed models.

Long-range haplotype mapping in Hispanic/Latinos reveals loci for short stature. G. Belbin, D. Ruderfer, K. Slivinski, M.C. Yee, J. Jeff, O. Gottesman, E.A. Stahl, R.J.F. Loos, E.P. Bottinger, E.E. Kenny.

The Hispanic/Latino (HL) population of Northern Manhattan represents a diverse recent diaspora population, with 95% of the individuals reporting having grandparents born outside of the United States. Of these 43% report grandparents born in Puerto Rico, 23% the Dominican Republic, 13% Central America, and 5%, 4%, and 2% from Mexico, South America, and Europe respectively. Despite complex patterns of migration, admixture, and diversity, strong signatures of cryptic relatedness persist amongst HLs. We have detected long-range genomic tract sharing (>3cM), or identity-by-descent (IBD), across 5,194 HL in the Mount Sinai BioMe Biobank. We observed an average population level IBD sharing of 0.0025 in HL, which is 2.5- and 5-fold higher than that observed in BioMe European- and African-American populations, respectively. We hypothesize that these patterns of recent migration and genetic drift may drive some otherwise rare functional alleles to detectable frequency. We clustered groups of homologous IBD tracts (n=112,250) segregating in this HL population. We observed that IBD clusters represent a class of low frequency alleles (median minor allele frequency =0.0077, s.d.=0.0015). We performed a genome-wide association of the IBD clusters, or ‘population-based linkage’, to detect loci implicated in height, a highly heritable polygenic trait. 15 independent loci surpassed our empirically derived genome-wide significance threshold of less than 4.4710-4, 11 of which replicated in an independent cohort of BioMe HLs. Strikingly, two regions confer strong recessive effects. In the case of the top hit on 9q32 (MAF less than 0.005; p less than8x10-6), homozygous non-referent individuals were shorter by 6” or 10”, for men or women, respectively, compared to the population mean (5’ 7” and 5’ 2” for men and women, respectively). In addition, IBD haplotypes in the 9q32 cluster harbored a significant enrichment of Native American ancestry (p less than 1x10-16). Finally, this interval contains a number of biologically compelling candidate genes, including COL27A1 and PALM2. This study demonstrates that rich population structure, rather than being a confounding factor in biomedical discovery efforts, may be leveraged to reveal novel genetic associations with complex human traits.

A haplotype reference panel of over 31,000 individuals and next-generation imputation methods. S. Das, on behalf of Haplotype Reference Consortium.

Genotype imputation is now a key tool in the analysis of human genetic studies, enabling array-based genetic association studies to examine the millions of variants that are being discovered by advances in whole genome sequencing. Examining these variants increases power and resolution of genetic association studies and makes it easier to compare the results of studies conducted using different arrays. Genotype imputation improves in accuracy with increasing numbers of sequenced samples, particularly for low frequency variants. The goal of the Haplotype Reference Consortium is to combine haplotype information from ongoing whole genome sequencing studies to create a large imputation resource. To date, we have collected information on >31,500 sequenced whole genomes, aggregated over 20 studies of predominantly European ancestry, to create a very large reference panel of human haplotypes where ~50M genetic variants are observed 5 or more times. These haplotypes can be used to guide genotype imputation and haplotype estimation. In preliminary empirical evaluations, our panel provides substantial increases in accuracy relative to the 1000 Genomes Project Phase 1 reference panel and other smaller panels, particularly for variants with frequency less than
5%. I will describe our evaluation of strategies for merging haplotypes and variant lists across studies and advances in methods for genotype likelihood-based haplotype estimation that can be applied to 10,000s of samples. I will also summarize new methods for next generation imputation that perform faster and require less memory than contemporary methods while attaining similar levels of imputation accuracy. Our full resource is available to the community through imputation servers that enable scientists to impute missing variants in any study and respect the privacy of subjects contributing to the studies that constitute the Haplotype Reference Consortium. The majority of haplotypes will also be deposited in the European Genotype Archive.

A rare variant local haplotype sharing method with application to admixed populations. S. Hooker, G. T. Wang, B. Li, Y. Guan, S. M. Leal.

With the advent of next generation sequencing there is great interest in studying the involvement of rare variants in complex trait etiology. For many complex traits sequence data is being generated on DNA samples from African Americans and Hispanics to elucidate rare variant associations. Analyses of admixed populations present special challenges due to spurious associations which can occur because of confounding. However using information on admixture and local ancestry can also be highly beneficial and increase the power to detect associations in these populations. Here a local haplotype sharing (LHS) method (Xu and Guan 2014) was extended to test for rare variant (RV) associations in admixed populations. Previously the Weighted Haplotype and Imputation-based Test (WHAIT) (Li et al. 2010) was proposed to test for rare variant associations using haplotype data. The RV-LHS method unlike WHAIT, does not require reconstruction of haplotypes which can be both computationally intensive and error prone. Additionally the RV-LHS uses information on local ancestry which is particularly advantageous when analyzing admixed populations. Results will be shown from simulation studies performed for rare variant data from an admixed population. Both Type I and II errors are evaluated for the RV-LHS method. Additionally the power of the RV-LHS method is compared to WHAIT as well as several other non-haplotype-based rare variant association methods including the combined multivariate collapsing (CMC) (Li and Leal, 2008), Variable Threshold (VT) (Price et al. 2010) and Sequence Kernel Association Test (SKAT) (Wu et al. 2010). Several heart, lung and blood phenotypes were analyzed using sequence data on African-Americans from the NHLBI-Exome Sequencing Project to better evaluate the performance of the RV-LHS compared to other rare variant association methods.

August 21, 2014

Tuberculosis is 6,000 years old

... and sea mammals (not Europeans) introduced it to the New World.

Nature (2014) doi:10.1038/nature13591

Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis

Kirsten I. Bos et al.

Modern strains of Mycobacterium tuberculosis from the Americas are closely related to those from Europe, supporting the assumption that human tuberculosis was introduced post-contact1. This notion, however, is incompatible with archaeological evidence of pre-contact tuberculosis in the New World2. Comparative genomics of modern isolates suggests that M. tuberculosis attained its worldwide distribution following human dispersals out of Africa during the Pleistocene epoch3, although this has yet to be confirmed with ancient calibration points. Here we present three 1,000-year-old mycobacterial genomes from Peruvian human skeletons, revealing that a member of the M. tuberculosis complex caused human disease before contact. The ancient strains are distinct from known human-adapted forms and are most closely related to those adapted to seals and sea lions. Two independent dating approaches suggest a most recent common ancestor for the M. tuberculosis complex less than 6,000 years ago, which supports a Holocene dispersal of the disease. Our results implicate sea mammals as having played a role in transmitting the disease to humans across the ocean.

Link

July 26, 2014

Ancestry of Cubans

PLoS Genet 10(7): e1004488. doi:10.1371/journal.pgen.1004488

Cuba: Exploring the History of Admixture and the Genetic Basis of Pigmentation Using Autosomal and Uniparental Markers

Beatriz Marcheco-Teruel et al.

We carried out an admixture analysis of a sample comprising 1,019 individuals from all the provinces of Cuba. We used a panel of 128 autosomal Ancestry Informative Markers (AIMs) to estimate the admixture proportions. We also characterized a number of haplogroup diagnostic markers in the mtDNA and Y-chromosome in order to evaluate admixture using uniparental markers. Finally, we analyzed the association of 16 single nucleotide polymorphisms (SNPs) with quantitative estimates of skin pigmentation. In the total sample, the average European, African and Native American contributions as estimated from autosomal AIMs were 72%, 20% and 8%, respectively. The Eastern provinces of Cuba showed relatively higher African and Native American contributions than the Western provinces. In particular, the highest proportion of African ancestry was observed in the provinces of Guantánamo (40%) and Santiago de Cuba (39%), and the highest proportion of Native American ancestry in Granma (15%), Holguín (12%) and Las Tunas (12%). We found evidence of substantial population stratification in the current Cuban population, emphasizing the need to control for the effects of population stratification in association studies including individuals from Cuba. The results of the analyses of uniparental markers were concordant with those observed in the autosomes. These geographic patterns in admixture proportions are fully consistent with historical and archaeological information. Additionally, we identified a sex-biased pattern in the process of gene flow, with a substantially higher European contribution from the paternal side, and higher Native American and African contributions from the maternal side. This sex-biased contribution was particularly evident for Native American ancestry. Finally, we observed that SNPs located in the genes SLC24A5 and SLC45A2 are strongly associated with melanin levels in the sample.

Link

June 15, 2014

Genetic structure of Mexico

This article is free to read with registration.

Science 13 June 2014:
Vol. 344 no. 6189 pp. 1280-1285

The genetics of Mexico recapitulates Native American substructure and affects biomedical traits

Andrés Moreno-Estrada

Mexico harbors great cultural and ethnic diversity, yet fine-scale patterns of human genome-wide variation from this region remain largely uncharacterized. We studied genomic variation within Mexico from over 1000 individuals representing 20 indigenous and 11 mestizo populations. We found striking genetic stratification among indigenous populations within Mexico at varying degrees of geographic isolation. Some groups were as differentiated as Europeans are from East Asians. Pre-Columbian genetic substructure is recapitulated in the indigenous ancestry of admixed mestizo individuals across the country. Furthermore, two independently phenotyped cohorts of Mexicans and Mexican Americans showed a significant association between subcontinental ancestry and lung function. Thus, accounting for fine-scale ancestry patterns is critical for medical and population genetic studies within Mexico, in Mexican-descent populations, and likely in many other populations worldwide.

Link

May 16, 2014

mtDNA D1 from 12-13 thousand year old Paleoamerican

This is interesting both because it's a >12,000 year old skeleton from the bottom of the sea (!) and because it establishes that an individual with clear Paleoamerican morphology belonged to a common modern Amerindian mtDNA haplogroup. Together with the recent publication of the Anzick genome, it seems that everything points towards continuity of Native Americans since the earliest settlement, rather than a more recent arrival of the ancestors of Native Americans that replaced an earlier "Paleoamerican" gene pool.

Science 16 May 2014: Vol. 344 no. 6185 pp. 750-754 DOI: 10.1126/science.1252619

Late Pleistocene Human Skeleton and mtDNA Link Paleoamericans and Modern Native Americans

James C. Chatters

Because of differences in craniofacial morphology and dentition between the earliest American skeletons and modern Native Americans, separate origins have been postulated for them, despite genetic evidence to the contrary. We describe a near-complete human skeleton with an intact cranium and preserved DNA found with extinct fauna in a submerged cave on Mexico’s Yucatan Peninsula. This skeleton dates to between 13,000 and 12,000 calendar years ago and has Paleoamerican craniofacial characteristics and a Beringian-derived mitochondrial DNA (mtDNA) haplogroup (D1). Thus, the differences between Paleoamericans and Native Americans probably resulted from in situ evolution rather than separate ancestry.

Link

March 04, 2014

Admixture in US populations

An interesting blog post from 23andMe:

In an update to that work, our researcher Kasia Bryc found that about about 4 percent of whites have at least 1 percent or more African ancestry.

Although it is a relatively small percentage, the percentage indicates that an individual with at least 1 percent African ancestry had an African ancestor within the last six generations, or in the last 200 years. This data also suggests that individuals with mixed parentage at some point were absorbed into the white population.

Looking a little more deeply into the data, Kasia also found that the percentage of whites with hidden African ancestry differed significantly from state-to-state. Southern states with the highest African American populations, tended to have the highest percentages of hidden African ancestry. In South Carolina at least 13 percent of self-identified whites have 1 percent or more African ancestry, while in Louisiana the number is a little more than 12 percent. In Georgia and Alabama the number is about 9 percent. The differences perhaps point to different social and cultural histories within the south.

and:

Previous published studies estimate that on average African Americans had about 82 percent African ancestry and about 18 percent European ancestry. But in self-identified African Americans in 23andMe’s database, Kasia found the average amount of African ancestry was closer to 73 percent.

I don't think that is necessarily the average percentage in the general African American population as the subset of African Americans who take 23andMe tests may not be representative (e.g., it may come more from cities where African Americans may have more opportunity to admix with European Americans).

and:

On average Latinos had about 70 percent European ancestry, 14 percent Native American ancestry and 6 percent African ancestry. The remainder ancestry is difficult to assign because the DNA is either shared by a number of different populations around the world, or because it’s from understudied populations, such as Native Americans. Obviously that large “unassigned” percentage means that those “averages” could be higher. As with African Americans, looking at the regional and state-to-state numbers for self-identified Latinos, the differences are striking.

...

For example, some Latinos have no discernible Native American ancestry, while in others have as much as 50 percent of the ancestry being Native American. Latinos in states in the Southwest, bordering Mexico — New Mexico, Texas, California and Arizona — have the greatest percentage of Native American ancestry. Latinos in states with the largest proportion of African Americans in their population — South Carolina, Louisiana and Alabama — have the highest percentage of African Ancestry.

23andMe may have a couple of orders of magnitude more sampled individuals than anything that appears in most published studies and it's great to see this being put to good use.

It'd be great if someone at 23andMe did some more analyses over their huge database. I can only imagine what a flashPCA with half a million individuals from around the world would look like; even if it told us nothing new about human history it would be quite a cool picture to look at.

February 12, 2014

Ancient Clovis genome from Montana yields no surprises (Rasmussen et al. 2014)

Ancient DNA has consistently managed to surprise us, with pretty much no direct genetic continuity revealed between Pleistocene and modern populations anywhere in the world. So, it is refreshing to see that at least in the case of the Americas the people who lived there ~13 thousand years ago are clearly related to the people who lived there in pre-Columbian times, with no real evidence of subsequent gene flows from Eurasia (at least in the case of Central/South Americans).

Many people suspected this because of the difficulty to access the Americas from Eurasia: this must have limited gene flow between the two regions to a handful of migrants and a restricted set of time periods where geological and climatic conditions were advantageous. The much reduced genetic diversity of Native Americans also argues in favor of them being a relatively simple population, with low heterozygosity and a handful of unique "founder lineages" in both the Y-chromosome and mtDNA.

Nonetheless, there are also several theories in the realm of alernative history, involving Solutreans from Europe, trans-Pacific boat riders, bearded "White Gods", Minoans/Phoenicians/Atlanteans/Ancient Egyptians, "African" Olmecs, "Caucasoid" Paleo-Indians, lost Israelite tribes, to mention only a few of the most well-known ones.

The new study does not, of course, disprove any of the proposals in the preceding paragraph: one can still claim that diverse groups once inhabited the Americas and Rasmussen et al. (2014) just happened to chance upon one that looked just like modern native Americans. But, this certainly improves the odds of early "Native American simplicity", offering no evidence for the complexity postulated by many of the alternative theories.

Moreover, while the existence of other human groups in the Americas cannot be disproved by the study of a single ancient individual, what can be proved is the antiquity of the ancestors of Native Americans. Rather than being late arrivals arriving from Asia after the initial colonization, perhaps with derived Mongoloid physical morphology, we now know that they were already there as early as ~13 thousand years ago. It is remarkable that a single ancient DNA sample can sweep away much of the nonsense that has been written on the topic in the past.

A piece in Nature News addresses some of the "ethics" debate that seems ever-present in studies involving Native American remains. I don't know how this study will be perceived by living Native Americans: a possibility is that they'll be more receptive to ancient DNA research now that a team of scientists have stretched the time depth of their ancestry in the Americas to the earliest studied sample, revealing themselves not to be the evil-doers that western scientists are generally assumed to be according to a certain kind of mentality. A different -and more alarming- possibility, is that radical anti-science elements will be emboldened by these findings to claim that continuity with the earliest Americans (which in itself seems true enough) adds support to claims of ownership to pretty much all archaeological samples whose relationship to living Amerindians was hitherto uncertain in light of the many alternative theories.

In any case, it is remarkable that this ~13 thousand year old genome now exists while the genomes of modern native Americans that can be had for a fraction of the cost and technical difficulty do not. Indeed, not even genotype data exist from most Amerindian groups from the USA, which creates the rather bizarre state of affairs that the Anzick-1 genome had to be compared with native groups from several countries in the Western hemisphere except the one in which it was found.

Nature 506, 225–229 (13 February 2014) doi:10.1038/nature13025

The genome of a Late Pleistocene human from a Clovis burial site in western Montana

Morten Rasmussen et al.

Clovis, with its distinctive biface, blade and osseous technologies, is the oldest widespread archaeological complex defined in North America, dating from 11,100 to 10,700 14C years before present (BP) (13,000 to 12,600 calendar years BP)1, 2. Nearly 50 years of archaeological research point to the Clovis complex as having developed south of the North American ice sheets from an ancestral technology3. However, both the origins and the genetic legacy of the people who manufactured Clovis tools remain under debate. It is generally believed that these people ultimately derived from Asia and were directly related to contemporary Native Americans2. An alternative, Solutrean, hypothesis posits that the Clovis predecessors emigrated from southwestern Europe during the Last Glacial Maximum4. Here we report the genome sequence of a male infant (Anzick-1) recovered from the Anzick burial site in western Montana. The human bones date to 10,705 ± 35 14C years BP (approximately 12,707–12,556 calendar years BP) and were directly associated with Clovis tools. We sequenced the genome to an average depth of 14.4× and show that the gene flow from the Siberian Upper Palaeolithic Mal’ta population5 into Native American ancestors is also shared by the Anzick-1 individual and thus happened before 12,600 years BP. We also show that the Anzick-1 individual is more closely related to all indigenous American populations than to any other group. Our data are compatible with the hypothesis that Anzick-1 belonged to a population directly ancestral to many contemporary Native Americans. Finally, we find evidence of a deep divergence in Native American populations that predates the Anzick-1 individual.

Link

December 27, 2013

Reconstructing Native American migrations

Of wider interest might be the authors' estimation of the autosomal mutation rate as 1.44x10-8 mutations/bp/generation. Of course, this might depend on the archaeological calibration used (where/when did the bottleneck in the ancestry of Native Americans occur?). It might also depend on recent evidence that Native Americans are of mixed origin and thus did not really split from CHB/JPT; only part of their ancestry did. Nonetheless, this is another fairly "low" autosomal mutation rate.

(This was previously released as a preprint to the arXiv).

PLoS Genet 9(12): e1004023. doi:10.1371/journal.pgen.1004023

Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data

Simon Gravel et al.

Link