December 30, 2012

Trojan pottery across the Bronze/Iron Age boundary

Before the sack of Troy, the city looked east towards the powerful Hittite Empire. But this political powerhouse collapsed around the time that Troy was destroyed. Grave says the post-conflict pottery is Balkan in style because the Trojans were keen to align themselves with the people there, who had become the new political elite and powerbase in the region.

The collapse of the Late Bronze Age political and economic structures of the eastern Mediterranean undercut elite production spheres serving this network at Troy.

The people of Early Iron Age Troy shifted their focus to elaborating their own household ceramic traditions to re-establish their role in newly configured social and economic networks that now looked to the Balkans rather than the Hittite Anatolia.

Journal of Archaeological Science Available online 12 November 2012

Cultural dynamics and ceramic resource use at Late Bronze Age/Early Iron Age Troy, northwestern Turkey

Peter Grave et al.

Changes in resource use over time can provide insight into technological choice and the extent of long term stability in cultural practices. In this paper we re-evaluate the evidence for a marked demographic shift at the inception of the Early Iron Age at Troy by applying a robust macroscale analysis of changing ceramic resource use over the Late Bronze and Iron Age. We use a combination of new and legacy analytical datasets (NAA and XRF), from excavated ceramics, to evaluate the potential compositional range of local resources (based on comparisons with sediments from within a 10 kilometer site radius). Results show a clear distinction between sediment-defined local and non-local ceramic compositional groups. Two discrete local ceramic resources have been previously identified and we confirm a third local resource for a major class of EIA handmade wares and cooking pots. This third source appears to derive from a residual resource on the Troy peninsula (rather than adjacent alluvial valleys). The presence of a group of large and heavy pithoi among the non-local groups raises questions about their regional or maritime origin.


December 28, 2012

Estonian Biocentre public data

The Estonian Biocentre (EBC) have put up all their free data in one convenient page. I have used most of these in my own experiments, and I must say that their public availability has been instrumental in enabling the type of "genome blogging" that I and others have engaged in over the last few years.

I have hitherto used some EBC data from GEO and some downloaded from the EBC site itself, so this is a good opportunity to rebuild all my datasets from a single source; as a bonus, the data has been lifted to build37/hg19, so I finally gave in and started to lift all my other datasets as well, using liftOver and the appropriate chain file.

FTDNA draft Y-chromosome phylogeny

A draft of the Y-chromosome phylogeny, including the newly discovered basal A00 clade has been posted by FTDNA. Hopefully some progress can be made in the F portion of the tree, where currently there are subclades F1,F2,F3,G,H, and IJK. Determination of the bifurcation structure within F will doubtlessly be instrumental in informing our understanding of the dispersal of F descendants, which, according to the age of this major Eurasian haplogroup, are closely linked to the Upper Paleolithic event in Eurasia.

Investing in whole-genome sequencing of one individual from each of these clades would be very helpful in determining this structure. Of course, there are already the Complete Genomics data which include haplogroup G and various IJK descendants, so we now need to identify some F1,F2,F3, and H samples and give them the WGS treatment.

December 27, 2012

Zoogeographic map of the world

I have written informally about the Sahara-Arabia belt in conjunction with my "two deserts" theory of modern human origins (=pre-100kya in North Africa, post-70kya from Arabia), so it's nice to see that it corresponds to some real zoogeographic entity derived from the distribution of thousands of species. So, perhaps an early evolution of modern humans in that area, followed by their dispersal and admixture with other hominins living in the Palearctic and Afrotropical regions might make sense.

Science DOI: 10.1126/science.1228282

An Update of Wallace's Zoogeographic Regions of the World

Ben G. Holt et al.

Modern attempts to produce biogeographic maps focus on the distribution of species and are typically drawn without phylogenetic considerations. Here, we generate a global map of zoogeographic regions by combining data on the distributions and phylogenetic relationships of 21,037 species of amphibians, birds, and mammals. We identify 20 distinct zoogeographic regions, which are grouped into 11 larger realms. We document the lack of support for several regions previously defined based on distributional data and show that spatial turnover in the phylogenetic composition of vertebrate assemblages is higher in the Southern than in the Northern Hemisphere. We further show that the integration of phylogenetic information provides valuable insight on historical relationships among regions, permitting the identification of evolutionarily unique regions of the world.


December 26, 2012

Variance of IBD sharing (Carmi et al. 2012)

Genetics doi: 10.1534/genetics.112.147215

The Variance of Identity-by-Descent Sharing in the Wright-Fisher Model

Shai Carmi et al.

Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of populations that have experienced recent genetic drift. Detection of these IBD segments has recently become feasible, enabling a wide range of applications from phasing and imputation to demographic inference. Here, we study the distribution of IBD sharing in the Wright-Fisher model. Specifically, using coalescent theory, we calculate the variance of the total sharing between random pairs of individuals. We then investigate the cohort-averaged sharing: the average total sharing between one individual and the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing is distributed approximately normally and surprisingly, the variance of this distribution does not vanish large even for large cohorts, implying the existence of "hyper-sharing" individuals. The presence of such individuals has consequences for the design of sequencing studies, since, if they are selected for whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate the expected gain in power of imputation by IBD, and subsequently, in power to detect an association, when individuals are either randomly selected or are specifically chosen to be the hyper-sharing individuals. Using our framework, we also compute the variance of an estimator of the population size that is based on the mean IBD sharing and the variance in the sharing between inbred siblings. Finally, we study IBD sharing in an admixture pulse model, and show that in the Ashkenazi Jewish population the admixture fraction is correlated with the cohort-averaged sharing.


“Mismodelling Indo-European Origins” Talk

Martin Lewis and Asya Pereltsvaig have been critical of the recent paper on Indo-European origins on the GeoCurrents blog, and they recently gave a talk at Stanford on the topic.

Some relevant past posts:

I think that the Indo-European question has been debated for more than two centuries without any clear resolution. Over the next few years, I think that either of two things will occur:

  • A clear unambiguous pattern of expansion mimicking the IE dispersal will appear in ancient DNA, providing the "smoking gun" for one of the different hypotheses.
  • No such Eurasian-wide pattern will emerge, and it will turn out that Indo-Europeanization was effected with minimum dispersal of populations.
I suspect the former will be the case, but it will nonetheless be interesting to see how the different parties coming from archaeology and linguistics will react to the (archaeo)genetic avalanche that will doubtlessly provide us with new information about the prehistoric past.

December 21, 2012

Estimating heterozygosity from low coverage sequencing data

arXiv:1212.4125 [q-bio.PE]

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

Katarzyna Bryc, Nick Patterson, David Reich

High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual is limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step calling genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first by its performance on simulated sequence data, and secondly on real sequence data where we obtain estimates using low coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse world-wide populations, and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show filters can correct for the confounding by sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates, and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher coverage data.


Y chromosome of Ramesses III

From the paper:
Genetic kinship analyses revealed identical haplotypes in both mummies (table 1⇓); using the Whit Athey’s haplogroup predictor, we determined the Y chromosomal haplogroup E1b1a.

Ethiohelix has more.
Added in my compendium of ancient Y chromosome studies.

BMJ 2012; 345 doi:

Revisiting the harem conspiracy and death of Ramesses III: anthropological, forensic, radiological, and genetic study


Objective To investigate the true character of the harem conspiracy described in the Judicial Papyrus of Turin and determine whether Ramesses III was indeed killed.

Design Anthropological, forensic, radiological, and genetic study of the mummies of Ramesses III and unknown man E, found together and taken from the 20th dynasty of ancient Egypt (circa 1190-1070 BC).

Results Computed tomography scans revealed a deep cut in Ramesses III’s throat, probably made by a sharp knife. During the mummification process, a Horus eye amulet was inserted in the wound for healing purposes, and the neck was covered by a collar of thick linen layers. Forensic examination of unknown man E showed compressed skin folds around his neck and a thoracic inflation. Unknown man E also had an unusual mummification procedure. According to genetic analyses, both mummies had identical haplotypes of the Y chromosome and a common male lineage.

Conclusions This study suggests that Ramesses III was murdered during the harem conspiracy by the cutting of his throat. Unknown man E is a possible candidate as Ramesses III’s son Pentawere.


New isolates from Friuli-Venezia Giulia region

European Journal of Human Genetics advance online publication 19 December 2012; doi: 10.1038/ejhg.2012.229

Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity

Tõnu Esko et al.

Population genetic studies on European populations have highlighted Italy as one of genetically most diverse regions. This is possibly due to the country’s complex demographic history and large variability in terrain throughout the territory. This is the reason why Italy is enriched for population isolates, Sardinia being the best-known example. As the population isolates have a great potential in disease-causing genetic variants identification, we aimed to genetically characterize a region from northeastern Italy, which is known for isolated communities. Total of 1310 samples, collected from six geographically isolated villages, were genotyped at >145 000 single-nucleotide polymorphism positions. Newly genotyped data were analyzed jointly with the available genome-wide data sets of individuals of European descent, including several population isolates. Despite the linguistic differences and geographical isolation the village populations still show the greatest genetic similarity to other Italian samples. The genetic isolation and small effective population size of the village populations is manifested by higher levels of genomic homozygosity and elevated linkage disequilibrium. These estimates become even more striking when the detected substructure is taken into account. The observed level of genetic isolation in Friuli-Venezia Giulia region is more extreme according to several measures of isolation compared with Sardinians, French Basques and northern Finns, thus proving the status of an isolate.


Armenian origin of Hamshenis

Hum Biol. 2012 Aug;84(4):405-22

Paternal lineage analysis supports an armenian rather than a central asian genetic origin of the hamshenis.

Authors: Margaryan A, Harutyunyan A, Khachatryan Z, Khudoyan A, Yepiskoposyan L


The Hamshenis are an isolated geographic group of Armenians with a strong ethnic identity who, until the early decades of the twentieth century, inhabited the Pontus area on the southern coast of the Black Sea. Scholars hold alternative views on their origin, proposing Eastern Armenia, Western Armenia, and Central Asia, respectively, as their most likely homeland. To ascertain whether genetic data from the nonrecombining portion of the Y chromosome are supportive of any of these suggestions, we screened 82 Armenian males of Hamsheni descent for 12 biallelic and 6 microsatellite Y-chromosomal markers. These data were compared with the corresponding data set from the representative populations of the three candidate regions. Genetic difference between the Hamshenis and other groups is significant and backs up the hypothesis of the Armenian origin of the Hamshenis, indicating central historical Armenia as a homeland of the ancestral population. This inference is further strengthened by the results of admixture analysis, which does not support the Central-Asian hypothesis of the Hamshenis' origin. Genetic diversity values and patterns of genetic distances suggest a high degree of genetic isolation of the Hamshenis consistent with their retention of a distinct and ancient dialect of the Armenian language.


Complex speciation in primates (Mailund et al. 2012)

PLoS Genet 8(12): e1003125. doi:10.1371/journal.pgen.1003125

A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species

Thomas Mailund et al.

We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.


Q2 mtDNA haplogroup in Oceania

PLoS ONE 7(12): e52022. doi:10.1371/journal.pone.0052022

The Q2 Mitochondrial Haplogroup in Oceania

Chris A. Corser et al.

Many details surrounding the origins of the peoples of Oceania remain to be resolved, and as a step towards this we report seven new complete mitochondrial genomes from the Q2a haplogroup, from Papua New Guinea, Fiji and Kiribati. This brings the total to eleven Q2 genomes now available. The Q haplogroup (that includes Q2) is an old and diverse lineage in Near Oceania, and is reasonably common; within our sample set of 430, 97 are of the Q haplogroup. However, only 8 are Q2, and we report 7 here. The tree with all complete Q genomes is proven to be minimal. The dating estimate for the origin of Q2 (around 35 Kya) reinforces the understanding that humans have been in Near Oceania for tens of thousands of years; nevertheless the Polynesian maternal haplogroups remain distinctive. A major focus now, with regard to Polynesian ancestry, is to address the differences and timing of the ‘Melanesian’ contribution to the maternal and paternal lineages as people moved further and further into Remote Oceania. Input from other fields such as anthropology, history and linguistics is required for a better understanding and interpretation of the genetic data.


December 18, 2012

Genographic GenoChip paper (Elhaik et al. 2012)

... has been posted on the arXiv. I don't have time to comment on it at the moment, and any further thoughts will be posted as an update here. By the way, thanks to the authors for putting me in the acknowledgements section :)

On a related note, I have released a patch for Geno 2.0 data so that they can be used with my DIYDodecad tools. I have converted 3-4 files already using it, so it seems to work fine, but in one file there was a problem because there were a lot of manual line breaks; not sure if this is a general problem or it was caused by the submitter re-saving the file, but if you encounter it, you might want to try saving your .csv file in Unix file format, or using dos2unix to fix it.

arXiv:1212.4116 [q-bio.PE]

The GenoChip: A New Tool for Genetic Anthropology

Eran Elhaik et al.

The Genographic Project is an international effort using genetic data to chart human migratory history. The project is non-profit and non-medical, and through its Legacy Fund supports locally led efforts to preserve indigenous and traditional cultures. In its second phase, the project is focusing on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide SNP genotyping, they were designed for medical genetic studies and contain medically related markers that are not appropriate for global population genetic studies. GenoChip, the Genographic Project's new genotyping array, was designed to resolve these issues and enable higher-resolution research into outstanding questions in genetic anthropology. We developed novel methods to identify AIMs and genomic regions that may be enriched with alleles shared with ancestral hominins. Overall, we collected and ascertained AIMs from over 450 populations. Containing an unprecedented number of Y-chromosomal and mtDNA SNPs and over 130,000 SNPs from the autosomes and X-chromosome, the chip was carefully vetted to avoid inclusion of medically relevant markers. The GenoChip results were successfully validated. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays for three continental populations. While all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The GenoChip is a dedicated genotyping platform for genetic anthropology and promises to be the most powerful tool available for assessing population structure and migration history.


December 15, 2012

Genotype imputation via matrix completion (Chi et al. 2012)

Link to Mendel.

Genome Res doi: 10.1101/gr.145821.112

Genotype imputation via matrix completion

Eric C. Chi et al.

Most current genotype imputation methods are model-based and computationally intensive, taking days to impute one chromosome pair on 1000 people. We describe an efficient genotype imputation method based on matrix completion. Our matrix completion method is implemented in Matlab and tested on real data from HapMap3, simulated pedigree data, and simulated low-coverage sequencing data derived from the 1000 Genomes Project. Compared to leading imputation programs, matrix completion as embodied in our program Mendel-Impute achieves comparable imputation accuracy while reducing run times significantly. Implementation in a lower-level language such as Fortran or C is apt to further improve computational efficiency.


Selective sweeps from standing variation or new mutation

A very interesting paper that addresses the question of whether a selective sweep proceeds from standing variation (i.e., an allele already exists in the population, perhaps for a long time, and becomes "advantageous" only when it is paired with the right environmental stimulus), or from a new mutation (i.e., the selection pressure begins first, and a new allele appears by mutation and gets positively selected).

This question is of interest to me, because it might help interpret the occurrence of alleles that may be selected in one core region -where, perhaps, the selection pressure is highest, or they've had the most time to increase in frequency- but also occur at low or even trace frequencies in many more regions.

If selection occurs from standing neutral variation, then the occurrence of the allele in a wide geographical region is not particulary noteworthy; presumably the allele occurred at such frequencies in many places, but became selected in a few.

On the other hand, if an allele occurs from de novo mutation, then it's low frequency occurrence outside its core region is evidence of gene flow, and perhaps recent one. This gene flow may be facilitated by the selection pressure itself (i.e., when people move with the technology, e.g., milk, that creates this pressure in the first place).

PLoS Genet 8(10): e1003011. doi:10.1371/journal.pgen.1003011

Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation

Benjamin M. Peter et al.

An outstanding question in human genetics has been the degree to which adaptation occurs from standing genetic variation or from de novo mutations. Here, we combine several common statistics used to detect selection in an Approximate Bayesian Computation (ABC) framework, with the goal of discriminating between models of selection and providing estimates of the age of selected alleles and the selection coefficients acting on them. We use simulations to assess the power and accuracy of our method and apply it to seven of the strongest sweeps currently known in humans. We identify two genes, ASPM and PSCA, that are most likely affected by selection on standing variation; and we find three genes, ADH1B, LCT, and EDAR, in which the adaptive alleles seem to have swept from a new mutation. We also confirm evidence of selection for one further gene, TRPV6. In one gene, G6PD, neither neutral models nor models of selective sweeps fit the data, presumably because this locus has been subject to balancing selection.


Local Origin of Domestic Pigs in the Upstream Region of the Yangtze River (Jin et al. 2012)

PLoS ONE 7(12): e51649. doi:10.1371/journal.pone.0051649

Mitochondrial DNA Evidence Indicates the Local Origin of Domestic Pigs in the Upstream Region of the Yangtze River

Long Jin et al.

Previous studies have indicated two main domestic pig dispersal routes in East Asia: one is from the Mekong region, through the upstream region of the Yangtze River (URYZ) to the middle and upstream regions of the Yellow River, the other is from the middle and downstream regions of the Yangtze River to the downstream region of the Yellow River, and then to northeast China. The URYZ was regarded as a passageway of the former dispersal route; however, this assumption remains to be further investigated. We therefore analyzed the hypervariable segements of mitochondrial DNA from 513 individual pigs mainly from Sichuan and the Tibet highlands and 1,394 publicly available sequences from domestic pigs and wild boars across Asia. From the phylogenetic tree, most of the samples fell into a mixed group that was difficult to distinguish by breed or geography. The total network analysis showed that the URYZ pigs possessed a dominant position in haplogroup A and domestic pigs shared the same core haplotype with the local wild boars, suggesting that pigs in group A were most likely derived from the URYZ pool. In addition, a region-wise network analysis determined that URYZ contains 42 haplotypes of which 22 are unique indicating the high diversity in this region. In conclusion, our findings confirmed that pigs from the URYZ were domesticated in situ.


December 13, 2012

Daniel MacArthur's chromosome 10

In a previous post I confirmed Razib's observation on the South Asian ancestry of part of Dan MacArthur's chromosome 10. Razib has a new post up in which he argues that this type of ancestry is not of Romani origin. I figured it was my turn to continue the series of gratis ancestry analysis, with two goals in mind: (i) to let people know that if you put your genome online, you may find interesting things about it, and (ii) to indeed figure out what is going on with this interesting case of unexpected admixture.

I used Beagle/fastIBD with default parameters and with the HapMap recombination map to figure out the mean IBD sharing between Dan's chr10 and a number of different populations, including most of my available South Asian references. I also included YRI as an appropriate outgroup, as well as CEU30 and the 1000 Genomes British populations, given that Dr. MacArthur is Australian and has a Scottish surname, so it's a good bet that he has plenty of British-type ancestry.

The mean chr10 sharing is plotted below:

Now, it's cool that the top-2 matches are Argyll and Orkney, both of which are part of Scotland. But, what is interesting, is that North_Kannadi squeezes in ahead of CEU, with a very respectable mean of 1.4cM, and a number of other Indian populations are not far behind, while most of the ones from Pakistan are not. I'd say this looks consistent with an "Indian" origin of this type of ancestry.

A useful control is to repeat this experiment with a different chromosome, the similar-sized chr9 which lacks evidence for South Asian ancestry:

It is now visually clear that the difference between the British populations and the South Asian ones is greatly diminished. And, while the North Kannadi were near the top of the order for chr10, they are near the bottom for chr9, even lower than the YRI outgroup at the "noise left end" of the spectrum. Even the highest ranked South Asian population has about 1/3 estimated IBD sharing as the British ones. 

Moreover, whereas in chr10 the top-ranked South Asian populations were often from the south, in chr9 the situation is reversed, with most of the top-ranked ones being from Pakistan. Again, this suggests that there is no real South Asian admixture here, but just some low-level sharing with the (more West Eurasian) populations of the northern part of South Asia.

So, to make a long story short, it does look to me like an excellent suggestion that there is some type of peninsular or even south Indian ancestry in the chr10 in question.

December 12, 2012

Evidence for 6th millennium BC cheese

Nature (2012) doi:10.1038/nature11698

Earliest evidence for cheese making in the sixth millennium bc in northern Europe

Mélanie Salque et al.

The introduction of dairying was a critical step in early agriculture, with milk products being rapidly adopted as a major component of the diets of prehistoric farmers and pottery-using late hunter-gatherers1, 2, 3, 4, 5. The processing of milk, particularly the production of cheese, would have been a critical development because it not only allowed the preservation of milk products in a non-perishable and transportable form, but also it made milk a more digestible commodity for early prehistoric farmers6, 7, 8, 9, 10. The finding of abundant milk residues in pottery vessels from seventh millennium sites from north-western Anatolia provided the earliest evidence of milk processing, although the exact practice could not be explicitly defined1. Notably, the discovery of potsherds pierced with small holes appear at early Neolithic sites in temperate Europe in the sixth millennium BC and have been interpreted typologically as ‘cheese-strainers’10, although a direct association with milk processing has not yet been demonstrated. Organic residues preserved in pottery vessels have provided direct evidence for early milk use in the Neolithic period in the Near East and south-eastern Europe, north Africa, Denmark and the British Isles, based on the δ13C and Δ13C values of the major fatty acids in milk1, 2, 3, 4. Here we apply the same approach to investigate the function of sieves/strainer vessels, providing direct chemical evidence for their use in milk processing. The presence of abundant milk fat in these specialized vessels, comparable in form to modern cheese strainers11, provides compelling evidence for the vessels having being used to separate fat-rich milk curds from the lactose-containing whey. This new evidence emphasizes the importance of pottery vessels in processing dairy products, particularly in the manufacture of reduced-lactose milk products among lactose-intolerant prehistoric farming communities6, 7.


Efficient moment-based inference of admixture parameters and sources of gene flow (Lipson et al. 2012)

My reading list keeps getting longer as another paper referenced by Loh et al. has now appeared on the arXiv, a day after the new Moorjani et al. paper on Romani origins. A number of papers from many of the same co-authors have appeared over the span of a couple of months, all of them containing interesting technical discussion on admixture parameter estimation, so perhaps this is a good place to make a list of them for easy reference:
This series of papers builds on earlier work, which can be found in the following:
The software introduced in the current paper (Lipson et al. 2012) can be found in the MixMapper page, and according to its description it is similar in spirit to the TreeMix software. Hopefully I'll be able to try it out for myself.

The most interesting thing about the current paper is, of course, its detection, to a lesser extent of the same "North Eurasian" ancestry found in northern Europeans by Patterson et al. also in Sardinians and Basques.

Sardinians did not appear to have such ancestry on the basis of the f3-statistic, but this might have been a consequence of the fact that they were the "least unadmixed" of the Europeans, so any application of f3(Sardinian; X, Amerindian) would not have given a negative result, because there does not exist any X less mixed with this Amerindian-like "North Eurasian" element than Sardinians.

Also, the ALDER paper seems not to have been able to date this type of admixture because of its antiquity. I have tried myself using a 1-ref approach on Sardinians (using Sardinians and various other "eastern" populations as possible contributors) but without success. So, it will be interesting to read how this type of ancestry was detected in the current paper. Any further comments will be posted in this space as updates.

UPDATE I: On the left you can see the model proposed for Europe. A first observation is the absence of a primate outgroup, or indeed of representatives of African hunter-gatherers. This makes sense in the context of this paper, since all African hunter-gatherers have been shown now to have admixture from African farmers, so they cannot be used for the "scaffold" tree, as they are not unadmixed.

However, their type of admixture differs from the admixture found in all other populations. For example, Europeans are a mixture of "Ancient Western Eurasians" and a group related to "Ancient Northern Eurasians". African hunter-gatherers, on the other hand, are a mixture between a group related to the Mandenka-Yoruba clade, and (potentially diverse) sets of "Palaeoafricans". The latter are an outgroup to the rest of mankind, and as such admixture with them cannot be represented in this model; consequently Yoruba assume by default a position of unadmixed outgroup to the rest of mankind, a position which -for reasons mentioned before in this blog- I believe is not correct. What effect this might have on the rest of the tree is not yet clear to me.

arXiv:1212.2555 [q-bio.PE]

Efficient moment-based inference of admixture parameters and sources of gene flow

Mark Lipson, Po-Ru Loh, Alex Levin, David Reich, Nick Patterson, Bonnie Berger

(Submitted on 11 Dec 2012)

The recent explosion in available genetic data has led to significant advances in understanding the demographic histories of and relationships among human populations. It is still a challenge, however, to infer reliable parameter values for complicated models involving many populations. Here we present MixMapper, an efficient, interactive method for constructing phylogenetic trees including admixture events using single nucleotide polymorphism (SNP) genotype data. MixMapper implements a novel two-phase approach to admixture inference using moment statistics, first building an unadmixed scaffold tree and then adding admixed populations by solving systems of equations that express allele frequency divergences in terms of mixture parameters. Importantly, all features of the tree, including topology, sources of gene flow, branch lengths, and mixture proportions, are optimized automatically from the data and include estimates of statistical uncertainty. MixMapper also uses a new method to express branch lengths in easily interpretable drift units. We apply MixMapper to recently published data for HGDP individuals genotyped on a SNP array designed especially for use in population genetics studies, obtaining confident results for 30 populations, 20 of them admixed. Notably, we confirm a signal of ancient admixture in European populations---including previously undetected admixture in Sardinians and Basques---involving a proportion of 20-40% ancient northern Eurasian ancestry.


December 11, 2012

Y chromosome study of Italy (Brisighelli et al. 2012) incl. sample of Greek speakers from Salento

This is a wonderful new source of information on Y-chromosome variation in Italy, that also includes some samples of the linguistic minorities of Ladins and Griko speakers.

The latter is particularly interesting to me, because, these last Greeks of Magna Graecia are descended either from the ancient colonists or medieval Eastern Roman settlers, and as such may represent a group of Greek descendants that (i) may have admixed to some extent with local Italic speakers, but (ii) will not have had an opportunity to experience much post-medieval gene flow that may have affected Greeks from the Aegean.

There may be something wrong with the presentation of the haplogroup frequencies on the left; in particular, based on the text, I think that what appears as R1* is in fact R1*(xR1a1).

In any case, here are my observations on the Grecani Salentini sample:
  • They, as well as the Messapi, possess the highest frequencies of E-M78. This ties them to the Balkans in a very obvious way; this haplogroup was also interpreted as a signal of Greek colonization in Sicily and Massalia. This seems like the most obvious explanation; note that Salento is in Messapia, so the high frequency in the non-Greek denizens of the region may be simply the result of language shift, since the remaining Greek speakers are presumably the last remnant of a once much more numerous population that was linguistically Italicized as have most other Greek speaking populations of Italy and Sicily.
  • Their highest frequency haplogroups are R1*(xR1a1) and J2. Both are fairly common haplogroups in both Greece and Italy, so only a fine-scale analysis would be able to differentiate between what might be pre-Greek and what is Greek in origin. In any case, I have proposed that these two haplogroups were typical of (albeit not limited to) the Graeco-Phrygo-Armenian clade, so their occurrence in this sample is not surprising.
  • There is an occurrence of I*(xM26) chromosomes. This requires finer phylogenetic resolution, but certainly the absence of M26 -which has a SW European distribution- is interesting to note.
  • Haplogroup G-M201 again requires finer-scale resolution, and could be anything from a relative of the Neolithic Italians (having been found in the Tyrolean Iceman) to much more recent events.
  • Within haplogroup J, the majority of the chromosomes belong to clade J2, with about a tenth of the frequency made up of J*(xM62, M172). Note that these are not necessarily J*(xJ1,J2) as indicated in the figure, since M62 defines only a part of the J1 lineage.
  • The absence of haplogroup R1a1 in this sample is perhaps the most interesting finding. This occurs at a frequency of ~10% in Greek samples from Greece and is fairly variable. I have previously observed that it was absent in the south stream of Indo-European based on its paucity in Armenians, Albanians, and its uneven distribution in Greeks. Its absence in the Italian Griko sample reinforces this idea. A caveat, however, is that the origin of the Greek settlement of Italy can be traced to southern Greece and western Anatolia, so it's still possible that some R1a1 was present in other areas of the Aegean basin since pre-medieval times.
The authors of the paper use many conventional labels of what is "Neolithic" and what is not (e.g., R1*(xR1a1) is claimed as Mesolithic). But, certainly, both age estimation of modern chromosomes (e.g., Wei et al. 2012) and the ancient Y chromosome studies cast doubt on this association. I would say that rather than being predominantly pre-Neolithic, it might appear that the Y-chromosome gene pool of Italy may have been formed in late Neolithic to medieval times, with the only lineages that can convincingly trace their ancestry to the Neolithic or earlier epochs being G and I-M26.

As for the Ladins, the high frequency (67.7%) of R1*(xR1a1) is consistent with what I believe to have been the main Italo-Celtic lineage.

Finally, I should point out the occurrence of a couple of haplogroup L samples; this haplogroup is more typical of populations much to the east, being the "eastern" cousin of the more "western" haplogroup T within the LT clade. Certainly a finer-scale resolution of these two L samples might be informative about their potential origins and/or the ancient distribution of this rather mysterious haplogroup.

PLoS ONE 7(12): e50794. doi:10.1371/journal.pone.0050794

Uniparental Markers of Contemporary Italian Population Reveals Details on Its Pre-Roman Heritage

Francesca Brisighelli et al.


According to archaeological records and historical documentation, Italy has been a melting point for populations of different geographical and ethnic matrices. Although Italy has been a favorite subject for numerous population genetic studies, genetic patterns have never been analyzed comprehensively, including uniparental and autosomal markers throughout the country.

Methods/Principal Findings

A total of 583 individuals were sampled from across the Italian Peninsula, from ten distant (if homogeneous by language) ethnic communities — and from two linguistic isolates (Ladins, Grecani Salentini). All samples were first typed for the mitochondrial DNA (mtDNA) control region and selected coding region SNPs (mtSNPs). This data was pooled for analysis with 3,778 mtDNA control-region profiles collected from the literature. Secondly, a set of Y-chromosome SNPs and STRs were also analyzed in 479 individuals together with a panel of autosomal ancestry informative markers (AIMs) from 441 samples. The resulting genetic record reveals clines of genetic frequencies laid according to the latitude slant along continental Italy – probably generated by demographical events dating back to the Neolithic. The Ladins showed distinctive, if more recent structure. The Neolithic contribution was estimated for the Y-chromosome as 14.5% and for mtDNA as 10.5%. Y-chromosome data showed larger differentiation between North, Center and South than mtDNA. AIMs detected a minor sub-Saharan component; this is however higher than for other European non-Mediterranean populations. The same signal of sub-Saharan heritage was also evident in uniparental markers.


Italy shows patterns of molecular variation mirroring other European countries, although some heterogeneity exists based on different analysis and molecular markers. From North to South, Italy shows clinal patterns that were most likely modulated during Neolithic times.


December 10, 2012

How to fix 23andMe's Ancestry Composition

There have been enough public reports by now that reinforce my initial suggestion that Ancestry Composition overfits to the training data (aka people with four grandparents from a reference population who filled their ancestry survey). The result of this is that such people get 99-100% of their ancestry assigned to a particular population, and the test essentially returns the customer-supplied population label instead of returning the person's ancestry based on his actual DNA.

Now, this is not a problem for the majority of 23andMe customers who don't have 4 grandparents from the same country, or have 4 grandparents from a colonial country such as the United States.

But, the problem for the rest of the 23andMe community cannot be overlooked, because it is significant for people from non-colonial countries who make up the reference populations. Ironically, the people who are actually making this type of analysis possible (people who dutifully filled in their ancestry survey) are the ones getting the raw end of the deal.

I have seen talk of people retracting their ancestry survey answers in the hope of getting some accurate results! I don't think that's the way to go, as that would lead to a race to the bottom: people might retract or change their ancestry survey answers in the hope of improving their results, but, if enough people do this, the training sample will be shrunk and distorted, so the results will be worse for everybody!

How to solve the problem

In a world with infinite computing resources, and a large number of samples, the problem could be solved optimally by leaving out each of N training samples, rebuilding the ancestry predictive model, using the remaining N-1 samples, N times, once for each training sample, and then applying it to each of these left out samples.

Naturally, this would have the effect of increasing the computational complexity of ancestry estimation approximately N-fold, so it does not seem practical.

An alternative approach would be to build the model only once (using all N training samples) and incrementally update it for each training individual. This depends on the feasibility of such an incremental update which would incur a minor cost per individual -to adapt parameters of the model by "virtually" taking out the individual. My suspicion is that it will be extremely difficult to do this type of incremental update for the fairly complex model used by 23andMe in their Ancestry Composition.

So, what would be a practical solution?

Partition the N training samples into a number of G groups, each of which will have N/G individuals. Now, rebuild the model G times, each time using N-N/G individuals, i.e., leaving one group out. Note that the initially proposed solution (i.e., leaving one out) is a special case of the above with G=1.

The computational cost of this solution will be something less than G times the cost of building the full model with all N training samples. This is due to the fact that you are building the model G times, but over a slightly smaller dataset (of N-N/G individuals).

Practically, G=10 would be reasonable number of groups, which would, however, require the model to be built ten times. Whether or not this is practical for 23andMe, I don't know, but since they have to periodically update their model, I think that they ought to try this approach. If they already have idle CPU cycles, that's a great way to occupy them, and if they don't, then investing in processing power would be a good idea.

On the South Asian (?) ancestry of Daniel MacArthur

Razib investigates an unexpected region of South Asian admixture in Daniel MacArthur of GenomesUznzipped, and wonders why this has never been found before, despite the fact that his data was out in the public for a while.

I was surprised about this myself, since I had studied this data when I was starting my ADMIXTURE experiments a couple of years ago. But looking back at that old experiment, it's immediately clear why Dr. MacArthur's column (highlighted) showed no evidence of South Asian admixture at the time: there was no South Asian ancestral population in that reference set!

Naturally, I was curious to see what would turn up if I ran this sample again through my most recent globe13 calculator, which I did using the "bychr" mode of DIYDodecad, which treats each of the 22 autosomes separately:

A clear outlier is indeed shown on chr10 which shows 20.51% "South_Asian" admixture; most of the other chromosomes lack this altogether, so this seems like a legitimate signal of admixture.

I next used the "byseg" mode of DIYDodecad in order to (i) localize this admixture signal within chr10 and study it further. Furthermore, I used the paint_byseg script in order to show how the top-4 components within chr10 varied along the length of the chromosome:

It does appear that a good portion of the first half of chr10 has "South_Asian" ancestry, with the signal close to ~50%, which is a fairly good indication that one half of the diploid genome in this region has this type of ancestry.

Interestingly, the South_Asian signal does not appear "constant" along this portion, but in some of its troughs, the "West_Asian" component shows a corresponding local peak. Now, this might be the case of one really long segment of ancestry which is interpreted sometimes as South_Asian, sometimes as West_Asian by the software, given that the South_Asian component inferred by ADMIXTURE is a composite of West_Asian-like Ancestral North Indians (ANI), and Ancestral South Indians (ASI). But, we can investigate this further by using globe4, which looks at the same chromosome at a lower level of resolution:

It does appear to me that a fairly convincing "Asian" signal exists in a good portion of this region. Note that "Asian" within the context of globe4 is a combination of East/South Eurasians and even Australasians; it is a generalized "Asian" component that captures some of the common ancestry of these populations.

So, on balance I would say that there does indeed appear to evidence of South Asian ancestry within chr10 for this sample, and, moreover, this type of South Asian ancestry is probably partly ASI-related.

Roma origins once more (Moorjani et al. 2012)

I had first noticed that this new paper by Moorjani et al. was referenced by Loh et al., and it has now been posted on arXiv. In the last week, a couple of other papers on the same topic (Mendizabal et al. on autosomal DNA and Rai et al. on a Y-chromosome founder lineage) have also appeared.

All three studies appear to converge on NW India as the place of origin of the European Roma, and on a recent admixture between this "Proto-Roma" population and Europeans. It will be interesting to see if there are any substantial differences between Moorjani et al. and Mendizabal et al. in the reconstruction of Roma origins. There is also an appendix on updates to rolloff and other topics of a technical nature that ought to be useful to readers irrespective of their interest in this particular population.

It'll probably take me a while to digest everything in this paper, but I will make one quick observation after (virtually) leafing through the article; the observation that {CEU, ANI} form a clade with Adygei as an outgroup is used to infer admixture proportions. I recently had a blog post on the differential relationship of ANI to Caucasus populations, in which I showed that while D(CEU, Adygei; South Asian, Onge) was positive, and significant in some cases -- indicating CEU being more closely related to ANI (Ancestral North Indians) than Adygei -- the reverse was the case for D(CEU, Georgian/Lezgin; South Asian, Onge).

A second observation was inspired by the following figure:

High IBD sharing with Romanians makes sense, because there is good evidence (e.g., presence of Y-haplogroup E-V13) that the Roma picked up European ancestry in the Balkans. So, I'm fairly sure that we are seeing a real signal that the Roma have Romanian-like recent European ancestors. But, we ought to be vigilant, because it is possible that some Romanians may have Roma ancestry too!  This was the case in a couple of individuals from the Romanian sample of Behar et al. (2010).

This is a more general issue: IBD sharing occasionally involves strictly -or mostly- unidirectional gene flow,  e.g., sharing between European and African Americans largely went EA->AA way, so an AA sharing with a EA more often than not involves EA->AA gene flow.

But, in other cases, the direction of gene flow is more obscure (so, e.g., sharing between German, Magyar, and Slavic speakers, and Jews in the old Austro-Hungarian Empire). This issue often comes up in the genealogical community, with a typical example being a couple of individuals (let's call them Klaus and Mikolaj) discovering a shared IBD segment, and Klaus thinking he's found a Polish ancestor, and Mikolaj a German one.

In any case, as the authors themselves note it will be interesting to use more European reference populations, and this might indicate whether they picked up European ancestry in one particular region, carrying it with them as they expanded into the Balkans and beyond, or whether they picked it up by interacting with different host populations (e.g., Greek Gypsies with Greeks, Romanian Gypsies with Romanians, and so on).

arXiv:1212.1696 [q-bio.PE]

Reconstructing Roma history from genome-wide data

Priya Moorjani et al.

The Roma people, living throughout Europe, are a diverse population linked by the Romani language and culture. Previous linguistic and genetic studies have suggested that the Roma migrated into Europe from South Asia about 1000-1500 years ago. Genetic inferences about Roma history have mostly focused on the Y chromosome and mitochondrial DNA. To explore what additional information can be learned from genome-wide data, we analyzed data from six Roma groups that we genotyped at hundreds of thousands of single nucleotide polymorphisms (SNPs). We estimate that the Roma harbor about 80% West Eurasian ancestry-deriving from a combination of European and South Asian sources- and that the date of admixture of South Asian and European ancestry was about 850 years ago. We provide evidence for Eastern Europe being a major source of European ancestry, and North-west India being a major source of the South Asian ancestry in the Roma. By computing allele sharing as a measure of linkage disequilibrium, we estimate that the migration of Roma out of the Indian subcontinent was accompanied by a severe founder event, which we hypothesize was followed by a major demographic expansion once the population arrived in Europe.


December 08, 2012

Main orientations of human genetic differentiation (Jay et al. 2012)

Mol Biol Evol (2012) doi: 10.1093/molbev/mss259

Anisotropic isolation by distance: the main orientations of human genetic differentiation

Flora Jay et al.

Genetic differentiation among human populations is greatly influenced by geography due to the accumulation of local allele frequency differences. However, little is known about the possibly different increment of genetic differentiation along the different geographical axes (north-south, east-west, etc). Here we provide new methods to examine the asymmetrical patterns of genetic differentiation. We analyzed genome-wide polymorphism data from populations in Africa (n = 29), Asia (n = 26), America (n = 9) and Europe (n = 38), and we found that the major orientations of genetic differentiation are north-south in Europe and Africa, east-west in Asia, but no preferential orientation was found in the Americas. Additionally, we showed that the localization of the individual geographic origins based on SNP data was not equally precise along all orientations. Confirming our findings, we obtained that in each continent, the orientation along which the precision is maximal corresponds to the orientation of maximum differentiation. Our results have implications for interpreting human genetic variation in terms of isolation by distance and spatial range expansion processes. In Europe for instance, the precise NNW-SSE axis of main European differentiation can not be explained by a simple Neolithic demic diffusion model without admixture with the local populations because in that case the orientation of greatest differentiation should be perpendicular to the direction of expansion. In addition to humans, anisotropic analyses can guide the description of genetic differentiation for other organisms and provide information on expansions of invasive species or the processes of plant dispersal.


December 07, 2012

23andMe Ancestry Composition

23andMe has launched its new Ancestry Composition feature, the workings of which are summarized -at a very high level- in this page.

I have already received some feedback from customers who also happen to be part of my Dodecad Project and who appear to be perplexed by their results. It is unfortunate that my own rules preclude me from discussing the details of these reports. I encourage people who want to discuss their ancestry composition to do so in the comments.

Without going into details, I would first advise that 23andMe make transparent the way in which 23andMe participants were selected as part of their training data. This is explained in their writeup with the following paragraph:
Most of the reference dataset comes from 23andMe members just like you. When someone tells us that they have four grandparents all born in the same country, and the country isn't a colonial nation like the US, Canada or Australia, they become candidates for inclusion in the reference dataset. We filter out all but one of any set of closely-related people, since they can distort the results. And we remove "outliers," people whose genetic ancestry doesn't seem to match up with their survey answers.
23andMe takes a "birthplace of grandparents" approach rather than an "ethnic origin" approach. This may be reasonable when the two tend to coincide but not appropriate at all when ethnic groups of different origins co-exist in a given territory. Contrary to the implicit belief expressed in the above paragraph, ethnic complexity is not limited to "colonial nations", and an approach that disregards ethnicity, language, and religion, and limits itself to "birthplace of grandparents" is bound to miss it.

The problem with supervised learning is that the end product is only as good as the labels. If the labels aren't good, or they're ambiguous, then you end up with a mess.

Let's take an example of an individual who reports "4 grandparents from Turkey." This may mean anything ranging from a Mesopotamian Kurd within the boundaries of Turkey, a Central Anatolian Turk, a Cappadocian Greek, a Turkocretan, an Armenian from Cilicia, an ethnic Greek from European Turkey, or a Turkish-speaking Muslim from Skopje or Bulgaria. Some of these may interpret "Turkey" geographically; others ethnically. The label "Turkey" is polysemous, for a variety of reasons: it can be interpreted either geographically or ethnically, and in both these senses it has not been time-invariant.

I don't know how 23andMe built their reference populations, but I am ~100% sure that 4 grandparents from Turkey = "Middle Eastern" in their terminology. I am also fairly sure that their "Balkan" sample consists of individuals as different as Croats and Greeks. So what do these meta-population labels mean? Your guess is as good as mine: a balance of samples of different origins and different interpretations of these origins in whatever training set 23andMe assembled.

In my own project, I never include a priori labels of individuals in the inference of ancestral components. I deal with genotypes and individuals, not self-reported ancestral origins and labelled sets of individuals (populations). Components emerge from unsupervised learning over a set of individual genotypes, and it is only a posteriori that labels are assigned to the inferred components, by observation. Indeed, one could forego the assignment of labels altogether!

My amicable advice to 23andMe is to drop supervised learning altogether. It will only get worse as new customers (aka new test data) join in.

December 06, 2012

Romani origins and admixture (Mendizabal et al.)

I will comment on the paper in this space after I read it. For the time being here's a link to the press release.
"From a genome-wide perspective, Romani people share a common and unique history that consists of two elements: the roots in northwestern India and the admixture with non-Romani Europeans accumulating with different magnitudes during the out-of-India migration across Europe," Kayser said. "Our study clearly illustrates that understanding the Romani's genetic legacy is necessary to complete the genetic characterization of Europeans as a whole, with implications for various fields, from human evolution to the health sciences."
The results seem to complement a recent Y-chromosome study of the major founder lineage of European Roma H-M82.

Current Biology

Reconstructing the Population History of European Romani from Genome-wide Data

Isabel Mendizabal et al.

The Romani, the largest European minority group with approximately 11 million people [1], constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic [2] and genetic [3, 4, 5, 6, 7 and 8] studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.


Sneak peek at new version of 23andMe ancestry analysis via Jeff Probst

You can watch a 9min clip here.

Paternal haplogroup ("traces to France and Germany"):

Anyone care to speculate what that is? The foci in eastern India and absence in parts of NW Europe and the Balkans throw me off.

And what of his maternal haplogroup ("Northern Africa" "pastoralists" "Berbers"):

I would have guessed U6 or M1, but the focus east of the Caspian throws me off again. 23andMe may have potentially very large sample sizes, so perhaps their frequency maps may be even better than ones published in the literature, so I'm genuinely curious what this might be.

Various other info: his dad "top 1% Neandertal", no evidence of "Asian" or "Jewish".

Anyway, onto the main course, i.e., the new Ancestry composition:

One thing that I like about this is the assignment of a portion of ancestry to a "Nonspecific Northern European" group, which is a feature I haven't seen before. I am told that this feature will launch very soon, so it will be interesting to see how well it works across many individuals.

December 05, 2012

Y chromosomes in Iranians and Tajiks (Malyarchuk et al. 2013)

An interesting paper on Iranian and Tajik Y chromosomes. Iranian Y chromosomes were comprehensively studied by Grugni et al. but it is always good to have additional samples.

I have mentioned before the apparent distinction between west and east Iranians in terms of haplogroup J/R1a frequencies, with high ratios in Persians and Kurds, and low ones in Pathans, and this seems to be reinforced here; the Tajiks are speakers of Persian (hence "western") but trace their ancestry to the east of the modern country of Iran, and in-between Persians and eastern Iranians.

The absence of R1a in this Kurdish sample, coupled with high J frequency parallels the situation in the Kurdish Anatolian settlement studied by Gokcument et al., as well as the Georgian Kurmanji sample studied by Nasidze et al. On the other hand, R1a is present in the Kurmanji samples from Turkey and Turkmenistan in the latter study, as well as in the aforementioned Kurdish sample from Iran by Grugni et al. and the Kurdish sample from Turkmenistan studied by Wells et al. I'd say that there is potential variation of this haplogroup within Kurdish groups, which might be worth further exploration.

It would also be very interesting to study the haplogroup I chromosomes from this region. Do they represent historical introgression from Europe, or are they, perhaps, local basal clades that reinforce the idea of a relic distribution of I in West Asia, prior to the migration into Europe, that was recently suggested by the discovery of IJ* chromosomes in Iran by Grugni et al.?

Annals of Human Biology, 2013; Early Online: 1–7

Y-chromosome variation in Tajiks and Iranians

Boris Malyarchuk et al.

Aim: The purpose of this study was to characterize Y-chromosome diversity in Tajiks from Tajikistan and in Persians and Kurds from Iran.

Method: Y-chromosome haplotypes were identified in 40 Tajiks, 77 Persians and 25 Kurds, using 12 short tandem repeats (STR) and 18 binary markers.

Results: High genetic diversity was observed in the populations studied. Six of 12 haplogroups were common in Persians, Kurds and Tajiks, but only three haplogroups (G-M201, J-12f2 and L-M20) were the most frequent in all populations, comprising together 60% of the Y-chromosomes in the pooled data set. Analysis of genetic distances between Y-STR haplotypes revealed that the Kurds showed a great distance to the Iranian-speaking populations of Iran, Afghanistan and Tajikistan. The presence of Indian-specific haplogroups L-M20, H1-M52 and R2a-M124 in both Tajik samples from Afghanistan and Tajikistan demonstrates an apparent genetic affinity between Tajiks from these two regions.

Conclusions: Despite the marked similarities between Y-chromosome gene pools of Iranian-speaking populations, there are differences between them, defined by many factors, including geographic and linguistic relationships.


December 04, 2012

Armenian Y-STR haplotype data

The same data was studied by Herrera et al. (2011) where it was shown that haplogroup R2 was one of the distinguishing features of the Sasun community.

I decided to try the batch version of the Haplogroup Predictor on this data, and I include my results in this spreadsheet. This is useful as test data because real haplogroup assignments and Y-STR data are known for the same individuals, so they can be cross-checked against the predicted haplogroup.

Haplogroup prediction was made on the basis of the highest posterior probability and with equal priors. There were a few errors, some of which are understandable (for example, the 23-marker version does not include haplogroup R2, so the R2 samples were assigned to various other haplogroups).

With a few other discrepancies aside (e.g., some R1b1a2 assigned to L), the overall performance seems robust, and one can probably use this tool for published Y-STR data, with the caveat that some predictions for the less frequent haplogroups -for which there were probably fewer training samples- may be off the mark.

Legal Medicine doi:10.1016/j.legalmed.2012.10.003

Sub-population structure evident in forensic Y-STR profiles from Armenian geographical groups 

Robert K. Lowery et al

Over the course of its long history, Armenia has acted as both a source of numerous indigenous cultures and as a recipient of foreign invasions. As a result of this complex history among populations, the gene pool of the Armenian population may contain traces of historically well-documented ancient migrations. Furthermore, the regions within the historical boundaries of Armenia possess unique demographic histories, having hosted both autochthonous and specific exogenous genetic influences. In the present study, we analyze the Armenian population sub-structure utilizing 17 Y-chromosome short tandem repeat (Y-STR) loci of 412 Armenians from four geographically and anthropologically well-defined groups (Ararat Valley, Gardman, Lake Van and Sasun). To place the genetic composition of Armenia in a regional and historic context, we have compared the Y-STR profiles from these four Armenian collections to 18 current-day Eurasian populations and two ancient DNA collections. Our results illustrate regional trends in Armenian paternal lineages and locale-specific patterns of affinities with neighboring regions. Additionally, we observe a phylogenetic relationship between the Northern Caucasus and the group from Sasun, which offers an explanation for the genetic divergence of this group from other three Armenian collections. These findings highlight the importance of analyzing both general populations as well as geographically defined sub-populations when utilizing Y-STRs for forensic analyses and population genetics studies.


Disentangling the histories of mtDNA haplogroups M1 and U6

mtDNA haplogroups M1 and U6 are often mentioned in terms of Eurasian back-migration in Africa. The former is the only clade of the Asian haplogroup M which occurs in Africa at all; the latter is the only clade of the West Eurasian haplogroup U that does the same. These haplogroups also tend to co-exist in North and East Africa, although they are largely absent in sub-Saharan Africa. Different ideas have been offered for their occurrence, including a "Paleolithic" spread or a more recent one associated with the spread of Afroasiatic languages.

The new paper offers useful new data on this debate. The most important conclusion is that despite their oft-mentioned association, these two haplogroups appear to have distinct histories. One argument for this is their separate geographic distribution:

M1 (on panel A) is much more common in Northeast Africa and the Near East (including the Caucasus), whereas U6 (panel B) is more confined in Africa, and has its stronger peak in NW Africa, being rare in NE Africa.

An interesting aside, is that all the mysterious M1 from the Caucasus belongs to subclade M1a, while the smaller M1b clade tends to co-occur with M1a in other parts of Africa and the Near East. This indicates a founder effect for the origin of Caucasian M1a, but leaves open the issue of the immediate origins of M1. Hopefully it will become possible to place this haplogroup within the broader M phylogeny in the future.

The Bayesian skyline plots also contrast M1 and U6 in terms of their demographic histories:

The authors argue that these histories are inconsistent with either a very early dispersal history with the Dabban industry, as well as a more recent spread with Afroasiatic. From the paper:
The transition from the Middle Palaeolithic to Upper Palaeolithic in North Africa is characterised by the appearance of the “Dabban”, an industry that is restricted to Cyrenaica in northeast Libya and represented at the caves of Hagfet ed Dabba and Haua Fteah [19]. Whilst a techno-typological shift occurred within the Dabban ~33 KYA [19], starker changes in the archaeological record occurred throughout North Africa and Southwest Asia ~23-20 KYA, represented by the widespread appearance of backed bladelet technologies. The appearance of these backed bladelet industries more or less coincides with the timing of the Last Glacial Maximum (LGM) (~23-18 KYA), including: ~21 KYA in Upper Egypt [20]; ~20 KYA at Haua Fteah with the Oranian [21]; the Iberomaurusian expansion in the Jebel Gharbi ~20 KYA [22]; and the first Iberomaurusian at Tamar Hat in Algeria ~20 KYA [23]. The earliest Iberomaurusian sites in Morocco appear to be only slightly younger ~18 KYA [24].
A disassociation of these haplogroups from the UP in North Africa might be consistent with my idea that the UP was in part a cultural revolution that spread not only with people, but often with ideas across a species that already had the "biological machinery" for behavioral modernity and was already established in both Africa and the Near East.

As for the connection to Afroasiatic, the authors detect a linguistic correlation with M1a, which, however, appears too old to have been involved directly in the spread of this language family:
Concerning haplogroup M1 individually, a significant correlation with languages was observed. Furthermore, within M1, it appears that the correlation is mostly due to M1a. However, given the small sample size of M1b, any potential signal correlating with language might not be detectable. Interestingly, M1a has a likely East African origin, but its coalescent age of ~21 KYA still largely predates that of the proto-AA. Maybe a sub-clade of M1a would still give a similar correlation, but there are not sufficient samples to allow splitting M1a into its various sub-clades, and to test for a correlation. Although we found a correlation, limited sample sizes do not allow drawing unambiguous connection between genes and languages. Furthermore, it is also possible that this putative sub-clade of M1 does not testify for the expansion of AA speaking people, but was already present among the people who inhabited the area before the spread of the AA languages.
Personally, I am in favor of an East African origin of Afroasiatic, as this makes sense of various lines of evidence, one of which is the African shift of the "Southwest_Asian" component that is modal in Semitic populations. I envision that M1 was geographically circumscribed in a NE African population after its much earlier arrival from Asia and piggy-backed onto the expansion of Afroasiatic speakers, thus explaining the observed correlation. A good analogy would be with the expansion of, say, haplogroup H in the Americas which piggybacked on the European colonization, even though the coalescence age of H predates the arrival of Europeans in the New World by many millennia.

BMC Evolutionary Biology 2012, 12:234 doi:10.1186/1471-2148-12-234

Divorcing the Late Upper Palaeolithic demographic histories of mtDNA haplogroups M1 and U6 in Africa

Erwan Pennarun et al.

Abstract (provisional)
A Southwest Asian origin and dispersal to North Africa in the Early Upper Palaeolithic era has been inferred in previous studies for mtDNA haplogroups M1 and U6. Both haplogroups have been proposed to show similar geographic patterns and shared demographic histories.

We report here 24 M1 and 33 U6 new complete mtDNA sequences that allow us to refine the existing phylogeny of these haplogroups. The resulting phylogenetic information was used to genotype a further 131 M1 and 91 U6 samples to determine the geographic spread of their sub-clades. No southwest Asian specific clades for M1 or U6 were discovered. U6 and M1 frequencies in North Africa, the Middle East and Europe do not follow similar patterns, and their sub-clade divisions do not appear to be compatible with their shared history reaching back to the Early Upper Palaeolithic. The Bayesian Skyline Plots testify to non-overlapping phases of expansion, and the haplogroups' phylogenies suggest that there are U6 sub-clades that expanded earlier than those in M1. Some M1 and U6 sub-clades could be linked with certain events. For example, U6a1 and M1b, with their coalescent ages of ~20,000-22,000 years ago and earliest inferred expansion in northwest Africa, could coincide with the flourishing of the Iberomaurusian industry, whilst U6b and M1b1 appeared at the time of the Capsian culture.

Our high-resolution phylogenetic dissection of both haplogroups and coalescent time assessments suggest that the extant main branching pattern of both haplogroups arose and diversified in the mid-later Upper Palaeolithic, with some sub-clades concomitantly with the expansion of the Iberomaurusian industry. Carriers of these maternal lineages have been later absorbed into and diversified further during the spread of Afro-Asiatic languages in North and East Africa.


Tomb of Genghis Khan found?

Newsweek has a story on the purported finding of the tomb of Genghis Khan. An excerpt:

A multidisciplinary research project uniting scientists in America with Mongolian scholars and archeologists has the first compelling evidence of the location of Khan’s burial site and the necropolis of the Mongol imperial family on a mountain range in a remote area in northwestern Mongolia. 
Among the discoveries by the team are the foundations of what appears to be a large structure from the 13th or 14th century, in an area that has historically been associated with this grave. Scientists have also found a wide range of artifacts that include arrowheads, porcelain, and a variety of building material. 
“Everything lines up in a very compelling way,” says Albert Lin, National Geographic explorer and principal investigator of the project, in an exclusive interview with Newsweek.
Whether this is the real thing or not, you gotta love that this has been made possible:
In a laboratory at the California Institute for Telecommunications and Information Technology at University of California, San Diego, Lin and his team combed through the massive volumes of ultrahigh-resolution satellite imagery and built 3-D reconstructions from radar scans in their search for clues to where Genghis Khan may be buried. As part of an unprecedented open-source project, thousands of online volunteers sifted through 85,000 high-resolution satellite images to identify any hidden structures or odd-seeming formations.
Apparently there is concern of the local authorities about grave robbing, so it does not seem that the site will be excavated anytime soon. And, perhaps, the central position of Genghis Khan in modern Mongolian culture might make the disinternment of any human remains from the area a difficult proposition politically.

In any case, it would be great to read the headline "Genome sequence of Genghis Khan"  in your Nature or Science news feed one fine evening a few years down the road, so let's keep our fingers crossed that it may yet happen.

PS: On an unrelated topic, I sometimes wonder why there has not been more work on "famous DNA"? This would provide an incredible way of involving the public in cutting edge science. It might also help historical research, and while the location of Genghis Khan's tomb is obscure, those of other famous potentates like Tamerlane, or the Ottoman Sultans, or a good number of European royals are not.

Of course, there may not be much scientific interest in many such persons, but if Einstein's brain continues to be the subject of reputable studies in good journals, why isn't Einstein's genome so studied? Or Newton's, Darwin's, Beethoven's, or any other intellectual giant's whose burial place is known? I'm not naive enough to think that such an approach would reveal a "genius gene" they all possessed, but still, it is not inconceivable that something of interest about their origins -if not their genetic predispositions- might turn up.

December 03, 2012

'globe13anc' calculator with chimp outgroup

I was thinking a bit about my suggestion to use Palaeo_African as an outgroup for D-statistic calculations using my new admixtureDstat script, and it occurred to me that it would be fairly easy to modify one of my calculators to include a sample that is indeed symmetrically related to all modern human groups.

To do this, I created an individual possessing the ancestral allele using hgdpGeo as a reference. According to the reference for this table:

Samples collected by the HGDP-CEPH from 1,043 individuals from around the world were genotyped for 657,000 SNPs at Stanford. Ancestral states for all SNPs were estimated using whole genome human-chimpanzee alignments from the UCSC database. For each SNP in the human genome (NCBI Build 35, UCSC database hg17), the allele at the corresponding position in the chimp genome (Build 2 version 1, UCSC database pantro2) was used as ancestral.
My new globe13anc calculator is simply a version of the latest globe13 one, but with an extra "Ancestral" component, so it has 13+1 = 14 ancestral components in total.

You can of course use globe13anc as any other calculator designed for DIYDodecad, and hopefully no one will get anything other than 0% for the "Ancestral" component :)

But, the main point of building this is to help you infer D-statistics with no suspicion that gene flow within the human species may affect the results; while the Khoesan of South Africa (where the Palaeo_African component is modal) are an approximate outgroup to the rest of mankind, there is evidence that even their most isolated groups have some external gene flow. So, using this "Ancestral" outgroup instead of Palaeo_African ought to make things cleaner for everyone.