December 07, 2010

Medieval GDP per capita

On the left there are some interesting tables from the paper.

I don't know how accurate of GDP per capita is that far in the past, as a certain number of assumptions must come into play. However, these numbers should give pause to anyone who makes broad statements about different populations' innate ability based on their present- or recent economic output, especially when such inferences are pushed thousands of years into the past.

The paper also has some estimates of population and population growth rates which should be interesting for population geneticists in calibrating their models.


Medieval England Twice as Well Off as Today’s Poorest Nations
New research led by economists at the University of Warwick reveals that medieval England was not only far more prosperous than previously believed, it also actually boasted an average income that would be more than double the average per capita income of the world's poorest nations today.

Relationship between world craniometric clusters

Continuing the post on MCLUST analysis of the Howells' dataset, below is the Mahalanobis distance matrix between the centroids of the 14 clusters:


The most striking feature is, of course, the great morphological distance of the Neandertal cluster #6 to all modern human groups, which exceeds the maximum distance between modern human groups (Bushman vs. Moriori/Maori).

December 05, 2010

World craniometric analysis with MCLUST revisited

I used MCLUST to cluster Howells' world craniometric dataset back in 2004

Now, I am revisiting the issue by using MCLUST on the entirety of Howells' dataset, including both the "training" and "testing" dataset. Moreover, in the interest of transparency, I've placed all the necessary code to repeat the analysis online. So, feel free to repeat or improve or my experiment, or indeed shred it to pieces, if you like (see Appendix).

The most interesting part of this analysis for me is the inclusion of several Upper Paleolithic skulls in Howells' testing dataset, and I will show how MCLUST assigns them to very meaningful clusters.

Let's start:

Part I
MCLUST on the training data (2,524 skulls with 57 measurements/skull)

Different types of output are included in the rar file, here I will only show the number of skulls from each population assigned to each of the 15 clusters inferred by MCLUST. Note that this is one more cluster than in my previous analysis, because I have included 20 extra Maori skulls.


The 15 clusters might be labeled:

1: Caucasoid, 2: Amerindian, 3: East Asian, 4: Mokapu, 5: Easter Island, 6: Tasmania, 7: Australoid, 8: Ainu, 9: Santa Cruz, 10: Bushman, 11: Andaman, 12: Buriat, 13: Negroid, 14: Moriori/Maori, 15: Eskimo

Note that skulls that fall in populations other than expected are (in part) due to the limitations of the method, and, in part, outliers, some of which were detected by Howells himself.

Part II
MCLUST on the training+test data (3,048 skulls with 57 measurements/skull)

Now, let's add the 524 test skulls and repeat the MCLUST analysis with all 3,048 skulls. I recommend looking at the howells.test.txt files in the bundle you can download, because this contains extra information about each skull.



As there are over 189 different populations and individual skulls, I am showing here only the first part of this table, on the training populations. All the data can be found in the download bundle in the frequency_all.csv file.

The 14 clusters inferred in this analysis can be labeled:

1: Linear Caucasoid, 2: Negroid, 3: Amerindian, 4: Moriori/Maori, 5: Santa Cruz, 6: Neandertal, 7: Lateral Caucasoid, 8: Mokapu/Easter Island, 9: Bushman, 10: Australoid, 11: East Asian, 12: Buriat, 13: Andaman, 14: Eskimo

The test data contains various populations as well as many individual skulls. You can look at them in detail in the download bundle, but, here, I will focus on some "famous" skulls.

First of all, notice that cluster #6 is a Neandertal cluster. It includes La Ferassie I, La Chappelle, Skhul V, Shanidar 1, Djebel Irhoud 1. Of course I am aware that there are controversies about some of these skulls, but, MCLUST doesn't seem to have any doubts: they are all placed in cluster #6 with 100% probability and no other skulls have any probability of belonging to this cluster.

Chancelade which was described as Eskimoid in the early literature is assigned to the Linear Caucasoid cluster, so are Predmost III and IV, Mladec 1, and Abri Pataud. The inclusion of Mladec 1, the earliest complete European (>30ky) in the main Caucasoid cluster undermines the idea that Caucasoid morphology developed in the Holocene, or more recent ideas that Eurasians were supposedly undifferentiated as recently as 18,000 years ago.

Cro-Magnon 1 is assigned to the lateral Caucasoid cluster, and so is Afalou-bou-Rhummel 5. It is interesting that the lateral Caucasoid clustered is centered on the population of Berg, described as "Alpine" in the classical sense by Howells, with Alpines being conjectured as being a foetalized evolutionary development of the Upper Paleolithic population by Coon. Cro-Magnon 1 is long-skulled but broad-faced, but its overall suite of measurements places it squarely as a European.

Grimaldi, described by some as Negroid is actually assigned to the Australoid cluster and so is Markina Gora, Djebel Qafzeh 6 and Keilor.

There are many other individual skulls and populations in the data, so feel free to look at it yourself. Also, if you have any other data that has been measured in Howells' standard variables, I'll be happy to include them in an MCLUST analysis.

Appendix

In order to run the experiment you need to follow these steps:
  1. Download and install R
  2. Launch R and in the menu Packages->Install package(s) choose to install the mclust package
  3. Load the mclust package via the Packages->Load package menu
  4. Download my code and extract it in the directory of your choice in your computer
  5. Change the directory in R via the File->Change dir menu
  6. Enter the command source("code.r") in the command prompt. This will take a while and reproduce a series of files after it runs for a while. You may open the code.r in a text editor to see what exactly it does and/or to modify it.
I have bundled Howells' data in the RAR file, but you can just as well download it from the repository instead.

UPDATE (Dec 7): A new post has Mahalanobis distances between the 14 clusters inferred in the MCLUST analysis.

December 04, 2010

Y-chromosome gene pool of Western Slavs

Interesting tidbit from the paper:
Age calculations based on evolutionary and pedigree
mutation rates gave significantly different date estimates,
5.5–8.0 and 2.3–3.4 ky, respectively. In our opinion,
the age calculations of the subcluster R1a1-WSL
based on the pedigree mutation rate appear to be more
consistent with the archeological record, as well as with
the limited distribution of this Y-STR subcluster in
Europe.
So, this paper, together with two other papers on Roma, and the one on Maronites, is added to my recent enumeration of cases where the pedigree (or germline, or genealogical) mutation rate gives better results than the "evolutionary" rate. Since both analysis of the Y-STR mutation model and empirical data suggests the superiority of the pedigree rate, it is perplexing why the evolutionary rate continues to persist in the literature.

Getting back to the paper:
Southern parts of present Poland were under Celtic influence. In the second century B.C., the Celts arrived in southern Poland via the Moravia and Bohemia regions, where they prevailed with their La Te`ne culture from the fifth century B.C. Therefore, it is probable that the R1a/R1b proportion varied in those regions according to the degree of influence of one population or another (i.e., Slavic or
Celtic).
I recently suggested a possible Celtic or Germanic link with some R1b subclades, and the presence of both R-U106 and R-U152 clades in Western Slavs (from the Myres et al.) paper suggests that both processes may have been important. It will be interesting to see ancient DNA studies confirm/disprove these hypotheses about an ethnic affiliation of particular Y-chromosome lineages.

American Journal of Physical Anthropology DOI: 10.1002/ajpa.21253

Similarities and Distinctions in Y Chromosome Gene Pool of Western Slavs

Marcin Wozniak et al.

Analysis of Y chromosome Y-STRs has proven to be a useful tool in the field of population genetics, especially in the case of closely related populations. We collected DNA samples from 169 males of Czech origin, 80 males of Slovakian origin, and 142 males dwelling Northern Poland. We performed Y-STR analysis of 12 loci in the samples collected (PowerPlex Y system from Promega) and compared the Y chromosome haplotype frequencies between the populations investigated. Also, we used Y-STR data available from the literature for comparison purposes. We observed significant differences between Y chromosome pools of Czechs and Slovaks compared to other Slavic and European populations. At the same time we were able to point to a specific group of Y-STR haplotypes belonging to an R1a haplogroup that seems to be shared by Slavic populations dwelling in Central Europe. The observed Y chromosome diversity may be explained by taking into consideration archeological and historical data regarding early Slav migrations.

December 01, 2010

Human genetic variation: the first 50 dimensions

Here is a huge data dump for anyone interested in human variation. Part of the reason I started the Dodecad Project was to be able to analyze data on my own, rather than having to squint to make sense of a plot, to speculate about what might show up at higher dimensions, or with more clusters, to wonder how the inclusion of additional populations would affect the results, and so on.

The following dataset represents the culmination (so far), of my efforts.

Number of SNP markers: ~177,000 as in here
Populations: 139
Individuals: 2,230

In the RAR file (~11MB) you will find 49 scatterplots (5000x5000 pixels each) representing the first 50 dimensions of a multi-dimensional scaling analysis of this dataset, together with information about the samples and their sources. There is a plot of the 1st and 2nd dimensions, 2nd and 3rd, 3rd and 4th, and so on, until the 49th and 50th.

I don't believe Picasa allows such huge pics, so I've made a few smaller (still 1600x1600 pixels each) ones to give you an idea of what to expect. Note that the legend in these small ones is partly visible.

In all plots, population labels have been placed on the population averages; this usually correspond to blobs of datapoints belonging to that population, but occasionally they are shifted due to the presence of outliers.

Before I proceed, it might be worth to give a visual representation of the three poles of human variation in its broadest context; these are Basques/Sardinians, Mbuti/Biaka Pygmies, and She. Well, these are marginally more toward the three poles than many others, but they will do:



Mbuti image by Mikael Strandberg; She image from Portraits of Chinese ethnic groups and links therein.

1 vs 2


3 vs 4

5 vs 6

7 vs 8

Inspection of these plots gives you an idea of why Clusters Galore works so well. It can detect "clusteredness" of individuals along multiple dimensions. It does not look at a series of 2D plots, but it considers proximity of individuals to each other along multiple dimensions, and adapts to the shape, size, and orientation of the clusters.

Y-chromosomes of Maronites from Lebanon

The freely available supplementary material contain a real treasure trove of Y-STR haplotypes for different populations of Lebanon and from Iran.

UPDATE: The paper uses the wrong Zhivotovsky et al. "evolutionary" mutation rate, hence their age estimates are inflated 3-fold. Hence, their conclusion that religion differences were superimposed on an already structured population is also wrong, in my opinion.

The write, for example that:
The Christian–Muslim split dated to 3475 (2000–6025) ybp for pooled Muslims and 3325 (1875–4225) ybp for pooled Christians.
Divide these by 3 and you get about 1.2ky which is quite close (given the huge confidence intervals, of course) to the arrival of Islam to the country. Once again, the genealogical mutation rate conforms with history, while the "evolutionary" one suggests a speculative scenario about the supposed long-term maintenance of structure on which the Islam-Christian distinction was superimposed.

European Journal of Human Genetics advance online publication 1 December 2010; doi: 10.1038/ejhg.2010.177

Influences of history, geography, and religion on genetic structure: the Maronites in Lebanon

Marc Haber et al.

Cultural expansions, including of religions, frequently leave genetic traces of differentiation and in-migration. These expansions may be driven by complex doctrinal differentiation, together with major population migrations and gene flow. The aim of this study was to explore the genetic signature of the establishment of religious communities in a region where some of the most influential religions originated, using the Y chromosome as an informative male-lineage marker. A total of 3139 samples were analyzed, including 647 Lebanese and Iranian samples newly genotyped for 28 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y chromosome. Genetic organization was identified by geography and religion across Lebanon in the context of surrounding populations important in the expansions of the major sects of Lebanon, including Italy, Turkey, the Balkans, Syria, and Iran by employing principal component analysis, multidimensional scaling, and AMOVA. Timing of population differentiations was estimated using BATWING, in comparison with dates of historical religious events to determine if these differentiations could be caused by religious conversion, or rather, whether religious conversion was facilitated within already differentiated populations. Our analysis shows that the great religions in Lebanon were adopted within already distinguishable communities. Once religious affiliations were established, subsequent genetic signatures of the older differentiations were reinforced. Post-establishment differentiations are most plausibly explained by migrations of peoples seeking refuge to avoid the turmoil of major historical events.

Link

Paleoamerican Morphology in the Context of European and East Asian Late Pleistocene Variation (Hubbe et al. 2010)

I will write more on this paper later, as it is very important, and not just about the origins of Native Americans.

American Journal of Physical Anthropology DOI: 10.1002/ajpa.21425

Paleoamerican morphology in the context of European and East Asian late Pleistocene variation: Implications for human dispersion into the new world

Mark Hubbe, Katerina Harvati, Walter Neves

Abstract

Early American crania show a different morphological pattern from the one shared by late Native Americans. Although the origin of the diachronic morphological diversity seen on the continents is still debated, the distinct morphology of early Americans is well documented and widely dispersed. This morphology has been described extensively for South America, where larger samples are available. Here we test the hypotheses that the morphology of Early Americans results from retention of the morphological pattern of Late Pleistocene modern humans and that the occupation of the New World precedes the morphological differentiation that gave rise to recent Eurasian and American morphology. We compare Early American samples with European Upper Paleolithic skulls, the East Asian Zhoukoudian Upper Cave specimens and a series of 20 modern human reference crania. Canonical Analysis and Minimum Spanning Tree were used to assess the morphological affinities among the series, while Mantel and Dow-Cheverud tests based on Mahalanobis Squared Distances were used to test different evolutionary scenarios. Our results show strong morphological affinities among the early series irrespective of geographical origin, which together with the matrix analyses results favor the scenario of a late morphological differentiation of modern humans. We conclude that the geographic differentiation of modern human morphology is a late phenomenon that occurred after the initial settlement of the Americas.

Link

Breast-cancer causing mutation in Ashkenazi Jews came from Europeans

The interesting question is: why did this become frequent in AJ? If this mutation did indeed enter the AJ gene pool half a millennium ago, then it may be within the reach of genealogists and historians to uncover its origins.

European Journal of Human Genetics , (1 December 2010) | doi:10.1038/ejhg.2010.203

On the origin and diffusion of BRCA1 c.5266dupC (5382insC) in European populations

The BRCA1 mutation c.5266dupC was originally described as a founder mutation in the Ashkenazi Jewish (AJ) population. However, this mutation is also present at appreciable frequency in several European countries, which raises intriguing questions about the origins of the mutation. We genotyped 245 carrier families from 14 different population groups (Russian, Latvian, Ukrainian, Czech, Slovak, Polish, Danish, Dutch, French, German, Italian, Greek, Brazilian and AJ) for seven microsatellite markers and confirmed that all mutation carriers share a common haplotype from a single founder individual. Using a maximum likelihood method that allows for both recombination and mutational events of marker loci, we estimated that the mutation arose some 1800 years ago in either Scandinavia or what is now northern Russia and subsequently spread to the various populations we genotyped during the following centuries, including the AJ population. Age estimates and the molecular evolution profile of the most common linked haplotype in the carrier populations studied further suggest that c.5266dupC likely entered the AJ gene pool in Poland approximately 400–500 years ago. Our results illustrate that (1) BRCA1 c.5266dupC originated from a single common ancestor and was a common European mutation long before becoming an AJ founder mutation and (2) the mutation is likely present in many additional European countries where genetic screening of BRCA1 may not yet be common practice.

Link

November 30, 2010

Cluster galore: re-analysis of Behar et al. (2010) data

I have re-analyzed the data of Behar et al. (2010) using my Clusters Galore method. See my previous post on the HGDP panel for some technical details.

Here are the 47 clusters of the optimal mclust solution over the MDS representation retaining 26 dimensions:

Each row has the number of individuals who are mapped to each of the 47 clusters. Here are a few comments:

The discovery that Jewish populations can be subdivided into numerous clusters is not inconsistent with Behar et al. (2010) and their observation of the existence of three major clusters in Jewish populations. This is a difference of detail.

Most clusters strongly map to single populations; many populations with "tribal" traditions and high levels of sanguinity are split into multiple clusters, suggesting the existence of sub-structure in them. And, there are a few clusters spanning several populations, such as #1 (Balto-Slavic), #22 (Syrians-Jordanians-Lebanese), #29 (Ethiopians and Ethiopian Jews), #25 (Romanians and Hungarians), #31 (Iranian and Iraqi Jews).

November 29, 2010

Clusters galore in HGDP panel

For background on this type of analysis, please read:

I've taken the Stanford HGDP dataset and extracted the markers common to it and to HapMap-3, Behar et al. (2010), Rasmussen et al. (2010) and the 23andMe v2 genotyping platform, or about 500k SNPs in total (I removed C/G and A/T SNPs as a precaution and flipped strand in discordant ones to the HapMap-3 standard when it differed from that of HGDP).

I removed SNPs with less than 99% genotyping rate in any of the four data sources, and about 434k SNPs were retained. Subsequently I applied linkage disequilibrium-based pruning on the HGDP set (PLINK parameter: --indep-pairwise 50 5 0.3) resulting in a final dataset of about 177k SNPs. In all analyses of the HGDP set, I followed the recommendations of Rosenberg et al. (2006) keeping the 940 individuals in common between his 952-individual panel and the Stanford data.

Subsequently I ran multidimensional-scaling (MDS) on the 940 individual/57 population/177k SNP set in PLINK, and then I applied model-based clustering as implemented in mclust over the first 42 MDS dimensions, with a maximum number of clusters = 70. In total there were 64 clusters in the optimal solution suggested by mclust (*)

Before I give the results, it might be worth looking at the pairwise MDS scatterplots for just the first 5 dimensions:

As you can see, clusteredness emerges in different dimensions. Rather than inspecting innumerable 2D combinations visually (and indeed we should 3D, etc. as well, because clusters might emerge in 3D and higher subspaces that are not discernible in 2D projections), we let mclust iterate over k, the number of clusters, and different shapes, orientations, and volumes of clusters, using the well-known EM algorithm together with the Bayes Information Criterion to choose a good solution that maximizes detail without sacrficing parsimony.

Below you can see how many individuals are assigned to each of the 64 clusters from each of the 57 populations:


This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity. And, there are a few populations, such as Tuscans and North Italians that are not split. But, the fact that this was inferred from unlabeled individuals is remarkable.

I remember reading Rosenberg et al. (2002), "The genetic structure of human populations" (pdf) which used structure, a model-based algorithm on raw genetic data to infer the existence of 6 clusters corresponding to continental populations. How is it that so much more detail can be achieved today?

There are three reasons: First, dense genotyping data are much better than the few hundreds of microsatellites used by Rosenberg et al. (2002). Second, the use of dimensionality reduction in the form of MDS allowed us to remove most of the "noise" in the genotyping data and focus on dimensions capturing a lot of distinctions. Third, the use of a sophisticated clustering algorithm such as mclust which can adapt to clusters of different shape, size, and orientation without human input was able to produce this result. mclust is computationally expensive, but it works like a charm (in a few minutes) with a few dozen dimensions and about a thousand individuals, producing a clustering of obviously good value.

How to repeat the experiment

If anyone wants to repeat this experiment they can do it easily. After you've managed to put the HGDP data into PLINK ped/map format, say in files HGDP.ped and HGDP.map (or any other data for that matter), just run

> plink --cluster --mds-plot d --file HGDP

Where d is the number of dimensions you want to retain. This produces a plink.mds file in which there is a header line, and each each line after that corresponds to an individual: the individual's projection in the first d dimensions are in columns number 4 to d+3.

Then, in R, after you install and load the mclust package (see the MCLUST page for limitations on its use and licensing information), you just run:

> MDS <- read.table("plink.mds", header=T)
> maxclust <- 70
> MCLUST <- Mclust(MDS[, 4:(d+3)], G=1:maxclust)

where maxclust is the maximum number of clusters you want to consider.

Then, if you run:

> MCLUST$z

you will see a table in which each line corresponds to an individual and each column to the probability that it belongs to the i-th cluster.

There's much more that you can do in R with the mclust package, but this is enough for anyone wanting to repeat the experiment in its basic form.

(*) The number of clusters in the optimal solution varied between 11 with 2 dimensions retained and 64 with 42 dimensions retained. There was a secondary maximum of 60 clusters with 30 dimensions retained; choosing more dimensions than 42 (up to 50 that I examined), also resulted in a very high number of clusters, but I've decided to keep the one with 42 dimensions and 64 clusters as it is enough to serve the purpose of this post.

Human effective sex ratio: different at different time scales

The authors manage to harmonize the seemingly contradictory results of Keinan et al. and Hammer et al.

From the paper:
Recently, two studies estimated Q in order to detect sexbiases in similar human populations16,17 and found seeminglycontradictory conclusions.25 Using SNP data fromthe International HapMap Project,26 Keinan et al. found evidence for a male bias during the dispersal of modern humans out of Africa (Figure 1A).17 Hammer and colleagues, however, found evidence for a female biasthroughout human history in six populations from theHuman Genome Diversity Panel (HGDP) (Figure 1A).16

This figure from the paper shows the model inferred by the authors which resolves the seeming contradiction.

They write:
Long-term sex-biased processes, such as polygyny or higher female dispersal rates in ancestral human populations,likely caused the Qπ estimates found by Hammer et al.
but:
The male bias detected by Keinan et al. can be explained by a recent event associated with the out-of-Africa dispersal, as initially proposed by the authors. The Q ratios detected by Keinan et al. suggest a very strong male bias for the entire portion of the non-African lineage before the split of Asians from Europeans.

I am not entirely convinced of this explanation. The authors' model suggests a higher male/female ratio in Eurasians than in Africans due to male bias in the Eurasian lineage against an ancestral background of high female/male ratio (due to polygyny).

But, an alternative explanation is that the higher female/male ratio in Africans is due to the fact that they are descended from a relatively small number of males who overwhelmed the pre-existing African gene pool.

There are reasons to believe this is the case: Africa has the deepest lineages in the human Y-chromosome phylogeny (A and B), but the balance is made of entirely of haplogroup E chromosomes, the sister clade of Eurasian D. The extremely diverse Eurasian haplogroup F is represented only by some subclades in Africa, due to back-migration.

So, while Eurasian males are descended from the expansion of F and DE males, African males are largely descended from the expansion of E males. These are the Afrasians I've often spoken of, the common ancestors of Eurasians and Africans. In Africa, the Afrasians could take the women of the Paleo-Africans, but Eurasia was largely empty land, and the Eurasians could only take the women they've brought with them.


The American Journal of Human Genetics, 24 November 2010
doi:10.1016/j.ajhg.2010.10.021

Estimators of the Human Effective Sex Ratio Detect Sex Biases on Different Timescales

Leslie S. Emery

Determining historical sex ratios throughout human evolution can provide insight into patterns of genomic variation, the structure and composition of ancient populations, and the cultural factors that influence the sex ratio (e.g., sex-specific migration rates). Although numerous studies have suggested that unequal sex ratios have existed in human evolutionary history, a coherent picture of sex-biased processes has yet to emerge. For example, two recent studies compared human X chromosome to autosomal variation to make inferences about historical sex ratios but reached seemingly contradictory conclusions, with one study finding evidence for a male bias and the other study identifying a female bias. Here, we show that a large part of this discrepancy can be explained by methodological differences. Specifically, through reanalysis of empirical data, derivation of explicit analytical formulae, and extensive simulations we demonstrate that two estimators of the effective sex ratio based on population structure and nucleotide diversity preferentially detect biases that have occurred on different timescales. Our results clarify apparently contradictory evidence on the role of sex-biased processes in human evolutionary history and show that extant patterns of human genomic variation are consistent with both a recent male bias and an earlier, persistent female bias.

Link

Y-chromosomes of Niger-Congo groups

Interesting that African farmers, like their European counterparts seem to have dispersed rapidly at the beginning. Hopefully, ancient DNA analysis in Europe will be able to discover their Y-chromosomes, as the inference from modern populations is not as clear-cut as in the (more recent) spread of Bantu farmers.

Mol Biol Evol (2010) doi: 10.1093/molbev/msq312

Y-chromosomal variation in Sub-Saharan Africa: insights into the history of Niger-Congo groups

Cesare de Filippo et al.

Abstract

Technological and cultural innovations, as well as climate changes, are thought to have influenced the diffusion of major language phyla in sub-Saharan Africa. The most widespread and the richest in diversity is the Niger-Congo phylum, thought to have originated in West Africa ∼10,000 years ago. The expansion of Bantu languages (a family within the Niger-Congo phylum) ∼5,000 years ago represents a major event in the past demography of the continent. Many previous studies on Y chromosomal variation in Africa associated the Bantu expansion with haplogroup E1b1a (and sometimes its sub-lineage E1b1a7). However, the distribution of these two lineages extends far beyond the area occupied nowadays by Bantu speaking people, raising questions on the actual genetic structure behind this expansion. To address these issues, we directly genotyped 31 biallelic markers and 12 microsatellites on the Y chromosome in 1195 individuals of African ancestry focusing on areas that were previously poorly characterized (Botswana, Burkina Faso, D.R.C, and Zambia). With the inclusion of published data, we analyzed 2736 individuals from 26 groups representing all linguistic phyla and covering a large portion of Sub-Saharan Africa. Within the Niger-Congo phylum, we ascertain for the first time differences in haplogroup composition between Bantu and non-Bantu groups via two markers (U174 and U175) on the background of haplogroup E1b1a (and E1b1a7), which were directly genotyped in our samples and for which genotypes were inferred from published data using Linear Discriminant Analysis on STR haplotypes. No reduction in STR diversity levels was found across the Bantu groups, suggesting the absence of serial founder effects. In addition, the homogeneity of haplogroup composition and pattern of haplotype sharing between Western and Eastern Bantu groups suggest that their expansion throughout Sub-Saharan Africa reflects a rapid spread followed by backward and forward migrations. Overall, we found that linguistic affiliations played a notable role in shaping sub-Saharan African Y chromosomal diversity, although the impact of geography is clearly discernible.

Link

November 27, 2010

Clusters galore: extremely fine-scale ancestry inference

By way of introduction, here is the command that literally made me jump from my seat:
> MCLUST <- Mclust(X,G=1:36)
Warning messages:
1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
best model occurs at the min or max # of components considered
2: In Mclust(X, G = 1:36) :
optimal number of clusters occurs at max choice
It may look like gibberish, but this is what happened when I tried to apply Model-based clustering as implemented in the R package mclust, over the first few dimensions of Multidimensional Scaling (MDS) of my standard 36-population, 692-individual dataset I have been using in the Dodecad Project.

But, let's take the story from the beginning...

The basic idea

When we look at an MDS or PCA plot, like the following MDS plot of the 11 HapMap-3 populations, it is obvious that individuals form clusters.

Here are dimensions 1 and 2:
West and East Eurasians form a cluster, and Africans form an elongated cluster towards West Eurasians. Gujaratis and Mexicans overlap between West and East Eurasians.

Here are dimensions 2 and 3:
Here, the Gujarati are shown to be quite different from the Mexicans.

We can use a standard clustering algorithm such as k-means to infer the existence of these clusters. This has two benefits:
  • We don't have to visually inspect an exponential number of 2D scatterplots
  • We can put some actual numbers on our visual impression of the existence of clusters
Actually, k-means is not a very good way to find clusters. For two reasons:
  • You have to specify k. But, how can you know which k is supported by the data, unless you look at an exponential number of 2D scatterplots?
  • k-means, using the Euclidean distance measure prefers "spherical" clusters. But, as you can see, some populations, especially recently admixed ones form elongated clusters, stretched towards their two (or more) ancestral populations.
I had previously used mclust, a model-based clustering algorithm to infer the existence of 14 different clusters in a standard worldwide craniometric dataset. This was 6 years ago, and only recently have geneticists been able to reach that level of resolution with genomic data.

But, for assessing ancestry, genomic data are obviously much better than craniometric ones: the latter reflect both genes and environmental/developmental factors.

So, while 6 years ago I had neither the computing power nor the data to push the envelope of fine-scale ancestry inference, today that's possible.

What mclust does (in a nutshell)

mclust has many bells and whistles for anyone willing to study it, but the basic idea is this: the program iterates between different k and different "forms" of clusters (e.g., spherical or ellipsoidal) and finds the best one.

Best is defined as the one that maximizes the Bayes Information Criterion. Without getting too technical, this tries to balance the "detail" of the model (how many parameters, e.g., k) it has, with its parsimony (how conservative it is in inferring the existence of phantom clusters).

How to combine mclust with PCA or MDS

mclust does not work on 0/1 binary SNP data; it needs scalar data such as skull measurements. However, that's not a problem, because you can convert 0/1 (or ACGT) SNP data into scalar variables using either MDS or PCA.

From a few hundred thousand SNPs, representing each individual, you get a few dozen numerical values placing the individual along each of the first few dimensions of MDS or PCA.

You can then run mclust over that reduced-dimensional representation. This is exactly what I attempted to do.

Clusters galore in HapMap-3 populations

I had previously used ADMIXTURE to infer admixture in the HapMap-3 populations, reaching K=9. So, naturally, I wanted to see whether the approach I just described could do as well as ADMIXTURE.

I used about 177k SNPs after quality-control and Linkage-disequilibrium based pruning and ran MDS as implemented in PLINK over a set of 275 individuals, 25 from each of the 11 HapMap-3 populations. I kept 11 dimensions, equal to the number of populations.

MDS took a few minutes to complete. Subsequently I ran mclust on the 275 individuals, allowing k to be as high as 11. Thus, if there were as many clusters as populations, I wanted mclust to find them. mclust finished running in a second. Here are the results (population averages):
The software esssentially rediscovered the existence of 10 different populations in the data, but was unable to split the Denver Chinese from the Beijing Chinese. Notice also a mysterious low-frequency component in the Maasai reminiscent of that which appeared in the previous ADMIXTURE experiment.

A question might arise why most of these populations look completely unadmixed? Even the Mexicans and African Americans get their own cluster. This is due to mclust's ability to use clusters of different shape. In particular, the "best" model was the one called "VVI", which allows for diagonal clusters of varying volume. In short, the software detected the presence of the elongated clusters associated with the admixed groups.

Indeed, the approach I am describing is not really measuring admixture. It is quantifying the probability that a sample is drawn from each of a set of inferred populations. Hence it is not really suitable for recently admixed individuals, but works like a charm in guessing the population labels of unlabeled individuals.

Clusters galore in Eurasia

Let's now see what clusters are inferred in the 36-population 692-individual dataset I commonly use in the Dodecad Project. This is done with 177k, 36 MDS dimensions retained, and allowing k to be as high as 36. This is what made me jump off my seat, and since I don't have enough colors to represent it, I'll put it in tabular form:

I could hardly believe this when I saw it, but the conclusion is inescepable: dozens of distinct populations can be inferred from unlabeled data of individuals that largely correspond, by a posteriori inspection to the individuals' population labels.

UPDATE: The above table has the average probabilities for the 36 clusters, but a better way might be to look at how many individuals are assigned to each cluster from each population:


For example, out of the 28 French individuals, 23 are assigned to cluster #1 (the French-CEU cluster), and 5 to cluster #3 (the North-Italian/Spanish/Tuscan cluster).

Some interesting observations:
  1. Some populations (e.g., CEU and French, or Belorussians and Lithuanians) remain unsplit even at K=36.
  2. Some populations are split into multiple components (e.g., Sardinians into 2)
  3. Some mini-clusters emerge (e.g., 4 clusters in Maasai, each of them corresponding to 8% of 25 = 2 individuals). These may correspond to pairs of relatives or very genetically close individuals.
Quantifying uncertainty

Naturally, we want to be able to assess how good a particular classification is. Fortunately, this is easy to do with mclust and its uncertainty feature. Looking at my 692-individual dataset, 687 have a less than 5% uncertainty level, and 682 have less than 1%. I did not inspect these fully, but some of them are "borderline" individuals who might belong on several components, e.g., a Frenchman who could either go to the CEU-French cluster #1 (36% probability) or the North/Central Italian-Spanish cluster #3 (64% probability).

Here is a dendrogram of the 36 components:



What does it all mean?

What this means, in short, is that the day of extremely fine-scale ancestry inference has arrived. We already had premonitions of this in the ability of researchers to place individuals within a few 100km of their place of birth in Europe. Now, it is clear that model-based clustering + MDS/PCA can infer ethnic/national identity, or something quite close to it.

This is obviously just the beginning. I allowed K to vary from 1 to 36, not really hoping that the optimal number of clusters would be 36. This raises the question: more than 36?

...

UPDATE:

I have followed up on this exciting new technique in the Dodecad Project blog:

November 26, 2010

ADMIXTURE on the shores of the Indian Ocean

I have applied Multidimensional Scaling and ADMIXTURE on a dataset of 15 populations:
Cambodians, Papuan, NAN_Melanesian, Gujarati, Malayan, Paniya, North_Kannadi, Sakilli, Singaporean Indians, Singaporean Chinese, Singaporean Malay, Yemenese, Saudis, Maasai, Ethiopians
These were collected from HGDP, Behar et al. (2010), HapMap-3, and the Singapore Genome Variation Project. There are 423 individuals in general (I've used samples of 25 individuals from the HapMap populations).

Here is the MDS plot:



At the bottom are the Papuans, relatively unadmixed Australoids. Close to them, but deviating towards East Eurasians are the NAN Melanesians; these are the Nasioi, Papuan speakers from Bougainville, which they inhabit together with Austronesian speakers.

At the top left are the Singaporean Chinese (CHS) who are Mongoloids. Deviating from them towards Indians are the Cambodians, a Southeast Asian group which according to physical anthropology is a basically Mongoloid population, but admixed with a pre-Mongoloid southern population element similar to that which has been preserved in India. Similar to them are the Singaporean Malay (MAS), another population that is basically Mongoloid but has absorbed Indian-like population elements.

The Singaporean Indians (INS), the North Kannada, the Sakilli and the Gujarati (GIH25) form the third population element in the region of interest.

The other two are the Caucasoids, represented here by the Saudis, with the Yemenese spread toward Africa and the more Caucasoid-admixed Ethiopians and the relatively unadmixed Maasai (MKK25).

These are the main population elements of our region of interest: Ethiopids and Australoids framing the Ocean on the west and east; the South Asians occupying India, and the Mongoloids occupying Southeast Asia, having absorbed the Indian-like former inhabitants of the region.

Here is a blowup of the middle part of the MDS plot, focusing on the Indians:
It's fairly clear that North Kannada and Sakilli (South Indians) occupy a place that is furthest from Caucasoids, while Gujarati and Singaporean Indians are positioned towards Caucasoids (to the top-right).

Let's now turn to ADMIXTURE to confirm the visual impression from the MDS:

Notice the following components:
  1. Light blue, Indian
  2. Dark blue, East African
  3. Light green, Southeast Asian
  4. Dark green, Chinese Mongoloid
  5. Pink, Arabian Caucasoid
  6. Red, Australoid
Finally, here is the table of Fst distances between these 6 inferred components:

Notice the small distance (0.023) between Chinese and Southeast Asian Mongoloids. The Indian component is equidistant between Caucasoids and Mongoloids, but as the MDS plot makes clear, and as the study of Y-chromosome and mtDNA polymorphisms have shown, the distinctive component in Indians is sui generis and not the result of admixture between Caucasoids and Mongoloids. And, finally, the Australoid component is clearly distant from all of the above.

Lexical borrowing in the history of Indo-European languages

This is an open access paper.

Proc. R. Soc. B doi: 10.1098/rspb.2010.1917

Networks uncover hidden lexical borrowing in Indo-European language evolution

Shijulal Nelson-Sathi et al.

Abstract

Language evolution is traditionally described in terms of family trees with ancestral languages splitting into descendent languages. However, it has long been recognized that language evolution also entails horizontal components, most commonly through lexical borrowing. For example, the English language was heavily influenced by Old Norse and Old French; eight per cent of its basic vocabulary is borrowed. Borrowing is a distinctly non-tree-like process—akin to horizontal gene transfer in genome evolution—that cannot be recovered by phylogenetic trees. Here, we infer the frequency of hidden borrowing among 2346 cognates (etymologically related words) of basic vocabulary distributed across 84 Indo-European languages. The dataset includes 124 (5%) known borrowings. Applying the uniformitarian principle to inventory dynamics in past and present basic vocabularies, we find that 1373 (61%) of the cognates have been affected by borrowing during their history. Our approach correctly identified 117 (94%) known borrowings. Reconstructed phylogenetic networks that capture both vertical and horizontal components of evolutionary history reveal that, on average, eight per cent of the words of basic vocabulary in each Indo-European language were involved in borrowing during evolution. Basic vocabulary is often assumed to be relatively resistant to borrowing. Our results indicate that the impact of borrowing is far more widespread than previously thought.

Link

Y-chromosomes of South Africans

Investigative Genetics 2010, 1:6

Development of a single base extension method to resolve Y chromosome haplogroups in sub-Saharan African populations

Thijessen Naidoo et al.

Abstract

Background: The ability of the Y chromosome to retain a record of its evolution has seen it become an essentialtool of molecular anthropology. In the last few years, however, it has also found use in forensic genetics, providinginformation on the geographic origin of individuals. This has been aided by the development of efficient screeningmethods and an increased knowledge of geographic distribution. In this study, we describe the development ofsingle base extension assays used to resolve 61 Y chromosome haplogroups, mainly within haplogroups A, B andE, found in Africa.

Results: Seven multiplex assays, which incorporated 60 Y chromosome markers, were developed. These resolved Ychromosomes to 61 terminal branches of the major African haplogroups A, B and E, while also including a fewEurasian haplogroups found occasionally in African males. Following its validation, the assays were used to screen683 individuals from Southern Africa, including south eastern Bantu speakers (BAN), Khoe-San (KS) and SouthAfrican Whites (SAW). Of the 61 haplogroups that the assays collectively resolved, 26 were found in the 683samples. While haplogroup sharing was common between the BAN and KS, the frequencies of these haplogroupsvaried appreciably. Both groups showed low levels of assimilation of Eurasian haplogroups and only two individuals in the SAW clearly had Y chromosomes of African ancestry.

Conclusions: The use of these single base extension assays in screening increased haplogroup resolution andsampling throughput, while saving time and DNA. Their use, together with the screening of short tandem repeatmarkers would considerably improve resolution, thus refining the geographic ancestry of individuals.