December 21, 2005

Clusters strike back (II)

Rosenberg's 2002 study which proved that individuals could be assigned to a genetic cluster which matched exactly their race sparked a lot of debate, since it demonstrated -even though those terms were not used- that "continental geographical origin" maps directly to genetic identity.

Serre and Paabo subsequently disputed the claims of Rosenberg, by claiming that clusters become less distinct if a continuous geographical sampling scheme is chosen, as opposed to sampling from distinct populations, and also if an "uncorrelated alleles" models was used.

But, as I have explained before, the results obtained by S&P are an artefact of greatly reducing their sample size (to obtain a geographically uniform sample, as opposed to little "chunks" of individuals from different populations). Moreover, the assumption of uncorrelated alleles is no better than the assumption of linked alleles used by Rosenberg. In human populations alleles tend to co-occur, and are thus correlated to some degree.

Clusters emerge when one uses a sufficiently intelligent clustering technique, has a sufficient sample size, and a large number of informative markers. Absence of clusters does not prove absence of structure. [1]


Now, Rosenberg et al. have published a new article in PLoS Genetics with almost 1,000 loci, which systematically addresses the whole "clines vs. clusters" controversy. This is the knockout punch to the criticism of Serre and Paabo:

Other factors besides sample size and number of markers, however, may influence clustering patterns. Serre and Pääbo [10] argued that the geographic dispersion of the sample and the assumption made about whether or not allele frequencies are correlated across populations had substantial influences on genetic clustering. They suggested that individuals are less strongly placed into clusters when the sample is more geographically uniform, and when allele frequencies are assumed to be uncorrelated. Consequently, they claimed that the geographic clusters obtained by Rosenberg et al. [3] were artifacts of the sampling design and of the use of a model of correlation among allele frequencies across populations. However, much of the geographic dispersion analysis of [10] was based on two datasets with 89 and 90 individuals and 20 loci, in general too little data for clustering to be apparent [3,4,9]. The remainder of their geographic analysis, as well as the source of their comments about uncorrelated frequencies, was a comparison to the Rosenberg et al. [3] results of several analyses of 261 individuals chosen to be equally distributed across the 52 populations studied. Serre and Pääbo's analyses assumed allele frequencies to be uncorrelated across populations, whereas Rosenberg et al. had assumed that they were correlated. Thus, although a difference in results was seen between the analyses in [10] and those in [3], the attribution of this difference specifically to a difference in geographic dispersion or to a difference in assumptions about allele frequency correlations is problematic, because both of these variables differed between studies, as did the number of individuals.
In agreement with the suggestion of [10], the assumption made about allele frequency correlations is also seen to have a substantial impact. Because large allele frequency correlations exist across populations, however, the basis for the supposition by [10] that allele frequencies are uncorrelated is questionable.
Rosenberg et al. studied "clusteredness", which is 1 if individuals are assigned completely to a single cluster, and 0 if they are equally assigned to all clusters, varied:
Holding the number of clusters, sample size, and allele frequency correlation model fixed, the general trend was that clusteredness was noticeably smaller for ten and 20 loci, and was larger for 50 or more loci (Figure 3). [DP: Better clustering with more loci]


When the number of loci, sample size, and correlation model were held constant, K = 2 (that is, two clusters) generally produced smaller clusteredness than did the larger values of K (Figures 3 and 4; Table 1). For the correlated allele frequencies model, K = 5 and K = 6 tended to have higher clusteredness than did K = 3 and K = 4, whereas the reverse was true for the uncorrelated model (Figure 4). [DP: K=3 and K=4 represents human genetic structure less clearly than a model with 5 continental clusters, or 6 ones, splitting Northern from Southern Amerindians. In other words, the number of clusters or races is not arbitrary, but some numbers of K fit the data better than others]

Holding the number of loci, number of clusters, and correlation model fixed, clusteredness was generally higher for the samples of size 250 and 500 than it was for the samples of size 100 (Figures 3 and 4; Table 1). [DP: Clusters emerge more clearly, when larger sample sizes are used, because larger sample sizes enable better estimation of model parameters]
and why do such robust clustering results emerge?
Loosely speaking, it is these small discontinuous jumps in genetic distance—across oceans, the Himalayas, and the Sahara—that provide the basis for the ability of STRUCTURE to identify clusters that correspond to geographic regions.
and the obligatory (PC-mandated) statement on race, which however does not deny its existence, but claims that the existence of clusters is true, irrespective of one's definition of race:
Our evidence for clustering should not be taken as evidence of our support of any particular concept of “biological race.” In general, representations of human genetic diversity are evaluated based on their ability to facilitate further research into such topics as human evolutionary history and the identification of medically important genotypes that vary in frequency across populations. Both clines and clusters are among the constructs that meet this standard of usefulness: for example, clines of allele frequency variation have proven important for inference about the genetic history of Europe [15], and clusters have been shown to be valuable for avoidance of the false positive associations that result from population structure in genetic association studies [16]. The arguments about the existence or nonexistence of “biological races” in the absence of a specific context are largely orthogonal to the question of scientific utility, and they should not obscure the fact that, ultimately, the primary goals for studies of genetic variation in humans are to make inferences about human evolutionary history, human biology, and the genetic causes of disease.

PLoS Genetics Volume 1 | Issue 6 | DECEMBER 2005

Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure

Noah A. Rosenberg et al.

Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables—sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample—on the “clusteredness” of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.


[1] For example, some researchers in physical anthropology disputed the existence of races, due to the discordance between different traits, such as the cephalic index, or facial index. W.W. Howells, convincingly proved that once you used dozens of variables you could recreate the racial groups of traditional physical anthropology. However, since the clustering method he used was a simple Euclidean-distance one, he could not assign individuals successfully to major clusters (races). Thus, he accepted the validity of individual populations, but not of geographical aggregates of populations (races).

Now that computing power is cheap, we can easily apply a sophisticated model-based approach- similar to the program structure used by geneticists- and assign individuals to their races, and even to smaller-order clusters.
[2] See also, the first Clusters strike back post about a different study which supports the validity of clusters as descriptors of human genetic variation.

No comments: