April 28, 2007

Prediction of Continent of Origin using randomly selected SNPs

A new article in BMC Genomics discusses the issue of predicting continental origin using randomly selected markers. The pdf is freely available.

One of the arguments of those who deny the existence of biological races is that their reality is subjective. Some extremists have argued that race is totally socially constructed; this is, however, disproven by the fact that socially constructed race is correlated with physical characteristics. Thus, rather than being separated from biology, the social phenomenon of race is rooted in biology.

A different argument holds that race is correlated with biology, but the differences are "skin-deep", i.e., involve only superficial, visible, (and by some strange logic unimportant) characteristics. According to the proponents of this view, the idea of biological race places an undue emphasis on a set of traits: it is a result of the subjective choice of a set of traits as race-defining. Thus, the commonly recognized races of traditional physical anthropology are discounted as subjective organizations of the biological data: we could just as simply speak of a "lactose-intolerant race" according to this view.

In forensic science and admixture analysis scientists often discover and use polymorphisms which exhibit large inter-population differences. Decoding DNA isn't free, thus, it makes sense to use the most informative, most "biased" markers when one is trying to discover the origin of a biological sample. For example, if Africans have 55% of gene version A and 45% of gene version B, and Europeans have 53% of A and 47% of B, it makes little sense to type this particular gene, since it cannot really tell us whether a sample is European or African. A gene where Africans have 90% of A while Europeans have 5% of A would be much more useful. Race skeptics claim, as with the physical anthropological data, that to privilege such carefully chosen genes is to stress the differences between groups; the implication is that in randomly chosen genes these differences are minor.

The new paper is one of many (you can click on the Clusters label to find more) recent papers that have discovered that no matter what genetic markers you choose: SNPs, STRs, no matter how you choose them: randomly or based on their "informativeness", it is relatively easy to classify DNA into the correct continental origin. Depending on the marker types (e.g., indel vs. microsatellite), and their informativeness (roughly the distribution differences between populations), one may require more or less markers to achieve a high degree of accuracy. But, the conclusion is the same: after a certain number of markers, you always succeed in classifying individuals according to continental origin.

Thus, the emergent pattern of variation is not at all subjectively constructed: it does not deal specifically with visible traits (randomly chosen markers could influence any trait, or none at all), nor does it privilege markers exhibiting large population differences. The structuring of humanity into more or less disjoint groups is not a subjective choice: it emerges naturally from the genomic composition of humans, irrespective of how you study this composition. Rather than proving that race is skin-deep, non-existent, or unimportant, modern genetic science is both proving that it is in fact existent, but also sets the foundation for the study of its true importance, which is probably somewhere in between the indifference of the sociologists and the hyperbole of the racists.

BMC Genetics

Geography and genography: prediction of continental origin using randomly selected single nucleotide polymorphisms


Dominic J Allocco et al.

Abstract
Background: Recent studies have shown that when individuals are grouped on the basis of genetic similarity, group membership corresponds closely to continental origin. There has been
considerable debate about the implications of these findings in the context of larger debates about race and the extent of genetic variation between groups. Some have argued that clustering according to continental origin demonstrates the existence of significant genetic differences between groups and that these differences may have important implications for differences in health and disease. Others argue that clustering according to continental origin requires the use oflarge amounts of genetic data or specifically chosen markers and is indicative only of very subtle genetic differences that are unlikely to have biomedical significance.
Results: We used small numbers of randomly selected single nucleotide polymorphisms (SNPs)
from the International HapMap Project to train naïve Bayes classifiers for prediction of ancestral
continent of origin. Predictive accuracy was tested on two independent data sets. Genetically
similar groups should be difficult to distinguish, especially if only a small number of genetic markers are used. The genetic differences between continentally defined groups are sufficiently large that one can accurately predict ancestral continent of origin using only a minute, randomly selected fraction of the genetic variation present in the human genome. Genotype data from only 50 random SNPs was sufficient to predict ancestral continent of origin in our primary test data set with an average accuracy of 95%. Genetic variations informative about ancestry were common and widely distributed throughout the genome.
Conclusion: Accurate characterization of ancestry is possible using small numbers of randomly
selected SNPs. The results presented here show how investigators conducting genetic association studies can use small numbers of arbitrarily chosen SNPs to identify stratification in study subjects and avoid false positive genotype-phenotype associations. Our findings also demonstrate the extent of variation between continentally defined groups and argue strongly against the contention that genetic differences between groups are too small to have biomedical significance.

Link (pdf)

No comments: