- Genetic structure in Northern Europe with 250K SNPs
- Geography and Genetic structure in Europe (again)
- 500K SNP Europe-wide study of genetic structure
- In this study the first principal component of variation is along an east-west axis, rather than north-south as in previous studies. This is due to the limited number of southern European populations, and the great number of populations along an east-west axis from Spain to Russia. As I have mentioned before, the results of a principal components analysis are dataset-dependent.
- The nice technique of this paper is to infer the ancestry of an unknown sample (which could be perhaps a forensic case or customer of an ancestry analysis test) using only summary statistics. Imagine that you have 1,000 individuals from different populations, and want to guess the ancestry of an unknown test case. You could go about doing a full STRUCTURE run using the 1,001 individuals, or you could exploit the information garnered from an analysis of the 1,000 individuals (a PCA analysis in this case) to test the 1,001th individual. This is much faster and convenient, since the full STRUCTURE run is very time consuming.
Some ethnic groups are clearly distinguishable from each other (e.g. Swedes vs. Spaniards); some groups are partitioned into fairly disjoint sets (Spain I vs. Catalans in Spain II); others mutually overlap (e.g., British and Irish); while others overlap asymetrically (e.g., some former Yugoslavs in the Greek cluster, but not vice versa).In this paper, the authors did a systematic study of the "ethnic distinctiveness" of their samples. In a first experiment, they used 80% of their data to identify the features of the various populations (e.g., Germans or Spaniards), and then tried to guess the origin of the remaining 20%:
It is clear that some nations appear to be distinct. For example, most test Spaniards (94.5%) are correctly guessed as Spaniards, with some (5.5%) guessed as French. Of course, this distinctiveness would be reduced if further populations (e.g. the Portuguese) were added to the analysis. More strongly, 99.1% of Norwegians are guessed correctly as Norwegians.
Other nations appear to be less distinct. For example, only 45.3% of Slovaks are guessed as Slovaks with most of the remaining ones guessed as Czechs (25%) or Hungarians (22%).
In some cases there is asymmetry of affiliation. For example, no Belgians are guessed as Germans but 10.2% of Germans are guessed as Belgians. Similarly 9.9% of Swedes are guessed as Norwegians, but only 1% of Norwegians are guessed as Swedes. While each case needs to be addressed individually, this observation is consistent with historical asymmetry in immigration patterns or ethnic identity formation. So, while e.g., the bulk of Germans (64.4%) are guessed correctly, sizeable minorities are guessed as Czechs, Belgians, or Scandinavians.
I would speculate that large central European countries have historically (both due to prestige or geographical position) absorbed more diverse populations from neighboring nations, while smaller peripheral countries have mostly acted as sources of population, reserving their own genetic distinctiveness.
In a second experiment the authors guessed the origin of individuals, but excluding the country from which they actually originated.
Once again, it is clear that members of particular nations can mostly be mistaken for members of their closest neighbors. Almost all Spaniards are guessed as French; French mostly as Belgians but with sizable Spanish and UK minorities; UK as Belgians but with sizable French minorities; Norwegians mostly as Swedes but with some UK; Swedes mainly as Germans but many as Norwegians; most Poles as Russians, but some Slovaks or Czechs, and so on.
The importance of these results can't be underestimated. While it can be argued that some ethnic groups are spuriously distinctive only due to insufficient sampling of the geographical continuum, it is more difficult to do this for others. For example, it is now possible to identify particular ethnic groups, e.g., Norwegians, with great accuracy from DNA.
More markers and more populations will doubtlessly enhance our ability to distinguish European nations using DNA. But perfect accuracy is unlikely; in most European nations there are probably minorities which -for historical reasons- allied themselves with one country or political entity even though they were ultimately of different genetic background than the majority population of that entity.
Nonetheless, at a time when -due to a sort of mental hysteresis- proclamations that "races are social constructs" are still routinely made, the discovery that not only races, but even closely related ethnic groups (e.g. Norwegians and Swedes) can be distinguished with greater than 90% accuracy, serves to illustrate the scientific irrelevance of the ethnic nihilists and the affirmation that nations are, at least in part, genetic entities.
European Journal of Human Genetics (2008) 16, 1413–1429; doi:10.1038/ejhg.2008.210
Investigation of the fine structure of European populations with applications to disease association studies
Simon C Heath et al.
An investigation into fine-scale European population structure was carried out using high-density genetic variation on nearly 6000 individuals originating from across Europe. The individuals were collected as control samples and were genotyped with more than 300 000 SNPs in genome-wide association studies using the Illumina Infinium platform. A major East–West gradient from Russian (Moscow) samples to Spanish samples was identified as the first principal component (PC) of the genetic diversity. The second PC identified a North–South gradient from Norway and Sweden to Romania and Spain. Variation of frequencies at markers in three separate genomic regions, surrounding LCT, HLA and HERC2, were strongly associated with this gradient. The next 18 PCs also accounted for a significant proportion of genetic diversity observed in the sample. We present a method to predict the ethnic origin of samples by comparing the sample genotypes with those from a reference set of samples of known origin. These predictions can be performed using just summary information on the known samples, and individual genotype data are not required. We discuss issues raised by these data and analyses for association studies including the matching of case-only cohorts to appropriate pre-collected control samples for genome-wide association studies.