Analysis of East Asia Genetic Substructure: Population Differentiation and PCA Clusters Correlate with Geographic Distribution
Accounting for genetic substructure within European populations has been important in reducing type 1 errors in genetic studies of complex disease. As efforts to understand complex genetic disease are expanded to other continental populations an understanding of genetic substructure within these continents will be useful in design and execution of association tests. In this study, population differentiation(Fst) and Principal Components Analyses(PCA) are examined using >200K genotypes from multiple populations of East Asian ancestry(total 298 subjects). The population groups included those from the Human Genome Diversity Panel[Cambodian(CAMB), Yi, Daur, Mongolian(MGL), Lahu, Dai, Hezhen, Miaozu, Naxi, Oroqen, She, Tu, Tujia, Naxi, and Xibo], HapMap(CHB and JPT), and East Asian or East Asian American subjects of Vietnamese(VIET), Korean(KOR), Filipino(FIL) and Chinese ancestry. Paired Fst(Wei and Cockerham) showed close relationships between CHB and several large East Asian population groups(CHB/KOR, 0.0019; CHB/JPT, 00651; CHB/VIET, 0.0065) with larger separation with FIL(CHB/FIL, 0.014). Low levels of differentiation were also observed between DAI and VIET(0.0045) and between VIET and CAMB(0.0062). Similarly, small Fsts were observed among different presumed Han Chinese populations originating in different regions of mainland of China and Taiwan. For example, the four For PCA, the first two PCs showed a pattern of relationships that closely followed the geographic distribution of the different East Asian populations.corner groups were JPT, FIL, CAMB and MGL with the CHB forming the center group, and KOR was between CHB and JPT. Other small ethnic groups were also in rough geographic correlation with their putative origins. These studies have also enabled the selection of a subset of East Asian substructure ancestry informative markers(EASTASAIMS) that may be useful for future genetic association studies in reducing type 1 errors and in identifying homogeneous groups.
Worldwide Population Structure using SNP Microarray Genotyping
We genotyped 348 individuals sampled from 24 populations world-wide using the Affymetrix 250k NspI microarray chip. For context, we added matching genotypes from 210 HapMap individuals for a total of 250,823 loci genotyped in 543 individuals from 28 populations. We included populations from India and Daghestan to provide detail between the genetic poles of Western Europe, East Asia, and sub-Sahara Africa. With so many markers, principal components analyses reveal genetic differentiation between almost all identified populations in our sample. Northern and southern European populations (FST = 0.004, p <0.01) are statistically distinguishable, as are upper and lower caste groups in India (FST = 0.005, p <0.01). All individuals are accurately classified into continental groups, and even between closely-related populations, genetic- and self-classifications conflict for only a minority of individuals (e.g. ~2% between upper and lower Indian castes; k-means clustering.) As expected, the HapMap CHB+JPT, CEU, and YRI samples are most similar to our east Asian, west European, and African samples, respectively. The HapMap CEU samples and our northern European ancestry samples were both collected from Utah. Although individual samples cannot be reliably classified into their collection of origin, the groups are statistically distinguishable despite their high similarity (FST = 0.0005, n.s.). Our Japanese group is also statistically distinguishable from the HapMap JPT group (FST = 0.006, p <0.01), and in this comparison, most samples can be correctly classified. With such large numbers of genotypes, significant differences can be found even between very similar population samplings. Our results provide guidelines for researchers in selecting suitable control populations for case-control studies.
Frequency distribution and selection in 4 pigmentation genes in Europe
Pigmentation is one of the more obvious forms of variation in humans, particularly in Europeans where one sees more within group variation in hair and eye pigmentation than in the rest of the world. We studied 4 genes (SLC24A5, SLC45A2, OCA2 and MC1R) that are believed to contribute to the pigment phenotypes in Europeans. SLC24A5 has a single functional variant that leads to lighter skin pigmentation. Data on 83 populations worldwide (including 55 from our lab) show the variant (at rs1426654) has almost reached fixation in Europe, Southwest Asia, and North Africa, has moderate to high frequencies (.2-.9) throughout Central Asia, and has frequencies of .1-.3 in East and South Africa. The variant is essentially absent elsewhere. SLC45A2 also has a single functional variant (at rs16891982) associated with light skin pigmentation in Europe. Data on 84 populations worldwide show the light skin allele is nearly fixed in Northern Europe but has lower frequencies in Southern Europe, the Middle East and Northern Africa. In Central Asia the frequency of the SLC45A2 variant declines more quickly than the SLC24A5 variant. It is absent in both East and South Africa. In OCA2 we typed 4 SNPs (rs4778138, rs4778241, rs7495174, rs12913832) with a haplotype associated with blue eyes in Europeans. This haplotype shows a Southeastern to Northwestern pattern in Europe with frequencies of .25 (.05 homozygous) in the Adygei to .85 (.75 homozygous) in the Danes. In MC1R we typed 5 SNPs (rs3212345, rs3212357, rs3212363, C_25958294_10, rs7191944) that cover the entire MC1R gene and found a predominantly European haplotype that ranges in frequency from .35 to .65 in Europe, reaching its highest levels in Southwest Asia and Northwestern Europe. Extended Haplotype Heterozygosity (EHH) and normalized Haplosimilarity (nHS) show evidence of selection at SLC24A5 in not only our European and Southwest Asian populations but also our East African populations. Neither SLC45A2 or OCA2 showed evidence of selection in either test. MC1R did not show evidence of selection for our European specific haplotype but we did see some evidence both upstream and downstream in our nHS test in Europe.
Using principal components analysis to identify candidate genes for natural selection.
Genetic markers that differentiate populations are excellent candidates for natural selection due to local adaptation, and may shed light into physiological pathways that underlie disorders with varying frequencies around the world. Principal Components Analysis (PCA) has emerged as a powerful tool for the characterization and analysis of the structure of genomewide datasets. In prior work, we described an algorithm that can be used to select small subsets of genetic markers (SNPs) that correlate well with population structure, as captured by PCA. Our method can be used to detect SNPs that differentiate individuals from different geographic regions, or even neighboring subpopulations. We set out to explore the nature and properties of the genes where population-differentiating SNPs reside, by analyzing the publicly available Human Genome Diversity Panel dataset (650,000 SNPs for 1,043 individuals, 51 populations). Applying our SNP selection algorithms, we chose small subsets of SNPs that almost perfectly reproduce worldwide population structure as identified by PCA. We determined SNP panels both for population differentiation within seven geographic regions, as well as around the globe. We then explored the hypothesis that the selected SNPs attained their current worldwide allele frequency patterns as a response to the pressure of natural selection. Comparing our lists to recently published reports, we found a significant overlap with other genomewide scans for selection, thus validating our hypothesis. For example, EDAR (involved in the development of hair follicles) harbors the most differentiating SNPs in our world-wide panels. SNPs located in genes that are involved in skin and eye pigmentation (OCA2, MYO5C, HERC1, HERC2) are also among the top population differentiating markers. In East Asia, SNPs residing at the ADH cluster appear among the most important SNPs for population structure, while, in Europe, the same is true for genes that are involved in immune response to pathogens (CR1, DUOX2, TLR, and HLA). Finally, a comprehensive gene ontology analysis is presented.