July 08, 2008

PCA-informative markers for European American substructure

The importance of this work is that while it takes many thousands of markers to identify population structure in closely related groups, a much smaller subset of these markers captures almost all the information in the larger marker set.

Thus, from an economic standpoint, discovery of substructure in an "unexamined" group requires a considerable initial investment of genotyping a large representative sample for a large number of markers. But, subsequent ancestry analysis can profit from the identified smaller subset to economically test new individuals.

Once I look at the details of this paper, I will try to update EURO-DNA-CALC to use this new marker panel.

See the earlier paper by this group on PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations.

PLoS Genet 4(7): e1000114. doi:10.1371/journal.pgen.1000114

Tracing Sub-Structure in the European American Population with PCA-Informative Markers

Peristera Paschou et al.


Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals–307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150–200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs.


No comments: