April 07, 2008

PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations

Computer Program Reveals Anyone's Ancestry
"Now that we have found that the program works well, we hope to implement it on a much larger scale, using hundreds of thousands of SNPs and thousands of individuals," said Drineas, who was funded by an NSF CAREER award. "The program will be a valuable tool for understanding our genetic ancestry and targeting drugs and other medical treatments because it might be possible that these can affect people of different ancestry in very different ways."

Understanding our unique genetic makeup is a crucial step to unraveling the genetic basis for complex diseases. Although the human genome is 99 percent the same from human to human, it is that 1 percent that can have a major impact on our response to diseases, viruses, medications, and toxins. If researchers can uncover the minute genetic details that set each of us apart, biomedical research and treatments can be better customized for each individual, Drineas said.

This program will help people understand their unique backgrounds and aid historians and anthropologists in their study of where different populations originated and how humans became such a hugely diverse, global society.

The program was more than 99 percent accurate in trials and correctly identified the ancestry of hundreds of individuals. This included people from genetically similar populations (such as Chinese and Japanese) and complex genetic populations like Puerto Ricans who can come from a variety of backgrounds including Native American, European, and African ancestries.

"When we compared our findings to the existing datasets, only one individual was incorrectly identified and his background was almost equally close between Chinese and Japanese," Drineas said. Drineas explains that the results are preliminary, but extremely promising. The team is now working to test their program on a much larger data set.
PLoS Genet 3(9): e160. doi:10.1371/journal.pgen.0030160

PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations

Peristera Paschou1*, Elad Ziv2,3,4, Esteban G. Burchard5,6, Shweta Choudhry7, William Rodriguez-Cintron8, Michael W. Mahoney9, Petros Drineas10


Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.



  1. Wow! This is great news-I hope all who want to be part of this can be and I hope we'll see useful results soon.It's also great for the human races scientifically,since many people say that there is no race at all -it's just a human construct.This can prove there is a genetic plan and origin for all.It's nice to be alive to see this-though it won't change a thing about people,but will address proven needs.

  2. Me again-I don't really like everyone caucasian claiming to be my ethnicity -especially if they don't look or act like me or if they are very fat.So I hope one day it'll be clear who shares my genes makeup and who isn't.And it would be nice if they have a set tribe name given to each group so they can be identified more easily.


Stay on topic. Be polite. Use facts and arguments. Be Brief. Do not post back to back comments in the same thread, unless you absolutely have to. Don't quote excessively. Google before you ask.