April 02, 2011

How many PCA dimensions to retain?

This is quite useful for practitioners who wish to have an automated way of selecting how many principal components to retain.

Personally, I have tried both visual inspection of clusters, as well as the dimension selection procedure commonly used in the Clusters Galore approach. The former relies on human visual perception to assess variation and structure in 2D projections of the data, while the latter takes a more pragmatic approach of selecting as many components as are necessary for a model-based clustering algorithm to maximize its number of inferred clusters. In a sense, the latter procedure, which I prefer, assesses the benefit of extra dimensions by gauging whether they contribute noise or "clusteredness".

I can definitely see the usefulness of the type of procedure described in this paper, and certainly the critique of the Tracy-Widom distribution is valuable, as is the study of the number of significant components in the presence of admixed individuals.

Heredity. 2011 Mar 30. [Epub ahead of print]

Investigating population stratification and admixture using eigenanalysis of dense genotypes.

Shriner D.

Center for Research on Genomics and Global Health, National Human Genome Research Institute, Bethesda, MD, USA.


Principal components analysis of genetic data is used to avoid inflation in type I error rates in association testing due to population stratification by covariate adjustment using the top eigenvectors and to estimate cluster or group membership independent of self-reported or ethnic identities. Eigendecomposition transforms correlated variables into an equal number of uncorrelated variables. Numerous stopping rules have been developed to identify which principal components should be retained. Recent developments in random matrix theory have led to a formal hypothesis test of the top eigenvalue, providing another way to achieve dimension reduction. In this study, I compare Velicer's minimum average partial test to a test on the basis of Tracy-Widom distribution as implemented in EIGENSOFT, the most widely used implementation of principal components analysis in genome-wide association analysis. By computer simulation of vicariance on the basis of coalescent theory, EIGENSOFT systematically overestimates the number of significant principal components. Furthermore, this overestimation is larger for samples of admixed individuals than for samples of unadmixed individuals. Overestimating the number of significant principal components can potentially lead to a loss of power in association testing by adjusting for unnecessary covariates and may lead to incorrect inferences about group differentiation. Velicer's minimum average partial test is shown to have both smaller bias and smaller variance, often with a mean squared error of 0, in estimating the number of principal components to retain. Velicer's minimum average partial test is implemented in R code and is suitable for genome-wide genotype data with or without population labels.



Jim Bowery said...

You might try one of the more recent multi-dimensional visualization techniques, such as vector fusion.

Andrew Oh-Willeke said...

The theoretical issue here is huge, particularly given the popularity of programs like admixture.

But, there may be a complicating issue. A cluster is valid to the extent that it doesn't blend with members of another cluster, and to the extent that the sample is large enough to capture the population its represents rather than simply elevating random quirks in the data to definitional status.

But, the evidence to date, such as the high level of precision with which genetics can be used to identify even village level ancestry in parts of the world with stable populations for the last few centuries, suggests that the level of detail at which clustering becomes incoherent is very fine.

In theory, looking at admixture of populations from the Eastern side of a French region with the Western side of a French region is no less meaningful than looking at axmiture between ANI and ASI populations in in South Asia as a whole. It is simply a difference of scale.

To really get meaningful answers then, one needs more than just math. One needs to coherently articulate when and why one would care about one scale of admixture v. another in time, space and phylogeny, and to find a way to represent that numerically.

Generally, in practice, the greater the time depth of separation of genetic components prior to admixture, the more interesting they are to us. We find most interesting the most phylogenetically remote admixtures, particularly when they have been present long enough to reach near fixation in the localized admixed population, because these are traces of particularly notable pre-historic events in population history.

Hence, for example, we care more about looking at respective Yayoi, later Chinese, Jomon, Siberian and Austronesian contributions to Japanese population structure than about differentiations in the Jomon population between moderately separated pre-Yayoi villages.