Personally, I have tried three methods for choosing the number of principal components to retain:
- Tracy Widom; which seems to retain more dimensions than are necessary, with a resulting reduction in clustering quality
- A test of normality (such as Shapiro-Wilk), which tends to identify a smaller number of dimensions where the data appear not normally distributed and hence may contain useful information about population structure.
- A more pragmatic approach of picking the number of components to retain that maximize the number of inferred clusters by MCLUST
Human Heredity Vol. 73, No. 2, 2012
Improved Eigenanalysis of Discrete Subpopulations and Admixture Using the Minimum Average Partial Test
Abstract Principal components analysis of genetic data has benefited from advances in random matrix theory. The Tracy-Widom distribution has been identified as the limiting distribution of the lead eigenvalue, enabling formal hypothesis testing of population structure. Additionally, a phase change exists between small and large eigenvalues, such that population divergence below a threshold of FST is impossible to detect and above which it is always detectable. I show that the plug-in estimate of the effective number of markers in the EIGENSOFT software often exceeds the rank of the sample covariance matrix, leading to a systematic overestimation of the number of significant principal components. I describe an alternative plug-in estimate that eliminates the problem. This improvement is not just an asymptotic result but is directly applicable to finite samples. The minimum average partial test, based on minimizing the average squared partial correlation between individuals, can detect population structure at smaller FST values than the corrected test. The minimum average partial test is applicable to both unadmixed and admixed samples, with arbitrary numbers of discrete subpopulations or parental populations, respectively. Application of the minimum average partial test to the 11 HapMap Phase III samples, comprising 8 unadmixed samples and 3 admixed samples, revealed 13 significant principal components.