July 21, 2013

Fst paper + EIGENSOFT 5.0

Razib covers a new paper on the varieties of Fst and its dependence on rare variants; this should be a very useful read for anyone interested in this widely used measure of genetic differentiation.

There is also a new 5.0 release of the EIGENSOFT suite of applications which can do (among other things) Fst. From the README file:
NEW features of EIGENSOFT version 5.0 include (see POPGEN/README):
-- New option lsqproject for PCA projection with large amounts of missing data
-- New options grmoutname and grmbinary to output genetic relationship matrix,
   compatible with GCTA software (v1.13)
-- Expanded options for LD regression in computing genetic relationship matrix
-- Bug fix for PLINK format files with out-of-order SNPs
NOTE: multi-threading is no longer supported. Users wishing to run in multi-thre
mode are recommended to use EIG4.2
NOTE: fortran compiler is no longer required to build EIGENSOFT. (However, the
lapack and blas libraries must still be installed on the system)

Genome Research doi: 10.1101/gr.154831.113

Estimating and interpreting Fst: the impact of rare variants 

Gaurav Bhatia et al.

In a pair of seminal papers, Sewall Wright and Gustave Malécot introduced FST as a measure of structure in natural populations. In the decades that followed, a number of papers provided differing definitions, estimation methods, and interpretations beyond Wright's. While this diversity in methods has enabled many studies in genetics, it has also introduced confusion about how to estimate FST from available data. Considering this confusion, wide variation in published estimates of FST for pairs of HapMap populations is a cause for concern. These estimates changed- in some cases more than two-fold- when comparing estimates from genotyping arrays to those from sequence data (1000 Genomes Project Consortium 2010; International HapMap 3 Consortium 2010). Indeed, changes in FST from sequencing data might be expected data due to population genetic factors affecting rare variants. While rare variants do influence the result, we show that this is largely through differences in estimation methods. Correcting for this yields estimates of FST that are much more concordant between sequence and genotype data. These differences relate to three specific issues: (1) estimating FST for a single SNP, (2) combining estimates of FST across multiple SNPs, and (3) selecting the set of SNPs used in the computation. Changes in each of these aspects of estimation may result in FST estimates that are highly divergent from one another. Here, we clarify these issues and propose solutions.



Seinundzeit said...

It is going to be fun, seeing Fst estimates. These have been sorely lacking across the genome blogosphere.

eurologist said...

We recently had a discussion related to this, regarding the question whether F_ST can reasonably suggest that there is a second, anciently important Uralic population in Europe, in addition to Finns.

While I am not sure the "average" definition supported in this paper is functionally the best, a good separation of effects due to drift vs. rare, recent mutations is certainly extremely valuable.