August 11, 2012

HAPI-UR: a revolution in phasing speed?

I have used primarily Beagle and ShapeIT whenever I needed to do some phasing, but I've discovered that it is often impractical to do this over large datasets in either the number of individuals, or the number of SNPs.

So, it was with great pleasure that I encountered a new paper by Williams et al. in AJHG that proposes a new phasing algorithm for unrelated data that seems to be significantly faster than existing approaches, without sacrificing phasing accuracy.

The software for this will eventually appear here, and I'm sure I will try it before long for myself. Phasing is an important step in fastIBD and ChromoPainter applications that I've used in recent months; for the dataset sizes I've been using, it appears to be no great overhead, but if I can get better accuracy for the same time and/or analyze bigger datasets, it will certainly be welcome. And, perhaps other algorithms (except phasing ones) that deal with quadratic comparisons between pairs of individuals can get something out of this to further improve performance.

I'm certainly looking forward to when people start using this for population genetics analyses on global datasets.

The American Journal of Human Genetics, Volume 91, Issue 2, 238-251, 10 August 2012

Phasing of Many Thousands of Genotyped Samples

Amy L. Williams, Nick Patterson, Joseph Glessner, Hakon Hakonarson and David Reich

Haplotypes are an important resource for a large number of applications in human genetics, but computationally inferred haplotypes are subject to switch errors that decrease their utility. The accuracy of computationally inferred haplotypes increases with sample size, and although ever larger genotypic data sets are being generated, the fact that existing methods require substantial computational resources limits their applicability to data sets containing tens or hundreds of thousands of samples. Here, we present HAPI-UR (haplotype inference for unrelated samples), an algorithm that is designed to handle unrelated and/or trio and duo family data, that has accuracy comparable to or greater than existing methods, and that is computationally efficient and can be applied to 100,000 samples or more. We use HAPI-UR to phase a data set with 58,207 samples and show that it achieves practical runtime and that switch errors decrease with sample size even with the use of samples from multiple ethnicities. Using a data set with 16,353 samples, we compare HAPI-UR to Beagle, MaCH, IMPUTE2, and SHAPEIT and show that HAPI-UR runs 18× faster than all methods and has a lower switch-error rate than do other methods except for Beagle; with the use of consensus phasing, running HAPI-UR three times gives a slightly lower switch-error rate than Beagle does and is more than six times faster. We demonstrate results similar to those from Beagle on another data set with a higher marker density. Lastly, we show that HAPI-UR has better runtime scaling properties than does Beagle so that for larger data sets, HAPI-UR will be practical and will have an even larger runtime advantage. HAPI-UR is available online (see Web Resources).


1 comment:

Mongoose said...

Hi there, I know you said to stay on topic and this has nothing to do with your blog or your area of expertise, but have you ever heard of a Chalcolithic or Early Bronze Age site called "Guschau" in Germany? There is a reference to it in RF Tylecote's History of Metallurgy but I can't get anything at all on Google and he doesn't have a citation. Supposedly it has some interesting slag from but that's all Tylecote said about it. Any help would be very much appreciated. (I'd comment on-topic but honestly I don't understand much of your blog, it's too technical for me.)