March 26, 2012

Similarity matrices and clustering (Lawson and Falush)

Lawson and Falush have a new review paper on different clustering methods using haplotype data such as their own ChromoPainter/fineSTRUCTURE methodology, as well as the MCLUST/fastIBD methods that I started playing with a while back.

I won't have much time for the next few days to comprehensively review this new work, but I will add one data point to the discussion, by pointing to my ChromoPainter and fastIBD analyses over the same dataset. I will also add any further comments on this blog post, once I get the opportunity to read the paper.

Another point that needs to be made is how commendable the ChromoPainter folks' attitude towards the topic has been. Not only did they post their ChromoPainter preprint and software online months before their original paper was published, but they quickly jumped on my comments and suggestions on their paper to write their new review paper, making at available as a preprint itself. I'm guessing this saved about a year or two over what would have been possible if all the formalities of "traditional" publishing had been observed. It's also a very nice example of synergy between professional and amateur science, that the Internet and social media has made possible.

Similarity matrices and clustering algorithms for population identification using genetic data


Daniel John Lawson and Daniel Falush

Abstract

A large number of algorithms have been developed to identify population
structure from genetic data. Recent results show that the information used
by both model-based clustering methods and Principal Components Analysis
can be summarised by a matrix of pairwise similarity measures between
individuals. Similarity matrices have been constructed in a number of ways,
usually treating markers as independent but differing in the weighting given
to polymorphisms of different frequencies. Additionally, methods are now being
developed that better exploit the power of genome data by taking linkage
into account. We review several such matrices and evaluate their ‘information
content’. A two-stage approach for population identification is to first construct
a similarity matrix, and then perform clustering. We review a range
of common clustering algorithms, and evaluate their performance through a
simulation study. The clustering step can be performed either directly, or
after using a dimension reduction technique such as Principal Components
Analysis, which we find substantially improves the performance of most algorithms.
Based on these results, we describe the population structure signal
contained in each similarity matrix, finding that accounting for linkage leads
to significant improvements for sequence data. We also perform a comparison
on real data, where we find that population genetics models outperform
generic clustering approaches, particularly in regards to robustness against
features such as relatedness between individuals.


Link

No comments: