February 13, 2009

Microsatellite-based molecular clock is pretty good

This paper aims to vindicate the utility of microsatellites (STRs) as a molecular clock, by comparing estimates of time from sequence alignments with those from STRs.

Sequence alignments are direct reads of the letters of DNA, e.g.,

AGCTTAC
AGCCTAC

Longer times of separation => more spots were sequences differ.

Microsatellites are small stretches of DNA (a few bases) where a segment can exist in multiples, e.g.,

GATAGATAGATAGATAGATAGATAGATAGATAGATAGATA
GATAGATAGATAGATAGATAGATAGATA

The alleles for the above case are 10 and 7, respectively, since there are so many repetitions of the GATA element.

Theory suggests that average squared distance (or (10-7)^2 = 9 for this case) should scale linearly with time, under the symmetric stepwise model, in which length increases and decreases are equiprobable and independent of allele length.

However, deviations from the stepwise model may upset this linearity, if e.g., mutations are length-dependent, e.g., if they occur more frequently if the original allele is 10 rather than 7, or if there are range restrictions in allele length, i.e., very few or too many number of repeats are "forbidden" by the chemistry of the mutation process.

What this paper does, is to show that time estimates from sequence divergence are linearly related to time estimates from microsatellites using ASD:
Sequence divergence and microsatellite ASD are linearly related: The regressions have correlation coefficients all greater than 0.97. Since sequence divergence is known to be proportional to tMRCA, microsatellite ASD is linear to tMRCA. Interestingly, however, the regression lines do not intersect the origin, a point we return to below.
The "information content" of the two types of system is quantified:
1 microsatellite is “worth” approximately 10 Kb of shotgun sequencing, which is expected to contain 10 nucleotide mutations between 2 modern humans.

Interestingly:
The microsatellite molecular clock appears to be linear for at least 2 million years ... Therefore, encouragingly, the duration of ASD linearity is at least 10 times that of theoretical predictions, suggesting range constraints are not as severe as previously imagined.
Of course, linearity may be exhibited only in the time window examined. This paper conclusively proves linearity across a particular range, but not for times outside this range (younger or older). For zero sequence divergence, the regression has non-zero ASD, which suggests a non-linear relationship for younger divergence times.

The authors give two explanations for the non-zero intercept: a technical one about miscalling of heterozygous genotypes for homozygous, and a theoretical one which I prefer:
Alternatively, the relationship between ASD and tMRCA could be globally nonlinear, but easily linearizable in our time window. Whatever the cause for our observations, these results indicate that for population genetic analysis, it is important to use a calibration curve (such as Figure 1) to convert ASD to sequence divergence, correcting for the inflated estimate of divergence time from microsatellite ASD.

By "calibration curve" (the above Figure 1), they mean a way in which to go from ASD to sequence divergence/time; in the time window considered, ASD increases at a constant rate with time, but the non-linearity suggests a different rate of increase for younger times, indeed a steeper one.

Another interesting aspect raised is that microsatellites are less subject to ascertainment bias. If a single-letter is found to be polymorphic in population A, then it is not clear that it will also be polymorphic in population B. Therefore, if SNPs are "discovered" primarily in population A, then A will appear to be more "diverse" than B. This is not a problem, however, with microsatellites, since these mutate much faster, and an STR that is polymorphic in A, will also be polymorphic in B.

In conclusion, this is a very nice paper which vindicates the use of microsatellite-based age estimation, while raising some important concerns. In a few years, this whole debate may, however, be less important, since whole genomes will be sequenced economically.

However, with a back-of-a-napkin type of calculation, 150,000 polymorphic microsatellites in the human genome are "equivalent" (using 10kb/microsatellite; see above) to 1.5billion base pairs, or half the human genome. Probably many of these microsatellites won't be as informative or amenable to study as the small set sampled here, but in any case, this serves to demonstrate that microsatellites may still be relevant as an auxiliary source of information after whole genome sequencing becomes routine.


Molecular Biology and Evolution, doi:10.1093/molbev/msp025

Microsatellites are molecular clocks that support accurate inferences about history

James X. Sun et al.

Microsatellite length mutations are often modeled using the generalized stepwise mutation process, which is a type of random walk. If this model is sufficiently accurate, one can estimate the coalescence time between alleles of a locus after a mathematical transformation of the allele lengths. When large-scale microsatellite genotyping first became possible, there was substantial interest in using this approach to make inferences about time and demography, but that interest has waned because it has not been possible to empirically validate the clock by comparing it to data in which the mutation process is well understood. We analyzed data from 783 microsatellite loci in human populations and 292 loci in chimpanzee populations, and compared them to up to one gigabase of aligned sequence data, where the molecular clock based upon nucleotide substitutions is believed to be reliable. We empirically demonstrate a remarkable linearity (r2 > 0.95) between the microsatellite average squared distance (ASD) statistic and sequence divergence. We demonstrate that microsatellites are accurate molecular clocks for coalescent times of at least two million years. We apply this insight to confirm that the African populations San, Biaka Pygmy, and Mbuti Pygmy have the deepest coalescent times among populations in the Human Genome Diversity Project. Furthermore, we show that microsatellites support unbiased estimates of population differentiation (FST) that are less subject to ascertainment bias than single nucleotide polymorphism (SNP) FST. These results raise the prospect of using microsatellite data sets to determine parameters of population history. When genotyped along with SNPs, microsatellite data can also be used to correct for SNP ascertainment bias.

No comments: