In this post I study the effectiveness of average squared distance (ASD) to estimate the age of the most recent common ancestor (MRCA) of a pair of Y-STR haplotypes.
Each haplotype is a vector of nm allele values:
hta = (a1 a2 ... anm)
htb = (b1 b2 ... bnm)
The average squared distance is defined as:
ASD(hta, htb) = [(a1-b1)2+(a2-b2)2+ ... +(anm-bnm)2]/nm
If the MRCA lived g generations ago, then the expeted value of ASD(hta, htb) is:
E[ASD(hta, htb)] = 2μg
where μ is the Y-STR mutation rate; the symmetric stepwise mutation model is assumed, in which a Y-STR allele increases or decreases by 1 repeat per generation with a probability of μ/2 each.
The above equation allows us to estimate g as:
gest = ASD(hta, htb)/2μest (Eq. 1)
Where, μest is the estimated mutation rate. This post investigates how accurate gest is.
Two independent g-generation long chains of Y-chromosome transmissions are simulated, leading to two present-day haplotypes. Haplotypes are nm-marker long.
Each marker has an estimated mutation rate μest=0.0025. This estimate is assumed to be derived from direct observation on nfs father-son pairs. Hence, each marker mutates with a real mutation rate that is binomially distributed according to Binomial(nfs, μest)/nfs.
Estimates of the mean gest, its standard deviation and the 95% C.I. interval (2.5-97.5%) are presented over 10,000 simulation runs.
Results are presented for nm=10 or 50, to represent typical values for a research paper or commercial genealogical samples, respectively, and with nfs=1000 or 10000. The mutation rate of one of the most studied Y-STR loci, DYS19 is based on 9,390 observations as of this writing, and many other markers have established mutation rates based on a much lower number of samples.
The following table summarizes the simulation results:
|g||nm||nfs||Mean(gest)||s.d. (gest)||95% C.I.|
It is evident that:
- The age estimate (Eq. 1) is unbiased, as Mean(gest) is quite close to the real g
- The standard deviation of the age estimate increases with g in absolute value, but decreases in relative value (s.d. (gest)/g).
- The standard deviation of the age estimate decreases with both nm (more markers) and nfs (better estimate of the mutation rate)
- Even for nm=50 and nfs=10000, there is considerable uncertainty about the TMRCA. For example, a 300-generation most recent common ancestor can appear to be as young as 176 generations, or as old as 456 generations, or a length of 280 generations. If we add our uncertainty about generation length (e.g., 25 or 30 years), this corresponds to 9,280 years, and stretches from the Bronze Age to the Upper Paleolithic.
While ASD provides an unbiased estimator of TMRCA for a pair of haplotypes, it can provide -at present- a very imperfect estimate because of:
- Stochasticity of the mutation process itself
- Inaccurate knowledge of the mutation rate
- Inaccurate knowledge of the generation length
The age estimate is, in fact, probably even worse, since the current simulation did not take into account:
- Deviations from the stepwise symmetric mutation model (multi-step increases/decreases in number of repeats)
- Lineage or allele-dependent mutation rate
In a few years, when every bit of variable DNA on the Y-chromosome will be sequenced routinely, including Y-STRs, Y-SNPs, and indel polymorphisms, it will be possible to provide better TMRCA estimates for a pair of Y-chromosomes. For Y-STRs it is important to determine the mutation rate in even larger samples than are currently available (~10,000).
There will always be some residual uncertainty, e.g., because we will never be able to determine the generation length for prehistoric cultures. However, our estimates are likely to be much better than the ones possible today, which are really not much better than guesses.
It is important to be skeptical of low confidence intervals associated with many published age estimates. The assumptions on which these intervals are based are rarely stated explicitly, and may assume (inappropriately) that only one type of uncertainty (of at least five types; see Discussion) are at play.