Var = wg
This equation has two unknowns: g, the time to the Most Recent Common Ancestor, and w, the unknown effective mutation rate. Thus, to estimate g we must first estimate w.
What I have argued in my posts of the Y-STR Series is that w is "close" to μ, the germline mutation rate. However, "close" means "on average close". For any particular genealogy and random sequence of mutations within that genealogy, it may be fairly different from its expected value. Thus, a particular group may accumulate variance at a much lower rate or even faster than μ.
I have carried out some simulations with nm-marker haplotypes, to see how w, or more precisely w/μ (the effective rate expressed in units of the germline rate) behaves. As always, results are averaged over 10,000 runs. The number of generations was kept at g=150 and the individual marker mutation rate at μ=0.0025/locus/generation.
The table has five columns:
- m: the growth constant; each man has m sons on average according to a Poisson process
- nm: the number of Y-STR loci
- E[w/μ]: the mean effective mutation rate
- s.d.: the standard deviation of the effective mutation rate
- Group size: the average number of present-day descendants
It's obvious, as in the previous experiments, that a fairly fast growth constant (m=1.075) is necessary to create a haplogroup with a large number of present-day descendants (~435k), and for such a group, the effective mutation rate is 0.84μ.
As nm, the number of markers, increases, the s.d. of the effective rate decreases. But, note: the difference between nm=11 and nm=16 is miniscule. A large number of markers is welcome when available, but no substantial gain is to be expected in terms of accuracy, beyond 10-20 markers.
These results illustrate vividly that large haplogroups behave more "regularly" than small ones. For m=1 and with 16 markers it is w=0.31 (s.d.=0.22), whereas for m=1.075 it is w=0.84 (s.d.=0.12). So, while for the small group, the standard deviation is 71% of the expected value, for the large group it is only 14%.
Thus, for large groups likely to be made the object of a population study, not only is the effective mutation rate close to the germline rate, but its variability is also greatly reduced.
Interclade Age Estimation
I have also carried out simulations using the interclade method, considering a pair of haplotypes whose common ancestor is g=150 generations in the past.
This method produces an unbiased estimate but it is obvious that it is a very noisy one, much more than in the previous case. Even for nm=67 markers (not shown in table), the standard deviation remains at 0.22.
The above experiment considered only a pair of haplotypes. This is a worst-case scenario. In a best-case scenario, two groups (A and B) of Y-chromosomes coalesce to ancestors who lived immediately after the common ancestor of both groups, and each group expanded at a very fast rate. In such a case, each pair of haplotypes (one from group A and one from group B) is approximately independent of any other pair.
Unfortunately, it's impossible to determine whether or not such a scenario is valid, since it would entail determining the age of the two groups, i.e. the very thing we are trying to estimate, as well as the population demography of each group.
I did carry an experiment with 10 completely independent pairs (rather than 1) coalescing to the common ancestor. The results are listed below:
Unlike the simple variance-based method, the interclade method can be used only when two groups can be shown to coalesce to a common ancestor (e.g. haplogroups D and E to a common YAP-bearing ancestor).
Its accuracy depends on (i) a large number of markers used, (ii) the two groups founded soon after their common ancestor, and then expanding rapidly, to approximate a star phylogeny. Without these assumptions (which can't be verified easily), its performance is actually not superior to that of a simple variance method.