## September 11, 2008

### The effective mutation rate with multiple Y-STRs

The relationship between Y-STR variance Var and time in generations g is:

Var = wg

This equation has two unknowns: g, the time to the Most Recent Common Ancestor, and w, the unknown effective mutation rate. Thus, to estimate g we must first estimate w.

What I have argued in my posts of the is that w is "close" to μ, the germline mutation rate. However, "close" means "on average close". For any particular genealogy and random sequence of mutations within that genealogy, it may be  fairly different from its expected value. Thus, a particular group may accumulate variance at a much lower rate or even faster than μ.

I have carried out some simulations with nm-marker haplotypes, to see how w, or more precisely w/μ (the effective rate expressed in units of the germline rate) behaves. As always, results are averaged over 10,000 runs. The number of generations was kept at g=150 and the individual marker mutation rate at μ=0.0025/locus/generation.

 m nm E[w/μ] s.d. Group size 1.000 1 0.31 0.47 82 1.000 6 0.31 0.26 82 1.000 11 0.31 0.23 81 1.000 16 0.31 0.22 83 1.025 1 0.58 0.52 900 1.025 6 0.58 0.28 892 1.025 11 0.59 0.24 902 1.025 16 0.58 0.22 885 1.075 1 0.84 0.33 434347 1.075 6 0.84 0.16 438545 1.075 11 0.84 0.13 439894 1.075 16 0.84 0.12 432672

The table has five columns:
• m: the growth constant; each man has m sons on average according to a Poisson process
• nm: the number of Y-STR loci
• E[w/μ]: the mean effective mutation rate
• s.d.: the standard deviation of the effective mutation rate
• Group size: the average number of present-day descendants

It's obvious, as in the previous experiments, that a fairly fast growth constant (m=1.075) is necessary to create a haplogroup with a large number of present-day descendants (~435k), and for such a group, the effective mutation rate is 0.84μ.

As nm, the number of markers, increases, the s.d. of the effective rate decreases. But, note: the difference between nm=11 and nm=16 is miniscule. A large number of markers is welcome when available, but no substantial gain is to be expected in terms of accuracy, beyond 10-20 markers.

These results illustrate vividly that large haplogroups behave more "regularly" than small ones. For m=1 and with 16 markers it is w=0.31 (s.d.=0.22), whereas for m=1.075 it is w=0.84 (s.d.=0.12). So, while for the small group, the standard deviation is 71% of the expected value, for the large group it is only 14%.

Thus, for large groups likely to be made the object of a population study, not only is the effective mutation rate close to the germline rate, but its variability is also greatly reduced.

I have also carried out simulations using the interclade method, considering a pair of haplotypes whose common ancestor is g=150 generations in the past.

 nm Estimated Age s.d. (times g) 1 1.01 1.91 6 1.01 0.76 11 1.00 0.55 16 0.99 0.45

This method produces an unbiased estimate but it is obvious that it is a very noisy one, much more than in the previous case. Even for nm=67 markers (not shown in table), the standard deviation remains at 0.22.

The above experiment considered only a pair of haplotypes. This is a worst-case scenario. In a best-case scenario, two groups (A and B) of Y-chromosomes coalesce to ancestors who lived immediately after the common ancestor of both groups, and each group expanded at a very fast rate. In such a case, each pair of haplotypes (one from group A and one from group B) is approximately independent of any other pair.

Unfortunately, it's impossible to determine whether or not such a scenario is valid, since it would entail determining the age of the two groups, i.e. the very thing we are trying to estimate, as well as the population demography of each group.

I did carry an experiment with 10 completely independent pairs (rather than 1) coalescing to the common ancestor. The results are listed below:

 nm Estimated Age s.d. (times g) 1 1.01 0.57 6 0.98 0.22 11 1.00 0.17 16 1.01 0.15

Unlike the simple variance-based method, the interclade method can be used only when two groups can be shown to coalesce to a common ancestor (e.g. haplogroups D and E to a common YAP-bearing ancestor).

Its accuracy depends on (i) a large number of markers used, (ii) the two groups founded soon after their common ancestor, and then expanding rapidly, to approximate a star phylogeny. Without these assumptions (which can't be verified easily), its performance is actually not superior to that of a simple variance method.

McG said...

The original Var = mG analysis was performed by M. Slatkin and in his analysis he assumed that the time back to the ancestor t was known. A subsequent reframing by Goldstein, Science, 2001, used ASD and stated that the ancestral modal values of each dys loci had to be known. It appears that 1. Ken is arguing that this is not necessary; somehow interclade analysis somehow produces all the variance needed???

I have made many R1b TMRCA analyses and for a long time I mixed 393 = 12 and 393 = 13 in my data sets. It turns out that 12 is ancestral to 13 and is the most probable value that M269 had. I find that mixing the two shortens the TMRCA for the older one done separately. So, I would argue what Ken is saying is really that if you have two sub clades with different ancestors and different modal values, you can't do an interclade analysis??? However, I believe that if you know the modal values of the original,common, ancestor than you can??? When Ken does an R1b analysis and includes 393 =12 and 393 =13, he will not estimate the correct TMRCA - it will be lower!!! In the case of s116+, I estimate the 393 = 13 group converges at about 8320 BP and the 393 =12 group converges at 14K BP.

An interclade analysis should only be made where the modal values of the common ancestor are known. Further this is also true for a TMRCA analysis of any group, if you don't know the modal values - your estimate will be in error.

Based on this I also question your use of the median value vice mode. I would suggest that that calculation will also be in error.

The Goldstein paper is at: Science, vol 291, 2 March 2001 and was co-authored with Stumpf; the Slatkin paper is at Genetics 139, 457 1995.

Dienekes Pontikos said...

Ken is arguing that this is not necessary; somehow interclade analysis somehow produces all the variance needed??

In the interclade method the ancestral value is not needed. The average squared distance between two descendant haplotypes is calculated, and this corresponds to 2g generations. Hence: ASD = 2*mu*g

The "regular" ASD compares haplotypes with an inferred ancestral haplotype (g generations ago). Hence ASD = mu*g.

Actually though, the inferred ancestral haplotype (using e.g. modal or median allele values) is often not the real one, and hence the resulting ASD is less than the real ASD (if the real ancestral alleles were used). Therefore ASD underestimates the age. However, as I argued in my posts, this underestimation is a little and not at all 3.6x as in the Zhivotovsky et al. rate.

Variance is another issue altogether and should not be confused with ASD. Variance compares descendant allele values with the mean allele value (this can actually be fractional). It is also the case that Var = mu*g for a perfect star phylogeny, where descendant haplotypes coalesce precisely to the ancestor who lived g generations in the past. Likewise, in real-life this isn't the case, and Y chromosomes coalesce to later common ancestors. Hence, Var/mu underestimates the age as well, but again very little, and not 3.6x.

McG said...

It is my understanding that ASD = Variance/2 is a mathematical identity??

Please consider the case I have presented: R1b and the fact that even though the modal value at 393 in most data sets is 13; the antecedent MRCA value is 12. Kens approach apparently works for I?? Why not R? As I showed the TMRCA difference between 12 and 13 for s116+ is 6K years. That is not a small difference. I know he doesn't work with the modal value, and based on the original derivations I'm not sure he is right?

Lastly, I think 3.6X is a fictitious number. I believe it depends on the dys loci being used. Some of Chandlers mutation rates for slower dys loci, are very similar to ZUV. 388 is the key here, since its rate appears to very different in I and R1b.

Dienekes said...

It is my understanding that ASD = Variance/2 is a mathematical identity??

Not really. Of course the term "variance" has been used way to liberally in related conversations on the subject. Variance does not rely on the estimation of an ancestral allele and there is no strict mathematical relationship between itself and ASD.

388 is the key here

I prefer not to use 388 personally. As I have said elsewhere, very slow-mutating markers provide very noisy age estimates. 388 has the additional problem that its mutation rate is haplogroup-specific.

McG said...

I'm probably testing your patience, so I'll make my last comment and leave it.

1. Fast mutators should be used very carefully. They enhance "variance" by having a higher number of multi-step mutations.

2. Re: slow mutators. I generally agree they have a large SD. If nothing else due to the smaller number of samples. There is a phase issue here also, I believe. A mutation has an associated period = 1/f., f= frequency = mutation rate. It takes time to average a slow mutation out over the set of entries.

3. All the published derivations use modal value. We have a problem in analyzing the R1b data set in that the modal value at 393 is not the ancestral value. It makes TMRCA analysis trickier and can yield large errors if it isn't compensated for.

4. We need a precise set of definitions of terms so that we can understand each other. Nordtvedts analyses are too terse for me and are based on fast mutators.