September 07, 2008

Reconstructing the ancestral allele value in a Y-STR locus

Sengupta et al. (2006) proposed using the median observed allele in a Y-STR locus as an estimate of the ancestral allele in that locus.

The average squared distance using the median allele tends to be lower (on average) than if the real ancestral allele was used. So, while the equation ASD = μg is appropriate if the real ancestral allele is used (where μ is the mutation rate and g the number of generations since the MRCA), if the median allele is used, then ASD=wag is appropriate where wa is the effective mutation rate.

Unfortunately this effective rate depends on population history, becoming close to the germline rate μ if the haplogroup attains a large size early on after the MRCA. As I have argued in many recent posts, for the large observed modern haplogroups, it is very likely that the effective rate should be close to the germline rate; yet, the discrepancy between the two introduces an element of uncertainty in the calculation of TMRCA (=g)

Naturally, it's obvious to ask: how often does the median allele equal the real ancestral allele? In my simulations, I set g to 10, 100, or 300 generations, and the growth constant m to 1 or 1.02. I expected the median allele to equal the ancestral allele more often for a younger group (less time for it to get obfuscated by the passage of time), and also for a more rapidly expanding group (a more star-like pattern of expansion).

g m % Correct
10 1 98.1
100 1 83.0
300 1 61.0
10 1.02 98.3
100 1.02 86.5
300 1.02 79.6

This intuition is essentially correct. It is for younger, and more rapidly expanding groups that the ancestral allele is estimated most accurately.

It is also worthwhile to see how using other methods of estimating ancestral alleles (e.g. building a rooted haplotype tree) would perform. I have only personally carried out experiments where the modal (most frequent) allele is used.

Using the modal allele tends to be a right guess slightly more often than the median one, giving a less biased estimate of the age. However, when the modal allele fails, it fails spectacularly: the median allele is conservative, being right in the middle of the observed alleles, whereas the modal allele may be observed for either a very small or very high number of repeats, which may be a long way off from the ancestral value in each particular case. Hence, the modal allele leads to age estimates with a higher variance.

Hence, I am in favor of the use of the median allele as an estimator of the ancestral one.

Appendix: age estimates using median or real ancestral allele

Below are the age estimates (ASD/μ) using either the median or the (unknown) ancestral allele.



Age Age
g m (median allele) (ancestral allele)
10 1 5.4 10.3
100 1 42 99.7
300 1 114.9 302
10 1.02 5.6 10.3
100 1.02 56.8 100
300 1.02 233.1 297.5

1 comment:

McG said...

I have always pretty much used the modal value also. But, since my research has concentrated on R1b, I have done some additional studies that help me make better estimates of TMRCA. In R1b we have a discontinuity. I believe it was caused by the great flood, but whatever. I can show that 393 =12 is ancestral to 13 for 393. Higher diversity, longer TMRCA. I have worked with several data sets and shownn that the TMRCA of the set of 13's is often significantly shorter than 12's. For s116+, I get about 6K years difference. This was not readily apparent in the data sets. By a whopping amount 13 is modal. What first got me intrigued was the Tarin Iberian and non-iberian datasets. I noticed that the Iberian, which is older, has a higher percent of 12 than the non-iberian, 3% in the non-iberian and 9% in the iberian, this factor got me thinking this way. So if a value of a dys loci is antecedent, it doesn't mutate to the modal. I know there are back mutations but they are a lesser effect than this.

One other study I performed was to separate 391 10/11 entries. In my analysis of the Scottish highlands I found that 11,11 appeared to be the Pictish signature and 10,11 the Scotti (Erainn). In this case, both the Iberian and non iberian data sets have about the same amount of each. I separated the 391 10 and 11 sets of data and this time I could see no difference in TMRCA. The mutation from 10 to 11 or 11 to 10 occurred very early and each has maintained a fairly constant population percentage since then.

Other than 393, I have not observed any other dys loci of this nature in R1b.