October 05, 2008

On the use of average squared distance (ASD) to estimate the time to most recent common ancestor (TMRCA) of a pair of Y-STR haplotypes

UPDATE: See second part here.

In this post I study the effectiveness of average squared distance (ASD) to estimate the age of the most recent common ancestor (MRCA) of a pair of Y-STR haplotypes.

Each haplotype is a vector of nm allele values:

hta = (a1 a2 ... anm)
htb = (b1 b2 ... bnm)

The average squared distance is defined as:

ASD(hta, htb) = [(a1-b1)2+(a2-b2)2+ ... +(anm-bnm)2]/nm

If the MRCA lived g generations ago, then the expeted value of ASD(hta, htb) is:

E[ASD(hta, htb)] = 2μg

where μ is the Y-STR mutation rate; the symmetric stepwise mutation model is assumed, in which a Y-STR allele increases or decreases by 1 repeat per generation with a probability of μ/2 each.

The above equation allows us to estimate g as:

gest = ASD(hta, htb)/2μest (Eq. 1)

Where, μest is the estimated mutation rate. This post investigates how accurate gest is.


Two independent g-generation long chains of Y-chromosome transmissions are simulated, leading to two present-day haplotypes. Haplotypes are nm-marker long.

Each marker has an estimated mutation rate μest=0.0025. This estimate is assumed to be derived from direct observation on nfs father-son pairs. Hence, each marker mutates with a real mutation rate that is binomially distributed according to Binomial(nfs, μest)/nfs.

Estimates of the mean gest, its standard deviation and the 95% C.I. interval (2.5-97.5%) are presented over 10,000 simulation runs.

Results are presented for nm=10 or 50, to represent typical values for a research paper or commercial genealogical samples, respectively, and with nfs=1000 or 10000. The mutation rate of one of the most studied Y-STR loci, DYS19 is based on 9,390 observations as of this writing, and many other markers have established mutation rates based on a much lower number of samples.


The following table summarizes the simulation results:

g nm nfs Mean(gest) s.d. (gest) 95% C.I.
100 10 1000 99 71 20-280
200 10 1000 201 131 40-540
300 10 1000 301 190 60-780
400 10 1000 399 243 80-1000
500 10 1000 500 298 120-1240
600 10 1000 601 354 140-1500
100 10 10000 101 65 20-260
200 10 10000 202 112 60-480
300 10 10000 302 159 80-700
400 10 10000 399 204 120-900
500 10 10000 499 248 140-1120
600 10 10000 601 297 180-1320
100 50 1000 100 33 48-176
200 50 1000 201 58 108-332
300 50 1000 301 84 164-492
400 50 1000 401 110 224-648
500 50 1000 500 133 276-792
600 50 1000 601 160 336-960
100 50 10000 100 28 52-164
200 50 10000 200 50 116-308
300 50 10000 299 71 176-456
400 50 10000 400 91 244-600
500 50 10000 500 113 308-744
600 50 10000 601 132 372-892

It is evident that:
  • The age estimate (Eq. 1) is unbiased, as Mean(gest) is quite close to the real g
  • The standard deviation of the age estimate increases with g in absolute value, but decreases in relative value (s.d. (gest)/g).
  • The standard deviation of the age estimate decreases with both nm (more markers) and nfs (better estimate of the mutation rate)
  • Even for nm=50 and nfs=10000, there is considerable uncertainty about the TMRCA. For example, a 300-generation most recent common ancestor can appear to be as young as 176 generations, or as old as 456 generations, or a length of 280 generations. If we add our uncertainty about generation length (e.g., 25 or 30 years), this corresponds to 9,280 years, and stretches from the Bronze Age to the Upper Paleolithic.


While ASD provides an unbiased estimator of TMRCA for a pair of haplotypes, it can provide -at present- a very imperfect estimate because of:
  1. Stochasticity of the mutation process itself
  2. Inaccurate knowledge of the mutation rate
  3. Inaccurate knowledge of the generation length

The age estimate is, in fact, probably even worse, since the current simulation did not take into account:
  1. Deviations from the stepwise symmetric mutation model (multi-step increases/decreases in number of repeats)
  2. Lineage or allele-dependent mutation rate


In a few years, when every bit of variable DNA on the Y-chromosome will be sequenced routinely, including Y-STRs, Y-SNPs, and indel polymorphisms, it will be possible to provide better TMRCA estimates for a pair of Y-chromosomes. For Y-STRs it is important to determine the mutation rate in even larger samples than are currently available (~10,000).

There will always be some residual uncertainty, e.g., because we will never be able to determine the generation length for prehistoric cultures. However, our estimates are likely to be much better than the ones possible today, which are really not much better than guesses.

It is important to be skeptical of low confidence intervals associated with many published age estimates. The assumptions on which these intervals are based are rarely stated explicitly, and may assume (inappropriately) that only one type of uncertainty (of at least five types; see Discussion) are at play.


McG said...

I do not use ASD, as I have mentioned before. My biggest concern is multi-steps, which you did not include in your model. My estimates of two known events, the Kerchner Family ancestor and the Clan Gregor founder are more accurate than I would expect?? I use the slower mutators such that mutations outside of +/- 1 are less than 5% of the total number of mutations. I then do a simple count assuming that each mutation, regardless of step size, is one mutation. We have here two issues, precision and accuracy. Accuracy is did I hit the right barn, precision is what is the cluster size around whatever barn I hit. I start with .00069 per gen from Zhiv (so I may be aiming at the wrong barn). I divide that by 30 to get 2.3 X 10^-5 as my average mutation rate over 7 dys loci ZUL used. I then use the rootsweb database of about 22K entries for the first 12 dys loci and count the number of mutations at the same 7 dys loci. I then compute the TMRCA average over these same dys loci. Given that TMRCA and count number from the data set for each dys loci, I compute the mutation rate for each dys loci. I do not have any idea what precision I have at the end of all this, I do know, for two known cases, I have high accuracy. If I use the Chandler rates, I get about 1/3 of TMRCA for these two examples. The key with the counting is to count only unique Transmission events as Charles points out at his website. There are two close cousins in the MacGregor Ian Cam data set, one is 16876 I believe. I use the first 10 dys loci there. In the Kerchner data set I use ASD derived mutation estimates for the faster mutators?

McG said...

Some add'l comments: 1. Stochasticity of the mutation process itself: I've observed what I would call a phase effect in that, especially for slow mutators, the time of occurrence is important and if a very slow mutator just occurred, then estimates will be biased to essentially a higher mutation rate for that dys loci. ergo; large data sets with many entries and long TMRCA's will provide more accurate estimates. 2. Inaccurate knowledge of the mutation rate: basically the same comment as above. 3. Inaccurate knowledge of the generation length: this is a nasty issue; certainly, life spans are longer these days but procreation periods for "Patriarchs" lasted longer in the past. I basically think 25 years is too short for some populations. However, for hispanic populations 25 may be more appropriate? 4. Multi-steps mutations: Estimates are they occur at the 5% level and changes up to 4 steps have been observed. This is my strongest argument against ASD, certainly faster mutators will be overestimated as to mutation rate. 5. Lineage or allele-dependent mutation rate: I don't know about lineage but I do know that travel to a different country with a different diet and temperature profile appears to increase the number of mutations. I have also seen different rates for the same dys loci in different haplogroups (388 in I1a and R1b for example). Also, if the rate were fixed at a dys loci, then we could expect that the distribution around the modal would be symmetrical. This is rarely true. I've seen skewness in both directions.

Lastly, models themselves are estimates of the physical process. Is there an underlying plan to all of this? If so, for what purpose were these channel properties we observe created?? Einstein had quite a bit of trouble with quantum mechanics and the apparent roll of the dice in some models of the underlying process??

Jheald said...


You need to be quite careful here. "Unbiased", in statistics, is a word that in modern languages might be called a "false friend" - i.e. a word which looks as if it should mean one thing, but in fact turns out to mean something rather different.

What unbiasedness tells you is that if you have 100 independent measurements of an estimator for an unknown parameter, then the average of those 100 estimates should be close to the true value of the parameter. And this is what your simulation has confirmed.

But that is not the question that comes up in genetics.

In genetics we have /one/ measurement of the estimator, and we want to ask whether that is likely to be close to the true value of the parameter. And the answer (at least in this case) is No. If g_est comes in at 99, the median value of g is actually likely to be substantially more than 99.

I sketched out the maths of why this is the case for Ken back in March of this year,

You can test it by running a slightly different simulation: Repeatedly pick a "true" value of g at random. Evolve the system to produce a g_est. If g_est is not 99 (or whatever value), then forget it. Otherwise, note the value of "true" g which in this case led to getting g_est=99. Keep going until you've accumulated a collection of different possible "true" values of g. Now look at the distribution of g in that collection.

You will find that neither its median nor its mean are 99; in fact they will both almost certainly be substantially larger.

Dienekes said...

But we don't have one estimator of the age, we have nm of them, since each Y-STR provides an independent estimate of the age, and we are taking the average of them.

Note that for the case considered in this post, the two estimators you list at the bottom of your post are equivalent, since μ is taken to be the same for all markers. But, I agree that it is better to create an estimator by averaging the nm marker-specific estimates rather than by dividing squared distance with the sum of mutation rates.

As for the simulation you are proposing, I am not sure what you are getting at. gest is rarely exactly the same as the true age (g), since it's not an integer. I would appreciate it if you elaborate on that point.

Jheald said...

To clarify: by estimator above, I meant a value of gest -- so each single simulated history gives one estimator, not nm of them.

Even though gest is "unbiased", one value of gest is not a very good indicator of g

As to the simulation, okay so collect all the runs that produce a ''gest'' between 98.5 and 99.5, and then look at the distribution of ''g'' values that gave rise to them. My claim is that the median of that distribution may be nearer perhaps 120 than 100.

dienekesp said...

I see what you mean now. This post addresses the P(gest | g) and you are speaking of P(g | gest). From Bayes' rule, this is = P(gest | g) P(g) / P(gest)

To derive the posterior distribution, we need an assumption of P(g). But it's not clear what an appropriate distribution of P(g) is. It should be different depending on our prior knowledge (e.g. two random human Y chromosomes vs. two Y chromosomes within a particular haplogroup vs. two Y chromosomes from a single surname).

In general, P(g) will probably be a unimodal shape, with low P(g) for small g (small chance of immediate relatives), then increasing, then decreasing (to 0 for g > 2,000 for example).

So, while E[gest | g] can be calculated under the symmetric stepwise model as a single value, E[g | gest] also depends on the prior belief, which in turn depends on background knowledge (e.g. SNPs) or popultion history (e.g. early or late expansion of the population will shape the P(g) distribution differently).

So, I agree that E[g | gest] is not necessarily equal to gest, but what it is will depend on the prior chosen. Is there a reason why it would necessarily be greater than gest in general?

Jheald said...

Okay. Let me come back to the "prior" P(g) in a moment. The reason I would expect to see the result is actually in the "Likelihood" part, L(gest;g) = P(gest | g).

Now, you have calculated P(gest | g) as a distribution on gest for various fixed values of g. But what we need for Bayes' formula is to think what the curve looks like as a function of g, for a fixed value of gest. Hence my suggestion to re-run the simulation for lots of values of g, noting the distribution of g for runs that produce a particular value of gest.

Why that curve will tend to have its centre of gravity at a larger value of g than gest?

The answer is because the spread in gest (as you have calculated) increases with g.

So suppose you're looking for gest=200. That's quite a rare outcome from g=100, because when g=100 the s.d you've found for gest equals 71. But it's a more common outcome from g=300, because that's associated with an s.d for gest of 190.

When you plot more values up as a function of g, you thus get a likelihood curve which has more of its weight skewed towards values of g which are greater than the fixed value of gest you're considering.

Of course, for a full Bayesian analysis you then need to multiply this likelihood function L(g;gest) by a prior probability P(g|I).

Bruce Walsh has suggested using a geometric distribution, based on the probability for two lines to coalesce in a randomly-breeding population of fixed size N.
More sophisticated models could also be used (for example, allowing the effective population size to vay with time - see eg BATWING, BEAST).

But it's worth noting (1) if the prior is comparatively flat across the area of interest, then the final probability may be predominantly determined by the shape of the likelihood curve - changes in the detail of the prior may make very little difference; (2) we can experiment with the effect of different prior models, to see whether they do make much difference (how sensitive are our conclusions to the details of the prior model?) In any case, we can calculate the likelihood independently of the prior, and it is often a good idea to look at the likelihood and the prior separately, to see how what the data itself tells us compares to what we think we know beforehand.

Of course, if we knew for certain that say g<60, if that was reflected in our prior, then that would remain the Bayesian conclusion whatever value of gest we observed. But if our prior is less informative, it is quite likely that the shape of final probability p(g|gest) will reflect the shape of the likelihood L(gest;g) and therefore somewhat prefer values of g larger then gest.

dienekesp said...

This is what E[g | gest] looks like for a Uniform(1,2500) prior for g. I generated 1 million g's from this prior and noted the corresponding gest's, and then plotted the expected g, given that gest is between 1 and 100 generations, 101 and 200, and so on.

Jheald said...

Nice. I think you can now see why the word "unbiased" always makes me dubious!

The fall-off after gest~1200 presumably shows the effect of the hard limit you placed on the possible values of g. (ie, not larger than 2500).

What was the value of nm for these runs? The effect should still be there, but more and more reduced, as nm increases.