Comments on Dienekes’ Anthropology Blog: On the use of average squared distance (ASD) to estimate the time to most recent common ancestor (TMRCA) of a pair of Y-STR haplotypes

Nice. I think you can now see why the word "unbia...

2008-10-13T12:08:00.000+03:00

Nice. I think you can now see why the word "unbiased" always makes me dubious!

The fall-off after gest~1200 presumably shows the effect of the hard limit you placed on the possible values of g. (ie, not larger than 2500).

What was the value of nm for these runs? The effect should still be there, but more and more reduced, as nm increases.

This is what E[g | gest] looks like for a Uniform(...

2008-10-13T00:15:00.000+03:00

This is what E[g | gest] looks like for a Uniform(1,2500) prior for g. I generated 1 million g's from this prior and noted the corresponding gest's, and then plotted the expected g, given that gest is between 1 and 100 generations, 101 and 200, and so on.

Okay. Let me come back to the "prior" P(g) in a m...

2008-10-10T17:09:00.000+03:00

Okay. Let me come back to the "prior" P(g) in a moment. The reason I would expect to see the result is actually in the "Likelihood" part, L(gest;g) = P(gest | g).

Now, you have calculated P(gest | g) as a distribution on gest for various fixed values of g. But what we need for Bayes' formula is to think what the curve looks like as a function of g, for a fixed value of gest. Hence my suggestion to re-run the simulation for lots of values of g, noting the distribution of g for runs that produce a particular value of gest.

Why that curve will tend to have its centre of gravity at a larger value of g than gest?

The answer is because the spread in gest (as you have calculated) increases with g.

So suppose you're looking for gest=200. That's quite a rare outcome from g=100, because when g=100 the s.d you've found for gest equals 71. But it's a more common outcome from g=300, because that's associated with an s.d for gest of 190.

When you plot more values up as a function of g, you thus get a likelihood curve which has more of its weight skewed towards values of g which are greater than the fixed value of gest you're considering.

Of course, for a full Bayesian analysis you then need to multiply this likelihood function L(g;gest) by a prior probability P(g|I).

Bruce Walsh has suggested using a geometric distribution, based on the probability for two lines to coalesce in a randomly-breeding population of fixed size N.
http://nitro.biosci.arizona.edu/ftdna/models.html#Effective
More sophisticated models could also be used (for example, allowing the effective population size to vay with time - see eg BATWING, BEAST).

But it's worth noting (1) if the prior is comparatively flat across the area of interest, then the final probability may be predominantly determined by the shape of the likelihood curve - changes in the detail of the prior may make very little difference; (2) we can experiment with the effect of different prior models, to see whether they do make much difference (how sensitive are our conclusions to the details of the prior model?) In any case, we can calculate the likelihood independently of the prior, and it is often a good idea to look at the likelihood and the prior separately, to see how what the data itself tells us compares to what we think we know beforehand.

Of course, if we knew for certain that say g<60, if that was reflected in our prior, then that would remain the Bayesian conclusion whatever value of gest we observed. But if our prior is less informative, it is quite likely that the shape of final probability p(g|gest) will reflect the shape of the likelihood L(gest;g) and therefore somewhat prefer values of g larger then gest.

I see what you mean now. This post addresses the P...

2008-10-10T14:34:00.000+03:00

I see what you mean now. This post addresses the P(gest | g) and you are speaking of P(g | gest). From Bayes' rule, this is = P(gest | g) P(g) / P(gest)

To derive the posterior distribution, we need an assumption of P(g). But it's not clear what an appropriate distribution of P(g) is. It should be different depending on our prior knowledge (e.g. two random human Y chromosomes vs. two Y chromosomes within a particular haplogroup vs. two Y chromosomes from a single surname).

In general, P(g) will probably be a unimodal shape, with low P(g) for small g (small chance of immediate relatives), then increasing, then decreasing (to 0 for g > 2,000 for example).

So, while E[gest | g] can be calculated under the symmetric stepwise model as a single value, E[g | gest] also depends on the prior belief, which in turn depends on background knowledge (e.g. SNPs) or popultion history (e.g. early or late expansion of the population will shape the P(g) distribution differently).

So, I agree that E[g | gest] is not necessarily equal to gest, but what it is will depend on the prior chosen. Is there a reason why it would necessarily be greater than gest in general?

To clarify: by estimator above, I meant a value o...

2008-10-09T19:32:00.000+03:00

To clarify: by estimator above, I meant a value of gest -- so each single simulated history gives one estimator, not nm of them.

Even though gest is "unbiased", one value of gest is not a very good indicator of g

As to the simulation, okay so collect all the runs that produce a ''gest'' between 98.5 and 99.5, and then look at the distribution of ''g'' values that gave rise to them. My claim is that the median of that distribution may be nearer perhaps 120 than 100.

But we don't have one estimator of the age, we hav...

2008-10-09T19:00:00.000+03:00

But we don't have one estimator of the age, we have nm of them, since each Y-STR provides an independent estimate of the age, and we are taking the average of them.

Note that for the case considered in this post, the two estimators you list at the bottom of your post are equivalent, since μ is taken to be the same for all markers. But, I agree that it is better to create an estimator by averaging the nm marker-specific estimates rather than by dividing squared distance with the sum of mutation rates.

As for the simulation you are proposing, I am not sure what you are getting at. gest is rarely exactly the same as the true age (g), since it's not an integer. I would appreciate it if you elaborate on that point.

Dienekes:You need to be quite careful here. "Unbi...

2008-10-09T18:27:00.000+03:00

Dienekes:

You need to be quite careful here. "Unbiased", in statistics, is a word that in modern languages might be called a "false friend" - i.e. a word which looks as if it should mean one thing, but in fact turns out to mean something rather different.

What unbiasedness tells you is that if you have 100 independent measurements of an estimator for an unknown parameter, then the average of those 100 estimates should be close to the true value of the parameter. And this is what your simulation has confirmed.

But that is not the question that comes up in genetics.

In genetics we have /one/ measurement of the estimator, and we want to ask whether that is likely to be close to the true value of the parameter. And the answer (at least in this case) is No. If g_est comes in at 99, the median value of g is actually likely to be substantially more than 99.

I sketched out the maths of why this is the case for Ken back in March of this year,
http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2008-03/1204658813

You can test it by running a slightly different simulation: Repeatedly pick a "true" value of g at random. Evolve the system to produce a g_est. If g_est is not 99 (or whatever value), then forget it. Otherwise, note the value of "true" g which in this case led to getting g_est=99. Keep going until you've accumulated a collection of different possible "true" values of g. Now look at the distribution of g in that collection.

You will find that neither its median nor its mean are 99; in fact they will both almost certainly be substantially larger.

Some add'l comments: 1. Stochasticity of the muta...

2008-10-06T17:28:00.000+03:00

Some add'l comments: 1. Stochasticity of the mutation process itself: I've observed what I would call a phase effect in that, especially for slow mutators, the time of occurrence is important and if a very slow mutator just occurred, then estimates will be biased to essentially a higher mutation rate for that dys loci. ergo; large data sets with many entries and long TMRCA's will provide more accurate estimates. 2. Inaccurate knowledge of the mutation rate: basically the same comment as above. 3. Inaccurate knowledge of the generation length: this is a nasty issue; certainly, life spans are longer these days but procreation periods for "Patriarchs" lasted longer in the past. I basically think 25 years is too short for some populations. However, for hispanic populations 25 may be more appropriate? 4. Multi-steps mutations: Estimates are they occur at the 5% level and changes up to 4 steps have been observed. This is my strongest argument against ASD, certainly faster mutators will be overestimated as to mutation rate. 5. Lineage or allele-dependent mutation rate: I don't know about lineage but I do know that travel to a different country with a different diet and temperature profile appears to increase the number of mutations. I have also seen different rates for the same dys loci in different haplogroups (388 in I1a and R1b for example). Also, if the rate were fixed at a dys loci, then we could expect that the distribution around the modal would be symmetrical. This is rarely true. I've seen skewness in both directions.

Lastly, models themselves are estimates of the physical process. Is there an underlying plan to all of this? If so, for what purpose were these channel properties we observe created?? Einstein had quite a bit of trouble with quantum mechanics and the apparent roll of the dice in some models of the underlying process??

I do not use ASD, as I have mentioned before. My ...

2008-10-06T13:50:00.000+03:00

I do not use ASD, as I have mentioned before. My biggest concern is multi-steps, which you did not include in your model. My estimates of two known events, the Kerchner Family ancestor and the Clan Gregor founder are more accurate than I would expect?? I use the slower mutators such that mutations outside of +/- 1 are less than 5% of the total number of mutations. I then do a simple count assuming that each mutation, regardless of step size, is one mutation. We have here two issues, precision and accuracy. Accuracy is did I hit the right barn, precision is what is the cluster size around whatever barn I hit. I start with .00069 per gen from Zhiv (so I may be aiming at the wrong barn). I divide that by 30 to get 2.3 X 10^-5 as my average mutation rate over 7 dys loci ZUL used. I then use the rootsweb database of about 22K entries for the first 12 dys loci and count the number of mutations at the same 7 dys loci. I then compute the TMRCA average over these same dys loci. Given that TMRCA and count number from the data set for each dys loci, I compute the mutation rate for each dys loci. I do not have any idea what precision I have at the end of all this, I do know, for two known cases, I have high accuracy. If I use the Chandler rates, I get about 1/3 of TMRCA for these two examples. The key with the counting is to count only unique Transmission events as Charles points out at his website. There are two close cousins in the MacGregor Ian Cam data set, one is 16876 I believe. I use the first 10 dys loci there. In the Kerchner data set I use ASD derived mutation estimates for the faster mutators?