Additions to this entry at the bottom (last update July 29)
In recent years, in most population genetics papers, an evolutionary mutation rate for Y chromosome microsatellites (STRs) of 0.00069/locus/generation has been used. This rate was proposed by Zhivotovsky et al. (2004) (pdf), and defended in Zhivotovsky et al. (2005), and especially Zhivotovsky, Underhill and Feldman (2006) (henceforth Z.U.F.)
This mutation rate is smaller than the observed germline mutation rate by a factor of 3-4. The germline mutation rate is observed by counting mutations directly, e.g., in father-son pairs, or in known pedigrees. Zhivotovsky et al. have provided two pieces of evidence in favor of their evolutionary rate:
- Study of accumalation of STR variation in populations with known founding events, namely Bulgarian Roma and Maori, in their 2004 paper.
- Simulations indicating a 3.6x discrepancy between the two rates in their 2006 paper, which is due to multiple bottlenecks in a haplogroup's history.
We need to obtain good estimates of the mutation rate in order to pinpoint in time the common ancestor of a set of Y chromosomes. A factor of 3, especially for relatively recent events may correspond to a difference between early historical and late Paleolithic events.Thus, I decided to look into the matter myself to be convinced -one way or another- of what the evolutionary mutation rate must be.
The following assumptions, following Z.U.F. are made:
- A man has 0, 1, 2, ... sons according to a Poisson process with mean m=1.
- A step mutation (increase or decrease by 1 repeat) occurs with a mutation rate of µ=0.00251
- STR variance of the man's descendants is measured after g generations.
Patriarch vs. MRCA
A consequence of the time-forward methodology of simulation, is that a Patriarch may not be the Most Recent Common Ancestor (MRCA) of his descendants g generations into the future. Trivially, if a Patriarch has only one son, then, that son -not the Patriarch- is the MRCA of his descendants. But, even if the Patriarch has many sons, and his group of descendants grows, it is possible (due to randomness of the fathering process) that at some generation only 1 descendant will survive.
Suppose that the Patriarch has lived in generation 0, and the MRCA lived in generation i. Thus, STR variance in the descendants at generation g (today) has accumulated over a time span of g-i generations, since, of course, at the generation i (of the MRCA), STR variance is zero.
Now, if we use a time-forward methodology from known foundation events (e.g. the arrival of the Roma in Bulgaria, or the Maori in New Zealand), it is perfectly right to see how STR variance accumulates from the known foundational event. We would then divide the accumulated STR variance by the known time span to determine an effective evolutionary mutation rate, similar to Zhivotovsky et al. (2004).
But, when the foundational event is unknown, when we are trying to estimate its age, then we can only go as far back as the MRCA, since at his time variance is zero. Therefore, by dividing accumulated variance with the evolutionary mutation rate of Z.U.F., we are over-estimating the time to the MRCA.
For example, with g=100, the average STR variance for the descendants of N=10,000 Patriarchs is 0.0755. But, if we average only those Patriarchs who are also the MRCA of their descendants, we obtain a value of 0.0824, or about 9% higher.
In general, the over-estimate (as a percentage) decreases as g increases: as g increases, the average number of descendants of a Patriarch increases, making them much less susceptible to a variance-reset type of bottleneck described here.
Thus, while the age difference between the MRCA and the Patriarch is real, its effect in the age estimate is not very pronounced. There is, however, a second, and much more serious problem, with the Z.U.F. rates when applied to evolutionary studies.
Prolific vs. Non-Prolific Patriarchs: an Observation Selection effect
Patriarchs starting at generation 0 will have a very variable number of descendants at generation g. By averaging over all of them, we are estimating the average STR variance in the descendants of men who lived g generations ago.
Now, consider how this average changes if we average only over the k most "prolific" men (with the most descendants) out of all the N=10,000 Patriarchs:
It is clear that the STR variance in the descendants of the most "prolific" Patriarchs is much higher than in the descendants of the least "prolific" ones. In fact, for the most prolific Patriarchs, variance accumulates near the germline mutation rate, and not at the lower evolutionary effective rate.
Below is the cumulative percentage of the descendants of the k most prolific Patriarchs, with k from 1 to N.
It can be seen that e.g., from the most prolific half of the Patriarchs stems 84% of the descendants. And this, assuming no social inequality in the number of progeny, i.e. each man having the exact same average probability (m=1) of fathering a son. Thus, in reality, the more prolific Patriarchs may have an even larger fraction of the descendants.
Why is this important? Because, in population studies, scientists are likely observe (in the finite samples they collect) multiple descendants only of the most prolific of the Patriarchs. Thus, for the vast majority of the Patriarchs with few descendants, we are likely to sample no, or few of their descendants.
This means that there is an inherent observation selection effect in the types of Patriarchs we are likely to study: they are much more likely to be among the prolific ones. Coupling this observation with the knowledge that STR variance in the descendants of prolific Patriarchs accumulates near the germline mutation rate (0.69µ for the 100 most prolific ones in my experiment), we, once again, conclude that the STR variance in haplogroups likely to be made the object of scientific study accumulates near the germline mutation rate, and at the very least, faster than the evolutionary rate of Z.U.F.
Z.U.F. have also proposed two additional demographic scenaria under which a higher effective mutation rate would be observed:
- A sudden jump in the size of the haplogroup after it appears
- An expanding population (m>1)
Moreover, it is reasonable to assume that in stratified human societies, a few males, (leaders, or conquerors), or groups of closely related males may have generated a disproportionate number of descendants in the short-term.
- The age difference between the Patriarch and the MRCA indicates that Variance/0.00069 overestimates the age of the MRCA somewhat (but not very much).
- A prolific Patriarch's descendants are more likely to be sampled by scientists, and tend to have a higher STR variance. Hence, Variance/0.00069 overestimates the age of the MRCA, perhaps substantially.
- Demographic factors, such as population growth, or short-term success by related males indicates that Variance/0.00069 overestimates the age of the MRCA.
1 Z.U.F. used a germline mutation rate of µ=0.001. For the purposes of simulation, this is not an important difference, as they themselves note. I choose the rate of 0.0025 because it is closer to the actual human germline mutation rate for STRs.
2 Z.U.F. generated 50,000 men and then averaged over the men who had descendants. I, on the other hand, generate as many men as it takes to harvest at least N men with descendants, to ensure that I average a substantially large number of such men.
Editorial change (Jul 22): erroneously written "exceeds",in paragraph 2, changed to "is smaller than".
Update (July 23):
To further elucidate how the observation selection effect may make lineages seem older than they really are, I carried out another small experiment (g=110, N=10,000, m=1).
The age of each group is inferred by dividing the accumulated variance by the evolutionary rate of 0.0006944 (=μ/3.6).
The average variance over all N in this experiment is 0.0867, thus, the average inferred age is 125 generations, close to the truth (110 generations), allowing for the correction in age between the Patriarch and the TMRCA.
However, if we calculated the average variance over ten groups of 1,000 lineages (out of all N=10,000) according to the number of descendants, we see, as described above, that more "prolific" lineages have accumulated more variance, whereas less "prolific" ones have accumulated less variance than the overall average of 0.0867.
Thus, over the 10% most populous lineages (right of the figure), the average inferred age is 209 generations, or a 90% overestimate of the true age!
But, as I mentioned, it is precisely these populous lineages (which don't just have "some" descendants today, but thousands and millions of them) that are likely to be studied, because they are the only ones that have enough representatives in a sample of 100-1,000 men, typically seen in a population study, to allow for an age estimate via a variance calculation.
Update (July 24): Haplogroup sizes
The number of a Patriarch's descendants after g generations is a random variable which depends on the parameters m (the population growth constant), and g, the number of generations.
Scientists typically look at haplogroups with thousands or millions of existing members. Are such haplogroups produced in the types of simulations performed by Z.U.F.?
I estimate the average size of the haplogroups of the haplogroups produced by Z.U.F. for different g=10,20,...,700 and m=1.
It is evident that this number increases linearly with g at a rate estimated to be 0.5/generation [This was also noted by Z.U.F. who state: "the average size of the surviving haplogroups increased each generation by a value rapidly approaching 0.5"] However, this means, that the average haplogroup at 700 generations has a size of ~350 men.
Thus, not only is the average variance estimated by Z.U.F. inappropriate because of an observation selection effect (averaging over small and large haplogroups alike), but it seems to miss the relevant observations altogether, i.e. the really large haplogroups numbering in the hundreds of thousands or millions. Yet it is precise for such large haplogroups that it has often be used in the literature.
How can we produce "realistic" haplogroup sizes, close to those likely to become an object of scientific study in contemporary human populations? We can either:
- increase the number of initial representatives, i.e. start with many related men with identical Y chromosomes rather than just 1, or we can
- increase the population growth constant m to something higher than 1, i.e. a growing population.
Indeed, Z.U.F. produce some such large haplogroups in some of their simulations (Fig. 1 asterisks, Fig. 2 squares/diamonds), all of which show -predictably- a higher effective rate than their 3.6x slower rate.
They caution against such large haplogroup sizes ["population size exceeds 1 million by generation 1000, which is not realistic for many local tribes."]. Granted, -- if one looks at local tribes never growing to large numbers.
And yet, some or all of the co-authors of Z.U.F. did not limit their use of the 3.6x slower rate to local tribes: Cinnioglu et al. 2004 (pdf), Sengupta et al. (2006), King et al. (2008) all apply the 0.00069 rate for populations (and haplogroups) that have grown to much more than 1 million in less time, thus overestimating severely their age.
Update (July 24): Variance of a large haplogroup
Following the previous observations, naturally, I wanted to see for myself what the STR variance of an ancient lineage with a large number of modern descendants actually looks like. My target size is 1,000,000, which is about 20% of modern Greek males.
I consider two cases:
- Expansion commencing in the Late Bronze Age (g=120 or 1,600BC with a generation length of 30)
- Expansion commencing in the early Neolithic (g=300 or 7,000BC)
I harvest N=1,000 haplogroups for each of these cases. I set the growth constant at m=1.100694 for the Bronze Age, and m=1.039122 for the Neolithic. This ensures that enough "large" haplogroups will be generated during simulation. Naturally, the overall population grows at a smaller rate, but the successful lineages will grow much faster than the population average.
Note that I harvest only haplogroups whose MRCA lived in the specified time span. Also, I harvest haplogroups whose final size is between 750,000 and 1,250,000 to match my target size of 1,000,000. Indeed, the average size of the harvested haplogroups is 964,327 for the Bronze Age, and 979693 for the Neolithic.
Here are the results:
- ~1 million descendants of a Bronze Age (120 generations ago) ancestor have an STR variance of 0.269 +/ 0.087
- ~1 million descendants of a Neolithic (300 generations ago) ancestor have an STR variance of 0.629 +/- 0.156
- Bronze Age: 107.6 generations, or a 10% underestimate
- Neolithic: 251.6 generations, or a 16% underestimate
- Bronze Age: 389.9 generations, or a 225% overestimate
- Neolithic 911.6 generations, or a 203% overestimate
Let's look at some concrete examples of age estimates in the literature, where I compare my own (first) estimates with the published ones. Here is how my estimates are derived:
For a Bronze Age ancestor (g=120) it is: 0.269 =(approx) 0.9 μg
For a Neolithic ancestor (g=300) it is: 0.629 =(approx) 0.84 μg
Thus, the correction multiplier, if the variance is between 0.269 and 0.629 is between 0.84 and 0.9; I will use the midpoint 0.87. If the variance is less than 0.269, then I use 0.9. If the variance is more than 0.629 then I use 0.84. Of course, the correction factor could be expressed more accurately as a function of the variance.
Note that the generation length preferred by these authors is 25, by me it is 30. All ages are ky BC.
Cinnioglu et al. (2004)
In this paper, an evolutionary rate of 0.0007 is used.
E-M78 is dated to 400BC, only a couple of centuries after the historical Greek colonization. E-M78 reaches its maximum in the Peloponese, a major source of Greek colonists.
I-P37 and J-M12 are dated to 1,100BC and 1,200BC, at around the time that e.g. the Phrygians from the Balkans are believed to have migrated to Asia Minor. I-P37 and J-M12 reach their maxima in areas north of Greece where the Phrygians are said to have originated.
Sengupta et al. (2006)
|R-M17 (upper caste)||0.26||7.3||1.5|
Thus, all the exogenous West Asian lineages in India have post-Neolithic ages, with R-M17 having a suggestive age of 1,500BC coinciding with the suggested date for the Indo-Aryans.
King et al. (2008)
|J-M12 (Nea Nikomedeia)||0.18||4.7||0.4|
|E-V13 (Lerna Franchthi)||0.25||7.2||1.3|
These are very localized samples, so they should not be interpreted as reflecting expansion times in Greece itself, however, they do suggest a Bronze Age expansion of E-V13 and a much later arrival of E-V13 in Crete.
Note that for Crete, the 1,000,000-haplogroup size assumption is a substantial overestimate, so my age estimates are also substantial underestimates.
Update (July 25): R-M17 in South Siberia
Derenko et al. (2006) "Contrasting patterns of Y-chromosome variation in South Siberian populations from Baikal and Altai-Sayan regions" calculate the variance of R-M17 chromosomes in South Siberia, using the Z.U.F. rate, arriving at an age of 11.3kya corresponding to a value of 0.31. This corresponds to 2,300BC according to my estimate (see previous update).
Recently Bouakaze et al. (Int J Legal Med (2007) 121:493–499) reported the presence of R-M17 chromosomes in ancient inhabitants of South Siberia and the Andronovo culture (2,500BC-1,500BC).
The Andronovo culture is widely believed to be of Eastern European ultimate origin, reflecting the eastward movement of the Kurgan culture, and is associated by some with the ancestors of the Indo-Iranians.
In the Balkans, again in Z.U.F. years, the age of R-M17 is 15.8kya corresponding to variation of 0.44, corresponding to ~4,000BC according to my estimate.
Update (July 25): Baltic Y chromosomes
Lappalainen et al. (2008) use the Z.U.F. rate to estimate the antiquity of lineages in the Baltic region. Dates are ky BC.
1,000BC for I1a in the Baltic region is within the time frame of the emergence of the Germanic people who did experience a strong demographic growth.
1,500BC for N3 shows a rather late time for Finno-Ugrians. However, it must be noted that smaller demographic sizes would impose more drift, and hence a slower accumulation of variance. Therefore, this time is probably underestimated.
1,900BC for R1a1 is consistent with the northern edge of the expansion of R1a1. Once again, reduced variance may also be influenced by smaller population numbers, making this a possible underestimate.
Update (July 25): Southeastern Europe (the Balkans)
Pericic et al. (2005) use the Z.U.F. rate to estimate ages of Y-chromosome lineages in the Balkans. Dates are ky BC.
|J-M241 (without Kosovars)||1||0.8AD|
Thus, Balkan haplogroup I seems related to a Bronze Age origin, with R-M17 being substantially older, and deriving perhaps from northern Balkan Neolithic or alternatively intrusive Kurgan populations. J-M241 seems to be quite young, similar to J-M12 in Nea Nikomedeia (see discussion of King et al. (2008) above).
The young ages of J-M12 and J-M241 also explain the striking inverse correlation between it and J-M410, which makes sense if it expanded later. A fairly late expansion also explains its under-representation in Southern Italy and Anatolia: it appears to be a rather young and "Epirotic" clade that was too late in coming to significantly participate in the historical Greek colonization.
Update (July 26): E3b in Cyprus and Southern Italy
Capelli et al. (2005) [Population Structure in the Mediterranean Basin: A Y Chromosome Perspective] study Y-chromosome variation in many Mediterranean populations including Cyprus. I use a mutation rate of 0.0018 for the six markers used in this study (Quintana-Murci et al. AJHG 68(2) pp. 537 - 542 ). Ages are in ky BC.
I come up with an age of 1.4ky BC for E3b in Cyprus, which is consistent with Mycenaean and later Greek settlements on the island.
I also looked at Southern Italian Y chromosomes. I removed those with values other than (13,12) in DYS19,DYS388), since these are universal in Greek E-V13, in order to remove possible contamination from non E-V13 chromosomes. The resulting age is 900BC, once again very close to the historical Greek colonization of Magna Graecia.
July (26): A more elaborate population growth model
Z.U.F. also propose (Fig. 2 triangles) a more elaborate population growth with:
- m=1.002 before 400 generations
- m=1.012 from 400 to to 14 generations ago
- m=1.12 from 14 to 8 generations ago
- m=1.25 from 8 generations ago to current time
I ran a simulation (g=1000, N=10,000) with this population growth model. The average size of the descent groups of the MRCAs is 692,982 men. Averaged all of them, variance is 1.37.
- With the germline mutation rate, an estimate of 549 generations (45% underestimate)
- With the Z.U.F. evolutionary rate, an estimate of 1,988 generations (99% overestimate)
- With the germline mutation rate: 776, 747, 668 generations
- With the Z.U.F. evolutionary rate: 2,810, 2,707, 2,419 generations
Thus, one can estimate that STR variance since the time of the MRCA accumulates at a rate of ~0.75μ / generation.
And, yet, the 0.00069 rate has been used to date Paleolithic events, e.g., by Semino et al. (2004) [Am. J. Hum. Genet. 74:1023–1034, 2004], leading to general age overestimates.
Update (July 29)
My discussion is continued in Haplogroup sizes and observation selection effects (continued)