July 21, 2008

How Y-STR variance accumulates: a comment on Zhivotovsky, Underhill and Feldman (2006)

An important erratum for this post.

Additions to this entry at the bottom (last update July 29)

In recent years, in most population genetics papers, an evolutionary mutation rate for Y chromosome microsatellites (STRs) of 0.00069/locus/generation has been used. This rate was proposed by Zhivotovsky et al. (2004) (pdf), and defended in Zhivotovsky et al. (2005), and especially Zhivotovsky, Underhill and Feldman (2006) (henceforth Z.U.F.)

This mutation rate is smaller than the observed germline mutation rate by a factor of 3-4. The germline mutation rate is observed by counting mutations directly, e.g., in father-son pairs, or in known pedigrees. Zhivotovsky et al. have provided two pieces of evidence in favor of their evolutionary rate:
• Study of accumalation of STR variation in populations with known founding events, namely Bulgarian Roma and Maori, in their 2004 paper.
• Simulations indicating a 3.6x discrepancy between the two rates in their 2006 paper, which is due to multiple bottlenecks in a haplogroup's history.
I was always apprehensive about what the "right" mutation rate should be:
We need to obtain good estimates of the mutation rate in order to pinpoint in time the common ancestor of a set of Y chromosomes. A factor of 3, especially for relatively recent events may correspond to a difference between early historical and late Paleolithic events.
Thus, I decided to look into the matter myself to be convinced -one way or another- of what the evolutionary mutation rate must be.

Methodology

The following assumptions, following Z.U.F. are made:
• A man has 0, 1, 2, ... sons according to a Poisson process with mean m=1.
• A step mutation (increase or decrease by 1 repeat) occurs with a mutation rate of µ=0.00251
• STR variance of the man's descendants is measured after g generations.
Results are averaged over N men who have descendants after g generations. I will call such men, "Patriarchs". Thus, I generate random family trees for men until I have harvested N=10,000 of them who have living descendants today.2

Patriarch vs. MRCA

A consequence of the time-forward methodology of simulation, is that a Patriarch may not be the Most Recent Common Ancestor (MRCA) of his descendants g generations into the future. Trivially, if a Patriarch has only one son, then, that son -not the Patriarch- is the MRCA of his descendants. But, even if the Patriarch has many sons, and his group of descendants grows, it is possible (due to randomness of the fathering process) that at some generation only 1 descendant will survive.

Suppose that the Patriarch has lived in generation 0, and the MRCA lived in generation i. Thus, STR variance in the descendants at generation g (today) has accumulated over a time span of g-i generations, since, of course, at the generation i (of the MRCA), STR variance is zero.

Now, if we use a time-forward methodology from known foundation events (e.g. the arrival of the Roma in Bulgaria, or the Maori in New Zealand), it is perfectly right to see how STR variance accumulates from the known foundational event. We would then divide the accumulated STR variance by the known time span to determine an effective evolutionary mutation rate, similar to Zhivotovsky et al. (2004).

But, when the foundational event is unknown, when we are trying to estimate its age, then we can only go as far back as the MRCA, since at his time variance is zero. Therefore, by dividing accumulated variance with the evolutionary mutation rate of Z.U.F., we are over-estimating the time to the MRCA.

For example, with g=100, the average STR variance for the descendants of N=10,000 Patriarchs is 0.0755. But, if we average only those Patriarchs who are also the MRCA of their descendants, we obtain a value of 0.0824, or about 9% higher.

In general, the over-estimate (as a percentage) decreases as g increases: as g increases, the average number of descendants of a Patriarch increases, making them much less susceptible to a variance-reset type of bottleneck described here.

Thus, while the age difference between the MRCA and the Patriarch is real, its effect in the age estimate is not very pronounced. There is, however, a second, and much more serious problem, with the Z.U.F. rates when applied to evolutionary studies.

Prolific vs. Non-Prolific Patriarchs: an Observation Selection effect

Patriarchs starting at generation 0 will have a very variable number of descendants at generation g. By averaging over all of them, we are estimating the average STR variance in the descendants of men who lived g generations ago.

Now, consider how this average changes if we average only over the k most "prolific" men (with the most descendants) out of all the N=10,000 Patriarchs:

 k Average Variance 100 0.1721 1000 0.1407 2500 0.1219 5000 0.1033 10000 0.0755

It is clear that the STR variance in the descendants of the most "prolific" Patriarchs is much higher than in the descendants of the least "prolific" ones. In fact, for the most prolific Patriarchs, variance accumulates near the germline mutation rate, and not at the lower evolutionary effective rate.

Below is the cumulative percentage of the descendants of the k most prolific Patriarchs, with k from 1 to N.

It can be seen that e.g., from the most prolific half of the Patriarchs stems 84% of the descendants. And this, assuming no social inequality in the number of progeny, i.e. each man having the exact same average probability (m=1) of fathering a son. Thus, in reality, the more prolific Patriarchs may have an even larger fraction of the descendants.

Why is this important? Because, in population studies, scientists are likely observe (in the finite samples they collect) multiple descendants only of the most prolific of the Patriarchs. Thus, for the vast majority of the Patriarchs with few descendants, we are likely to sample no, or few of their descendants.

This means that there is an inherent observation selection effect in the types of Patriarchs we are likely to study: they are much more likely to be among the prolific ones. Coupling this observation with the knowledge that STR variance in the descendants of prolific Patriarchs accumulates near the germline mutation rate (0.69µ for the 100 most prolific ones in my experiment), we, once again, conclude that the STR variance in haplogroups likely to be made the object of scientific study accumulates near the germline mutation rate, and at the very least, faster than the evolutionary rate of Z.U.F.

Closing Remarks

Z.U.F. have also proposed two additional demographic scenaria under which a higher effective mutation rate would be observed:
• A sudden jump in the size of the haplogroup after it appears
• An expanding population (m>1)
Both factors seem reasonable for post-Holocene human populations. It is well known that -whatever temporary setbacks there were- mankind has overall experienced a substantial population growth in recent millennia. Thus, an expanding population seems like a fair assumption.

Moreover, it is reasonable to assume that in stratified human societies, a few males, (leaders, or conquerors), or groups of closely related males may have generated a disproportionate number of descendants in the short-term.

In summary:
• The age difference between the Patriarch and the MRCA indicates that Variance/0.00069 overestimates the age of the MRCA somewhat (but not very much).
• A prolific Patriarch's descendants are more likely to be sampled by scientists, and tend to have a higher STR variance. Hence, Variance/0.00069 overestimates the age of the MRCA, perhaps substantially.
• Demographic factors, such as population growth, or short-term success by related males indicates that Variance/0.00069 overestimates the age of the MRCA.
In view of the above, and keeping in mind both the stochastic factors that cause STR variance to fluctuate around its expected value, as well as uncertainties in demographic history, I do believe that ages calculated with the evolutionary mutation rate of 0.00069/locus/generation are significantly overestimated.

1 Z.U.F. used a germline mutation rate of µ=0.001. For the purposes of simulation, this is not an important difference, as they themselves note. I choose the rate of 0.0025 because it is closer to the actual human germline mutation rate for STRs.
2 Z.U.F. generated 50,000 men and then averaged over the men who had descendants. I, on the other hand, generate as many men as it takes to harvest at least N men with descendants, to ensure that I average a substantially large number of such men.

Editorial change (Jul 22): erroneously written "exceeds",in paragraph 2, changed to "is smaller than".

Update (July 23):

To further elucidate how the observation selection effect may make lineages seem older than they really are, I carried out another small experiment (g=110, N=10,000, m=1).

The age of each group is inferred by dividing the accumulated variance by the evolutionary rate of 0.0006944 (=μ/3.6).

The average variance over all N in this experiment is 0.0867, thus, the average inferred age is 125 generations, close to the truth (110 generations), allowing for the correction in age between the Patriarch and the TMRCA.

However, if we calculated the average variance over ten groups of 1,000 lineages (out of all N=10,000) according to the number of descendants, we see, as described above, that more "prolific" lineages have accumulated more variance, whereas less "prolific" ones have accumulated less variance than the overall average of 0.0867.

Thus, over the 10% most populous lineages (right of the figure), the average inferred age is 209 generations, or a 90% overestimate of the true age!

But, as I mentioned, it is precisely these populous lineages (which don't just have "some" descendants today, but thousands and millions of them) that are likely to be studied, because they are the only ones that have enough representatives in a sample of 100-1,000 men, typically seen in a population study, to allow for an age estimate via a variance calculation.

Update (July 24): Haplogroup sizes

The number of a Patriarch's descendants after g generations is a random variable which depends on the parameters m (the population growth constant), and g, the number of generations.

Scientists typically look at haplogroups with thousands or millions of existing members. Are such haplogroups produced in the types of simulations performed by Z.U.F.?

I estimate the average size of the haplogroups of the haplogroups produced by Z.U.F. for different g=10,20,...,700 and m=1.

It is evident that this number increases linearly with g at a rate estimated to be 0.5/generation [This was also noted by Z.U.F. who state: "the average size of the surviving haplogroups increased each generation by a value rapidly approaching 0.5"] However, this means, that the average haplogroup at 700 generations has a size of ~350 men.

Thus, not only is the average variance estimated by Z.U.F. inappropriate because of an observation selection effect (averaging over small and large haplogroups alike), but it seems to miss the relevant observations altogether, i.e. the really large haplogroups numbering in the hundreds of thousands or millions. Yet it is precise for such large haplogroups that it has often be used in the literature.

How can we produce "realistic" haplogroup sizes, close to those likely to become an object of scientific study in contemporary human populations? We can either:
• increase the number of initial representatives, i.e. start with many related men with identical Y chromosomes rather than just 1, or we can
• increase the population growth constant m to something higher than 1, i.e. a growing population.
Yet, both these changes have the same effect, namely the accumulation of variance at a higher rate than the Z.U.F. rate.

Indeed, Z.U.F. produce some such large haplogroups in some of their simulations (Fig. 1 asterisks, Fig. 2 squares/diamonds), all of which show -predictably- a higher effective rate than their 3.6x slower rate.

They caution against such large haplogroup sizes ["population size exceeds 1 million by generation 1000, which is not realistic for many local tribes."]. Granted, -- if one looks at local tribes never growing to large numbers.

And yet, some or all of the co-authors of Z.U.F. did not limit their use of the 3.6x slower rate to local tribes: Cinnioglu et al. 2004 (pdf), Sengupta et al. (2006), King et al. (2008) all apply the 0.00069 rate for populations (and haplogroups) that have grown to much more than 1 million in less time, thus overestimating severely their age.

Update (July 24): Variance of a large haplogroup

Following the previous observations, naturally, I wanted to see for myself what the STR variance of an ancient lineage with a large number of modern descendants actually looks like. My target size is 1,000,000, which is about 20% of modern Greek males.

I consider two cases:
• Expansion commencing in the Late Bronze Age (g=120 or 1,600BC with a generation length of 30)
• Expansion commencing in the early Neolithic (g=300 or 7,000BC)

I harvest N=1,000 haplogroups for each of these cases. I set the growth constant at m=1.100694 for the Bronze Age, and m=1.039122 for the Neolithic. This ensures that enough "large" haplogroups will be generated during simulation. Naturally, the overall population grows at a smaller rate, but the successful lineages will grow much faster than the population average.

Note that I harvest only haplogroups whose MRCA lived in the specified time span. Also, I harvest haplogroups whose final size is between 750,000 and 1,250,000 to match my target size of 1,000,000. Indeed, the average size of the harvested haplogroups is 964,327 for the Bronze Age, and 979693 for the Neolithic.

Here are the results:
• ~1 million descendants of a Bronze Age (120 generations ago) ancestor have an STR variance of 0.269 +/ 0.087
• ~1 million descendants of a Neolithic (300 generations ago) ancestor have an STR variance of 0.629 +/- 0.156
If we used the germline mutation rate (μ=0.0025) we would estimate the ages of these haplogroups as:
• Bronze Age: 107.6 generations, or a 10% underestimate
• Neolithic: 251.6 generations, or a 16% underestimate
On the other hand, if we used the evolutionary rate of 0.00069 of Z.U.F., our estimates would be:
• Bronze Age: 389.9 generations, or a 225% overestimate
• Neolithic 911.6 generations, or a 203% overestimate
It is clear that the Z.U.F. rate of 0.00069 substantially overestimates the ages of large recent haplogroups, whereas the germline rate underestimates them by a little.

Let's look at some concrete examples of age estimates in the literature, where I compare my own (first) estimates with the published ones. Here is how my estimates are derived:

For a Bronze Age ancestor (g=120) it is: 0.269 =(approx) 0.9 μg

For a Neolithic ancestor (g=300) it is: 0.629 =(approx) 0.84 μg

Thus, the correction multiplier, if the variance is between 0.269 and 0.629 is between 0.84 and 0.9; I will use the midpoint 0.87. If the variance is less than 0.269, then I use 0.9. If the variance is more than 0.629 then I use 0.84. Of course, the correction factor could be expressed more accurately as a function of the variance.

Note that the generation length preferred by these authors is 25, by me it is 30. All ages are ky BC.

Cinnioglu et al. (2004)

In this paper, an evolutionary rate of 0.0007 is used.

 Variance Cinnioglu Dienekes E-M78 0.18 4.4 0.4 G-P15 0.35 10.5 2.9 I-P37 0.23 6.2 1.1 J-M12 0.24 6.6 1.2 J-M67 0.33 9.8 2.6 R-M269 0.33 9.8 2.6

E-M78 is dated to 400BC, only a couple of centuries after the historical Greek colonization. E-M78 reaches its maximum in the Peloponese, a major source of Greek colonists.

I-P37 and J-M12 are dated to 1,100BC and 1,200BC, at around the time that e.g. the Phrygians from the Balkans are believed to have migrated to Asia Minor. I-P37 and J-M12 reach their maxima in areas north of Greece where the Phrygians are said to have originated.

Sengupta et al. (2006)

 Variance Sengupta Dienekes J2-M410 0.38 11.7 3.3 R-M17 0.39 12 3.4 R-M17 (upper caste) 0.26 7.3 1.5 G-P15 0.29 8.5 2 J-M241 0.38 11.8 3.3

Thus, all the exogenous West Asian lineages in India have post-Neolithic ages, with R-M17 having a suggestive age of 1,500BC coinciding with the suggested date for the Indo-Aryans.

King et al. (2008)

 Variance King Dienekes J-M12 (Nea Nikomedeia) 0.18 4.7 0.4 E-V13 (Sesklo/Dimini) 0.24 6.6 1.2 E-V13 (Lerna Franchthi) 0.25 7.2 1.3 J-M92 (Crete) 0.14 3.1 0.1 AD J-M319 (Crete) 0.14 3.1 0.1 AD E-V13 (Crete) 0.09 1.1 0.8 AD

These are very localized samples, so they should not be interpreted as reflecting expansion times in Greece itself, however, they do suggest a Bronze Age expansion of E-V13 and a much later arrival of E-V13 in Crete.

Note that for Crete, the 1,000,000-haplogroup size assumption is a substantial overestimate, so my age estimates are also substantial underestimates.

Update (July 25): R-M17 in South Siberia

Derenko et al. (2006) "Contrasting patterns of Y-chromosome variation in South Siberian populations from Baikal and Altai-Sayan regions" calculate the variance of R-M17 chromosomes in South Siberia, using the Z.U.F. rate, arriving at an age of 11.3kya corresponding to a value of 0.31. This corresponds to 2,300BC according to my estimate (see previous update).

Recently Bouakaze et al. (Int J Legal Med (2007) 121:493–499) reported the presence of R-M17 chromosomes in ancient inhabitants of South Siberia and the Andronovo culture (2,500BC-1,500BC).

The Andronovo culture is widely believed to be of Eastern European ultimate origin, reflecting the eastward movement of the Kurgan culture, and is associated by some with the ancestors of the Indo-Iranians.

In the Balkans, again in Z.U.F. years, the age of R-M17 is 15.8kya corresponding to variation of 0.44, corresponding to ~4,000BC according to my estimate.

Update (July 25): Baltic Y chromosomes

Lappalainen et al. (2008) use the Z.U.F. rate to estimate the antiquity of lineages in the Baltic region. Dates are ky BC.

 Lappalainen Dienekes I1a 5.7 1 N3 6.8 1.5 R1a1 8.7 1.9

1,000BC for I1a in the Baltic region is within the time frame of the emergence of the Germanic people who did experience a strong demographic growth.
1,500BC for N3 shows a rather late time for Finno-Ugrians. However, it must be noted that smaller demographic sizes would impose more drift, and hence a slower accumulation of variance. Therefore, this time is probably underestimated.
1,900BC for R1a1 is consistent with the northern edge of the expansion of R1a1. Once again, reduced variance may also be influenced by smaller population numbers, making this a possible underestimate.

Update (July 25): Southeastern Europe (the Balkans)

Pericic et al. (2005) use the Z.U.F. rate to estimate ages of Y-chromosome lineages in the Balkans. Dates are ky BC.

 Pericic Dienekes I1b* (xM26) 8.1 2 E3b1α 5.3 0.9 R-M17 13.8 3.8 R-M269 9.6 2.3 J-M241 (without Kosovars) 1 0.8AD

Thus, Balkan haplogroup I seems related to a Bronze Age origin, with R-M17 being substantially older, and deriving perhaps from northern Balkan Neolithic or alternatively intrusive Kurgan populations. J-M241 seems to be quite young, similar to J-M12 in Nea Nikomedeia (see discussion of King et al. (2008) above).

The young ages of J-M12 and J-M241 also explain the striking inverse correlation between it and J-M410, which makes sense if it expanded later. A fairly late expansion also explains its under-representation in Southern Italy and Anatolia: it appears to be a rather young and "Epirotic" clade that was too late in coming to significantly participate in the historical Greek colonization.

Update (July 26): E3b in Cyprus and Southern Italy

Capelli et al. (2005) [Population Structure in the Mediterranean Basin: A Y Chromosome Perspective] study Y-chromosome variation in many Mediterranean populations including Cyprus. I use a mutation rate of 0.0018 for the six markers used in this study (Quintana-Murci et al. AJHG 68(2) pp. 537 - 542 ). Ages are in ky BC.

I come up with an age of 1.4ky BC for E3b in Cyprus, which is consistent with Mycenaean and later Greek settlements on the island.

I also looked at Southern Italian Y chromosomes. I removed those with values other than (13,12) in DYS19,DYS388), since these are universal in Greek E-V13, in order to remove possible contamination from non E-V13 chromosomes. The resulting age is 900BC, once again very close to the historical Greek colonization of Magna Graecia.

July (26): A more elaborate population growth model

Z.U.F. also propose (Fig. 2 triangles) a more elaborate population growth with:
• m=1.002 before 400 generations
• m=1.012 from 400 to to 14 generations ago
• m=1.12 from 14 to 8 generations ago
• m=1.25 from 8 generations ago to current time

I ran a simulation (g=1000, N=10,000) with this population growth model. The average size of the descent groups of the MRCAs is 692,982 men. Averaged all of them, variance is 1.37.
• With the germline mutation rate, an estimate of 549 generations (45% underestimate)
• With the Z.U.F. evolutionary rate, an estimate of 1,988 generations (99% overestimate)
If we limit ourselves only to the 10, 1000, 5000 most prolific MRCAs (out of the N=10,000), we obtain ages (respectively):
• With the germline mutation rate: 776, 747, 668 generations
• With the Z.U.F. evolutionary rate: 2,810, 2,707, 2,419 generations

Thus, one can estimate that STR variance since the time of the MRCA accumulates at a rate of ~0.75μ / generation.

And, yet, the 0.00069 rate has been used to date Paleolithic events, e.g., by Semino et al. (2004) [Am. J. Hum. Genet. 74:1023–1034, 2004], leading to general age overestimates.

Update (July 29)

My discussion is continued in Haplogroup sizes and observation selection effects (continued)

Maju said...

I have a question. First you say:

This [ZUF] mutation rate exceeds the observed germline mutation rate by a factor of 3-4.

This means that ZUF are dating clades 3-4 times more recently than they would by the germline MR, right? If the mutation rate is smaller then age is larger, obviously, as each mutation would need more statistical time to happen.

But then you conclude that:

I do believe that ages calculated with the evolutionary mutation rate of 0.00069/locus/generation are significantly overestimated.

I was following your post pretty well (I believe) until stumpled upon such conclussions, which were exactly the opposite I was understanding from your logic.

I think the inflexion point is when you say:

we, once again, conclude that the STR variance in haplogroups likely to be made the object of scientific study accumulates near the germline mutation rate, and at the very least, faster than the evolutionary rate of Z.U.F.

Slower, right? Smaller mutaton rate: longer time per mutation (slower): larger clade ages.

What am I getting wrong in all that? If I am...

Dienekes said...

You're right, the first sentence should read "is smaller". The conclusions are the same.

pconroy said...

So Dienekes, correct me if I'm wrong, you're saying that the TMRCA of a given haplogroup may be overestimated by a factor of at least 3?

So this would mean that the TMRCA for R-M222 may not be on the order of 1,500 years ago, but 500 years ago?

Or R1b may not be 10,000 years ago, but 3,000 years ago?

Dienekes said...

I wouldn't jump to any conclusions about particular haplogroups at this point. In general, population history, haplogroup substructure, and possible constraints on the mutation model make it (in my mind) a very risky business to estimate the age of old and widespread haplogroups using STRs.

At this point, all I can say is that whenever the 0.00069/locus/generation rate is used in a paper, I would be very skeptical of any attempted historical/archaeological correlations.

For example Cruciani et al. (2007) "Tracing Past Human Male Movements in Northern/Eastern Africa and
Western Eurasia: New Clues from Y-Chromosomal Haplogroups E-M78
and J-M12" say that E-V13 marks a Bronze Age expansion in SE Europe, whereas King et al. (2008) "Differential Y-chromosome Anatolian Influences on the
Greek and Cretan Neolithic" place E-V13 in the Mesolithic-Neolithic.

The difference between the two is that King et al. (2008) uses the "evolutionary rate", whereas Cruciani et al. (2007) amend it upwards somewhat to arrive at the later (Bronze Age) dates.

arborist said...

The time at which all the different non-African haplogroups coalesce has been estimated to be about 50,000 years corresponding to the Out-of-Africa event. With the higher mutation rate proposed by Dienekes, this would have been only 17,000 years ago.

dienekesp said...

There are several reasons why this is not the case: first, the correction factor seems to diminish as _g_ increases (0.9 at 120generations, 0.84 at 300 generations).

Second, before the adoption of the farming economy, population sizes were small, hence there was more genetic drift removing variance, and the large haplogroup assumption did not hold.

Third, the stepwise mutation model may not scale as well for large _g_ since there are chemical constraints on the number of repeats that an STR can have. It has been observed, e.g., that low repeat scores tend to "freeze", mutating more slowly, while high repeat scores tend to back-mutate to a lower value more frequently.

In view of the above, I would say that the quoted figure of 17,000 years is not valid. Indeed, I would not -and did not- venture to use STR variance to date any pre-Neolithic events.

Finally, note that it is possible for haplogroups to exist long before the MRCAs within haplogroups.

Maju said...

In view of the above, I would say that the quoted figure of 17,000 years is not valid. Indeed, I would not -and did not- venture to use STR variance to date any pre-Neolithic events.

But that basically means you can measure nothing of relevance with them. Obviously population has been growing globally since Neolithic, there have been no major changes other than some migrational spread since then.

Dienekes said...

There have been no major changes since the Neolithic?

Maju said...

I wouldn't say that with those words but it has been mostly expansion everywhere. Sure you can identify a handful of migrations in the Y-DNA (Bantu, Indo-Europeans) but it's not like there has been loads of drift (and therefore fixation) since then.

Hope you understand what I mean now.

Ebizur said...

Dienekes wrote,

"The young ages of J-M12 and J-M241 also explain the striking inverse correlation between it and J-M410, which makes sense if it expanded later. A fairly late expansion also explains its under-representation in Southern Italy and Anatolia: it appears to be a rather young and "Epirotic" clade that was too late in coming to significantly participate in the historical Greek colonization."

I've enjoyed reading about your conversions from Zhivotovsky's hypothetical "evolutionary mutation rates" to observed "pedigree mutation rates," and I don't mean to sound too critical of your estimates, but how would you explain the significant presence of haplogroup J-M241 in India and Nepal if it is an "Epirotic" clade that has expanded so recently? Something having to do with Alexander's invasion or the Yavanas? But Alexander was a Macedonian, and "Yavana," in its strictest sense, is supposed to refer to Ionians. Within Europe, haplogroup J-M241 is presently found most commonly in Albania, is it not? What historical movement could have brought haplogroup J-M241 to Albania and Nepal in such great quantities at a date so recent as that which you have proposed for this clade's expansion?

Dienekes said...

This is the dating of J-M241 _in the studied populations_ and not the dating of J-M241 in general.

The MRCA of J-M241 in the Balkans isn't the same as the MRCA of J-M241 in India. Both of them are descended from a common ancestor (the M241 guy), but I am not dating _him_.

Ebizur said...

dienekes wrote,

"This is the dating of J-M241 _in the studied populations_ and not the dating of J-M241 in general.

The MRCA of J-M241 in the Balkans isn't the same as the MRCA of J-M241 in India. Both of them are descended from a common ancestor (the M241 guy), but I am not dating _him_."

Then you are presuming that J-M241 had produced several descendants, at least one of which went to northern Greece (and Albania, etc.) and produced prolific progeny in that region, and at least one other of which went to northern India/Nepal and whose descendants experienced a separate period of population growth there. Where is the location of haplogroup J-M241's ultimate origin? Is the origin of J-M241 as a whole even determinable if one allows the sort of scenario that you are describing?

If we start accepting the idea that a clade could have expanded separately in two distant locations, aren't we opening the floodgates to ideas like "R1b in Iberia is Basque, but R1b in Armenia is Indo-European" and "Neanderthal DNA did introgress into European Homo sapiens sapiens, but archaic Y-DNA and mtDNA lineages have been lost due to genetic drift"?

Dienekes said...

The ultimate origin of J-M12 is usually placed in the Near East, however there is little of J-M12 there. I wouldn't speculate on this point.

aren't we opening the floodgates to ideas like "R1b in Iberia is Basque, but R1b in Armenia is Indo-European"

A haplogroup need not have an expanded in one place. J1 for example multiplied among Arabians and Northeast Caucasian speakers.

terryt said...

Ebizur asked: "aren't we opening the floodgates to ideas like 'R1b in Iberia is Basque, but R1b in Armenia is Indo-European' and 'Neanderthal DNA did introgress into European Homo sapiens sapiens, but archaic Y-DNA and mtDNA lineages have been lost due to genetic drift'?"

So what precisely do you see to be the problem with those ideas?

Ebizur said...

"So what precisely do you see to be the problem with those ideas?"

I just think that it would negate the usefulness of Y-DNA and mtDNA testing for the purpose of identifying one's "deep ancestry." If we allow a single haplogroup to have evolved in two widely separated, genetically distinct (in terms of their autosomes) populations, then it greatly reduces the significance of the determination that an individual belongs to that haplogroup.

For example, if R1b is allowed to be both Basque and Indo-European in origin, then the discriminatory power, or scientific significance, of the "haplogroup R1b" designation is diminished, because the majority of the DNA of the average member of these populations (the autosomes) could have evolved independently and may be greatly dissimilar to each other. A haplogroup is much more meaningful if it indicates a shared evolutionary origin, i.e. an origin in a particular endogamous group that has developed autosomal genetic distinctiveness to match the Y-DNA or mtDNA "signature."

Besides, allowing so many separate expansion events (e.g. hypothesizing that haplogroup J2b1-M241 expanded to become a major haplogroup in at least two separate populations in two widely separated regions) makes the possibility of "hidden archaic admixture" that much more plausible, because we can't know what other haplogroup(s) might have been lurking in one area or the other when J2b1-M241 managed to gain a foothold and expand there.

Maju said...

I don't understand well how this last bit of discussion arose but it's clear that it's statistically nigh-on-impossible that an SNP arose twice in human biological history. When there is more than one defining SNP, then the situation is even more clear. Hence R1b means one single common male ancestor via purely paternal line for all carriers of that haplogroup.

But that doesn't mean that, after the clade arose, different descendants could not have got different ethnic identities, lived in different geographic locations or whatever. A haplogroup is not the same as ethnicity. In theory there is no objection to some R1b being Basque and some R1b being Indo-European, no matter they share the same common ancestor. In fact all ethnicities have their own salad of haplogroups and related haplos are found among very different ethnic groups.

Ebizur said...

I started this last bit of discussion because Dienekes had estimated the age of STR variation of J-M241 in the data of Pericic et al. 2005 (without Kosovars) on Balkan populations (I think they were mostly populations of the former Yugoslavia) to be 800 AD, while estimating the age of STR variation of J-M241 in the Indian data of Sengupta et al. 2006 to be 3,300 BC. These dates could be taken to suggest that haplogroup J2b1-M241 actually evolved among the ancestors of the Indians, and the presence of this haplogroup in a large percentage of modern males of populations in the Balkans does not really indicate anything significant about the ancient ancestry or genetic affinities of these Balkan populations (i.e., the shared presence of J-M241 between the Balkans and India does not indicate that there is any genetic connection between these two regions except maybe one male ancestor who very recently introduced J-M241 into the Balkans).

The shared presence of a haplogroup in a large percentage of the members of any two or more populations is most significant if it indicates a shared period of evolution (both genetic and cultural) of those populations. Otherwise, haplogroups are really quite meaningless except on an individual scale for the purpose of genealogy (and even then, they can only help one find genealogical matches in the direct paternal or direct maternal line).

Maju said...

The shared presence of a haplogroup in a large percentage of the members of any two or more populations is most significant if it indicates a shared period of evolution (both genetic and cultural) of those populations. Otherwise, haplogroups are really quite meaningless except on an individual scale for the purpose of genealogy (and even then, they can only help one find genealogical matches in the direct paternal or direct maternal line).

It depends. It specially depends on timelines. It's not the same if the haplogroup is 500, 5000 or 50,000 years old. The degree of cultural connection would obviously be much different. Also, specially for Paleolithic events, the likehood of some "random" founder effect or fixation via drift becomes much more likely, making the cultural connection even less clear.

terryt said...

"If we allow a single haplogroup to have evolved in two widely separated, genetically distinct ... populations". As Maju says, that's impossible. However descendants of the common ancestor may have become widely spread before becoming sort of fixed in two separate populations, which is what appears to have happened in the examples mentioned here.

"the discriminatory power, or scientific significance, of the 'haplogroup R1b' designation is diminished, because the majority of the DNA of the average member of these populations (the autosomes) could have evolved independently and may be greatly dissimilar to each other". I have maintained that position all along. As Maju also says, "all ethnicities have their own salad of haplogroups" so a single haplogroup cannot be used as a marker for a particular ethnicity. Human groups have been mixing for as long as we have historical records (with occasional episodes of genocide) and probably were doing so long before we have these records.

Ebizur said...

terryt said,

"'If we allow a single haplogroup to have evolved in two widely separated, genetically distinct ... populations.' As Maju says, that's impossible. However descendants of the common ancestor may have become widely spread before becoming sort of fixed in two separate populations, which is what appears to have happened in the examples mentioned here."

It is most definitely not impossible; in fact, it is exactly what Dienekes and others have argued for, that a particular haplogroup could have diversified and increased its proportion of the population in two or more separate regions at two or more different times. Notice that I used the word "evolved" (i.e. change and development; diversification) rather than "originated" (i.e. occurrence of the original mutation in the haplogroup's patriarch/founder; first appearance of the relevant SNP).

If we find evidence for this sort of scenario in the history of a particular haplogroup at the current level of phylogenetic resolution, then it is imperative that more detailed research be done to determine markers for subclades that developed in each of the distinct populations among which this haplogroup has become common, so that belonging to a particular subclade of that haplogroup will actually have some significance in relation to the overall genetic/ethnic background of an individual testee.

terryt said...

Exactly. Haplogroups distribution shows that their members have been moving around the earth considersbly. I maintain it's reasonable to assume such movement goes back, even beyond the evolution of our modern haplogroups.

R said...

J2b2-M241

Aromuns (Albania, Macedonia, Romania) ..... 3.5 (1500BC) // E Bosch et al 2006
Europe .... 3.2 (1200BC) // ftdna, ysearch
India ..... 5.7 (3700BC) // Sengupta et al 2006
India ..... 4.9 (2900BC) // Jobling et al. 2006
Nepal ..... 4.2 (2200BC) // Jobling et al. 2006
South Asia (India, Malay, Nepal, Sri Lanka, Afghan) .... 5.6 (3600BC) // Jobling et al. 2006

* Klyesov's formula + Regular formula

McG said...

I have been studying TMRCA's using Zhivs equation. I find that for short term family relationships it works well, e.g. Kerchner family (350 years old), Clan Gregor (1300AD). I have also estimated TMRCA's for older sets of data and generally find my dates to be 2 to 3X Ken Nordtvedts who uses Chandlers rates. Note that Chandlers estimates uses many haplogroups, which I believe is incorrect. It appears that modal values and rates at some dy loci change over time. For a long time I used ASD to compute TMRCA's. I now feel it overcounts mutations at a dys loci by as much as a factor of 3 over simpler counting algorithms, where multi-step mutations are taken into account. I make no assumptions about population growth. I have not closely studied, I read it, your blog on this subject, I have found in the past that the assumptions are critical. I would like your first impression to my comments

Dienekes said...

>> I find that for short term family relationships it works well, e.g. Kerchner family (350 years old), Clan Gregor (1300AD).

Two questions: are the founders of these clans also the MRCAs of their descendants. In other words, did they both have at least two sons with patrilineal descendants.

Also, how many patrilineal descendants are there (if such estimates exist).

McG said...

1st question: I believe that the convergence is not to a man but to a haplotype. The Jefferson case illustrated that. In the Kerchner case, I estimate that Adam is the founder, not Frederick. I believe that Frederick may have been an only son, but he had three sons I believe. There no 10 patrilineal descendants. Google Charles Kerchner and you'll find his site and analysis all laid very carefully. Please note the distinctin between Unique Transmission events and transmittals.

2. In Clan Gregor, there are currently over 60 "genetic" MacGregors, descendants of the founder. The clan chieftain is hereditary and doesn't necessarily depict the actual number of scions. I think there is only one first descendant but I'll check and correct if I'm wrong.

Tuuli Lappalainen said...

Very intereting results, although I still need to think about your analysis in a bit more detail before I dare to say whether I agree with you. I hope you'll publish these soon.

I just wanted to comment on your recalculation of our dates for the Baltic region (Lappalainen et al. 2008), and the historical connections of N3. Actually a few linguists have recently suggested that the Uralic expansion may be as recent as 2000 BC, which would agree with your dates very nicely. It would indeed be quite intereting if BOTH linguistic and genetic dates become independently updated...

Dienekes said...

Actually a few linguists have recently suggested that the Uralic expansion may be as recent as 2000 BC, which would agree with your dates very nicely.

That is very interesting, this is a related link that someone sent me recently.

I hope you'll publish these soon.

I don't have plans to publish my results. I will, however, release my source code once I've tidied it up a bit and wrote some basic instructions, so that others can check its accuracy and repeat my calculations for any set of parameters they choose.

McG said...

I would like to begin a detailed discussion of the ZUL derivation and especially its assumptions. I'm not sure if this is the right forum - but I don't know where else to go. Some of my comments may appear trivial and can be immediately dismissed, but I want to make sure that ZUL's model is correct, let alone his effective evolutionary rate.

First, Population. The population he models is a segment of the total population, the Y chromosome population that produces new Y chromosomes which may or may not have a mutation. This excludes all females, all-nonreproductive y chromosomes ( for any reason, e.g. all girls) etc. So, when we're talking about migrations of Gypsies, Maoris etc. We are only really discussing the Y chromosomes that produce new Y chormosomes.

Given these intrinsic assumptions, the first major assumption in the analysis is that the mean number of sons produced by a Y chromosome is 1. I can present two examples where this is not true. Now, I'm not blind to statistical assumptions and the fact that you are working with 10K Patriarchs and my examples may just be extremes within that population, but here goes anyway. First, the well documented Kerchner family which the Chandler rates don't model. I estimate that m = 1.6+ for this population. Second, my own heritage which I have good documentation for back to 1650 AD. I am the 11th generation and the average number of sons born down my line is 2.2+.

We may be looking at a little bit of the "Genghis Khan" phenomena in which some 5 to 6% of the asian population is attributable to him??? I don't know? The point is what is the "scientific" basis for assuming the Poisson process with a mean of 1 models Y Chromsome reproduction down a male line???

McG said...

Addendum to my previous comment: m = 1 is the least m can be if the population being studied is the successful Y chromosomes that reproduce - by definition. For m less than 1 and approaching zero, we have to include a much larger Y Chromosome population???

Dienekes said...

the first major assumption in the analysis is that the mean number of sons produced by a Y chromosome is 1.

The mean number of sons is 1 in a constant-sized population.

However, the mean number of sons _given that the line has survived to the present_ is not 1. These are the men we are interested in.

For example, in the beginning generation, each man has one son of average. But, over men who do have at least one son, the average number of sons is ~1.58.

And, if we are actually interested in men who are also MRCAs, then in the first generation they have ~2.39 sons on average (because they must have at least 2 sons if they are MRCAs).

So, while men have 1 son on average, if you look at the family trees of the men who have patrilineal descendants in the present, you will not see 1 son on average, but more.

Moreover, it is the case that people are usually interested in the big family trees, because these are more noticeable. Thus, if you look at men who don't just have patrilineal descendants today but a lot of them, you will find even a higher average number of sons.

McG said...

I would still like to explore the Poisson distribution if I may and I thank you for your answer. It would seem the first questions I raised are consistent with your modelling? The Poisson distribution is a special case of the H/T binomial distribution when the probability of one of the events is rare. This is certainly our case since, over 37 dys loci, the P of no mutation is of the order .997 to .998? If I observe the properties of the allele histogram distribution of different dys loci I observe that generally it is more normal than uniform and it is usually skewed high or low. An exception is CDYa and b, which are approaching a uniform distribution, which is an expected result of a binomial process? So many of the distributions of dys loci do not meet a uniform dist. but appear to more fit a normal? (A good set of data to observe these distributions is the R.L.Tarin, Jr. data sets, on - line, at world families network.). So why are most dys loci not approaching a uniform distribution, which I believe should be the expected result??

Another general question about this process is: Are the numbers of mutations at a dys loci constrained?? One apparent constraint is the maximum allele length attainalble. It appears it is a number less than 50? Even more interesting is the question are the numbers of mutations at each dys loci, relative to other dys loci constrained? Goldstein and Stumpf's paper appears to say yes.(Science, 2 March 2001,vol. 291.) They argue, see eqn. 4, that any dys loci can be used to estimate TMRCA, whereas in practice, several are used and then averaged. This equation suggests that there is a balance between the number of mutations any one dys loci can have relative to the other dys loci. Do you have any idea what this constraint implies, beyond what I have said here???

Thank you for your thoughts. Next I plan to show how the ZUL rates, not the Chandler rates, predict the correct TMRCA for the Kerchner and Clan Gregor data sets.

McG said...

First, a couple of comments on ASD/VAR as methods of counting mutations. I no longer use these techniques, I select DYS Loci where 95% of the allele values are at the modal and its two side values. This is, in part, to accommodate multi-step mutations which have been estimated as up to 5% of all mutations. Using the ASD/VAR squared differences counting technique significantly overcounts mutations. I also, currently, try not to use Palindromic dys loci, e.g. 385a,b due to the way they mutate occasionally.

I will now show that the Chandler rates do not predict known dates for sets of data. The ZUL rates appear to do much better, and at the present time I cannot explain why. I have studied your ZUL analysis and am hard-pressed to identify any errors in analysis. All I can say is I don't think it is the right model for mutations???

My approach is simple, I use a modified form of the Slatkin/Goldstein equation: TMRCA = (1/N)(1/#DL) X ( Sum over DL (#mu/musubr). Where N = number of Y chromosomes; #DL = number of dys loci; #mu = number of mutations at a dys loci; musubr = mutations rate of the dys loci.

The Kerchner data set has a mutation at 390, 439, 449(2), GATA H4, 576, CDYa, CDYb. (note some of these dys loci do not meet my criterion and I estimated the mutation rates using ASD, even though I think they're off. Fortunately, as we will show these dys loci contribute only a few years to the calculation). The ZUL mutation rate values for these dys loci (all in 10^-5): 4.17, 3.75, 12.67,4.17,9.75,16.2,16.2. Using the ten Kerchner entries and 37 dys loci, I compute a TMRCA of 305 years. Which appears to identify Adam, not Frederick as the founder. Using Chandlers rates and a 30 year gen to year conversion I get 109 years which is way too short!!!

For the Ian Cam analysis, all genetic descendants of the Clan Gregor founder who was born about 1300 AD, I only use the first 12 dys loci. I find mutations at the following dys loci of the 58 chromosomes I evaluated: 390, 1; 19, 1; 385a 1; 385b 2; 439 4; 389i 2;389ii 4. I omitted 3 mutations that are of close relatives: two Stirlings at 385b, and two 389i and 389ii who are close cousins. To the best of my knowledge that gave me unique transmission events.

Again the ZUL rates for these mutatins are: 390 4.17;19 1.17; 385a 1.67; 385b 4.90; 439 3.75; 389i 1.83; 389ii 4.67. Plugging these values in the equation with 12 dys loci and 58 entries I get a TMRCA = 1273 AD. This is within 25 years of the date of "reputed" founder and in fact could be his father? Chandler rates estimate 1641 as the founders date.

For whatever reason The ZUL rates provide me with precision and accuracy in these calculations. All the other calculations of TMRCA's for s21, 116+, Tarins Iberian data sets yield estimates consistent with traditional estimates!! (note in the Kerchner calculation the year contribution of CDYa and b is about 25 years total in the ZUL calculation)

If I have one regret here, it is that I cannot identify the error in logic in the VAR/Chandlers rate approach. I just believe it gives incorrect answers. Bob

reletomp said...

no need for calibration or estimation of the germline mutation rate ever.
if 1000 pairs of father/son observed and the mutation rate for an STR was found to be .002 then this is it folks.

Dr Rob said...

Dienkes, your calculations are likely to be more accurate than those using the Zhiv method. Based on the archaeological evidence, it is only from the Bronze Age / Late Neolithic that population demogrpaphy began to really grow, thus 'shape' the overall molecular profile of modern Europe.

Levitylab said...

I'm 13/12 at DYS19 and DYS388, respectively; the Greek Modal, and my highest 12 marker match frequency at FTDNA is for Greece at 3.7%. My paternal line is from northern Germany, near Hamburg.