November 11, 2011

Falsification in action

I am an occasional critic of Anatole Klyosov's Y-STR based age estimation methodology on the GENEALOGY-DNA-L list. As I have mentioned before, I am boycotting Y-STRs because they are simply worthless for the student of prehistory due to their poor qualities as molecular clocks and lack of any clear correspondence with population movements.

Nonetheless, Klyosov's professional credentials and substantial "dna genealogy" paper production, may lead some to give his work, characterized by very narrow confidence intervals and rather imaginative archaeological reconstructions, undue attention.

Klyosov resurfaced on GENEALOGY-DNA-L, taking a swipe at my criticism of his narrow confidence intervals:
Instead of walking in circles considering "bushy trees" all these years and complaining on "huge confidence intervals", one better take ACTUAL genealogy data, ACTUAL haplotype datasets, and compare actual dates with those resulted from DNA genealogy. This will show what ACTUAL margins of error looks like. With "bushy trees", they should be first subdivided on separate branches, and each branch should be analyzed individually.

Thankfully, the arrival of ancient DNA analysis can be used to falsify Klyosov's assertions. In December 2010 he discussed the possibility that some E1b1b1 subclades may have played a role in wiping out the "Bell Beakers":
However, E-V13 is already out, since it was formed around 2600 ybp (Lutak and Klyosov, Proceedings, 2009, April, pp. 639-669). E-V65 is out on the same reason (2625 ybp). E-V22 is a good candidate, with its common ancestor around 5075 ybp (ibid). E1b1b1a1-V12 also could be there, with its common ancestor of 4300+/-680 ybp. E3b1, as Adams et al (2008) called them (it is apparently E-81), has a common ancestor in Iberia around 4825 ybp (Klyosov, Proceedings, 2009, March, pp. 390-421), which nicely fit to the concept.
The recent publication of 7,000-year-old E-V13 from Neolithic Spain, indicates that this haplogroup was in existence at least that long ago, and hence could not have been formed 2,600 years before present. Klyosov's error is at least 2.5x, consistent with my assertions that Y-STR based age estimates carry huge confidence intervals, and inconsistent with his self-assurance that they do not.

I see nothing wrong in advancing speculative hypotheses based on the available evidence. I've advanced some of my own ideas for the spread of E-V13 that appear to be less plausible in the light of the ancient DNA evidence, even though a historical, Greek-mediated spread of a subset of E-V13 as proposed by Di Gaetano et al. and King et al. is still possible.

What is certainly wrong is to have over-confidence in one's assertions and not to admit the limitations of Y-STR based age estimates when they are staring us in the face on both theoretical and empirical grounds.


Pascvaks said...

"In every Odyssey, the Sirens sing a song so irresistible that none can hear it and escape."

Indeed, I wonder what Sirens' song Anatole Klyosov hears?

mooreisbetter said...

Dienekes, I'm happy you posted another post on STRs, because I think I speak for many when I say that they are poorly understood.

Can you explain in easy-to-understand terms what STRs are? Don't STRs indicate SNPs?

Also, why they are unreliable. For example, when we get a DNA test back, and it says DYS 390 is 12, or whatever, can you give us an explanation of what that means and why results are unreliable?

I've always wondered, for example, if one of those numbers can mutate DOWN as well as up, for example.

If that is the case - that there can be deketions as well as additions, then I dont see how they are reliable except for genealogy tests for close relatives.

I appreciate a quick primer -- with real life examples -- even from Klysov's work. Again, my discipline is history - this is new to me.

Unknown said...

@ mooreisbetter

There isn't a direct causual relationship between SNPs and STRs, the latter are just inherited if a man has a son who gets a new SNP. The STRs of course continue to mutate on their own at a faster rate. If a man had two sons, they could each have a different SNP, one the same as the father, the other the new SNP, but still have the same set of STRs. The descendants of each son can continue to carry their respective SNPs but the more rapidly mutating STRs over time can become very different. Moreover, each STR locus has a different mutation rate, so some of those mutate more rapidly than others. The effect of all this is that you get a range of values for each locus for every SNP. Many range values for loci overlap for many SNPs making prediction unreliable. You can see some of these observed ranges for several SNPs here:
As you correctly point out, STR counts can go up as well as down, hence they are called indels, short for insertion deletion, hence we see that these values sort of oscillate within a range with observed upper and lower limits.
In a study as say, Mike Weale's Anglo Saxon Mass Migration, SNPs and STRs are compared. If the source population has a lot of one SNP, R1b for example and the host population has a lot of R1b then the new admixed population has a lot of R1b. The only way of telling which of the new admixed population descend from the hosts and which descend from the source is by comparing STRs. It only works if the source and host populations have some significant observed differences. The new admixed population should, in theory, still preserve some record of these differences in the STR values but, as they go both up and down, you are comparing moving targets. Hence the results are always given in terms of degree of probability. There's nothing hard and fast about it. You can't say, Man A is definitely from the host population and Man B definitely from the source population. You may find these short explanatory video clips of help:

Styan said...

I have been especially interested in history and archaeology since I was a child. In recent years, I have become very interested in the effort to obtain historical information from DNA evidence. I will also attempt to answer mooreisbetter’s question.

An SNP is a small change to an otherwise stable part of the Y chromosome. It is generally thought to have occurred only once, and any man, who has it must be a male line descendent of the man in whom it originated. An SNP test shows that a man definitely does or does not belong to a particular haplogroup. The problem is that the SNPs for each group have to be found individually and tested individually.

An STR is a change to a more variable part of the Y chromosome, involving the number of repeats of a part of the chromosome. They have the advantage that the same STR tests can be applied to any human Y chromosome and results will be produced. The problem is that almost every possible STR figure occurs in more than one haplogroup. I have spent a lot of time trying to understand STRs, mainly studying the 67 STR haplotypes available from the website Ysearch. It is a very complicated subject. Different STRs vary greatly in their degree of stability or tendency to mutate. The most stable STRs stay the same in many different haplogroups. The most stable is DYS 472, which is hardly ever anything other than 8. The least stable frequently vary even within the same group, and often vary within the same range in many different groups. STRs generally do not vary freely within a wide range. Especially the most stable usually have one common figure found in many different groups. Variant figures are most frequently one more or one less than the most frequent figure. Figures usually become rarer as you move away from the most frequent figure. I think they more frequently move up than down. They usually change by one at a time, but some can jump by several at once. Some (e.g. DYS 437) vary frequently between two figures, with others only rarely appearing. DYS 425 is usually 12, but most haplogroups have a branch in which 425 = null. This STR is rarely anything other than 12 or null.

I think a good way for a beginner to see how this works is to obtain some results for perhaps 5-10 members of two different haplogroups and copy the figures into a table. You will see that some STRs have the same figure in all members of both groups, some have one figure in one group and a different figure in the other, and some also vary within one or both groups.

The way to identify haplogroups from STRs is to find the ones that have one figure in one group and a different figure in another. One STR is generally not enough because almost every possible STR figure occurs in more than one group. However, if two sets of figures have the same unusual figures for enough STRs, we can be fairly certain that they belong to the same haplogroup. The more stable STRs are best for identifying haplogroups, while the less stable are useful for identifying individuals and families. Which of the more stable STRs are useful for identifying the haplogroup varies depending on which groups we are tying to distinguish. The website to which Dienekes’ blog has a link, has a haplogroup predictor that works more or less according to these principles.

The reasons stated above mean that it is useful to have figures for a large number of STRs, such as can be found in Ysearch and the DNA projects of Family Tree DNA. Unfortunately many scientific papers give data for only 8 or 12 STRs. These are of little use for identifying haplogroups. People often do statistical tests on such data, in the hope of working out the age of groups. These have been much criticized by Dienekes because uncertainty about the mutation rates of STRs means that different experts can produce very different age estimates for the same group.