January 20, 2010

Y-haplogroup prediction software accuracy tested

The authors tested Athey's Haplogroup Predictor and the University of Arizona Haplogroup Classifier.

From the paper:
These results represent a high probability of error, and a bias towards the R* haplogroup, so it is most likely that results based on the haplogroup predictions of these software systems are weakened. For cases in which sex bias in multiethnic populations is estimated by this method, an overestimation of the European component is expected. Haplogroup determination by SNP analysis remains the best approach, considering the low reliability of prediction of software available.

The adequate LR+ for the Q1a3a and DE haplogroups could be explained by a lower diversity within each group. Especially in the Q1a3a case, which is a relatively recent haplogroup, the homogeneity is the result of its young evolutionary age, given that the time lapse in which the haplotypes spread away from the haplogroup founder is rather short [14].
This is a very important result, as haplogroup prediction software has sometimes been applied to large collections of Y-STR haplotypes found in the forensic literature. It is important to be cautious about its use, especially for short haplotypes.

Int J Legal Med. 2010 Jan 15. [Epub ahead of print]

Software for Y-haplogroup predictions: a word of caution.

Muzzio M, Ramallo V, Motti JM, Santos MR, López Camelo JS, Bailliet G.

The development of online software designed for genetic studies has been exponentially growing, providing numerous benefits to the scientific community. However, they should be used with care, since some require adjustments. The efficiency of two programs for haplogroup prediction was tested with 119 samples of known haplotypes and haplogroups from Argentine populations. Quantitative estimates of the predictive quality of both software systems were computed with the uncertainty coefficient; and sensitivity, specificity, positive, and negative likelihood ratios were also calculated to assert the reliability of both programs, showing high probabilities of assigning an incorrect haplogroup.



  2. I found it accurate and I don't belong to R*. STRs mutate how they will, and maybe in uncharacteristic ways for the actual haplogroup. I have one STR marker that is 6-7 allele values below the modal value. No one else except a genetic cousin with whom I am not related for nearly 500 years has it. That odd value doesn't alter the prediction.

    It is a prediction after all. Weather bureaus predict weather with far less accuracy. Those bureaus are still around and predicting.

  3. There are 2 problems with Athey's haplogroup predictor:

    1. The source data are not clearly described in his papers. He lists some websites as data sources in his publications but it is very difficult to get the full list of haplotypes and haplogroups Athey used for his Predictor. Additionally some of the databases He lists contain predicted haplogroups for some samples... So it's possible that the Athey's predictor predicts based on something that was previously predicted ... not very reliable I must say...

    2. The algorithm for percentage similarity calculation is misleading as it gives value "100%" quite often. This value reflects AFAIR the similiarity to modal haplotype. However, this ignores the non-zero probability, that the haplotype could have evolved on another haplogroup background...

    To give you some examples: I have tested more than 150 Polish samples with Y-SNP and 12 Y-STR data known. Things were OK with R1b, with R1a there was one haplotype that was in fact K* but its Y-STRs fitted perfectly into R1a so the Predictor missed it. But everything went banana when it came to E1b1b1... :-) The Predictor was not able to recognize correctly even one of a dozen of chromosomes from this haplogroup - they went to J2 and other HGs according to Predictor. On the other side, some J* chromosomes were "predicted" to be E1b1b...

    OK, one may argue that 12 Y-STRs is not enough to make good predictions but still, I was getting values like 100% match for obviously mistaken predictions, and that's misleading, especially for lay persons who use the Predictor software.

