It is important to critically evaluate published data sets, because otherwise data sets, even as flawed as the one from Nasidze & Stoneking (2001), may become innocently incorporated in further data analyses by other authors, yielding therefore a growing avalanche of a priori flawed results (e.g., Bulayeva et al. 2003; Vernesi et al. 2004). The lessons to be learnt from this case are the following:
1. Researchers should not rush into publication with poor quality DNA sequences and without checking for artificial patterns in the data.
2. Any major inference from mtDNA data must be accompanied by the primary data, which should be submitted to GenBank before submission (and made available there as soon as the paper is published electronically), and additionally displayed in the paper or as an electronic supplement (either in the form of a diagram or a table) in order to enable quick evaluation by referees and readers.
3. Major analyses carried out and interpreted in the paper should be displayed in the form of tables and diagrams.
4. Referees of a submitted manuscript should routinely request additional data and details about analyses that are necessary to check results fully (although this may be difficult to achieve in view of the overabundance of mtDNA papers submitted to journals).
It is nice to see that some researchers take the time to critically evaluate existing results and methodologies.
Ann Hum Genet (early view)
Quality Assessment of DNA Sequence Data: Autopsy of A Mis-Sequenced mtDNA Population Sample
H.-J. Bandelt, and T. Kivisild
Published DNA data sets constitute a body of sequencing results resting in silico that are supposed to reflect the variation of (once) living cells. In cases where the DNA variation reported is suspected to be fraught with artefacts, an autopsy of the full body of data is needed to clarify the amount and causes of mis-sequencing. In this paper we elaborate on strategies that allow a clear-cut identification of the problems in severely flawed mtDNA data. This approach is applied, by way of example, to a data set of HVS-I sequences from the Caucasus, published by Nasidze & Stoneking in 2001. These data bear numerous ambiguous nucleotide positions and suffer from an even higher number of phantom mutations, indicating that severe biochemical problems adversely influenced those sequencing results at the time. Furthermore, systematic omission of sequences with a long C-stretch (incurred by a transition at position 16189) must have severely biased the data set. Since no complete correction of these data has appeared to date, this example of mis-sequencing necessitates circumstantial evidence that is bullet-proof.