I have dabbled into the world of variant calling lately, so it is great to see a new paper in Human Genetics that takes one through the entire process from sequence data to SNPs, comparing different pieces of software for things like mapping and variant calling and filtering in the process. It's an invaluable read for anyone wanting to jump onto the FGS train.
With that said, it is probably a good idea if everyone publishing full sequence data would also take the time to also publish VCF files to go with their FASTQ or BAM files. While nothing can beat learning about and carrying out the process from start to finish for yourself, not everyone has the bandwidth, storage, or CPU power to handle the task. With a little help and a little reading, I have managed to cope with the ancient DNA data that's become available over the last half year or so, but having the option of getting a "just the SNPs" download would be most appreciated by most.
2012, DOI: 10.1007/s00439-012-1213-z
A beginners guide to SNP calling from high-throughput DNA-sequencing data
André Altmann, Peter Weber, Daniel Bader, Michael Preuß, Elisabeth B. Binder and Bertram Müller-Myhsok
High-throughput DNA sequencing (HTS) is of increasing importance in the life sciences. One of its most prominent applications is the sequencing of whole genomes or targeted regions of the genome such as all exonic regions (i.e., the exome). Here, the objective is the identification of genetic variants such as single nucleotide polymorphisms (SNPs). The extraction of SNPs from the raw genetic sequences involves many processing steps and the application of a diverse set of tools. We review the essential building blocks for a pipeline that calls SNPs from raw HTS data. The pipeline includes quality control, mapping of short reads to the reference genome, visualization and post-processing of the alignment including base quality recalibration. The final steps of the pipeline include the SNP calling procedure along with filtering of SNP candidates. The steps of this pipeline are accompanied by an analysis of a publicly available whole-exome sequencing dataset. To this end, we employ several alignment programs and SNP calling routines for highlighting the fact that the choice of the tools significantly affects the final results.