Award details

An integrated variant calling pipeline for third-generation sequencing technologies

ReferenceBB/I02593X/1
Principal Investigator / Supervisor Professor Gerard Lunter
Co-Investigators /
Co-Supervisors
Professor Gil McVean
Institution University of Oxford
DepartmentWellcome Trust Centre for Human Genetics
Funding typeResearch
Value (£) 626,048
StatusCompleted
TypeResearch Grant
Start date 30/04/2012
End date 30/04/2016
Duration48 months

Abstract

Background The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. Current tools will have to be adapted to deal with emerging 3rd-generation platforms, such as Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular. A particular technical challenge for many third-generation platforms is their relatively high indel error rate (inserted or missed bases), compared to e.g. Illumina and SOLiD technologies. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set (including the read mapper Stampy), it proposes to do this by developing a generic and integrated tool chain. To achieve technology-agnosticity, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of existing platforms, as well as future 3rd generation platforms. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework.

Summary

Rationale The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. The fast-paced nature of HTS research has led to the development of numerous tools of varying and often ill-understood quality, with usually a narrow application range in terms of sequencing technology and experimental design. This situation will likely not improve with the emergence of 3rd generation platforms (including, for instance, Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular). A particular technical challenge for many third-generation HTS platforms is their relatively high indel error rate (inserted or missed bases), compared to 2nd generation (particularly Illumina and SOLiD) technologies, requiring the development of new tools. We have recently developed a read mapper ('Stampy') and integrated SNP and indel caller ('Platypus'), both of which were designed to cope with indel errors and mutations. Recently published and unpublished data shows that our tools are state-of-art for Illumina data. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set, it proposes to do this by developing an integrated tool chain, built on solid statistical principles, and applicable to a range of experimental designs. To make the tool chain technology-agnostic, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of the Illumina, SOLiD and 454 platforms, as well as currently known 3rd generation platforms. Current technologies do not provide rich error models; in particular no current technology annotates reads with per-base indel error rates. To compute these from existing data, as well as to tune factory-provided error models to the particular conditions of a library, lane or flow cell, a recalibration tool will also be developed. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework. In addition, we will continue to develop the current tool set to cope with a larger range of variants, and widen the spectrum of experimental designs to which it is applicable. Work plan overview We will first show feasibility by aiming at three currently widely used platforms: Illumina and 454 (both available in-house at the WTCHG), and SOLiD (for which we have access to data through the 1000 Genomes project to which both applicants contribute). Following successful development of the tool chain for these platforms, and having established a standard for representing uncertainty in sequence reads, we will adapt these tools for 3rd generation platform. Since the ability to successfully deal with indel errors will be crucial here, we will be helped by our previous experience in developing the read mapper 'Stampy', which shows particularly good sensitivity and specificity for indels.

Impact Summary

Beneficiaries The proposed tool chain is expected to be beneficial in cases where high sensitivity and specificity for identifying polymorphisms in genomes from high-throughput sequence data is required. This is the case in many clinical settings, as well as in many research settings, a large fraction of which will fall within the remit of the BBSRC. In addition, when the technology is pushed to its limits in large projects, for instance in large sequencing-based GWAS studies, and for reasons of economy or statistical power it is intended to extract maximal amounts of information from relatively low-coverage sequencing per individual, accurate variant calls or genotype likelihoods as provided by the proposed tool chain are essential. It is expected that providers of sequencing technology will also benefit from the proposed standardization of uncertainty in sequencing reads, since this will enhance inter-operability, and ease transition to new generations of sequencing platforms. Similarly, technology providers are expected to benefit from the generic tool chain by allowing providers to focus on the sequencing platform and less on downstream software development. The project outcomes are intended and expected to be beneficial for providers of analysis services of high-throughput sequencing data. Examples are the PI's group at the Wellcome Trust Centre for Human Genetics, and the group of Dr. Mario Caccamo at TGAC in Norwich. Groups in this position have a need for trusted, comprehensive and broadly applicable analysis tools in order to be able to effectively help their clients. Impact By enabling users to analyze heterogeneous data across multiple sequencing platforms in a uniform manner, the project outcomes will reduce the time between sequencing and analysis, interpretation and results. By enabling researchers to switch to other technologies, cost savings or power increase may be achieved. The higher sensitivity and lower false-positiverates that are suggested by our initial results, and that are the intended outcomes of the planned tool chain, will in a clinical context lead to fewer missed genetic variants that may be causative for the phenotype under study, and higher rates of correct diagnoses. Access to a trusted and sensitive generic analysis pipeline will enable bioinformaticians to provide end-users with the required analysis results more quickly and more confidently, saving costs and reducing turnaround times. The proposed technology is primarily *enabling*. A substantial fraction of the expected impact can be summarized as increased efficiency. This includes expected cases where take-up of a new technology will move forwards because of the availability of analysis tools, and the effect of this could be significant. To a lesser expected extent, the proposed project may lead to research that otherwise would not have been contemplated, because of the access to efficient sequencing technology that is appropriate for the research question. People The two PDRAs that will be trained in this project will acquire sought-after skills in statistical modelling, software development, and bioinformatics research and analysis. They should be well placed for a further career in research, software engineering or data analysis. The WTCHG is a world-class research centre and second-largest UK sequencing centre, employing around 500 scientists and staff, a large fraction of which in bioinformatics, and has an excellent track record in springboarding young researchers into future careers.
Committee Closed Committee - Genes & Developmental Biology (GDB)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file