Award details

Development of a Rapid Processing Pipeline and Graph-based Visualization for the Analysis of Next Generation Sequencing Data

ReferenceBB/J019275/1
Principal Investigator / Supervisor Dr Anton Enright
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentEnright Group
Funding typeResearch
Value (£) 188,768
StatusCompleted
TypeResearch Grant
Start date 01/07/2012
End date 31/10/2014
Duration28 months

Abstract

We propose to develop a high-performance system for the processing, analysis and visualization of NGS data. Currently there are many issues associated with NGS data analysis that make this data a significant challenge for most laboratories to deal with. We propose to build a highly optimised data processing system for dealing with these data based on our extensive experience of computational biology algorithm development and visualization technologies. The modular system with components will be written in C/C++, Java OpenGL and OpenCL where appropriate. Raw sequencing data will be processed by Reaper an ultra-fast read processing engine that de-multiplexes sample barcodes, removes adapter contamination, polyA contamination and low-complexity sequence. Reads are then examined for redundancy by the Tally algorithm which collates sequence data and produces QC metrics for further analysis. Cleaned, processed reads are scanned against the genome to determine their point of origin. We intend to produce a fast parallel mapping tool using the Burrows-Wheeler system and utilising parallel optimisation and hardware acceleration using GPU computation. Annotated reads are produced after mapping with further QC data. Reads will also be cross-mapped to each other by a GPU hardware accelerated suffix-array algorithm that allows rapid computation of read-read similarities across a specified locus. These data will be passed to the visualization engine that allows read-graph topology analysis and also custom assembly routines together with further visual QC. Depending on the user requirements a range of results can be produced including: transcript expression summaries, differential expression across loci, read-read assemblies and graphs and tracks for genomic visualization of reads within the IGV tool or the UCSC browser. We believe this combination of high-performance algorithms with visualization and a graphical interface will be of great benefit to the community.

Summary

Over the last decade or so there has been an explosion of biological data emanating from new laboratory analysis platforms. These data are increasingly complex and large-scale. DNA sequencing in particular has revolutionized the biomedical and biological sciences over the last decade. The recent availability of new DNA sequencing platforms mean that orders of magnitude more data can be produced relative to what was possible just a few years ago. These advances have further changed the way we think about scientific approaches to basic, applied and clinical research. For example, the ability to sequence the whole genome of many related organisms has allowed large-scale comparative and evolutionary studies to be performed that were until recently unimaginable. Sequencing can also be used to determine which genes are currently active at any given state or time by RNA sequencing for gene-expression analyses. In analysing gene-expression studies, RNA-sequencing can identify and quantify rare genes without prior knowledge and can provide information regarding sequence variation in the identified genes. When combined with 'pull-down' technologies, these approaches can also answer important questions regarding gene regulation such as transcription factor or microRNA target binding. These advances in technology however come with significant analytical challenges, in particular with respect to the sheer scale of data now being produced. For example a single run of an Illumina Solexa GA-2 machine produces approximately 100Gb of sequence data alone. A number of approaches exist for the analysis of these data, however they are usually slow and extremely computationally intensive, requiring large-memory computers or high-performance computing clusters in order to effectively analyse these data. How best to analyse this information is an ongoing and active discussion. One approach to resolving some of these issues is to both develop fast optimal algorithms for data analysis andto visualise and analyse data as network graphs. This proposal is to develop an optimised system for the analysis of such data. It will involve the development of extremely fast and optimised algorithms for processing the data for which we have already created prototypes. We will utilise the relatively new field of GPU hardware acceleration to allow these algorithms to run significantly faster when utilising specialised hardware on a consumer 3D graphics card. Data processed through the system will be visualised using a customised 3D visualisation environment designed around the existing BioLayout Express3D system. These sequence graphs have already proved themselves useful identifying novel sequence elements and aiding the assembly of their consensus sequences, in many cases helping to identify where issues lie. Furthermore, we intend to harness the power of correlation analysis for working with RNA-seq data, providing an integrated solution for moving from primary sequence data through to co-expression analysis of tags per gene summaries. In doing however we will also provide network and alignment based views of the primary data that underpin the summary analyses. This will provide novel ways for users to see their data and how reads interact with each other and the genome itself. The entire system will be modular and each module will be accessed from a graphical user interface written in Java, that gives the user control over analysis modules and allows rapid analysis of large-scale datasets from the primary data to genome/gene level analyses.

Impact Summary

The advent of NGS technology represents significant challenges in terms of data magnitude and complexity. Tools and techniques that can deal with such data are urgently required. The system for analysis and visualization described in this proposal would be of great assistance to a large number of researchers. Computational tools and techniques have a significant impact on the way biological science is being performed in the post-genomic era. Freely available tools and software allow researchers throughout the UK and worldwide to quickly adapt new technology to their own research goals. In particular, our proposal hopes to develop a powerful analysis system which is simple and intuitive to use and will minimise the requirement for expert bioinformatics support, thereby helping to bridge the gap between wet and dry research. On an economic and societal level this proposal could have significant secondary benefits, allowing the application of new sequencing technologies to many different biological research problems. Benefits also exist for human and animal health as clinicians and veterinary scientists adopt NGS technologies for diagnostic purposes and to explore population variation and its impact on disease. Likewise, the tools and resources we describe in this proposal will likely be of benefit to pharmaceutical and agriculture sectors. Both the Freeman and Enright laboratories have extensive networks of collaborators throughout the UK and beyond, and the research described will support many new and existing collaborations. We envisage that these collaborations will improve both communication and scientific effectiveness across the institutes involved. The Roslin Institute is a world-leading agricultural research institute and The European Bioinformatics Institute is a world-leader in delivering computational tools, resources and research to the international biological community. Both applicants are experts in the delivery of usable tools via the provision of intuitive human interfaces. We will support this through a number of avenues including publications, training and outreach activities, to promote and support the proposed research. The EBI has a well-developed outreach team who promote our resources across the UK and worldwide through online tutorials, local presentations and as part of a travelling roadshow. The Roslin Institute and the European Molecular Biology Laboratory (EMBL) both have technology transfer offices. Both applicants have been involved in patent applications, technology transfer and company formation previously. Indeed, both were founding members of Fios Genomics Ltd., a data analysis company that has just received significant investment. Where appropriate, commercialisation options will be explored. Independently both applicants are actively involved in a range of teaching activities across the United Kingdom and world-wide. This allows us to promote computational techniques, tools and resources to biologists who may not have a strong background in bioinformatics or computational biology. The EBI has a dedicated Industry program that brings together leaders from industry with bioinformaticians. This provides a platform to inform industry of recent advances in research and also to learn what are the specific needs and requirements from industry in the UK and world-wide. We foresee significant benefits from this research to clinician science. Both applicants have previous successful collaborations with clinical groups including the University of Cambridge teaching hospital (Addenbrookes, NHS Foundation Trust) and Royal Infirmary of Edinburgh (NHS Lothian). It is possible therefore that such interactions may produce findings that result in the creation of novel therapeutic or diagnostic procedures with a potential for significant impact on human health.
Committee Not funded via Committee
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file