BBSRC Portfolio Analyser
Award details
COMPUTATIONAL METHODS FOR MICROBIAL NEXT GENERATION RE-SEQUENCING DATA
Reference
BB/M001121/1
Principal Investigator / Supervisor
Professor David Robertson
Co-Investigators /
Co-Supervisors
Professor Douglas Kell
,
Dr Mattia Prosperi
,
Dr Avraam Tapinos
Institution
The University of Manchester
Department
School of Biological Sciences
Funding type
Research
Value (£)
274,155
Status
Completed
Type
Research Grant
Start date
01/09/2014
End date
31/08/2017
Duration
36 months
Abstract
The aim of this project is to address a specific set of unsolved theoretical problems in the fields of metagenomics and microbiology/virology-associated sequencing projects. We will tackle practical problems, such as the need to make more efficient computer pipelines for analysing and assembling NGS data with particular emphasis on de novo assembly. Data management and processing of NGS short read data from microbes is usually done with reference to existing genomes. However, due to high-levels of variation the available algorithms can fail to align homologous reads or perform poorly in regions with frequent insertions or deletions or where genome architecture is highly variable. In this proposal we will investigate and implement novel methods for alignment assembly with and without the use of reference genomes. To achieve this we will use a novel approach for efficient analysis of NGS data by harnessing the speed and accuracy of existing time series data compression/mining techniques. Our approach will make use of these time series representation techniques to compress the individual NGS reads to lower dimensions; this thereby reduces the size of the data to be processed and analysed. Working with this transformed representation of sequence reads will speed the data analysis and improve the accuracy of any results by enabling the use of more thorough heuristics. Existing, clustering and indexing approaches, and similarity evaluation methods will be used to determine how the reads are linked to either single or multiple genome assemblies depending on the sample. Pairwise similarity levels of the reads will provide the statistical information required to assess the result of the assembly. We also aim to introduce new methods for visualising NGS alignments graphically, for example, in three-dimensional space. The proposed method will provide clear and precise visual information, for example, visually representing which regions are or are not covered by the assembly.
Summary
The overwhelming majority of life that has existed or exists is invisible to the naked eye (collectively termed the microbes, or microorganisms) and, including the viruses, forms large and complex communities. Characterising the species present, genome composition and genetic variation in these communities has been a major focus of 'metagenomics', the genomic study of mixed samples from the environment, or from animals or humans, for example, from an animal's gut or a soil microbial ecosystems. Contemporary sequencing technologies (next generation sequencing, NGS) have massively parallelized the determination of nucleotide order within genetic material resulting in our ability to rapidly sequence different microbes. This introduces the potential to explore microbial communities and genetic diversity on a scale that was previously unprecedented. Computational methods play a central role in the analysis, alignment and assembly of NGS data. However, the amount of data being generated is outstripping our ability to analyse them routinely, let alone carry out appropriate comparative analysis. This lack of software arises because most research effort is being directed at assembling single complete genomes from next generation sequence data. However, with microbes many interesting questions concern the diversity of sequences present in a community and population variation, revealed by 'ultra-deep' sequencing. Emerging approaches aim to build a de novo assembly of the NGS reads (each read is an individual sequence fragment corresponding to a region of a genome) in a similar fashion to a jigsaw puzzle where a picture is constructed by joining all the matching pieces together. In de novo assembly the genome sequence is constructed by allocating matching short reads together. The majority of the existing de novo assembly approaches for NGS data make extensive use of the de Bruijn graph method. However, building de Bruijn graphs for very large NGS data sets is very demanding because they require hefty computational resources. In this project we propose to develop novel computational methods, based on compressing the individual NGS reads by recasting them as numerical sequences (and working with this transformed/compressed data directly) that will be generically useful for all types of microbial data sets. In order to do this we will explore novel methods for representing short-read sequence data graphically and apply established mathematical approaches for efficient data mining. The particular problem we will address is the assembly of NGS data sets where the variation in the sample needs to be considered in the analysis. In metagenomics data variation between reads corresponds to both distinct microbial species and variation within individual species or viral populations. A particularly important focus is the ability to assembly a genome without a reference sequence for comparison (de novo assembly) as an appropriate reference genome is frequently not available for many microbes and, even when a reference is available, genome architecture can vary within a species.
Impact Summary
Many scientists are using next-generation sequencing technologies in their research. In most projects it is sufficient to combine the short-read data arising from the specific sequencing platform into a consensus sequence and a number of best-practice computational methods exist. However, in the case of next generation re-sequencing projects where the aim is to study depth of variation (for example, in a viral infection of an animal or human) or metagenomic projects where the aim is to study a community of microbes including viruses, there remains a dearth of appropriate computational methods. The particular nature of metagenomic data (large sizes and complexity) produces its own challenges as well as some unprecedented opportunities (highlighted in the BBSRC's 2010 Review of Next Generation Sequencing). For the full potential of metagenomics studies to be realised there is a need for novel computational tools for NGS data analysis. Our approach will reduce the complexity of NGS data sets permitting the implementation of more rigorous algorithms and as a consequence improvements to data storage and analysis, and the reliability of the results that can be obtained from NGS data sets. The wider exploitation of metagenomics approaches will have a number of beneficiaries: (i) Researchers in academia and not-for-profit organisations who study biodiversity in environmental or organism samples. (ii) Detecting and monitoring of microbial and viral pathogens in agriculture (plants or animals) (iii) Public-health researchers who are interested in detecting novel or existing pathogens. (iv) Researchers in the commercial sector in the form of companies developing screening techniques, environmental monitoring etc. Our primary form of communication to potential beneficiaries will be through the joint mediums of presentations at major conferences and peer-reviewed papers in open access journals. Such traditional mediums are important to ensure the quality of the research. Wewill also engage directly with individuals, institutions and companies that are likely to find our research applicable starting with the collaborative partners we have listed here. Conference presentations are also very effective for establishing contact with other potential beneficiaries and collaborators in academia or industry.
Committee
Research Committee C (Genes, development and STEM approaches to biology)
Research Topics
Microbiology, Technology and Methods Development
Research Priority
X – Research Priority information not available
Research Initiative
X - not in an Initiative
Funding Scheme
X – not Funded via a specific Funding Scheme
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search