Award details

Development of Novel Computational Strategies to Store and Interpret Next Generation Sequencing Data and Their Application to Multi-Genomic Analyses

Reference	BB/I01585X/1
Principal Investigator / Supervisor	Professor Michael Sternberg
Co-Investigators / Co-Supervisors	Dr Sarah Butcher, Professor William Knottenbelt, Dr Come Raczy
Institution	Imperial College London
Department	Life Sciences
Funding type	Skills
Value (£)	99,932
Status	Completed
Type	Training Grants
Start date	01/09/2012
End date	31/08/2016
Duration	48 months

Abstract

unavailable

Summary

AIM OF THE PHD PROJECT - High throughput sequencing (HTS) of genomes and transcriptomes will lead to the availability of sequencing data for numerous samples across many species. However, there are major problems in the exploitation of this information due to difficulties in the storage, transfer between sites, and visualisation of the large data sets. The aim of this cross-disciplinary PhD project is to (1) To develop novel data reduction methods to streamline data storage and analysis of large complex multi-genomic data (2) To develop visualisation tools to produce compacted visualisation (3) To use these tools to undertake mining of a biological dataset to investigate specific points of biological interest. DATA REDUCTION - The first challenge will be to achieve a major reduction in the size of the data without losing critical meta-data associated to each base sequenced (i.e. the quality of the data or even the original read). We will need to develop novel data reduction algorithms since traditional lossless compression techniques are unsuitable for HTS data because they do not manage both rapid decoding starting from any point in the stream combined with rapid mutual comparison of several compressed streams. Additionally, current DNA compression methods (DNACompress, LCA, and DNAzip) primarily consider a single genome algorithm. Here we will use the repeatability and the consistency of sequencing technologies: applying the same technology and method to very similar genomes sequences is likely to show strong similarities in systematic deviations (sequencing errors, variations in coverage, etc.). This would make the differential compression or other de-duplication techniques highly efficient for the whole data. The second challenge will be to design protocols to improve data transfers. A large number of scientists will be querying consolidated data sets from several locations around the world. We need to provide efficient storage that will support real time partial extraction of data at various resolutions similarly to the functionalities provided by BigBed and BigWig. In addition to data format definitions, it will be necessary to define the protocols that will efficiently support the distributed nature of the work. VISUALISATION - Existing genome browsers are not suited for large scale comparative genomics studies as at best they work for simultaneous visualization of a small number of genomes. Visualization of a large number of genomes will require the identification of new concepts for the navigation and visualization of genomic data. The data reduction techniques we will develop naturally lead towards compact data visualisation with the ability to use interactive thresholds and cut-offs to display comparative features, and the ability to toggle between data sub-sets. Once the right queries have been presented to the appropriate databases, and the results aggregated, the remaining step is to present the data in a meaningful way. APPLICATION - Our current favoured exemplar dataset is from genomic and transcriptomic studies of the obligate fungal pathogen of Barley Blumeria graminis hordei and other closely related fungi. A large collaborative effort including Butcher and Spanu (Imperial) is underway involving BBSRC support (BB/E000983/1; BB/H001646/1). Several completed genomes (>120Mbases range) are available, several others underway with international collaborators; also transcriptomes. We will use the developed computational tools to study phenotypic variation between species. Other biological topics which can be explored include analysis of strain data of plant and animal pathogens and cross genomic studies on related bacteria .

Committee	Not funded via Committee
Research Topics	X – not assigned to a current Research Topic
Research Priority	X – Research Priority information not available
Research Initiative	X - not in an Initiative
Funding Scheme	Training Grant - Industrial Case

I accept the terms and conditions of use (opens in new window)

export PDF file

back to list new search

BBSRC Portfolio Analyser

BBSRC Portfolio Analyser will be decommissioned at the end of 2025.
Other UKRI reporting services are available to you providing details of BBSRC investments: UKRI - What we have funded

Award details

Development of Novel Computational Strategies to Store and Interpret Next Generation Sequencing Data and Their Application to Multi-Genomic Analyses

Abstract

Summary

BBSRC Portfolio Analyser will be decommissioned at the end of 2025.Other UKRI reporting services are available to you providing details of BBSRC investments: UKRI - What we have funded

Award details

Development of Novel Computational Strategies to Store and Interpret Next Generation Sequencing Data and Their Application to Multi-Genomic Analyses

Abstract

Summary

BBSRC Portfolio Analyser will be decommissioned at the end of 2025.
Other UKRI reporting services are available to you providing details of BBSRC investments: UKRI - What we have funded