BBSRC Portfolio Analyser
Award details
Development of Novel Computational Strategies to Store and Interpret Next Generation Sequencing Data and Their Application to Multi-Genomic Analyses
Reference
BB/I01585X/1
Principal Investigator / Supervisor
Professor Michael Sternberg
Co-Investigators /
Co-Supervisors
Dr Sarah Butcher
,
Professor William Knottenbelt
,
Dr Come Raczy
Institution
Imperial College London
Department
Life Sciences
Funding type
Skills
Value (£)
99,932
Status
Completed
Type
Training Grants
Start date
01/09/2012
End date
31/08/2016
Duration
48 months
Abstract
unavailable
Summary
AIM OF THE PHD PROJECT - High throughput sequencing (HTS) of genomes and transcriptomes will lead to the availability of sequencing data for numerous samples across many species. However, there are major problems in the exploitation of this information due to difficulties in the storage, transfer between sites, and visualisation of the large data sets. The aim of this cross-disciplinary PhD project is to (1) To develop novel data reduction methods to streamline data storage and analysis of large complex multi-genomic data (2) To develop visualisation tools to produce compacted visualisation (3) To use these tools to undertake mining of a biological dataset to investigate specific points of biological interest. DATA REDUCTION - The first challenge will be to achieve a major reduction in the size of the data without losing critical meta-data associated to each base sequenced (i.e. the quality of the data or even the original read). We will need to develop novel data reduction algorithms since traditional lossless compression techniques are unsuitable for HTS data because they do not manage both rapid decoding starting from any point in the stream combined with rapid mutual comparison of several compressed streams. Additionally, current DNA compression methods (DNACompress, LCA, and DNAzip) primarily consider a single genome algorithm. Here we will use the repeatability and the consistency of sequencing technologies: applying the same technology and method to very similar genomes sequences is likely to show strong similarities in systematic deviations (sequencing errors, variations in coverage, etc.). This would make the differential compression or other de-duplication techniques highly efficient for the whole data. The second challenge will be to design protocols to improve data transfers. A large number of scientists will be querying consolidated data sets from several locations around the world. We need to provide efficient storage that will support real time partial extraction of data at various resolutions similarly to the functionalities provided by BigBed and BigWig. In addition to data format definitions, it will be necessary to define the protocols that will efficiently support the distributed nature of the work. VISUALISATION - Existing genome browsers are not suited for large scale comparative genomics studies as at best they work for simultaneous visualization of a small number of genomes. Visualization of a large number of genomes will require the identification of new concepts for the navigation and visualization of genomic data. The data reduction techniques we will develop naturally lead towards compact data visualisation with the ability to use interactive thresholds and cut-offs to display comparative features, and the ability to toggle between data sub-sets. Once the right queries have been presented to the appropriate databases, and the results aggregated, the remaining step is to present the data in a meaningful way. APPLICATION - Our current favoured exemplar dataset is from genomic and transcriptomic studies of the obligate fungal pathogen of Barley Blumeria graminis hordei and other closely related fungi. A large collaborative effort including Butcher and Spanu (Imperial) is underway involving BBSRC support (BB/E000983/1; BB/H001646/1). Several completed genomes (>120Mbases range) are available, several others underway with international collaborators; also transcriptomes. We will use the developed computational tools to study phenotypic variation between species. Other biological topics which can be explored include analysis of strain data of plant and animal pathogens and cross genomic studies on related bacteria .
Committee
Not funded via Committee
Research Topics
X – not assigned to a current Research Topic
Research Priority
X – Research Priority information not available
Research Initiative
X - not in an Initiative
Funding Scheme
Training Grant - Industrial Case
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search