Award details

Integrating genomes and proteomes on the cloud

ReferenceBB/K004123/1
Principal Investigator / Supervisor Professor Andrew Jones
Co-Investigators /
Co-Supervisors
Institution University of Liverpool
DepartmentInstitute of Integrative Biology
Funding typeResearch
Value (£) 116,287
StatusCompleted
TypeResearch Grant
Start date 25/12/2012
End date 24/06/2014
Duration18 months

Abstract

Shotgun proteomics studies are now routinely used to identify large numbers of proteins from complex mixtures and can provide semi-quantitative information about each protein. Typically mass spectrometry (MS) data are queried against a local protein sequence database downloaded, and potentially locally modified, from one date stamped release of predicted gene sequences by the genome curators. In many published proteomics studies, the protein accessions do not correspond with the latest genome release and results are difficult to interpret correctly or integrate with other data sets. As such, proteomics results are generally disconnected from the genome, since repeating analyses when new gene models are produced is computationally and labour intensive. MS data can also provide evidence for protein isoforms to aid genome annotation. However, most MS data sets are not mined for this information because these searches are too computationally expensive in most settings. Cloud computing offers a powerful new model in which access to parallel compute clusters can be provided and charged on demand for the length of time required. In related projects, we are already developing open-source software and data standards to support various tasks in proteomics. In this project, we will make existing tools available within the Galaxy framework - a genomics-focussed pipeline infrastructure for cloud computing, and build several new tools, specifically focussed on integrating proteomics data with the corresponding genome. These cloud-based software tools will enable fast searches to identify alternative splice products, alternative gene models, sequence polymorphisms and a range of modifications, as well as supporting label-free quantitative approaches. We will provide a publicly available cloud implementation for users to test the software and an image of the software that can be easily instantiated on a new cloud instance, or local cluster, setup by other groups.

Summary

Research in the life sciences has been greatly facilitated by high-profile efforts to sequence the genomes of humans, model organisms, crops, livestock and organisms causing infectious disease. Each genome sequence is made of DNA containing many thousands of genes, each one providing the code for one (or more) proteins - the functional molecules in the cell. In each cell, different genes are switched on or off to provide the set of proteins required at that point in time, depending on local signals and stresses. To understand what is happening in a particular cell at one point in time, the genome sequence is not informative on its own, researchers must study the set of proteins that have been produced, termed proteomics. In proteomics, mass spectrometry (MS) is used to identify and quantify which proteins are present in a sample. MS reports the masses of fragments of proteins, which require processing by specialised software to interpret. MS data are searched against a database containing the organism's gene sequences (after they have been translated into proteins) to find exact matches - confirming that a particular protein is present in the sample. The success of proteomics is thus reliant upon gene sequences having been correctly identified within the genome. Gene sequences can be difficult to identify correctly (termed genome annotation) since in many genomes the genes are interspersed with considerable amounts of non-coding DNA. Genome annotation can be facilitated by MS data, since the confirmation that a protein sequence has been experimentally observed gives strong evidence that the gene sequence was correctly predicted. By sequentially searching for matches to many different hypothetical predictions can produce an incremental improvement in gene sequences, but this requires significant computing time and so they are difficult to perform without specialised high-performance computing (HPC). Another challenge in proteomics is that curators of genome sequences release new sets of genes at frequent intervals. Scientists tend to quantify a set of proteins based on one particular set of gene sequences and it is time consuming to repeat their analysis if a new set is released - meaning that most publicly available proteomics data sets refer to old gene sequences, which are difficult to compare with the latest release. These challenges would be made simpler if HPC and appropriate software was more readily available for proteomics. Cloud computing is a new model of running software on remote computers over the Internet, where access to processing time or data storage can be scaled up or down on demand to the user's requirement. Several commercial providers allow access to clouds whereby software can run in parallel on tens, hundreds or thousands of computers with the cost passed back to the user based only on their usage. In the past, we have developed software that allows MS data to be queried directly against genome sequences to facilitate genome annotation efforts. In this project, we will develop a software toolkit to run in a cloud computing environment to allow proteomics scientists access to HPC from their desktops. The software will allow MS data to be used to search for different forms of a protein, for example specific to one individual or one cell type, since such searches are too computationally challenging to perform on a standard PC. We will allow users to repeat searches against new or different gene model sets to help improve genome annotation and to allow scientists to perform their quantitative analyses against the latest set of gene models, making it easier for other scientists to interpret their data or integrate with their own results. These advances will have potential benefits for a huge range of research areas in the health and life sciences, which make use of proteomics techniques.

Impact Summary

The potential impact of this project falls into the following categories: potential for commercial exploitation and indirect impact through improvements in genome annotation and proteomic data analysis capabilities. Commercial exploitation We have a track record of working with commercial partners in proteome bioinformatics, especially through developments within the Proteomics Standards Initiative - for example several data standards developed in this context have been implemented in industrial software, including Mascot (Matrix Science, UK), the market leading search engine. All software released in this project will be developed and maintained through the Google code subversion repository, which remains open-source at all times. We typically use the Apache 2.0 licence, which allows code to be employed without restriction in other open or closed sourced products. As such, we envisage that proteomics software vendors or pharmaceutical research and development teams may wish to take up our tools for in-house pipelines deployed on a private cloud instance. Indirect impacts Public genome databases are used in a broad variety of contexts, including by commercial research and development teams so any software that is capable of improving genome annotation has the potential for significant (indirect) impact on health and/or the economy. To date, our proteogenomic pipeline has been used primarily for improving the genome annotation of apicomplexan pathogens. As one example of a potential for impact, several pharmaceutical companies are searching for vaccine candidates in the Apicomplexa. The identification of a new protein isoform in one of these organisms (as provided by our software) could be an important advance if it elicits an immune response in the host. The pipeline is currently challenging to run for smaller facilities without access to high-performance computing, so a cloud-enabled implementation will greatly increase the uptake of this software, with potential impacts across a range of areas. Shotgun proteomics analyses that provide quantitative values on proteins are used in various pre-clinical settings, for example in early screening stages to search for biomarkers of disease. The associated data analysis can be computationally challenging and is rarely repeated if gene models change, meaning that most previously published studies are now out-of-date. Cloud computing provides a route for rapidly repeating previous data analyses or performing new analyses with a much wider range of search parameters or against different search databases. This could potentially improve biomarker identification pipelines
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file