Award details

Galaxy Workflows for Proteomics Informed by Transcriptomics (PIT)

ReferenceBB/K016075/1
Principal Investigator / Supervisor Professor Conrad Bessant
Co-Investigators /
Co-Supervisors
Dr Jun Fan, Dr DA Matthews
Institution Queen Mary University of London
DepartmentSch of Biological and Chemical Sciences
Funding typeResearch
Value (£) 108,443
StatusCompleted
TypeResearch Grant
Start date 22/04/2013
End date 21/06/2014
Duration14 months

Abstract

Popular software for the identification of proteins from MS/MS spectra (e.g. Mascot, OMSSA, MaxQuant) searches acquired spectra against a peptide list derived from the proteome of the species under study. This makes the protein identification problem tractable by constraining the peptide search space, but limits the application of proteomics to those few species for which there is a high quality well annotated genome, and does not inherently find variant peptides that differ from the reference proteome. We have recently shown that these limitations can be overcome by generating a sample-specific protein database from transcriptomes assembled de novo from RNA-seq short reads acquired from the same sample. This technique, which we refer to as proteomics informed by transcriptomics (PIT) has recently (Sept 2012) been accepted for publication in Nature Methods. We showed that, for case studies including adenovirus infected HeLa cells, the PIT approach identified >95% of the peptides found using a traditional proteomics search against a reference proteome and, thanks to the more tightly defined search space and the ability to match sample-specific variant peptides, detected several hundred additional peptides that were present. The aim of the proposed project is to use our unique experience of the PIT approach to develop a suite of Galaxy-based workflows that allow the typical bench scientist to perform the data analysis workflows needed to support PIT and extract biologically relevant information from the results obtained. The workflows will be composed of numerous individual tools, some of which are already available for Galaxy (e.g. Trinity for de novo transcript assembly, and getORF for deriving protein sequences from the transcripts), others that need to be "wrapped" for use in Galaxy (e.g. OMSSA protein search engine), and a further set that must be specially written for PIT (these will be primarily for data integration and downstream analysis and reporting).

Summary

Identifying which proteins are present in a given biological sample, and in what quantities, is essential to understanding many biological processes. A technique called "shotgun proteomics" has become the method of choice for tackling this problem. In a shotgun proteomics analysis proteins are first broken down into more easily analysable segments (peptides) using a cleavage enzyme, then separated using liquid chromatography (LC), prior to individual injection into a tandem mass spectrometer (MS/MS), which breaks peptides into fragments, producing a spectrum of product ions that can be considered as a fingerprint for each peptide. Software is used to match the acquired spectra to peptides and these peptide identifications are then used to infer the presence of proteins. Working out which peptide is represented by each of the acquired spectra is clearly a crucial part of shotgun proteomics. In theory, because we understand the principles of peptide fragmentation, it should be possible to take any peptide spectrum and work out the sequence of the peptide from which it came. In practice this is usually too difficult because the combination of imperfect MS/MS spectra and the huge number of peptides that could potentially exist make incorrect identifications very likely. To circumvent this problem, protein identification software seeks to match peptide spectra only to those peptide sequences that might reasonably be expected to be in the sample. Currently this is done by searching against the sequences of all proteins that the species under study is known to produce (the "proteome"), downloaded from an online database (e.g. UniProt). However, high quality proteomes are only available for a small number of species. What if you want to do proteomics on a sample from a species for which a proteome is not available, or on a sample from an experiment involving multiple species, or unknown species? We recently developed (and tested, and published) a solution to this problem, which we call proteomics informed by transcriptomics (PIT). The key to PIT is the creation of a sample-specific list of proteins that may be present, derived from gene transcripts found in the sample. Transcripts are copies of genes that are used to make proteins, so by knowing which transcripts are present in a sample we can predict which proteins might be present. The transcripts are found by using a next generation sequencing technique called RNA-seq. Until very recently, RNA-seq involved mapping short reads to a reference genome, but software is now available that can assemble transcripts de novo. The PIT approach therefore makes it possible to identify and quantify proteins in complex samples when a reference proteome (or genome) is not available. This opens many new areas of research for species that do not have well annotated genomes (which include many pests, pathogens and plants), and also for experiments where proteins from multiple species are present (so-called "metaproteomics") or where the proteome is changing (e.g. during viral infection). There are also a number of additional spin-off benefits such as the ability to find protein variants that are specific to the individual under study (i.e. not present in any reference proteome), and possibility to annotate genomes. Currently, the main challenge of the PIT approach is the complexity of the data analysis necessary to integrate the transcriptomic and proteomic data and report results in a way that is useful to biologists. The aim of this proposal is therefore to put together a suite of easy to use connected software tools that enable the typical bench scientist to perform the necessary data analysis within an acceptable timescale with no bioinformatics support. To help achieve this we plan to implement the software within the popular Galaxy framework. Galaxy provides an easy to use web browser interface and can take advantage of powerful computing resources.

Impact Summary

As a fundamental methodology that substantially improves our ability to study proteins and understand genomes, the potential beneficiaries of the PIT approach that will be facilitated by the proposed software development are broad and numerous. As already mentioned, the concept of PIT analysis emanated from the infectious diseases community, following the realisation that traditional proteomics was not well suited to many of the studies that they were undertaking, especially with non-model organisms such as mosquito and bat. If we take virology as an example where PIT can bring new insights, the improved understanding of viruses that PIT can provide clearly has great potential to impact on human health, animal welfare, public policy and the economy. This is just one example among many others, including food security and industrial biotechnology, which are of intense interest to both academia and industry (as evidenced by the supplied letters of support). This proposal will also help bolster the UK's position in proteomics research. Despite proteomics being a very competitive area, BBSRC funding has helped the UK to establish several internationally competitive research groups, both in laboratory proteomics and proteome informatics. This has led to commercial activities, including the formation of the very successful proteomics software companies Matrix Science and Nonlinear Dynamics. With continued investment we see no reason why the UK cannot retain its leading position in proteomics, and in this case also help bolster our expertise in the increasingly important area of data integration. In terms of timescale, we genuinely expect some benefits of this project to be realised within the timescale of the project itself as researchers at Bristol are already doing PIT analysis and our proposed software will be made available to them as it is developed. This will allow them to get more out of their data in shorter timescales (as well as helping us refine our softwarein response to their feedback). Scientific benefits will extend further once the PIT workflows are made generally available towards the end of the project, and any societal benefits that follow from novel scientific insights would become apparent in subsequent years.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file