Award details

An Integrated Open Source Software Resource for Quantitative Proteomics

Reference	BB/I000909/1
Principal Investigator / Supervisor	Mr Henning Hermjakob
Co-Investigators / Co-Supervisors
Institution	EMBL - European Bioinformatics Institute
Department	Proteomics Services Team
Funding type	Research
Value (£)	228,575
Status	Completed
Type	Research Grant
Start date	01/11/2010
End date	31/10/2015
Duration	60 months

Abstract

The aim of the project is for a consortium of proteome informatics experts from the EBI and the Universities of Manchester, Liverpool and Cranfield to deliver an open-source software workbench for quantitative proteomics which makes it simple for bench scientists to analyse their data using state of the art methods, submit to public repositories and re-analyse public data sets. We will work with the Proteomics Standards Initiative to define the new standard for quantitative data (mzQuantML) and provide on-going support for standards for mass spectra (mzML), transition design (TraML) and identifications (mzIdentML). These standards will underpin the software toolkit, which will be entirely independent of the analysis platform used. The software toolkit will have a common user interface, integrating a number of existing and new resources, developed at different sites. The software will provide high-quality identification data through the integration of algorithms previously developed for using multiple search engines, re-scoring identifications and inferring the presence or abundance of proteins, where there is ambiguity. We will incorporate standards-compliant quantitation tools based on Cranfield's existing X-Tracker platform and Manchester's SILACanalyser. We also commit to implement improved algorithms and new methods as they appear in the literature. We will integrate Cranfield's MRMaid transition design into PRIDE and provide support for TraML in MRMaid and X-Tracker. The software will incorporate a statistical analysis module, allowing the user to interpret the effects of using different methods and optimise the parameters of the algorithms. It will provide a simple mechanism for upload of a complete experiment to PRIDE, including spectra (mzML), identifications (mzIdentML) and quantitations (mzQuantML) within a wrapper capturing experimental metadata. The software will also provide support for reanalysis of reposited datasets in PRIDE.

Summary

In a scientific sense, a living system such as a plant, animal, organ or cell can be considered to be a complex machine. The basic components that make up this machine are molecules, of which there are several main types - genes, proteins and metabolites. To understand how these molecules work together to produce the complex living systems that we see around us we need to have analytical methods capable of detecting and quantifying these molecules. This proposal deals with one aspect of this analysis - proteomics - the science of identifying and quantifying proteins. The most popular approach in proteomics is to simplify a sample by separating all the proteins, digesting those proteins with an enzyme into much smaller components (peptides) and then analysing all these peptides with mass spectrometry (MS). Identification of proteins can then be carried out by computational analysis of the mass spectrum acquired from each peptide - peptides are usually mapped to proteins by comparison of observed spectra to those in a database. Protein quantity is typically calculated from mass spectral peak intensities, or by simply considering how many peptides have been observed from each protein. Within this general analytical schema there are a great many variations according to the laboratory that is doing the analysis, the samples being analysed, or the overall aim of the experiment. Factors that may differ between experimental protocols include the protein separation method (some people use gels, others liquid chromatography), different types of mass spectrometry, different search databases (some are simulated from protein sequences, others are libraries of experimentally acquired spectra), and different methods of quantitation (for instance there are various methods of labelling which are used to distinguish peptides from different samples during the analysis). This plethora of quantitative proteomic methods has two major disadvantages for proteomics practitioners. Firstly, it is a challenge to devise standard data formats for sharing proteomic data because there are so many experimental parameters to capture and different parameters are required for different protocols. Secondly, for each different protocol it can be necessary to perform a different computational analysis of the data - this has led to the development of many different software tools, particularly for quantitative proteomics in which each tool can be specific for a particular type of mass spectrometer, a particular type of labelling or a particular quantitation algorithm. The resulting array of incompatible software is bewildering to the typical proteomics practitioner, and because effort is spread across many tools there is limited resource to optimise the robustness and usability of each individual tool. In the work described in this proposal the four main centres of proteome informatics expertise in the UK aim to work together to develop an integrated suite of analysis and statistical processing tools for all popular variants of quantitative proteomics. The software will cover the whole range of quantitative proteomic data analysis, from extracting abundance data from the original MS spectra through to statistical analysis and deposition of results into the public proteomic data repository, PRIDE. A key component needed to get this working will be standard data formats to link each step of the data analysis. We will therefore be making a substantial contribution to the completion of the necessary quantitative data standards as part of this project. Overall, we aim to produce a robust, easy to use, standards-compliant software suite that will prove invaluable for proteomics practitioners seeking to analyse and share their quantitative proteomic data, regardless of the specific quantitative protocol they use.

Impact Summary

The major direct beneficiaries beyond academic researchers are pharmaceutical and biotechnology companies engaged in proteomic research, vendors of mass spectrometers, and companies involved in developing software for proteomic data analysis. As evidenced from our letters of support from industrial collaborators, there is considerable interest in proteomics. Many pharmaceutical companies have now outsourced their proteomic research or use data in the public domain in their analyses, not least from the various HUPO projects (as captured in PRIDE). The industry therefore stands to benefit significantly if we can bring about a major increase in the amount of quantitative data deposited in public databases, with sufficient metadata to draw conclusions about its validity and reliability. We also anticipate that this application will move on the field of proteomic analysis, such that it is simple to analyse quantitative data with a variety of tools, regardless of the experimental method employed. This improved accessibility and increased confidence in obtained results may have the effect of rejuvenating proteomics within industrial settings, overcoming existing concerns about achieving reproducible results. We predict that some of this impact will be realised within the five year duration of the project, helping to cement the UK's reputation as one of the world leaders in proteomics. In the longer term, findings emanating from proteomic analyses have great potential to improve health and quality of life, presenting economic opportunities across a broad range of sectors. In the health sector alone, quantitative proteomics can be used in discovery pipelines for new drugs, vaccines or as biomarkers of disease states that can be used as the basis of new diagnostic products. While nucleic acid-based techniques are currently preferred due to a perceived greater reproducibility and simpler analysis, the fact remains that proteins are the functional molecules in cells. Indeed,some clinically relevant tissues are only really accessible via proteomics, such as plasma, where there is no RNA component to study, yet this remains a relatively easy sample to obtain and analyse. Studies of comparative genomics or transcriptomics will only ever be indirect indicators of cellular states or processes. Countless studies have demonstrated poor correlations between the level of cellular RNA and the corresponding abundance of protein, and many signalling events are controlled by post-translational modifications. There is a growing realisation that a system-wide approach to biological research is required to understand the complexities of living organisms, and quantitative proteomics is a key tool in the armoury of those wishing to follow this approach. Over the course of the project we expect to see this systems approach gain momentum in application areas such as pharmaceuticals, bioprocessing, plant science (including research intended to mitigate climate change) and food science. As detailed in the case for support, we have a comprehensive plan for communication and engagement to ensure that the work carried out has the maximum possible impact on the beneficiaries mentioned above. In addition to passive dissemination via the project web site we will also be presenting our software at as many conferences and seminars as possible. Furthermore, the proposed training courses will present a perfect opportunity to engage the user community, particularly as we plan to hold at least one alongside the EBI/BSPR proteomics conference, which is traditionally attended by many industry-based scientists and vendors of proteomic hardware and software. It should also be noted that all four members of the consortium are already working with industrial partners and therefore have a direct route via which to publicise the work to these organisations and garner feedback to ensure that the software suite produced has maximum relevance to their needs.

Committee	Research Committee C (Genes, development and STEM approaches to biology)
Research Topics	X – not assigned to a current Research Topic
Research Priority	X – Research Priority information not available
Research Initiative	Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding Scheme	X – not Funded via a specific Funding Scheme