Award details

Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics

Reference	BB/M023818/1
Principal Investigator / Supervisor	Professor Andrew Jones
Co-Investigators / Co-Supervisors	Professor Robert Beynon
Institution	University of Liverpool
Department	Institute of Integrative Biology
Funding type	Research
Value (£)	266,753
Status	Completed
Type	Research Grant
Start date	30/09/2015
End date	31/03/2019
Duration	42 months

Abstract

Tandem Mass Spectrometry (MS/MS) coupled to Liquid Chromatography (LC) is the primary technique used in proteomics. The most common approach is LC separation of tryptic fragments derived from a proteome digestion, followed by tandem MS of the peptides. This entire workflow is conceived as a series of discrete steps, some chemical, some instrumental, some informatics and some statistical. Existing software concentrates on subcomponents of the workflow, and comprise a series of deterministic, self-contained steps. No methods propagate uncertainty from one step to the next, nor do they borrow strength either within or across steps - this starkly contrasts with recent advancements in processing RNA-seq data. We propose to translate the whole protein quantification pipeline into a rigorous statistical framework underpinned by Bayesian methodology. The new framework will enable us to integrate evidence across all experimentally acquired datasets, and allow us to borrow strength from unused structure within a proteomics workflow, including digestion dynamics. Our proposed pipeline consists of three synergistic developments (1) Utilisation of all unidentified (peptide) features, as well as identified features, to infer the most likely mixture of proteins present in a sample; (2) Differential quantification of complex mixtures of known proteoforms; (3) Discovery of unknown proteoforms and all modifications (PTMs) carried by their quantification signatures. These advancements will elicit a step-change in quantification sensitivity and interpretation at the proteoform level for the first time. We will disseminate this end-to-end analysis solution within the user-centric standards compliant ProteoSuite package, and as a Galaxy workflow for high-throughput pipelines.

Summary

Research in the life sciences is being driven forward by cutting-edge techniques for studying the molecules acting in cells. The functional molecules in cells are proteins - the expression, activity and interactions of particular proteins in any given cell define its structure and what it is capable of doing. As one example, we are often interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the disease process, and the search for new drug targets. The technologies used to study proteins on a large scale are collectively called proteomics. The main method used in proteomics is mass spectrometry (MS), which can calculate the molecular weight and abundance of molecules. The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins become broken up into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". One issue with the digestion of proteins is that some proteins break down quicker than others - for some proteins/peptides digestion is incomplete, producing unreliable quantification data, which at present is not fully understood or compensated for by the analysis software. While bottom-up studies dominate the field, they currently have several significant drawbacks. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms - through the gene encoding the protein being processed in different ways (alternatively splicing), or through the addition of functionally important chemical groups, called post-translational modifications (PTMs). Since only one or a few peptides are different between different proteoforms, they are far more challenging (or impossible with current techniques) to quantify accurately. Current practice in proteomics generally ignores this problem - losing vast amounts of data about the true nature of the molecules in the system. There are MS techniques for studying intact proteins and their proteoforms (called top-down methods), but at present these do not function in high-throughput mode, and thus are typically used for targeted studies on a small number of proteins. In order to make a step change in the quantification and discovery of proteoforms, we will develop an integrated suite of analysis techniques using a powerful statistical technique called Bayesian modelling. With Bayesian approaches, the problem at hand is simulated many thousands of times probabilistically. By interpreting the range of different conclusions reached, we can get an idea of how certain we are about the results, which is crucial given the subtle nature of the evidence within the MS datasets. In essence, our computational techniques will deliver the same quality of data about individual proteoforms (including novel discovery of PTMs) as top-down techniques, but based off bottom-up (peptide-focussed) workflows - thus, for the first time, enabling highly accurate proteoform-level discovery and quantification in high-throughput mode. To ensure rapid and wide uptake of our new methods, we will integrate our advancements into a freely available software suite we are developing, ProteoSuite.

Impact Summary

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics vendors. The proposed Bayesian Quantiative Proteomics platform will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses amongst systems biology researchers. There is potential for direct impact through the licensing of some or all of our software tools developed, as we are working towards for other packages with Waters Inc. There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish a powerful platform for differential proteoform analysis and discovery enabling a wealth of new investigations in the biological sciences and translational medicine. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries. The PDRAs employed on this grant benefit significantly from exposure to the wealth of proteome informatics expertise we will bring together, particularly since the PDRAs will be encouraged to play a significant role in public dissemination. All staff will benefit through being engaged within an international, cutting edge interdisciplinary project.

Committee	Research Committee C (Genes, development and STEM approaches to biology)
Research Topics	Technology and Methods Development
Research Priority	X – Research Priority information not available
Research Initiative	X - not in an Initiative
Funding Scheme	X – not Funded via a specific Funding Scheme

Associated awards:

BB/M024954/1 Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics

BB/M024954/2 Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics

I accept the terms and conditions of use (opens in new window)

export PDF file

back to list new search