Award details

ProteoFormer - a software toolkit for top-down proteomics

ReferenceBB/L018462/1
Principal Investigator / Supervisor Professor Andrew Jones
Co-Investigators /
Co-Supervisors
Professor Claire Eyers
Institution University of Liverpool
DepartmentInstitute of Integrative Biology
Funding typeResearch
Value (£) 110,220
StatusCompleted
TypeResearch Grant
Start date 01/07/2014
End date 30/11/2015
Duration17 months

Abstract

Most proteomics workflows employ a "bottom-up" method, in which the identity and intensity of peptide ions measured by mass spectrometry (MS) are used as a proxy measure for the parent protein, since signals from analysis of intact proteins were too complex to interpret on older, low-resolution instruments. Recent improvements mean that the isotope pattern of large molecular weight species can now be resolved, enabling direct analysis of intact proteins - "top-down" proteomics. A major advantage of top-down analysis is the ability to characterise closely related proteins (proteoforms), resulting from paralogs, alternative splicing or different post-translational modifications (PTMs) on proteins. MS data from intact proteins can be hugely complex to interpret, and the majority of software tools have been developed for peptide data. In this project, we will develop software to tackle two related problems. Firstly, ionisation of intact proteins generates high-charge state ions, often with overlapping signals. We have previously developed the seaMass software, which is able to identify, separate and de-charge overlapping isotope patterns in complex LC-MS peptide data robustly under ion counting noise. We will adapt seaMass for detecting high-charge state, intact protein data, producing de-charged MS1 and MS2 spectra. Secondly, the traditional sequence database search used in peptide identification (in which modifications must be specified in advance) is not suitable for intact protein data. We are developing the Proteoformer software, which converts a fragment spectrum into a set of peptide regular expressions (pep regexes) that are used to search a protein database index. Following identification of the correct database protein, we can dynamically identify the PTMs and processing events that have occurred de novo. Both software packages will be provided with a graphical interface through our Proteosuite software, and will work with open data standards.

Summary

Research in the life sciences is being driven forward by cutting-edge techniques for studying the molecules acting in cells. The functional molecules in cells are proteins - the expression, activity and interactions of particular proteins in any given cell define its structure and what it is capable of doing. As one example of these techniques, we are often interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the overall disease process, and the search for new drug targets. The set of technologies used to study proteins on a large scale are called proteomics, as the complete set of proteins present in a given cell or sample of interest is described as the "proteome". The main method used in proteomics is mass spectrometry (MS). MS is essentially a technique for calculating the molecular weight of molecules, and it can also provide information about the abundance of a given molecule in the sample. MS has been available for many decades, but recent years have seen huge strides in the ability to perform high-throughput workflows, studying tens of thousands of molecules in any given sample, and great advances in the instrument resolution, such that molecules of almost identical mass can be differentiated. The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins in the sample become broken up in a predictable manner into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know in advance that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". However, while bottom-up studies dominate the field, they have a significant drawback. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms. Proteins may become activated or deactivated by the addition/removal of one or more chemical groups, called post-translational modifications (PTMs). Many important diseases have been shown to be associated with dysfunction of PTMs, including cancer and neurodegeneration. Bottom-up studies have a severely limited ability to study these important proteoforms, since the small pieces of a protein (the peptides), cannot tell us which proteoform we are looking at, only that one of the many possible proteoforms is present. Our groups are pioneering techniques to study intact proteins by MS, in so-called top-down studies. Recent developments in MS instruments enable much larger molecules to be studied effectively, opening up the possibility to study each proteoform in the state in which it is present in the cell. However, MS produces very complex and often overlapping signals from closely related proteins, which are difficult to interpret. Software for identification and quantification of peptide signals from MS data has been developed for over twenty years, but research for interpreting protein-level data is quite limited. We specialise in developing software for proteomics and we are devising a new algorithm that, firstly, will simplify the complex MS data signal coming from the instrument, and secondly, will confidently identify the proteins present including all PTMs on the different proteoforms we detect. The software will help to advance the field of top-down proteomics, making this a more broadly accessible method, and thus improving our ability to study proteins in cells, for a wide range of applications.

Impact Summary

Our developments will have impacts through the following routes: - The development of seaMass-TD and Proteoformer will make it more straightforward for top-down analysis to be performed on a much wider range of instruments, producing high-quality and reliable results. This will open up this important technology for studying proteins in their native state in the cell, for basic and applied research across numerous domains. - Our software has the potential to increase sales of mass spectrometers, capable of performing top-down analysis. In particular, locally we are working with Waters to develop software compatible with their data, since current software has typically been designed for mass spectrometers produced by Thermo. - We will explore routes for commercialisation of Proteoformer and seaMass-TD, as discussed in the Pathways to Impact document. - We will work with international consortia aimed with data sharing and standardisation in proteomics - the Proteomics Standards Initiative (PSI), ProteomeXchange and EBI's PRIDE database to ensure that current standards can appropriately handle top-down data, and researchers can submit data to the leading public repositories for community re-analysis.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file