Award details

Data Fusion and Inductive Transfer for Organelle Proteomics

ReferenceBB/K00137X/1
Principal Investigator / Supervisor Professor Kathryn Lilley
Co-Investigators /
Co-Supervisors
Dr Sean Holden
Institution University of Cambridge
DepartmentBiochemistry
Funding typeResearch
Value (£) 120,427
StatusCompleted
TypeResearch Grant
Start date 01/09/2012
End date 28/02/2014
Duration18 months

Abstract

Organelle proteomics has gained a large amount of attention in recent years due to the role played by organelles in carrying out defined cellular processes. The datasets produced in general are high quality rich sources of data,. Such datasets have been mined using a variety of statistical methods and are beginning to be explored using more robust machine learning (ML) methods which have shown to yield significant improvements in protein-organelle prediction quality over previous methods. However, there are still inherent issues that limit the optimal application of many contemporary ML methods such as the limited number of (1) proteins and features/channels available, (2) organelle markers and (3) organelle classes. In addition to the high-quality data produced from experimental high-throughput mass spectrometry-based methods we wish to develop a toolkit that will enable the exploitation of other data sources on which to predict protein localisation such as protein amino acid sequences, protein-protein interaction partners, conserved signal peptide motifs and cell biological data to strengthen the predictions made based using organelle proteomics approaches. We will, 1. Review and comparative assessment current ML methods that have been applied independently to experimental and sequence data which predict protein-organelle localisation. 2. Apply data fusion methods to combine experimental data, e.g. LOPIT data, with protein sequences from public annotation repositories, to augment information used to create protein-organelle classifiers via contemporary ML methods. Further data, e.g. protein interactions, imaging data, will also be considered. 3. Apply the inductive transfer methodology that models protein-organelle assignments independently on different data sources to subsequently reinforce the classification predicted from experimental data. 4. Implement the above innovative approaches into pRoloc, an open-source framework for organelle data analysis.

Summary

A protein is localised in its intended sub-cellular location in order to function and interact with its correct binding partners and substrates. The determination of sub-cellular protein location is particularly desirable to biologists for two reasons. First, it can assist elucidation of a protein's role within the cell, as proteins are spatially organised according to their function and specificity of their molecular interactions. Second, it refines knowledge of cellular processes by pinpointing certain activities to specific organelles. Organelle proteomics, the systematic study of proteins and their assignments to organelles, is a rapidly growing field and many high throughput approaches have been developed to date which result in protein subcellular locations. One reason for the increase in growth of the field is that there is significant correlation between disease classes and sub-cellular localisations. It is well established that loss of functional effects in diseases can be attributed to abnormal protein localisations. For example, in many types of carcinoma cells, nuclear-cytoplasmic transport, essential for normal cell function, has been found to be defective as a result of abnormal protein localisations. The datasets produced in general are high quality rich sources of data and are have been mined using a variety of statistical methods and are beginning to be explored through the use of more robust machine learning (ML) methods which have shown to yield significant improvements in protein-organelle predictions over previous statistical methods. However, there are still inherent issues that limit the optimal application of such contemporary ML methods: (1) limited number of proteins and features/channels available, (2) limited number of organelle markers and (3) limited number of organelle classes. In this proposal we aim to improve protein-organelle association via the application of state-of-the-art ML methods in which we wish to exploit all sourcesof information available to accurately assign a protein to its sub-cellular compartment. In addition to the high-quality data produced from experimental (high-throughput mass spectrometry-based) methods we wish to develop a toolkit that will enable the exploitation of other data sources on which to predict protein localisation such as protein amino acid sequences, protein-protein interaction partners, conserved signal peptide motifs and imaging data on which to strengthen the predictions made on high quality gradient-based data. Many researchers have already found that in many situations training statistical models on multiple related data sources is better than training models on each data source individually. The development of a framework that will enable the fusion and transfer of knowledge from multiple data sources will lead to the creation of optimal organelle proteomics datasets which will be deposited in a public access proteomics data repository (PRIDE). The toolkit will be freely available for the use of the whole proteomics community. The work proposed in this grant will be implemented by a multidisciplinary team bringing together expertise in state-of-the-art gradient-based proteomics approaches (Lilley), contemporary machine learning methods (Holden and Trotter), bioinformatics and code development (Gatto), and applied mathematics (Simpson). The majority of this team have worked together previously on organelle proteomics bioinformatics projects that have resulted in the release of new toolkits for organelle proteomics data analysis.

Impact Summary

Organelle proteomics has gained a large amount of attention in recent years due to the role played by organelles in carrying out defined cellular processes. The datasets produced in general are high quality rich sources of data. Such datasets have been mined using a variety of statistical methods and are beginning to be explored using more robust machine learning (ML) methods which have shown to yield significant improvements in protein-organelle prediction quality over previous methods. However, there are still inherent issues that limit the optimal application of many contemporary ML methods such as the limited number of (1) proteins and features/channels available, (2) organelle markers and (3) organelle classes. In addition to the high-quality data produced from experimental high-throughput mass spectrometry-based methods we wish to develop a toolkit that will enable the exploitation of other data sources on which to predict protein localisation such as protein amino acid sequences, protein-protein interaction partners, conserved signal peptide motifs and high quality imaging data on which to strengthen the predictions made on high quality gradient-based data. The work proposed in this grant will be implemented by a multidisciplinary team bringing together expertise in state-of-the-art gradient-based proteomics approaches (KSL), contemporary machine learning methods (SBH, MWBT), bioinformatics and code development (LG), and applied mathematics (LMS). Who will benefit from this research? These developments will benefit the proteomics field as a whole. Computational biologists will also benefit from the open-source organelle proteomics analysis methods, Cell biologists, both academic and within the pharmaceutical sector will also benefit as this proposal underpins the interface of modern "omics" technologies and more classical cell biological methodologies. How will they benefit from this research? The toolkit will not only ensure optimal mining of proteomics data produced from widely-used gradient-based proteomics approaches but will provide a benchmark upon which to add new methods as the technology progresses. The sophisticated ML methods will be made available for the statistical programming environment R and such methods will no doubt be applicable in other "omics" areas of research due to the inherit cross-disciplinary nature of computer science, mathematics and machine learning that underpins many areas of computational biology. Lastly, fully characterised organelle proteomics datasets will be deposited in publicly accessible databases (PRIDE). What will be done to ensure that they have the opportunity to benefit from this research? The algorithms and tools developed in this proposal will be implemented in the R statistical programming environment (www.r-project.org) and will be deposited to the Bioconductor suite of bioinformatics software. The algorithms will be implemented as independent modules that will be compatible with the pRoloc analysis framework (developed by the PI's laboratory), to form a freely available open-source toolkit for the analysis of organelle proteomics data. It is envisaged that these manuscripts will be submitted to high impact journals with large general readership, such as Nature Methods and Nature Biotechnology. The PI is invited to give numerous talks at all the major proteomics conferences world wide, thus she will endeavour to publicise the work described here at such events. Dr Lilley and Dr Gatto also teach on several proteomics workshop (organised by the Biochemical Society, EMBO and Bioconductor consortium) in which the tools they develop are disseminated to participants.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityTechnology Development for the Biosciences
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file