Award details

Automated identification of optimal data-specific organelle clusters using freely available protein annotations

ReferenceBB/L018497/1
Principal Investigator / Supervisor Dr Laurent Gatto
Co-Investigators /
Co-Supervisors
Professor Kathryn Lilley
Institution University of Cambridge
DepartmentBiochemistry
Funding typeResearch
Value (£) 114,678
StatusCompleted
TypeResearch Grant
Start date 01/05/2014
End date 31/10/2015
Duration18 months

Abstract

Localisation of proteins inside cells is of paramount importance to study their function, refine our comprehension of sub-cellular process and organisation and understand the effect of perturbations at the sub-cellular level. Various dedicated experimental designs based on biochemical separation and quantitative mass-spectrometry have been described and refined over the years. The major break-through in terms of organelle proteomics data analysis consists in the application of state-of-the-art supervised machine learning (ML) techniques. These techniques utilise the quantitative profiles of the proteins and permit optimal classification of proteins of unknown localisation based on the definition of sub-cellular markers. These markers represent proteins of known localisation, identified through manual database mining, literature search and, most crucially, expert curation. Manual curation of a dataset containing thousands of proteins is however, although currently the most reliable solution, an extremely time consuming task. Furthermore, the quest for tens of highly reliable markers per organelle favours large, well characterised organelles at the expenses of smaller, less studied compartments, leading to systematic under-representation of the true organelle diversity in the experimental data. Our project proposes a major shift in the analysis of organelle proteomics data by abandoning supervised ML which requires rigid sets of highly reliable markers and instead employ unsupervised and semi-unsupervised approaches relying on the vast amount of freely available database annotations such as, for example, the Gene Ontology. These novel approaches will allow to (1) automate the analysis of our datasets without the expensive manual curation and (2) assess the true cellular diversity that underpin such experiments at a much finer scale. These techniques will be made accessible in the frame of the open source pRoloc framework for organelle proteomics data analysis.

Summary

Organelle proteomics is the systematic study of proteins and their assignments to sub-cellular compartments like organelles and macro-molecular complexes. It is a growing field in importance and popularity and over the last few years has gained a large amount of attention due to the role played by organelles in carrying out defined cellular processes. The most information-rich datasets are generated using high accuracy mass-spectrometry (MS), a technique that allows to identify and quantify the proteome content in complex biological samples. These datasets are high quality rich sources of data that have been mined using a variety of robust supervised statistical machine learning (ML) methods which have shown to yield valuable protein-organelle predictions (BBSRC: BB/G024618/1 and BB/H024247/1). These classification methods require as set of tenth expert-curated ground-truth marker proteins of know localisation and then match proteins of unknown localisation to organelles based on their MS data resemblance with those of marker proteins. However, there are still inherent issues that limit the optimal application of such contemporary classification methods: (1) the limited number of organelle markers and the reliance on time-consuming manual curation and (2) the limited number of organelle classes that systematically underestimates the sub-cellular diversity recorded in the datasets. In this proposal we aim to improve protein-organelle association via the application of different state-of-the-art methods to remove the need to ground-truth marker proteins to accurately assign proteins to a broader set of sub-cellular compartments. These unsupervised approaches will be looking specifically for patterns in the organelle proteomics data. We will also make use vast amounts of freely available protein annotation data like the Gene Ontology. These annotations, while prone to erroneous or misleading information, are available for tens of thousands proteins, describing allorganelles identified so far. The amount of anntation data allows to overcome its uncertainty and investigate the sub-cellular environment at a much more meaningful diversity. In addition, the proposed methods will allow complete automation of the data analysis, thus permitting the treatment of more and bigger datasets. The development of a framework that will support this annotation to guide the extrapolation and elucidation of patterns in the MS data will lead to the creation of optimal organelle proteomics datasets which will be deposited in a public access proteomics data repository through the main ProteomeXchange submission portal. These tools will be made freely available as open-source software for the use of the whole proteomics community. The work proposed in this grant will be implemented by a multidisciplinary team bringing together expertise in state-of-the-art mass-spectrometry based proteomics approaches (KSL), database annotation (CD), contemporary pattern recognition methods (AP, TB, SBH and LG), computational bioinformatics and code development (LG) and applied mathematics (LMS). LG, KSL and LMS have worked together previously on organelle proteomics grants that resulted in the release of the current state-of-the-art toolkits for organelle proteomics data analysis.

Impact Summary

Who will benefit from this research? The developments proposed in this project will benefit the organelle proteomics community in particular as we will develop and share improved tools to analyse such data. The proteomics field as a whole will also benefit as our methods and software, although focused on organelle proteomics data, have a much wider scope and impact and can be applied in other fields. Computational biologists will also benefit from the open-source organelle proteomics analysis methods and the quality software that will be distributed to the wider community. Cell biologists, both academic and within the pharmaceutical sector will also immensely benefit as this proposal underpins the interface of modern omics technologies and more classical cell biological methodologies. Our work is targeted to experimentalist users who will use our tools to analyse their data, as well as computational scientists and developers who want to re-use or adapt our methods and software infrastructure to new projects and topics. How will they benefit from this research? The toolkit will ensure unprecedented mining of proteomics data produced from widely-used gradient-based proteomics approaches, enabling unprecedented insight into the underlying sub-cellular diversity of these data. In addition, it will provide a benchmark upon which to add new data analysis methods as the technology and data annotation progresses. The sophisticated statistical machine learning methods will be made available for the statistical programming environment R and the Bioconductor project and will inter-operate with existing complementary software. Our novel methods will no doubt be applicable in other omics areas of research due to the inherit cross-disciplinary nature of computer science, mathematics and machine learning that underpins many areas of computational biology. Lastly, the project will contribute knowledge and scientific advancement in the form of the dissemination of data and improvement of the analyses of complex multivariate data to facilitate interpretation and understanding of relevant biological processes. Fully characterised organelle proteomics datasets will be deposited in publicly accessible databases (via the ProteomeXchange portal) upon publication of the peer-reviewed research outputs and the detailed analysis methodologies will be documented and distributed with software releases to facilitate application of our methods to new datasets and use cases. The research staff will benefit from the multi-disciplinary research environment and extend their national and international research network through on-going collaborations. In addition to the benefits of improved tools and data, the academic beneficiaries will also be invited to workshops that will be organised in the frame of the European FP7 project to promote our approaches. What will be done to ensure that they have the opportunity to benefit from this research? The algorithms and tools developed in this proposal will be implemented in the R statistical programming environment (www.r-project.org) and will be deposited to the Bioconductor suite of bioinformatics software. The algorithms will be implemented as independent modules that will be contributed to and compatible with current the pRoloc analysis framework (developed by LG and LMS in BBSRC: BB/H024247/1 and BB/G024618/1), to form a freely available open-source toolkit for the analysis of organelle proteomics data. It is envisaged that these manuscripts will be submitted to high impact journals with large general readership, such as Nature Methods and Nature Biotechnology. KSL, LG and CD are invited to give numerous talks at all the top proteomics and computational conferences world wide, thus they will endeavour to publicise the work described here at such events.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file