Award details

PROCESS - Proteomics data Collection, Software and Standards to support open access and long term management of data

ReferenceBB/K020145/1
Principal Investigator / Supervisor Mr Henning Hermjakob
Co-Investigators /
Co-Supervisors
Dr Juan Antonio Vizcaino
Institution EMBL - European Bioinformatics Institute
DepartmentProteomics Services Team
Funding typeResearch
Value (£) 284,794
StatusCompleted
TypeResearch Grant
Start date 05/12/2013
End date 04/12/2016
Duration36 months

Abstract

The PROCESS project will provide the framework needed for sharing, public data deposition and re-analysis of proteomics experimental data, based around mass spectrometry (MS), developed in the context of the Proteomics Standards Initiative (PSI), in which the applicants take a lead role. The framework comprises the evolution and maintenance of standard data formats (mzML, mzIdentML, mzQuantML and mzTab), controlled vocabularies (PSI-MS and PSI-MOD) and software tools (PRIDE Inspector, PSI Validator, and Java programming interfaces to the standards). The standards will be developed to cope with evolution in proteomics technology, including the capability to handle ambiguity in protein modification sites and protein grouping, data independent acquisition in MS and top-down proteomics. We will also develop international standards for compression of proteomics data sets to ensure that software performance and database architectures can scale up to the outputs of the newest instruments. The PRIDE database at the EBI is the primary public database for experimental proteomics data. It has recently initiated a (potentially huge) raw data archive service for the community, in which the PSI standards play a central role. The PROCESS outputs will ensure that the wider research community will have long-term access to experimental proteomics data for re-use across a range of purposes. New modules will be created in the PRIDE Inspector software for data visualisation and analysis, and the further development of the programming interfaces will help bioinformatics developers to build tools for re-analysis of data sets.

Summary

Proteomics is the science of studying large numbers of proteins - the key molecules that perform the functional roles in cells, and it is the natural partner to genomics - the study of the genes that encode those proteins. Proteomic studies are performed in laboratories all over the world, investigating disease processes, as well as the basic function of cells in humans, animals, plants and microorganisms. Proteins are challenging molecules to work with but the technology of mass spectrometry (MS) has developed over many years, such that it is now possible to identify and quantify many hundreds, or even thousands, of proteins simultaneously in one type of sample compared with another, for example to test how a cell responds during a disease process compared to a healthy cell, allowing us to begin understanding the complex and dynamic molecular changes. MS can produce very large raw data sets, running to many Gigabytes for a single sample analysed. The raw files are processed, often in two stages by different software packages that first identify and then quantify the proteins that were analysed by the instrument. In the past, the raw data files were encoded in a data format specific to each instrument vendor, effectively tying scientists to using the software provided with the instrument, which have not always been the optimal solutions for analysing the data. A global consortium of academics and industrial researchers, called the Proteomics Standards Initiative (PSI), has collaborated to agree open access standards for storing raw data, protein identification data and quantitative data. These standards mean that open-source (and free) software can now be developed, capable of analysing data arising from any type of instrument. It also means that data sets generated at high cost can be deposited in a public repository, such as the PRIDE database, hosted at the European Bioinformatics Institute, allowing their re-use for integration and interpretation of data from other studies, improving our knowledge about genomes and biological systems, and improving software tools in this field. In this project, we are requesting support so that the PSI standard data formats can continue to be maintained and evolve as new proteomics techniques are described in the literature. We are also developing interfaces so that other groups can develop new software packages easily, using the PSI standards as inputs and outputs. The standards are being used as part of a recently released raw data archive within PRIDE, which will store very large amounts of data for the entire scientific community. As such, we are working on software to make it straightforward for research labs to deposit and visualise data in PRIDE, as well as optimising the way in which data is compressed and stored, so that the system can scale for the needs of the next generation of instruments. These developments of PSI standards, software and PRIDE are essential for making sure that proteomics data are open access for all researchers and not restricted to the small number of laboratories with specialised, expensive software.

Impact Summary

The direct beneficiaries include: - Vendors of commercial software, including UK SME's Matrix Science and Nonlinear Dynamics, will benefit (see Pathways to Impact) - Vendors of instruments will benefits, through increased compatibility of their raw data with a range of analysis software and easier deposition of data into PRIDE. Letters of support from Waters and AB Sciex demonstrate their commitment to PROCESS. - Numerous pharmaceutical companies use mass spectrometry for analysis of proteins or metabolites. They will benefit through easier connectivity between software packages and more data in the public domain for re-analysis. - Research councils and charities funding research will benefit through the potential for increased impact of the (proteomics) projects they fund, as public data deposition becomes straightforward and expected of all projects. As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as PROCESS will help the field to become less fragmented and data analysis to become more straightforward. These benefits could be realised in any area of basic biology, biomedical or clinical science, for example leading to new drugs or biomarkers being discovered. Staff employed will benefit: - Exposure to numerous international collaborations, through the PSI (see letters of support) - New collaborations with industry, particularly in relation to the shared development of software (see letters of support)
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file