BBSRC Portfolio Analyser
Award details
PRIDE Converter - Efficient Database Deposition of Mass Spectrometry Data
Reference
BB/I024204/1
Principal Investigator / Supervisor
Mr Henning Hermjakob
Co-Investigators /
Co-Supervisors
Dr Jyoti Choudhary
,
Dr Juan Antonio Vizcaino
Institution
EMBL - European Bioinformatics Institute
Department
Proteomics Services Team
Funding type
Research
Value (£)
107,645
Status
Completed
Type
Research Grant
Start date
01/01/2012
End date
31/12/2012
Duration
12 months
Abstract
While the extension of standards support in PRIDE Converter requires significant new development, the key technical challenge is the support for large submissions, in terms of number and size of processed files. Currently, each individual input file needs to be processed interactively, an unrealistic demand for submissions often comprising dozens or hundreds of files. Objective a) is to modularise PRIDE Converter to allow usage in batch mode, automating multi-file processing based on templates. Another technical challenge is the size of individual files, which can often be several Gigabytes in size. Currently, PRIDE Converter uses an in-memory data model (DOM), requiring all data to be in the main memory of the computer. The overall memory consumption is 2-3 times the size of the PRIDE XML output file. Thus, standard desktops with 4 GB main memory are quickly insufficient to process large files. We quite frequently have to do custom conversions for data depositors on large memory EBI machines, a process which is clearly not scalable. Objective b) is to redevelop the PRIDE Converter memory management to overcome this limitation. PRIDE Converter has so far had more than 150 minor updates, usually in response to user feedback (http://code.google.com/p/pride-converter/updates/). However, two years after the original release, the tool has become difficult to maintain, and urgently needed major updates are impossible without a major redevelopment. While the basic technologies (Java, XML) will remain unchanged, the redevelopment will modularise the source code, implementing a strict MVC (model-view-controller) concept, allowing reuse of components both for the interactive mode and the new batch processing mode. As with the current PRIDE Converter, the project will be completely open source, allowing users to freely adapt the modular system to their particular needs.
Summary
Public availability of biological data has been of paramount importance in the rapid development of molecular biology. However, the amount of proteomics data in public domain repositories is regrettably still quite low in comparison with other disciplines like genomics and transcriptomics. One of the most prominent public repositories for proteomics data is the PRoteomics IDEntifications database (PRIDE, http://www.ebi.ac.uk) at the EBI, in Cambridge. At present proteomics journals are increasingly mandating public deposition of MS data to public repositories in general, and to PRIDE in particular, to support the publication of related manuscripts. At the same time, funding agencies (such as BBRSR and the Wellcome Trust in the UK) are clearly supporting this trend as a way to maximize the value of the funds provided. However, in practical terms, this public data-sharing policy cannot succeed if no reliable and 'user-friendly' submission tools, that can efficiently capture the technical and biological metadata, are provided to the research community. The submission tool PRIDE Converter (http://code.google.com/p/pride-converter) was developed with that idea in mind. It is an open source, platform independent software and a big part of its success can be attributed to its easy-to-use graphical user interface (GUI) component. PRIDE Converter is currently by far the most comprehensive and popular tool of this kind, since it made the submission of MS data a much easier and more straightforward process. It has definitely been the key factor in the huge growth in data contents in PRIDE for the last two years and has become the de facto submission tool for PRIDE for most researchers. From Jan 2009 to Sept 2010, PRIDE has received 243 data depositions, comprising more than 63.6 million mass spectra, through PRIDE Converter. The redevelopment proposed here is based on user feedback gathered by PRIDE curators in direct exchange with data depositors, as well as on discussions with journal editors, and recent development of community standards for mass spectrometry (MS). Beyond the technical objectives a-b (see 'Technical Summary'), we urgently need to implement support for current community data standards. mzML for the representation of mass spectra has recently been published and supersedes mzData, currently supported by PRIDE Converter. mzIdentML for the representation of protein and peptide identifications has recently been released and is already supported by Mascot 2.3 and other tools. Key objective c) of this proposal is to implement full mzML/mzIdentML support in PRIDE Converter. In addition to standards for data representation, the HUPO Proteomics Standards Initiative has published a series of 'Minimum Requirements' documents, describing the metadata items which should be reported for proteomics experiments. Currently, adherence to these Minimum requirements documents is not validated by the PRIDE Converter. Key objective d) is to implement such validation, but also to make adherence to these requirements efficient by providing a template mechanism for repetitive submission processes. This will make the reuse of the data more feasible and will allow perform more reliable global re-analysis of data (meta-analysis studies). In its current form, the PRIDE Converter provides only rudimentary support for quantitative MS technologies, which are quickly becoming the standard proteomics approach. Key objective e) aims to implement light weight PRIDE Converter support for quantitative proteomics tools. The final objective f) is the standardisation of protein inference. Currently, proteomics search tools usually select one of a range of equivalent protein choices for a given peptide set. We will standardise this process as much as possible between different search tools, to ensure optimised comparability of proteomics data.
Impact Summary
The proposed work directly supports BBSRCs data sharing policy (http://www.bbsrc.ac.uk/web/FILES/Policies/data_sharing_policy.pdf) by reducing the necessary 'activation energy' to start a beneficial circle of lower resistance to data deposition, more public data availability, more re-use of public data, and resulting awareness of the benefits of publicly available data, not only for the community, but also for the data producer through improved visibility and citations. As described in the previous section, primary beneficiaries are academic proteomics researchers, both as data producers and consumers. However, the pharmaceutical and more generally life science based industry also stands to benefit from more and more useful proteomics data. As PRIDE follows a strict open data policy, no IP restrictions will limit usefulness of the data to the commercial sector. In recent years, in particular the pharmaceutical industry also became much more open to collaboration and even public data release in areas considered precompetitive. In some instances, public data release is more hampered by the necessary time and effort than by IP considerations. Thus, reducing the necessary effort for public data release might even be a step towards tapping the vast treasure of currently private data generated, often in high quality, by the commercial sector, and releasing it into the public domain.
Committee
Research Committee C (Genes, development and STEM approaches to biology)
Research Topics
Technology and Methods Development
Research Priority
X – Research Priority information not available
Research Initiative
Tools and Resources Development Fund (TRDF) [2006-2015]
Funding Scheme
X – not Funded via a specific Funding Scheme
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search