Award details

Building the PTM map of the human genome through commensal computing

ReferenceBB/L005239/1
Principal Investigator / Supervisor Professor Andrew Jones
Co-Investigators /
Co-Supervisors
Professor Leszek Gasieniec
Institution University of Liverpool
DepartmentInstitute of Integrative Biology
Funding typeResearch
Value (£) 232,708
StatusCompleted
TypeResearch Grant
Start date 15/02/2014
End date 14/02/2017
Duration36 months

Abstract

We are developing a crowd-sourcing tool for massively parallel re-analysis of mass spectral data from proteomics studies, called the Human Proteome Modifier (HPM). The HPM tool features an unrestricted search for all types of variable modifications on proteins, such as post-translational modifications (PTMs) as well as chemical artefacts. These searches are not currently commonly performed by most proteomics groups due to the CPU time required. HPM will function as a browser-embedded application (running on PCs, tablets or phones) which will make use of a small amount of client-side CPU time, while users are browsing websites, using social media applications or playing games. This computing model we have adapted from the concept of "parasitic computing" - that is stealing CPU time, as "commensal computing", since users will be aware that their CPU time is being used for the public good (human genome annotation) but at no noticeable cost to them. The HPM tool will be available for proteomics labs to upload new data sets for unrestricted modification searching, and for re-analysis of all (human proteome) data sets available in public databases. The results will be fed into a database we will develop, called HPM-DB, which will be mined by the human proteome project with the aim of discovering all experimentally observable modification sites on proteins. Visualisation software will be provided for specialists to analyse spectral level evidence, and for non-specialists to appreciate the strength of evidence for a given PTM site identified by HPM. Proteomics groups will also use HPM-DB to learn about the frequencies of all types of modifications that can occur on proteins. We will work with industrial collaborators to embed HPM in distributed games, to increase the uptake of the tool, maximise the amount of CPU time available for data analysis and employ the lay public as problem solvers.

Summary

In recent years, the concept of "crowd-sourcing" has emerged as an exciting new paradigm for engaging large groups of people to solve a common task - one particularly high-profile example is Wikipedia. Crowd-sourcing can also be applied to data analysis - engaging many distributed machines to solve problem. This is a hugely exciting development for science, since massive data sets are now commonplace. It is a major unsolved problem as to how research organisations should fund the computing equipment to analyse the data explosion. The traditional route has been to purchase large farms of computers (clusters) in a dedicated location. This route is expensive to purchase, and expensive to keep up-to-date as a cluster of 10 computers purchased in 2003, would have a similar computing power as a modern desktop PC today available at fraction of the cost. An alternative model that has received attention recently is cloud computing, in which companies such as Amazon and Google provide access to massive compute farms hosted in distributed locations, on a pay-as-you-use basis. This model is attractive for high-powered, short term jobs, as purchasing 1 hour of analysis time on 1000 computers costs approximately the same as 1000 hours on 1 computer. This model does not ultimately save any cost in real terms though, since the service providers are aiming to profit from the cluster provision. The crowd-sourcing model aims to take advantage of the fact that devices containing CPUs are now ubiquitous - not just in PCs, but also in tablets and mobile phones. The vast majority of CPU time on these devices goes un-used. In this application, we are going to put the crowd-sourcing model to work to help annotate the human genome. The completion of the genome sequence was indeed an important scientific landmark, but the important part is now to study the functional units within the genome - the genes, and the protein(s) encoded by each gene. We wish to understand the basic function of each protein, what happens if a protein malfunctions, for example if the gene encoding it contains a mutation in some individuals, and how these proteins change in the cell. An important process that happens to proteins is post-translational modification. These are chemical changes that happen after the protein has been produced from the genetic code, altering the function- making it active or inactive, and influencing which other proteins it can interact with. The genetic code gives us no clues as to which sites in proteins can or will be modified with particular chemical groups, and so we must study these modifications experimentally. Mass spectrometry is widely used to study proteins on a very large scale, with a single experiment producing data on thousands of proteins at once. The computational analysis of the data is difficult to perform optimally, so most researchers often ignore data on modifications on proteins because they do not have access to sufficient computing power to analyse these properly. In this project, we are going to build a tool that runs in any browser platform (PC, tablet, phone etc), which will perform massive analysis of proteomics data. Our tool can be embedded in social media platforms, such as Facebook, so that the public can get personally involved in an important scientific endeavour, simply by having a near-silent application in existing browser windows they have open or by playing an interactive game we will build to map the problem to a solvable puzzle. This will provide us with a very large amount of CPU time for analysing the data fully, as well as engaging human brains to interpret challenging data. All results will be fed back into the genome annotation effort, so we can start to fully understand how every protein encoded in the human genome can be modified in different cell types. Other researchers will be able to mine this important data for their own studies in a wide variety of biological and biomedical contexts.

Impact Summary

- Both large pharma and smaller biotech SME's will see direct benefits from the provision of the HPM-DB, as a resource for studying human proteins in a wide variety of contexts. - Commercial software developers working in proteomics will benefit from the provision of HPM-DB through the provision of very high-quality data sets for training their peptide / modification identification algorithms. - Research councils and charities funding computationally intensive Life Sciences research will see indirect benefits if the crowd sourcing model can be effectively deployed for data analysis, with potentially enormous savings in the high-performance computing costs. The staff employed on the project will benefit through the development of skills and understanding in this cutting-edge software project.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Crowd Sourcing for the Biological Sciences (CSBS) [2013]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file