Award details

ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange

ReferenceBB/L024128/1
Principal Investigator / Supervisor Professor Andrew Jones
Co-Investigators /
Co-Supervisors
Institution University of Liverpool
DepartmentInstitute of Integrative Biology
Funding typeResearch
Value (£) 206,036
StatusCompleted
TypeResearch Grant
Start date 25/06/2014
End date 24/12/2016
Duration30 months

Abstract

The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium. In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.

Summary

For researchers in the Life Sciences, it imperative that they are able to access and view the human genome, and genomes of model organisms and human pathogens in an efficient and user-friendly way via the Internet. The genome itself is annotated with information about the locations and functions of genes, and quantitative data about genes and other elements within the genome. The UK-based Ensembl project is a leading genome browser, used by thousands of researchers every day. The value of genomic information is greatly increased when it is integrated with and can be directly viewed alongside other biological data sources such as proteomics - a set of technologies devoted to the identification and quantification of proteins, the functional molecules encoded by each gene. From a technical point of view, the large size of modern biological data sets makes it challenging to efficiently integrate them into genome browsers. A technology called DAS (Distributed Annotation System) is the prevalent technology used by genome browsers to integrate external data but it can no longer support much-needed new features or scale to the sizes of modern data sets. Another genome browser, the UCSC Genome Browser, has developed a more modern and efficient technology, specifically designed for large-scale data sets called 'TrackHubs'. Both UCSC and Ensembl have developed initial support for this technology, but there are still limitations for many users, and Ensembl's support remains incomplete. In the 'ProteoGenomics' project, we first want to further develop the 'TrackHub' technology, expanding its scope of usage in Ensembl, and making it easier for researchers around the world to discover and use TrackHubs containing different types of research data. Ensembl's TrackHub technology will be expanded to proteomics data for the first time and thus improve the provision of non-genomics biological information in this widely used resource. In the project, we are going to build technology tointegrate proteomics data with the genome data held in Ensembl, in a dynamic and effective way. With this aim in mind we will use public MS proteomics data submitted and available in one of the main repositories in the world, the UK-based resource PRIDE, which is also one leading the ProteomeXchange Consortium of proteomics resources. We will reanalyse the data in PRIDE via our ProteoAnnotator pipeline to provide updated or complementary information to the results originally submitted by the research team that generated the data. We are pioneering techniques for extracting more value from the same data, to understand how proteins vary in their abundance and in chemical modifications that occur on proteins, altering their function, two types of results often not generated initially by research groups submitting data to PRIDE. Through this data reuse and the extraction of new biological findings, the value of the submitted datasets will increase. In addition, 'ProteoGenomics' will provide a portal for datasets from the recently started Human Proteome Project (HPP), providing the global research community with a single entry point to these datasets.

Impact Summary

The direct beneficiaries include: - Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence. - Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl. - As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information. Staff employed will benefit: - Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.
Committee Not funded via Committee
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file