Award details

PIT-DB: A Resource for Sharing, Annotating and Analysing Translated Genomic Elements

ReferenceBB/M020118/1
Principal Investigator / Supervisor Professor Conrad Bessant
Co-Investigators /
Co-Supervisors
Dr DA Matthews
Institution Queen Mary University of London
DepartmentSch of Biological and Chemical Sciences
Funding typeResearch
Value (£) 122,711
StatusCompleted
TypeResearch Grant
Start date 31/08/2015
End date 30/08/2016
Duration12 months

Abstract

We previously developed PIT (Proteomics Informed by Transcriptomics), a methodology in which a given sample is analysed by both RNA-seq and proteomic mass spectrometry (MS) followed by integration of the acquired data to provide genome-wide information about which genomic elements are transcribed and translated within a given sample. Unlike traditional shotgun proteomics this does not require prior knowledge of the sequences that may be expressed, so provides an unbiased analysis that is as suited to finding novel translated genomic elements (TGEs) as it is to finding established proteins. This type of analysis is getting a lot of attention thanks to recent studies that have questioned the accuracy of widely accepted genome annotations and have found evidence that there are many other molecules translated from RNA - not just proteins. To make the processing of data from PIT experiments tractable for the typical lab scientist we have developed Galaxy-based data analysis workflows that integrate RNA-seq and MS data to produce uniform output files containing information about all the observed TGEs. We now have a growing collection of results from experiments on several species, and our aim in this project is to produce a web-accessible database called PIT-DB for sharing these results and results collected by other groups around the world. PIT-DB will be created using standard methods for developing databases and web front ends, but additional work will be done to pool submitted data to build up evidence of TGEs over multiple experiments. This is expected to provide large numbers of novel genome annotations backed up by significant experimental evidence. We will conduct a small validation experiment to check for the existence of a number of novel TGEs from the database.

Summary

The publication of the human genome in 2001 was rightly hailed as a major scientific achievement, but over a decade later we are still far from a complete understanding of the structure of the genome and the role of the various elements within it. While protein coding regions of the genome were identified and used to annotate the genome soon after it was sequenced, many more exotic genomic elements have subsequently attracted interest including pseudogenes, non-coding RNAs and short open reading frames (sORFs). In recent years, post-genomic bioanalytical techniques such as RNA-seq transcriptomics (which tells us which genomic elements are expressed) and mass spectrometry based proteomics (which tells us which of the expressed elements are translated into peptides or proteins) have helped refine our understanding of the human genome at a fundamental level. Just this year, two proteomics studies published in Nature caused a stir by showing that no experimental evidence could be found for the expression of several genomic elements widely accepted to code for protein, while other regions of the genome that were not previously thought to be protein coding were in fact found to produce proteins. If this is the situation for the intensively studied human genome, we must assume that the genome annotations for less studied species (so called non-model organisms) are even less accurate. We recently developed (and tested, and published) a methodology called proteomics informed by transcriptomics (PIT) that rapidly generates large numbers of genome annotations underpinned by multiple sources of experimental evidence. In PIT, every sample is analysed using both RNA-seq and proteomic mass spectrometry and the data from these two analyses integrated to provide a list of observed proteins and any other translated genomic elements (TGEs), together with the detailed transcriptomic and spectral evidence that underpins these observations. The beauty of PIT compared with traditional proteomics is that no prior sequence knowledge is needed, so novel TGEs (be they proteins or other more exotic features) can be detected. RNA-seq can be used by itself to rapidly generate genome annotations without prior knowledge, but without PIT's mass spectrometry step the confidence in these annotations is limited and there is no guarantee that transcribed elements actually get translated. In a recent BBSRC TRDF project we developed easy to use web-based software workflows, implement in the popular Galaxy platform, to process the data from PIT experiments in a repeatable way with uniformly formatted output files. This has proven very useful for answering individual biological questions, but there is currently no meaningful way to share the results of PIT experiments. In this project we propose to plug this gap by developing PIT-DB, a web-accessible database of results produced by PIT. This publicly available database will immediately be populated with data from experiments conducted on various species at the University of Bristol, but other groups will be actively encouraged to submission their own data. Having data from multiple PIT experiments in one database will deliver exciting new scientific insights. As well as simply allowing researchers to share their results from individual PIT experiments, PIT-DB will pool information about individual novel TGEs from multiple experiments so evidence can be accumulated for each individual TGE. Improving the quality of results by using data from replicate experiments is a fundamental concept in science and the utility of doing this on a community-wide basis has been repeatedly demonstrated by other bioinformatics databases such as Ensembl, UniProt and PRIDE. As well as being of interest individually, the well evidenced TGEs in PIT-DB will provide large numbers of experimentally derived (as opposed to computationally predicted) genome annotations for all of the species for which data is present in the database.

Impact Summary

The principal groups who will benefit from this project are: 1. Researchers from academia and industry seeking a better understanding of genomes Genomics is now the cornerstone of a large proportion of biological research, coving a wide range of applications from medicine and food science through to ecology and industrial biotechnology. In all these areas a detailed understanding of the structure and function of the genome of the species under study is important in answering key research questions. The increased understanding of the genome that PIT-DB provides will accelerate progress towards answering these questions. Given the wide range of biological areas in which genomics is used, this will translate into impact across a broad range of strategically important research areas across BBSRC's remit, including bioenergy, infectious diseases, food security, healthy ageing, animal welfare and synthetic biology. 2. Industry The range of companies that stand to benefit directly from PIT-DB is large and diverse. To give just three examples: (i) Small companies dedicated to the discovery of biomarkers and drug targets will find that the plentiful supply of experimentally derived novel translated genomic elements (TGEs) provides a valuable source of material for new research projects. (ii) Pharmaceutical companies who are becoming increasingly interested in the possible role of novel TGEs such as fusion proteins will benefit from having access to a large catalogue of such TGEs. (iii) The agri-food industry, in which the analysis of genomes from a wide range of species including farmed animals, parasites, pathogens and multiple cultivars of popular crops is a core activity, will have the opportunity to benefit from the improved annotations of these genomes afforded by pooling of data within PIT-DB. There will also be a general benefit from the greater transparency and repeatability of published PIT experiments, through the easy sharing of results. This will give industry more confidence in the findings of such research, increasing the likelihood of this research being translated into economic benefit. 3. General public The ultimate beneficiaries of this project should be the general public, for whom the improved biological insight revealed by the groups above has the potential to lead to new medical treatments, increased food security, greener energy and an improved economy. It is impossible to predict which, if any, of these benefits will come to fruition but by making significant amounts of otherwise difficult-to-access PIT results freely available to researchers via an intuitive web-based user interface we aim to make a significant contribution towards this goal.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file