Award details

Improving the rat reference genome annotation and building community engagement

ReferenceBB/K009524/1
Principal Investigator / Supervisor Dr Jennifer Harrow
Co-Investigators /
Co-Supervisors
Professor Tim Hubbard, Dr Stephen Searle
Institution Wellcome Trust Sanger Institute
DepartmentComputational Genomics
Funding typeResearch
Value (£) 592,697
StatusCompleted
TypeResearch Grant
Start date 01/04/2013
End date 31/08/2016
Duration41 months

Abstract

The genome represents a complete description of an organism. However, to understand the functioning of the genes and regulatory elements, and to design sensible molecular biological experiments to test hypotheses, the genome sequence must be related to the extant functional data for that organism. In particular the set of genes must be accurately annotated. An updated genome assembly for rat (Rnor5.0) has recently been released. This improved assembly is more complete and has longer contig size, making it a better substrate for generating both automatic and manual gene annotation. We propose to create a comprehensive evidence based set of gene annotation for rat. This will be a combination of manual annotation in targeted loci and genome wide automatic annotation produced using the established Ensembl annotation system. Manual annotation provides the most in depth annotation of a locus, with all transcripts for which there is evidence, generated. Automatic annotation provides rapid genome wide gene annotation. Together they provide the most useful, cost effective gene set for researchers. Manual annotation will be targeted at loci chosen by the community as important for rat based research, or where user feedback suggests automatic annotation has failed to generate good models. It will be performed using the established Otterlace/ZMap annotation tools. A community annotation jamboree will be organized to further increase the amount of manual annotation possible. An established process, used successfully in the ENCODE project, will merge the manual and automatic annotation for each Ensembl release. The gene set will be made available through the Ensembl website and via the other access methods to Ensembl (biomart datamining interface, Perl API, flat file dumps, MySQL database), and for Ensembl tools e.g. Variant Effect Predictor. The gene set will be further annotated each release by Ensembl's comparative genomic, variation and functional genomic pipelines.

Summary

Rats have been used in research for over a 100 years as a model to examine physiology and behaviour to provide insight into human disease. Owing to its well characterised physiology, the rat is also the favoured rodent model used in the pharmaceutical industry for the assessment of drug efficacy and toxicity. In 2004 the first reference Rat genome sequence was made public and this has changed the direction of research using Rat as a model organism, enabling identification of rat genes associated with specific diseases. The first release of the rat genome sequence was not of high quality and contained many gaps and missing genes. This has been updated in 2012 by the Baylor College of Medicine Human sequencing group integrating sequence generated from new sequencing technologies increasing the amount of sequence covered in the genome. Recently new experimental techniques have enabled scientists to knockout genes in the Rat genome facilitating observations of what happens to the rat when a gene is deleted. As a result, it is essential that the genes targeted for this type of genetic experiments are correctly identified i.e. "annotated" on the rat genome. The main aim of this project is to correctly identify all the rat genes on the new release of the reference rat genome. This is achieved in a combination of two strategies. Initially the genes will be identified using state of the art bioinformatic programs and pipelines developed by the Ensembl gene build team. The genes are identified by matches to known rat proteins on the genome, other transcribed data such at mRNAs and ESTs or conserved proteins from other species. As this is an automatic pipeline there maybe complex gene families that cannot be correctly identified and require manual inspection. The HAVANA team have been involved manual annotation of the human, mouse and zebrafish reference genomes and have developed in-house specialist tools to help accurate identification of genes within different genomes. Since manual inspection is expensive and time consuming the manual effort will be targeted on complex gene families and genes of specific interest to the rat scientific research community. Engaging with the community will be essential to receive feedback about targetting of annotation as well as to generate community participation in the manual inspection of genes of interest. There are predicted to be over 22000 protein-coding genes identified on the original rat assembly and therefore community input could improve and refine these gene models. Automatic annotation identifies around 70% of genes correctly, therefore the aim would be to use bioinformatics analysis and feedback from researchers to target the 30% incorrectly annotated genes and improve them. The HAVANA team have previously worked with pig researchers to pursue a community annotation project of identify immuno-response genes on the pig genome. Approximately 8% of protein-coding genes were annotated using the Havana annotation tools remotely on their own laptops in their labs after attending a workshop on how to use the in-house tools. Regular contact with the professional annotators ensured the resulting models were consistent among all researchers and adhered to the guidelines produced by the Havana group. This model of community annotation will be presented to the Rat community as an opportunity to improve the annotation of Rat genes. The reference rat genes can be viewed via the internet using the Ensembl genome browser. This reference gene set will be updated approximately every three months and updates from the manual annotation effort will be merged into the automatic gene set by the Ensembl gene builders. In addition any new Rat specific data that helps with identifying new genes such as new sequencing technology transcriptome data can be integrated into this complex genebuilding pipeline.

Impact Summary

This proposal will generate a more accurate and complete annotation of the gene structures contained in the rat genome than is currently available. Accurate knowledge of the gene structures of an organism is a fundamental requirement for the interpretation of many types of experimental biological datasets and so this research is important to all individuals who carry out research concerning rats. The open availability of the data generated and the software code and tools to access it will ensure its use is maximized. The beneficiaries of this research will include those researching the basic biology of rats and those using rats as a model of humans in order to better understand human physiology and disease. This group includes the pharmaceutical industry where the rat is an important model organism of human in drug development. These groups will benefit from this research by having a more reliable and complete gene set to use in their analysis. This will enable them to design more precise experiments and better interpret experimental data. An improved gene annotation will also lead to more accurate and complete identification of orthologous genes in other organisms such as human and will enable detailed comparisons of gene structures. When using the rat as a model for human diseases or physiology for drug development research it is important to know how similar the biology of each species is. This in turn depends on how similar the genes in each species are, including their structure and regulatory features. An improved gene annotation will facilitate this analysis. Research using the rat as a model organism has an important role in the understanding of human disease and in the development of new drugs. The research therefore has the potential to contribute to improved health of the UK population. The pharmaceutical industry is a major generator of wealth in the UK, so this research also has the potential to improve its research output and through that help improve the competitiveness of this sector of the UK economy. The community engagement aspects of this proposals will specifically enable UK researchers in both academia and companies to propose priorities for gene annotation improvements based on the priorities of their research and allow them to engage with expert annotators. The improved gene set resulting from this research could also potentially provide a starting point for commercial companies producing experimental reagents for other researchers. The international importance of this research will also encourage links to other international rat resources and databases, such as the US funded Rat Genome Database (RGD) and the EU funded EURATools (http://euratools.rns4u.com/) and follow on project EURAtrans consortia (http://www.euratrans.eu/). Such links will enhance access of UK researchers to other large scale basic research projects on rat and the data they generate. Finally the staff trained on this project will gain valuable expertise in computational methods for handling genome data and biological expertise around vertebrate gene structure. These bioinformatics skills, particularly in the use of high throughput biological data, are in great demand both in academia and industry. This increasingly includes the health sector where genome data is being increasingly used in medical diagnostics.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file