Award details

A long term resource to maximise the potential of laboratory mouse strains for medical research

ReferenceBB/M000281/1
Principal Investigator / Supervisor Dr David Adams
Co-Investigators /
Co-Supervisors
Dr Jennifer Harrow, Dr Thomas Keane
Institution Wellcome Trust Sanger Institute
DepartmentComputational Genomics
Funding typeResearch
Value (£) 674,759
StatusCompleted
TypeResearch Grant
Start date 01/03/2015
End date 28/02/2018
Duration36 months

Abstract

The genome represents a complete description of an organism. However, to understand the functioning of the genes and regulatory elements, and to design molecular biological experiments to test hypotheses, the genome sequence must be related to the extant functional data for that organism. In particular the set of genes must be accurately annotated. The first chromosome sequences for the laboratory mouse strains are soon to be released by the Mouse Genomes Project at the Wellcome Trust Sanger Institute. The main aim of this proposal is to take the sequences and create strain-specific annotation and targeted manual annotation in regions where the automated processes fail. We propose to create a comprehensive evidence-based set of gene annotations for twelve laboratory mouse strains. This will be a combination of manual annotation in targeted loci and genome wide automatic annotation. Manual annotation provides the most accurate annotation of a locus, with all transcripts for which there is evidence, generated. Automatic annotation provides rapid genome wide gene annotation. Together, they provide the most useful cost effective gene set for researchers. Manual annotation will be targeted at loci chosen by the community as important for medical based research, or where user feedback suggests automatic annotation has failed to generate good models. It will be performed using the established Otterlace/ZMap annotation tools. An established process, used successfully in the ENCODE project, will merge the manual and automatic annotation for each Ensembl release. The gene set will be made available through the Ensembl website and via the other access methods to Ensembl (biomart datamining interface, Perl API, flat file dumps, MySQL database) and MGI, and for Ensembl tools e.g. Variant Effect Predictor. The gene set will be further annotated each release by Ensembl's comparative genomic, variation and functional genomic pipelines.

Summary

Our key aim is to explore the relationship between genetic and medically relevant human disease phenotypes. One way to do this is to assess the genetic differences between long-established laboratory mouse strains. Laboratory mouse strains display many important disease phenotypes such as resistance to various forms of cancer (e.g. liver, lung, and skin cancer), bacterial, and viral infection and are used as models for many human diseases. The foundation for studying the genetic differences in these strains is having accurate genome sequences. In this project, we will first generate genome sequences for the most commonly used laboratory mouse strains and then use these sequences and knowledge of the gene structures to determine the genetic cause of observed disease response and behaviour differences between these strains. By combining sequence and phenotypic data we will determine whether sequence variants are likely to be contributing to disease susceptibility. The main aim of this project is to correctly identify all the genes on the newly completed release of genome sequences of 12 laboratory mouse strains. This is achieved in a combination of two strategies. Initially the genes will be identified using state of the art bioinformatic programs and pipelines. The genes are identified by matches to known mouse proteins on the genome, other transcribed data such at mRNAs and ESTs or conserved proteins from other species. As this is an automatic pipeline, there will be complex gene families that cannot be correctly identified and require manual inspection. The HAVANA team have been involved in manual annotation of the human, mouse and zebrafish reference genomes and have developed in-house specialist tools to help accurate identification of genes within different genomes. Since manual inspection is expensive and time consuming the manual effort will be targeted on complex gene families and genes of specific interest to the mouse scientific research community. Engaging with the community will be essential to receive feedback about targeting of annotation as well as to generate community participation in the manual inspection of genes of interest. Automatic annotation identifies around 70% of genes correctly, therefore the aim would be to use bioinformatics analysis and feedback from researchers to target the 30% incorrectly annotated genes and improve them.

Impact Summary

The most obvious beneficiary of these genome sequences and annotation generated will be the mouse genetics community involved in mapping complex disease related traits, researchers mapping mutations in crosses involving the wild-derived strains and crosses attempting to identify modifiers of mutations. Complete genome sequence and annotation is needed to explore the relationship between genetic and phenotypic variation at a number of levels. First, it is a starting point for exploring how sequence and gene structure variation impinges on gene function. The new gene structures that this project will identify will provide a resource for examining sequence function, particularly in those regions, identified by the ENCODE project, that are either transcribed or implicated in gene regulation. Importantly, complete sequence will allow unambiguous assignment of function to specific nucleotide differences. Second, the sequence will accelerate the identification of genes involved in the increasingly large number of phenotypes available for inbred strains. To date, more than 2,000 loci that contribute to quantitative variation have been identified, with only a small number characterized at a molecular level. The de novo assemblies and corresponding annotation data will obviate the need to re-sequence candidate genes identified in genetic analysis of complex traits. Third, in combination with accumulating expression, proteomic and metabolomic data sets, accurate genome annotation of multiple mouse strains will markedly improve our ability to understand gene function. A systems biology approach will be possible, in which the integration of genetic and functional genomic data provides a path to inferring causal associations between genes and disease.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file