Award details

Developing proteome-bioinformatics methods for a large scale refinement of gene models in Apicomplexan parasites

ReferenceBB/G010781/1
Principal Investigator / Supervisor Professor Andrew Jones
Co-Investigators /
Co-Supervisors
Professor Jonathan Wastling
Institution University of Liverpool
DepartmentVeterinary Preclinical Science
Funding typeResearch
Value (£) 414,712
StatusCompleted
TypeResearch Grant
Start date 01/07/2009
End date 30/06/2012
Duration36 months

Abstract

The research programme will develop software tools that allow proteome data to be dynamically integrated with the genome for several important apicomplexan parasites that cause significant global challenges to human and animal health. Novel informatics tools will be created that can both re-query mass spectra each time a new set of gene models is released and actively inform the process of gene prediction through the integration of protein expression evidence directly into gene finders. The proteome re-querying software will incorporate multiple database search engines, employing robust statistical procedures to control the false discovery rate. Improved algorithms for sequencing peptides de novo from mass spectra will be incorporated for making identifications that are problematic for standard searches, for example to find previously unpredicted genes. Visualisation and querying interfaces at the genome databases will be supplemented to display the probability that a predicted gene model is correct, as estimated by the modified gene finder. New proteome data will be generated for four species (Toxoplasma gondii, Neospora caninum, Cryptosporidium parvum and Eimeria tenella), using a selective technique to identify N-terminal peptides for the first time across several Apicomplexa. Algorithms will be developed for making robust identifications of N-terminal peptides and for their alignment against the genome to find new genes and to predict the correct start codons. The pipeline software will be deployed for all Apicomplexan genomes hosted within GeneDB (Sanger Centre) and ApiDB (University of Pennsylvania) to make considerable improvements in the existing gene models. The software will have a wide applicability, for both genome and proteome researchers, and will be freely available for use by other groups and database centres.

Summary

The central aim of this work is to develop novel computational tools that allow scientists to derive the maximum possible information from large collections of gene and protein data which have recently been acquired by large publicly-funded research programmes. In the first instance our work will facilitate studies on an important group of parasites that have a devastating impact on animal and human health. The parasites can cause disease in cattle, sheep and poultry, and have great economic significance to the UK farming industry, as well as to public health. Substantial investment has been made to discover the complete genome sequences of these organisms, providing the framework for understanding the organism's biology at the molecular level. An important first step with any new genome sequence is to locate the precise position and structure of each gene, often within large regions of non-coding sequence. This can prove highly problematic since the starting positions of genes cannot always be predicted with accuracy and many genes are split into a series of coding regions interspersed with non-coding regions. Several software programs have been developed that can find and predict gene structures, but although partially successful, these computer-generated gene models often require considerable refinement. Within a single genome, there may be tens of thousands of genes, and as such, manual verification of all gene structures is not feasible for most organisms. Following on from genome sequencing projects are studies of the complete set of proteins expressed at one time, termed the proteome. Proteins are the functional units of cells and, in this context, they are particularly studied to understand the basic biology of the parasite and the mechanisms by which the parasite invades its host, causing disease. Individual proteins may represent drug targets or vaccine candidates, so identifying new proteins could be highly significant. Techniques have been developed that are capable of identifying a high proportion of the proteins present in a sample (proteomics). However, proteomics techniques have a limitation in that it is difficult to identify a protein unless the corresponding gene code has first been located in the genome. Computational tools are therefore required that allow proteins to be discovered, even if the corresponding gene has not yet been found. In this proposal, we will develop several software packages that allow existing proteome data sets to play an active role in the discovery of new genes, and for re-modelling existing gene structures where they do not correspond with the protein evidence. Several proteome data sets already exist for these organisms but at present they are being under-exploited due to the problems generating correct gene models. We will also use a novel laboratory technique that can specifically detect a certain fragment of each protein from one end of the sequence. The sequence of the fragments will be aligned against the genome, using newly developed software tools, to find the exact starting position of many genes, for four parasite species. The software will be deployed as part of the publicly accessible databases where the genomes are hosted, providing an important resource for the large community of researchers who work on these important organisms. It will also ensure that highly valuable proteome data does not become obsolete as the gene models change over time. We anticipate that a number of new genes and proteins will be discovered, and that a significant proportion of existing gene models will be improved. These improvements will be of immediate benefit to researchers who are working to understand the mechanism by which these parasites cause disease and will help the search for new drug targets and vaccines.
Committee Closed Committee - Genes & Developmental Biology (GDB)
Research TopicsAnimal Health, Microbiology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file