Award details

An Integrated CATH Resource for the Postgenomic Era

ReferenceBB/F010451/1
Principal Investigator / Supervisor Professor Christine Orengo
Co-Investigators /
Co-Supervisors
Professor David Jones
Institution University College London
DepartmentStructural Molecular Biology
Funding typeResearch
Value (£) 816,263
StatusCompleted
TypeResearch Grant
Start date 01/09/2008
End date 28/02/2014
Duration66 months

Abstract

The CATH integrated resource will combine data on domain structures classified in CATH with predicted sequence relatives in the genomes. In addition 3D models will be built for genome sequences and protein interactions, where possible. Functional information will be integrated for each family extracted from public sources and inherited between relatives, using safe thresholds. In addition we will be opening up the resource to the sequence-based community through tightly integrated prediction tools (PSIPRED). To regularly update this information we will develop more sensitive methods and robust Grid based workflows for classifying structures, predicting structures in genome sequences, 3D modelling and integration of functional annotations. We will keep pace with the worlwide genomics initiatives by expanding our domain boundary recognition suite to include additional algorithms. Sensitivity in homologue detection will also be increased using neural network based approaches. Coverage of structural predictions will be significantly improved by exploiting multiple structural alignments, built for each CATH family, to improve the sensitivity and accuracy of HMM and threading methods. Information on conserved structural positions will also improve the homology modelling protocols used to build 3D models for genome sequences. Finally, protocols for integrating functional information will be improved and extended to incorporate data generated by new in-house methods being developed in related projects. The new integrated CATH resource will be available to biologists via new web pages which will allow users to browse the resource in a much more intuitive manner moving easily from structural family data to related sequences and their associated functions and to view available structures or 3D models highlighted to show conserved residue positions and surface features, such as electrostatics. The data and methods will also be available via DAS and Web services.

Summary

The success of the worldwide genome initiatives has given us the protein sequences for more than 300 species including human and mouse. The challenge now is to predict the functions of these proteins and how they interact with each other to give the diverse biological repertoires observed in nature. The three dimensional structure of a protein is much harder to determine than its sequence explaining why fewer than 25,000 structures are known compared with ~2.5 million non-redundant sequences. However, structural data often gives more profound insights into the mechanisms by which proteins act and interact. Also, because structure is more conserved than sequence we can detect more distant relationships giving clearer insights into how proteins evolve. A number of structural classifications exist to group proteins by their structural similarity and are particularly valuable for understanding how changes in the sequences and structures of relatives can modify functions. Since we cannot experimentally characterise all proteins, being able to accurately predict functions from related proteins is essential for understanding biological systems and determining the causes of and remedies for disease. The CATH classification is one of the most widely used and comprehensive of these structural family resources. It has expanded 12-fold since it was established in 1993 and is now accessed by biologists nearly 1 million times per month over the web. The only other resource of this kind is SCOP, which classifies a similar number of protein structures. The two resources employ different approaches, SCOP relying largely on manual inspection for the identification of remote structural similarities whilst CATH applies automated algorithms and manual inspection to validate only the hardest cases. This use of carefully validated automated approaches will ensure that CATH can cope with the massive flood of data expected over the next decade. The worldwide structural genomicsinitiatives are currently solving the structures for protein families for which no structural information exits. Although these initiatives are very welcome because they are expanding our knowledge of protein structures, they are necessitating faster and much more sensitive automatic methods for CATH, as well as a greater degree of manual validation. In this project we will develop much more efficient ways of classifying these structures to keep pace with the structural genomics initiatives. Since very few proteins have known structures, CATH will bring much wider benefits to the biological community if structural data can be predicted for the millions of sequences not yet structurally characterised. We have already developed very robust technologies for predicting which genome sequences can be assigned to CATH structural families. International competetions have shown these to be amongst the best performing in the world. Using these techniques we can predict structures for up to 80% of proteins in some organisms. In this project, we therefore propose to develop an integrated resource that combines information on structural families with structural predictions for all sequences in the genomes. We also have methods to integrate any available functional information for the proteins. Furthermore, our in-house modelling techniques can provide reasonable 3D models for many of these sequences which will help biologists in understanding the functional properties of the proteins and in determining the functional networks in which they participate. The integrated CATH resource we plan will present biologists with structural data for any protein of interest, combined with comprehensive functional data and highly intuitive web pages that help them to view the structures in the context of all the available functional data. By integrating data in this way this resource will ultimately enrich our understanding of biological systems.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file