Award details

Support for the SUPERFAMILY protein domain resource.

ReferenceBB/G022771/1
Principal Investigator / Supervisor Professor Julian Gough
Co-Investigators /
Co-Supervisors
Institution University of Bristol
DepartmentComputer Science
Funding typeResearch
Value (£) 684,410
StatusCompleted
TypeResearch Grant
Start date 01/02/2010
End date 31/01/2015
Duration60 months

Abstract

The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP hierarchy and searched against all completely sequenced genomes using hidden Markov models. The resource contains 4 main components accessed by end users: a database of over 14 million domain assignments, a library of over 14 thousand hidden Markov models, numerous analysis tools, and a web interface to all of these. The Structural Classification of Proteins (SCOP) database classifies the proteins of solved 3D structure in the PDB. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. There are 3464 families contained in 1777 superfamiles, totalling 97178 domain definitions. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 14 million domains. SUPERFAMILY currently has comprehensive inclusion of genomes, but advances in sequencing technology are rapidly increasing the number which need to be included. The detection and classification of domains in genome sequences is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology. The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, comparative genomics and visualisation which are usually accessible only to the more expert.

Summary

The SUPERFAMILY resource detects and classifies protein domains of known structure in genome sequences. Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which are assembled into whole proteins via duplication and recombination. X-ray crystallography and NMR experiments provide the 3D atomic resolution of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins. Some years ago the first complete genome was experimentally characterised, giving us a list of all the sequences of the proteins which make up that organism. Subsequently the human genome was sequenced and now we have the complete sequences for the proteins of approaching 1,000 organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. Our knowledge of domain families is not complete, so the assignments from the hidden Markov models cover only about half of the protein sequences, but this is still extremely valuable information. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life. The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions.The data,software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.
Committee Closed Committee - Biomolecular Sciences (BMS)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file