Award details

An Greatly Expanded CATH-Gene3D with Functional Fingerprints to Characterise Proteins

ReferenceBB/K020013/1
Principal Investigator / Supervisor Professor Christine Orengo
Co-Investigators /
Co-Supervisors
Professor Gerard Kleywegt, Dr Alexey Murzin
Institution University College London
DepartmentStructural Molecular Biology
Funding typeResearch
Value (£) 612,409
StatusCompleted
TypeResearch Grant
Start date 01/01/2014
End date 31/12/2017
Duration48 months

Abstract

(1) CATHifier We will build the CATHifier platform for classifying structures/sequences in CATH-Gene3D (referred to below as CATH). This will comprise a better homologue predictor exploiting more powerful sequence matching (meta-methods), structure matching (meta-methods) and function matching (text mining). We will improve the machine learning SVM combining this data. CATHifier will comprise RESTful web services and Cloud based workflows. We will make CATHifier available via webservers for user's queries. The web services will also export CATH data to PDB and InterPro. (2) More sensitive methods for functional classification Our functional classification/prediction tool (FunFamer) will be improved eg by using MDA data, better detection of conserved residues, looking for 3D co-localisation of conserved residues, exploiting conserved 3D motifs. (3) FunFamer webserver for function prediction Most proteins are not experimentally characterised and so function prediction is a major aim of the project. High Performance Compute (HPC) strategies will handle the vast datasets biologists are generating using next gen sequencing. A recent BBSRC T&R pilot ported FunFamer to HPC facilities ie Amazon and UCL Legion (5500 nodes). We used infrastructure-on-demand services ie Amazon EC2 compute cloud and Amazon S3 storage service. Amazon virtual machines can be used in a cluster-like scenario via Sun SGE's 'cloud adapter' software or in parallel. We will improve scheduling for large datasets and explore using Hadoop and related strategies. A major aim will be intuitive web pages displaying functions. No other resources identify structure-based functional families. We will show 3D structures highlighting functional sites conserved in both sequence and structure. We have begun this but more work is needed eg to make the site more intuitive, align query proteins against FunFams, display mutations close to functional sites or splice variations modifying function

Summary

There are millions of proteins being sequenced which have no known function. New CATH methods will predict their functions. Whilst other resources do also predict function, CATH-Gene3D (referred to below as CATH) provides unique information on structurally conserved features linked to function. Structure data reveals how proteins perform their function and why the function changes if the protein is modified by mutations or other genetic variations. Protein function information is key to understanding biological systems and by extension drug design, protein engineering and disease. CATH is a world leading resource that classifies proteins evolved from the same ancestral protein, into evolutionary families. Currently, CATH classifies 15 million protein domains into 2600 families. Family data is valuable because evolutionary relatives (called homologues) tend to have similar 3D structures and perform similar functions. Thus the benefit of CATH is the ability to infer properties between homologues. This is important because of the millions of proteins currently known (>20 million) less than 5% have experimentally determined functions. Even in the organism of greatest interest to us, human, <10% of proteins have known functions. Because it can be slow and very expensive to characterise proteins it will not be possible to experimentally study all these proteins. Therefore, biologists use CATH to predict the function of a protein based on the family to which it belongs. Another fact is that proteins are made up of 'domains' - on average two per protein. These are independently folded entities that act together to confer the function of the whole protein. CATH classifies proteins at the level of the domain and currently classifies ~70% of domains found in nature. Domains are the building blocks of proteins - a few thousand of them are combined in different ways to give the 20 million proteins, or more, in nature. Our group develops methods for predicting domain functions. This allows functions of whole proteins to be deduced from the functions of their constitutive domains. Thus functions can be suggested for proteins made from any combination of domains. CATH uses information on the 3D structure of the domain to give more accurate family classifications, as structure is more highly conserved, during evolution, than the sequence. Even more important - structure can reveal how the protein performs its function and whether the protein loses its function if a mutation occurs at a particular site. We will expand CATH by 100%. Since manual validation is very time consuming, we will develop better methods for automatically recognising distant homologues. We will continuously release data (CATH-B), prior to manual curation, so that biologists can benefit from the information much sooner. We will collaborate with the other major structure classification SCOP to develop common classification strategies and provide complementary information on families. We will improve the accuracy of functional inheritance across a family. We need to do this because in some families, especially those occurring more frequently in nature, the functions can change in some relatives. We will improve accuracy by characterising important positions in the domain, conserved across functionally similar relatives. We can build patterns of these positions to recognise other domains sharing such patterns and likely to have similar functions. We will make it easy for biologists to use our web search tool to determine if a protein belongs to one of these functional families. We will set this up on the Cloud so that biologists can quickly search CATH with the massive datasets they obtain using new sequencing technologies. These technologies capture proteins expressed under different conditions. Our web pages will report their functions and variations in the protein which could modify function causing disease

Impact Summary

We will maintain and develop a world leading resource for protein domain structure classification (CATH-Gene3D, henceforth referred to as CATH) which combines 3D structure data, tens of millions of sequences predicted to belong to CATH families and extensive information on protein functions. We will improve the purity of functional classification and thereby increase the value of the resource for both basic biosciences and also the agricultural and biomedical communities. CATH already has a very well developed website and this will be extended to provide more detailed information on protein functions and in particular residue sites on the protein surface likely to be important for function. The new web pages will therefore inform protein design or rationalise the impacts of genetic variation eg in different plant or animal strains. For example a single residue mutation in the Rubisco protein, affecting allostery, can alter the catalytic efficiency of this enzyme in rice and promote survival in arid regions. CATH is already widely used - The website now receives nearly 2 million web-pages accesses/month from ~61,000 unique visitors and the CATH paper is highly cited - the original CATH Structure paper is now cited 1986 times (all CATH publications are cited 6413 times). Communities in which CATH has an impact Basic bioscience researchers: Evidenced by the fact that CATH is one of the 8 member groups of InterPro - a consortium of major protein family resources at the EBI. Several European networks of excellence (Biosapiens, EMBRACE, IMPACT, ENFIN) included the CATH group to provide structural/functional annotations for genome sequences. Structural biologists: Evidenced by the fact that major protein structure repositories (PDB) link directly to CATH; a major structural genomics initiative (PSI) in the States selected CATH as the structural resource for target selection. Biomedical Researchers: Evidenced by the fact that CATH is used to provide information onprotein functions, protein networks and the impacts of SNPs for large consortia researching neuropathic pain (London Pain Consortium, Europain). Other evidence of impact is given by the range of support letters including letters from directors of major institutes (eg RCSB, EBI), companies undertaking genome annotation (eg Synthetic Genomics) and users of the data. Research fields in which CATH will have an Impact Agricultural and Food security - Protein sequencing initiatives are providing increasing amounts of data for plants, crops, cattle and the bacteria that interact with these hosts and cause damage. The data and tools we will develop (eg information on conserved positions involved in function) will explain variations between strains and help identify suitable strains to improve yields, taste or colour or to cope with environmental conditions eg drought, pests and pathogens. Protein design and biotech industries - modification of proteins in pathways can yield new sources of materials and energy (ie biofuels). New proteins can be designed to build synthetic pathways. The functional family (FunFam) data can be used to constrain conserved structural core positions in the protein and identify positions more tolerant to change and useful for new designs. Health - Knowledge of structural details in the active sites of proteins and identification of conserved 3D features is valuable for drug design. Another major benefit will be the use of conservation data in FunFams to rationalise the impact of genetic variations (eg SNPs, spliced variations) on protein functions and disease susceptibility. This will inform both diagnostic strategies and drug design. The FunFam server will characterise the functional repertoire of metagenomes from human cavities eg gut and thereby help explain the role of commensal bacteria in promoting health. Other - CATH has been widely used to teach students about protein structure and evolution.
Committee Closed Committee - Biochemistry & Cell Biology (BCB)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file