Award details

Increasing the Coverage and Accuracy of CATH for Comparative Genomics and Variant Interpretation

ReferenceBB/R015201/1
Principal Investigator / Supervisor Dr Sameer Velankar
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentProtein Data Bank in Europe
Funding typeResearch
Value (£) 105,225
StatusCompleted
TypeResearch Grant
Start date 11/01/2019
End date 10/01/2020
Duration12 months

Abstract

The UCL PDRA will spend ~50% of their time maintaining CATH's computational platforms ie the software, hardware, databases and web services required to process a constantly increasing amount of data; manually validating remote homologues and new folds; developing programs to generate derived data for CATH-Plus (eg multiple structure alignments, 3D templates). The remaining time will be spent improving the accuracy of CATH data, improving web pages/APIs and building new features: -Export DomChop Platform to EBI: modify CATH's DomChop platform to run with SCOP data and move to the EBI (in collaboration with PDBe). This will require removing/replacing all local dependencies (comprising scripts, databases, HPC and webservices). -Expand FunFams: rework the agglomerative clustering algorithm to speed up clustering so that all domain relatives in superfamilies can be regularly clustered into FunFams. Several strategies will be explored eg using fast, rough clustering (MMseqs2.0) to guide sequence cluster comparisons, improving throughput of profile comparisons, improved batching of HPC jobs, using predictions of likely cluster-merges etc. The faster method will enable FunFams of 'Enzyme Units', with new pipelines to identify domains contributing to enzyme active sites. -Downloadable implementation of CATH-MDA-Annotate: develop workflow providing external access to CATH tools and data, allowing users to annotate their own sequence datasets (eg full genome annotation). This will be in the form of low-dependency, open source software that is easy to download, install and run. -Expand multiple structure alignments and site characterisation: build software for analysing multiple structure superpositions to identify conserved positions in the buried core or around known or predicted functional sites. -Extend API for FunSite data: expand existing FunFam API to include annotations (in Stockholm format) from structure analyses (eg conserved positions in ligand binding pockets)

Summary

Evolution has given rise to families of protein domains where relatives are linked through speciation events or duplication events in the same genome. Extensive domain duplication and shuffling gives multi-domain proteins with varying functions depending on the domain composition. The CATH classification takes the domain as the primary evolutionary unit and classifies relatives having significantly similar structures and sequence patterns. Currently there are 5500 CATH superfamilies containing 93 million domains. Previous funding allowed us to hugely increase the number of domains in CATH. We want to keep increasing this data - even bigger expansions are expected as new technologies make it easier to solve structures and capture sequence data. We will improve the accuracy of our domain data by working with other classification experts (Alexey Murzin of SCOP) to establish a shared domain recognition platform for new domains at the European Bioinformatics Institute, with difficult assignments jointly validated by CATH/SCOP experts. This data will be public and valuable for other resources (eg SCOPe, ECOD). CATH has been established for 22 years and is renowned for providing accurate structural annotations for biological analyses. More recently it significantly increased its value to the biology community by providing functional predictions. Although the structural core of the superfamily is highly conserved, variations away from the core cause changes in function. CATH addresses this by grouping evolutionary relatives likely to have highly similar functions and structures into functional families (FunFams). Thus FunFams can accurately inherit information about structures and functions, between relatives. This is important as <10% of domains have been experimentally characterised. We verified in-silico that FunFams can accurately model structures of uncharacterised relatives and the ability of FunFams to inherit functional information between relatives has been validated by an international competition - CAFA. We will make the FunFams much more comprehensive and increase the accuracy of FunFams for enzymes. Extending our FunFam library will allow us to predict more accurate multi-domain annotations in genome sequences. This will help biologists comparing the genomes of organisms occupying different environmental niches, as identification of diverse domain combinations can hint at changes in the functional repertoires of the organisms and different abilities to exploit compounds in their environments. Because relatives in FunFams are so structurally conserved we can align and superpose them to extract the characteristics of this conserved structural core and use this information to build a '3D core-template'. These templates will help solve the structures of many more relatives since powerful new structural biology techniques (eg cryo-EM) can use core libraries like these to model the structures of uncharacterised proteins from electron dispersion data. In another exciting development for CATH we will harness the structural data and the additional power that comes from 200-fold greater sequence data to find residue sites in the protein, conserved throughout evolution for their functional importance. We will characterise these sites. We already predict functional sites well from conservation patterns in sequence data, but including structural data can help distinguish the type of site (eg site binding a compound or another protein) and identify additional residues involved in the functional mechanism. This data is valuable for protein design and understanding why mutations near these sites affect the protein and cause disease. We will disseminate our data via webpages and other web mechanisms and develop e-videos and training material for the new features. We'll also build more efficient mechanisms for scanning our website and for biologists to install our tools on their own computers to analyse genome data.

Impact Summary

CATH is a world leading resource for protein domains, unique in combining 3D structures with millions of sequences predicted to belong to CATH families and extensive functional information. We will improve the accuracy of the domain assignments and predicted functional sites, thereby increasing the value of CATH for basic biosciences and the agricultural and biomedical communities. The CATH webpages and webservers are highly accessed with 33,747 unique visitors per month and ~1.5 million hits per month (ie all files), measured using awstats which is better than webalizer at distinguishing 'human' users from 'robots'. This is a more appropriate metric than Google Analytics, which uses very strict criteria for "human" interaction and more problematic, API interactions will not show up at all on Google Analytics. Over the last 6 months CATH has served an average of 1 million web pages/month to humans on web browsers. Taking all traffic into account (e.g. data downloads, API calls, web robots), CATH has served an average of 3.5million pages/month. The average session duration is up by 10% and the pages per session are up by 5%, demonstrating that users are spending more time on the site and looking at more pages. CATH web pages and scientific data are accessed from 179 different countries with the top ten being United States (16%), India (12%), United Kingdom (11%), China (11%), Germany (4%), Spain (3%), France (3%), Japan (3%), Italy (2%) and Canada (2%). The original CATH paper is cited 2653 times and all CATH papers are cited 7789 times. CATH has been endorsed as an ELIXIR UK resource (only 5 UK data resources are endorsed) and is the only UK resource with ELIXIR Europe-wide 'Core Resource' status - only 14 resources have similar status across Europe. ELIXIR is a European initiative providing endorsement (but not funding) for computational resources supporting the biology community. CATH also has impact in directly supplying data to the following resources, accessed by structural, experimental and computational biologists. - CATH domain structure annotations are used by PDB and provided via PDBe and RCSB websites. PDBe has ~50,000 unique visitors/month. - Partner in InterPro - Gene3D structural annotations are disseminated by InterPro ~86,000 unique visitors/month from nearly every country in the world. - Contributor to UniProt annotations, also widely accessed. - Partner in Genome3D resource - an integrated resource of UK-structural bioinformatics resources providing structural annotations and 3D models for key model organisms, including human, mouse and representatives from Pfam families. Web access to Genome3D is well distributed across Europe, Asia and Americas. The impact of CATH data on biology communities is reflected in the fact that since 2002 CATH has been a partner in 7 EU funded European Initiatives, 2 NIH funded consortia for structural genomics and 2 UK funded initiatives (eFamily (MRC), London Pain Consortium (Wellcome Trust). Current partnerships include the DDIP consortium for developmental fly interactome (BBSRC), Genome3D (BBSRC - structural annotations) and FunPDBe (BBSRC funded - functional site annotations)). All these projects use CATH data and tools for structural and functional annotations. Links to Industry: Nearly 20% of CATH's unique visitors per month are from commercial IP addresses. Pharmaceutical companies also use CATH tools for structure analysis (eg the CATH structure comparison tool has been purchased by Celltech, Pfizer India and Lilly). CATH was a founding resource of the UCL company Inpharmatica involved in predicting structures and functions for proteins via the 'Biopendium'. Inpharmatica was acquired by Galapagos in 2006. Other evidence of impact is given by the range of support letters including letters from directors of major institutes and centres and companies undertaking drug design. CATH has also been widely used to teach students about proteins.
Committee Not funded via Committee
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file