Award details

IDA2GO - Improving Domain Annotation and Representation within InterPro

ReferenceBB/K004328/1
Principal Investigator / Supervisor Ms Sarah Hunter
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentSequence Database Group
Funding typeResearch
Value (£) 120,338
StatusCompleted
TypeResearch Grant
Start date 01/11/2012
End date 30/04/2014
Duration18 months

Abstract

IDA2GO will improve the annotation and representation of domain information within InterPro. The member databases which make up the InterPro resource each have their own biological focus and signature methodology. InterPro aims to provide a consensus view of their data but achieving this for protein domain information is complex due to the different ways each database defines domains. Whilst the definitions often overlap, there are many cases where they differ substantially. An imperfect, compromise solution (where some databases' definitions are favoured over others) is currently used to generate domain architectures on the InterPro web site. This makes the data difficult to interpret and it is currently not possible to perform sophisticated analysis on it (e.g. searching for proteins that contain a particular set of domains). We intend to use graph-theory to accurately represent InterPro's domain architectures. This would allow an in-depth analysis of the domain information contained within InterPro for arguably the first time. A user-friendly query interface that is tightly integrated into the existing InterPro website will also be produced. In addition, we will collaborate with the Gene Ontology (GO) consortium to improve the annotation of domains and domain architectures using the Gene Ontology. At present, annotation of InterPro's domains is relatively sparse (both in coverage and depth of annotation) compared with the annotation of protein families. This is due to the inherent difficulties in annotating domains, as they are frequently found in different functional contexts. We will mitigate this by manually mapping GO terms to domains with additional qualifiers describing how the domain contributes to the protein's function. We will also perform an automatic mapping of GO terms to the domain architectures produced in the first part of the project. Together, these approaches will greatly improve the coverage and utility of InterPro2GO.

Summary

Protein domains are discrete, stable structures within proteins. They typically form distinct operational units with responsibility for specific functions, such as binding a given molecule or catalyzing a specific step in an enzymatic reaction. To fully understand a protein's biological role, it is necessary to understand domain distribution, evolution and function. The core concept of InterPro is that if two proteins look similar (either structurally and/or at the sequence-level), there is a strong possibility that they will have a similar or identical function. The similarities and differences between proteins that have the same function or structure can be modelled; InterPro calls the resultant predictive models "signatures". InterPro uses signatures from several different databases (each of which has a particular niche or biological focus) to predict information about proteins. InterPro integrates together signatures if they appear to represent the same protein family, domain or site. In addition, concise information about the signatures and the types of proteins they match is added, including terms from the Gene Ontology (GO), a controlled vocabulary that is used to describe biological functions, processes and the subcellular localisation of genes in a standardised way. InterPro regularly calculates the presence of domains in sequences from the UniProtKB protein knowledgebase. It makes this information available through websites and software tools. However, the manner in which these data are displayed and calculated is sub-optimal and can lead to confusion for the biologists attempting to use them. Similarly, because domains can be found in proteins which have quite different overall functions, it is difficult to accurately annotate individual domains with GO terms. The IDA2GO project intends to improve the way that domains are represented and annotated within the InterPro database so that scientists are able to utilise these data for the functionalannotation of genomes, the discovery of novel domains and to better understand how proteins evolve.

Impact Summary

The InterPro database has a large number of users of both its website (~50,000 unique IPs served per month) and the InterProScan search software (21.3 million searches performed at EBI in 2011 alone). This userbase comprises both academic and commercial scientists with a range of research questions. The biggest "traditional" usage of InterPro has been the high-throughput functional annotation of genome sequencing projects. InterPro has the benefit of a comprehensive set of protein signatures for predicting protein function and sequence features, as well as trusted annotations via the association of Gene Ontology terms. The IDA2GO project promises to benefit these users considerably: The domain architecture data that will be generated could be used in quality control for gene coding predictions and for transfer of annotation between orthologs. The new GO term associations for domains and domain architectures should increase coverage of functional annotation of gene products. The data within InterPro covers many areas of taxonomy and, as such, can be utilised by a wide range of biologists, including crop researchers, drug developers and microbiologists. For those researchers covering novel areas of biology (e.g. metagenomics), where proteins are not as thoroughly functionally characterised as in more established areas, domain information is of particular utility because even though the protein's particular function may be unknown, the presence of a well-understood set of individual domains can give insights into its potential role. It is hoped that the project will also have benefits for evolutionary biologists, particularly those studying how domains shuffle and adapt over time and across species. Presenting domain architectures in a quick and easy to use graphical interface should allow better exploitation of the wealth of information held within InterPro.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityTechnology Development for the Biosciences
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file