Award details

Keeping pace with protein sequence annotation; consolidating and enhancing Pfam and InterPro's methodologies for functional prediction

ReferenceBB/L024136/1
Principal Investigator / Supervisor Dr Robert Finn
Co-Investigators /
Co-Supervisors
Dr Alex Bateman
Institution EMBL - European Bioinformatics Institute
DepartmentSequence Database Group
Funding typeResearch
Value (£) 545,328
StatusCompleted
TypeResearch Grant
Start date 01/08/2014
End date 31/07/2017
Duration36 months

Abstract

Pfam and InterPro are two widely used databases containing thousands of protein signatures. Both databases provide websites and services so that user-submitted protein sequences can be searched for identification of conserved functional modules. In this proposal, we intend to improve the accuracy of functional annotation provided by Pfam and InterPro by annotating catalytic and ligand-binding residues for sequences in Pfam, and offering on-the-fly functional residue predictions as part of the InterProScan software. We will also use iPfam to expand the protein interaction information in both resources to the residue level. The domain- and ligand-binding data will be used in combination with other signatures to improve the accuracy of GO term assignment via the InterPro2GO pipeline. We will apply new approaches to expand and improve existing Pfam families, and annotate and integrate these families into InterPro, together with signatures from other member databases, improving and extending annotative coverage. Methods for calculating, storing and propagating this additional tier of functional residue information in Pfam and InterPro will be developed with future computational scalability key to the design. Existing web interfaces will be extended to enable discovery of this new data. Open source libraries for the graphical representation of the data will also be produced and shared. Mechanisms for producing meaningful, representative multiple sequence alignments for displaying functional residue data will be designed. We will implement a range of web services to provide both large-scale, programmatic access and facilitate data exchange between the two databases and source databases. The strong links between InterPro, the Gene Ontology and UniProt ensure that all annotations produced as a result of this project will be propagated to a large spectrum of protein resources, thus improving researchers' capability to predict protein function.

Summary

New technologies, developed in the last few years, have greatly increased the amount of biological sequence information that it is possible for laboratories to produce. As a result, there is now a very large and ever-growing amount of sequence data entering public databases. The overwhelming majority of these sequences have not been examined by scientists, nor is there any experimental information to suggest what their function might be. The Pfam and InterPro resources help plug this gap, using probabilistic models to predict the function of proteins by examining their amino acid sequences. Pfam is arguably the most well-known and one of the largest producers of such models. InterPro, meanwhile, does not produce models directly, but takes them from Pfam and 10 other complementary databases, integrating them together and adding functional information. InterPro is regularly run against the full contents of the main public repository for protein sequences, the UniProt Knowledgebase, so that its functional predictions can be transferred. In order that InterPro and Pfam can continue to cover the growing number of sequences and remain accurate in their predictions, new models need to be made and integrated, existing models need to be checked and the proteins that they match evaluated. One aim of the project is to support this effort. Another aim is to look at other prediction methods, not currently used by either Pfam or InterPro, that identify the individual amino acids in a protein sequence that are responsible for the protein's functions. We will add this functionality to the resources and use it to make their predictions more accurate. This will in turn improve the quality of information associated with large numbers of proteins in the UniProt Knowledgebase. Adding to the resources in this way will require changes to some of the underlying software. At the same time, we will update the InterPro and Pfam web sites, so that users can easily see the new and improved data, and understand what it means. Finally, we will prepare and organise training materials and courses to introduce new users to the resources and educate existing users about the new and updated features.

Impact Summary

Pfam and InterPro are long-established bioinformatics resources that are widely used to predict the function of protein sequences. Commercial and academic scientists with a wide variety of research focuses (e.g. human, animal and plant health) use both resources. In particular, these services are regularly used in the annotation of genomes and metagenomes. Data produced by InterPro and Pfam are consumed by a number of internationally-important databases, such as Ensembl, Ensembl Genomes, UniProtKB, and model organism-specific databases (including Vectorbase, Pombase, Flybase, Wormbase, TAIR and MGI). These databases, in turn, serve many hundreds of thousands of users on a monthly basis. There are also a variety of widely-used analysis platforms (such as DAVID, Blast2GO and CDD) that incorporate InterPro and Pfam's data and/or search software. Additionally, as evidenced by the large number of sequence searches and visits to Pfam and InterPro's websites, a significant number of users also choose to access these resources directly. One vital impact of the project will be the continued provision of annotation for sequences entering public protein databases, in the face of ever increasing data volume. Improved annotation of proteins by adding new, residue-level function prediction methodologies to Pfam and InterPro is also critical, since it will allow very fine-grained analysis of proteins (for example, distinguishing catalytically inactive enzymes from their active counterparts). This will improve accuracy of annotation, and help to remove misleading information from the public databases. Augmenting users' capabilities to visualise relevant functional traits on sequences and multiple sequence alignments is highly important, since it will allow them to scrutinise conservation of such traits across families, which will help power annotation transfer based on homology. The benefits of this project will be felt almost immediately. While InterPro has a two month releasecycle, it provides annotation for UniProtKB on a monthly basis. This project would feed into that annotation pipeline. In the medium term, further benefits will come from the inclusion of improved and/or new families into Pfam and from their subsequent integration into InterPro. This will be supplemented by the addition of the novel residue-level annotations and interaction data. Users will be able to explore these data more effectively as the modifications to the user interfaces and viewers are implemented. There will also be long term benefits, in that the new functionalities added to the resources will continue to be offered following the project's completion. Erroneous annotation already in the public databases will be corrected by the inclusion of more accurate data, produced by this project. It is anticipated that we would employ 2 different types of scientist to work on this project. Firstly, a scientific data curator would be required in order to add to and improve the content of both resources. Data curation is a highly specialised career, but the skills learned as a curator can be transferred to other sectors. For example, curators gain exceptional scientific writing skills and typically attain the ability to precis complex scientific information into a format easily understandable by others, without losing accuracy. These skills are particularly useful in positions requiring scientific (or other) communication. Curators also gain data management and mining expertise, which can be useful in a range of jobs, not limited to scientific fields. A software engineer would also be employed to implement necessary changes to the infrastructure. This may necessitate training in programming languages and software frameworks. Both staff members would be expected to learn how to present their work to others, regardless of their audience's background knowledge or expertise.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file