Award details

Embracing new technologies to streamline improve and sustain InterPro and its contributing databases

ReferenceBB/F010508/1
Principal Investigator / Supervisor Dr Rolf Apweiler
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentSequence Database Group
Funding typeResearch
Value (£) 678,472
StatusCompleted
TypeResearch Grant
Start date 01/04/2008
End date 31/12/2012
Duration57 months

Abstract

InterPro is an integrated documentation resource for protein families, domains and functional sites, which unifies results from 10 major signature databases into a single resource. The integration process and domain/family annotation is done manually by biologists, ensuring high standards of data quality and consistency. The accompanying software, InterProScan, integrates the individual searching and post- processing algorithms into a single package. InterProScan data is supplemented with GO annotations using InterPro2GO mappings, making it a powerful protein functional classification tool. The data and tools are currently accessible for searching via a Web interface and downloading from the FTP site. Although already used extensively by the scientific community, InterPro and its contributing databases, have a number of internal and external limitations. Internally, they suffer from a lack of funding, which stunts the growth and further development of the databases. Externally, the core databases need to keep up with new technologies, provide links to new databases, and continually improve the interface and data accessibility for their users / currently, this is not being done. This project aims to streamline the current data-production procedures for InterPro and its member databases, improve coordination of activities to make better use of resources, and ensure that new technologies are embraced to drive the project into the future. These activities will enable the databases to provide new data to the public more rapidly, improve and speed up protein match production with InterProScan, and enhance data access through improved Web interfaces and Web services. The latter will provide much needed programmatic access to the data, which will facilitate more complex data analyses, and thus more efficient use of the wealth of scientific content held within the databases.

Summary

New DNA sequencing technologies have led to a flood of new data in sequence databases being submitted by individual scientists, genome sequencing projects and metagenomics projects. These sequences enter the databases with little or no annotation, limiting their usefulness to the scientific community. This has inspired the development of new tools for automatic annotation of the encoded protein sequences. One of the most successful developments in this area has been in the production of so-called protein 'signatures', diagnostic methods that are able to characterise newly-determined sequences in terms of the protein families to which they belong and/or the structural or functional domains they contain. Protein signature approaches have been adopted by a number of databases, and ten of the top such resources are integrated into the InterPro database. InterPro, and its accompanying protein analysis software tool, InterProScan, is now one of the leading protein functional classification resources in the world. However, despite its success, InterPro and its partners are currently suffering from a lack of financial support. The level of funding required to maintain and improve a database of this size is often underestimated. The amount of incoming data is increasing exponentially, and databases now struggle to provide their data to the public in a timely way, while at the same time maintaining the necessary high standards of data quality. Moreover, as they become more popular, and user demands increase, these core databases endure mounting pressure not only to keep up with the expanding volume of data and growing community requirements, but also to be early adopters of newly emerging technologies. This proposal aims to resolve these issues by embracing new technologies to enhance and further develop InterPro and its source databases. It aims to streamline production processes both to provide more regular data releases and to better cope with increased volumes of data. With more formalised Consortium activities and coordination thereof, we will make more efficient use of resources and share tasks to ensure long-term sustainability of the databases. Specifically we aim to: - Streamline data production procedures to enable a faster turn-around time for releasing the data; - Develop and integrate new annotation tools and standards to make the rate-limiting annotation step quicker and easier, and share tasks, such as annotation, to remove redundancy in effort; - Work closely together to improve quality-assurance procedures for protein matches; - Coordinate the upgrade of InterProScan and other HMM-based databases to the latest HMMer version; - Improve the InterProScan protein domain-finding software; - Exploit new technologies for database linking and data exchange; and - Extend the functionality of the Web interface to better meet the needs of the user community. The planned improvements to InterProScan and the protein match procedures will improve the quality, as well as the speed of protein functional classification; streamlining the production processes will enable the databases to get new protein domains and families out to the public as soon as they become available. New technologies will facilitate easier linking between different databases, and will provide the public with access to data from different sources. They will also open the door to more complex analyses, by providing improved programmatic access to the data. In addition, these new processes and technologies will allow InterPro and its member databases to cope with the ever-increasing flood of new data and make it accessible to the public in more regular releases. Ultimately, these improvements will make InterPro and its partners easier and more efficient to maintain, paving the way to a more sustainable future and increasing their benefit and usefulness to the scientific community.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file