Award details

Expanding the knowledge of structures and functional information through the SIFTS resource

ReferenceBB/M011674/1
Principal Investigator / Supervisor Dr Sameer Velankar
Co-Investigators /
Co-Supervisors
Professor Gerard Kleywegt, Dr Maria J. Martin, Ms Claire O'Donovan
Institution EMBL - European Bioinformatics Institute
DepartmentProtein Data Bank in Europe
Funding typeResearch
Value (£) 515,730
StatusCompleted
TypeResearch Grant
Start date 02/03/2015
End date 01/03/2018
Duration36 months

Abstract

The proposed work in this application will lead to enhance the biological functionality and relevance of the SIFTS resource by extending the value added annotations to span data from genomes to systems biology. We also plan to carry out necessary work to ensure the long-term robustness and sustainability by consolidating the existing processes and infrastructure. Currently, SIFTS has allowed the enrichment of ~33K UniProtKB protein sequences with structure information through the mapping of the PDB sequence onto a UniProtKB sequence entries based on sequence identity and the organism information. We plan to improve the provision and exploitation of structure information by extending the mapping procedures to include the protein isoforms and variants along with sequences with high residue identity (based on the UniRef90 set). UniProt database infrastructure will be updated to provide specific sequence annotations for particular isoforms and variants that are currently in free text format. Other major enhancements include incorporation of genomic information including variation data based on unique protein mapping (isoform/variants) and linking it to Ensembl and ENA identifiers. We also plan to map the residue level annotation for manually curated ligand binding sites information available in UniProtKB to the PDB structures and vice versa. Additionally, we will develop the infrastructure to provide information on interface residues for all assemblies annotated in the PDB. We will evaluate ways of including this information in UniProtKB entries. PDBe will also provide PubMed identifiers based on text mining of full text open publications in collaboration with the Europe PMC team at EMBL-EBI. The procedure to map Pfam annotation to PDB structure will be replaced to use HMMER server to allow for more up-to-date Pfam cross-references information for all PDB structures. We will also update data export mechanism and API to make enhanced annotations available to our users.

Summary

Over the last decade we have seen rapid increase in the amount and diversity of biological data. At the beginning of this process the challenge before the scientific community was to create the necessary infrastructure to collect, manage and make these data available in an efficient manner to the research community. This has transformed life-science research into a data driven scientific field. But very quickly the scientific community has realized that apart of having these data available, the real challenge is to add significantly to the biological context of these data and make this knowledge available to the researchers. This is especially true for the increasing amount of data on three-dimensional structures of macromolecules. The macromolecular structure data can provide great insights into the functional mechanism of the macromolecules. By integrating it with other biological data better understanding of life and disease processes can be derived leading to better intervention strategies by designing new drug molecules. The macromolecular structure data can also be used to predict the effects of genetic variation, found naturally in the population, on the function of the macromolecules again leading to better understanding of genetic diseases. So providing biological context to the increasing amount of macromolecular structure data is critical if we want to exploit these data and add value to the increasing amount of genomic and proteomic information. The SIFTS resource links the macromolecular structure data (archived in two publicly available databases PDB and EMDB) to its biological context by integrating annotations from different biological databases mainly through linking it to UniProt, a publicly available database of protein sequences, which is at the forefront of protein annotation. This resource was established in 2002 and has evolved over the years by integrating increasing number of protein related annotations from different databases. Before the SIFTS resource was established every major biological data resource or research laboratory had to establish processes and complex infrastructure to derive necessary information linking macromolecular structure data to other databases. With rapid advances in sequencing technology, an increasing amount of variation and isoform information is now becoming available. It is critical that the SIFTS resource is extended to map these variants and isoforms onto macromolecular structures and make it freely available for the benefit of the life-science research community. This will require the SIFTS resource to update its processes and infrastructure to include genomic and variation information for the first time. These data and the extended annotations for related uncharacterised sequences will be useful for developing methodologies for predicting structure-function relationship. These considerations and user requests have contributed to the proposed developments. The main objectives of the proposed project include - 1. Enhance the annotations available in the SIFTS resource to include genomic and variation information. 2. Increase coverage of protein sequence space by including isoforms, variants and related uncharacterised sequences. 3. Implement a mechanism to provide sequence annotations specific to isoforms and variation in UniProtKB database. 4. Develop the necessary infrastructure to include value-added structure-based annotations on ligand binding sites and assembly interface residues. 5. Consolidate the software processes and the database infrastructure for long-term sustainability.

Impact Summary

To remain at the forefront in data-driven life-science research, it is critical that researchers are able to take advantage of the diversity of datasets available to them. Integrating the wide range of knowledge in this diversity of datasets and providing ways to deliver them in a harmonized and timely manner are fundamental to achieving this goal. Over the last decade the challenge for the bioinformatics community has been to find ways to reliably integrate diverse datasets and to find robust ways to transfer value-added annotations from one domain to another to help the knowledge economy. SIFTS is one such resource that, at its core, focuses on integration of structure and sequence based annotations by mapping sequences from macromolecular structure data in the PDB database to sequence information in the UniProt Knowledgebase (UniProtKB). The value of this data is evident from its wide use mainly by the major bioinformatics databases such as RCSB PDB, PDBj, SCOP, CATH, CREDO, PSI-SBKB, PISCES and ProtCID. The data is also central to the EBI strategy for integrating macromolecular structure information in biological context and is used by all EBI resources (Ensembl, UniProt, PDBe, PDBsum, Pfam, InterPro, IntAct, ChEMBL, ChEBI and Reactome) for integrating structure data. SIFTS is also central to the PDB annotation policies, in providing up-to-date mapping between two widely used scientific resources, PDB and UniProtKB. SIFTS is the only resource of structural data that is updated weekly with each PDB release and provides a reliable and robust mechanism for other databases and individual researchers to obtain up-to-date data mapping information. This has resulted in an efficient mechanism that avoids duplication of effort for each database and researcher to establish a similar process. The impact is exploiting the data management expertise in PDB and UniProt, achieving efficiency and letting other databases and researchers concentrate on their area of interest while deriving maximum benefit from structural data. To maximise the use of SIFTS, we provide users with data in various formats and services including - XML, comma and tab delimited files, DAS servers and PDBe and UniProt website. We plan to provide our users with new and extended REST APIs in SIFTS and UniProt for easy programmatic access to this data. SIFTS data has been widely used by a variety of users spanning from genome scientists to systems biologists, and from structural bioinformaticians to drug-design communities. The planned enhancements of the resource will extend the benefits for bexisting users and will engage new users with interest in predicting structure-function relationship. It will also help researchers in developing methods to explain the effects of variants on the protein structure and function leading to better understanding of life processes. SIFTS data is essential in deriving maximum benefits from the macromolecular structure information in the genomic and proteomic context. Some of this research will lead to the design of drug molecules or to better understanding of genetic diseases contributing to better health and/or in design of efficient enzymes to help industrial processes directly contributing to the UK economy. Such combination of skills, in software development and experience in life-science data, are critical if UK has to remain competitive in the age of knowledge economy and data-driven biology. Apart form the economic benefits; the proposed work will contribute directly to the professional development of staff. The software developers named on the proposal are experienced software engineers with many years of experience in database and software development. They will benefit from experience in handling diverse biological datasets and in developing methods to integrate such diverse data.
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file