Award details

Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam

ReferenceBB/S020039/1
Principal Investigator / Supervisor Professor Christine Orengo
Co-Investigators /
Co-Supervisors
Institution University College London
DepartmentStructural Molecular Biology
Funding typeResearch
Value (£) 26,806
StatusCurrent
TypeResearch Grant
Start date 01/05/2020
End date 30/04/2024
Duration48 months

Abstract

InterPro and Pfam are preeminent complementary resources in the field of protein research. InterPro draws its information from a compendium of 13 expert member databases, including Pfam, enabling classification of protein sequences into families and prediction of functional domains and sites. Pfam generates protein families, with each curated entry represented by an alignment and profile hidden Markov model (HMM). In light of the sheer volume of novel protein sequences being constantly discovered, especially through metagenomics, this proposal devises key developments to further improve functionality and scalability of these resources. We will enhance coverage of environmentally derived sequences (MGnify database, Tara Oceans and MMETSP projects) by generating families for the largest novel sequence clusters. We will incorporate de novo structural models and produce deep sequence alignments (using metagenomics sequences) necessary for the detection of co-evolutionary residues, which in turn will be used for structural modelling. The websites will facilitate visualization of these structural models and display co-variance contact sites. We will use a combination of known structures and models to classify additional Pfam entries into clans, as well as review domain boundaries. To increase InterPro coverage and functional annotations, we will integrate new resources (CATH FunFams) to provide sub-domain classifications, improve annotations (especially domains of unknown function) and maximise member database integrations. To enable scaling and refine annotations, we will adopt a new algorithm (TreeGrafter) in InterProScan, harmonise PANTHER and FunFams-based Gene Ontology terms within InterPro, and evaluate performance of an upgraded version of the HMMER software. Finally, we will focus annotation efforts on eight genomes of agricultural importance, including chicken, salmon, and wheat, generating 1000s of Pfam and InterPro entries.

Summary

Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes. These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families). InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources. We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.

Impact Summary

The field of protein research has witnessed an explosion in novel protein sequences due to advances in sequencing technologies. However, these sequences are meaningless without functional annotation. This proposal focuses on the world leading protein databases, InterPro and Pfam, which are routinely used for protein annotation. Due to their extensive use by researchers worldwide, this application will impact most BBSRC strategic priorities - especially agriculture and food security, industrial biotechnology, and bioscience for health. To maximise the impact of these resources, we propose to exploit multiple computational approaches to (i) improve annotation of metagenomics datasets and eukaryotic marine microbes; (ii) provide co-evolutionary structural models for Pfam entries using deep alignments to build additional models and permit their visualization; (iii) integrate and improve annotations from current and new InterPro databases, such as PANTHER, CDD, and CATH FunFams; (iv) improve scaling and refine annotations by adopting new algorithms and software, like TreeGrafter and HMMER4, and reconcile Gene Ontology terms across databases; (v) systematically annotate eight genomes of agricultural importance. These developments will ensure users in the UK and world over can derive the maximum benefit from these resources while further cementing their position as exceptional databases of immense importance to the scientific community at large. Developing new pipelines to build new entries for proteins derived from metagenomics provides a unique exploitable opportunity for InterPro and Pfam. The fact that these resources will extend coverage of marine eukaryotic microbes will have significant, far reaching impacts on other fields and analytical disciplines. This is especially true for the UK Darwin Tree of Life project, which forms part of a global initiative to sequence all eukaryotic species, aiming to revolutionize our understanding of biology, evolution and biodiversity. However, this will only be realised through detailed and accurate functional annotation, such as that provided by InterPro and Pfam. The agricultural sector represents another area of considerable impact. Providing comprehensive functional annotations for proteins from widely farmed animal and plant species in the UK and worldwide will facilitate insights into the molecular basis of biological features including yield characteristics, capacity to resist disease and tolerance to the vagaries of nature. This will lead to socioeconomic benefits, through maximising land utilisation for growing crops such as wheat and sugar beet (the latter providing nearly 30% of the world's annual sugar production and forming an important source for bioethanol and animal feed), or enhancing the global aquaculture market, projected to reach $20 billion by 2022, where salmon is a substantial component. Furthermore, the project outputs will be of exceptional value to the commercial sector, eventually benefiting the public. Improved annotations of proteins originating from microbes will lead to new discoveries, such as novel antibiotics for humans and livestock, higher agricultural yields from the understanding of ecological interplay (e.g. food chain microbes), expanded discovery of novel enzymes (e.g. psychrophilic enzymes for detergents) or those with novel catalytic functionality. We will ensure impact on all academic and industrial audiences by the publication of software, data, and peer reviewed articles. To ensure that resource developments are disseminated as widely as possible, we will deliver onsite training, webinars, participate in community workshops and produce online training materials. We will leverage our professional networks and collaborations, conference platforms and social media channels to further publicise key developments. The public sector will also be engaged, via specific events and the publication of non-specialist articles and interviews.
Committee Research Committee B (Plants, microbes, food & sustainability)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file