Award details

18-BBSRC-NSF/BIO : CIBR:Implementing an explicit phylogenetic framework for large-scale protein sequence annotation

ReferenceBB/T010541/1
Principal Investigator / Supervisor Dr Maria J. Martin
Co-Investigators /
Co-Supervisors
Dr Alex Bateman, Dr Robert Finn
Institution EMBL - European Bioinformatics Institute
DepartmentMSCB Macromolec, structural and chem bio
Funding typeResearch
Value (£) 402,187
StatusCompleted
TypeResearch Grant
Start date 30/03/2020
End date 29/03/2023
Duration36 months

Abstract

Advances in DNA sequencing technologies are rapidly expanding our knowledge of protein sequences, but only a small fraction of these proteins has been experimentally characterised. The UniProt protein knowledgebase aims to maximise the utility of protein sequence data to the scientific community: it not only presents the sequences but also provides "annotations", i.e. data that help to infer functional information about those sequences, like predicted active sites. The current approach to large-scale annotation of proteins in UniProt, called UniRule, relies on ad hoc rules to define sets of proteins that should be annotated similarly. While these rules implicitly utilise information about evolutionary relationships (e.g. membership in a protein family), they do not model functional evolution explicitly and are thus limited in the specificity of annotations they can express, e.g. a protein family may have two or more distinct subtypes, which perform related but distinguishably different functions. Here, we propose to implement an explicit evolutionary approach to large-scale sequence annotation. We will build upon previous work (1) on evolutionary modelling of gain and loss of protein functions (represented as Gene Ontology terms, GO) in gene families; (2) on software to reconstruct the evolutionary history of any arbitrary protein sequence by placing it in the context of a phylogenetic tree. This will enable decisions regarding the specificity of annotations to be transferred based on the evolutionary difference between the characterised protein and the protein to be annotated. Broadening the range of annotations, coupled with increasing numbers of sequences will present key technical challenges that will be addressed during the course of this work. Overall, implementing this approach within UniProt will integrate the large-scale annotation systems already used in the UniProt and GO projects and result in increased specificity and coverage of annotations in UniProt.

Summary

Proteins are the primary molecular machines that perform the instructions encoded in our genomes. Proteins ultimately shape the response of our cells, tissues, organs, and bodies to the surrounding environment, either directly (e.g. muscle contraction) or through their functional outputs (e.g. the electrical signals along the dendrites to produce a nerve impulse or action potential). Therefore, understanding the functional role(s) performed by each protein is critical to research and development in many areas of science, particularly biology, medicine and applied biotechnology. The rapid increase in throughput of next-generation sequencing technologies has important ramifications, in that our ability to sequence an organism's genome and determine the proteins it encodes far out paces our ability to experimentally characterise the function of a protein. Thus, for every functionally characterised protein, there are now many thousands of proteins that will never be experimentally characterised. Molecular biology increasingly relies on our ability to computationally group related sequences and to transfer functional annotations from the few experimentally characterised proteins, to those related, yet uncharacterised, proteins. Knowledge on proteins has been collected and stored in public databases like UniProt, a world-leading resource on protein sequences and function. Currently, there are over 150 million sequences in UniProt, with the number doubling every two years. Therefore, it is crucial to develop new and reliable computational methods for inferring protein function that can be scaled to billions of sequences. We aim to implement an annotation system that incorporates evolutionary information, permitting the level of annotation transfer to be tuned accordingly, while also ensuring scalability and speed of annotation that meets current and future demands. This new annotation system will integrate the most innovative features present in two pre-existing methods that are currently used in producing world-class resources. The Gene Ontology (GO) Consortium has developed software for explicit evolutionary modelling of GO annotation gain and loss along specific branches of phylogenetic trees, and has applied it to inferring GO annotations for experimentally uncharacterised proteins. UniProt has developed the UniRule system that applies annotation "rules" that combines information on protein families and domains (from the InterPro resource), with a range of other types of information like taxonomy, to make more precise and informative annotations. Our goal is to create a next-generation, large-scale annotation system that merges the two approaches, and to implement this annotation system in the UniProt resource, thereby increasing the quality of functional annotations in the database for the benefit of the scientific community. We propose three specific aims to achieve this goal: (1) convert existing UniRule rules into explicit evolutionary models, (2) integrate software to apply the evolutionary models (TreeGrafter) into the UniProt annotation pipeline, and (3) develop software for ongoing curation of new evolutionary models of additional annotation types and protein families. The result will be an annotation pipeline based on explicit evolutionary principles, which will enable seamless sharing of information between the UniProt and GO curation processes, and substantially improve the accuracy, comprehensiveness and informativeness of inferred protein annotations in public databases.

Impact Summary

The field of protein research has witnessed an explosion in novel protein sequences due to advances in sequencing technologies. However, our ability to understand their role(s) within a cell largely relies on our ability to functionally annotate them. In modern molecular biology, the vast majority of functional annotations is performed computationally by identifying similarities between the few experimentally characterised sequences and uncharacterised ones, followed by the transfer of annotations. This proposal aims to build on two recent developments in the field of computational functional inference to make annotations: (i) utilise evolutionary modelling of Gene Ontology (GO) annotations using phylogenetic trees; (ii) annotation "rules" applied by the UniRule system to combine information on protein families with other data types like taxonomy. Merging these two approaches within the UniProt production framework will increase the number and quality of functional annotations, benefiting the entire scientific community. The impact of this project will be facilitated by the high profile of both the UniProt and GO resources that provide the distinct annotation functionalities. This impact is measurable in terms of the numbers of users, citations, and educational materials. The UniProt website receives 700,000 visitors and 6 million hits per month. Similarly, the GO resource has been cited over 70,000 times. One of the major impacts expected from this project will be the increased accuracy, comprehensiveness and informativeness of inferred UniProt annotations, including GO annotations. This will benefit the broad user community of UniProt, encompassing academia and biotechnological industries. The improved functional annotations will help academics ranging from evolutionary biologists to those studying pathogenesis. To ensure dissemination of the proposed new developments to users, we will create a new online training module focused on the use of evolutionary models in bioinformatics to explore the basic principles of molecular evolution that enable adaptation of gene functions through a gene family. We will also investigate the provisioning of online "learning pathways", which will group different combinations of training modules according to a user's research background. People at over 400,000 unique IP addresses have accessed the EMBL-EBI online training resource, thus amply demonstrating the demand for bioinformatics training. The project outputs will be of exceptional value to the commercial sector as well, eventually benefiting the public. For example, improved annotations of proteins will lead to the discovery of novel antibiotics for humans and livestock, and higher agricultural yields by helping to understand the mechanisms underlying abiotic resistance. More precise sub-family annotations provided by this work will facilitate discovery of novel enzymes with distinct functionalities, i.e. ability to break down alternative substrates or substitute processes that are currently performed via organic synthesis (chemical). We will also disseminate project outputs to academic and industrial audiences via the publication of software, data, and peer-reviewed articles. We will leverage our professional networks and collaborations, conference platforms and social media channels to further publicise key developments. The public sector will also be engaged, via specific events and the publication of non-specialist articles and interviews. This transatlantic project involves four staff members directly, as well as other staff members within the three participating groups. It will expose all teams to different approaches of working, as well as strengthen international collaborations, especially in the field of bioinformatics. All staff members will receive relevant scientific training and opportunities to develop their interpersonal skills.
Committee Not funded via Committee
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file