Award details

Machine Learning Approaches to Predict Enzyme Function

ReferenceBB/I00596X/1
Principal Investigator / Supervisor Dr John Mitchell
Co-Investigators /
Co-Supervisors
Institution University of St Andrews
DepartmentChemistry
Funding typeResearch
Value (£) 265,111
StatusCompleted
TypeResearch Grant
Start date 01/09/2011
End date 31/12/2014
Duration40 months

Abstract

The key idea in our work is to identify the reaction mechanism, if any, catalysed enzymatically by a protein structure. Here, the reaction mechanisms are the 260 distinct entries in MACiE. The possible predictions are that the enzyme catalyses each of these reactions, or catalyses no enzyme reaction in our knowledge base. Our work, including a study of convergently evolved analogous pairs of enzymes, suggests that the full stepwise chemical reaction mechanism contains information critical to recognising similarities between enzymes. Our main machine learning method is Random Forest, simply a forest made out of many different randomly created decision trees. Randomness is introduced in two ways. Firstly, each tree is based on a bootstrap sample of N out of the N known proteins, chosen with replacement such that some proteins will appear more than once and others not at all in the set from which a given tree is built. Secondly, the descriptors used for making the split at each node are chosen from a (new) small random subset of the descriptors. Once grown, the trees then predict unseen data. Random Forest can predict either a categorical or a continuous variable. Here, our interest is in classification; the class assigned to a new protein is that given the most votes amongst the trees in the forest. Subsequently to predicting the reaction mechanism, we will apply chemoinformatics, docking and Ultrafast Shape Recognition to suggest substrates for each enzyme reaction identified. Docking is a computational filter, reducing the number of candidates by more than an order of magnitude. Rescoring will use our novel Random Forest based RF-Score function. We will use fingerprint-based chemoinformatics methods to retain only molecules with the correct chemical functionalities needed to undergo the reaction mechanisms identified, and Ultrafast Shape Recognition as a scaffold-hopping method to identify molecules of suitable shape.

Summary

Proteins are amongst the most important of all molecules in biological systems. They are crucial to organisms which use them to carry out a huge variety of essential functions: catalysis, transport, storage, motor functions, signalling, chaperoning folding, regulation, molecular recognition, structural roles, and DNA Repair. As proteins are so ubiquitous in biology, understanding their properties is essential if we want to know about biological processes. This project is focused on one of the most significant of all protein functions: enzyme catalysis. Enzymes catalyse, or facilitate, the chemical reactions that occur in living organisms. Understanding how they work is both interesting in itself and useful in areas as diverse as drug design, diagnostics, biofuels, food science and laundry. This project is about the relationship between the structure of a protein and the enzyme function it carries out. We aim to predict the catalytic functionality from a knowledge of the protein structure. In order to achieve this, we will use machine learning methods, and in particular a technique called Random Forest. The forest consists of several hundred 'decision trees', each of which is basically a flow diagram. We will train them to learn patterns in the known properties of existing enzyme structures and the chemistry of the steps comprising the reactions they catalyse. However, the way in which we will generate the trees involves computer-simulated dice-rolling. This will ensure that they are all different, though based on the same underlying information. The decision trees then each make a prediction of the unknown possible catalytic functions. These predictions are treated as votes as to the function of the protein. This voting process produces a consensus of many decision trees and maximises the use of the information contained in the underlying data, generating results which are much more accurate than those of any one decision tree. The prediction of enzyme function is immensely important for a number of reasons. Firstly, being able to predict enzyme function more accurately will improve the functional annotation of genomes and reduce the current risk of misannotations being propagated through bioinformatics databases. Rapid developments in structural genomics, high throughput structure determination of diverse proteins from a wide variety of organisms, mean that many structures are available for enzymes whose functions are not yet known. Secondly, this project will allow us to recognise chemical similarities between evolutionarily unrelated enzymes that catalyse similar steps, though not necessarily similar overall reactions. Thirdly, this work will help us to understand the key determinants of the complex relationship between protein structure, function and evolution, particularly in terms of catalysis of reaction steps. Fourthly, the project will facilitate the design of new enzymes with either novel functions or carefully modified versions of existing functions. This project sits at an interface between disciplines, combining chemistry, biology and computer science. A wide range of skills and expertise is necessary to increase our understanding of catalysis, which has long been an important academic goal. Commercially, this work lays a foundation which is directly useful to the pharmaceutical and biotechnology industries, where enzymes are used both as diagnostics and therapeutics; the agrochemical industry, whose products often target enzymes; in the development of biofuels, which need robust enzymes to improve productivity and reduce costs; in laundry, where enzymes are already used in everyday products; and in the nutrition and food industries. In particular this project will aid in the design of new and repurposed enzymes.

Impact Summary

The key beneficiaries are companies in the pharmaceutical, biotechnology, and medical technology sectors; other possible beneficiary fields are biofuels, foods, agrochemicals, and 'home and personal care'. This work centres on new aspects of function prediction, complementary to those used elsewhere, and we envisage that our methods will take their place amongst the arsenal of tools in the workflows for protein function prediction and gene annotation. We expect our methods to be most valuable when used alongside other state-of-the art techniques for predicting protein function from sequence and structure. One element of the strategy for increasing the impact of our function prediction work is to encourage its use in private sector R & D. This naturally includes large pharmaceutical companies, but we are particularly keen to see SMEs, biotechnology and smaller medical technology companies, many of whom do not have the resources to fund large in-house computational resources, make use of our predictive models. A key aspect of this is eliminating any IP-related barriers to the commercial use both of our predictive models and also of MACiE. Our function prediction software and models will be freely available on a Creative Commons license. The IP status is that all data in MACiE are public domain. Almost always, these are published, or very soon to be published, by their authors. We are prepared to embargo data pre-publication, but not afterwards. The database itself is copyrighted, and we may in future include a light touch Open Data Commons licence. This is intended only to prevent extreme cases of plagiarism, such as copying the entire database and passing it off as the work of others, and we positively encourage the use of our predictive models, and also data from MACiE, in commercial research and development. The second part of our strategy is to increase visibility. Dr Mitchell is in the fortunate position of receiving regular invitations to speak both directly to pharmaceutical, chemical and other commercial organisations (Pfizer, GSK, Unilever, Syngenta, Schering-Plough etc.), and also at conferences designed for the pharmaceutical industry (Improving Solubility 2008; ADMET 2009; Improving Solubility 2009; ADMET Europe 2010; UK-QSAR spring 2010). Here we can discuss our work in formal presentations and through informal networking. Other ideas for impact in the shorter term include authoring articles describing our work on function prediction and the related work on MACiE. There would be three specific target groups. One of these is research and development scientists in pharmaceutical and biotechnology companies. To reach them, an article in Drug Discovery Today or a similar 'trade magazine' would be an appropriate medium. The second target audience is young people (particularly the 16-21 age group) with an interest in science. While this already happens on a small scale via UCAS open days and the like, we are particularly interested in opening up discussion of science through the blogosphere, see for instance http://baoilleach.blogspot.com/2009/05/how-do-enzyme-mechanisms-evolve.html We are also very aware of the benefits of including relevant parts of our own research in undergraduate teaching material. The third target group is the broader public, who could be impacted by general interest magazine or newspaper articles, as well as by the blogs and other internet-based content. We also hope to secure a slot to present the work to the public at a local event such as those organised by Cafe Science Dundee. The University of St Andrews is a partner in the 'Create and Inspire' public engagement training days for young scientists at Sensation science centre. We believe that MACiE has potential as an educational resource. As well as undergraduate teaching, it could also be a valuable resource for year 13 chemistry teaching in school sixth forms (e.g., Salters' A-level module 'Thread of Life').
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsIndustrial Biotechnology, Structural Biology, Technology and Methods Development
Research PriorityTechnology Development for the Biosciences
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file