Award details

Random Forest Prediction of Protein-Ligand Binding Affinities

ReferenceBB/G000247/1
Principal Investigator / Supervisor Dr John Mitchell
Co-Investigators /
Co-Supervisors
Institution University of Cambridge
DepartmentChemistry
Funding typeResearch
Value (£) 80,714
StatusCompleted
TypeResearch Grant
Start date 01/01/2009
End date 31/12/2009
Duration12 months

Abstract

Unlike knowledge-based methods, Random Forest affinity prediction will use binding affinities as well as 3D structures. We will take hundreds of protein-ligand complexes with binding affinities from our own PLD, and from the PDBbind, AffinDB, LPDB, Binding MOAD, BindingDB and KiBank databases. Most will form the training data, but we will withhold an external validation set. Random Forest is an ensemble of decision trees generated stochastically so that all are different, though based on the same underlying data. Random Forest can handle large numbers of descriptors even when some are uninformative, can measure the importance of each descriptor, and is immune from overfitting. For regression, the prediction is averaged over all the trees. Processing PDB structures, defining atom types and preparing histograms of atom type pairwise distance distributions are all handled by our existing BLEEP software. Predictive Random Forest models will be built using the randomForest package from the statistical suite R. Our descriptors will be counts of atom type pairs interacting in distance ranges, say hydroxyl oxygen interacting with amide nitrogen between 3.0-3.5Å. We will use fewer than 40 atom types; their definitions can be revised during the project. The more data we have, the more specific we can make our descriptors, by adjusting atom type definitions and histogram bin sizes. We will build Random Forests with 500 trees using the training set. The performance in predicting out-of-bag data, those data not selected to build that tree, reflects a model's quality. We will measure the importance of individual descriptors by replacing them with random noise and recording the resultant drop in accuracy. We will also test our models on the independent external validation sets. We will build models for the overall diverse dataset of protein-ligand complexes and for specific families, like serine proteinases, aspartic proteinases and sugar binding proteins.

Summary

The binding affinity between a small molecule ligand and the protein with which it interacts is not easy to calculate. Indeed, its computational prediction remains one of the most important and difficult unsolved problems in computational biochemical science. Most medicines, and many other molecules in uses from agrochemicals to deodorants, are ligands that bind to proteins. The proteins may be from the human, or from a pathogenic or undesirable organism such as a bacterium. It would be very beneficial to be able to predict binding affinities using a computer, because the alternative experimental approach of making very many molecules and assaying them against the relevant protein or proteins is difficult, expensive and time-consuming. The computer calculates an estimated binding affinity using a mathematical formula known as a scoring function. The development of suitable scoring functions for ranking possible three dimensional protein-ligand interaction geometries, and especially for accurate prediction of protein-ligand binding affinities, remains a considerable challenge. The scoring function must capture all the important aspects of the interaction in order to give an accurate and reliable prediction of the binding affinity. In order to develop better scoring functions, we are looking to the fields of machine learning and informatics, and will require the known binding affinities and structures of numerous well-characterised protein-ligand complexes. Fortunately, many hundreds of protein-ligand complexes have both structures and binding affinities available. The method we will use is called Random Forest. The forest is a set of several hundred 'decision trees', each of which is basically a flow diagram. We will train them to learn patterns in the known properties of existing protein-ligand complexes, their binding affinities and their patterns of atom-atom interaction distances. However, the way in which we will generate the trees involves computer-simulated dice-rolling. This will ensure that they are all different, though based on the same underlying information. The decision trees then each made a prediction of the unknown binding affinity. These predictions are averaged to give the final computed value. This averaging over many decision trees maximises the use of the information contained in the underlying data and produces results which are much more accurate than those of any one decision tree. Our models will be validated by using them to predict binding affinities of protein-ligand complexes that the algorithm has not seen before. This ensures that the computer is not simply learning the idiosyncrasies of the data on which it is being trained.
Committee Closed Committee - Biomolecular Sciences (BMS)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file