Award details

CATH-FunL: Improving Gene Target Selection by Predicting Functional Modules in Biological Systems

ReferenceBB/M020088/1
Principal Investigator / Supervisor Professor Christine Orengo
Co-Investigators /
Co-Supervisors
Institution University College London
DepartmentStructural Molecular Biology
Funding typeResearch
Value (£) 113,199
StatusCompleted
TypeResearch Grant
Start date 01/06/2015
End date 31/05/2016
Duration12 months

Abstract

CATH-FunL will be a new tool in the CATH-Gene3D resource, to prioritise proteins in a large query set, generated by a high-throughput experiment. It will allow biologists to identify a subset of genes, likely to be associated with a biological system of interest, for more detailed experimental characterisation. Users will specify a biological process eg by providing a set of proteins or GO terms, known to be associated with the process. The novel feature of FunL will be its ability to identify network modules within this prioritised list, enriched in the known proteins or the relevant GO terms. CATH-FunL will use protein interaction/association data for ten model organisms (including human) from a range of public sources (eg IrefIndex etc). Data from each source will be transformed into similarity matrices using our well-established, kernels-based approaches. These matrices will then be combined and transformed into a final matrix and a new in-house method, COMPASS, applied to give a prioritised list of genes We already have a small pilot FunL platform, built to handle small query datasets (<100). This will be significantly improved by including the more powerful COMPASS method for ranking the proteins and another novel method for identifying enriched network modules in the ranked list, to better prioritise the proteins. COMPASS, applies partial least squares regression to prioritise targets more effectively. Functional enrichment analysis will be enabled by annotating proteins in the network with our in-house, CATH-Gene3D functional family data. CATH-FunL will be re-engineered to be robust to multiple queries from groups submitting large datasets. This will be done by pre-calculating the underlying matrices and protein functional annotation data on the UCL 5500 node compute farm (Legion), and on the Cloud. We will also explore porting user queries to the Cloud and running the whole project externally on Google Cloud (https://cloud.google.com).

Summary

In the past decades, a marked increase in data availability has revolutionized the study of biology. Advances in experimental techniques mean that we now have an abundance of information about the genes and proteins in our cells and their interactions. This unprecedented volume of data presents a challenge for biologists: how to best combine and exploit different data sources to gain meaningful biological insights. CATH-FunL is a tool designed to address this problem. FunL will allow users to predict novel proteins ('targets') likely to be associated with a set of proteins they are interested in - for example, known components in a protein signalling pathway. CATH-FunL will also allow users to gain further insight into these predicted targets by organizing and annotating this list of predicted genes. Finally, CATH-FunL will provide intuitive visualizations of the predicted targets and the functional relations between them. CATH-FunL's prediction methods are based on the well-documented concept of guilt-by-association. Much of the data produced by modern experimental techniques can be used to infer whether proteins participate in the same biological process - that is, whether they are functionally associated. Evidence for functional association comprises physical binding between proteins, correlation in expression patterns and numerous other, more indirect indicators. Guilty-by-association methods represent this information as a network of functional associations between proteins and attempts to use the structure of the network to predict new associations. The simplest methods simply make predictions based on the direct network neighbours of a protein. This, however, ignores the rich information present in the overall topology of the network: for example, groups of proteins relating to the same function are known to form densely connected clusters within the network, with fewer connections to other proteins. FunL aims to exploit this type of structure using a powerful and well-studied approach known as graph kernels. CATH-FunL will integrate a large volume of protein interaction/association information, from several public repositories and our own in-house tools for protein association prediction. These data will be represented as networks, combined and then transformed into a ranked list of potential targets using kernel-based methods, based on a set of query and known proteins provided by the user. Query proteins will be ranked by the strength of their association to known proteins. FunL will provide further insight into the target proteins by providing information about their function. Functional annotation is often performed using terms from the Gene Ontology (GO). However, on average, <10% of genes in an organism have been experimentally characterised - GO annotations can therefore be sparse or unreliable for many proteins. Therefore, we will supplement experimental GO annotations with predicted annotations using state-of-the-art, in-house, sequence based prediction methods. Once the target list has been computed, CATH-FunL will organise the list into functionally coherent sub-groups. This will allow users to detect potential patterns in the predicted targets and to focus on particular biological processes of interest to them. Because much of the computational work involved in this clustering will already be done by FunL at the query stage, this provides a very efficient way of classifying the target list proteins. Finally, FunL will visualise the results in an intuitive way. We will use both network based visualisations and explore more innovative approaches related to the kernel-based methods. In summary, CATH-FunL will allow users to combine their own datasets of experimentally analysed genes with information from heterogeneous publicly available repositories and our in-house functional annotation datasets to gain valuable functional insights into biological processes they are interested in.

Impact Summary

Who will benefit from the research As described already, FunL will address BBSRC strategic areas by aiding experimental groups involved in high throughput studies eg generating next generation sequence data and proteomics data. There are a number of such groups that we work with already eg on ageing, pain, fly development and cancer, who would be willing to continue testing the CATH-FunL tool for us. However, apart from experimental groups involved in high-throughput 'functional genomics' style studies, other groups involved in high throughput structural biology will also find the tool valuable. For example, we collaborate with two large structural genomics consortia who use CATH-Gene3D functional annotations to guide selection of suitable targets in metagenomics studies, a priority area for the BBSRC. Structural genomics groups such as these and structural biologists will clearly benefit from FunL to help guide their selection of new targets for structure determination. Perhaps more valuable, they will also use FunL to suggest possible interactors for proteins they are interested in. Knowing the interactors for a protein target can considerably aid solubilisation and purification of proteins during the crystallisation process especially where these proteins are involved in forming stable complexes with the target protein. Another potentially large group of beneficiaries are researchers in industry. There is growing interest in industry for exploiting protein networks to aid target selection. For example, identification of network modules enriched in highly expressed genes, that correlate with a particular phenotype, can suggest suitable targets for drug design. Here, the links between FunL and the CATH-Gene3D superfamilies will be particularly valuable as researchers will be able to identify the domain constituents in an enriched module (CATH domain IDs will be reported alongside the GO functions of node proteins) and this will help in determining whether a poly-pharmacological strategy could be employed eg where a weakly binding drug increases in efficacy because it targets multiple copies of a particular CATH domain within a protein network module. In this context development of the new CATH-FunL tool will benefit from a collaboration between the Orengo group and computational researchers at Glaxo Smith Kline (GSK) on a project exploring how drug poly-pharmacology can be enhanced by targeting specific domains within protein network modules. This project has EU funding which supports a Marie Curie Fellow, Dr Aurelio Garcia, within the Orengo Group for the next two years. All the data generated by CATH-FunL (ie predicted protein interactions/associations, similarity matrices produced by the kernel based analysis of the graph network topology, GO functional annotations of proteins in the model organisms used by FunL) will be freely available to all users to download from the CATH-Gene3D site. The PDRA who will be working on the project, Sonja Lehtinen, is already experienced in protein network generation and analysis as she has worked as a PhD student in the Orengo group for the last 3 years. She developed the powerful COMPASS tool which is competitive with, and in some cases outperforms the widely used GeneMania algorithm, that also exploits protein network topology for target prioritisation. This one year project will give Sonja the opportunity to extend her network analysis skills by developing a novel clustering approach to detect modules in networks. It will also give her experience of using a large compute farm and Google Cloud and it will give her experience of web-page construction. All these skills are likely to be valuable when seeking future academic or industry-based posts as there is a shortage of skilled researchers in this area and a significant demand for this expertise to analyse large scale functional genomics data, such as next generation sequencing and proteomics data.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file