Award details

Automated Biological Event Extraction from the Literature for Drug Discovery

ReferenceBB/G013160/1
Principal Investigator / Supervisor Professor Sophia Ananiadou
Co-Investigators /
Co-Supervisors
Professor Junichi Tsujii
Institution The University of Manchester
DepartmentComputer Science
Funding typeResearch
Value (£) 288,468
StatusCompleted
TypeResearch Grant
Start date 01/09/2009
End date 31/08/2012
Duration36 months

Abstract

In establishing drug target confidence, it is essential to have evidence of the type of relationship between the target and key protein-bioprocesses. However, the primary starting point for target choice, and the context for interpretation of all pre-clinical observations is the literature. Text mining (TM) is ideally suited to support the discovery of reliable drug targets. But for TM systems to help researchers understand the role proteins play in biological processes, they have to extract, normalise and identify the context of complex relationships between genes, diseases and their underlying bioprocesses. Our TM techniques will recognise diverse surface forms in text describing bioprocesses and will link them with events and the proteins associated with them. Our methods are based on a combination of advanced semantic text mining (deep parsing, named entity recognition) and machine learning techniques, as we shall automatically identify events (involving proteins) such as decrease [in concentration], phosphorylation, ubiquitination, etc. Bioprocesses such as angiogenesis are composed of individual events described in the literature. We propose to identify these bioprocesses automatically and to link them with the associated events. A combination of kernel methods with knowledge resources and annotated texts (evaluated by biologists) will be used to automatically learn how bioprocesses underlying higher level processes are linked with which events. We shall concentrate on angiogenesis as an example. We shall thereby produce and make available a text mining service for researchers working in drug discovery. Both the software tools used for event extraction as well as the annotated texts used for training purposes will be made available. Co-funded by EPSRC under the RCUK Cross-Council Funding Agreement.

Summary

The development of new drugs is both expensive and time-consuming: it can take over a decade for a new drug to be proven effective and safe, even with the many advances we have seen in the life sciences. From a batch of promising early candidates, only a few will eventually be approved. The longer a candidate lasts before being found unusable (attrition), the more expensive the cost, especially if clinical trials have been involved. Attrition rates run at ca 90%, and attrition is thus ruinously costly to the pharmaceutical industry, so there is an urgent need to reduce its impact. UK researchers, leading in biological and pharmaceutical research, would benefit greatly from means to identify as early as possible drug candidates that are likely to fail, preferably long before the clinical stage is reached. Another current area of concern is how drugs may be targeted to groups of individuals: not every individual responds in the same way to the same drug.. If we can discover which genes are implicated in this, then we can hope both to focus on the more promising drug candidates and find ways of tailoring treatments to (groups of) individuals. Unfortunately, however, scientists are faced with a severe knowledge gap: no scientist can keep up, using traditional means, with the vast amount of experimental data and especially its massive associated literature that is being (and has been )generated in the life sciences. Moreover, much knowledge is hidden in the literature: it has been shown that entirely new knowledge has been available for discovery in the literature, often for many years, but that the vastness of the literature has prevented researchers from achieving the required level of information retrieval, that is the first step in linking and synthesizing it into new, previously unsuspected knowledge. The main target of information finding is the MEDLINE resource, which currently contains some 17 million abstracts: this is seemingly large but is nevertheless a fraction of the information and hidden knowledge contained in the associated full text scientific articles. The proposed project is designed to help scientists overcome this knowledge gap, by developing automatic means to filter information and to synthesise new knowledge from the scientific literature. As a direct link between a (number of) proteins(s) and a physiological or pathophysiological process is not always described explicitly in a text, we must hunt for indirect evidence. This involves looking for indications of biological processes that are associated with proteins. When writing, biologists essentially describe 'events' such as such as phosphorylation that are involved in higher order bioprocesses such as angiogenesis. By identifying and extracting such events, and the particular biological entities (proteins, diseases), we can collect many fragments of information about bioprocesses from many thousands of texts. These fragments can then be used to find new knowledge by establishing associations among the fragments. To achieve such extraction of fragments for knowledge finding, powerful semantic text mining techniques are required that can handle the special languages of biologists, and that can achieve appropriate levels of abstraction far beyond mere word search. This project will customise the generic tools of the National Centre for Text Mining and carry out research to find the best ways of extracting events concerning biological processes from the literature. AstraZeneca will be closely involved, both in terms of informing the research, and providing practical domain expertise, requirements, data and concrete evaluation scenarios. Their interest is also manifest in a substantial cash contribution to the project. The result of this programme will be a text mining service to academic researchers, offered NaCTeM, supporting them in their task of discovering protein -bioprocess associations from the literature.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeIndustrial Partnership Award (IPA)
terms and conditions of use (opens in new window)
export PDF file