Award details

Mining term associations from literature to support knowledge discovery in biology

ReferenceBB/C007360/1
Principal Investigator / Supervisor Professor Goran Nenadic
Co-Investigators /
Co-Supervisors
Professor John Keane
Institution The University of Manchester
DepartmentComputer Science
Funding typeResearch
Value (£) 192,905
StatusCompleted
TypeResearch Grant
Start date 01/01/2006
End date 30/06/2009
Duration42 months

Abstract

In this project we propose combining various text mining approaches to establishing associations among biological terms. Our aim is to support biological knowledge discovery and develop novel text mining techniques to extract and present non-trivial knowledge and term associations (e.g. related proteins, their molecular functions, localisations, etc). More specifically, the objectives of the proposal are: to implement text-based methods for determining term similarity from large document collections; to investigate, implement and evaluate a novel term kernel method for biological text mining; to identify, implement and evaluate suitable kernel-based technologies for solving user-elicited biological text mining scenarios; to make the tools available to the wider research community via the National Centre for Text Mining. Terms are vital for processing scientific texts. A term is the lexical realisation of a concept. It can be a single word e.g. protein or a multiword phrase e.g. son of sevenless. We will focus on extracting term relationships as a basis for text mining. We will combine lexical, syntactic and contextual similarities extracted from the literature. Measurement of lexical term similarities will be based on considering substrings that are shared among bio-terms. In addition, we will investigate the use of string and subsequence kernels for this task. Measurement of syntactic similarity will be based on co-occurrence of terms within term enumerations, coordinations and conjunctions, i.e. in expressions where a sequence of terms appears as a single syntactic unit. Contextual term similarities will be measured by comparing contexts in which terms appear. Context of a term will be represented by a regular expression containing different elements, such as part-of-speech and syntactic tags, terminological and additional ontological information, and lemmatised contextual elements. Contexts will be mined automatically from documents, linguistically normalised andbiologically generalised, and then compared using a vector representation. We will select sensible weighting schemes and test their performance in detecting term associations using existing resources for validation. The endpoint of mining term similarities is appropriate representation, analysis and visualisation of information in order to support biologists in knowledge discovery. By combining these similarities, terms can be linked into semantic networks and further used for text mining. Based on term similarities, we will develop a novel term kernel for biological text mining. Once we have such a kernel, we can use the whole gamut of emergent kernelised data mining methods. The technologies to be investigated for supporting knowledge discovery include term clustering, classification, principle component analysis, regression, and correlation. For example, discovery of correlations between textual and non-text information derived from post-genomic techniques such as expression array and sequence analysis is a powerful hypothesis generation method. For instance, entities, that appear similar from the results of text mining might behave very differently under a particular set of experimental conditions; this suggests the experiment is uncovering something that was previously unknown and is worthy of further investigation. At present, there are no good tools for detecting these types of patterns; we will develop such tools. We will demonstrate the utility of these technologies in solving user-elicited biological text mining scenarios. Scenarios are small-scale, but real-world problems that we can help solving using term-based text mining. These scenarios will include, but not be limited to the following: compound toxicity prediction, quantification and classification; linking genes from quantitative trait loci and expression array data using the literature, etc. These scenarios will be defined and evaluated in close collaboration with biologists.

Summary

unavailable
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file