Award details

Biological information extraction for genome and superfamily annotation

ReferenceBB/C507253/1
Principal Investigator / Supervisor Professor David Jones
Co-Investigators /
Co-Supervisors
Professor Bernard Buxton, Dr David Corney
Institution University College London
DepartmentComputer Science
Funding typeResearch
Value (£) 264,049
StatusCompleted
TypeResearch Grant
Start date 01/11/2004
End date 31/03/2008
Duration41 months

Abstract

There are two aspects of the proposed research: first, creating, populating and annotating biological databases, and second, creating the software to do so via information extraction. The proposed system will use templates to perform information extraction from biological texts, to create new database records. A template is a representation of a text pattern that allows the system to identify the information of interest. It consists of a number of predefined slots to be filled in by the system from information contained in the text. One major drawback of existing information extraction tools is that new templates must be created by hand for each information extraction task, and this requires considerable computational expertise, as well as time. Our goal is a system that will be able to generate its own templates automatically. We propose a sequence of research steps, moving from manual template creation to fully automatic template creation. The aim is to reduce systematically the dependence on technical expertise, either in the form of a technical expert, or in requiring biological researchers to spend time developing the required computational knowledge. The end result will be a system that can create its own templates for biological (as opposed to computational) users. As the system develops, we will apply it to a variety of biological problems that we have identified with our collaborators. These include superfamily assignment, genome annotation, gene product characterisation and pathway inference. In each case, the system will be validated by comparison with known results in existing databases, and by being used to extend well-understood domains, before being deployed by the biological research partners to work on new domain areas. This sequence will provide confidence in the system's abilities. We plan to represent documents and templates in a single format, using Hidden Markov models (HMMs). Each node in the model will represent a word, a part of speech (noun, verb etc.) or a semantic biological category (protein, gene, metabolite etc.). A template is a sub-graph within this model, and so the problem becomes one of learning to identify useful sub-graphs automatically. We shall apply different machine learning technologies to the problem, including supervised learning, semi-supervised learning and active learning. In supervised learning, a labelled example is a sentence or document fragment that a domain expert has chosen as being of interest (positive example) or irrelevance (negative example). It takes considerable time and expertise to create a set of such labelled examples, and this must be repeated for each new domain. In contrast, unlabelled examples are easy to acquire in large quantities. Semi-supervised learning makes use of both labelled and unlabelled examples to model the domain. Active learning systems work by generating potential solutions, and asking the user to evaluate them. In this way, the users expert knowledge can be incorporated within the system, without requiring the user to learn new technical specification languages, or spend time learning to use a complex tool.

Summary

unavailable
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeIndustrial Partnership Award (IPA)
terms and conditions of use (opens in new window)
export PDF file