BBSRC Portfolio Analyser
Award details
Biological information extraction for genome and superfamily annotation
Reference
BB/C507253/1
Principal Investigator / Supervisor
Professor David Jones
Co-Investigators /
Co-Supervisors
Professor Bernard Buxton
,
Dr David Corney
Institution
University College London
Department
Computer Science
Funding type
Research
Value (£)
264,049
Status
Completed
Type
Research Grant
Start date
01/11/2004
End date
31/03/2008
Duration
41 months
Abstract
There are two aspects of the proposed research: first, creating, populating and annotating biological databases, and second, creating the software to do so via information extraction. The proposed system will use templates to perform information extraction from biological texts, to create new database records. A template is a representation of a text pattern that allows the system to identify the information of interest. It consists of a number of predefined slots to be filled in by the system from information contained in the text. One major drawback of existing information extraction tools is that new templates must be created by hand for each information extraction task, and this requires considerable computational expertise, as well as time. Our goal is a system that will be able to generate its own templates automatically. We propose a sequence of research steps, moving from manual template creation to fully automatic template creation. The aim is to reduce systematically the dependence on technical expertise, either in the form of a technical expert, or in requiring biological researchers to spend time developing the required computational knowledge. The end result will be a system that can create its own templates for biological (as opposed to computational) users. As the system develops, we will apply it to a variety of biological problems that we have identified with our collaborators. These include superfamily assignment, genome annotation, gene product characterisation and pathway inference. In each case, the system will be validated by comparison with known results in existing databases, and by being used to extend well-understood domains, before being deployed by the biological research partners to work on new domain areas. This sequence will provide confidence in the system's abilities. We plan to represent documents and templates in a single format, using Hidden Markov models (HMMs). Each node in the model will represent a word, a part of speech (noun, verb etc.) or a semantic biological category (protein, gene, metabolite etc.). A template is a sub-graph within this model, and so the problem becomes one of learning to identify useful sub-graphs automatically. We shall apply different machine learning technologies to the problem, including supervised learning, semi-supervised learning and active learning. In supervised learning, a labelled example is a sentence or document fragment that a domain expert has chosen as being of interest (positive example) or irrelevance (negative example). It takes considerable time and expertise to create a set of such labelled examples, and this must be repeated for each new domain. In contrast, unlabelled examples are easy to acquire in large quantities. Semi-supervised learning makes use of both labelled and unlabelled examples to model the domain. Active learning systems work by generating potential solutions, and asking the user to evaluate them. In this way, the users expert knowledge can be incorporated within the system, without requiring the user to learn new technical specification languages, or spend time learning to use a complex tool.
Summary
unavailable
Committee
Closed Committee - Engineering & Biological Systems (EBS)
Research Topics
X – not assigned to a current Research Topic
Research Priority
X – Research Priority information not available
Research Initiative
X - not in an Initiative
Funding Scheme
Industrial Partnership Award (IPA)
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search