Award details

Biological information extraction for genome and superfamily annotation

Reference	BB/C507253/1
Principal Investigator / Supervisor	Professor David Jones
Co-Investigators / Co-Supervisors	Professor Bernard Buxton, Dr David Corney
Institution	University College London
Department	Computer Science
Funding type	Research
Value (£)	264,049
Status	Completed
Type	Research Grant
Start date	01/11/2004
End date	31/03/2008
Duration	41 months

Abstract

There are two aspects of the proposed research: first, creating, populating and annotating biological databases, and second, creating the software to do so via information extraction. The proposed system will use templates to perform information extraction from biological texts, to create new database records. A template is a representation of a text pattern that allows the system to identify the information of interest. It consists of a number of predefined slots to be filled in by the system from information contained in the text. One major drawback of existing information extraction tools is that new templates must be created by hand for each information extraction task, and this requires considerable computational expertise, as well as time. Our goal is a system that will be able to generate its own templates automatically. We propose a sequence of research steps, moving from manual template creation to fully automatic template creation. The aim is to reduce systematically the dependence on technical expertise, either in the form of a technical expert, or in requiring biological researchers to spend time developing the required computational knowledge. The end result will be a system that can create its own templates for biological (as opposed to computational) users. As the system develops, we will apply it to a variety of biological problems that we have identified with our collaborators. These include superfamily assignment, genome annotation, gene product characterisation and pathway inference. In each case, the system will be validated by comparison with known results in existing databases, and by being used to extend well-understood domains, before being deployed by the biological research partners to work on new domain areas. This sequence will provide confidence in the system's abilities. We plan to represent documents and templates in a single format, using Hidden Markov models (HMMs). Each node in the model will represent a word, a part of speech (noun, verb etc.) or a semantic biological category (protein, gene, metabolite etc.). A template is a sub-graph within this model, and so the problem becomes one of learning to identify useful sub-graphs automatically. We shall apply different machine learning technologies to the problem, including supervised learning, semi-supervised learning and active learning. In supervised learning, a labelled example is a sentence or document fragment that a domain expert has chosen as being of interest (positive example) or irrelevance (negative example). It takes considerable time and expertise to create a set of such labelled examples, and this must be repeated for each new domain. In contrast, unlabelled examples are easy to acquire in large quantities. Semi-supervised learning makes use of both labelled and unlabelled examples to model the domain. Active learning systems work by generating potential solutions, and asking the user to evaluate them. In this way, the users expert knowledge can be incorporated within the system, without requiring the user to learn new technical specification languages, or spend time learning to use a complex tool.

Summary

unavailable

Committee	Closed Committee - Engineering & Biological Systems (EBS)
Research Topics	X – not assigned to a current Research Topic
Research Priority	X – Research Priority information not available
Research Initiative	X - not in an Initiative
Funding Scheme	Industrial Partnership Award (IPA)

I accept the terms and conditions of use (opens in new window)

export PDF file

back to list new search

BBSRC Portfolio Analyser

BBSRC Portfolio Analyser will be decommissioned at the end of 2025.
Other UKRI reporting services are available to you providing details of BBSRC investments: UKRI - What we have funded

Award details

Biological information extraction for genome and superfamily annotation

Abstract

Summary

BBSRC Portfolio Analyser will be decommissioned at the end of 2025.Other UKRI reporting services are available to you providing details of BBSRC investments: UKRI - What we have funded

Award details

Biological information extraction for genome and superfamily annotation

Abstract

Summary

BBSRC Portfolio Analyser will be decommissioned at the end of 2025.
Other UKRI reporting services are available to you providing details of BBSRC investments: UKRI - What we have funded