Award details

Continued development of the ChEBI database and ontology for improved interoperability with biomedical resources

ReferenceBB/G022747/1
Principal Investigator / Supervisor Professor Christoph Steinbeck
Co-Investigators /
Co-Supervisors
Professor Michael Ashburner
Institution EMBL - European Bioinformatics Institute
DepartmentChemoinformatics and Metabolism
Funding typeResearch
Value (£) 542,326
StatusCompleted
TypeResearch Grant
Start date 19/10/2009
End date 18/10/2012
Duration36 months

Abstract

ChEBI is a freely available dictionary of molecular entities focused on 'small' chemical compounds. The primary motivation behind ChEBI was to provide a high-quality, thoroughly annotated controlled vocabulary to promote the correct and consistent use of unambiguous biochemical terminology throughout the molecular biology databases at the EBI and worldwide. However, this aim could not be achieved outside of a wider context of general chemistry and chemical nomenclature. The scope of ChEBI encompasses not only 'biochemical compounds' but also pharmaceuticals, agrochemicals, laboratory reagents, isotopes and subatomic particles. ChEBI is designed as a relational database, which is implemented in an Oracle database server. A number of utility applications, implemented mainly in Java and Unix scripts, provide additional functionality around the database, such as the loading of data from external sources. Specialized web-based interfaces provide for both public access to the data and restricted access to the annotation tool. ChEBI stores 2D or 3D structural diagrams as connection tables in MDL molfile format. One entity can have one or more connection tables. One-dimensional strings such as IUPAC InChI, IUPAC InChIKey and SMILES are automatically derived from the default connection tables. Every ChEBI entry contains a list of parent and children entries and the names of the relationships between them. ChEBI can be accessed via the web and Web Services.The entire ChEBI data is available for download in four different formats from the FTP server. The structures of molecular entities from ChEBI are made available to the PubChem database. Feedback to the ChEBI team is provided via a SourceForge Forum. A tool for user-driven direct deposition of chemical data to ChEBI is under development.

Summary

Today, the biological sciences are generating an enormous amount of data aimed at tackling fundamental questions such as 'What is the molecular basis for life?', 'How do organisms work?' and 'How does disease arise and how can it be treated?'. While research in the past has often followed a reductionist approach - studying the parts to understand the whole - today it increasingly follows a systems approach, integrating insights from the past into a holistic model of an organism (systems biology). The goal is to use such a model to perform a computer simulation of the organism and to use this to answer questions such as 'What effect will the addition of compound X (for example a drug) have on the organism?'. Isolated approaches to organizing the data in particular fields in the molecular sciences - information on genes (the code of life), proteins (an organism's chemical factories) and small molecules such as sugars or drugs - will hamper synergistic insights. Instead, databases in the biological sciences, which are used to parametrize such simulations, need to be interlinked and interoperable, allowing seamless movement amongst them. Because biological databases are generated on a worldwide basis by diverse communities, their integration creates obvious challenges. Besides simple technical questions of interoperability, the bioscience community is therefore working on common data models and standards. Scientists create rules on how to name and encode scientific information in a computer (semantics) and how a particular piece of such information relates to the scientific concepts in its surroundings (ontologies). This results in ontological chains such as 'A fox is_a mammal, which again is_a animal', from which the computer automatically reasons that a fox is an animal, even if this is not explicitly stated. Besides the so-called 'is_a' relationship between entities in ontologies, there exists a whole range of other relationships such as 'is_part_of', but which may be relevant only in certain fields of knowledge. Since ontologies can be complex and those of neighbouring fields may be interlinked, they allow machines to reason about the world. The database Chemical Entities of Biological Interest (ChEBI) provides for the bioscientific community semantic and ontological information as well as stable identifiers for small chemical compounds (as are most drugs). Areas such as drug discovery or systems biology bring together information about the morphology of cells, genes and proteins, as well as the small molecules that act on these. The interlinking between these bits of information in databases is typically performed through stable identifiers assigned to entities such as single genes, proteins or small molecules by standardization bodies and database providers. In addition, formal and so-called 'trivial' names are assigned and associated with both the entity and the stable identifier in the database. ChEBI acts as a resource for such names and stable identifiers in the area of small molecules of biological interest. For this purpose it is widely used in the bioscience community, who send formal requests for the assignment of identifiers for particular small molecule entities to the ChEBI team, who then perform the assignment, publish the information into the public domain and inform the requesting party that the request has been fulfilled. Also acting as an ontology, ChEBI puts small molecule structures and their structural properties into an ontological context. It makes statements such as 'D-Glucose is_a D-aldohexose, which is_a ... [various is_a relationships omitted] ... which is_a monosaccharide, which is_a sugar.' Again, ontological chains such as the one above allow computers to make statements about the world (of chemistry in this case), which have not been explicitly coded elsewhere. This is useful, for example, in the field of text mining, the computer-based re-discovery of knowledge in the printed literature.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file