Award details

From data to knowledge / the ONDEX System for integrating Life Sciences data sources

ReferenceBB/F006063/1
Principal Investigator / Supervisor Professor Anil Wipat
Co-Investigators /
Co-Supervisors
Dr Phillip Lord, Professor David Lydall, Professor Darren Wilkinson
Institution Newcastle University
DepartmentComputing Sciences
Funding typeResearch
Value (£) 625,264
StatusCompleted
TypeResearch Grant
Start date 01/04/2008
End date 31/10/2011
Duration43 months

Abstract

The current ONDEX system enables data from diverse biological data set to be linked, integrated and visualised through graph analysis techniques. It uses a semantically rich Core data structure based on graphs, has explicit support for workflow and has the ability to bring together information from structured databases and unstructured sources such as sequence data and free text. Extensions for Systems Biology include: Enhancing the ONDEX Core: - Methods to map data into the core data structures to exploit synteny and sequence similarity for applications needing comparative analysis of genetic and genomic organisation of multiple organisms. -Techniques for probabilistic interpretation of relations allowing uncertainty in the integrated data and in biological relationships to be modelled, combining relations using probabilistic models such as naive Bayesian and Bayesian graphical Gaussian approaches. Exploiting the ONDEX data graph: A graph structure analysis toolkit using, standard and advanced graph analysis algorithms, that traverses the data graph and modules representing common structural and functional components to be identified. Populating the ONDEX model: - Orchestrating data integration and analysis steps in ONDEX applications, using Taverna workflows and services (myGrid), including the running of workflows. Using Taverna will allow ONDEX to retain data on workflow provenance, which can be used to track, verify and validate data. - Enhanced text mining methods to extract and map terms from text in databases and online literature sources to detect synonymy and ambiguity and the identification and extraction biologically relevant relations. Exposing ONDEX to tools: New data access interfaces to allow ONDEX data to be used by third party tools, e.g. within workflows, and data export tools to provide easy access to ONDEX data for users of Cytoscape and for export in standard systems biology model exchange formats (e.g. SBML, BioPAX etc).

Summary

The biological sciences generate many different types of data from different specialist disciplines (e.g. genetics, biochemistry, molecular biology). Bringing data together coherently is a major undertaking in any systems biology project. While new databases of biological thesauri and classification systems (ontologies) for the component parts of biology make it easier to link specialist databases, this only solves part of the problem of data integration for systems biologists who need a much richer body of information. For example, there are many different ways that biological components can be related (e.g. by function, location, size) which needs to be captured and information about the provenance (history or source) of data can be important when it is interpreted. New types of information are also important in systems biology, including descriptions of the biological processes and pathways for metabolism and information flow. Many of these have been created by extracting information from the scientific literature to form the basis for the predictive dynamic models and simulations of system function. Because systems biology has a need for complex data integration and scientific text mining that is not met by readily available bioinformatics software in the biological research community, a prototype system (ONDEX) has been developed by Rothamsted Research. This project will combine ONDEX with leading technologies in workflow, graph analysis and text mining, to develop a powerful and professional tool that will underpin systems biology research. Three systems biology research projects, run by our BBSRC-funded systems biology centre partners, will drive the development of ONDEX and will validate new features on real scientific problems. Biological areas addressed cover: bioenergy crops; yeast metabolome models; and Telomere Function in ageing. The research partners bring important technical expertise that will enhance ONDEX with new capabilities known to berequired by systems biologists at their centres. These include: * Extensions to methods that map data into ONDEX to broaden the range of data that can be integrated and capture more of the information about it (the metadata). * State of the art text mining capabilities, for extracting biological concepts and relationships from online text to enable new data buried in the scientific literature to be extracted and structured into models and databases. * Extensions to handle the statistical uncertainty inherent in many biological relationships, to enable new relationships to be identified in the integrated datasets using modern statistical inference techniques. * Enhanced graphical visualisations of the complex network of relationships to accommodate new information and scale to huge data networks, to enable a better understanding of new interactions, and better ways of interrogating the data in a richly integrated dataset * Exploitation of the latest in distributed computing techniques and scientific workflows to simplify, automate and scale the complex task of integration. * Extended range of data interfaces relevant to both programmers and users to enable shared access over the Internet of the integrated datasets, which are important information resources in their own right. A number of actions and engineering developments will make ONDEX easier to use by biologists and support uptake in new areas of systems biology. These include new training resources, workshops for users and developers and providing direct help for new applications through an outreach programme. At the end of the project ONDEX will be delivered in a well-engineered and robust form to existing and new users that will be more readily used by a greatly expanded user and developer community that should make it sustainable in the long term as an open software project.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsBioenergy, Industrial Biotechnology, Systems Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Systems Approaches to Biological Research (SABR) [2007]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file