BBSRC Portfolio Analyser
Award details
Mining protein interaction data and its context from the scientific literature
Reference
BB/H016694/1
Principal Investigator / Supervisor
Professor David Robertson
Co-Investigators /
Co-Supervisors
Dr Robert Hernandez
,
Professor Goran Nenadic
Institution
The University of Manchester
Department
Life Sciences
Funding type
Skills
Value (£)
75,281
Status
Completed
Type
Training Grants
Start date
01/10/2010
End date
30/09/2014
Duration
48 months
Abstract
unavailable
Summary
The main archive of life sciences literature currently contains more than 17 million references and grows by approximately 2,000 articles every day. This information is invaluable and represents a rich source of knowledge for academic, biomedical and industrial researchers. However its current, let alone future size, is rendering it virtually impossible for individuals scientists to keep the pace with publications in their own area, let alone related ones. It is therefore likely that there is a significant degree of repeated scientific attempts to re-discover phenomena that might have been already studied in similar experiments. This has led to the generation of extensive secondary data sets mined from the published literature, e.g., for yeast (Reguly et al., 2006; J. Biol. 5:11), microbes (Rajagopala et al., 2008; Bioinformatics 24:2622) and HIV (Pinney et al., 2009; AIDS 23:549) among others. In recent years much emphasis has been placed on using text mining to identify protein interactions and in this area several relatively successful systems have been developed (Krallinger et al., 2008; Genome Biol. 9:S4). However, the extracted information is typically represented in the form of simple interacting pairs, with limited background information to characterise the interaction: little attempt is made to capture the context of such information (e.g. experimental conditions, methods used, how reliable it is, what is the nature of interaction is etc). Furthermore, literature curated data can be problematic as it can contains curation errors and redundant data. In addition a diverse collection of experimental methods will have been used to determine interactions. In this project we propose to study the way findings, experiments and knowledge about protein interactions is presented in the literature, and in particular how contextual information that details an interaction are encoded and presented. The aim will be to put interaction data into its semantic and biological context. To do this, we will implement a text mining framework to extract contextual information from full-text articles, and link and contrast it with data in other (structured) resources. The knowledge extracted will be characterised by both qualitative and quantitative features. Qualitative attributes will model experimental context (e.g. outcomes, interaction types, conditions, constraints, methods, model organisms, etc). We will explore and if necessary customise existing modelling frameworks (including, for example, PSI-MI, EXPO etc.) to represent experimental context extracted from the literature. Quantitative measures will represent features that may be indicative of data quality or relevance for a specific data set. Bibliometrics assigned to protein interactions, such as number of citations and mentions; peaks and changes over time; association with specific entities such as experimental methods, model systems, drug associations, outcomes, etc. will be explored. To achieve these the student will develop a generic framework where interaction data will be systematically collected from the literature, and then integrated, explored and visualised. The specific methodology will follow a hybrid approach that will combine existing biomedical resources, e.g., terminological dictionaries and ontologies, with a rule-based approach to bootstrap data set-specific patterns, whereas suitable machine-learning based methods will be developed to improve the coverage of the information extracted. The information will be presented via interaction networks augmented with context data, which will facilitate more biologically informed exploration of protein-protein relationships. Importantly, the general framework developed for placing biological 'facts' in context will be applicable across biological and text-mining domains, but will be implemented and evaluated in a specific context in collaboration with the industrial partner.
Committee
Not funded via Committee
Research Topics
X – not assigned to a current Research Topic
Research Priority
X – Research Priority information not available
Research Initiative
X - not in an Initiative
Funding Scheme
Training Grant - Industrial Case
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search