Award details

The Lazarus Project: Resurrecting data and knowledge from life science articles by crowd-sourcing

ReferenceBB/L005298/1
Principal Investigator / Supervisor Professor Steve Pettifer
Co-Investigators /
Co-Supervisors
Professor Teresa Attwood, Professor Carole Goble, Professor Robert Stevens
Institution The University of Manchester
DepartmentComputer Science
Funding typeResearch
Value (£) 481,203
StatusCompleted
TypeResearch Grant
Start date 30/06/2014
End date 29/06/2017
Duration36 months

Abstract

Lazarus aims to harness the crowd of scientists reading life-science articles to recover the swathes of legacy data buried in charts, tables, diagrams and free-text, to liberate process-able data into a shared resource that benefits the community. Scientific articles are 'stories that persuade with data', but their historical format makes accessing the data for validation or analysis difficult: small molecules are typically represented as illustrations; biochemical properties as tables or graphs; protein/DNA sequences are buried amongst text; references and citations have arcane formats; and other objects of biological interest are referred to by ambiguous names. Capturing such data necessitates the familiar drudgery of re-typing figures from tables, chasing citations through digital libraries, redrawing molecules by hand... tedious, error prone, wasteful and currently wasted processes. Mass-mining methods (text mining, optical recognition) to automate such tasks aren't yet sufficiently reliable to be used without human validation, and are generally disallowed by the licenses under which articles are published. Without 'human computation', existing knowledge is thus destined to remain entombed in the literature. Lazarus' objectives are to harness a percentage of paper readership and leverage the Utopia document-reading platform with which any PDF from any publisher can be read. We aim to harness individuals' 'microtasks' of extracting data or annotating articles for personal use, and pool them for reuse; cross-validate and feedback annotations to better train the crowd and improve data quality; produce an open-access, restriction-free searchable and processable resource for use by computational and analytical pipelines; create a web-based observatory, gathering per-article metrics; observe and steer the crowd toward data-resurrection campaigns. Lazarus' methods combine data extraction micro-task design, task observation, crowd engagement and data reuse.

Summary

The scientific literature is one of the most important knowledge-resources for the life sciences, with over 200k articles downloaded each day from Elsevier's Science Direct system alone. Covering over 20k journals, two new papers per minute are added to 22 million or so existing articles indexed by PubMed. For most scientists, reading, analysing and organising their personal library of articles is a daily task that forms a fundamental part of their scientific process. As the rate of publishing accelerates, the need for computational support to work which articles to read, and how to interpret, reproduce and validate the claims they contain is growing. However traditional publications are aimed at consumption by humans -- they are 'stories that persuade with data' -- and their combination of nuanced natural language and complex figures does not make them easily amenable to processing by machine. In the life-science literature, drug-like molecules are typically represented as illustrations; biochemical properties as tables or graphs; protein/DNA sequences are buried amongst text; references and citations have arcane formats; and other objects of biological interest are referred to by ambiguous names. Capturing such data necessitates the familiar drudgery of re-typing figures from tables, chasing citations through digital libraries, redrawing molecules by hand: all of these are tedious, error prone, wasteful and currently wasted processes that are carried out by scientists on a regular basis. Mass-mining methods (text mining, optical recognition) to automate such tasks are not yet sufficiently reliable to be used without human validation, and are generally disallowed by the licenses under which articles are published. Thus without the 'human computation' possible through crowd-sourcing, existing knowledge is destined to remain entombed in the literature. The Lazarus Project aims to harness the crowd of scientists reading life-science articles to resurrect the swathesof legacy data buried in charts, tables, diagrams and free-text, to liberate processable data into a shared resource that benefits the community. Lazarus aims are to harness activities that are currently carried out by individuals for their own purposes (annotating, cross-referencing articles with databases, organising collections of articles). Our approach is to extend the functionality of an existing literature-enhancement platform that currently is designed for individual use. Utopia Documents is a PDF-reader that enhances the experience of reading life-science literature: it analyses documents on the fly, linking their content to online resources, and helps users explore associated data and knowledge bases. It has a number of 'convenience' features such as extracting data from tables, reconstructing molecules from images or 'markush-like' representations or navigating citations that make interacting with the content of an article more efficient. Its counterpart, Utopia Library, provides complimentary functions for collections, providing automated recommendation, legitimate copyright/license sensitive acquisition and sharing of articles and sophisticated 'semantic' classification and organisation of personal libraries. Lazarus aims to enhance the Utopia tools such that the micro-tasks already performed by individuals can be harnessed at a crowd scale and repurposed for crowd consumption. As a result, scientists will benefit from richer, more searchable literature, and more accessible data; publishers - will benefit from enriched content, without the need to develop new in-house infrastructures; data integration initiatives - will benefit from access to a rich literature/data-linking resource.

Impact Summary

Lazarus has the potential for exceptionally broad impact in the Life Sciences and beyond. While the UK leads globally in terms of open access policies, the scientific community is in desperate need of tools to exploit the potential of these recent changes, and to make the most of the knowledge currently locked in the literature. The recent acquisition of Mendeley -- holders the largest 'independent' collection of biobliographic metadata and citation network data -- by commercial publishing giant Elsevier makes the creation of an open, freely accessible repository of knowledge from the literature ever more pressing. BBSRCs investment in biology makes much data that are under-exploited and making available "what we didn't know we already knew" will have immediate and long-term benefits to biological science. Biologists lack tools tuned to aggregate, integrate and mine the data and insights currently locked in the scientific literature, this project addresses this need. This project has the potential to make an impact on the "reduction, refinement, and replacement" of animal experiments. By making the data on experiments published in the literature more available replication can be avoided. Although in its pilot phases this project focuses on three areas of life sciences (pathways, pharmacology and sequence/structure analysis) these are merely case studies designed to enabling the fine-tuning of the crowd-sourcing approach and the underlying technology; the resulting platform and approach will be applicable in any life science domain. Scientists, whether in academia or industry will benefit from richer, more searchable literature, and more straightforward access to the data and concepts that are currently sequestered in papers. Tasks that they are presently required to perform manually and repeatedly will be simplified, reducing the time wasted and increasing the quality of the results. The data generated by UK-funded research, past and future, will be more openand accessible to human and machine consumption. Scientific publishers of all scales, whether commercial and scholarly, will benefit from enriched content, without the need to develop new in-house infrastructures Data integration initiatives and primary life science database will benefit from open access to a rich literature/data-linking resource the content of which has been validated by the crowd. The pharmaceutical/biotech industry will benefit from a system that allows them to 'join up' their in house knowledge, linking their scientists' reading habits to their in-house knowledge-bases.
Committee Research Committee A (Animal disease, health and welfare)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative Crowd Sourcing for the Biological Sciences (CSBS) [2013]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file