Award details

PLUTo: Phyloinformatic Literature Unlocking Tools. Software for making published phyloinformatic data discoverable, open, and reusable

ReferenceBB/K015702/1
Principal Investigator / Supervisor Professor Matthew Wills
Co-Investigators /
Co-Supervisors
Institution University of Bath
DepartmentBiology and Biochemistry
Funding typeResearch
Value (£) 118,765
StatusCompleted
TypeResearch Grant
Start date 01/02/2014
End date 01/08/2015
Duration18 months

Abstract

While there are well-established and excellent repositories for molecular sequence data (NCBI), there are no comparable resources for alignments or morphological data (Dryad is the best), still less for trees or other meta-data (measures of tree support, indices of homoplasy, etc.). These data remain locked down into PDFs, and are currently not machine-readable. This is hugely detrimental to many biological disciplines. We will develop and perfect tools (PLUTo) enabling researchers to unlock phyloinformatic data from published PDFs. These will generate Newick/NeXML tree files (with branch lengths and support metrics) by interpreting SVG and other graphics, and parsing the text/legends for other data. We will use AMI2 extraction technology, based on PDFBox, JUMBO and AMI-code. This is presently in prototype. The full code system (PLUTo) will comprise AMI2 and SOLR. The beta will be presented to BMC, PLoS and EuPMC staff/boards. We will also contact selected TA publishers to seek CC0 extraction agreements. The corpus of data will then be checked and annotated in detail by a data clerk (Bath) and via PyBossa (an OKF crowdsourcing community platform). We will explore the possibility of EuPMC and publisher-adopted installation for sustainable CC0 tree extraction. We will develop annotation tools for testing and validating PLUTo on new content. We will set up a PLUTo server based on SOLR (OKF already has Pubcrawler with extracted bibliographic metadata for 25 million STM publications in CKAN). PLUTo content will be uploaded to the OKF CKAN/Datahub (following the model for data.gov.uk). We will use the corpus to address key questions in phyloinformatics and systematics. Which clades are the foci of phylogenetic research and what types of data are being used? Importantly, how does this research effort relate to the diversity of clades? Are some groups disproportionately under-sampled? Is the quality of phylogenetic data variable across higher taxa?

Summary

Phylogenetic data, and the trees inferred from them, represent a hugely valuable resource for evolutionary biological research. The data are often expensive and time-consuming to acquire, and the results from analyses of these data - typically trees - represent a vast investment of effort and expertise across the global community of bioinformaticians and systematists. Trees, and their underlying character data, are often repurposed in other areas of biology; notably in evolutionary studies that seek to test patterns of genomic evolution or macroevolutionary trends. Despite their enormous value, recent research by the PDRA estimates that less than 4% of the phylogenetic trees published in 2010 are available in machine-readable form. Our proposal stands at the leading edge of content mining technology. We will create Open Source 'data liberation' software tools that will allow us to unlock the greater proportion of phyloinformatic data from where they are currently buried in the literature. These will include phylogenetic trees, branch lengths and support values (extracted from the SVG content of PDF files), analytical methods and indices of data quality (from figure legends and the main body of the text) and the underlying molecular and morphological character data. We will also derive full bibliographic and geographical data for each source paper. We will test, refine and perfect these tools by applying them to PLoS, BMC, Elsevier, Wiley and Springer online content from the 21st Century. Once the data are extracted, we will ensure that their immense interdisciplinary (evolutionary biology, ecology, ethology, palaeobiology and conservation) and legacy potential is realised by making them available online in an explicitly open manner. We will also use the data ourselves in order to address several related questions concerning research effort, phyloinfomatic data quality and the progress of systematic research. While there is renewed interest and emphasis on curatingunderlying research data and results (exemplified by projects such as TreeBASE, Dryad, BMC's partnership with LabArchives, and FigShare) these ventures rely upon author submission, which is rarely mandated by journals. Uptake has been slow and coverage is woeful. The data archiving success of NCBI/GenBank for nucleotide sequences (N.B., not alignments, trees or other results, and certainly not morphology) is the exception rather than the rule in the Biological Sciences. For the foreseeable future, therefore, there is a pressing need to retrospectively gather data from the published literature. This project is extremely novel in its scale and ambition. If successful in re-extracting the majority of phylogenetic data from the last decade, the software will easily be adapted and modified by others to suit the data re-extraction needs of other areas of science. This will better harness the billions of pounds of research money hitherto invested into obtaining and analyzing data, only for it to have been locked down and subsequently obfuscated in PDF publications when projects are completed. The project is also widely trans-disciplinary, bringing together a macroevolutionary phylogeneticist (Wills), a chemoinformaticist (Murray-Rust), and a young, up-coming Researcher (Mounce). The potential wider benefits of this project are vast and diverse; content mining techniques are estimated to be capable of generating up to £200 billion annually in added value for Europe alone. We cannot claim to generate those benefits directly, but we will create open tools and generate open data that will greatly facilitate other commercial, industrial and academic ventures.

Impact Summary

Academic Impact This project will have international academic impact in five areas. 1. Our results will be indispensable for any researcher conducting a systematic review of the phylogenetic literature. We will have identified where all papers containing phylogenetic trees have been published in the last decade, and be able to extract trees, their meta-data and underlying character data from many of these publications. These resources will be repurposed in many additional projects. 2. Our resources will be invaluable for evolutionary biologists, ecologists, ethologists and palaeobiologists needing to test evolutionary hypotheses against a phylogeny. There is also vast potential for developing phylogenetically-informed indices of conservation priorities. 3. The project will complement and enhance the published literature by providing discoverable, open, reusable data. Our resources will also complement projects elsewhere, especially the Assembling and Visualising the Tree of Life NSF projects. In particular, we will author tools that will enable much of the backlog of phyloinformatic data to be liberated from the literature. We are aware of no strategy to achieve this implemented elsewhere. 4. All systematic researchers will benefit from the project, as it relieves them from the responsibility of submitting their trees and meta-data to a separate repository (e.g., TreeBASE). The woeful coverage of such repositories (<4%) speaks to the inefficiency with which self-archiving captures the overall research investment. Making the data within a paper available and re-usable also increases the probability that the paper will be cited. 5. The new tools and data will revolutionise the process of supertree construction. It will integrate with other tools under development for this purpose; notably the Supertree Toolkit (STK). Economic and Societal Impact A recent appraisal of the economic potential of content mining techniques as applied to the scientific literature estimated that the value to Europe's economy could be £200 billion annually. Our project will develop cutting-edge technologies; considerably more advanced than straightforward text-mining, because we are also extracting images and amalgamating both techniques. Our proposal will promote technological progress in this area by providing open source software tools that can be applied transferably to other problems. A recent JISC report found that data mining techniques can result in substantial cost savings, productivity gains and innovative service development . The same report also found considerable potential for societal benefit; most significantly the provision of better visualisation techniques for large volumes of data. These new tools with ultimately allow researchers "to better convey research findings and other complex ideas to general audiences". New developments in content mining technology within academia also highlight the need for a fresh appraisal of UK Copyright law. Currently there are no exceptions allowed for research purposes. However, the independent Hargreaves Review of intellectual property and growth suggested that exceptions should be made; particularly and especially where there are clearly identified scientific benefits. Thus, our findings (and the benefits from them) will lend considerable weight to these Hargreaves recommendations. This has the potential to influence legislative change that will affect UK society as a whole.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file