Award details

pubmed2ensembl: a resource for linking biological literature to genome sequences

ReferenceBB/G000093/1
Principal Investigator / Supervisor Dr Casey Bergman
Co-Investigators /
Co-Supervisors
Professor Goran Nenadic
Institution The University of Manchester
DepartmentLife Sciences
Funding typeResearch
Value (£) 99,333
StatusCompleted
TypeResearch Grant
Start date 23/02/2009
End date 22/02/2010
Duration12 months

Abstract

Advances in DNA sequencing technology have drastically increased the rate of production of genomic sequence data, thereby accelerating the rate of biological discovery and publication. Genomic data are well-served by genome portals and PubMed provides widely-used access to the biomedical literature. However, essentially no effort has been made to systematically integrate genome sequences directly with the biological literature, despite the fact that these are the two most heavily relied-upon sources of information for many biologists. The ability to navigate directly between genomes and the biomedical literature, and to perform cross- and multi-lingual queries using both textual and genomic constraints would greatly aid experimental and computational researchers alike, and would provide a unique and much-needed bridge between two of the fastest growing sources of biological information. We propose to develop an open-access resource called pubmed2ensembl that links biological literature directly to genomes, allowing integrated queries over genomic and textual information via human and programmatic web interfaces. We will use both human-curated and automatically-extracted gene-publication links to populate the pubmed2ensembl database, including a novel source of links based on an automated method to extract DNA sequences from text and map them to genomes (called text2seq). Queries to the pubmed2ensembl system will be executed using genome- or text-based data types and return data types in the same or complementary domain. The capability for such cross- and multi-lingual queries over text and genomic data will be a novel and defining feature of the pubmed2ensembl system. Our system will also uniquely leverage comparative genomic data to allow cross- and multi-species retrieval of text-based information, thereby enabling one of the most common workflows in the life sciences of using published results from model organisms to guide further biological research.

Summary

Due to advances in technology, the rate of discovery and publication in the field of biology is accelerating at an ever-increasing pace. Approximately 500,000 articles are published annually on biological research, and advanced computational systems are now needed to fully access and interpret this wealth of biological information. On an equally grand scale, the complete genetic blueprint for a large number of species has been recently made available to the scientific community through international genome sequencing projects. These genome projects have in large part driven the explosion in biological publication, however essentially no work has been done to develop computational systems that provide integrated access to genome sequences and the biomedical literature. This project seeks to overcome this critical limitation in access to biological information, by developing a computational resource, called pubmed2ensembl, that will directly integrate genomic data with the biomedical literature, providing biological researchers a unique bridge between two of the fastest growing sources of biological information. Our system will allow both experimental and computational researchers alike to perform 'cross-lingual' and 'multi-lingual' queries using both textual and genomic information (e.g. querying textual data using genomic information as constraints). Additionally, our system will allow direct navigation to the literature from genome sequences, allowing researchers to browse the published literature as they would any other genomic feature (e.g. genes). pubmed2ensembl will be open-access, accessible by both human and programmatic interfaces, and will be integrated with established bioinformatics services and resources (such as the Ensembl Genome Browser). By coupling the accumulated knowledge in millions of published articles directly with genome sequences, pubmed2ensembl will provide a critical and much-needed resource to decode biological processes encoded in genomes.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file