Award details

NextGenPartiGene: next generation transcriptome assembly annotation and exploitation toolkit

ReferenceBB/I023585/1
Principal Investigator / Supervisor Professor Mark Blaxter
Co-Investigators /
Co-Supervisors
Dr Martin Jones
Institution University of Edinburgh
DepartmentSch of Biological Sciences
Funding typeResearch
Value (£) 123,949
StatusCompleted
TypeResearch Grant
Start date 01/08/2011
End date 31/01/2013
Duration18 months

Abstract

Next generation sequencing technologies have qualitatively changed the way we acquire and analyse transcriptomes by making it possible to generate vast amounts of sequence data very cheaply. As the sequencing effort required to generate transcriptome-scale data has decreased, the bioinformatics effort required to analyse and annotate them has grown proportionally bigger. The combined effects of increased affordability of sequencing and decentralization of sequencing facilities means that the bulk of the burden of analysis falls on researchers who are not bioinformatics specialists. These conditions create a need for a user-friendly, robust transcriptome analysis package that can handle the volume of data produced by next-gen technologies. Our existing transcriptomics pipeline, PartiGene, is designed for last-generation sequencing technologies and written using last-generation programming techniques. We propose to develop and release a complete replacement, NextGenPartiGene, which will be built on modern programming technology and will incorporate best-practice transcriptome analysis. NextGenPartiGene will run completely within a web browser, allowing data sharing to be built in as a core feature, and will combine third party applications (for assembly and annotation) with custom visualization tools to provide a complete transcriptomics analysis and data mining workflow. NextGenPartiGene will be built using the Grails web framework, allowing rapid development and straightforward deployment and where possible will use parallelization to take advantage of multiple processor cores and speed up analysis. The database schema will be designed from scratch to cope with the expected volumes of data, and will take advantage of the full-text indexing integrated in postgreSQL 8.3 to offer comprehensive searching of annotations.

Summary

Biologists have access to ever improving toolkits with which to ask probing questions of the natural world. One revolutionary development that has taken place over the last forty years is the advent of DNA sequencing. We now have the ability to decipher the genome sequence (or 'genetic blueprint') of any organism, and from this work out how they tick. About five years ago, this genomics revolution stepped up a gear, with the introduction of DNA sequencing technologies that increased the rate of genome sequencing, and reduced the cost, many, many fold. These 'next generation' technologies have suddenly made it possible for many researchers to start using genome sequencing in their work. However, as with any new technology, new solutions bring new problems. In the case of genome sequencing it is a 'rich person's' problem: researchers now can generate hundreds to thousands of times as much data as they used to, in a small fraction of the time, but they do not have the computer tools to process and understand it. The reduced cost of sequencing also means that many researchers who now can afford to use this technology do not have the long training required in computing to successfully analyse the floods of data. We propose to develop a set of easy-to-use tools, which we call NextGenPartiGene, using 'next generation' computing frameworks, that will alleviate this problem. We are focussing on the problem of working out what genes an organism is using (or 'expressing'), and what it is that these genes are likely to be doing. By sampling only the expressed genes of an organism (or a part of an organism, such as a leaf or a particular tissue type) it is possible to build up a detailed picture of the kinds of biochemical pathways the organism is running (what it can eat and what wastes it produces), and how experimental interventions change these pathways. We will build the NextGenPartiGene toolkit using an emerging model for such projects: the idea that much of the hard work is done by a server computer, running clever programmes behind the scenes, and that this server is driven by a client, accessed through a standard web browser. By building this client-server toolkit, we will be able to guide researchers with vast amounts of next-generation sequencing data down the best-practice, tried-and-tested paths to full and fruitful analysis. This means they will be able to extract maximum information from their data, and maximum value from their research funding. We will release the NextGenPartiGene tools as open-access software, so that others are both free to use it, and free to modify and improve it to fit their needs.

Impact Summary

NextGenPartiGene is envisaged as an enabling tool. The beneficiaries of this research will be mainly, in the first instance, academics and small to medium enterprise companies using next generation sequencing approaches in the analysis of novel species or novel treatments of well-studied species. By building efficient, fit-for-purpose and open-access tools, we will promote best practice across the field. As we are releasing the software openly, it will not impact in terms of direct financial (i.e. intellectual property rights) benefit to ourselves or to the University, but it will facilitate the exploitation of these tools by such users. By taking the weight of construction and testing of usable software we release such beneficiaries to better produce the outcomes they are qualified to, be they improved biological understanding, or better exploitation of a biological or biotechnological resource. In particular, the need for discovery, development and testing of new crop organisms, whether they are animals, plants, fungi or other eukaryotes, for goals of biofuels production, ecological remediation and food security assurance, will be aided by more efficient and trustworthy bioinformatics tools. Genomics and transcriptomics are now a first port of call in development of novel organisms for exploitation, whether to understand their basic biology and biochemistry, to unravel the mechanisms behind desirable traits, or to develop of genetic markers for assisted breeding programmes. NextGenPartiGene can be a key resource in achieving these goals. In particular, by reducing the time and resource needed to turn raw data into mineable databases, it will increase the effciency and productivity of next generation transcriptomics approaches across the board. Our tools will also promote data sharing between users, thus giving them enhanced ability to fruitfully cooperate on shared projects. By offering a unified solution, collaborating institutions and organisations and companies can either open their analyses (via the open API of the NextGenPartiGene suite) or the web browser to outside scrutiny, or simply merge datasets produced independently (because the underlying data structure will be the same).
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityTechnology Development for the Biosciences
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file