Award details

Blobtoolkit: Identification and analysis of non-target data in all Eukaryotic genome projects

ReferenceBB/P024459/1
Principal Investigator / Supervisor Dr Guy Cochrane
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentOMICs
Funding typeResearch
Value (£) 173,760
StatusCompleted
TypeResearch Grant
Start date 07/07/2017
End date 31/07/2021
Duration49 months

Abstract

Many next generation genome datasets derive from a mixture of taxa - either because the mixture is a biologically relevant unit (symbionts, organisms with associated metabiomes), or because the sample was, or became, contaminated. Separation of reads into bins corresponding to distinct organisms is essential for analysis, as mixed assemblies result in erroneous inferences - e.g. of species physiology, horizontal gene transfer, and holobiont biology. Unfortunately, public databases are already contaminated by wrongly taxonomically assigned sequences. We propose to develop BlobToolKit, based on our successful Blobtools, to both clean the existing public databases and to ensure that future submissions are correctly annotated. BlobToolKit will use a range of algorithms to delineate distinct sequence and read bins in next generation data, and use these separate bins for independent analyses. BlobToolKit will include an interactive visualisation platform that will facilitate exploration of assembly data, and thus the generation of high-quality assemblies. BlobToolKit use modes will be delivered by distinct packaging of the core software: It will be used by researchers assembling de novo, as part of high-quality assembly pipelines - delivered through a command line version, accessible through an API. It will be used by authors, editors, reviewers and database curators as a quality check before submission, acceptance of manuscripts or accessioning - delivered through a cloud-based Galaxy instance. It will be used by databases to display interactive graphical reports on accessioned genomes and support data reuse - delivered through API integration with databases. Core development will be carried out in Edinburgh, and integration with service delivery through the European Nucleotide Archive will be delivered from the EMBL-EBI, Hinxton. We will develop training and outreach materials to promote uptake of BlobToolKit in the research community.

Summary

Genomics has become one of the cornerstones of biology. Knowing an organism's genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases. When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record. How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples. We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies. BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in "legacy" genomes already submitted to public databases, and to have in place a system that prevents further contamination happening. BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.

Impact Summary

We and others have identified a critical issue with contamination in sequence attribution in genomic sequences in the public databases. To rectify this legacy problem and to reduce its impact on future data submissions we propose a toolkit, BlobToolKit, that aids producers and users in identifying and correctly classifying such data. How will BlobToolKit impact science and industry? This work will have impact beyond the purely academic sphere of those generating genome sequences. In particular we envisage impacts in: * Clinical science and delivery, as pathogens and other possibly harmful species will be correctly identified; * Food production, where the improvement of methods of fermentation by microbes such as in brewing and cheese manufacture, requires access to accurately attributed sequence data; * Crop science, as data relevant to invasive and pathogenic species will be available for monitoring, control and eradication programmes; * Livestock health, as data relevant to emerging threats to production to crop and livestock species from novel or imported pathogens will be available for monitoring and eradication programmes; * Biofuel species development, where yield optimisation depends upon a clear mechanistic understanding of the genomics of the species to hand and its relatives, free from contaminant sequence * Drug discovery, where the process of initial lead definition will not be fatally misled by misattributed sequence; * Bioprospecting, where correct linkage between sequences and the organisms they derive from will speed identification of useful bioactives; * Biotechnology, where the engineering of synthetic pathways requires accurate identification and characterisation of genomic material to its correct sources. We also recognise that SMEs are beginning to generate genome assemblies for target species, and BlobToolKit will aid these in generating high-quality data on which future investment can be based. The toolkit will be available under an appropriate open software license, permitting installation on local servers as well as on private cloud computing systems. How will prospective users become informed about BlobToolKit? By embedding BlobToolKit in standalone, cloud, and database-proximal versions, and by developing novel interactive visualisations, we will ensure that it has wide uptake and open availability. We will deliver BlobToolKit-enabled assessments of public data via a plugin to the ENA web data services. This will reach tens of thousands of data users per year. By annotating sequences with suspect annotation, we will improve the sequence search results, and interpretation of downloaded data, for many tens of thousands more. Overall, the toolkit will serve to correct the scientific record at source, and provide an independent measure of data quality and reliability for future reuse. Ultimately we hope that BlobToolKit will become part of the hidden but essential infrastructure that supports UK and global bioscience, whether academic or commercial. "Users" will realise that the data they are using has been screened by BlobToolKit, and will expect BlobToolKit stamps of credibility on data they access and exploit.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file