Award details

2020BBSRC-NSF/BIO: REDEFINE - Development of efficient, large-scale metagenomics sequence comparison algorithms to facilitate novel genomic insights

ReferenceBB/W002965/1
Principal Investigator / Supervisor Dr Robert Finn
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentGenome Assembly and Annotation
Funding typeResearch
Value (£) 498,881
StatusCurrent
TypeResearch Grant
Start date 11/01/2022
End date 10/01/2025
Duration36 months

Abstract

Over the past five years, assembled metagenomic sequence data have been generated at such accelerated rates that they have overtaken the volume of sequence data from isolate microbial genomes. Furthermore, the numbers of distinct bacterial strains recovered from metagenomes already match the orders of magnitude of isolate strains (100,000s). This massive and continual expansion in size and numbers of metagenomics datasets are increasingly yielding metagenome assembled genomes (MAGs). Thus, there is an urgent need to produce new tools and resources that enable large-scale genome comparisons, as no existing approaches sufficiently scale to deal with both large queries and large reference databases. In this proposal, we will extend the functionality of sourmash, a widely used tool for performing sequence comparison using MinHash approaches and apply it to datasets in MGnify. To achieve the necessary scalability we will optimise search through algorithmic optimisations (e.g. heuristics, caching), precalculation of sketches for reference databases and horizontal scaling using multiple compute nodes. Additional improvements will be achieved via implementations in Rust, the multi-paradigm programming language. Collectively, these will enable us to perform large-scale database comparisons that will enable: (1) Detection of contaminating contigs in MAGs and reference genomes; (2) Removal of redundancies between MAG collections by intra- and inter-dataset comparisons; (3) Assignment of taxonomy to MAGs by converting MinHash distances into evolutionary distances; (4) Profiling of short-read metagenomics datasets to detect novelty and permit matching to MAG collections. These tools will be made available to the community as standalone software packages and via new web interfaces in MGnify, which will provide unparalleled access to the MGnify MAG catalogues. These tools will also be applied to MGnify data to improve MAG quality and help prioritise datasets for analysis.

Summary

Microbes are ubiquitous and perform essential roles that help sustain life on earth, for e.g. environmental oxygenation, soil nutrient cycling to support plant growth or facilitating animal digestion. They cause many diseases in plants and animals and have the ability to rapidly evolve to exploit new niches and/or combat antimicrobials. A relatively new field, metagenomics is a culture independent method that applies sophisticated DNA sequencing technologies to analyse the total microbial genetic material from any environment. It is now possible to reassemble the millions of short DNA sequences to produce representations of the microbial genomes in a sample, termed metagenome assembled genomes (MAGs), especially for bacteria. While this approach remains computationally expensive, the computer algorithms used to recover these genomes have been substantially improved to increase accuracy of MAGs. Just in the past five years, many large-scale studies, including our own, have successfully applied these techniques to cumulatively generate millions of MAGs. This has provided scientists with novel insights into ~99% of organisms yet to be experimentally cultured and dramatically expanded the Tree of Life. These MAGs are reshaping our understanding of microbial community structure and the functional capacities of constituent members. This explosion in MAG numbers nevertheless presents new challenges. These large-scale analyses can generate genomes at magnitudes that match GenBank's large genome collection, which is derived from traditional techniques of sequencing experimentally isolated microbes. Such genome collections have taken decades to build and are managed by large data centres. Yet, there is now the need for groups to routinely perform comparisons between new MAG collections and such large reference genome collections. We propose to use a particular class of algorithm called MinHash, which rapidly estimates similarity between two sets based on the number of shared entities, in our case short sequences. Most implementations of this approach have focused on the rapid comparison of one genome to another. In this proposal, we aim to use a range of computational techniques to enable the comparison of a large query dataset to a large reference database, with the purview of being applied to microbial genomes, MAG collections and metagenomic sequences. We will develop and apply this tool to a range of datasets, particularly those housed in MGnify, a leading database of metagenomic data. The key applications are the identification of errors in MAGs which were introduced by the computational methods, data reduction by identifying duplicate MAGs between datasets, the rapid incorporation of MAGs into catalogues of genomes that have been found in a particular environment, taxonomic classification of MAGs (by converting similarity distances to evolutionary distances), and the profiling of metagenome datasets to determine which genomes are likely to be found. The latter set of profiles will also enable the delineation of datasets that are poorly characterised by MAG/genome collections and prioritise them for analysis (i.e. MAG generation). The outputs of this proposal are manifold. The first is a suite of software tools and associated workflows that can be installed and run on the computer command line. The application of the tool will lead to multiple new data outputs (refined MAGs, improved catalogues and metagenomic profiles) which will be made available via MGnify's web interfaces. To provide rapid access to these MAG catalogues, we will also deploy new web interfaces (implementing the new tools) that allow users to compare their own MAGs against established collections. This will not only democratise scientific research but also reduce the need for data duplication. We will also use specific use cases to demonstrate the utility of our tools and provide training and support for their use.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsMicrobiology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative UK BBSRC-US NSF/BIO (NSFBIO) [2014]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file