Award details

19-BBSRC-NSF/BIO genomeRxiv: a microbial whole-genome database & diagnostic marker design resource for classification, identification & data sharing

ReferenceBB/V010417/1
Principal Investigator / Supervisor Dr Leighton Pritchard
Co-Investigators /
Co-Supervisors
Institution University of Strathclyde
DepartmentInst of Pharmacy and Biomedical Sci
Funding typeResearch
Value (£) 354,186
StatusCurrent
TypeResearch Grant
Start date 08/03/2021
End date 07/03/2024
Duration36 months

Abstract

We will extend and enhance the capabilities of LINbase to produce the genomeRxiv web server, providing: 1. Greatly increased capacity and functionality for genome classification and identification. 2. Novel capabilities, e.g. users may instantly and easily obtain and share precise identity of newly sequenced genomes without revealing the genome sequence, even to genomeRxiv, maintaining confidentiality for commercially or otherwise sensitive organisms while retaining findability. LINbase circumscribes groups of organisms by assigning Life Identification Numbers (LINs) to genome sequences in the database. LINs express genome similarity based on average nucleotide identity (ANI), providing a neutral genome similarity framework (conceptually similar to GPS coordinates) independent of taxonomic rank, to which users can "pin" circumscriptions of any named species or any other monophyletic genome-similarity group (from now on simply referred to as "group") below the rank of genus. These permit precise identification by placing newly-sequenced genomes within them. We will maximise database utility by making improvements in capacity, precision, and functionality to turn it into genomeRxiv: 1. Increase the number of genome sequences from approximately 8,000 to all prokaryotic genomes in NCBI's Genbank and JGI's Integrated Microbial Genomes (IMG) System (almost 500,000) and automatically import new genomes as they are released. 2. Maximise precision of classification and identification by pushing the resolution of LINs towards outbreak-level resolution. 3. Automatically classify bacteria based on validly published named species, genome phylogeny-based species clusters, and genome similarity-based clusters (cliques). 4. Automated diagnostic marker design specific to genomeRxiv classifications. 5. Increase speed of genome identification, and number of simultaneous users. 6. Improve the user interface.

Summary

Precise identification of microorganisms that impact on society and the environment is a prerequisite for maintaining a healthy society and a healthy environment and for combating diseases, in addition to providing a sound empirical core for understanding microbiology. The DNA sequencing revolution has created the opportunity to use genome sequences of cultured and uncultured microorganisms for fast and precise identification. However, precise identification is impossible without reference databases that precisely circumscribe classes of microorganisms with their unique characteristics, and rapid identification is impossible without fast algorithms that can handle the deluge of genome sequences being sequenced. Therefore, we will enhance our current web server to develop genomeRxiv, which will provide a database of hundreds of thousands of accurately catalogued and classified public genome sequences supplying the basic and applied research community with precise and accurate identification of unknown isolates based on their genome sequences alone. A unique new feature will be provision of the academic, industrial, and government communities with the ability to identify, and announce, sequenced genomes without having to share sequences themselves, providing confidentiality for commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and benefit sharing with indigenous communities). genomeRxiv will also enable practical application of its classification scheme by providing the capability to design molecular diagnostic tools to detect specific groupings of bacteria, including high impact microorganisms, directly in the environment. We are uniquely placed to develop genomeRxiv by leveraging the computational tools and platforms that we have already developed and by integrating them into the new web server. We will combine the highly-resolved classification framework of Life Identification Numbers (PIs Vinazter and Heath), the speed and computational efficiency of sourmash (PI Brown), and the precision and filtering of pyani (PI Pritchard), with the collaborative crowdsourcing framework of the LINbase web server (PIs Vinazter and Heath).
Committee Not funded via Committee
Research TopicsMicrobiology
Research PriorityX – Research Priority information not available
Research Initiative UK BBSRC-US NSF/BIO (NSFBIO) [2014]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file