Award details

Bilateral NSF/BIO-BBSRC:A Metagenomics Exchange - enriching analysis by synergistic harmonisation of MG-RAST and the EBI Metagenomics Portal

ReferenceBB/N018354/1
Principal Investigator / Supervisor Dr Robert Finn
Co-Investigators /
Co-Supervisors
Dr Guy Cochrane
Institution EMBL - European Bioinformatics Institute
DepartmentSequence Database Group
Funding typeResearch
Value (£) 817,225
StatusCompleted
TypeResearch Grant
Start date 03/07/2017
End date 02/03/2021
Duration44 months

Abstract

Metagenomics is a widely used approach to investigate the composition of microbial communities. With the development of modern sequencing platforms, (sequence) data generation is rarely the bottleneck, but rather its analysis. MG-RAST (MGR) and EBI Metagenomics (EMG) are the two world-leading metagenomics analysis platform. These analysis platforms employ distinct, yet complementary, approaches for the functional characterisation of metagenomic sequences. However, their pipelines closely align in the early stages of analysis, such as quality control. Unlike the other datatypes, there is no mandate for researchers to submit metagenomics data to an analysis platform. Furthermore, resources such as MGR are not linked to an INSDC member, such as the European Nucleotide Archive (ENA). Currently metagenomics sequence data, associated contextual metadata and derived functional and taxonomic assignments are disjointed within the field. Consequently, it is virtually impossible to navigate these cumbersome datasets. We propose to solve this problem by the development of a 'Metagenomics Exchange' (ME), which builds upon ENA technologies, to provide a registry of metagenomics datasets. MGR and EMG will use this registry to discover new datasets and publish their derived annotations, using tools and RESTful APIs to push/pull information from the registry. With the ME in place, we will populate it with existing datasets - developing the tools necessary to identify equivalent datasets. MGR and EMG will standardise on common analysis components and utilise the ME to enable crosstalk between pipelines, reducing computational overhead. The two teams will also exchange technology knowledge, such as data storage solutions and pipeline containerization. The websites will be harmonised to seamlessly present federated analysis results from both platforms, thereby enriching interpretation. We will investigate optimal pipeline solutions that may pave the way for a unified pipeline.

Summary

Micro-organisms are found in virtually all environments. Typically, they form the base of the food chain (such as plankton in the sea) and play essential roles in their ecosystems. There is often a complex interplay between different micro-organisms, with some organisms requiring that others be present in order for them to exist. When there is an imbalance within a community, this can lead to severe effects, such as disease in the human gut, or the inability for plants to grow efficiently in soil. An understanding of the composition and interplay within the communities allows us to potentially manipulate them. Thus, there is intense research into micro-organism communities in many different fields, such as improving livestock yields, the recovery from bacterial infections using fecal transplants and the efficient production of biofuels. Many of these communities also contain important proteins that could be useful to the biotechnological and pharmaceutical industries, such as enzymes involved in the production of antibiotics. Metagenomics is the study of these different micro-organism communities, which is achieved by isolating the DNA from the organisms within an environmental sample (e.g. water, soil, animal stool), sequencing the DNA, followed by the computational analysis to decode which organisms are present and the functions they might be performing. This computation is complicated: (1) there is a huge amount of data; (2) The sequence data is a jumbled mix of fragments from different organisms; (3) Decoding the DNA is hard - typically >90% of organisms within a sample are not well characterised. This proposal brings together three major resources within the field of metagenomics data archiving and analysis. The European Nucleotide Archive (ENA) is a repository of DNA sequence data. Importantly, ENA also captures metagenomic contextual data, such as where and when the sample was taken, how the DNA was extracted and sequenced. The EBI metagenomics portal (EMG, UK) and MG-RAST (MGR, US) are two metagenomics sequence analysis platforms. Uniquely, they represent the only free to use services, whereby researchers can upload sequence data and have it analysed without restriction. Despite the widespread use of metagenomics, currently the community lacks standards to ensure that metagenomics sequence data and the derived functional and taxonomic information are deposited within a database of record. Consequently, the navigation between metagenomics datasets is very difficult for even experienced users. As they offer slightly different, yet complementary, analysis services, there is often the desire to have a metagenomics dataset analysed by both resources. But, the number of equivalent datasets between the two resources is unknown. Unless a user has prior knowledge about equivalent projects, they remain disconnected. Also, sequence data submitted to MGR may not necessarily be deposited in ENA. We propose to set up a computational framework, termed Metagenomics Exchange (ME), to enable metagenomics datasets and the results of their analysis to be linked. All sequences will become available to the research community via ENA and analysis results we be automatically exchanged between EMG and EMR. The ME will be implemented to enable other metagenomics analysis providers to join, and so that it can be used by researchers wishing to perform large scale analyses. We will also investigate ways that our own pipelines can be enhanced through the use of the ME, sharing software and processing tasks, for example. This will lead to computational savings, increasing the capacity for metagenomics analysis. We will also generate a knowledge transfer forum, enabling the exchange of ideas on a range of topics, from hardware solutions to algorithms. Finally, we will undertake a research program to investigate the optimal combination of pipeline analysis components, and whether a single, unified analysis pipeline could be engineered.

Impact Summary

The use of metagenomics is widespread, with its application in diverse fields, e.g. agriculture, food manufacture, the elucidation of both antibiotic products and antibiotic resistance mechanisms, bioenergy, crop yields and animal/human health. Consequently, metagenomics data continues to grow exponentially, with ever increasing demands on community analysis services. As yet, the field lacks systematic co-ordination and organisation of sequence data and derived functional and taxonomic information. We propose to solve this through the development of the Metagenomics Exchange (ME), which will primarily address the key area of data driven bioscience, but also have significant influences on many of the strategic priorities for the BBSRC and NSF. The impact of both the EBI metagenomics (EMG) and MG-RAST (MGR) analysis platforms on academic research are already in effect. Both provide robust, specialised analyses and access to significant amounts of compute (~55 million CPU hours/year). The ME will catalogue information about different metagenomic sets and their analyses, enabling users from both academic and industrial sectors to rapidly discover them. Moreover, EMG and MGR will collect and present results from each other's platform, ensuring that a user is presented with all available analyses (saving user time/effort). To reduce duplications and to minimise differences, EMG and MGR will standardise on common parts of their pipelines. This will improve consistency and, as the project matures, allow crosstalk between the analysis pipelines. Crosstalk will also reduce computational overhead, allowing greater throughput for the community. The EMG and MGR websites collectively have 100,000s of individual visitors per year. Steps to harmonise the websites will improve user experience for both new and existing users. Our objective of improving data discoverability via ME is to allow metagenomics results to reach a broader life science community, where individuals may be otherwise unaware of the data. It is important to also note that, in this project, we are also establishing a new collaboration, enabling MGR and EMG to become more aligned. Knowledge transfer between the groups will expand both UK and US skills in high throughput bioinformatics analysis. The staff employed on this grant will receive hands-on training from members in the Finn, Cochrane and Meyer teams. All the institutes have excellent training schemes and career development courses and the staff will be working in world class laboratories of internationally renowned scientists. They will have opportunities to present their work within the groups, between the groups and at international conferences. Both technical developments and research findings will be presented at conferences and published in peer reviewed journals. Information about all the resources, especially the new ME, will be disseminated to the community via peer-review journals, conference presentations, a specialist workshop, and online training materials. We will also engage with the non-specialist and public domains via non-scientific literature, social media (blogs and tweets) and by attending meetings aimed at a range of audiences. These activities will maximize dissemination into the academic, industrial and 3rd-party communities. MGR and EMG will leverage their links to the industrial sectors to ensure that this sector's needs are met. Indeed, the biotechnology industry may benefit the most from the implementation of ME, as they are frequently engaged in identifying catalytic activities across multiple datasets. The ME will enhance the translation of metagenomics research to industrial applications. In the longer term, the knowledge gained from understanding complex communities will have significant impacts for the UK, US and World economies from more efficient industrial enzymes, through improved soil conditions and crop yields, to healthcare solutions by comparing diseased and healthy states.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file