Award details

EBI Metagenomics Portal - Towards a better understanding of community metabolism

ReferenceBB/M011453/1
Principal Investigator / Supervisor Professor Tom Curtis
Co-Investigators /
Co-Supervisors
Professor Darren Wilkinson
Institution Newcastle University
DepartmentSch of Engineering
Funding typeResearch
Value (£) 251,060
StatusCompleted
TypeResearch Grant
Start date 01/04/2015
End date 10/05/2018
Duration37 months

Abstract

EBI-MP is a global portal for the metagenomics research community. Offering data submission, archiving and sharing functions, community standards-compliant curation, and functional and taxonomic diversity analyses, the service has attracted a growing user-base of UK, European and global researchers. We intend to improve the pipeline infrastructure to offer analysis provenance by modularising pipeline components, defining a dependency tree between modules, and module versioning. Subsequently, we will perform updates to reference databases and analysis software, and make results reanalysis with the updated pipeline actionable for our users. We will improve the range of taxonomic annotations provided by the resource, moving beyond 16S rRNA-based analyses. We will also investigate the application of the UniPept approach to taxonomic classification for metagenomic datasets. We will add pathway information to the functional annotation provided by EBI-MP, using the latest version of InterProScan to provide KEGG, MetaCyc and UniPathway links, and develop a tool to visualise the catalytic potential of a sample, highlighting reactions where there is support for the existence of constituent proteins. We will implement CRAM compressed sequence data formats within the system to increase the speed of upload of data to EBI-MP and to facilitate internal processing and storage. We will also design and build data discovery tools that provide a full range of search functions across the sample, contextual and analysis data, and provide these tools as web services and via the website. Finally, we will develop mathematically sound methods to estimate depth of sequencing required to capture a specific fraction of diversity, and to normalise samples so that they can be compared in statistically meaningful ways. These analyses will be provided from the website, along with visualisation tools capable of producing heatmaps and PCA plots for sample comparison.

Summary

Metagenomics is a rapidly expanding field, where modern sequencing technologies are applied to the DNA isolated from an environment, such as soil, seawater or animal gut. This allows us to analyse the DNA from the collection of microorganisms that inhabit the environment, many of which may not have previously been sequenced due to their fastidious nature, making it difficult to culture them in the laboratory. Substantial amounts of sequencing data are produced, which typically only reflect a small fraction of the total DNA. However, after analysis, this DNA provides clues to the taxonomic diversity and functional potential that is found in that environment. Metagenomics is an exemplar of what is referred to as data-driven biology. The cost of DNA sequencing has fallen rapidly, such that most research groups have access to affordable institutional sequencing facilities. We now run the risk of each research group producing its own metagenomics analysis platform and storage repository. To help prevent such a fragmented situation, and to help reduce overall costs (both computational and staff), we have developed and implemented a centralised pipeline to provide a metagenomics data analysis and archiving platform. The EBI Metagenomics Portal (EBI-MP) has been developed as a collaborative UK effort at the EMBL-European Bioinformatics Institute (EMBL-EBI). The portal was launched in 2011 and has increasingly become established as a world leader in metagenomic analysis. Users submit their high-throughput metagenomic nucleotide sequence data, accompanied by contextual data describing their samples and experiments in a controlled and consistent manner. Once analysed by the portal, results including taxonomic and functional annotations can be visualised alongside project and sample descriptions, and can be downloaded for further analysis. In order to keep up with community needs, and to help cement the UK's position as a leader in the field of metagenomics research, the proposed project aims to further develop the portal. This will include the addition of analysis provenance, to ensure that analyses can be re-run in the future as new and updated tools and algorithms become available, without overwriting existing analysed data, which may form the basis of an existing publication. Further enhancements include extending the taxonomic and functional analyses to achieve a better picture of the microbial communities sampled, their composition and biological functions. The portal will also be developed to allow more sophisticated searches of its data, so that users can discover samples or environments based on the kind of microbe or protein function found there. We will also add the ability to perform statistically rigorous cross-sample comparisons, that will allow analysis results from different samples to be compared in a scientifically meaningful way, and provide visualisation tools for such comparisons. With today's modern sequencing technologies producing ever more data, better data compression is essential to speed up data transfer both into EBI-MP and internally within the resource. To this end, we will implement industry-standard compression data structures that have developed at EMBL-EBI.

Impact Summary

The use of metagenomics is widespread, with its application in diverse fields such as agriculture, food manufacture, the elucidation of antibiotic resistance mechanisms, bioenergy production, and animal/human health. The EBI Metagenomics Portal (EBI-MP) covers data submission, archiving and sharing functions, community standards-compliant curation, and rich functional and taxonomic diversity analyses. Launched in 2011, the resource has become a world leader in metagenomics data analysis, attracting a growing userbase across the UK, European and global communities. The impact on academic research is already in effect, with the EBI-MP providing both a robust analysis platform and access to a large compute resource. Both of these features are often lacking within academia. Thus, the EBI-MP is making metagenomics analysis available to more researchers, and relieves a significant bottleneck between obtaining sequence data and results. One vital impact of the project will be continued support for archiving and analysis of metagenomic data in the face of ever increasing data volumes. The proposed work provides a number of mechanisms, including CRAM-based sequence compression and a tightly controlled way of updating analysis algorithms, by which the pipeline can be made more efficient, with higher throughput and the ability to scale. Improved sample analyses, through updated reference databases and extended taxonomic and functional analyses, are also critical, since they will increase the usefulness of EBI-MP to researchers and better meet the community's needs. These benefits will be felt in the short term, and will also persist into the longer term, as updates and improvements are made throughout the course of the project. In the medium term, these developments should allow the EBI-MP to grow with increasing demand, without significantly increasing the computational overhead. This will be achieved by the incorporation of more efficient algorithms, thereby increasing throughput. Updating the reference database will facilitate a more in-depth functional and taxonomic analysis, as more diverse organisms are represented in them. The infrastructural changes to the pipeline will also allow other tools to be more easily incorporated into the analysis platform, not only providing scientific exposure to the tool developer, but also enriching the analysis results. Our objective of improving data discoverability, by linking from other databases to the EBI-MP, will allow metagenomics results reach a broad life science community, whom may be unaware of the data. It is important to note, that in this project we are also establishing new collaborations that cross scientific disciplines (EMBL-EBI, Newcastle University and OeRC). This should expose our own staff to novel approaches and scientific challenges. Nevertheless, from these collaborations, we aim to produce statistical protocols to provide additional confidence and information about the data. Cross-sample analyses will inevitably provide researchers with a significantly deeper understanding of complex communities. In the medium to longer, the knowledge gained from understanding complex communities will have significant impacts for the UK. The impacts could economical form more efficient industrial enzymes, to improved soil conditions providing greater crop yields, to healthcare solutions by comparing diseased and health states. One of the key areas will be the translation of metagenomics to industry. Through out industrial connections both at EMBL-EBI and Newcastle University, we will engage with this sector, to establish their requirements. To ensure our users are able to utilise the new features we will provide online training material, publish in scientific and non-scientific literature, attend meetings and conferences aimed at a range of audiences and run training workshops, to maximize dissemination into the academic, industrial and third-party communities.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsMicrobiology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file