Award details

A computational cloud framework for the study of gene families

ReferenceBB/N023145/1
Principal Investigator / Supervisor Professor Anthony Hall
Co-Investigators /
Co-Supervisors
Professor Alistair Darby, Dr Ritesh Krishna
Institution Earlham Institute
DepartmentResearch Faculty
Funding typeResearch
Value (£) 147,326
StatusCompleted
TypeResearch Grant
Start date 01/03/2017
End date 31/08/2018
Duration18 months

Abstract

Genomics is moving from studying single reference genomes to study of multiple genomes from same species, allowing us to uncover the pan-genome. The pan-genome or supra-genome describes a core set of genes common in all strains, and a non-core set that is found in only a sub-set or single strain. Commonly this is seen as an expansion of an existing gene family. Study of these gene families is important for various reasons, as each gene family can be understood in terms of its function and its evolution across strains. As an example, NB-LRR genes or resistance genes in plants are responsible for disease resistance and show expansion and contraction across accessions. Critically, new R-genes have the potential of being important targets for plant breeders. As the field is moving towards the study of pan-genomes using next generation sequencing techniques, there is a timely need for appropriate software for data analysis. Given the large size of datasets, it is preferable that the software runs on community accessible cloud resources. We propose to build an open-source, cloud enabled, software toolkit to analyze gene-family datasets. We propose to use BBSRC funded iPlant-UK compute infrastructure as the cloud platform of choice. iPlant-UK is maintained by a dedicated team of experts and offers large compute resources with easy to use graphical interfaces for bioinformaticians and bench-biologists. Further, we plan to use the software for two case-studies - 1) Extract the pan-NB-LRRome for bread wheat and 2) for study of gene families in Tsetse fly in context of their role as disease vectors.

Summary

Life science research is increasingly turning into data-intensive discipline. New high-throughput sequencing technologies produce vast amounts of digital data that needs to be efficiently analyzed in order to discover interesting patterns to make new biological discoveries. The large volume of data produces a problem of its own, as it needs to be stored and analyzed using large computing resources and sophisticated computing skills. Many biology labs struggle to own and maintain large computing clusters for their computing needs. Cloud computing frameworks have emerged as feasible alternative for availing large computer power on a pay-as-you-go model and are increasingly making inroads into mainstream biological data analysis. iPlant-UK is a cloud initiative funded by BBSRC to make large computing resources available for free of cost to UK researchers. iPlant-UK cloud is specifically tailored to meet the computing requirements of life sciences community and provides access to large computing infrastructures through the comforts of web-browser. Through this proposal, we want to develop a computational toolkit for analysis of gene family datasets. An example of a gene family in plants is R genes, also known as Resistance genes that are responsible for pathogen recognition and disease resistance responses in plants. To understand a gene family in a species, one must first catalogue all members of the family, and then understand their function with respect to other each other, as well as related species. These datasets are generated through next generation sequencing techniques and are usually large in volume. We aim to develop specialized software for analysis of gene family datasets on the iPlant-UK compute cloud. This way, we can provide researchers access a specialized tool on a large and free computing resource. Further, we want to simplify the use of the toolkit by providing graphical user interfaces that can be accessed through web-browsers to enable wet-lab biologists to focus on their core research rather than worry about the complex computation on a cloud platform. The code developed in this project will be available in public domain for free under open-source license. By building the workflow in iplant we will ensure its sustainability and visibality beyond this proposal.

Impact Summary

The principal beneficiaries from this grant are research scientists in academia and industry engaged in study of gene families in various species. Completeness and functional understanding of gene families has many important applications in academia and in industry, particularly in pharmaceutical and agriculture settings. The availability of a toolkit on a publically accessible cloud will increase its usability by global research community, while tackling many hurdles posed by Big Data analysis. The toolkit includes various components that can be used independent of this proposal by researchers analyzing next generation sequencing datasets. All the components will be hosted on high performance computing environment, making them desirable due to vastly decreased execution time. The proposed toolkit includes analyses pipelines that are fully traceable, resulting in sharing and reproducibility of results; this will benefit collaborators and reviewers. The iPlant cloud used in this proposal is free for researchers and is maintained by a dedicated team, resulting in substantial cost benefits to research institutions. This proposal enables sharing of tools and execution platforms, apart from the standard sharing of data, meeting an important goal of funding bodies.
Committee Research Committee A (Animal disease, health and welfare)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file