Award details

The SPRINT approach to network biology

ReferenceBB/J019283/1
Principal Investigator / Supervisor Professor Peter Ghazal
Co-Investigators /
Co-Supervisors
Mr Terence Sloan
Institution University of Edinburgh
DepartmentSch of Biomedical Sciences
Funding typeResearch
Value (£) 257,288
StatusCompleted
TypeResearch Grant
Start date 01/10/2012
End date 30/09/2014
Duration24 months

Abstract

SPRINT is an R package that allows easy access to HPC for the analysis of high throughput "omics" data using the statistical programming language R. SPRINT provides functional interfaces that are as close as possible to existing R user interfaces so that biologists can obtain maximum performance from HPC platforms with minimal changes to existing analysis workflows and without requiring specialist knowledge of HPC. This project aims to promote the accessibility and usability of SPRINT. This will be achieved by working on four different levels: functionality, access, availability and dissemination. New functions will be added to implement a critical step-change in the functionality needed for machine learning algorithms by implementing the distance function which is core to many clustering algorithms; the Hamming distance for use with the string data as produced by genotyping or next generation sequencing; and optimise a standard function, Robust Multi-Array Average expression measure (RMA). Better access will be offered by providing ready to use SPRINT installations on the national supercomputing service HECToR and on the Amazon EC2 cloud. Direct assistance with installation and user training will also be offered to research groups wanting to exploit their local HPC facilities. Availability of SPRINT will be improved on Linux/UNIX and non Linux/UNIX platforms. SPRINT currently relies on MPICH2 MPI library, we will adapt the software to ensure it can also run on other implementations of MPI2 in particular OpenMPI. SPRINT has been successfully installed on the Apple MAC architecture. We will improve and fully port the software to MAC architecture. We will also investigate the installation and porting of SPRINT onto Windows platforms. Extensive efforts will be made to promote the use of SPRINT through an active programme of dissemination through hands on workshops across the UK to train users, advice research groups on installing and using SPRINT.

Summary

The aim of this project is to promote the accessibility and usability of the Simple Parallel R INTerface (SPRINT). SPRINT is an innovative parallel tool kit for performing computationally challenging analysis workflow on post-genomic data using High Performance Computing (HPC). Specifically we propose to improve the support for manipulating very large datasets such as next generation sequencing and the functionality required to implement machine learning approaches such as clustering to analyse complex, data dependent experiments such as time series. SPRINT and R are open source tools free to use by all. SPRINT does not require any expert knowledge of HPC or parallel programming. SPRINT allows R users to run their analysis workflows on any HPC platform with minimum alterations to their existing scripts and yet give them maximum performance. SPRINT is implemented using the standard parallel programming tools C and MPI. It has two main components: an intelligent HPC harness that manages all aspects of accessing and working with HPC and a library of parallel R functions. SPRINT is fully scalable and portable. SPRINT is designed to use any number of nodes, from two to thousands. It runs on any HPC platform, from multi-core desktop to server, local cluster, supercomputer or cloud. SPRINT can tackle very large datasets including datasets larger than the internal memory of the computer. SPRINT is flexible and allows the addition of further functions to its library. SPRINT is open to external contributions from the research community. The functions currently included in SPRINT have been selected for their importance in the analysis of highly parallel, high-throughput post-genomic data, and network biology in general, and through prioritisation by R users. The objective of improving the accessibility and usability of SPRINT will be achieved by working on four different levels: functionality, access, availability and dissemination. - New functionality will be added to support machine learning approaches and next generation sequencing. - Central SPRINT installations on public HPC resources ready to use and accessible to all will be made on HECToR, the UK supercomputing service and on Amazon EC2 cloud service. Help with local installation will also be provided. - Technical improvements will be made to allow SPRINT to run on different type of computers such as UNIX/Linux, Apple Macs or Windows platforms. - The use and usability of SPRINT will be promoted through an active programme of dissemination. This will be achieved through hands on workshops across the UK to train users, advice research groups and promote SPRINT use. SPRINT is the first application to recognise that parallelisation and HPC access for biologists is the key next step in supporting high quality bioinformatics resources to respond to next generation sequencing and other high-throughput technologies. SPRINT stands out as the only tool that combines ease of use, the ability to perform complex statistical analyses including data dependent problems, the capacity to tackle very large datasets even larger than the physical memory of the computer, and can also run on any HPC platform, including cloud, with good scalability. This project will benefit the biological community, both academia and industry, by providing them with a tool kit enabling them to exploit HPC and to perform currently intractable analyses on large HT genomic data. In fact, the wider R community will benefit from these technology developments because the SPRINT functionality can be used in generic statistical analyses.

Impact Summary

The SPRINT framework is an R package which aims to overcome limitations on data size and analysis time by providing easy access to High Performance Computing (HPC). The statistical programming language and environment R is commonly used by both industry and academia and is becoming the lingua franca of statistical computing. R has been extended with a large number of problem-specific packages; it is distributed free under a GNU General Public License and is available from the Comprehensive R Archive Network (CRAN). The main beneficiaries of these technology developments are the bioscience researchers, both in Industry and academia, who use R to process data from high throughput, highly parallel "omics" experiments. These are now essential tools of biological research. Technologies such as microarrays and next generation sequencing are becoming routinely used in life science laboratories for many applications such as, for example, the discovery and validation of new drug targets, or the fundamental research into the complex nature and relationships between various organism levels in a system and network biology approaches to further our understanding of the healthy system. These technologies generate an unprecedented amount of data which is more and more difficult to store and process due to the lack of appropriate tools. The requirements for the analysis and interpretation of such data are also particularly sophisticated and specialised. It is of crucial importance that adequate resources are provided to the community to fully analyse and extract biological knowledge from these data. Failure to do so will reduce possible economics and societal benefits and so limit the practical and applied advances of genomics. SPRINT provides the biological community with a tool kit enabling them to exploit HPC to perform currently intractable analyses on large high throughput "omics" experiments. SPRINT removes current bottlenecks in the analysis of very large datasets such as those from time series and next generation sequencing experiments. It allows the processing of datasets previously too large to tackle, enables execution of analyses that previously took too long to perform or used algorithms that were computationally too demanding. In particular, the proposed developments aim to provide support for machine learning algorithms. SPRINT is user friendly, aimed at biological scientists, not HPC experts. However, SPRINT is designed by HPC experts giving the software an industrial quality not usually found in open source software. The scalable aspect of SPRINT offers the added potential of future proofing analysis workflows from increasing data size. SPRINT is platform independent and can be used on multi-core desktops, local clusters, servers, supercomputers or in the cloud. Moreover, the R community from all disciplines and scientific community at large will also benefit from this technology development because the analysis methods considered here are generic and can be applied to a wide variety of areas.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file