Award details

Comparative methods for second generation sequence analysis

ReferenceBB/K004344/1
Principal Investigator / Supervisor Dr Andrew Meade
Co-Investigators /
Co-Supervisors
Professor Mark Pagel
Institution University of Reading
DepartmentSch of Biological Sciences
Funding typeResearch
Value (£) 101,596
StatusCompleted
TypeResearch Grant
Start date 01/04/2013
End date 31/03/2014
Duration12 months

Abstract

Comparative methods is a computationally intensive process due to the growth of large biological data sets and the complexity of the underlying statistical models. This is creating a gap between data and models which biologists would like to analyse with available computing resources. High performance computers can be used to bridge this gap but they are unrealistic for many biologists, due to their expense, rarity, running costs and technical requirements. General Purpose computing on Graphics Processing Units (GPGPU) has the potential to solve these problems, as the hardware is cheap, ubiquitous, has low technical requirements and requires a 20th of the power, compared to traditional computing. Converting programs capable of running on GPGPU requires a significant rewrite to take advantage of their vast parallel nature. This project will convert a comparative methods package, BayesTraits, for GPGPU use. The majority of the program run time (>99%) is concentrated in the likelihood function, which calculates the probability of observing the comparative data given a phylogeny and model parameters. The likelihood function is based on a phylogenetic generalised least squares (GLS) calculation for continuous data and a continuous time Markov model for discrete data. These are fundamentally different calculations, with the GLS method dominated by matrix operations, inversions, powers and multiplications. The continuous time Markov model is a mix of matrix powers and a pruning function to collapse the likelihood through the phylogeny. The OpenCL framework will be used for development as it offers a hardware independent programming environment, with a high degree of portability. BayesTraits is a general purpose comparative methods package and must be able to effectively deal with a wide range of data sets, tuning the program to work effectively with a wide range of data sets and complex models will be important.

Summary

Biological data, from more than once species, must be analysed in an evolutionary context, taking into account their evolutionary histories, as data is non-independent. For example, a trait found in mice will have a higher probability of being found in rats than humans, as mice share a more recent common ancestor with rats than humans. If the evolutionary histories, known as phylogenies, are not accounted for, an incorrect result can be found. Analysing data in an evolutionary context is called comparative methods. Current DNA sequencing technology creates very large data sets, both in terms of the number of species and types of data. Comparative methods use computationally complex mathematical models to combine the phylogeny with the data of interest, and more complex mathematical models are being developed. The increase in the volume of data and complexity of the models is creating a gap between the ideas biologists would like to test and the computational power needed to perform the analysis. A single analysis can take weeks or even months on a desktop computer, this is currently a rate limiting step in biological research. Supercomputers can be used to solve these issues but are expensive to buy and run, are rare, complex and require a large amount of technical knowledge to use. Supercomputers also require large amounts of electricity to power and cool them. The hardware used to play computer games, found in PC and games consoles, have the potential to offer a solution to this problem. The vast computing power needed to generate 3D images can now be applied to solve other problems. A recent study, analysing medical data, showed how a PC with a number of graphics cards, costing $5300, could outperform a $4.6 million supercomputer. This project aims to vastly accelerate comparative methods analysis by using graphics hardware. A popular comparative methods package, BayesTraits, will be converted to use a range of graphics hardware. While graphics hardware has alarge amount of computing power, it can be hard to utilise as they are designed, primarily, to perform a very different task. This makes developing programs for graphics hardware more complex and time consuming than traditional computer programming. Converting comparative methods programs to use graphics hardware will give biologists access to effective computer hardware and software required to analyse the vast quantities of data being generated. Allowing biologist to explore large data sets, answer complex questions and develop new insights into biological systems. It will eliminate the large technical hurdle associated with supercomputers and is cost effective, costing hundreds or thousands of pounds instead of millions. Graphics cards require 1/20th less power than traditional computers, making them more environmentally friendly.

Impact Summary

Using graphics hardware to accelerate comparative methods will have a diverse range of impacts. Comparative methods are gaining ground in cultural research areas, BayesTraits has been used in a number of diverse fields including linguistics and anthropology. Languages and cultures have many similarities with species, they mutate and adapt over time, are heritable and compete for resources. Researchers also ask similar questions, what is the rate of change, what is the ancestral state, and are there correlations in the data. BayesTraits has been used to analyse a range of cultural data, including identifying the rate meanings evolve threw Indo-European languages, investigating how cultures evolve and sustain complex social systems and showing how marriage systems and wealth transfer at marriage are correlated across cultures. While supercomputers are widely used in scientific research they are limited to countries with large research budgets. The top 500 supercomputers in the world are shared between 31 countries, with the United States owning 56% of them. This leads to over 160 countries without access to high-performance computing. Personal supercomputing, offered by GPGPU, has the ability to change this situation, giving researchers across the world, access to cheap and powerful computing. It is estimated that 2% of the world's total energy is consumed by computer equipment. Graphics cards require a 1/10th of the power or traditional supercomputers making them an excellent green alternative. In 2007 the University of Reading purchased a supercomputer, ThamesBlue, rated as the 36 fastest supercomputer in the world, consisting of over 700 nodes. One of its primary tasks was to analyse biological data. It cost an estimated £25,000 a month in electricity to run, including cooling. Currently, 40 of the latest graphics cards have the same computational power and would fit into 10 nodes, requiring no more than £250 a month in electricity to run. These examples are notquite comparable, as the supercomputer is much easier to program and it assumes that any program would use all of the capabilities of the graphics cards, which is hard to achieve. It does, however, serve to highlight how energy efficient and powerful this technology is.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PrioritySystems Approach to Biological research, Technology Development for the Biosciences
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file