Award details

Practical statistical alignment

ReferenceBB/C509566/1
Principal Investigator / Supervisor Professor Jotun Hein
Co-Investigators /
Co-Supervisors
Professor Gerard Lunter
Institution University of Oxford
DepartmentStatistics
Funding typeResearch
Value (£) 356,116
StatusCompleted
TypeResearch Grant
Start date 01/02/2005
End date 30/06/2008
Duration41 months

Abstract

The evolution of sequences is stochastic and modelling this process is essential for analysis of sequences and genomes. The first component of such models is to model the evolution of a single nucleotide/amino acid/codon by a continuous time Markov process on 4/20/61 states that it is characterised by an instantaneous rate matrix. Such a model is then extended to the complete sequence by assuming independence among positions and that substitution evolution is the only class of events needed to be modelled. The first and simplest model was presented in 1969 by Jukes and Cantor. In 1981, Felsenstein presented an algorithm central to analysing a set of sequence related by a phylogeny. The second component models of insertions and deletions, which was first attempted in 1986 (Bishop and Thompson) and further developed by Thorne, Kishino and Felsenstein in 1991 and 1992. The introduction of insertions and deletions transforms the process from being on a finite set (nucleotides, sequences of fixed length) to an infinite set (the set of all sequences). Empirically determined sequences will be of finite length, but the postulated ancestral sequence can be of arbitrary length. The probability of sequences at the leaves of a given phylogeny can be determined by dynamical programming that uses the independence of evolution along the sequence. This allows the analysis of 3-4 sequences of length 300-500. This was not possible just 2 years ago. However, real data necessitates these limits to be extended. The dynamical programming algorithm allows a summation over all possible histories relating the extant sequences. It is this summation of all possible that becomes computationally limiting. However, Markov Chain Monte Carlo (MCMC) methods allows summation over a representative set of ancestral histories, so the likelihood function can still be evaluated. At present, this allows the analysis for up to 10 sequences. This will be extended by better MCMC algorithms and beyond these computational developments, more realistic models allowing long insertion-deletions and heterogeneity of evolution for different positions will be developed and applied. These techniques will be implemented in user-friendly software to allow several investigations on biological sequences hard to be aligned. It will be possible to define several priors based on biological knowledge (for example coalescent prior, molecular clock prior, rate heterogeneity along the sequence, information on secondary structure elements, etc.), be possible to define all the parameters of the MCMC sampling methods (recommended parameter sets for different investigations will be given, though), and several options will be given how the data should be analysed. The software package will also provide graphical outputs for easy visualisation. It will also be thoroughly tested on real biological data, like HOMSTRAD database, mammalian genomes, etc.

Summary

unavailable
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file