Award details

Fast supertree construction using quartet joining

ReferenceBB/G024707/1
Principal Investigator / Supervisor Dr Peter Foster
Co-Investigators /
Co-Supervisors
Dr Tobias Hill, Dr Mark Wilkinson
Institution The Natural History Museum
DepartmentLife Sciences
Funding typeResearch
Value (£) 121,618
StatusCompleted
TypeResearch Grant
Start date 01/07/2009
End date 31/03/2011
Duration21 months

Abstract

Supertree methods are widely used for constructing large phylogenetic trees, and have become central to reconstructing the tree of life. Existing supertree methods are ad hoc and have a number of drawbacks. For example, the most widely-used supertree method, matrix representation with parsimony (MRP), has biases associated with the shapes of the input trees, may make a supertree with relationships that are in conflict with all the input trees, and may show 'unsupported groups' which are not present in any of the input trees. Most supertree methods, including MRP, require searching tree space. A recently proposed supertree method appears to be a much-needed fast and flexible alternative. This method, quartet joining (QJ), grows a supertree by using the information contained in quartets in the input trees to infer placement of new leaves. It is very fast, with a complexity of O(n log n). Initial testing appears promising, and we propose to: enhance the efficiency of the method by allowing it to use more information contained in the input trees, increase the speed through parallelization and by allowing grafting of subtrees onto the growing supertree, increase the realism of the construction by allowing it to use support information in the input trees, allow optimization of the speed to accuracy via dataset dependent automatic tuning of the number of quartets from the input trees consulted to place a new leaf, and to enhance ease of use by producing standalone applications. We will test the method and evaluate the effects of these enhancements using simulated and empirical data, and compare its performance and accuracy with other supertree methods, especially the widely-used MRP.

Summary

That all kinds of organisms that have ever lived are related through common ancestry and descent in one Tree of Life is one of the major insights of bological science. Knowledge of these phylogenetic relationships helps scientists to understand how the great diversity of life we see today has originated, provides a framework for inferring how living things have evolved, and allows testing hypotheses that seek to explain this diversity and identify the mechanisms that have generated it. Phylogenetic relationships can be inferred using morphology but are increasingly inferred from DNA or amino acid sequence data. However, the inferred phylogeny of a single gene may differ from (be incongruent with) the true species phylogeny, either due to errors in the inference or because the gene tree is not identical to the species tree. The latter can arise when, for example, genes are transferred horizontally between species, as has happened in the development of antibiotic resistance in some bacteria, or when genes are duplicated and subsequently lost. This raises questions of how best to do phylogenomics (the phylogenetic analysis of genomic scale data) with two alternative strategies currently being pursued (1) combining all genes into a single analysis and (2) building a supertree - a synthesis of the individual gene trees. Supertree methods can be considered a 'divide-and-conquer' approach where a large phylogenetic problem is decomposed into smaller problems which are then combined to give a global solution. Underpinning this is the expectation that individual gene trees can be more easily or effectively analysed because they are smaller and because they include only those taxa for which particular genes are available. This also assumes that the information in the individual trees can be combined efficiently, but unfortunately the supertree methods that are currently most relied upon in practice have a number of obviously undesirable properties, such as producing supertrees that contradict relationships that are true of every input tree (and which therefore must be true if any input tree is true). We propose to develop a new supertree method that uses logical inference to make species phylogenies from collections of gene trees, to implement it in software, and to test it with simulations and empirical data. In this method a supertree is grown by adding leaves; the inference about where to put new leaves is given by 'quartets', which can be considered the quanta of phylogenetic information, in the input trees. The new method is needed to enable researchers to make best use of the rapidly expanding number of complete genome sequences which may be of relevance to understanding the evolution of metabolic pathways, of drug resistance, to drug discovery, epidemiology, and diversification studies linked to historical climate change. Technical advances have seen the massive increases in the rate of production of new genomic data; complete genomes of prokaryotes can now be produced in an afternoon. Advances are now needed in the methods used to analyse this flood of data, and aim to replace ad hoc methods with better-founded alternatives. Based on its logical foundation, its flexibility, and on the speed of its computation, we expect that this will be a method of choice in phylogenomic analysis, but this needs to be confirmed through simulation to show its properties and determine its error rates, and through empirical tests that will provide proof of concept.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file