Award details

PopSeqle: Software for Population Sequence data to Lower Errors

ReferenceBB/N02317X/1
Principal Investigator / Supervisor Professor Cock van Oosterhout
Co-Investigators /
Co-Supervisors
Professor Federica Di Palma, Dr Graham Etherington
Institution University of East Anglia
DepartmentEnvironmental Sciences
Funding typeResearch
Value (£) 151,363
StatusCompleted
TypeResearch Grant
Start date 05/09/2016
End date 04/03/2018
Duration18 months

Abstract

PopSeqle is a fast, user-friendly software tool to perform quality control (QC) checks of population-based sequence data. Uniquely, the software locates irregular sequence regions in multiple aligned genome assemblies by identifying regions with deviating values for their summary statistics (pi, Fst, Ho, Ts/Tv, dN/dS, Ka/Ks, Patterson's D, fd). Assembly artefacts and sequencing errors are identified by performing wavelet transform analyses, which locates peaks and valleys in the population genetic signal across the sequence space. Wavelet transform is a relatively new mathematical method similar to the Fourier transform, and it has been used successfully with NGS data in the past. By using population genetic theory, PopSeqle is able to quantify the signal present in multiple sequence alignments caused by evolutionary forces and separate this from the signal caused by errors. Fundamentally, evolutionary forces act on all individuals in the population (and often over considerable timescales). In contrast, errors in the sequence data do not comply with population genetic rules, and this enables us to discriminate between the population genetic signatures of evolutionary forces and errors. In addition, PopSeqle will employ a sliding-window approach to help visualise and identify outlying regions. Following this initial check, the software will then direct the user to appropriate further QC-checks and/or downstream analysis to further investigate the regions of interest. The software algorithms will be evaluated, customised and improved by running both simulated data as well as empirical datasets. The ultimate aim of the PopSeqle project is to help improve the quality of NGS data and analyses to the benefit of a wide research community.

Summary

Currently, quality control (QC) checks of NGS data typically rely on the use of a single reference genome. This makes it very difficult to identify errors in sequence and assembly and distinguish such artefacts from genuine biological phenomena such as e.g. genome rearrangements, copy number variants and aneuploidy. To date, there is no software developed for QC of whole genome population sequence data that incorporates population genetics theory to identify errors. The principal aim of this proposal is to develop a fast, user-friendly software platform (PopSeqle) to QC check of population-based sequence data. PopSeqle uses a population genetic framework to identify potentially irregularities in NGS genome / transcriptome assemblies by identifying regions with outlying values for their summary statistics (i.e. pi, FST, Ho, etc.). Such regions either represent sequence or assembly errors, or they may represent parts of the genome or transcriptome that are of genuine evolutionary genetic interest. The proposed project is timely as NGS technologies have only recently gained the capability to generate affordable genome level sequence data from many individuals. PopSeqle will be developed in the new programming language 'Julia', and it will identify errors by performing a relatively novel 'wavelet transform analyses' to locate peaks and valleys in a signal of population genetic summary statistics across the sequence space. Scientists working on genome datasets of plant and plant pathogens will test the beta version of the new software. These researchers will provide feed-back which will help us optimise the software algorithms, thereby ensuring stakeholder relevance. Finally, the software, handbook and training video will be uploaded to the TGAC website, and workshops will be organised to demonstrate the PopSeqle software to end-users, thereby promoting staff training potential and increasing value for money. This project will facilitate new interactions between researchstaff, postdocs, and PhD students on the NRP and elsewhere who work with NGS data of crops and crop pathogens, thereby enhancing research that is relevant to the BBSRC strategy.

Impact Summary

The proposed software PopSeqle has great potential to improve the quality of NGS data in the analysis of large datasets consisting of multiple individuals of one or more populations. Given that the number of de novo genome sequence assemblies is rapidly increasing, this enables us to incorporate population genetic theory in the QC checks. The PopSeqle software will set a new benchmark in QC of whole genome population sequence data by incorporating information across multiple de novo genome assemblies. The new software will fit in the existing bioinformatics pipeline in between the currently implemented QC tests based on read depth and other assembly- and sequence-quality statistics, and the downstream genetic analysis. In other words, the proposed project allows us to perform a novel QC step to check the quality of population-based whole genome sequence data. To accomplish this, we propose to incorporate population genetic theory and a wavelet transform analysis. We believe this is a very powerful approach to identify errors in genome assemblies that hereto forth have escaped unnoticed, such as for example, artefacts due to mis-assembly (which may resemble genuine genome rearrangements). Similarly, highly diverged alleles (in populations with large effective population size, or of genes under balancing selection) may end up being mapped on different scaffolds, and visa versa, copy number variants may be erroneously collapsed. Using the information of multiple de novo assemblies and analysing this in a population genetic framework will significantly improve the quality of NGS data. Over 10 years ago, the PI implemented population genetic algorithms to check genotyping accuracy of SSR loci (microsatellite loci) and developed the software Micro-Checker. That software set a new benchmark in QC of these genetic markers and received more than 5400 citations and 18000 downloads. As with the launch of Micro-Checker, we believe that the proposed PopSeqle software will fundamentally change the way in which QC of NGS data will be performed in the future.
Committee Research Committee A (Animal disease, health and welfare)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file