Award details

Next generation imputation for huge data sets

ReferenceBB/L020726/1
Principal Investigator / Supervisor Dr John Hickey
Co-Investigators /
Co-Supervisors
Institution University of Edinburgh
DepartmentThe Roslin Institute
Funding typeResearch
Value (£) 453,932
StatusCompleted
TypeResearch Grant
Start date 24/10/2014
End date 23/04/2018
Duration42 months

Abstract

Realising the potential of sequencing livestock genomes will require sequence for huge numbers of animals. This can only be achieved when the cost of acquiring sequence is much lower than at present. One approach to reducing cost is to use low-coverage sequencing and infer missing data with the process of imputation. Existing imputation algorithms for livestock are unable to use such probabilistic data as they are designed for imputing data from genotype data, which are known with near certainty as generated from SNP-chips. These and other probabilistic approaches using Hidden Markov Models (HMM) will also be unable to cope with the computational demands of the millions of animals that will be sequenced. This proposal will develop a generic imputation algorithm that is (1) flexible in utilising multiple types of genomic and ancillary information (e.g. pedigree), (2) scalable to datasets with millions of animals, and (3) accurate in livestock settings. The algorithm will start by developing new heuristic approaches to encompass probabilistic data obtained from low-coverage sequence data and, after applying heuristic principles, will produce data that is suitable for the application of HMM, so producing a novel hybrid algorithm. The heuristic component will target large haplotypes shared by many individuals in livestock populations by capitalising on pedigree, and abundant, large families. The probabilistic component will target genomic regions where haplotypes are too short for the heuristic component to work effectively, or where information (e.g. pedigree) is unreliable. This will create synergy between the scalability and computational efficiency of heuristic algorithms and the robustness of the HMM. The hybrid algorithm will be benchmarked by comparing performance with existing algorithms on datasets from large, industry populations, huge simulated populations, and small prototype data sets. Software for the algorithm will be provided to allow ease of use.

Summary

Knowledge gained from genome sequencing has great potential for increasing the direction and rate of genetic change in livestock breeding, and biological discovery in animal science. However huge numbers of individuals will need to be sequenced to unlock this potential, and the current cost of sequencing for livestock is several hundreds or thousands of pounds per individual. This will remain a barrier for using this data routinely until the unit cost is of the order of tens of pounds. One promising approach to reducing costs whilst maintaining the quality of the resulting data is to use technology called next-generation sequencing with low coverage (lcNGS). With lcNGS, large numbers of individuals can have their sequences sampled at low cost per individual, but each individual sequence will have substantial missing information. Accuracy is restored by inferring missing data using a process known as imputation. In livestock this process is made more efficient by pedigree structures in livestock populations. Imputation using single nucleotide polymorphism (SNP) data from chips has been successfully applied in livestock. However, these methods are not optimal for the imputation from lcNGS data for several reasons. (i) SNP-chip genotypes are highly accurate and data points are missing only occasionally due to technical issues. In contrast, lcNGS data has much less certainty over the true genotype at a particular locus, and the missing data is randomly spread over the whole genome. (ii) SNP-chip genotypes cover only a small fraction of the genetic variation present in the genome in comparison to sequence data, so the computational techniques for imputing sequence data need to be much more efficient for practical use. (iii) The range of the data produced by lcNGS is rapidly evolving, requiring next-generation imputation algorithms to be very flexible. The imputation algorithm proposed will address these issues from a novel direction by combining two approaches: heuristic and probabilistic. Heuristic algorithms use basic principles of inheritance and so are fast, and accurate. They are well-suited to animal breeding since they use pedigree to make inferences from the abundance of closely-related individuals from large families, with large portions of the genome shared between pairs of individuals. However, heuristic methods can fail if such data is lacking or is unreliable across all or parts of the genome. Probabilistic algorithms primarily use Hidden Markov Models to mimic inheritance statistically and are computationally more demanding, slower, and inherently less accurate than heuristic algorithms. They have been developed primarily for application to human populations in which the pedigree structures, for example small sibships, are not well-suited to exploiting the power of heuristic algorithms. The proposed algorithm will obtain synergy from combining the two approaches as they have complementary strengths in the recovery of information and computational efficiency. The overall objective is therefore to develop a generic imputation system that is capable of imputing in data sets of the order of millions of animals, can cope with the wide variety of data types that may appear from lcNGS. New heuristic approaches will be adopted to develop data that can be integrated with probabilistic approaches and combined into a novel hybrid algorithm. Efficient data handling and storage frameworks, and a user interface will be developed to ensure the algorithm is computationally efficient, easy-to-use, and readily available to users. The algorithm will be benchmarked using a range of real and simulated data sets and historical, real SNP-chip data to ensure it remains backwards compatible to current or previous technology. The availability of the algorithm will enable breeders to accumulate sequence data on millions of animals at low unit cost, and in turn prompt greater accuracy of selection and innovation in breeding goals.

Impact Summary

This project will develop a practical tool enabling sequence to be imputed from a wide variety of sources, opening up the potential for generating huge volumes of sequence information at low cost. It will develop fundamental scientific knowledge primarily in bioinformatics applied to genomics. The outcomes will be beneficial for: (i) The academic community. Scientifically, the project constitutes a novel approach for combining heuristic and probabilistic imputation methods into a single scalable, flexible, and accurate imputation algorithm. This algorithm will enable the generation of large volumes of sequence information at low cost and will have the flexibility to handle new types of genomic information as they emerge. This will enable larger and hence more powerful experiments than currently feasible, and greater ability to combine data obtained with old technology with those with new technologies. The direct application of the method will benefit researchers in animal genetics (both natural and commercial populations) and those who study isolated human populations. Methodological developments will benefit plant and human geneticists concerned with outbred populations. The prototype data generated in this project will be a unique resource for livestock researchers and evolutionary biologists. (ii) Breeding companies, breed societies, and levy boards. As indicated by the attached letters of support from four representatives of the livestock production industry (covering the three economically most important livestock species in the UK), successful outcome of the project is expected to be open new possibilities that will be highly beneficial to breeding companies and organisations that carry out genetic evaluations of domestic livestock. Such organisations will be provided with the tool so it can be embedded within their research, development and operational pipelines. This will increase the efficiency and sustainability of genetic improvement in the long-term. Wealso anticipate similar application in pedigreed companion animal populations in the future. (iii) Commercial sequence and genotype providers. Companies providing SNP or sequence data will be able to use imputation to add value to the data that they generate. (iv) Society. All members of society who work to improve or depend upon the competitiveness and sustainability of agriculture will benefit from the downstream practical applications outlined above. The application of the algorithm by breeding organisations will lead to faster and more sustainable genetic progress, leading to healthier food, and food production that is more resource efficient and affordable. Increased efficiencies in agriculture has direct societal benefits in greater food security with less environmental impact. (v) UK science base. The proposed algorithm will provide a platform for increased R&D capabilities in the UK, maintaining its scientific reputation and associated institutions, with increased capability for sustainable agricultural production. (vi) Training. The proposed research will be embedded within training courses that the PI is regularly invited to give, and the post-doc working on the project will have the opportunity to be trained at a world-class institute in a cutting edge area of research. (vii) Policy. Sequence data is expensive, but the research and practical benefits are potentially large. Therefore much investment will be made in sequence data in the livestock sector in the coming years. To maximise efficiency of investment a co-ordinated national and perhaps international effort may be needed. The method to be developed in this proposal could enhance and underpin such an effort.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsAnimal Health, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file