Award details

Scalable causal gene network inference via genetic node ordering

ReferenceBB/M020053/1
Principal Investigator / Supervisor Dr Tom Michoel
Co-Investigators /
Co-Supervisors
Professor Albert Tenesa
Institution University of Edinburgh
DepartmentThe Roslin Institute
Funding typeResearch
Value (£) 148,695
StatusCompleted
TypeResearch Grant
Start date 01/09/2015
End date 28/02/2017
Duration18 months

Abstract

Genome-wide association studies have uncovered the genetic architecture of numerous complex traits in model organisms, crops, livestock species and human. A major challenge now is to understand the molecular mechanisms that explain genetic associations. Because the majority of trait-associated loci lie in non-coding genomic regions, it is hypothesized that they play a gene regulatory role and that genetic variation affects the status of molecular networks of interacting genes, proteins and metabolites, which collectively control physiological phenotypes. The aim of this proposal is to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. To achieve this we will develop a novel statistical method for reconstructing causal gene networks based on total genetic node ordering, implement the method in a unique and ultra-fast computer software for genome-scale causal network reconstruction, and validate the method in silico using benchmark datasets from human and pig. The proposed method will be based on pairwise Mendelian randomization tests to establish the most likely causal direction between two correlated gene expression traits, graph-theoretical concepts to derive a total causal ordering of nodes based on pairwise orderings, and penalized linear regression to reconstruct a sparse maximum-likelihood Bayesian causal gene network from the inferred total genetic node ordering.

Summary

The aim of this proposal is to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. To achieve this we will: (i) develop a novel statistical method for reconstructing causal gene networks based on total genetic node ordering; (ii) implement the method in a unique and ultra-fast computer software for genome-scale causal network reconstruction; (iii) validate the method in silico using benchmark datasets from human and pig. Genetic differences between individuals cause variation in phenotypes. This principle underpins genome-wide association studies (GWAS), which map the genetic architecture of complex traits by measuring genetic variation on a genome-wide scale across many individuals. A major challenge in GWAS is to understand the molecular mechanisms that explain the statistical association between quantitative trait loci (QTLs) and phenotypes. Because the majority of QTLs lie in non-coding genomic regions and presumably play a gene-regulatory role, it is hypothesized that genetic variation affects the status of molecular networks of interacting genes, proteins and metabolites, which collectively control physiological phenotypes. Since comprehensive, experimentally verified, cell-type-specific networks of molecular biological interactions are lacking, statistical and computational methods which reconstruct causal trait-associated networks from omics data are essential to study the impact of genetic variation on gene regulatory networks. Causal gene networks consist of directed interactions between genes and are usually modelled as Bayesian networks, which assume that the expression level of a gene is normally distributed around a linear combination of the expression levels of its causal regulators and that no gene can affect its own expression directly nor indirectly via an extended cycle of interactions. Current state-of-the-art algorithms for learning the structure and parameters of a Bayesian network from experimental data relie on local optimization where a model is improved one edge at a time. Such algorithms are feasible for systems of a few hundred genes, but modern sequencing technologies measure the abundance of orders of magnitude more RNA molecules, and increased sample sizes mean that ever more of those are detected as variable across individuals. To develop a scalable method to reconstruct causal gene networks from whole-genome genotype and transcriptome data measured across many individuals is therefore an open problem of outstanding interest. Statistical theory permits one exception to the intractibility of the large-scale causal network inference problem: if there exists a total ordering of the nodes in the network, such that the parents of any node can be found among the nodes ranked before it, then the problem reduces to a set of independent, tractable optimization problems, one for each node. In genetics, pairs of gene expression traits can be causally ordered using genotype data. This is based on the principle of Mendelian randomization which states that because genotypes of unlinked SNPs are inherited independently, if gene A is causal for gene B, then the association between the expression of gene B and eQTL of gene A must be conditional on expression of gene A. Here we propose to use graph-theoretical concepts to derive a total causal ordering of nodes based on pairwise Mendelian randomization tests. We will then use penalized linear regression to reconstruct a sparse maximum-likelihood Bayesian causal gene network from the inferred total genetic node ordering. Preliminary results support the hypothesis that this method will lead to a dramatic reduction in computational cost, a higher model likelihood score and better biological validation, compared to current methods based on local optimization techniques.

Impact Summary

This project proposes to develop a novel method and software tool to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. The academic impact of the project will extend well beyond the immediate professional circle of the applicants and includes all researchers who perform systems genetics studies to understand the fundamental molecular mechanisms that connect genetic variation to phenotypic variation. Researchers at private commercial companies in the biotechnological and pharmaceutical sectors also have a strong interest in the research described in this proposal. They often face the challenge that candidate disease genes reported by genome-wide association studies are not directly druggable. The ability to reconstruct causal gene networks to generate hypotheses on causal upstream regulators of lead candidate genes and the potential downstream side-effects of affecting them via existing or novel drugs is essential in modern drug target discovery research. Researchers at both commercial and academic organizations will benefit from this project by the availibility of a novel software tool to reconstruct causal gene networks, applicable to the size of contemporary datasets and packaged in a user-friendly toolbox that will integrate seamlessly with existing data analysis pipelines for the R and Matlab statistical computing environments. The applicants are committed to an open access policy for all software developed during this project. Under the conditions of the GNU Public License (GPL), anyone will be allowed to use and distribute the developed software. No active commercialisation through licensing of the software as a for-profit product is therefore planned. The applicants strongly believe that both the academic and private research sector will benefit most from an open software development. Although this will not likely lead to the creation of a new commercialisable product, the scientific knowledge gained from developing and benchmarking the novel software will be exploited. The Roslin Institute is committed to knowledge exchange and commercial companies can benefit from the knowledge gained in this project through consultancy agreements with the applicants. Already the PI, with support from Edinburgh Research and Innovation, has entered such an agreement with the SME Clinical Gene Networks AB (CGN), to oversee the reconstruction of gene networks surrounding identified genomic risk loci for cardiovascular disease. An important impact of this project will concern the training of a highly skilled postdoctoral research associate for academic or non-academic professions alike. There is currently a great demand for computational scientists to assist in the analysis of ''big data'' in academic and non-academic life science organisations, but few computational scientists possess the necessary experience of working with molecular biological data. Through working on this project and performing the benchmark analyses on human and pig test datasets, the postdoctoral research associate will be trained in biological data analysis and at the end of the project will be well prepared for a cross-disciplinary research career.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsSystems Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file