Award details

Deciphering the regulatory genetic code

ReferenceBB/V00736X/1
Principal Investigator / Supervisor Professor Jussi Taipale
Co-Investigators /
Co-Supervisors
Institution University of Cambridge
DepartmentBiochemistry
Funding typeResearch
Value (£) 1,543,608
StatusCurrent
TypeResearch Grant
Start date 22/07/2021
End date 21/07/2026
Duration60 months

Abstract

Many questions about how DNA or protein sequence determines the activity of macromolecules are poorly understood. A classic example is the sequence-to-structure problem (protein folding), but we also do not understand how affinity of macromolecular interactions is defined (structure-to-affinity), or how DNA sequence determines when and where genes are expressed (sequence-to-expression). This project will tackle the sequence-to-expression problem, namely: how mammals read their genomes, and how changes in DNA sequence lead to differences in gene-expression and ultimately phenotype. Our approach segments a large problem to smaller subproblems that can be solved by coordinated experimentation. We have previously developed several novel high-throughput biology and automated biochemical methods that allows us to determine high-resolution binding specificities of mammalian transcription factors (TFs; Jolma et al., Cell 2013; Nitta et al., eLife 2015; Yin et al., Science 2017), and analyzed the role of TF-TF and TF-nucleosome interactions on their binding specificity (Jolma et al., Nature 2015; Zhu et al., Nature 2018). We propose to use these technologies to measure TF binding specificities alone, and in combination with other TFs, nucleosomes, and histone modifying enzymes. The results from the biochemical assays are then analysed computationally to build a predictor that will receive as input a DNA sequence, extract its features, and predict the effect on transcription. For the modeling will use two main approaches: convolutional neural networks, which can approximate arbitrary input-output functions, and a logistic regression model that is biochemically interpretable. The ultimate aim here is to move beyond observations towards understanding of transcriptional regulation based on biochemical principles - in other words, deciphering the genetic code of gene expression.

Summary

Our project aims to understand how the cell can read the information that is written in its genome. The project is very much a collaboration between biological and computational scientists, who will work closely together to understand basic mechanisms of how cells can tell when and where the genes written in their DNA should be active. In other words, we seek to understand 'the second genetic code'. The first genetic code that describes how DNA sequence is converted to protein sequence was decoded more than 50 years ago, and the first draft of human genome, which describes the sequence of the chemical letters A, C, G and T found in all human cells was published in 2001. However, just knowing the order of the letters is not enough to understand how they instruct cells to function and to grow. Advances in DNA-sequencing have also allowed sequencing of entire genomes of individual humans and we have learnt for instance that sequences of unrelated individuals are different by approximately 10 million DNA bases. These specific variants, and the mechanisms by which they act are largely unknown. This is because most of the changes do not affect protein structure. Instead, the variations are presumed to affect the amount of proteins made in particular cells, by affecting DNA binding of proteins called transcription factors. Our research project aims to understand how the transcription factors read the genomic code. This will also help us to understand how mutations or variations in DNA sequences change the activity of genes. The work is basic research utilizing novel high throughput methods and artificial intelligence based computational data analysis tools. The project will first generate vast amounts of data in a laboratory, and then utilize and understand it using tailor-made computer programs. The work will have immediate benefits to the scientific community in terms of deeper understanding, novel methods and computational tools. The work will also lead to increased understanding of the function of cells during growth and development. In a wider context, the proposed project is part of a broader effort to use advanced genomic and computational tools to understand the basis of human and animal biology. Genetic variants that are located between genes are so common that most plants, animals and people have many of them. To understand their function will be of great benefit to the scientific community in terms of deeper understanding of biological principles, for development of novel experimental methods and computational tools, and also eventually for applications such as plant and animal breeding. In addition, understanding the effect of non-coding genetic variants will help to explain human variation, for example in lifespan and health.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsStructural Biology, Systems Biology
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file