Award details

The Dundee Resource for Sequence Analysis and Structure Prediction

ReferenceBB/R014752/1
Principal Investigator / Supervisor Professor Geoffrey Barton
Co-Investigators /
Co-Supervisors
Institution University of Dundee
DepartmentSchool of Life Sciences
Funding typeResearch
Value (£) 794,889
StatusCurrent
TypeResearch Grant
Start date 01/09/2018
End date 31/08/2023
Duration60 months

Abstract

A major hurdle in managing large sequence sets is to use the raw sequence data in context with structure and evolution to inform our knowledge and understanding of biological systems. An essential prerequisite is accurate and reliable software tools to make structural and functional predictions from the sequence data. Here, we will continue robust support for the secondary structure prediction and solvent accessibility server "JPred" which performs up to 500,000 predictions per month for scientists in 200 countries. The JABAWS platform provides sophisticated access to 8 multiple sequence alignment methods, 4 disorder predictors, an RNA secondary structure predictor and 18 methods for conservation calculation from alignment. Since 2010, JABAWS has served 18,000 jobs/month on average. In order to expand the scope and ease of use of JABAWS, we will migrate it to the to the "Slivka" Python-based technology developed in this proposal. The prototype "ProteoCache" built on Cassandra NoSQL technology stores JPred results for >300,000 proteins including proteomes for human and model organisms. We will migrate the prototype into production and so enable fast access to JPred predictions as well as new data on protein-protein and protein-ligand interactions in context with population variation (SNVs). ProIntVar includes research code for the analysis of population variants (SNVs) across protein families and at protein-protein and protein-ligand interfaces. We will migrate ProIntVar into a production Resource with interfaces to Jalview, the web and its own API. The users of the Dundee Resource are very diverse, from experimental biologists to bioinformaticians who write their own software. Accordingly, we will develop extensive manuals and e-learning materials to inform and educate potential users at all levels and we will run regular training courses.

Summary

This resource application is focused on supporting and maintaining computer tools and techniques developed at the University of Dundee that are in daily use by thousands of biological scientists throughout the UK and the world. The resource will not only ensure that these tools are readily available to all scientists, but also improve the ability of scientists and students to use them through better interfaces and via regular face-to-face training courses and other on-line materials. The tools focus on the analysis of protein sequences and structures which are briefly introduced here. The plans to make a plant, animal or micro-organism are encoded as the molecule DNA and known as its genome. The genome can be represented as a long word made up of four different letters (A, C, G, T). The genome may be a few thousand letters long for a virus, to several billion letters for plants and animals. The genome is divided up into regions called genes which are translated by complex molecular machines into other molecules such as proteins. Humans and other animals have 20-30,000 genes that code for proteins and each protein made up of a sequence of 20 different amino acid types joined together in a chain. Protein sequences from an organism vary in length from a few amino acids, to several thousand and can be represented as a word made up of 20 different letter types. The protein chain folds up into a complex three-dimensional shape that is defined primarily by its sequence. The shape of the protein, its "conformation", dictates the biological function of the protein, so understanding the conformation of a protein is vitally important to understanding the protein function. Over recent years there have been huge advances in technology to sequence DNA and so the genomes of many different organisms have been determined. As a consequence, the sequences of several million proteins are now known but less than 150,000 have had their detailed three-dimensional structures worked out. The computational tools that will make up this resource help to bridge this information gap by classifying protein sequences and making predictions of protein structure that can guide biologists to design more efficient and effective experiments. A major objective of the proposal are to provide support, maintenance and training for the popular JPred protein structure prediction server which performs up to 500,000 predictions monthly for scientists in 200 countries and other techniques that we have developed. Web sites are good for humans to interact with, but less useful for computer software to interface to. Since our tools are useful for large analyses that might be done on many thousands of proteins, the new resource will also support a novel "web services" interface to the tools. Web services allow a program or application to be run remotely from within a program. For example, I might have a program running on my desktop computer, but call for an intensive calculation to be done on a remote high-performance computer system. We will develop our new framework for web services called "Slivka" that makes installation of web services easier. A key part of the new resource will be to store the results of analyses and predictions for many organisms in an innovative database called the ProteoCache.

Impact Summary

The Dundee Resource will support a set of tools that will be widely used by the international biological sciences community. This has impact to all areas of academic BBSRC research as well as MRC funded and other research councils that support research involving genome or protein sequences. Users of the Dundee Resource span academia across all biological subject areas and researchers in the pharmaceutical, agrochemical, agricultural and animal breeding industries where the analysis of protein sequences and their functional context is important to the economic success of the company. As such, the Dundee Resource will have both Economic and Societal impacts by speeding up the accuracy and depth of inference possible from sequence data and so increasing the competitiveness of its users in academia and industry. Improved competitiveness of the users of the resource across such a wide range of academic and industrial domains is likely to lead to improved competitiveness for the UK. The ProteoCache will significantly accelerate the speed at which scientists can access key information about their proteome of choice and apply this in experimental design or interpretation. The new ProIntVar resource will make available recent advances in the use of population variation data in studying the function of specific residues in protein families. This has potential impact in the fields of drug discovery as well as guiding the design of experiments to modify the function of proteins for industrial processes. The Dundee Resource, particularly when coupled with the Jalview sequence analysis workbench, will also be important in teaching students in life sciences disciplines both basic and advanced sequence analysis. This educational role will enhance the knowledge and expertise of future generations of biologists and technologists working in academia and industry across all molecular life sciences disciplines in the UK. Further beneficiaries will be attendees at the annual training workshop that will be run to teach potential users both the scientific background to the methods in the Dundee Resource and the practical use of the tools on their specific problems. The training workshops will be open to graduate students, postdocs, academics and members of industry. For those who can't attend the workshops, the on-line e-learning materials will provide similar information backed by informal email support. The Dundee Resource for Protein Sequence Analysis and Structure Prediction is aimed at accelerating scientific discovery and maximising the benefit of investment in sequence data generation. However, when coupled with visualisations in Jalview some of the tools such as secondary structure and disorder prediction could be explained to schoolchildren and the general public. We have experience of public outreach through the annual "Doors Open Day" at Dundee and through the development of our GenomeScroller exhibit (www.genomescroller.org) that provides an exciting backdrop on which to explain the human genome, how big it is and how much (and how little) is understood about how it functions. In Year 2 of this grant and subsequent years, we will display and explain outputs of the new Dundee Resource in order to introduce a new audience to the power and excitement of bioinformatics research.
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file