Award details

Signal-based image registration and mixed modelling for differential analysis of large scale cross-omics datasets

ReferenceBB/K004158/1
Principal Investigator / Supervisor Professor Andrew Dowsey
Co-Investigators /
Co-Supervisors
Professor Garth Cooper, Professor Warwick Dunn
Institution The University of Manchester
DepartmentMedical and Human Sciences
Funding typeResearch
Value (£) 120,346
StatusCompleted
TypeResearch Grant
Start date 28/01/2013
End date 27/07/2014
Duration18 months

Abstract

We propose to develop a generalised algorithm for aligning complex experimental designs of proteomic and metabolomic LC-MS and GC-MS data for the large-scale studies that are necessarily to ensure the success of the systems biology approach. By basing the alignment in the complete raw signal domain, simultaneously compensating for differential expression, and provision of a GPU-accelerated implementation, we anticipate significantly improved robustness and accuracy, and increased reporting of biochemical features while maintaining throughput. This method will also allow for the first time the downstream use of functional mixed modelling (FMM) methodology for differential analysis that will mine deep below the proteome and metabolome which are visible with current data processing algorithms, compensate for confounding effects and present full posterior distributions of statistical certainty. In particular, it will enables the integrated analysis of proteomics and metabolomics datasets for the first time with a universal method that simultaneously models the interdependencies between them. We will employ a groupwise image registration approach with a physics-based deformation model. This will provide a tractable order of complexity to take into account the full raw data of the whole collection of datasets. The success of this approach is reliant on specialist modelling of the systematic bias and variation inherent in LC-MS and GC-MS. An accelerated FMM approach will then be developed using a variational Bayes formulation for incorporation directly into the alignment process. We believe this is key to (a) avoiding local optima as the posterior probabilities for these will be low, and (b) reducing the complexity of FMM to realise a tractable integrated alignment. The groupwise registration and FMM will be packaged for use by the community as a novel discovery engine, together with its comprehensive validation on large-scale cross-omics datasets.

Summary

Biologists are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control the function of every living organism. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans, as well as how they break down under disease and how this breakdown can be averted. Often the systems biology approach starts with a 'snapshot' of a particular biological sample. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Due to technical and biological variations, it will be necessary to analyse the sample a number of times to get reliable readings. Furthermore, classes of protein, metabolite and metals require different sample preparation, different chromatography approaches and different types of mass spectrometry instrumentation. These all add different kinds of biases and variation which make it extremely challenging to infer links between compounds, especially if the compounds are from different classes. To make matters worse, many snapshots are needed to capture different 'angles' of the biological process under investigation, and the instrumental conditions themselves are not entirely reproducible over time. All this has led systems biology to become a progressively computational discipline. Since the datasetsare so large, however, the existing computational techniques tend to convert the rich raw data from mass spectrometry output to a symbolic representation of compounds too early on. We instead advocate all the data across the samples should be modelled together as raw data, so statistical 'strength' can be borrowed across the collection when making decisions about whether a compound or compound interaction truly exists in the data and at what level of confidence. Unfortunately, the chromatographic step is particularly variable, so corresponding compounds have to be matched to each other before or during analysis. We propose to do this directly on the raw data so that far less compounds are missed by trying to detect them on each dataset in isolation. Furthermore, we propose that with the right 'mixed model' and on the aligned raw data, we can separate out the systematic biases in the data despite being confounded by their intermixed correlations. This will provide high quality evidence for interactions across sample classes and fuel advancements in the systems biology field.

Impact Summary

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry. The discovery engine will increase the amount of usable data extracted from LC-MS and GC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires a considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses. The proposed discovery engine could be seen to be in direct competition with products from software vendors and instrument manufacturers. In fact we perceive a symbiotic relationship with user-centric discovery packages such as Progenesis (Nonlinear Dynamics, Newcastle, UK). The majority of development time for these packages is spent in data import/export, graphical interface, workflow, and results presentation. They also expose interfaces to popular search engines for feature identification including Mascot (Matrix Science, London, UK), which is an essential source of complementary information for a discovery platform. We will therefore investigate the commercialisation of our methods, which could potentially occur in the short to medium term. Nevertheless, we are committed to providing our methods freely for academic use. To maximise dissemination and facility to the academic community we will pursue the interfacing of our discovery engine into the open-source ProteoSuite package of our collaborator Dr Andy Jones, University of Liverpool (see letter of support). There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our stated aim is to enable reliable and precise statistical evidence from large-scale cross-omics experiments, such as those using a Systems Biology approach which are increasingly becoming more essential. This improvement will disseminate down to the public through reduced resources, costs and overheads required for environmental, biological and biomedical discoveries and the characterisation of those discoveries. Since the system will identify multiple covariant effects, it is also reasonable to believe that tertiary biological processes could be identified which otherwise would go unnoticed. This has the potential to deliver further novel discoveries and characterise potentially interfering processes, therefore avoiding subsequent misallocation of resources. The PDRA employed on this grant will be encouraged to spearhead public dissemination and will benefit from the unique intensive cross-disciplinary interaction at CADET that brings together proteomics, metabolomics and bioinformatics expertise all into the same facility and working towards the same goal.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityTechnology Development for the Biosciences
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file