Award details

eHive-RPC: A Remote Procedure Call Public Interface for eHive

ReferenceBB/M020398/1
Principal Investigator / Supervisor Mr Andrew Yates
Co-Investigators /
Co-Supervisors
Dr Paul Flicek
Institution EMBL - European Bioinformatics Institute
DepartmentEnsembl Group
Funding typeResearch
Value (£) 132,015
StatusCompleted
TypeResearch Grant
Start date 01/10/2015
End date 31/03/2017
Duration18 months

Abstract

eHive-RPC attempts to bring a system for creating efficient Remote Procedure Calls (RPC) for concurrent pipelines. eHive is a distributed processing system built by the Ensembl Project and used to control single pipelines on large-scale compute farms. RPC allows inter-process communication, meaning that local code can cause the execution of code in another address space. There is no limit on how long the remote task should take, which makes waiting for the response inefficient. Instead, developers use asynchronous solutions that allow the local work to be queued. Some solutions will immediately hand back a token that identifies a unit of work and requires periodic client polling asking if the work is finished. Polling is usually considered inefficient too, as clients cannot proceed the moment the remote call has finished. Our system attempts to solve this problem by extending the eHive workflow system. eHive is essentially event-driven, and is able to control concurrent work units using semaphores. In this context, a blocking work unit can represent a remote call, and another, blocked, work unit can be local. The latter can be unblocked using an inter-pipeline semaphore system when the RPC call has completed. The remote server will communicate task completion, including appropriate error reporting. eHive-RPC also requires efficient two-way data transfer between local and remote servers. Here we aim to extend an existing architecture called accumulators and move towards the transparent transfer of data between caller and remote server. Our implementation will be integrated into the existing eHive project both as a client and a server, making any eHive pipeline an RPC target. We expect the system to be generic and useable by any client or workflow engines e.g. Galaxy or Taverna. We also aim to disseminate eHive knowledge by hosting two training courses alongside extensive documentation of the protocols and message formats developed by this project.

Summary

Ensembl developed 'eHive' as a production system that manages and optimizes the running of tasks (called 'jobs'), on a compute cluster that may have thousands of Central Processing Units (CPUs). A CPU is the hardware within a computer that carries out the instructions of a computer program by performing the basic input and output operations of the system. Each computer has one or more CPUs. Some compute clusters comprise many thousands of CPUs distributed amongst many computers. With so many computers and CPUs, it is important that jobs are sent to these CPUs in a fair and efficient manner, especially when many users are competing to use the same resources. Clusters usually rely on a central queuing system that holds a list of all the jobs that need to be run and can give individual computers in the cluster explicit instructions about which job to execute. This type of queuing system works well if the jobs each take an hour or more to complete. However, when jobs complete faster than they can be scheduled it creates a processing bottleneck e.g. if a job executes in minutes or less. The usual way to solve the bottleneck is to implement another system on top of the scheduler that 'batches' similar jobs together to make operations more efficient. eHive's novel solution to the issue of job queuing is to move away from this central job scheduling: eHive is a 'distributed' processing system based on 'autonomous agents' with the behavioural structure of honeybees, hence the term 'eHive'. eHive maintains the ability to monitor and track jobs via a central 'blackboard'. Workers are efficiently created on a compute cluster, known as a meadow, with no specific task assigned to them. Once running, each worker contacts the blackboard, is able to find the most suitable kind of job, specializes to claim work and runs multiple jobs of this type in a row. Workers are able to re-specialize to claim other types of jobs once they exhaust their original designation. Each worker regularly updates its status in the blackboard to allow other workers to optimize the overall job distribution. The benefits of eHive are (a) a reduction in the overhead of individual job processing, (b) an increase in the maximum number of tasks that can be running at any one time, (c) an increase in the tolerance to faults in the compute cluster, and (d) the allowance of complicated processes running in parallel. Although eHive was originally designed for the purpose of Ensembl, its functionality is applicable to all data types that have large compute requirements. In this project we aim to transform the possibilities of eHive further, by developing a 'Remote Procedure Call system (RPC) for eHive. This will allow jobs to run on remote clusters as well as local clusters, thereby expanding the use of eHive to multiple compute clusters and cloud computing services. This will enable wider use of eHive within data-intensive fields in the life sciences and beyond.

Impact Summary

The last decade has seen the advancement of laboratory techniques that enable research data to be produced at a cheaper and faster rate, e.g. 'Next-Generation' DNA sequencing. In addition new laboratory techniques now make it possible to probe new areas of biology, such as gene regulation, through epigenetic mechanisms. These rapid improvements in methodology impact many different disciplines in the life sciences, from basic research to applied areas such as plant and animal breeding. The primary beneficiaries from this proposed development for eHive will be Bioinformaticians in academia and industry, both in the UK and beyond, including those supporting research by analyzing data, and those producing and maintaining archives and data resources for the research community. Bioinformaticians are actively working on developing new algorithms and more efficient means of handling and processing these data, so that they can be interpreted quickly and accurately. In order to process these data, large-scale compute is required and the management of these data on compute clusters becomes an increasing challenge. World-leading pharmaceutical companies, bioinformatics service companies, and animal breeding companies have in-house Bioinformaticians to produce customized data analysis on private data. These companies therefore have the expertise to use software such as eHive. Evidence that EMBL-EBI supports these areas includes our long-standing Industry Programme and the more recent announcement for the Centre for Therapeutic Target Validation (CTTV). Suppliers of open source and commercial 'omics tools (e.g. Taverna and Galaxy) will also benefit from access to compute farms and software that our eHive development will provide. Enabling research in these areas impacts socio-economic outcomes, contributing both in areas of basic research that promote understanding, as well as benefitting the wider public with improved health care for humans and animals, and productivity increases in agriculture. How will these users benefit? In this grant we propose to undertake some major enhancements to eHive. These enhancements will set the stage for the use of eHive in a wider context than is currently possible, and we believe that this will make it a more appealing and more accessible tool for Bioinformaticians and Bioscientists who wish to run large-scale compute. Enabling eHive to communicate across more than one compute cluster will bring about novel use-cases for eHive: - For time-critical work, being able to redirect certain job types to a second cluster (eg. Compute cloud) will allow the required workload to be achievable in the limited time frame. Running a pipeline within only one compute cluster can be a disadvantage when that cluster is being heavily used by other users, or when the capacity of a cluster is small. - Being able to run a pipeline that spans more than one compute cluster will enable the user to run sections of their pipeline on the most appropriate cluster for that type of job. Some compute clusters are optimized for a particular type of job, and may work well for one type of pipeline but not another: compute clusters may be optimized for the number of jobs running simultaneously, the number and size of files stored on disk (several small files versus few large files), multi-threaded jobs, etc. Support for eHive is growing and we now have users in both academia (the Roslin Institute, EMBL-EBI, Gramene in Cold Spring Harbor, USA) as well as industry (Eagle Genomics). Having a common job scheduling tool promotes efficiency, as one group can focus on the development for the tool and all groups benefit from the improvements. We support outreach activities, and there is demand for training workshops (Eagle genomics, pers. comm. with Kathryn Beal). Our mailing list is open and developers from any background are free to ask questions, help, and share opinions on the use of eHive.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file