Acquire: A Distributed, Fault-Tolerant Work Sharing Library

Acquire is a distributed queuing system that uses distributed work queues (WorkQueue) to schedule and run packets of computational work on remote computing resources. Each packet of work (WorkPacket) can contain a single packet of computation, multiple WorkPackets of computation (with its own WorkQueue), or a packet of computation that can dynamically generate further WorkPackets of work (with its own WorkQueue). This allows for a dynamic tree of computation to be generated and distributed over large, heterogenous and geographically remote computing resources.

Processing of WorkPackets is dynamic, fault-tolerant and adaptive/energy-aware.

Dynamic
Processing of a WorkPacket can lead to the creation of child WorkPackets, which can be scheduled using their own, dynamically created WorkQueues. This allows a single WorkPacketWorkPackets, with these packets generated, scheduled and processed dynamically depending on the results of previous calculations.

Fault Tolerant
If processing of a WorkPacket fails, then the error can be reported to the parent WorkQueue. The failed WorkPacket can be inspected to see if a result can be salvaged. If not, then either a copy of the WorkPacket from its last sane state can be rescheduled, or the WorkQueue can send the failed WorkPacket up to the WorkQueue’s own parent WorkQueue.

Adaptive/Energy-aware
Each WorkPacket provides estimates of the computational complexity of its contents, and can optionally supply different implementations that would allow processing on different types of computational resource (e.g. CPU, GPU). WorkQueues use this data, together with historical data gained from processing previous WorkPackets of this type and computational complexity, to bias the scheduling of the WorkPacket to either low-speed/low-energy compute resources, or high-energy/high-speed resources, depending on the current demands of the user (e.g. “weekend job”, try to minimise energy consumption, or “urgent job”, try to minimise time to completion). Note that the user’s demands can be dynamic, so the user can change the urgency of the job whilst it is being processed. This would allow a weekend job that has not finished by Monday morning to be changed to an urgent job so that the results would be ready for a processing on Monday afternoon. In addition to using historical data from the processing of WorkPackets, the WorkQueues would factor into the scheduling decision historical and realtime data on the energy consumption and CPU availability of the different compute resources under its control, together with historical and realtime data on the projected CO2-cost of each unit of electricity. This would allow the WorkQueue to optimise against energy consumption, CO2-production and time-to-completion, depending on the user’s current preference.