Developing a precise and efficient algorithm for Adaptive Checkpoint Scheduling of Virtual Machine Replication on Multi-core

Student : Angad Singh (3035124764)

Supervisor : Dr. Cui H.M.

Today, there is a growing demand for online services deployment on virtualized infrastructures. Online services are processing more and more requests concurrently nowadays and that requires virtual machines (VM) to utilize and more and more virtual CPU’s on multi-core hardware. Because of this rise in cloud computing, hardware failures have become more common. This implies that the need for high availably is rising.

However, high availability is hard to obtain. Previously, high availably was only possible using commercial hardware or application specific replication. Several Solutions have been given in the past to make high availability common place and make the Virtual Machines more fault tolerant. One of the most common solution is check point recovery. The entire state of the Virtual machine is copied at very high frequency’s and the changes are propagated to the backup virtual machine almost instantaneously. While the changes are being copied, the virtual machine is paused and no output is being released to the user. The output is released when the copy is made successfully. This process is carried out at fixed intervals.

The problem with this method is that it can cause significant overhead because of the large amount of data that needs to be copied and transferred, even on uni-processor VM setups. There have been research studies on adaptive checkpointing but this is a vast field and the process of adaptive checkpointing can be made more efficient to improve user experience and fault tolerance.

The aim of the project was to investigate further in this area, and come up with metrics and algorithms for adaptive checkpointing to decrease the overhead, especially on multi-core systems.

The final algorithm shows promising results and is able to reduce the runtime of network bound applications.

Project Schedule and Progress

This is the tentative schedule for the project.

Activity	Expected Time Frame	Progress
Analysis of the existing softwares	Oct 2017	100%
Invent/ Explore relevant metrics and algorithms	Nov 2017 Dec 2017	100% 100%
Implementation with Remus	Jan 2018	100% 100%
Experimentation	Feb 2018	100% 100%
Results Analysis and Comparison	Mar 2018	100% 100%
Completion	Apr 2018	100% 100%

Methodology

The project was carried out primarily in two phases. First phase consisted of workload analysis of several different applications on PLOVER. The applications selected for this analysis were Mongoose, Apache (using PHP), Tomcat, FTP (File Transfer Protocol) server, and NAS parallel benchmarks. The applications were tested with varying intervals for checkpoints ranging from 10ms to 500ms. Properties such as dirtied memory pages and runtime were compared. After this analysis, an algorithm was developed to reduce the system overhead and runtime. Phase two consisted of integrating the algorithm with PLOVER and comparing its performance with PLOVER.

Results

Source Code : https://github.com/angad2102/Plover/tree/Adaptive-Plover

The Algorithm

The algorithm shows promising results and is effectively able to reduce the runtime in processing intensive applications. The performance is also better in applications combining both processing and network communication. The algorithm's performance is the same as that of Plover in default mode for network intensive applications.