Title: Towards a Single System Image for High-Performance Java
Abstract: Multithreaded programming in Java is an attraction to programmers writing high-performance code if their programs can exploit the multiplicity of resources in a cluster environment. Unfortunately, the common platforms today have not made it possible to allow a single Java program to span multiple computing nodes, let alone dynamically to cling on to additional nodes or migrate some of its executing code from one node to another for load balancing reasons during runtime. As a result, programmers resort to parallel Java programming using such devices as MPI. Ideally, a "Single System Image" (SSI) offered by the cluster is all that is needed. But the purist's idea of SSI is not at all easy to achieve if indeed it can be achieved. A workable SSI solution will likely require a major concerted effort by designers on all fronts, from those of the hardware and OS to those looking after the upper or middleware layers. In the meantime, partial SSI implementations offer limited but useful capabilities and help clear the way for more complete SSI implementations in the future. We present in this talk our brave attempts to provide partial SSI in a cluster for the concurrent Java programmers, and discuss how the design of the Java Virtual Machine (JVM) has made it possible (or has given us some of the troubles). At the core of our present design are a thread migration mechanism that works for Java threads compiled in Just-In-Time (JIT) mode, and an efficient global object space that enables cross-machine access of Java objects. We close with some thoughts on what can be done next to popularize our or similar approaches. The following gives an overview of the system we have implemented which will serve as a major guiding example for our discussion.
SSI can be implemented at different layers such as the hardware, the OS, or as a middleware. We choose to implement SSI as a middleware by extending the JVM, resulting is a "distributed JVM" (DJVM). Our DJVM supports the scheduling of Java threads on cluster nodes and provides locality transparency to object accesses and I/O operations. The semantics of a thread's execution on the DJVM will be preserved as if it were executed in a single node.
The DJVM needs to extend the three main building blocks of the single-node JVM: the execution engine, the thread scheduler, and the heap. Similar to the single-node JVM runtime implementation, the execution engine can be classified into four types: interpreter-based engine, JIT compiler-based engine, mixed-mode execution engine, and static compiler-based engine. There are two modes of thread scheduling that can be found in current DJVM prototypes: static thread distribution and dynamic thread migration. The heap in the DJVM needs to provide a shared memory space for all the Java threads scattered in various cluster nodes, and there are two main approaches: realizing a heap by adopting a distributed shared memory system or by extending the heap.
Our proposed DJVM exploits the power of the JIT compiler to achieve the best possible performance. We introduce a new cluster-aware Java execution engine to support the execution of distributed Java threads in JIT compiler mode. The results we obtained show a major improvement in performance over the old interpreter-based implementation. A dynamic thread migration mechanism is implemented to support the flexible distribution of Java threads so that Java threads can be migrated from one node to another during execution.
Among the many challenges in realizing a migration mechanism for Java threads, the transferring of thread contexts between cluster nodes requires the most careful and meticulous design. In a JIT-enabled JVM, the JVM stack of a Java thread becomes a native stack and no longer remains bytecode oriented. We solve the problem of transformation of native Java thread context directly inside the JIT compiler. We extend the heap in the JVM to a Global Object Space (GOS) that creates an SSI view for all Java threads inside an application. The GOS support enables location transparent access not only to data objects but also to I/O objects. As the GOS is built inside JVM, we exploit the JVM runtime components such as the JIT compiler and the threaded I/O functions to minimize the remote object access overheads in a distributed environment.
As the execution speed in JIT mode is typically much faster than that in interpreter mode, the cost to access an object in a JIT compiler enabled DJVM becomes relatively high, which in turn puts pressure on the efficiency of the heap implementation. To reduce the communication overheads, optimizing caching protocols are used in our design. We employ an "adaptive object home migration protocol" to address the problem of frequent write accesses to a remote object, and a timestamp-based fetching protocol to prevent the redundant fetching of remote objects, among other optimization tricks.