Next Generation CUDA Architecture, Code Named Fermi

JAPONICA : Java with Auto-Parallelization ON GraphIcs Coprocessing Architecture (11/2011-10/2013)

Processor hardware has undergone seismic changes in recent years. Having come up against the “clock-speed wall”, we have entered a multicore era through duplicating processor cores on a single chip to sustain performance growth. Further scaling up would go for multi-socket multicore CPU configurations (e.g. 2- and 4-socket AMD Opteron and Intel Core i7, each CPU with 2, 4 or 6 cores). They feature high-speed interconnects such as QuickPath Interconnect (QPI) and belong to cache-coherent Non-Uniform Memory Access (NUMA) machines. Recently, the trend of using vector-based coprocessors, such as GP-GPUs and IBM Cells, as compute devices is quickly gaining a footing in today’s HPC and leading to heterogeneous many-core architectures. Coupling these accelerators with traditional CPUs could deliver much higher performance but with less space and power consumption. GPUs are particularly gaining momentum to be the mainstream component for building petaflop-scale machines. Being the world’s fastest supercomputer recently built by China, Tianhe-1A epitomizes such architectures coupling massively parallel GPUs with multicore CPUs (including 14,336 Xeon CPUs, 7,168 Nvidia M2050 GPUs) which deliver record-breaking performance of 2.5 petaflops. Creating such performance using CPUs alone would need more than 50,000 CPUs, consuming 3 times more power. Another two supercomputers on the Top500 list in Nov. 2010, Nebulae and TSUBAME 2.0 (ranked 3rd and 4th resp.) are also heterogeneous systems built from Intel Xeon CPUs coupled with GPUs.

Due to such hybridized memory model, achieving good performance is non-trivial. This could be observed by the rather low efficiencies attained by the aforesaid two systems (Nebulae: 42.59% and TSUBAME: 52.11%), being far lower than other comparable systems on the list (e.g. Jaguar: 75.46% and Roadrunner: 75.74%). Application optimization becomes more focused on minimizing off-chip data traffic though maximal reuse of on-chip data and more importantly reduction of data traffic passing the PCI-E bus bottleneck. To unleash the full GPU compute power, it is necessary to spawn hundreds to thousands of threads for massive parallelism. Thread synchronization for consistent view of memory is a great challenge. Traditional locked-based programming is poorly suited to handling hundreds of threads as it over-serializes the code while fine-grained locking is notoriously error-prone. Software transactional memory (STM) helps here with speculative concurrent execution of threads, and handles memory conflicts afterwards. But existing STMs were not designed for GPU; there are many foreseeable issues calling retrofit.

We propose a user-friendly software ecosystem running atop the heterogeneous coprocessing architecture, unifying the programming style and language for transparent use of both CPU and GPU, automatically parallelizing loops, scheduling workloads efficiently across CPU and GPU resources, and providing scalable software coherency to shared data access in the highly-threaded runtime environment. We would propose Java as the target language for unifying CPU and GPU programming. While Java looks suboptimal for high-performance computing, we believe this is still an attractive solution in view of its overwhelming popularity, portability, ease of software engineering and drastically improved speed. Our methodology is to dynamically detect and translate the data-parallel parts (typically loop bodies) of the Java code to OpenCL C/C++ for GPU execution. With computation-intensive components offloaded to GPU cores, the relative slowness of Java compared to C/C++ is no longer a concern. The unconverted parts of the Java program could use up the CPU multicore resources with its multithreaded workload. System-wide data consistency issues can be handled by a GPU-friendly design of software transactional memory. This proposal will bring about several significant contributions to GPU-based heterogeneous architectures: (1) more general (object-oriented) developers can harness such architectures while requiring minimal changes to their programming habits; (2) the proposed software engine will help extend the range of portable and scalable applications on such architectures, including those with non-trivial multithreaded data sharing patterns.

JAPONICA (基于GPU異構多核系统的Java多線程自動並行化及内存一致性的支持)

繼中央處理器（CPU）轉用多核心設計之後，為了持續提昇計算性能，使用異構多核系統乃大勢所趨。面對計算密集的任務，在傳統中央處理器以外，加插一枚擁有數百計算核心的圖像加速晶片（GPU）輔助計算，更為節省金錢和能源。同時包含多個 CPU 核心和 GPU 核心的混合處理器架構，將成為下一代處理器的標準。與此同時，軟件與硬件之間的鴻溝越趨明顯：如何設計一套軟件系統，有效地運用系統上所有 CPU 和 GPU 核心並行運算，成為最棘手的問題。現時，如要使用 GPU 協助運算，程序員須了解其嶄新記憶體架構，並且掌握全新的底層編程介面。然而GPU 的記憶體和 CPU 記憶體之間，互相存取和同步的方法，仍然未有理想答案。傳統的軟件鎖機制，有互斥的特性，難以支持在 GPU 上同時運行數以千計的綫程。事務型記憶體管理（Transactional Memory）可以在沒有互斥的情況下，繼續維持記憶體資料一致準確，為此困局帶來新希望。然而現有的管理系統多為 CPU 而設計，GPU 上的研究寥寥無幾。這些技術間隙，使異構多核系統普及應用，面對很大的障礙。在這個項目中，我們提出了一個新的運行時平臺--- Japonica ( Java with Auto-Parallelization ON GraphIcs Coprocessing Architecture)，可透明地將Java多線程程式移植到基於GPU的異構多核系統上。通過我們平臺提供的動態運行支持，應用程式開發人員可以無縫的使用CPU和GPU資源去開發他們熟悉的Java編程模式。Japonica有以下幾個特徵：(1) 可自動將Java 字節碼轉換為OpenCL代碼；(2) 可自動並行化非確定性數據依賴的迴圈；(3)容許通過在CPU和GPU間的任務遷移進行動態負載調度和實現負載平衡；(4)提供廣域虛擬共享內存，使所有核心上的綫程得見同步而一致的資料鏡像, (5) 支持在CPU和GPU上基于事務性執行的投機并行多線程運行。我們提出的技術將是通用的，並可使基於加速器的異構多核計算模式達到較強大的可編程性和可擴展性。我們預期Japonica可在科學計算和商業應用中得到廣泛使用。

We aim at seamless support for a Java program execution with multiple threads scheduled concurrently on both CPU and GPU, and propose a CPU-GPU software ecosystem called Japonica (Java with Auto-Parallelization ON GraphIcs Coprocessing Architecture) to achieve the goal.

Unified execution model and work scheduling: We propose a Java-to-OpenCL code translator with a loop analyzer that performs automatic loop parallelization in a speculative manner, and a scheduling runtime that assigns workloads to CPU and GPU with symbiosis awareness and dynamic load balancing support.
Virtual shared memory between CPU and GPU: We will investigate mechanisms along with optimization techniques like lazy-copy and object migration to support transparent sharing of Java objects and arrays between CPU and GPU memory.
GPU-friendly software-level speculative synchronization: We will examine and unite concepts of thread-level speculation and software transactional memory to develop a scalable memory consistency protocol for the massively parallel GPU threads as well as CPU threads. Space-efficient hierarchical conflict detection methods and resource-aware transaction retrying methodology will be studied.

System Design

The Japonica system consists of a frontend (code translator) and a backend (runtime system).

Code Translator: We propose to tackle the challenge of translating bytecode to OpenCL at runtime. OpenCL adopts a JIT compilation model so that programs written in OpenCL can be executed on a heterogeneous platform comprising CPUs, GPUs and other processors. This provides an opportunity for developers to effectively use the multiple heterogeneous compute resources.
Unified Runtime System: The OpenCL runtime (device) will execute tasks submitted by the Java threads running on the JVM (host). Due to multi-version executables, workloads can move across the host-device boundary for adaptive rerun or load rebalancing. Host and device will be able to share data objects. We use JVM TI to export pointers in C of Java objects and copy the memory to the buffers allocated in GPU memory (object marshalling may be needed). To resolve data consistency problems in the “(GPU) threading within (Java) threading” model, we will deploy a Java STM on CPU and make it co-work with a software-level speculative coherence protocol running on the GPU. The consistency protocol is tailor-made for GPU with lightweight bookkeeping and conflict detection via bloom filtering.

GPU-TLS: loop parallization engine for Japonica

GPU-TLS is a framework for speculative loop parallelization on GPUs. We propose a warp-based synchronization method to eliminate unnecessary inter-thread synchronizattion, and a loop chopping method to reduce the dependency checking overhead for loops with large iteration number. A hybrid approach with eager intra-warp and lazy interwarp dependency checking is also proposed to reduce the misspeculation penalty for unparallelizable loops.

Efficient dependency checking schemes in GPU-TLS:

PRW: dependency checking using Precise Read & Write sets
BDC: Balanced Dependency Checking
SW : dependency checking using Summarized Writes
SR: dependency checking using Summarized Reads
SRW: dependency checking using Summarized Reads and Writes

Other Related Projects

JESSICA3 (9/2006-4/2009) : aims at supporting memory-intensive applications (e.g., data mining, search engine, computational biology, and scientific applications) on commodity clusters.
JESSICA4 (9/2009-8/2011): a new version of JESSICA based on software transactional memory. JESSICA4 will be ported on our new multicore cluster. We are looking for new research students to join this project (See JESSICA4 Recruit).

Last Modification: Sept 25, 2011, by Dr. C.L. Wang