Big-Little Heterogeneous Computing with Polymorphic GPU Kernels

(Project Goals and Status, Updated on April 30, 2018)

Project PI: C.L. Wang

Project members: Hao Wu, Billy Lin, Pengfei Xu, Ji Zhouren

Research Assistants: Zhang Longtao, Zhang Longtao, Hu Shiyu


Hao Wu (??), Ph.D., 2/2012-	Xu Pengfei, PhD, 9/2017-	Huanxin Lin (Billy), PhD, 9/2014-	Ji Zhouren, PhD, 9/2018-

Zhang Longtao, RA	Zhang Longtao, RA	Hu Shiyu, RA

Objective 1: Develop a GPU-based big-little Heterogeneous Computing ecosystem.

To support collaborative, heterogeneous multiprocessing, we have built a new compiler and runtime framework, called Japonica. Japonica allows a high-level sequential Java program to be parallelized on heterogeneous compute devices, including multi-core CPUs, “big” GPUs (e.g., standard desktop or server-grade Nvidia and AMD GPUs), and embedded “little” GPU (e.g., Adreno GPU in mobile phone). Japonica enables ?ne-grained and dynamic control over resource partitioning and scheduling via OpenCL kernel transformations.

Japonica first uses a frontend compiler JavaR (need to parse the Java code and extract the Java code snippet containing the loops to parallelize. The extracted code is converted to C and analyzed by the PPCG (Polyhedral Parallel Code Generation) compiler to generate OpenCL kernel code. The steps involved in the PPCG-based processing include: (1) Calculating the loop bound and memory access patterns; (2) Adding memory/buffer management code for transferring flexible amount of data to/from GPU; (3) Wrapping all irregular data access in function calls to circumvent confusion led to the PPCG’s code generating backend (PPCG can only apply affine mapping to the subscript expressions of the statements in a loop nest); (4) Parallelizing the loops without data dependency and wrapping dynamic loops with additional dependency checking code. Japonica ecosystem also implemented a JOCL runtime (a Java bindings for OpenCL) for distributing workload from the host machine to GPU devices based on a symbiosis-aware kernel scheduling (Discussed in Objective 4). Each for-loop block is represented by one JaTask, which is a Java interface to encapsulate all the GPU resources and functions needed by the OpenCL kernel execution, especially the data partition and kernel submission operations. At current stage, Japonica can generate the basic OpenCL code to run on standard discrete GPUs (Nvidia GTX980, Nvidia GTX1080, ADM R9 290X, AMD Vega 64) and mobile GPUs (Adreno 530/540 used in Snapdragon-based mobile phone).

Objective 2: Runtime Profiling-based Analysis.

To achieve Objective 2, we propose several light-weight profiling techniques which can be deployed at runtime for detecting: (1) Non-deterministic data dependencies (i.e., loops with non-af?ne loop bounds or subscripts), (2) Control flow (branch) divergence, and (3) Runtime resource usage under different resource partitioning and kernel reshaping methods. To detect non-deterministic data dependencies, we generate our own inspector kernels using ANTLR, which intentionally remove un-necessary compute-intensive code (based on dead code elimination used commonly in compilers). We only keep the code needed to detect inter-loop data dependency or track the control flow divergence. We design special on-GPU profiling code (not sure if Wu Hao has done this) to examine the runtime resource usage of OpenCL kernels with respect to different degree of kernel slicing and kernel stretching (Discussed in details in Objective 3). We extract these runtime features using our own profiling methods and also collect hardware performance counters by CodeXL to validate our model (the actual usage of hardware performance counters has to be confirmed). Our profiling techniques have been applied to detect non-deterministic dependency during kernel execution. In our paper “Lightweight Dependency Checking for Parallelizing Loops with Non-Deterministic Dependency on GPU” (ICPADS’16), we report two lightweight dependency checking schemes, namely two-pass inspector-executor (Two-Pass) and warp-intrinsic speculative execution (WISE), which can check loops with non-deterministic WAW, RAW, and WAR dependencies. Our schemes feature linear work complexity for memory operations, lower memory consumption compared to previous work, and minimal false positives by leveraging the lockstep execution on the GPU's SIMD lanes. Experiments show that our schemes can achieve 2.2 × speedup over existing solutions in dependency-free cases while only taking about 20% of time compared to existing solutions in the case with statically unproven loop-carried dependencies.

Objective 3: Polymorphic Kernel Transformation.

Polymorphic kernel transformation reshapes computation at compile-time with kernel code modification. The ultimate goal is to achieve a good mix of kernels to run concurrently on a single GPU that maximizes the use of all functional units (e.g., vector/scalar, branch, memory units) at all time. We proposed two kernel transformation techniques, namely kernel slicing and kernel stretching. Kernel slicing cuts a large kernel into multiple smaller kernels, while kernel stretching merges multiple work groups from the same kernel into a larger work-group. Kernel slicing and stretching are used as means to adjust the number of work-groups of a given kernel by restricting the static resource usage of each work-group with the consideration of GPU’s hardware limitations. Based on the OpenCL programming model, the basic unit of work on a compute device is a single work-item (or thread in Nvidia). A group of work items executing the same code are stitched together to form a work-group. OpenCL assumes that underlying devices consist of multiple Compute Units (CUs). When executing a kernel, work groups are mapped to CUs based on an occupancy-based round-robin scheduling. Once a work-group is dispatched to an CU, it is further broken down into hardware schedulable units, known as wavefronts (consisting of 64 threads in AMD GPU), and dispatched to SIMD units for executions. In a modern AMD GPU, each CU consists of four SIMDs and each can schedule up to 10 concurrent wavefronts. The occupancy-based scheduling stops mapping work-groups to CUs if the next work-group does not have the required hardware resources (e.g., registers, shared memory) to run on an CU, or it has reached the hardware limit (max 40 wavefronts per CU). With kernel slicing, we avoid a kernel to launch big elephant work-groups which occupy the most resources in each CU and forbid us to mix its execution with other kernels. Kernel slicing carefully controls various factors which affect the work-group size of a kernel, including the number of work-items (threads) to be executed in each work-group, the number of registers and amount of shared memory used by all threads in the same work-group. On the other hand, kernel stretching merges multiple work-groups from the same kernel into one larger work group. By increasing the workload of each work-group, kernel stretching can increase data reuse rate and reduce context switch overhead, which is favorable for compute-bound kernels. We have shown that kernel slicing and kernel stretching can calibrate the runtime resource usage of individual kernels and make our predictive model more accurate. This also makes Japonica scheduler possible to leverage more fine-grained control on concurrent kernel execution, and result in better performance.

Objective 4: Symbiosis-aware kernel scheduling

Based on our kernel transformation techniques discussed in Objective 3, we further propose a symbiosis-aware kernel scheduling policy to support concurrent kernel execution on GPUs. By running multiple kernels simultaneously within a Compute Unit (CU), we are able to maximize the dynamic resource usage and improve the overall execution time. We proposed a metric, symbiotic factor, that seeks to co-schedule kernels that have complementary resource requirements, or otherwise avoid co-schedule “mis-match” kernels to run at the same time. To avoid sweeping through all pairwise combinations of kernel execution and measuring their runtimes, we built a simple performance predictive model for concurrent kernel execution. As we aim at the runtime optimization, we mainly use four parameters to build our performance model for predicting kernel’s execution time: (1) the number of work-groups that can be active at the same time on a single GPU (#WG active), (2) the local memory used by each work-group (LD/WG(B)), (3) the number of threads in execution (T), and (4) the number of work-groups that can be dispatched to a CU at the same time (#WG active).

To characterize individual kernel’s the runtime behaviors, we use CodeXL to extract the number of vector/scalar ALU instructions per work-item, and derive the percentage of time that the functional unit is occupied by a kernel for executing its vector/scalar instructions, and memory load/store instructions respectively. Our model further quantify the time delay caused by contentions on GPU’s scalar/vector unit, and memory load/store units, while multiple kernels are running at the same GPU. We evaluated our symbiosis-aware kernel scheduling policy using five benchmarks from the Parboil benchmark suite and three benchmarks for deep learning framework. A total of 56 pairwise combinations of the 8 kernels with different slicing and stretching factors have been tested. Our results show that our new scheduling scheme can achieve up to 1.66x speedup on AMD R9 290X compared to the default concurrent kernel execution.