Research Projects - The Systems Research Group

Supervisor: Prof. C.L. WANG

We have a few PhD (or 4-year ???) positions open for self-motivated and academically strong students this year. If you are interested in one of the projects, please contact me at clwang@cs.hku.hk. Interview will be arranged for qualified students.

???????,??????????????? by C.L. Wang (05/10/2014): PPT
(Keynote) Perspectives of GPU Computing in the New Big Data Era, 2017 Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC2017), Taichung, Taiwan. May 26, 2017. (photos)

Big-Little Heterogeneous Computing with Polymorphic GPU Kernels (RGC project: 11/2016-10/2019)

We will develop a GPU-based big-little Heterogeneous Multi-Processing (HMP) ecosystem for exploiting integrated GPUs and discrete GPUs through a big.LITTLE scheduling approach. Based on OpenCL, data-parallel kernels can be mapped to them, as well as the CPU, according to their memory access patterns and footprints, and the processing capacity of the target device. We will investigate techniques for performing runtime profiling-based analysis in an efficient manner. The techniques include lightweight checking for data dependencies, profiling for branch divergence, and tracking memory access ranges and patterns, leveraging the integrated GPUs. We introduce a new concept called polymorphic kernel transformation, which aims at generation of task objects that can adapt to the compute devices by code/data layout transformation. More details can be found Here

Ph.D (or 4-year ???): Solid backgroudnd in compiler techniques (e.g., loop parallelization, dependency checking), GPU hardware architecture (Nvidia or AMD GPUs), and good experiences in GPU programming (CUDA or OpenCL). Some background in Deel Learning.
1-2 RAs: can apply any time if you have the above experiences (especially OpenCL compilation techniques)

Two other related projects:

ThumbDL: Deep Learning Using Embedded/Integrated GPUs (new, 2018-)

This project is to optimize the performance of various Deep Learning models (e.g., ConvNets (AlexNet), RNNs (LSTM), Autoencoders) to offer a competitive energy-efficient solution for lightweight deep learning computations on embedded and integrated GPUs, such as Intel Iris Pro Graphics and Qualcomm Adreno GPUs. The ultimate goal is to make the traditional Deep Learning tool (e.g., Caffee) be able to run on mobile client efficiently for surveillance cameras and drones, robots, and AR/VR headsets for gesture and eye tracking. The OpenCL programming model will be used for the project. Various model compression techniques wil be applied to optimization the DL performance.

Michael Cheng, Performance Optimization on TensorFlow Framework on Mobile GPU using OpenCL (based on Qualcomm Adreno GPUs)
JI Zhuoran, Deep Learning on Mobile Devices with GPUs

JAPONICA : Java with Auto-Parallelization ON GraphIcs CoprocessingArchitecture

GPUs open up new opportunities for accelerating the Java programs for high-speed big data analytics. In this new project, we will develop a portable Java library and runtime environment "Japonica+" for supporting GPU acceleration of auto-parallelized loops with non-deterministic data dependencies. The runtime can support parallelization of a sequential Java program (with non-deterministic data dependencies) into parallel workloads (either Java threads or OpenCL x86 kernels) to run on CPU and OpenCL kernels to run on GPU concurrently, utilizing all the CPU and GPU computing resources. Task to be done: (1) automatic translation from Java bytecode to OpenCL, (2) auto-parallelization of loops with non-deterministic data dependencies (See our GPU-TLS paper), (3) dynamic load scheduling and rebalancing via task migration between CPU and GPU, (4) virtual shared memory support between host and multi-GPU cards. The whole project will be developed in recent Nvidia K40 and Pascal-based GPU cards.

Hao Wu, Weizhi Liu, Cho-Li Wang, "Fine-grained Concurrent Kernel Execution on SM/CU Level in GPGPU Computing"
Hao Wu, Weizhi Liu, Cho-Li Wang, “A Performance Model for Fine-grained Concurrent Kernel Execution of GPGPU Computing”
Huanxin Lin, Cho-Li Wang, On-GPU Thread-Data Remapping for Branch Divergence Reduction.
Hongyuan Liu, King Tin Lam, Huanxin Lin, Cho-Li Wang, “Lightweight Dependency Checking for Parallelizing Loops with Non-Deterministic Dependency on GPU", ICPADS2016 Best Paper Award. (pdf)
Guodong Han, Chenggang Zhang, King Tin Lam, and Cho-Li Wang, Java with Auto-Parallelization on Graphics Coprocessing Architecture, 42nd International Conference on Parallel Processing (ICPP2013), October 1-4, 2013, Lyon, Lyon, France. (pdf)
Chenggang Zhang, Guodong Han, Cho-Li Wang, GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs, 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 13--16, 2013, Delft, The Netherlands. (pdf)

Reference:

OpenCL https://www.khronos.org/opencl/, https://developer.nvidia.com/opencl (Nvidia)
HKU GPU Cluster: 12 x IBM iDataPlex dx360 M3 server connected by a QDR IB switch, each has one Nvidia M2050 GPU. (See : http://i.cs.hku.hk/~clwang/Gideon-II/)
Previous Japonica (Java with Auto-Parallelization ON GraphIcs Co-processing Architecture) project: http://i.cs.hku.hk/~clwang/projects/Japonica.html
Previous projects on Distributed Java Virtual Machines: http://www.cs.hku.hk/~clwang/projects/JESSICA2.html

Efficient and Reliable Parameter Server Architecture for Large-scale Distributed Deep Learning (planning)

Looks for Ph.D (or 4-year ???) who are interested in large-scale system development. Background: Spark, Flink, Hadoop, parameter server systems (e.g., Adam, Angel). We have two Ph.D students working on this topic. Our research focus is on the design of new synchronization models, data replication and consistency mechanisms, and fault-tolerance support. You are welcome to join us.

Software Architecture for Fault-Tolerant Multicore Computing with Hybridized Non-Volatile Memories (HK GRF: 9/2015-8/2018)

In this project, we propose a new multicore architecture with a two-level memory hierarchy (on-chip and off-chip) containing both non-volatile SST-RAM and volatile SRAM/DRAM. We will investigate the challenges to the design of system software architectures and the associated programming model for reliable big data computing using such hybridized memory hardware. Specifically, we hope to modify the Linux kernel to build native NVM management for use by the upper level, and develop a data-centric fault-tolerant software system for MapReduce-like programming in a reliable manner.

Reference: Project Webpage

Recent progress:

Mingzhe Zhang, King Tin Lam, Xin Yao, Cho-Li Wang. SIMPO: A Scalable In-Memory Persistent Object Framework Using NVRAM for Reliable Big Data Computing, to appear in ACM Transactions on Architecture and Code Optimization (TACO).
Mingzhe Zhang, Xin Yao and C.L. Wang, NVCL: Exploiting NVRAM in Cache-Line Granularity Differential Logging with User-Aware Persistence Guarantee

Ph.D: strong background in OS kernel, background in fault-tolerance protocol.

Crocodiles : Scalable Cloud-on-Chip Runtime Support with Software Coherence for Future 1000-Core Tiled Architectures, HKU 716712E, 9/2012-8/2015, supported by HK RGC.

Moving up to a parallelism with 1,000 cores requires a fairly radical rethinking of how to design system software. With a growing number of cores, providing hardware-level cache coherence gets increasingly complicated and costly, leading researchers to promote abandoning it if future many-core architectures are to stay inherently scalable. That means software now has to take on the role in ensuring data coherence among cores. In this research, we address the above issues and propose novel methodologies to build a scalable CoC ("Cloud on Chip") runtime platform, dubbed Crocodiles (Cloud Runtime with Object Coherence On Dynamic tILES), for future 1000-core tiled processors. Crocodiles involves the development of two important software subsystems: (1) Cache coherency protocol (2) DVFS-based power management. (Currently, 3 Ph.D students are working on this project.)

Ph.D (or 4-year ???): strong background in OS kernel, full knowledge in memory subsystem (cache/DRAM, paging), cache coherent protocols. Require strong background in software distributed shared memory systems (e.g., TreadMarks, JiaJia, JUMP), programming experiences in multicore power management systems.

Publication and recent effort:

Z. Lai, K. T. Lam, C.-L. Wang, and J. Su, "Latency-aware DVFS for Efficient Power State Transitions on Many-core Architectures," Journal of Supercomputing, Vol. 71, No. 7, pp 2720-2747, July 2015.
Z. Lai, K. T. Lam, C.-L. Wang, and J. Su, "PoweRock: Power Modeling and Flexible Dynamic Power Management for Many-core Architectures," IEEE Systems Journal, Issue: 99, pp. 1-13, 20 January 2016.
Z. Lai, K. T. Lam, C.-L. Wang, and J. Su, Power and Performance Analysis of the Graph 500 Benchmark on the Single-chip Cloud Computer, International Conference on Cloud Computing and Internet of Things (CCIOT '14), Changchun, p. 9-13, China; 12/2014
Z. Lai, K. T. Lam, C.-L. Wang, and J. Su, A Power Modeling Approach for Many-Core Architectures, 10th International Conference on Semantics, Knowledge and Grids (SKG '14), Beijing; pp. 128 – 132, 27-29 August 2014.
“Rhymes: A Shared Virtual Memory System for Non-Coherent Tiled Many-Core Architectures,” ICPADS2014.
“Latency-aware Dynamic Voltage and Frequency Scaling on Many-core Architecture for Data-intensive Applications”, CloudCom-Asia 2013, Fuzhou, China, Dec. 16-18, 2013.

OS-1K: New Operating System for Manycore Systems

Traditional operating systems are based on the sequential execution model developed in the 1960s. Such operating systems cannot address new many-core parallel hardware architecture without major redevelopment. For instance, how can you harness the power a next-generation manycore processor with >1,000 cores? We will investigate various perspectives on the future OS design towards the goal. We are developping an x86-based full-system simulator based on Gem5. (Currently, one M.Sc student is working on this project. We are modifying Gem5 simulator to simulate the 1000-core chip)

Ph.D: must have some experiences in OS kernel development. Some experiences in Barrelfish or sccLinux will be quite helpful. The student is also required to have good knowledge in multicore hardware architecture.

TrC-MC: Software Transactional memory on multi-core.
- Kinson Chan, King Tin Lam, Cho-Li Wang, Cache-efficient adaptive concurrency control techniques for software transactional memory on multi-CMP systems, submitted to Concurrency Computat.: Pract. Exper. 2015;
- K. Chan, K. T. Lam, and C.-L. Wang, Cache Affinity Optimization Techniques for Scaling Software Transactional Memory Systems on Multi-CMP Architectures, The 14th International Symposium on Parallel and Distributed Computing, June 29-July 1, 2015, Limassol, Cyprus.

Updated: May 08, 2015