HKU SRG Homepage

Directed Point: An Efficient Communication Subsystem for Cluster Computing

Chun-Ming Lee, Anthony Tam, Cho-Li Wang

Download:

Interested persons please click here to fill in the information for downloading the DP source code.

Abstract:

In this paper, we present a new communication subsystem, Directed Point (DP) for parallel computing in a low-cost cluster of PCs. The DP model emphasizes high abstraction level of interprocess communication in a cluster. It provides simple application programming interface with syntax and semantics similar to UNIX I/O function call, to shorten the learning period. The DP achieves low latency and high bandwidth based on (1) a efficient communication protocol Directed Messaging, that fully utilizes the data portion of packet frame; (2) efficient buffer management based on a new data structure Token Buffer Pool that reduces the memory copy overhead and exploits the cache locality; and (3) light-weight messaging calls, based on INTEL x86 call gate, that minimizes the cross-domain overhead. The implementation of DP on a Pentium (133 MHz) cluster running Linux 2.0 achieves 46.6 microseconds round-trip latency and 12.2 MB/s bandwidth under 100 Mbit/s Fast Ethernet network. The DP is a loadable kernel module. It can be easily installed or uninstalled.

1. Our Ideas:

We believe that kernel-level or user-level implementation of a communication facility is not the major factor for communication performance degradation. Bad performance is due to poor architecture design, optimization techniques and implementation. Thus, kernel-level communication subsystem can also achieve low latency and high bandwidth communication.

Programmability is important. It eases the application development, and allows to build higher level programming tools. We also concern how to deliver good performance on application interface level, rather than network interface level. In addition, the support of multi-process is important since cluster is a multi-user and multitasking environment. Fast communication subsystem should be able to port on different network technology.

2. DP Architecture:

Only two layers implemented in kernel:

Service layer: handles the user transmission, and dispatches incoming messages to users, open and close communication channels.
Network layer: controls physical network hardware

No protocol handling layer in kernel:

flexible for user application to customized the application protocol.
remove the bulky protocol stack, build application protocol directly using DP API and network packet.

dp01.jpg (11434 bytes)

Figure 1: DP Architecture

3. Optimizing Techniques:

Efficient buffer management:

a. Token Buffer Pool (TBP) as receiving buffer:

1. Dedicated buffer area for receiving data

2. Shared by user process and kernel process

3. Reduces extra memory copy time,and directly use the arrived messages

b. Directly deposits outgoing messages to NIC¡¦s memory:

1. Reduces intermediate buffering overhead for outgoing message

Light-weight messaging calls:

1. A fast entrance to kernel

2. Avoids extra software overheads in typical system calls when switching from user mode to kernel mode.

dp02.jpg (50530 bytes)

Figure 2: Transmit Process & Receive Process Overview

4. Directed Point Abstraction Model:

To provide higher level of abstraction, we view the interprocess communication in a cluster as a directed graph. In the directed graph, each vertice represents a communicating process. Each process may create multiple endpoints for the identification of communication channels. A directed edge connecting two endpoints at different processes represent a uni-directional communication. The DP model can be used to describe any interprocess communication pattern in the cluster. The programmer or the parallel programming tool can easily translate the directed graph to the SPMD code.

dp03.jpg (30265 bytes)

Figure 3: Directed Point Abstraction Model

5. DP Programming Interfaces:

DP also provides a small set of new API which conform the semantics and syntax of UNIX I/O calls. This further makes the translation very easy. DP APIs consists of one new system call, user level function calls, lightweight message calls, and traditional UNIX I/O calls. Details of the DP API and the functions of each command is shown in Table 1.

All DP I/O calls follow the syntax and semantics of the UNIX I/O calls. Similar to the file descriptor used in UNIX I/O calls to identify the target device or a file, we also use the it to abstract the communication channel. Once the connection is established, a sequence of read and write operations are performed to receive and send the data. The DP API provides handy and friendly interface for network programming. This unloads the burden of learning new API and makes it easier to be used, since the UNIX I/O calls have been widely used for developing code in the past.

dpt1.jpg (50513 bytes)

Table 1: The DP API

dpt2.jpg (53587 bytes)

Table 2: Example Program

6. Sending and Receiving:

Figure 4 shows a simplified diagram for illustrating the interaction between the user processes and the network component. The network component is the Digital 21140 controller.

dp04.jpg (25015 bytes)

In this example, we show two communicating processes at the user space. When a process wants to transmit the message, it simply issues either a traditional write system call or a DP lightweight messaging call dp_write( ). The process then switches from user space to kernel space. The corresponding DP I/O operation for transmission is triggered to parse the NART to find the network address based on the given NID for making the network packet frame. The network packet frame is directly deposited to the buffer pointed by the TX descriptor without intermediate buffering. Afterward, it signals the network hardware to indicate that there is message to be injected on the network. At this stage, it is on the context of calling user process or thread which is consuming its host CPU time. After the messages stored in TX descriptor, the network component takes the message and then injects it to the network without CPU interference.

When a message arrives, the interrupt signal is triggered and the interrupt handler is activated. The interrupt handler calls MDR to look at the Directed Message header to identify the destination process. If it finds the process, the process¡¦s TBP is used to save the incoming message. In our implementation, the TBP is pre-allocated and locked in the physical memory. Thus, it is guarantee that the destination process will obtain the message without any delay.

7. Performance Measurement:

In this section, we report the performance of DP and the comparison with other research results. Benchmark tests are designed to explore the round-trip latency and bandwidth benchmarks. We also conduct micro-benchmark tests which provide in-depth timing analysis to justify the performance of DP.

7.1 The Testing Environment

Operating System: Linux Kernel 2.0
FORE PCA200E ATM Interface for PCI Bus
Dlink DEF-500TX Fast Ethernet Interface for PCI Bus: Embedded Digital 21140A Fast Ethernet Controller
Hosts:

Two Pentium 133Mhz PCs
Two Pentium II 233Mhz PCs
Two Pentium II 450Mhz PCs
Two AMD K6/2 350Mhz PCs

PCs are linked directly back-to-back.

All the benchmark measurements are obtained by using software approach, and these programs are compiled by using gcc with optimization option -O2. These tests are running in multi-user mode. To have more accurate timing measurement, we construct our timing function with Pentium TSC counter that is incremented by 1 for every CPU clock cycle. Therefore, we are able to measure the execution time in very high resolution.

7.2 Bandwidth and Round-trip Latency

To measure the round-trip latency, we implemented an SPMD version of Ping-Pong program using DP API. The Ping-Pong program measures the round trip time for message size from 1 byte to 1500 bytes. Timing results are obtained by measuring the average time of this message exchange operation in 1000 iterations.

dp05.jpg (33363 bytes)

Figure 5: Latency: Fast Ethernet: Sending 1 byte one-way back-to-back latency: P2-450: 20.80us, P-133: 24.80us, P2-233: 26.78us, AMD K6/2-350: 31.87us

dp06.jpg (40588 bytes)

Figure 6: Latency: Fast Ethernet: For Sending 1 byte to 1500 Bytes

dp07.jpg (25537 bytes)

Figure 7: Latency: ATM vs. Fast Ethernet measured in Pentium 133Mhz PCs for Sending 1 byte to 1500 Bytes

dp08.jpg (25452 bytes)

Figure 8: Comparison of Communication Bandwidth between ATM and Fast Ethernet using DP

8. Cost Model and Performance Analysis:

To identify the overheads in communication path, we proposed a cost model which includes 5 parameters:

O_startup: start-up time of a send operation
O_send: time in send operation
O_closeup: close-up time of a send operation, also the minimum time required for next send operation
T_net: networking delay time which includes delay in PCI bus, network interface and switch.
O_delivery: the time for DP service layer to delivery arrival message to destination process (an DP endpoint).

dp09.jpg (32582 bytes)

Figure 9: Breakdown of communication time between two processes P1 and P2

dp10.jpg (35881 bytes)

Figure 10: Overheads breakdown in sending 1500 bytes using Fast Ethernet

dp11.jpg (32023 bytes)

Figure 11: Overheads breakdown in sending 1 bytes using Fast Ethernet

dp12.jpg (25665 bytes)

Figure 12: Overhead Breakdown Comparsion Between ATM & Fast Ethernet, Measured in Pentium 133Mhz PCs: Overheads in sending 1500 bytes

dp13.jpg (26310 bytes)

Figure 13: Overhead Breakdown Comparsion Between ATM & Fast Ethernet

dp14.jpg (50401 bytes)

Figure 14: Performance Comparison with other implementations

Research Students: David Lee (Graduated), Raymond Wong, Anthony Tam, Wenzhang Zhu.

Technical Papers:

Cho-Li Wang, Anthony T.C. Tam, Benny W.L. Cheung, Wenzhang Zhu and David C.M. Lee. "Directed Point: High Performance Communication Subsystem for Gigabit Networking in Clusters", Journal of Future Generation Computer Systems, pp. 401-420, 2002. (PS, 1.22MB) (PDF, 448KB)
Wenzhang Zhu, David Lee, Cho-Li Wang. "High Performance Communication Subsystem for Clustering Standard High-Volume Servers using Gigabit Ethernet", HPC Asia 2000, May 2000, Beijing, China. (PS, 1037KB) (PPT)
R. Wong and C.L. Wang. "Efficient Reliable Broadcast for Commodity Clusters", HPC Asia 2000, May 2000, Beiging, China. (PS, 862KB) (PPT)
Kwan-Po Wong and Cho-Li Wang. "Push-Pull Messaging: A High-Performance Communication Mechanism for Commodity SMP Clusters". (PS, 864KB)
C. M. Lee, A. Tam, and C. L. Wang. "Directed Point: An Efficient Communication Subsystem for Cluster Computing", The International Conference on Parallel and Distributed Computing Systems (IASTED), October 1998. (PDF, 83KB)
C.M. Lee. "Efficient Communication Subsystem for Cluster Computing", Thesis, August 1998.