FOLD3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Fanxin Li, Shixiong Zhao*, Yuhao Qing, Xusheng Chen, Xiuxian Guan, Sen Wang, Gong Zhang, Heming Cui

Abstract—Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments' computational tasks. We realized this idea in our FOLD3D training system. Extensive evaluation shows FOLD3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%-42.1% training throughput improvement compared to four notable baselines over various settings; FOLD3D's high performance scaled to many GPUs.

Index Terms—Deep Learning, distributed training, GPU, DNN, 3D parallelism, pipeline parallelism, Machine Learning

1 INTRODUCTION

The high modeling capacities of a large DNN (e.g., GPT3 [1] with 175 billion parameters) have made training or fine-tuning (essentially, training) such a model prevalent and frequent on commodity clouds. Various cloud tenants, frequently training or fine-tune such a large DNN for broad applications [2], [3], [4], [5], [6], [7] with their own private datasets and application needs. The working-set memory (i.e., in-GPU memory, without further specified) needed for training the model far exceeds the capacities of individual accelerators (e.g., GPUs), flourishing parallel techniques that split a DNN model across devices.

3D parallelism [8], [9] (Figure 1a) is a crucial DNN training technique that combines and orchestrates three parallelism dimensions. Tensor parallelism (TP) splits a single DNN operator (often too large to fit in one device) over devices. Pipeline parallelism (PP) [10], [11] places different operator sets (i.e., pipeline stages) of a DNN model over devices and pipelines the execution of multiple micro-batches (i.e., splits of a single SGD batch which is a set of training inputs for each SGD [12] iteration) to reduce devices’ idling time, as Figure 1 shows. Data parallelism (DP) replicates the model across devices, lets each replica handle one micro-batch, and synchronizes the gradients produced by all micro-batches after finishing one SGD batch [13].

Overall, the end-to-end performance of 3D parallel training can be divided into two runtime phases: a configuration phase and a scheduling phase. First, given a DNN model and an AI cluster of N GPU devices connected by hierarchical inter-links (e.g., NVLink [14] within a host) and RDMA [15] across hosts, the configuration phase determines the number of splits in TP t, the number of splits in PP p, and the number of DP replicas d, where t * p * d = N. Second, given the above 3D configuration, the scheduling phase determines the order in which the devices actually execute the computation tasks of each micro-batch and communicational tasks between devices (TP sync, PP sync, and DP sync in Figure 1). The two phases collectively decide the effective total GPU ALU utilization, under the bounds of per-GPU memory and the heterogenous inter-links.

Many recent works [8], [16], [17] focus on finding an optimal 3D configuration. For example, to place DNN models that are too large to fit in one device, while TP and PP both fit for splitting a model, Megatron-PTD [8] and Piper [16] prefer TP over PP with the existence of fast inter-links such as NVLink (often available within a host), as TP often achieves higher computational efficiency [8] in such cases. Inversely, for inter-links such as RDMA and Ethernet, PP is favored [8], [16]. Specifically, the RDMA or Ethernet in the most high-end commodity cloud (e.g., AWS) is up to 400Gbps for the entire cluster. Moreover, the networks in the same commodity cloud are shared by many tenants. Even for the top-of-the-line AWS cloud, the network bandwidth for a single tenant is often merely up to 70 or 80 Gbps (confirmed in [6]).

Unfortunately, despite much effort in optimizing the configuration phase, in the scheduling phase, existing 3D training systems are inevitably trapped in a serialization problem, where heavy communication blocks the computation and causes devices idling. Specifically, as shown in
A DNN model with 12 layers (A-L)

![A 3D configuration example of the DNN model](image)

A DNN model with 12 layers (A-L)

Figure 1: a) 3D Parallelism. Each gray box is a GPU device. Sync stands for the communications that synchronize each parallelism dimension. b) A conceptual illustration of the serialization problem and our idea. The gradient computational tasks are represented by backward passes of DNN training (with forward passes omitted and full scheduling being shown in Figure 1). In AIAO scheduling, a copy of DNN gradients produced by micro-batch is sliced into four segments (with distinct colors), and the same segments (i.e., the same colored ones) are grouped together during pipelining. Compared with the FIFO-based scheduling, our AIAO scheduling moves the DP sync tasks off the performance critical path by introducing larger total lifecycles of activation checkpoints and resulting in a larger peak GPU memory.

Empirically, in the four most notable baseline systems we extensively evaluated (e.g., Megatron v3.0 [26], the latest 3D parallel training system released by Nvidia in May 2022), the serialization problem makes the DP sync communications tasks heavily block the computational tasks of the next training step and causes GPUs to idle for up to 44% of the total training time, over a 256 A100 cluster with 200Gbps inter-host links (an extremely private cluster setting that limits the number of cores per GPU to 32). Nevertheless, because of the serialization problem, the performance of these 3D parallel training systems is inevitably capped by the sum of computational and communicational costs (see §3.3).
of input micro-batches, our AIAO essentially needs all input micro-batches (all-in) enqueued, so as to allow the grouped scheduling of segment sub-tasks corresponding to these micro-batches. The full scheduling depicted in Figure 3 and §3.2 shows that within each segment of the model, our AIAO separately groups the forward pass and backward pass of the pipelined computational tasks on this segment and schedules the grouped tasks according to segments’ forward-backward dependencies. To alleviate serialization, a segment’s DPsync task is scheduled to be overlapped with both the backward and forward pass computational tasks in a mirrored way.

However, one challenge for the above AIAO scheduling is that, altering the FIFO principle would inevitably introduce longer activation lifecycles in total and increase the peak memory usage of GPUs §5.5.

The second observation to address this challenge is that, there exists an invariant architectural opportunity for any pipeline schedule, where all micro-batches share the same size of computation window (the sum of one micro-batch’s forward pass and backward pass) which allows offloading critical activation checkpoints to the host memory, despite the order of micro-batches enqueueing and dequeueuing. By doing so, the increased memory burden is shifted from GPUs to hosts, as the CPU memory on a host is orders of magnitude larger than the memory capacity of each GPU §2.1. Leveraging this observation, the AIAO scheduling is accompanied with two key memory squeezing mechanisms §4: an intra-segment offloading mechanism and an inter-segment lazy communication mechanism, making the AIAO scheduling incur negligible extra GPU memory burden when training large DNNs (Table 2).

We implemented the AIAO scheduling in Fold3D based on Megatron §26, a well-engineered and open source 3D training system, by adding 5371 LoC. We compared Fold3D against Megatron-SP §28 (v3.0.2, the latest release), Megatron-PTD §8 (v2.5.0), DeepSpeed Zero3 §29 (DSpeedZ3), and DeepSpeed 3D §19 (DSpeed3D), covering three notable and open source 3D training systems and one state-of-the-art data parallel training system (DSpeedZ3). Both Fold3D and Megatron-SP are enabled with sequence parallelism §5, one of the latest memory squeezing techniques §28 complementary to 3D parallelism. Our evaluation was done both over a high-profile cluster (256 A100 GPUs) and a middle-profile cluster (64 V100 GPUs). The numbers of GPUs evaluated are comparable to the latest works §16, 17 that study 3D training. We evaluated all five notable large Transformer §30 based models §1, 29, 31, 32, 33 evaluated by recent systems §8, 16, 29. The extensive evaluation shows that:

- Fold3D is high-performance on commodity cloud networks. Fold3D achieved 25.2%-42.1% higher throughput than the baselines with all systems being deployed on both the A100 cluster and the V100 cluster.
- Fold3D’s high performance is robust. By setting various stringent model shapes (e.g., a slip model with a large layer number and a small layer size), Fold3D’s high performance was consistently observed.
- Fold3D is scalable. Our scalability evaluation over 256 A100 GPUs shows that Fold3D’s performance gain over Megatron was stable (~31%) from 64 GPUs to 256 GPUs with the model’s scale increased correspondingly (i.e., weak scaling, see Figure 2). When each tenant trains a GPT-3 instance, Fold3D can save the tenant’s electricity for about 100,000 KWh over 256 A100 GPUs, which is roughly equal to the electricity used by 100 families per year or tens of electric cars’ lifetime §6.5.
- The increase (relaxing) of Fold3D’s memory consumption is moderate. Table 2 shows that for each host with eight A100 GPUs, Fold3D consumed in total 8.1GB-17.3GB extra CPU memory, while Fold3D’s GPU memory usage was comparable to baselines’.

Our contributions are as follows. We take the first step to systematically summarize (Figure 1) and quantitatively model §3.3 the serialization problem in existing 3D training systems. We propose the idea of folding, design the AIAO scheduling, and practically realize it in Fold3D. Leveraging these contributions, we maximally overlap computation and communication tasks in 3D parallel training. Fold3D can greatly promote many more researchers and enterprises to enjoy the benefit of training and fine-tuning large DNN models on commodity clouds. We believe Fold3D can benefit various emerging large DNN paradigms such as Mixture-Of-Experts (MoE) §34, 35, Pathways Language Model (PaLM) §36 and Multi-Modal Learning §37, because 3D parallelism is the foundation for these paradigms to scale large; meantime, we envision that it would be challenging to fuse the AIAO scheduling with these new paradigms §5.7, and we leave this in future work. Our code and evaluation results are released at github.com/hku-systems/fold3d.

2 BACKGROUND AND MOTIVATION

Existing large models with billions of parameters trained on 3D parallel training systems are mainly stacked up with homogeneous blocks (e.g., transformer block). The repeated structure of these models is their fundamental advantage to obtain better model capacity (and thus higher accuracy) by simply scaling up the model size §30, 31. In this paper, same as Megatron §8, we assume all models are repeatedly stacked transformer models.

2.1 Parallelism Dimensions

ML model training proceeds with iterations of forward and backward pass computations on micro-batches of a dataset. However, fitting existing large models into a single
GPU for training is unrealistic [25], which expedites the development of parallel training systems that cope with two crucial requirements. First, a system should fit the model’s parameters and intermediate results (e.g., activation maps) into a GPU’s memory. Even on the top-of-the-line AWS AI clusters, each GPU’s memory is limited (e.g., an A100 GPU has 40GB or 80GB memory), while the CPU memory on a host is orders of magnitude larger (e.g., Terabytes) and much cheaper than a GPU’s memory. Second, a system should be capable to scale up the training to more GPUs. Certainly, all these two requirements should be met as efficiently (more effective FLOPS per GPU) as possible.

**Data Parallelism (DP).** In data parallelism [38], each worker has a copy of the model, and the dataset is split across workers. The workers synchronize their gradients periodically via an all-reduce [13] communication (i.e., DP .sync) to maintain a consistent version of the parameters. For a large model which does not fit in a single worker, although pure DP practice [19], [39], [40] can be used to train a large model with various optimizations, it takes excessive extra critical path costs for offloading activations and optimizer states to CPU memory [27] and NVMe storage [11], or sharding them across GPUs [29]. Moreover, its scalability [8] is bounded by communication on low-end networks and the size of a total batch (a set of data for producing each parameter update).

**Tensor Parallelism (TP).** Tensor (Model) parallelism [29] partitions input and parameter tensors of a layer (e.g., transformer multi-head self-attention layer [30]) across GPUs. Within each repeated (transformer) block (a set of layers), during both forward pass and backward pass, TP requires an all-reduce communication (i.e., TP . sync) to aggregate the tensors between repeated blocks, which typically lies on the critical path and is network-intense. Therefore, TP is usually deployed across GPUs within the same server to use fast intra-server GPU-to-GPU links (e.g., NVLink [32]). TP is mainly adopted as a complementary technique to help existing parallel training systems [8], [19], [25], [41] to support larger transformer layers.

**Pipeline Parallelism (PP).** Pipeline (model) parallelism [10], [11], [20], [22], [43] shards the layers of a model across multiple GPUs; each shard is called a pipeline stage; activation tensors are propagated between stages via a point-to-point communication (i.e., PP . sync). A total batch is split into micro-batches (a micro-batch is the minimum unit for each GPU’s forward and backward pass computations); execution is then pipelined across micro-batches. When used on symmetric models, each stage, the forward passes for all micro-batches of a total batch are first executed, followed by backward passes for all micro-batches. For its simplicity and easy-to-integrate nature, AFAB scheduling is widely adopted by systems such as HetPipe [45] and DeepSpeed 3D [19].

**1F1B Scheduling.** Pipedream [10] proposes the 1F1B scheduling where one backward pass immediately pre-empts the execution as soon as its required forward pass is finished (for the last stage) or its depending backward passes are finished (for other stages). 1F1B scheduling is adopted by Megatron [8] and recent pure pipeline parallel training systems [22] (e.g., Out-Of-Order [21]) with a flush inserted. Compared with AFAB scheduling, 1F1B scheduling costs less GPU memory footprint. Nevertheless, we embrace a CPU offloading scheme [4] of activation checkpoints to make the AIAO scheduling of FOLD3D not abuse GPU memory.

Overall, despite the differences between the above scheduling algorithms in terms of how forward passes and backward passes are interleaved, all existing pipeline scheduling algorithms are FIFO-based scheduling, as shown in Figure 1b. Specifically, from the view of forward pass computational tasks and backward pass computational tasks separately, micro-batches are executed in the order of them being fed into the execution queue; and later enqueued micro-batches need to wait for the dequeueing of the previous micro-batches’ tasks. In this paper, instead of following this FIFO principle, FOLD3D enqueues all the micro-batches to conduct its subtle scheduling (§3.2) of computational tasks that alleviates the serialization problem in existing 3D parallel training systems.

### 2.2 3D Parallelism

Despite many efforts made towards all the three aforementioned scaling dimensions of parallel training, none of a single scaling dimension could scale infinitely. The reason is that a single scaling dimension may be bounded by various scaling efficiency bounds [8]: TP incurs frequent and high-volume intra-server communication tasks and thus is only suitable within a server (host); DP is bounded by cross-server DP . sync communication tasks and the total batch size [8]; PP is bounded by bubbles and the total number of layers [10]. 3-Dimensional Parallelism (3D) combines all these three dimensions so that when one dimension reaches its scaling efficiency bounds, a 3D parallel system can scale along other dimensions.

**The Serialization Problem** Existing 3D parallel training systems all suffer from the serialization problem: most of the communicational tasks are serialized after the computational tasks and scattered along the performance critical path. When these systems are deployed on commodity clouds with a few hundred of Gbps networks, the serialization problem is getting much more pronounced. Compared to the reported experiments from Megatron [8] over a cluster of 256 A100 GPUs with 1.6Tbps dedicated inter-host links, the per-GPU hardware utilization sharply dropped from around 140 TFLOPs to 67.4 TFLOPs, when training a GPT3-18B model over a 200Gbps cloud network. Note that in the commodity cloud, each tenant can only get 70 to 80 Gbps bandwidth [6].
Fig. 3: FOLD3D’s Architecture. L(A) and L(B) mean layer A and layer B respectively. The gray boxes are FOLD3D’s runtime components.

3 FOLD3D SYSTEM

3.1 Architecture Overview

Figure 3 shows the architecture of FOLD3D. To deploy a model for training, a user needs to feed the model and the 3D parallelism configuration (generated by Megatron-PTD [8] or Piper [16]) into FOLD3D [44], and FOLD3D will automatically select a proper segment number [4] and generate an AIAO schedule for the model. Empirically, we find that the 3D parallelism configuration generated by Piper is also optimal for FOLD3D (see §6.2).

For each GPU device, FOLD3D launches one runtime containing an executor, a communicator, and an offloader. On each GPU, FOLD3D’s executor runs a partition of the AIAO scheduling as shown in Figure 4 and assigns the communication (sync) tasks to FOLD3D’s communicator. The communicator schedules the communication tasks. The offloader manages activation checkpoints [8].

Executor. Each executor is a dedicated main process which manages one GPU device, controls the computation scheduling, and interacts with FOLD3D’s communicator and offloader. In particular, the executor runs a static AIAO scheduling, performs the training computation tasks (including both forward pass and backward pass computations), and assigns three types of communication (sync) tasks (including DPsync, TPsync, and PPsync) to the communicator. Meanwhile, the executor informs the offloader of the execution status before each computation task starts, so that the offloader can manage the checkpoint offloading/prefetching without hurting training performance on the critical path.

Communicator. Each communicator is the executor’s child thread, which receives communication tasks and schedules the tasks to the underlying communication library (Algorithm 2). We implemented a preemptive communication scheduling mechanism in the communication library. Specifically, the latency-sensitive communication tasks (PPsync) can preempt the all-reduce communication tasks (DPsync, TPsync) to avoid being blocked by the all-reduce tasks [5].

Offloader. Each offloader is the executor’s child thread, which coordinates with the executor and offloads the activation checkpoints to the CPU memory after they are generated in forward passes and prefetches them back to GPU memory before they are required in backward passes.

3.2 AIAO Scheduling

We propose AIAO (Figure 4), a new 3D parallel scheduling algorithm that co-schedules and parallelizes the computation and communication tasks to fully (but not overly) utilize both the GPU devices’ computation capacity and the networks’ communication bandwidth. AIAO works in three steps: first, it folds a model into segments (step 1); second, it pipelines each segment across all pipeline stages (step 2); third, it schedules the communication tasks to maximally parallelize them with the computation tasks (step 3). Same as Megatron, AIAO is bulk synchronous [46], where a pipeline flush is inserted for parameter update to retain the convergence guarantee of Stochastic Gradient Descent [47] training.

Step 1. As shown in Figure 4, the first step is the folding of all layers: given a 3D parallelism configuration with PP stage number denoted as $p$, each model is divided (folded) into a number (denoted as $ns$, inferred in §4) of segments, and each segment is further divided into $p$ stages. For example, if one model has 12 layers (from $A$ to $L$, alphabetically), $ns=2$, and $p=3$, FOLD3D assigns GPU 0 with layers $(A, B), (G, H)$; GPU 1 with layers $(C, D), (I, J)$; GPU 2 with layers $(E, F), (K, L)$. Although Megatron [8] already has a segmenting scheme, Megatron’s scheme differs from FOLD3D’s in purpose. Megatron’s scheme is designed to reduce bubbles in its 1F1B pipeline, and the segmenting scheme in FOLD3D’s AIAO scheduling is designed to unleash the potential of DPsync overlapping with communication tasks. Moreover, FOLD3D’s segmenting scheme is used to balance the deducted DPsync tasks (by overlapping) and the increased PPsync tasks (by folding).

Step 2. The fundamental idea of step 2 is that the pipeline scheduling should be performed in a way that a model layer’s gradient should be attained first (thus the layer’s computation tasks should be scheduled in a bundle) for decoupling the dependency of this layer’s DPsync communication task with its computation task, so that this DPsync communication task can be scheduled to overlap other layers’ computation tasks. Therefore, in the second step, AIAO schedules the attained computation tasks in a bundled and spiral way, where each segment is further split and pipelined across all stages during the default injection of multiple micro-batches in any PP-enabled training schedules [42]. For example, in Figure 4 during the forward passes of AIAO, the first segment is further partitioned into $p$ (three) stages, and the first segment $((A, B), (C, D), (E, F))$ is injected with nine micro-batches (defined by the user) in a pipelined way in their forward passes. The following segments are then executed subsequently.

During AIAO’s backward passes, reversely, the last segment is first executed $((L, K), (J, I), (H, G))$ in the pipeline. The reason is that a backward pass must always start from the last layer of a DNN model [48]. Meanwhile, in each stage, after finishing the second segment’s backward pass computation tasks (which take roughly twice the time
of their corresponding forward passes due to activation checkpointing to save GPU memory in all PP-enabled training systems’ schedules ([22]), the DPsync tasks (e.g., DPsync(H, G)) of all layers in this segment can immediately be launched, and the first segment’s computation tasks can start in parallel. When \(ns = 1\), AIAO essentially becomes GPipe’s AFAB schedule. When \(ns = x\), the DPsync tasks of \(x - 1\) segments can be overlapped with computations, and only the DPsync of 1 (the first) segment cannot be overlapped with other tasks.

**Step 3.** An ideal case of overlapping DPsync tasks with computation is that DPsync tasks of a segment’s layers are faster than another segment’s backward pass computation tasks. For instance, the DPsync(H, G) finishes no later than the backward pass of (B, A). In this case, only the first segment’s DPsync tasks lie on AIAO’s performance critical path, because the first segment’s next forward pass should wait until these DPsync tasks finish, which is the optimal case as discussed in [22].

However, in a commodity cloud’s network, the finish time of a segment’s DPsync tasks (e.g., DPsync(H, G)) can be longer than the overlapped backward pass (e.g., the backward pass of (B, A)). Therefore, FOLD3D truncates the longer part of the DPsync tasks and overlaps this part with a corresponding forward pass (e.g., the forward pass of (A, B)) in the next iteration (details are in [4]). For instance, in Table 2 for FOLD3D, the time spent in the “DPsync” column of all segments mostly overlapped with the time spent in the “Bwd” column of all (other) segments; for Megatron, these two columns were serialized in its training performance critical path.

### 3.3 Performance Modeling

**Critical Path Analysis.** In conventional 3D parallel training systems [8, 16, 19], the execution time (i.e., defined as the critical path) of one iteration processing a whole data batch can be divided into computation time \(T_{comp}\), communication time \(T_{comm}\), and bubble time \(T_{bubble}\). Generally, the performance model used for evaluating 3D configurations in Megatron, Alpa and Piper can be unified as:

\[
T_{comp} + T_{comm} + T_{bubble} = T_d
\]

**FOLD3D** overlaps communication with computation, and the communication time in FOLD3D can be further divided into overlapping time \(T_{comm}^{old}\) and non-overlapping time \(T_{nol}^{old}\). The critical path of FOLD3D is thus defined as:

\[
\max(T_{comp}, T_{comm}^{old}) + T_{comm}^{old} + T_{bubble} = T_d
\]

As formulated in recent work [8, 16], the computation time \(T_{comp}\) is orthogonal to the scheduling strategy and relates only to the given 3D configuration. Therefore, given the same DNN model and 3D configuration, the \(T_{comp}\) of FOLD3D should perform the same as that of Megatron and any other 3D parallel training systems.

\(T_{bubble}\) is the bubble time in the pipeline, and is calculated as the sum of startup times of all pipeline stages. The startup time of a pipeline stage is defined as the sum of forward and backward times of its first micro-batch.

We denote the pipeline stage number as \(p\). According to the segment-based scheduling, the bubble in FOLD3D consists of \(p - 1\) forward passes and \(p - 1\) backward passes of a segment’s micro-batch. Given the segment number \(ns\) and the micro-batch number \(ms\), \(T_{bubble}\) is:

\[
(p - 1) \times \frac{T_{comp}}{ns \times ms} = T_{bubble}
\]

Data parallelism can be performed inside a host and across hosts. Thus, we define intra-host data parallel size \(d_{intra}\) as the number of GPUs in the same data parallel group on a host and inter-host data parallel size \(d_{inter}\) as
the number of hosts in a data parallel group. Assume \( w \) is the total model parameter size that all GPUs in a host contain, the data parallel communication time \( T_{dp}^{\text{comm}} \) is \( \frac{2(d_{\text{inter}} - 1)w}{d_{\text{inter}}r} \), where \( r \) is the network bandwidth of a host. We assume the traditional all-reduce \( [18] \) communication for the DP sync here. The data parallel communication of that all \( ns \) segments except for the first segment in a stage can overlap with the computation.

In FOLD3D, a tensor is transmitted in both the forward and backward passes of a micro-batch. The tensor size equals the activation size \( a \) of a single layer in the model. Assume there are total \( bs \) tensors transmitted in a stage, the pipeline parallel communication time \( T_{pp}^{\text{comm}} \) is \( \frac{bsa}{d_{\text{inter}}r} + \frac{s}{s + 1}T_{dp}^{\text{comm}} \). The pipeline parallel communication between stages can be transmitted asynchronously. The pipeline parallel communication of all \( ms \) micro-batches except for the first micro-batch of the first segment in a stage can overlap with the computation. The overlapping communication \( T_{ol}^{\text{comm}} \) is:

\[
\frac{ms - 1}{ms}T_{pp}^{\text{comm}} + \frac{s - 1}{s}T_{dp}^{\text{comm}} - T_{\text{comp}} \tag{4}
\]

Thus, the non-overlapping communication time \( T_{no}^{\text{comm}} \) is:

\[
\frac{T_{pp}^{\text{comm}}}{ms} + \frac{T_{dp}^{\text{comm}}}{s} + \frac{T_{lp}^{\text{comm}}}{s} \tag{5}
\]

**Memory Analysis.** The major difference between AIAO scheduling and FIFO-based scheduling is that an activation checkpoint on average incurs a longer lifecycle in AIAO, making the peak working-set memory of FOLD3D larger than the other FIFO-based 3D training systems. We take Megatron’s interleaved 1F1B scheduling \([9], [16]\) as an example. The peak memory consumption for activation checkpoints of Megatron is:

\[
(p \cdot (ns + 1) - 1) \cdot \text{SizeOf}(\text{checkpoints}) \tag{6}
\]

The peak memory consumption for activation checkpoints of FOLD3D is:

\[
ms \cdot ns \cdot \text{SizeOf}(\text{checkpoints}) \tag{7}
\]

Overall, the Megatron’s memory consumption for stashing activation checkpoints is only related to the pipeline stage number \( p \) and segment number \( ns \), while FOLD3D’s total memory consumption for stashing the activation checkpoints is proportional to the number of micro-batches \( (ms) \), which makes FOLD3D’s memory consumption larger than Megatron’s in most cases. Nevertheless, FOLD3D’s offloading mechanism shifts this extra memory burden to the CPU memory, making FOLD3D incur negligible extra GPU memory usage (see Table 2).

**4 FOLD3D Runtime**

FOLD3D takes a model’s shape setting and training hyperparameter as inputs and trains the model with 3D parallelism on a GPU cluster. The model shape setting includes hidden size, number of attention heads, number of layers, etc. The training hyperparameter contains learning rate, weight decay, etc. The 3D parallelism setting includes pipeline parallelism size \( p \), tensor parallelism size \( t \) and data parallelism size \( d \).

FOLD3D automatically determines AIAO’s segment number for the given model and parallelism setting in order to reach a high training performance. Specifically, an ideal segment number should balance the DP sync (network bandwidth-hungry) and PP sync (latency-sensitive) tasks, and should overlap the communication and computation tasks as much as possible \([4, 5]\), as shown in Figure 4. If the number of segments \( (ns) \) is larger, FOLD3D can move more DP sync tasks off the critical path, and AIAO’s pipeline bubble ratio can decrease. However, these benefits do not come for free: increasing the segment number to \( ns \) will invoke \( ns - 1 \) more times of PP sync tasks (Figure 4). Although FOLD3D overlaps PP sync tasks with computation tasks through asynchronous transfer \([5]\), the PP sync tasks may still block the DP sync tasks, because both these two communication tasks contend for the same network. Therefore, FOLD3D determines a near-optimal segment number \( (ns) \) heuristically. FOLD3D increases \( ns \) until the combination of PP sync and DP sync tasks exceed the computation time being overlapped.

The executor realizes the AIAO scheduling, given the 3D parallelism strategy \((p, t, d)\) and the segment number \( ns \). Algorithm 1 describes the executor’s logic: it invokes all sync tasks based on the current injected micro-batch ID and the current GPU’s segment ID to determine its upcoming communication and computation tasks’ interleaving. It first executes all micro-batches’ (in this training iteration) forward passes (line 3) and then executes all micro-batches’ backward passes (line 13). After that, the pipeline flush (line 9) is performed to synchronize all the gradients along the DP dimension and update the model parameters. During the computation tasks, the executor assigns the generated communication tasks to the communicator with each communicated object reference and its current execution status.

**Algorithm 1: FOLD3D Executor**

**Input:** Training iteration \( T \); Micro-batch number \( m \); Segment number \( ns \).

1. for \( i = 1 \) to \( T \) do
   2. for \( j = 1 \) to \( ns \) do
     3. \( \text{seg} \leftarrow \text{getSegment}(j) \);
     4. // wait for line 13 in DPComm to finish
     5. \( \text{seg}.\text{flush}() \);
   6. for \( k = 1 \) to \( m \) do
     7. // prepared by recv() in PPComm
     8. \( \text{input} \leftarrow \text{seg}.\text{getForwardInput}(k) \);
     9. \( \text{output} \leftarrow \text{seg}.\text{runForward}() \);
   10. for \( j = ns + 1 \) to \( 1 \) do
     11. \( \text{seg} \leftarrow \text{getSegment}(j) \);
     12. for \( k = 1 \) to \( m \) do
     13. // prepared by recv() in PPComm
     14. \( \text{input} \leftarrow \text{seg}.\text{getBackwardInput}(k) \);
     15. \( \text{output} \leftarrow \text{seg}.\text{runBackward}() \);
     16. // invoke send() in PPComm
     17. \( \text{seg}.\text{setBackwardOutput}(k, \text{output}) \);
     18. // invoke line 6 in DPComm
     19. \( \text{seg}.\text{setBackwardDone}() \);
Algorithm 2 describes FOLD3D’s communicator logic. It splits the DP . sync task of each segment into two subsets. The first subset is launched after it is generated, and this subset will be overlapped with the upcoming segment’s backward pass (line 11). The second subset will be overlapped with the corresponding forward pass computation (line 13). The scheduling outcome is depicted in Figure 4. FOLD3D communicator automatically decides the split ratio of the two subsets based on the runtime-collected computation time of backward pass and forward pass tasks correspondingly.

FOLD3D communicator issues pipeline parallel send() and recv() operations per forward or backward pass (lines 21, 24). At the beginning of each computation task, FOLD3D communicator issues the recv() operation for the next computation task’s input tensors. The recv() operation should finish before the next computation task starts. At the end of each computation task, FOLD3D communicator calls send() with the output tensors to transfer them to the next pipeline stage. FOLD3D communicator issues TP . sync tasks during the forward and backward passes. FOLD3D executor waits until the TP . sync tasks finish and then continues the computation.

Algorithm 2: FOLD3D Communicator

<table>
<thead>
<tr>
<th>Procedure DPComm():</th>
</tr>
</thead>
<tbody>
<tr>
<td>for i = 1 to T do</td>
</tr>
<tr>
<td>( D P_{f, w d} \leftarrow \emptyset );</td>
</tr>
<tr>
<td>for j = ns to 1 do</td>
</tr>
<tr>
<td>seg \leftarrow \text{getSegment}(j);</td>
</tr>
<tr>
<td>( \text{seg.waitBackwardDone}() );</td>
</tr>
<tr>
<td>task \leftarrow \text{seg.DP . sync}();</td>
</tr>
<tr>
<td>task_{f, w d}, task_{b, w d} \leftarrow \text{task.split}(r);</td>
</tr>
<tr>
<td>( D P_{f, w d}.\text{append}(\text{task_{f, w d}}); )</td>
</tr>
<tr>
<td>// finish backward DP . sync task</td>
</tr>
<tr>
<td>task_{b, w d}.\text{launch}();</td>
</tr>
<tr>
<td>end for</td>
</tr>
<tr>
<td>for j = 1 to ns do</td>
</tr>
<tr>
<td>task_{f, w d} \leftarrow ( D P_{f, w d}.\text{popLast}() );</td>
</tr>
<tr>
<td>// finish forward DP . sync task</td>
</tr>
<tr>
<td>task_{f, w d}.\text{launch}();</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Procedure PPComm():</th>
</tr>
</thead>
<tbody>
<tr>
<td>for i = 1 to T do</td>
</tr>
<tr>
<td>for j = 1 to ns do // j = ns to 1 for backward</td>
</tr>
<tr>
<td>seg \leftarrow \text{getSegment}(j);</td>
</tr>
<tr>
<td>( \text{input} \leftarrow \text{recv}(1); )</td>
</tr>
<tr>
<td>( \text{seg.setInput}(1, \text{input}); )</td>
</tr>
<tr>
<td>for k = 1 to m - 1 do</td>
</tr>
<tr>
<td>input \leftarrow \text{recv}(k + 1);</td>
</tr>
<tr>
<td>( \text{seg.setInput}(k + 1, \text{input}); )</td>
</tr>
<tr>
<td>output \leftarrow \text{seg.getOutput}(k);</td>
</tr>
<tr>
<td>send(k, output);</td>
</tr>
<tr>
<td>output \leftarrow \text{seg.getOutput}(m);</td>
</tr>
<tr>
<td>send(m, output);</td>
</tr>
</tbody>
</table>

Intra-Segment Offloading. The offloader incorporates both activation checkpointing and CPU offloading. Activation checkpointing is essential for greatly reducing GPU memory footprint (i.e., the activation tensors) in existing PP . enabled training systems by paying extra re-computation time of GPU ALUs. Activation checkpointing only stashes output activations (i.e., checkpoints) of selective layers, and the rest activations are recomputed in the backward pass by running the forward pass again.

The offloader decides on a proper set of activation checkpoints, which does not defer the progress of backward pass, but achieves the minimum peak memory footprint. Compared to Megatron’s 1F1B scheduling, FOLD3D’s AIAO scheduling inevitably incurs larger GPU memory. This is because AIAO requires all the forward passes of all segments to be finished before any backward pass starts (see Figure 3). The activation size may still exceed the GPU memory even with activation checkpointing enabled in FOLD3D. Therefore, the offloader offloads activation checkpoints to CPU memory, a common trick adopted from existing systems (e.g., DeepSpeed [19]). All checkpoints generated by a micro-batch are offloaded during the next micro-batch’s forward pass. Then the offloader pre-fetches the checkpoints of each micro-batch from CPU memory during the previous micro-batch’s backward pass. Evaluation shows that FOLD3D’s pre-fetchings/offloadings were overlapped by computation tasks and caused negligible training slowdown (Table 4).

Inter-Segment Lazy Communication. For all micro-batches of segment \( i \) (e.g., segment 0 in red in Figure 4), the last pipeline stage of this segment (stage 2) has to send the output tensors (of layer F) to segment \( i + 1 \)’s first stage (stage 0). However, the execution of segment \( i + 1 \) on stage 0 will not start until segment \( i \) finishes. If using a naive GPU-to-GPU direct communication, the stashed output tensors will cause extra GPU memory consumption (e.g., the output tensors of micro-batch 0-8 will be stashed in stage 0’s GPUs in Figure 3B). We shift this part of extra GPU memory to CPU hosts by FOLD3D’s inter-segment lazy communication mechanism, in which the output tensors from the last stage’s GPUs (e.g., stage 2’s GPUs in Figure 3B) will be directly sent to the remote CPU memory of the first pipeline stage (stage 0), and these tensors on the remote CPU memory will be lazily loaded into GPU memory when the tensors are used in related computational tasks.

Overall, these runtime algorithms do not affect the bulk synchronous training convergence for three reasons. First, each segment collects the gradients along the DP dimension through DP . sync tasks (see lines 10, 13 in Algorithm 2) and each segment has gradients with respect to all samples in an iteration. Second, FOLD3D ensures that the gradients of each segment are updated to the model parameters before the next iteration of this segment begins (see line 4 in Algorithm 1). Third, FOLD3D does not alter tensors transmitted between GPUs or tensors fed into the model.

5 IMPLEMENTATION
5.1 Preemptive communication scheduling
In FOLD3D’s scheduling, DP . sync tasks are not only overlapped with computation but are also overlapped with PP . sync tasks. Concurrent DP . sync and PP . sync tasks may contend for network bandwidth, and both tasks slow down. Although most of the DP . sync and PP . sync tasks are overlapped with computation in FOLD3D, the PP . sync tasks dur-
ing the pipeline warmup period still stay in the performance critical path.

FOLD3D incorporates preemptive communication scheduling to ensure that the PP.sync time does not increase when overlapping with DP.sync tasks. Specifically, when a PP.sync task arrives, FOLD3D pauses the DP.sync tasks in the same node. For a PP.sync send task, only the DP.sync tasks currently sending data are stopped. For a PP.sync receive task, the DP.sync tasks only stop receiving data when the PP.sync task starts to receive data. When a DP.sync task stops receiving data, it first saves the data already received in its buffer, and then sends an interruption signal to the corresponding sender.

5.2 CPU Offloading

FOLD3D incorporates CPU offloading to mitigate the increased GPU memory burden caused by FOLD3D’s scheduling. The activation tensors are continuously moved from GPU memory to CPU memory during the forward pass and moved back to GPU memory in the backward pass. When transferring data from the CPU memory to GPU, GPU requires the CPU memory page to be pinned (page-locked). Otherwise, a temporary pinned page is created and data is first copied to the pinned page and then transferred to the GPU. This is because the OS would swap an unpinned page to the disk if the page is inactive. For efficient data transfer, FOLD3D preallocates a pinned CPU memory buffer at the start of training to store the activation tensors. The buffer size is determined by profiling the total activation tensor size in an iteration.

We assign an individual CUDA stream for CPU offloading so that the CPU offloading does block the computation and communication tasks. In the backward pass, the activations are moved back to the GPU according to the order they are used. FOLD3D synchronizes the computation stream with the offloading stream before each activation is used for recomputation to ensure data correctness.

When sequence parallelism is enabled alongside tensor parallelism, the activations are partitioned across the tensor parallel ranks. As a result, each GPU only offloads its own activation partitions to the CPU memory.

The kernel launch overhead becomes significant when we invoke a GPU kernel for each activation tensor to be transferred. FOLD3D introduces a technique named batched CPU offloading to reduce the kernel launch overhead. Specifically, in the forward pass, a layer’s output tensor saved as the checkpoint is not moved out of GPU memory immediately after it is used by the next layer. Instead, we batch multiple activation tensors together and transfer them between GPU memory and CPU memory in a single kernel. Tensors are also moved back to GPU in batches. By doing so, we can achieve higher PCIe utilization and reduce the CPU offloading time.

6 Evaluation

Testbeds. We performed the experiments on two clusters. The first cluster is a public commodity cloud consisting of 8 nodes with in total 64 NVIDIA V100 GPUs. Each node is an AWS EC2 p3dn.24xlarge instance which has 96 vCPUs, 1.2TB memory and 8 Nvidia Tesla V100 GPUs (each has 32GB memory and 125 FP16 TFLOPs). GPUs in a node are connected by NVLink, and nodes are connected over a 100Gbps network. The second cluster is a private laboratory cloud containing 32 nodes with a total of 256 NVIDIA A100 GPUs. Each node has 128 Intel 6248R CPUs, 2.0TB memory and 8 Nvidia A100 GPUs (each has 40GB memory and 312 FP16 TFLOPs). GPUs in a node are connected by NVLink, and nodes are connected over 200Gbps Infiniband. Unless otherwise specified, we used 16 nodes with a total of 128 A100 GPUs as our default testbed.

Baselines. We took Megatron v3.0 (Megatron-SP) [28], Megatron v2.5 (Megatron-PTD) [8], DeepSpeed 3D (DSpeed3D) [19], and DeepSpeed ZeRO3 (DSpeed3Z) [29] as our baselines. Megatron-SP is the latest 3D parallel training system that was reported to achieve almost linear scaling efficiency. Megatron-PTD is the system used in Megatron’s earlier paper [8]. DSpeed3Z is a powerful data parallel training system that incorporates a set of memory optimization techniques. Microsoft’s DSpeed3D is a well-engineered system which extends data parallelism optimized by DeepSpeed ZeRO with tensor parallelism [25] and pipeline parallelism [10, 24] to break the scaling efficiency bounds of data parallelism. We ran these two DeepSpeed systems in DeepSpeed v0.5.5 environment. Sequence parallelism [28] was integrated into Megatron, DSpeed3D and FOLD3D to reduce the activation size and support larger models.

Baseline settings. We used two 3D parallel configurations for the experiments. The first configuration was chosen following the instructions provided in Megatron-PTD. Specifically, given a DNN model, we first scaled along the tensor parallel dimension within hosts and then the pipeline parallel dimension until the model’s parameters and activations can be fit into GPU memory. Then, we scaled along the data parallel dimension to use up all GPUs. The second configuration was selected by Piper, which proposed an efficient optimization algorithm to find the best 3D parallel configuration for its corresponding 3D parallel performance modeling. For both Megatron-PTD and Megatron-SP, we adopted the interleaved schedule introduced in its paper to reduce the pipeline bubble. The best interleaved schedule was selected by trials and chosen with the highest throughput produced, as no determined selection instruction is provided by Megatron. Megatron-PTD/Megatron-SP can overlap most of the PP.sync tasks with computation when enabling its interleaved schedule, but the PP.sync tasks during the pipeline warmup period have to be in the performance critical path. DSpeed3D has to left all the PP.sync tasks in the performance critical path since it adopts the 1F1B scheduling. This is because when overlapping PP.sync tasks with computation in 1F1B scheduling, two simultaneous send/recv operations between a pair of GPUs may potentially cause deadlock [53].

Models and Datasets. We evaluated five giant transformer models which cover all the large transformer models evaluated by recent large model training systems [8]. Specifically, we covered major pretraining transformer models (GPT [1], BERT [31], CPM [32], Turing-NLG [29] and T5 [33]) and their respective datasets. GPT and Turing-NLG use OpenWebText [51] dataset, BERT uses Wikipedia [52] dataset, CPM uses WuDao Corpus [53] dataset, and T5 uses
Model Configurations. Table 1 shows all model settings and further specifications can be found in their paper. Former blocks) are from Megatron [8] for fair comparisons, TFLOPs metric, whose approximation formulas (for transformer blocks) are from Megatron [5] for fair comparisons, and further specifications can be found in their paper.

Model Configurations. Table 1 shows all model settings used in this paper. Each model’s configuration is the same as the official specifications or settings evaluated by previous works. Moreover, to better understand F3D, we evaluated GPT-3 models with various model shapes and parameter sizes. We will specify how these settings are selected when they are used. Without further specifications, the micro-batch size used in our experiments was 4, which was large enough to saturate a GPU’s computation while leaving enough GPU memory space for memory footprints during training.

We focus on four questions. §6.1: How does F3D perform compared to the baselines? §6.2: How does F3D perform with different parallel configurations? §6.3: How robust is F3D’s high performance under different batch sizes and network bandwidths? §6.4: How effective are F3D’s high performance under different batch sizes and network bandwidths?

### 6.1 End-to-End Performance

Table 2 shows five training systems’ per-GPU throughput when training GPT-3 models on 64 V100 GPUs and 128 A100 GPUs. We present the detailed settings (e.g., model, batch size and parallel configuration), as well as various breakdown results, in Table 2. Column “ExCMem.” stands for the extra CPU memory brought by the novel AIAO scheduling of F3D. The extra CPU memory is the peak CPU memory occupied by the offloaded checkpoint activations in a host, and excludes the CPU memory used by Python program and PyTorch library. Column “ExCMem.” is not applicable to Megatron and DeepSpeed3D since these systems only store the activation tensors in GPU memory. Column “ExCMem.” is not applicable to the pure data parallel training system DSpeedZ3 because Column “ExCMem.” evaluates the CPU memory usage caused by pipeline parallelism. Besides the extra CPU memory, we have also evaluated the total CPU memory usage of all systems, and we report the results in §6.1. Columns “Bubble” and “PPsync” are not applicable to DeepSpeedZ3 because DeepSpeedZ3 is a pure data parallel training system.

All models’ shape configurations used in this evaluation followed the models’ original papers. To compare with Megatron-PTD, we evaluated GPT-3 1BB, which is the model Megatron-PTD used in its paper on 256 GPUs. We then used GPT-3 39B to evaluate F3D’s performance on larger models, and compared the results with Megatron-SP. We selected the batch size that achieved the shortest training time making the model converged. Details of the batch size selection are elaborated in §6.3.

Overall, F3D achieved the highest throughput (116.1 TFLOPS and 51.1 TFLOPS) on both A100 and V100 clusters; to the best of our knowledge, this throughput is higher than the highest per-GPU throughput publicly reported on 64 V100 GPUs with similar models. Specifically, DeepSpeed reported a publicly highest per-GPU throughput of 41.4 TFLOPS [29] on training the model of the same size, but with a much faster cross-server link of 800Gbps (Nvidia DGX-2) than the 100Gbps network in our evaluation. F3D achieved 31.5%-42.1% speedup over Megatron-SP on 128 A100 GPUs and 25.2%-33.0% over Megatron-SP on 64 V100 GPUs.

Table 2 reveals F3D’s high performance came from both the reduced (overlapped) DPsync and PPsync communications on the performance critical path. We observed that the network was saturated by “DPsync” for all five training systems. On the training performance critical path of both Megatron and F3D, both their throughput mainly depends on the sum of “Fwd.”, “Bwd.”, “DPsync”, and “PPsync”. However, F3D’s “Fwd.” and “Bwd.” are overlapped with most of the PPsync and PPsync tasks (see Figure 4). For example, on the V100 cluster, F3D reduced the DPsync time on the performance critical (non-overlapped) path from 2.50s to 0.67s and the PPsync time on the performance critical path from 0.95s to 0.41s, respectively.

F3D outperformed the baselines under both the parallel configurations derived by Megatron-PTD and Piper. This is because that DPsync took a large portion of the iteration time for these parallel configurations. Megatron-PTD increases PP and TP sizes until the model split can be fit into GPU memory, and then enlarges DP to use all GPUs. In such a case, the PP size and TP size are minimized while the DP size is maximized. Meanwhile, Megatron-PTD used most of the GPU memory to accommodate the model parameters and their corresponding gradients. This leads to the extremely large gradients to be synchronized in each GPU. The large per-GPU gradient volume and the large DP size lead to the substantial DPsync of Megatron-PTD.

Even though Piper automatically finds the best parallel configuration that maximizes the training throughput, DPsync still accounted for 28.6% of the training time in our evaluation. This is because the decrease of DPsync time always comes with the increase of PPsync time and pipeline bubble time, which inevitably increases the overall iteration time. In 3D parallel training, the way to mitigate the DPsync time is to increase the PP size. By doing so, both the DP size and the gradients needed to be synchronized per GPU decrease. However, both the PPsync time and the pipeline bubble increase as well.

We have collected both GPU and CPU memory usages during evaluation. The memory usage (i.e., the sum of GPU memory and CPU memory usages) of F3D is larger than the baselines, and the extra memory overhead comes from the novel AIAO scheduling of F3D. The AIAO scheduling requires F3D to store the checkpoint activations generated during the forward pass of all the micro-batches. Since the checkpoints are further offloaded
TABLE 2: Breakdown of performance critical path for each system training GPT-3. Tuples in (3D), (Seg.) column stands for (DP, PP, TP), (segment numbers); Fwd. stands for forward computing time, Bwd. stands for backward computing time. Thrp. stands for per-GPU throughput in TFLOPs and Util% stands for ratio of the measured throughput to the theoretical peak throughput provided by Nvidia. GMem. stands for the peak GPU memory usage (in GB) of a GPU. ExCMem. stands for the extra CPU memory brought by the novel AIAO scheduling of FOLD3D and represents the peak CPU memory (in GB) occupied by the offloaded checkpoint activations in a host. Both PPync and DPsync contain only non-overlapped communication time. n/a means the column is not applicable to the system.

Table 3: Breakdown of performance critical path for each system training five models. Column name meanings are the same as Table 2. n/a means the column is not applicable to the system.
We further draw the breakdown results of Fold3D and Megatron-SP for GPT-3 18B and GPT-3 39B in Figure 5. The results generally matched our performance modeling in § 3.3. For the two models, the DP .sync time of Fold3D was reduced by 68.9% and 72.6% compared to Megatron-SP. In our performance modeling, the DP .sync time will be reduced by 75% when the segment number is 4. We attribute this discrepancy to the fact that not all GPUs start the DP .sync tasks at exactly the same time. The computation time of Fold3D is also slightly larger than Megatron-SP for both models. We attribute the slowdown to the overlapping of DP .sync task with computation. As revealed by a recent study, when overlapping the all-reduce operation in DP .sync with DNN computation, the all-reduce operation contends for GPU resources with DNN computation. However, these facts only cause the real iteration time less than 5% larger than the performance modeling in our evaluation, and performance modeling is useful to estimate the real performance of Fold3D.

In Table 3, we show four systems’ per-GPU throughput for another four models. We only demonstrate the results for each model under the Piper setting. Fold3D achieved 1.25x to 1.33x speedup over Megatron-SP, which further confirms that Fold3D’s gain holds for different models. The four models only differ from GPT-3 models in the pre-process and post-process layers. Same as GPT-3, majority of the four models are composed of Transform blocks. The models mainly differ in the hidden sizes of the transformer blocks. The hidden size determines the ratio between computation, DP .sync and PP .sync. We further show the results for each model on 64 V100 GPUs in Figure 6.

We conducted a weak scaling study to evaluate Fold3D’s performance on large-scale clusters. In particular, following the common practice of baselines, weak scaling is to test a system’s throughput on scaling to train a larger model with more GPUs. Figure 7 shows that Fold3D’s throughput was consistently ∼31% higher than both Megatron-SP and DSpeedZ3. Fold3D still consistently outperformed baselines in all the scales we evaluated. The reason is when the model size and the number of GPUs grow, both computation and communication time increase accordingly. We believe Fold3D will be able to overlap most of the communication tasks with the computation tasks even in larger scales, e.g., 512 or thousands of GPUs; in contrast, baselines again left all tasks being serialized on the performance critical path.

6.2 Evaluation of Parallel Configurations

In this section, we evaluated the performance of Fold3D on different parallel configurations. In particular, we fixed the size of one parallel dimension and changed the combination of the other two. We conducted the experiments for model GPT-3 39B on both 64 V100 GPUs and 128 A100 GPUs. For each cluster, the batch size used is the same as the one in Table 2.

6.2.1 PP v.s. DP

In Figure 8, we evaluated the impact of PP and DP sizes on Fold3D’s performance. We set the TP size to 8 (the number of GPUs in a host), which is a common setting for large model training (see Table 3). The Megatron paper’s lesson on these two degrees of parallelism is that DP should always be more favorable than PP on their 1.6 Tbps network. However, we found DP and PP should be balanced to reach the peak throughput in our evaluation. This is because a large PP size will incur longer PP .sync time and longer bubble time, while a large DP size will increase the DP .sync time (even when most of the DP .sync is overlapped with computation in Fold3D).

6.2.2 PP v.s. TP

In Figure 9, we evaluated the impact of PP and TP sizes on Fold3D’s performance. We set the DP size to 2 for 64 V100 GPUs and to 4 for 128 A100 GPUs. Overall, our evaluation shows that within a host (TP size less than or equal to 8), on both the V100 GPU cluster and the A100 GPU cluster, TP is more preferred than PP. This is because within a host,
the GPU-to-GPU links are fast enough so that the scaling efficiency of TP (mainly bounded by TP sync communications) can surpass the efficiency of PP (mainly bounded by flush bubbles). TP size greater than 8 means cross-server TP (because there are 8 GPUs in a node), which is much slower than TP within a server. The same conclusion is reported in Megatron [8].

Fig. 9: Throughput per GPU of FOLD3D under different (PP size, TP size) combinations. TP is more preferred than PP within a host.

6.2.3 DP v.s. TP
In Figure 10 we evaluated the impact of DP and TP sizes on the training throughput of FOLD3D. We set the PP size to 4. The figure shows that within both V100 and A100 clusters, TP is more preferred than DP. This is because the TP sync time was faster than the DP sync time. When a model’s size increases (e.g., increasing the number of DNN layers), the DP sync time increases faster than the TP sync time, and TP would still be more preferred.

Fig. 10: Throughput per GPU under different (DP size, TP size) combinations. TP is more preferred than DP within a host.

6.2.4 Impact of Segments
We evaluated the impact of segment number selection on FOLD3D’s performance for models GPT-3 39B and GPT-3 14B. For both models, we used the 3D parallel configurations derived by Piper. Since DP and PP should be both preferred and balanced on commodity cloud networks, selecting a proper segment number in FOLD3D is crucial. Although an extremely large segment number will bring a larger overlapping ratio of DP sync communication tasks with computation tasks [54], it will also increase the PP sync costs, as each segment needs to be pipelined across all pipeline stages. We found that in most of our experiments, the best segment number was 2 to 4 for various models, which matched the conclusions we drew from [54] as segment number in this range retains speedup from a highly overlapped portion of communication tasks (50-75%) without incurring too much PP sync cost.

Fig. 11: How the number of segments affects the final throughput. Given a model and its 3D parallelization configuration, FOLD3D’s runtime is able to find the optimal segment.

6.3 Ablation Study

Fig. 12: (a) Training loss curves under different batch sizes. (b) Throughput per GPU under different batch sizes. (c) Training time required for the training loss to reach 3.3 (the minimum training loss achieved by the model).

The selection of batch size when training large models involves the trade-off between system training throughput and convergence efficiency [56]. When enlarging the batch size, GPUs can achieve higher ALU utilization, but the convergence efficiency becomes lower due to the decrease of gradient noise scale [57]. To demonstrate the relationship between the convergence efficiency and batch size, we trained GPT-3 39B model under different batch sizes. For each batch size we evaluated, we selected the best learning rate and other hyperparameters following approaches from existing works [1]. Figure 12a plots the training loss curves under different batch sizes. When increasing the batch size, the model has to be trained with more epochs although the training throughput increases. Thus, the higher throughput brought by a larger batch size does not necessarily shorten training time. The result of the relationship between
convergence efficiency and batch size also matches recent study [58].

We first evaluated the performance of FOLD3D and baselines under different batch sizes. The result is shown in Figure 12b. When increasing the batch size from 256 to 1024, FOLD3D’s throughput improvement over Megatron decreased from 31.5% to 10.7%, and improvement over DSPEEDZ3 decreased from 48.2% to 27.2%. This is because the computation time and the overall iteration time increase with the batch size, while the DP-sync time is orthogonal to the batch size and stays roughly the same across various batch sizes. The ratio of the DP-sync time thus decreased and so did FOLD3D’s improvement. Although the improvement of FOLD3D over the baselines decreased when enlarging the batch size, we found that the relatively smaller batch size (256) achieved the shortest training time for the model to attain the desired training loss even for Megatron and DSPEEDZ3. Figure 12c shows the total training time used for the training loss to achieve 3.3 (the minimum training loss that can be achieved by the given GPT-3 model) under each batch size for all systems.

Our evaluation on AWS cloud shows that the network bandwidth for a single node ranges from 70 to 80 Gbps. We thus evaluated the throughput of FOLD3D and baselines under 25, 40, 100 and 200 Gbps networks on 128 A100 GPUs. Similar to the approach stated above, we chose the best batch size for these systems under each network bandwidth. The best batch sizes for 25, 40, 100 and 200 Gbps networks are 1024, 1024, 512 and 256 respectively. With the decrease of bandwidth, batch size has to be enlarged to achieve higher system throughput and shorten the overall training time. The result is shown in Figure 13. FOLD3D outperformed Megatron-SP by 30.4% to 36.7%, and outperformed DSPEED3D by 38.2% to 45.6%. This is because the ratio of DP-sync to the computation time stayed in the range between 22.3% to 29.6% under all the bandwidths we evaluated. The bandwidth decrement came with batch size enlargement, and thus led to both longer computation time and DP-sync time.

Table 4: Cost of CPU activation offloading in different models and runtime configurations. Mi. batch stands for micro-batch size and O/F stands for time cost of offloading operations.

<table>
<thead>
<tr>
<th>Model</th>
<th>Mi. batch</th>
<th>Hid. Size</th>
<th>Acti. Size</th>
<th>Fwd. Time</th>
<th>Bwd. Time</th>
<th>Off. Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>BERT</td>
<td>4</td>
<td>1244</td>
<td>56.6M</td>
<td>2901.7</td>
<td>7902.3</td>
<td>1472.5</td>
</tr>
<tr>
<td>GPT-3 39B</td>
<td>4</td>
<td>8192</td>
<td>64.0M</td>
<td>2901.7</td>
<td>7902.3</td>
<td>1472.5</td>
</tr>
<tr>
<td>CPU</td>
<td>4</td>
<td>5120</td>
<td>40.0M</td>
<td>1673.2</td>
<td>4707.9</td>
<td>1097.3</td>
</tr>
</tbody>
</table>

Throughput under different network bandwidths.

6.4 Effectiveness Analysis

Activation Checkpointing. The activation checkpointing technique [49] in both FOLD3D and Megatron was necessary for conducting all our experiments, because all the baseline systems were with activation checkpointing to support training of large models. We tried to disable activation checkpointing in our experiments, and all default configurations went to out-of-memory exceptions, indicating that activation checkpointing was necessary.

Checkpoint CPU Offloading. The data transfer rate between the accelerator and the CPU memory, which is bounded by the PCIe bandwidth, always grows proportionally to the accelerator’s throughput. In our test environments, the total GPU-to-CPU data transfer rate is 29.6GB/s for 8 A100 GPUs and 13.6GB/s for 8 V100 GPUs, while A100 GPU’s peak throughput is 312TFLOPs and V100 GPU’s peak throughput is 125TFLOPs.

We evaluated the effectiveness of our CPU Offloading of activation checkpoints [54]. Table 4 shows the microevents when training three models on the V100 setting. In particular, given the micro-batch size and hidden size, we collected the activation checkpoint offloading/prefetching time and collected the forward and backward pass time between two activation checkpoints. Overall, the total time of offloading checkpoints to CPUs and fetching them from CPUs was lower than the forward and backward pass time between two activation checkpoints in FOLD3D. Therefore, the activation checkpoint offloading costs were overlapped with computation and caused negligible impact on FOLD3D’s performance.

In the evaluation, we already evaluated both typical large batch size (i.e., 2048) and small batch size (i.e., 256) as shown in Figure 12b. On the level of principle, both the activation checkpoint size and the number of FLOPs per iteration in FOLD3D are proportional to the batch size. Overall, both our and common practices [29] match the principle. In particular, the ratio between the computation time and the checkpoint offloading time remains roughly the same for all the batch sizes we evaluated. Therefore, the checkpoint offloading time remains smaller than the computation time under various batch sizes; FOLD3D’s offloader causes negligible performance penalty.

Training Convergence. FOLD3D is designed to maintain the same training convergence with the baselines and to remain transparent (§3) to the training workload. Still, in Figure 14 to verify the training convergence of FOLD3D, we trained GPT-3 39B using both FOLD3D and Megatron-SP on 128 A100 GPUs. We kept the same training parameters (learning rate, batch size, and random seeds) for both systems. The results show that FOLD3D achieved the same convergence curve with Megatron, although FOLD3D achieved an obviously better loss reduction, because FOLD3D finishes each training iteration faster.

6.5 Lessons Learned

FOLD3D has two limitations. First, same as Megatron [8], [28], our system is mainly designed for large DNN models with a repeated, stacked structure. Nevertheless, compared to baselines’ papers (with one or two large models evaluated), we have evaluated all the five notable and typical large models with repeated blocks. FOLD3D and all existing 3D parallel training systems (Megatron [8], [28] and DSPEED3D [19]) are currently not designed to support

**Fig. 13:** Throughput under different network bandwidths.

---

**TABLE 4:** Cost of CPU activation offloading in different models and runtime configurations. Mi. batch stands for micro-batch size and O/F stands for time cost of offloading operations.
DNN models with heterogeneous layers (i.e., layers which do not have the same structure or input tensor shape). For instance, all 3D parallel systems including Fold3D are not suitable for ResNet models, because the layers in a ResNet model differ in structures and input tensor shapes. A typical ResNet model reduces the input tensor shape and increases the number of convolution filters layer by layer. Fold3D is not suitable for DNNs with heterogeneous layers due to two reasons. The first reason, which also applies to Megatron and DSpeed3D, is that the heterogeneity easily leads to unbalanced computation across pipeline stages. When splitting models like ResNet to pipeline stages, the last stage with a linear layer will have extremely heavyweight computation compared to other stages. The imbalance makes 3D parallelism fundamentally unsuitable for DNNs with heterogeneous layers. Researchers may need to use the other two parallel dimensions or invent a new parallel dimension to replace the pipeline parallel dimension. We leave this open problem for future work. The second reason specifically for Fold3D is that the segment slicing of Fold3D requires the segments to be homogeneous so that a segment’s computational and communicational tasks can align with those of other segments to maximize Fold3D’s effectiveness. The second limitation is that the CPU offloading mechanism in Fold3D can consume extra CPU memory than Megatron. Fortunately, on commodity clouds, compared with GPU memory, CPU memory is cheap and extensible.

We believe Fold3D and Megatron are complementary to each other. Megatron is optimized for training on dedicated ultra network clusters [59]. On such dedicated clusters, we envision that Fold3D’s gain over Megatron will decrease, because these clusters’ Tbps network seems not to be a bottleneck. After all, Fold3D is designed for high-performance training of large models on commodity clouds for a wide range of users, labs, and enterprises (who do not have access to these dedicated clusters). Fold3D’s much-improved throughput on commodity clouds (on many GPUs and sub-100Gbps networks) has shown its value on saving these folks’ massive financial resources and natural energy. Moreover, the available network on commodity clouds is often not as large as claimed by the cloud provider. When we ran our experiments on V100-100G in a dedicated, quiet AWS cluster, by network monitoring, we found that the peak network bandwidth for each AWS tenant (us) could be only 70 to 80Gbps (see §6.3), making Fold3D especially desirable.

7 Related work

There are tremendous systems that study the parallel techniques for DNN training, along data parallelism [19], [29], [60], [61], [62], pipeline parallelism [10], [11], [13], [20], [21], [22], [43], [63], [64], and tensor parallelism [25], [65], [66], [67], until the emergence of 3D parallel training systems [8], [9], [17]. Based on the above three foundational dimensions, there are various emerging parallelism techniques including optimizer parallelism (e.g., DeepSpeed Zero [29]), token parallelism (e.g., TeraPipe [64]), sequence parallelism [28], etc. All these various techniques are complementary to 3D parallelism with each targeting at extreme cases of large DNN training (e.g., token parallelism is for extremely long sequence training). In this paper, we focus on optimizing the 3D parallel dimensions, which serve as the foundation for today’s large DNNs to scale efficiently to billions of parameters.

**Systems for data parallel training**. Data parallelism [19], [29], [60], [61], [62], [68] is widely adopted for distributed DNN training. Some data parallel training systems like P3 [69] and TicTac [40] adopt priority scheduling in data parallel training to overlap the data parallel communication with both forward and backward computation. Gradients of front layers are scheduled ahead of rear layers to maximize overlapping. BytePS [70] unifies all-reduce and parameter servers to utilize heterogeneous resources in a cluster. ZeRO [29] reduces the memory usage of data parallelism by sharding model parameters and gradients across GPUs. Compared with P3 and TicTac which only work for pure data parallel training, Fold3D further tackles the challenges to overlap communication with computation in 3D parallel training, which are non-trivial as we discussed in §4. Fold3D can incorporate techniques like BytePS.

**Systems for pipeline parallel training**. Pipeline parallelism [10], [11], [13], [20], [21], [22], [43], [63], [64], [71], [72], [73], [74] is commonly used for training large DNN models. TeraPipe [64] performs fine-grained pipeline parallelism across tokens in a single training sequence for Transformer-based models. vPipe [22] balances the memory usage and computation across pipeline stages. HetPipe [10] supports training on a set of heterogeneous GPUs with pipeline parallelism. These optimizations solely in the pipeline parallelism dimension are orthogonal to Fold3D. When combining existing pipeline parallel systems with data parallelism, none of them can overlap data parallel communication with computation, and the data parallel communication of these systems is serialized after pipeline parallel computation.

**Automatic partitioning**. FlexFlow [75], Placeto [25], RE-GAL [77], Alpa [17] and Piper [16] automatically partition a model over multiple devices through transforming the parallelization optimization problem into a cost minimization problem. Among these works, Piper efficiently finds a near-optimal strategy for 3D parallelization combined with memory-saving techniques. However, these works focus on finding an optimal parallelization configuration, while Fold3D proposes a new 3D parallel scheduling. Note that
both FOLD3D and Megatron used the 3D configuration strategy produced by Piper \cite{10}. We believe FOLD3D and Piper are orthogonal.

OOO \cite{21} proposes a new training task splitting paradigm that splits the gradient computations of output and weights in the backward propagation, so that a smaller bubble size during pipelining can be achieved. However, OOO achieves this at a cost of larger memory overhead because it requires a longer duration of all layers’ gradient outputs (ΣO) stashing in the GPU memory. OOO is not designed for 3D parallelism, because when OOO is combined with TP, the ΣO will be explosive as each layer’s gradient output is gathered from all GPUs of the TP dimension. OOO has to keep this heavy ΣO in GPU memory until all tasks of a layer’s backward pass finish. Therefore, OOO is orthogonal to all 3D parallel training systems including FOLD3D.

There are also various pioneer works that target at new training paradigms based on Transformer-like models, introducing sparsely activated DNN training. Pathways \cite{78} is a recent Multi Program Multiple Data (MPMD) training framework (proposed by Google) that runs multiple training tasks/programs (i.e., each task/program is a single SGD procedure; tasks may share parameters with each other) to fully exploit a cluster’s heterogeneous GPU resources. Still, Pathways is complementary to Single Program Multiple Data systems including Megatron, DeepSpeed, and FOLD3D, because within each training task (program), the 3D parallelism technique is still essential for scaling to a large number of GPUs/TPUs with heterogeneous inter-links between devices.

Besides, Mixture-of-Expert \cite{34}, \cite{67} extends Transformer models with many sparsely activated experts. Many emerging training systems such as FasterMoE \cite{35} and Tutel \cite{79} to accelerate MoE workloads. These systems are emerging training systems such as FasterMoE \cite{35} and former models with many sparsely activated experts. Many large number of GPUs/TPUs with heterogeneous inter-links 3D parallelism technique is still essential for scaling to a large number of GPUs of the TP dimension. OOO has to keep this heavy ΣO in GPU memory until all tasks of a layer’s backward pass finish. Therefore, OOO is orthogonal to all 3D parallel training systems including FOLD3D.

8 Conclusion

We present the FOLD3D system, which maximally overlaps communication and computation tasks in 3D parallel training of large DNN models on commodity clouds. By folding a model into segments, FOLD3D conducts AIAO to achieve the an all-parallel scheduling between communication and computation tasks. FOLD3D can benefit most people who demand training and fine-tuning large DNN models.

Acknowledgments

We thank all reviewers for their valuable comments. The work is supported in part by the Huawei Flagship Research Grant in 2021, the HKU-SCF FinTech Academy R&D Funding Scheme in 2021 and 2022, HK RIF (R7030-22), HK ITF (GHP/169/20S2Z), and the PuiJiang Lab (Heming Cui is a courtesy researcher in this lab).

References

\begin{thebibliography}{1}
\bibitem[8]{8} “Microsoft/deepspeed,” \url{https://github.com/microsoft/DeepSpeed}
\bibitem[14]{14} “Infiniband and remote dma (rdma) interfaces,” \url{https://www.kernel.org/doc/html/v5.11/driver-api/infiniband.html}
\end{thebibliography}


Fanxin Li received the BE degree from Xi’an Jiaotong University in 2019. He is currently working toward the PhD degree at The University of Hong Kong. His research interests include distributed machine learning and cloud computing.

Shixiong Zhao received his Bachelor degree in HKU and his master degree in HKUST. He is currently a PhD student in Computer Science of HKU. He is under the supervision of Prof. Heming Cui. His research interests include distributed systems for high performance computing, distributed systems and system security. He is a student member of IEEE.

Yuhao Qing received his Bachelor degree in City University of Hong Kong. He is currently a PhD student in Computer Science of HKU, under the supervision of Prof. Heming Cui. His research interests include machine learning systems and cloud computing.

Xusheng Chen received his Bachelor degree in HKU. He is currently a PhD student in Computer Science of HKU. He is under the supervision of Prof. Heming Cui. His research interests include distributed consensus protocols, distributed systems and system security.

Xiuxian Guan received his Bachelor's degree in the University of Science and Technology of China (USTC), Hefei, China, in 2005, the M.S. degree from the Chinese Academy of Sciences (CAS), Beijing, China, in 2008, and the Ph.D. degree from Tsinghua University, Beijing, China, in 2014, all in computer science. From 2014 to 2019, he was a lecturer and then an associate professor at Chongqing University, Chongqing, China. Currently, he is a senior researcher at Huawei, Hongkong. His research interests include information-centric networking, Federated Learning and AI for System.

Sen Wang received the B.S. degree from the University of Science and Technology of China (USTC), Hefei, China, in 2005, the M.S. degree from the Chinese Academy of Sciences (CAS), Beijing, China, in 2008, and the Ph.D. degree from Tsinghua University, Beijing, China, in 2014, all in computer science. From 2014 to 2019, he was a lecturer and then an associate professor at Chongqing University, Chongqing, China. Currently, he is a senior researcher at Huawei, Hongkong. His research interests include information-centric networking, Federated Learning and AI for System.

Gong Zhang is a chief architect researcher scientist, director of the Huawei Future Network Theory Lab. His major research directions are network architecture and large-scale distributed systems. He has abundant experience on system architect in networks, distributed system and communication system for more than 20 years. He has more than 90 global patents.

Heming Cui is an Associate Professor in Computer Science of HKU. His research interests include operating systems, programming languages, distributed systems, and cloud computing, with a particular focus on building software infrastructures and tools to improve reliability and security of real-world software. He is a member of IEEE.