2.3 Related Models

The lack of the central unifying model in parallel computation as compared to the unique von Neumann model in the field of sequential computation has resulted in a long debate of selecting the representative model(s) for parallel computation. Consequently, there exist enormous number of models for parallel computation. Some of them look different, but were shown to be quantitatively equivalent [14]. Some of them look similar, but they could be completely different. In general, models have tended towards undesirable extremes. On the one hand, they are of highly theoretical qualities but be unrealistic or difficult to map onto real machines. At the other extreme, models may be too machine-oriented or complex which limit their long-term usage and portability.

Bear in mind that the definition of the model is "simply an abstract view of a system or a part of a system, obtained by removing details in order to allow one to discover and work with the basic principles" [43]. On the other hand, the complexity of designing and analyzing parallel systems requires that models be used at various levels of abstraction that are highly related to the application characteristics. Strictly speaking, it is hard to make a head-to-head comparison on models as they may involve different levels of abstraction. As our quest for a performance model is to have effective exploitation of commodity clusters for high-performance computation, so we focus our exposition of models that are based on the same architectural foundation as compared to our communication model, as well as targeting to a similar programming abstraction as we are.

2.3.0.0.1 The BSP Model

It is the first to be called as bridging model [105]. Essentially, it agrees with the generic architectural model described above, but requires an extension that provides efficient global synchronization on all processors. The performance of the BSP-style program can be characterized by three parameters: p, L and g, where p stands for the number of processors; L is the cost of global synchronization in unit of time step; and g corresponds to network throughput in terms of the ratio between the number of local computational operations performed per second by all processors, to the total number of data words delivered per second by the router. A parallel algorithm is expressed in BSP model as a sequence of parallel supersteps. Each superstep consists of a sequence of local computation steps plus any message exchanges, followed by a global synchronization. The cost of a single superstep phase is represented by the formula

, where w is the maximum cost of the local computation on each processor, and h is the maximum number of packets both sent and received by any processor. The cost of the parallel algorithm is just the sum of each individual superstep cost that comprising the algorithm.

Although BSP model does not explicitly stress on data locality, the gh parameter shows us how the importance of data distribution (locality) in influencing performance. Furthermore, the gh parameter implicitly captures the contention issue but inadequately, as in reality, g may be affected under congestion condition. One important performance feature it has missed out is the communication cost related to the message size, as it does not distinguish between a message of length kb and k messages of length b, but in reality, this could be a significant factor to the performance prediction and analysis. Another limitation of this model is the restricted framework - supersteps, as some parallel applications cannot fit into this programming structure, e.g. task-parallel model. The global synchronization operation between supersteps would impose a stringent requirement on the cluster communication system. This is because almost all commodity clusters are not coupled with hardware synchronization primitive, any realization of the global synchronization has to be done by software approach, which means it would contend with normal data communications. Thus, this becomes a costly overhead, especially this operation appears once in each superstep.

2.3.0.0.2 The LogP Model

The LogP model [30] tends to be more empirical and network-oriented. Its includes four parameters to characterize the system: L, o, g and P, where P stands for the number of processors involved; o represents the software overhead associated with the transmission/reception of message; L is the upper bound on the hardware delay in transmitting a fixed but small size message between two endpoints; and g is the minimal time interval between consecutive messaging events at a processor, which corresponds to the network throughput available to the processor. By simply exposes these architectural parameters, we can directly derive the performance/cost when using it to analyze parallel algorithms.

An interesting concept of LogP model is the idea of finite capacity of the network, such that no more than certain amount of messages ( $\left\lceil \frac{L}{g}\right\rceil$ ) can be in transit from any processor or to any processor at any time. And any attempts to exceed the limit will stall the processor. However, the model does not provide any clear idea on how to quantify, avoid and take advantage of this information in algorithm design. Similarly, LogP model does not address on the issue of message size, even the worst is the assumption of all messages are of ``small'' size; however, this has been addressed on their follow-up study [31].

Despite of the shortcoming, this model is the pioneer model that breaks the synchrony of parallel execution as oppose to the PRAM model [37], even though it is not the first to do so. Consequently, other studies tried to extend its capabilities to support more constructive features. For examples, LogGP model [4] augments the LogP model with a linear model for long messages; LoGPC model [67] further extends the LogGP model to include contention analysis using queuing model on the k-ary n-cubes network; LogPQ model [103] augments the LogP model on the stalling issue of the network constraint by adding buffer queues in the communication lines.

2.3.0.0.3 The Postal Model

The Postal model [8] is similar to LogP model with the exception of more abstractly expressing the network. The system is characterized by two parameters: n and $\lambda$ , where n stands for the number of processors and $\lambda$ represents the communication latency. The communication latency $\lambda$ is expressed as a ratio between total time spent in transmitting the message from sender to receiver with the time spent by the sender in initiating the transfer. This ratio captures both the software and hardware costs, and effectively reduces the dimension of analysis. Similarly, to simplify the analysis, this model sacrifices the performance accuracy by neglecting the importance of message size over communication latency. Therefore, their cost models are better for asymptotic analysis than for prediction, when porting the resulting algorithm to a particular platform, significant efforts have to be made for tuning the algorithm for performance.

2.3.0.0.4 The C $^{3}$ Model

The $C^{3}$ model [41] comes with the BSP superstep notion and also requires to have synchronization events between supersteps. However, on the selection of the performance parameter set, this model adopts a tactic that lies midway between the Postal and LogP models, which explicitly expressing the costs spent in sending and receiving messages. The unique feature of this model is the introduction of the congestion measures to the performance set, which measure the congestion over communication links ( $C_{l}$ ) and congestion at the processors ( $C_{p}$ ). The authors admitted that congestion is difficult to evaluate, and they approached this problem by a rather phenomenal way.

Observed that congestion depends on the total amount of data sent between all processor pairs (cong). This model relates the link congestion by simply estimated the cost as the per-processor delay in routing $L_{a}$ packets across the bisection width (b), which is shared by all processor pairs (cong), i.e. $C_{l}=L_{a}*\left\lceil \frac{cong}{b}\right\rceil$ . And the processor congestion is estimated as $C_{p}=L_{a}*\left\lceil \frac{cong}{p}\right\rceil *h$ , where h is the average distance between processors. Their rationale is that a message of size $L_{a}$ traversing a distance h links would compete for the resources with other messages at each of the

intermediate processors, therefore is slowed down by a factor of $\left\lceil \frac{cong}{p}\right\rceil$ at each processor. However, it is easy to find out that these congestion measures are quite unrealistic. For example, the many-to-one and one-to-many communications suffer with the same degree of link and processor congestion, which is obviously not true in real networks.

2.3.0.0.5 The CCM model

Motivated by the inadequacy of BSP model and the restrictive framework, the Collective Computing Model (CCM) [86] transforms the BSP-superstep framework to support more high-level programming model, such as MPI and PVM. Although CCM follows the superstep terminology of the BSP model, it waives the requirement of global synchronization between supersteps, but combines the message exchanges and synchronization properties into the execution of a collective communication function. As a result, this model provides a finite set ( $\vec{F}$ ) of collective communication functions, which sincerely maps to the collective operations found in MPI. Besides, it also provides a set ( $\vec{T}_{F}$ ) of cost functions for each collective function in $\vec{F}$ , such that performance analysis can be made on these cost functions.

As this model is aiming for a higher level programming model, its abstraction is more closely resembled to those common high-level message-passing programming interfaces; therefore, it consists of a larger set of performance parameters. As these performance parameters are directly related to some concrete operations, quantitative analysis is therefore possible, and the prediction quality is usually high, albeit the intricacy of the analysis. Besides, the parameter set can be a useful tool for evaluating message-passing software. However, this approach only contributes minimally in designing efficient message-passing library as they cannot provide information to guide on the design process, as they assume that the abstract machine supports these high-level primitives. Since this model is oriented to a high-level model, it can actually be built atop of existing abstract models. For example, to derive the $\vec{T}_{F}$ set, one can measure the performance of those collective operations directly out of the boxes, or we can determine $\vec{T}_{F}$ from LogP model or from $C^{3}$ model.

2.3.0.0.6 Phase Parallel Model

This model [118] is similar to the BSP model, but functionally closed to the CCM model. Under this model, a parallel execution is divided into sequence of phases. The next phase begins only after all operations in the current phase have finished, however, there are no synchronization primitive to enforce this synchrony. There are three types of phase: (1) Parallelism phase - performs process management; (2) Computation phase - executes local computation; (3) Interaction phase - executes interaction operation. There is no stringent framework in confining the sequence of phases, such that an interaction phase or another computation phase can follow a computation phase. However, different interaction operations, e.g. point-to-point communication or collective communication, may take different times. There is a general cost formula for an interaction operation:

which reflects that the cost of interaction depends on the message length (m), startup overhead ( $t_{0}(n)$ ) for an operation involves n processors, and the asymptotic bandwidth ( $r_{\infty }(n)$ ) under this communication profile. Likewise, to derive these formulae for different interactions, the authors performed direct measurements on the target machines [117].

Table 2.3: Summary of related abstract models which built on the same architectural foundation. With simple message-passing model, the programming interface is based on the send and receive operations, or their variants. While for high-level message-passing model, the programming interface includes simple point-to-point communication as well as complex collective communications.

Related Models Programming Model Abstract Model Architectural Model Characteristics

Bulk-Synchronous Parallel PRAM or
Message-Passing Model or BSP programming [65] (p, g, L, h-relations) generic architecture + synchronization primitive Tightly-synchronized computation mode

Postal Model Message-Passing Model (n, $\lambda$ ) generic architecture Full connectivity;
simultaneous I/O

LogP Message-Passing Model (L, o, g, P) generic architecture Implicit network constraint; overlapping of computation and communication

$C^{3}$ Model Message-Passing Model (p, s, h, l, $C_{l}$ (), $C_{p}$ () ) generic architecture + synchronization primitive Introduces congestion overheads

LogGP Message-Passing Model (L, o, g, G, P) generic architecture Support long messages transfer

Phase Parallel Model High-level Message-Passing Model ( $\vec{t}_{0}$ , $\vec{t}_{c}$ , $t_{f}$ , $\vec{t}_{p}$ , w, $\sigma$ , $\alpha$ ) generic architecture Includes costs for collective operations and process management

LoGPC Message-Passing Model LogGP Model + $C_{n}$ () generic architecture Contention factor on k-ary n-cubes network

Collective Computing Model High-level Message-Passing Programming Model - MPI (P, $\vec{F}$ , $\vec{T}_{F}$ , $\vec{P}$ , $\vec{T}_{P}$ ) generic architecture or BSP machine or LogP machine Includes costs for collective operations and process management

Our Cluster Model Message-Passing Model (p, $O_{s}$ (), $g_{s}$ (), $O_{r}$ (), $g_{r}$ (), $U_{r}$ (), L(), $B_{L}$ , memcpy() ) generic architecture + packet-switched network Packet-oriented;
explicitly exposes network constraint;
facilitates communication overlapping