Load balancing in Parallel Computers: Theory and Practice

by Chengzhong Xu and Francis Lau

Kluwer Academic Publisher (now Springer), 1996/97 (ISBN 0-7923-9819-X; 250 pages)

[ Sales: 102 (1996) / 160 (1997) / 47 (1998) / 32 (1999) / 6 (2000) / 5 (2001) / ...]

Load Balancing in Parallel Computers: Theory and Practice is about the essential software technique of load balancing in distributed memory message-passing parallel computers, also called multicomputers. Each processor has its own address space and has to communicate with other processors by message passing. In general, a direct, point-to-point interconnection network is used for the communications. Many commercial parallel computers are of this class, including the Intel Paragon, the Thinking Machine CM-5, and the IBM SP2.

Load Balancing in Parallel Computers: Theory and Practice presents a comprehensive treatment of the subject using rigorous mathematical analyses and practical implementations. The focus is on nearest-neighbor load balancing methods in which every processor at every step is restricted to balancing its workload with its direct neighbours only. Nearest-neighbor methods are iterative in nature because a global balanced state can be reached through processors' successive local operations. Since nearest-neighbor methods have a relatively relaxed requirement for the spread of local load information across the system, they are flexible in terms of allowing one to control the balancing quality, effective for preserving communication locality, and can be easily scaled in parallel computers with a direct communication network.

Load Balancing in Parallel Computers: Theory and Practice serves as an excellent reference source and may be used as a text for advanced courses on the subject.

Foreword. Preface. 1. Introduction. 2. A Survey of Nearest-Neighbor Load Balancing Algorithms. 3. The GDE Method. 4. GDE on Tori and Meshes. 5. The Diffusion Method. 6. GDE Versus Diffusion. 7. Termination Detection of Load Balancing. 8. Remapping with the GDE Method. 9. Load Distribution in Combinatorial Optimizations. 10. Conclusions. References. Index.

PREFACE

``Parallel computing'' is no longer a buzzword, it is synonymous with high-performance computing, it is practical. Parallel computers are here to stay. By interconnecting hundreds and thousands of the world's most advanced processors, teraflop computers will soon be a reality and they will be all out to confront the most complex problems and the grandest challenges. A notable example is the project by the U.S. Department of Energy (DOE) on building the world's first trillion-operations/second computer, which will be powered by more than 9000 Intel's Pentium Pro processors. Another project of similar scale, also involving the DOE, will use a vast number of IBM's RS/6000 processors to achieve a comparable performance. But parallel computing is not limited to massively parallel processing (MPP). Symmetric multiprocessing (SMP) is now a common trend in the servers market. There is the likelihood that before too long multiprocessing will reach even the desktop. Recent advances in high speed communication networks have enabled parallel computing on clusters of workstations.

The raw power of computers has kept on increasing by leaps and bounds, but human's ability to harness that power does not seem to be keeping up. We perhaps are too accustomed to solving problems sequentially, especially when using the computer. The gap must be bridged by advanced software. A huge amount of effort has been devoted by researchers worldwide to the development of software techniques for parallel computing. These researchers all share the common goal of making the use of parallel computers much less formidable and enabling the user to fully exploit the power of the parallel computer. One such essential software technique is load balancing, which is the subject of this book. Load balancing aims at improving the performance of parallel computers by equalizing the workloads of processors automatically during the execution of parallel programs.

This book is about load balancing in distributed memory message-passing parallel computers, also called multicomputers. Each processor has its own address space and has to communicate with other processors by message passing. In general, a direct, point-to-point interconnection network is used for the communications. Many commercial parallel computers are of this class, including Intel Paragon, TMC CM-5, and IBM SP2. This book presents a comprehensive treatment of the subject using rigorous mathematical analyses and practical implementations. Focus is on nearest-neighbor load balancing methods in which every processor at every step is restricted to balancing its workload with its direct neighbors only. Nearest-neighbor methods are iterative in nature because a global balanced state could be reached through processors' successive local operations. Since nearest-neighbor methods have a relatively relaxed requirement on the spread of local load information around the system, they are flexible in terms of allowing one to control the balancing quality, effective for preserving communication locality, and can be easily scaled in parallel computers with a direct communication network.

In the design and analysis of nearest-neighbor load balancing algorithms, two most important performance metrics are stability and efficiency. Stability measures the ability of the algorithm to coerce any initial workload distribution into a global balanced state in the static workload model and the ability to bound the variance of processors' workload in the dynamic workload model. Efficiency measures the time cost for arriving at the global balanced state or for reducing the variance to a certain level. The objective of this work is to try to design nearest-neighbor algorithms that have good stability and efficiency characteristics.

Two of the most well-known nearest-neighbor load balancing algorithms are the dimension exchange and the diffusion methods. With the dimension exchange method, a processor goes around the table, balancing workload with its nearest neighbors one at a time. With the diffusion method, a processor communicates simultaneously with all its nearest neighbors in order to reach a local balance. These two methods are rigorously analyzed in this book, resulting in optimal tunings of the methods for a number of popular interconnection networks. On the practical side, these two methods are implemented on multicomputers with different characteristics and evaluated in applications with different behaviors. The methods are shown to effective and efficient.

Modeling and Analysis of Load Balancing Algorithms

The dimension exchange method equalizes a processor's workload with those of its nearest neighbors one by one, and the most recently computed value is always used in the next equalization step. It is observed that ``equal splitting'' of workload between a pair of processors in each balance operation does not necessarily lead to the fastest convergence rate in arriving at a global balanced state. We therefore generalize the dimension exchange method by introducing an exchange parameter into the method to control the workload splitting; it is expected that through adjusting this parameter, the load balancing efficiency may be improved. We carry out an analysis of this generalized dimension exchange (GDE) method using linear system theory, and derive a necessary and sufficient condition for its convergence. We also present a sufficient condition w.r.t. the structure of the system network for the optimality of the dimension exchange method. Among networks that have this property are the hypercube and the product of any two networks having the property.

For other popular networks, the ring, the chain, the mesh, the torus and the k-ary n-cube, we derive the optimal exchange parameters in closed form and establish several important relationships between the efficiencies of these structures using circulant matrix theory. Based on these relationships, we conclude that the dimension exchange method favors high dimensional networks.

With the diffusion method, a processor balances its workload with those of its nearest neighbors all at the same time rather than one by one as in the dimension exchange method. Its efficiency is dependent on a diffusion parameter, which characterizes the behavior of a local balance operation. We analyze the diffusion method using circulant matrix theory and derive the optimal values for the diffusion parameter for the k-ary n-cube and its variants. Through statistical simulation, we show significant improvements due to the optimal exchange and the diffusion parameters.

Furthermore, we analyze the dimension exchange and the diffusion method in different workload models and system characteristics. We show that the optimally-tuned dimension exchange algorithm outperforms the diffusion method in both one-port and all-port communication models in achieving a global balanced state. The strength of the diffusion method is in load sharing (i.e., keeping all processors busy but not necessarily balancing their loads) in the all-port communication model.

Practical Implementations

On the practical side, we experiment with the dimension exchange and the diffusion methods in various applications for the purposes of global load balancing and load sharing. We implement the GDE method for periodic remapping in two time-dependent multiphase data parallel computations: a parallel Monte Carlo simulation and a parallel image thinning algorithm. The experimental results show that GDE-based remapping lead to substantial improvements on the execution time in both cases. The GDE method is also implemented for parallel partitioning of unstructured finite-element graphs. Experimental results show that the GDE-based parallel refinement, coupled with simple geometric partitioning approaches, produces partitions comparable in quality to those from the best serial algorithms.

The last application is parallel combinatorial optimizations. We experiment with the dimension exchange and the diffusion methods for distributing dynamically generated workloads at run-time. Their performance is evaluated in the solution of set partitioning problems on two distributed memory parallel computers. It is found that both methods lead to an almost linear speedup in a system with 32 processors and a speedup of 146.8 in a system with 256 processors. These two methods give the best results among all the methods we tried.

Organization

Chapter 1 gives an overview of the load balancing problem, and presents a general dynamic load balancing model and the performance metrics. Chapter 2 surveys nearest-neighbor load balancing algorithms in multicomputers. Chapter 3 introduces and presents an analysis of the basic properties of the GDE method, one of the two nearest-neighbor methods covered in this book. In Chapter 4, we apply the GDE method to a number of popular interconnection networks, and derive the optimal values for the exchange parameter for these various cases. We also present results of simulation of the GDE method in these structures, based on which we find that the optimal exchange parameters do speed up the efficiency of the balancing procedure significantly. The second method, the diffusion method, is studied in Chapter 5, in a style similar to the study of the GDE method in the previous chapters. Chapter 6 compares the GDE and the diffusion method in different machine and workload models in terms of their stability and efficiency. One important issue of implementing these methods in real parallel computers is termination -- how do the processors know they have reached a global balanced state? This is a non-trivial problem as the two methods under study are fully distributed solutions. Chapter 7 addresses this issue and proposes an efficient solution to the termination detection problem. Chapter 8 reports on the implementation of the GDE method for remapping in two realistic data parallel applications and for parallel partitioning of unstructured finite-element graphs. These implementations incorporated the termination detection algorithm presented in Chapter 7. Chapter 9 reports on the implementation of the GDE and the diffusion method for dynamic load distribution in parallel branch-and-bound optimizations. Chapter 10 concludes the work and gives suggestions on further work.

Acknowledgements

A large part of the materials of this book are derived from the first author's Ph.D. dissertation, which was submitted to the Department of Computer Science, The University of Hong Kong in June 1993. The experiments of graph partitioning and parallel branch-and-bound optimizations were conducted, in cooperation with AG-Monien research group while the first author was visiting University of Paderborn and Paderborn Center for Parallel Computing (PC²) of Germany. The first author's thesis research was funded by a Li Ka Shing Postgraduate Scholarship, and supported in part by a Hong Kong and China Gas Company Limited Postgraduate Scholarship and a research assistantship from The University of Hong Kong. The first author's visit of Germany was supported by DFG-Forschergruppe ``Effiziente Nutzung Paralleler Systeme''. The second author was supported by grants from The University of Hong Kong and grants from the Research Grant Council of the Hong Kong Government.

The authors are very grateful to Burkhard Monien, Ralf Diekmann, Reinhard Lüling and Stefan Tschoeke for their valuable comments and contributions to both the theoretical and the experimental aspects of this research. Thanks also go to Erich Koester and other associates of AG-Monien group for their considerate arrangements in both academic and non-academic affairs while the first author was in Germany.

Many other people have contributed to this book. We thank Professor Kai Hwang for the foreword, and Professors Dimitri P. Bertsekas, Tony Chan, Vipin Chaudhary, Henry Cheung, Francis Chin, Andrew Choi, Georege Cybenko, F. Meyer auf der Heide, David Nassimi, and Loren Schwiebert for their valuable inputs at various stages of this project. Special thanks go to our families who had suffered through many long nights of being neglected. Their love, patience and support had meant a lot to us. To Jiwen who had assisted us wholeheartedly from beginning to end, we are greatly indebted.