How to achieve good Linpack Performance Result

How to achieve good Linpack Performance Result ?

HKU SRG Experiences

Performance benchmark by itself is an art. Our suggestions may not necessarily help you to achieve the best performance on your own cluster. But we will be happy to work with you for finding a better solution. If you have some other methods for helping us to further improve the Linpack performance on Gideon, it will be most welcomed.

Prepared by Sit Yiu Fai (Nov. 12, 2002)

Here is a guide for running Linpack in Gideon:

Things needed:

A good compiler
MPI
BLAS library
hpl source (Linpack)

1) Some people report that the code generated by newer versions of gcc (>2.96) are significantly slower (>10% in some cases) than previous version. In some distributions, e.g. RedHat 7.3, these inefficient compilers are included and hence older version of gcc has to installed. The Intel Compiler is also a good choice.

2) The most common MPI implementaions are MPICH and LAM/MPI. LAM/MPI should have better performance over MPICH and it was used in our test. GAMMA/MPI could outperform LAM/MPI, but the maximum number of nodes is about 120 (max. 252 active ports. each process needs 1 active port for
short message and 1 for long message).

3) The BLAS library used is an auto-tuned library call ATLAS, which will tune the parameters of the computation to fit the CPU architecture. People from DELL said that the optimization done by ATLAS is so complete that you won't see major performance improvement with further compiler-based optimization. The use of gcc or Intel Compiler does not make a different in the performance of the code generated.

4) HPL is source for the linpack benchmark, which needs MPI and a BLAS library. This is the easiest part, when MPI and ATLAS are installed.

Tuning of Linpack

There are many parameters in the benchmark. Some of them have significant influence in the resultant performance, while some have little. For Gideon (and probably clusters with fast CPU but slow network), the most important parameters are problem size (N) and block size (NB). The larger the problem size is, the better the performance. This parameter is limited by the available memory in the systems. For the block size, muliple of 80 seems to be the best.

Another parameter to look for is BCAST. This specifies how the messages are broadcast in the cluster. For Gideon and other similar Fast Ethernet clusters, 1, 4, and 5 should be the best.

Look ahead depth and swapping threshold have some minor impact on the performance, too.

TCP also has some parameters to tune, and this has some influence on the result too. What one needs to do is to increase the buffer size for the TCP connections:

echo 8388608 > /proc/sys/net/core/wmem_max
echo 8388608 > /proc/sys/net/core/rmem_max
echo 65536 > /proc/sys/net/core/rmem_default
echo 65536 > /proc/sys/net/core/wmem_default
echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem
echo "4194304 4194304 4194304" > /proc/sys/net/ipv4/tcp_mem
echo 1 > /proc/sys/net/ipv4/route/flush

Created by Cho-Li Wang, Nov. 16, 2002