| |
How to achieve good Linpack Performance Result ?
HKU SRG Experiences
Performance benchmark by itself is an art. Our suggestions may not
necessarily help you to achieve the best performance on your own cluster. But we
will be happy to work with you for finding a better solution. If you have some
other methods for helping us to further improve the Linpack performance on
Gideon, it will be most welcomed.
Prepared by Sit Yiu Fai (Nov. 12, 2002)
Here is a guide for running Linpack in Gideon:
Things needed:
- A good compiler
- MPI
- BLAS library
- hpl source (Linpack)
1) Some people report that the code generated by newer versions of gcc (>2.96) are significantly slower (>10% in some cases) than previous
version. In some distributions, e.g. RedHat 7.3, these inefficient compilers are included and hence older version of gcc has to
installed. The Intel Compiler is also a good choice.
2) The most common MPI implementaions are MPICH and LAM/MPI. LAM/MPI should have better
performance over MPICH and it was used in our test. GAMMA/MPI could outperform LAM/MPI, but the maximum number of nodes is
about 120 (max. 252 active ports. each process needs 1 active port for
short message and 1 for long message).
3) The BLAS library used is an auto-tuned library call ATLAS, which will tune the parameters of the computation to fit the CPU architecture. People
from DELL said that the optimization done by ATLAS is so complete that you
won't see major performance improvement with further compiler-based
optimization. The use of gcc or Intel Compiler does not make a different in the performance of
the code generated.
4) HPL is source for the linpack benchmark, which needs MPI and a BLAS library. This is the easiest part, when MPI and ATLAS are installed.
Tuning of Linpack
There are many parameters in the benchmark. Some of them have significant influence in the resultant performance, while some have little. For
Gideon (and probably clusters with fast CPU but slow network), the most important parameters are problem size (N) and block size (NB). The larger
the problem size is, the better the performance. This parameter is limited by the available memory in the systems. For the block size,
muliple of 80 seems to be the best.
Another parameter to look for is BCAST. This specifies how the messages are broadcast in the cluster. For Gideon and other similar Fast
Ethernet clusters, 1, 4, and 5 should be the best.
Look ahead depth and swapping threshold have some minor impact on the performance, too.
TCP also has some parameters to tune, and this has some influence on the result too. What one needs to do is to increase the buffer size for the
TCP connections:
echo 8388608 > /proc/sys/net/core/wmem_max
echo 8388608 > /proc/sys/net/core/rmem_max
echo 65536 > /proc/sys/net/core/rmem_default
echo 65536 > /proc/sys/net/core/wmem_default
echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem
echo "4194304 4194304 4194304" > /proc/sys/net/ipv4/tcp_mem
echo 1 > /proc/sys/net/ipv4/route/flush
Created by Cho-Li Wang, Nov. 16, 2002
|