FuzzyCmeans case
From HP-SEE Wiki
FuzzyCmeans
Section contributed by UVT
During the migration of the Fuzzy C-means code to BlueGene/P several problems had to be addressed in order to ensure a proper scalability and to generate an optimized code (at least partially).
The first problem we encountered was related to an initially significant loss of efficiency when using a large number of processing units. The cause of this loss of efficiency proved to be the partitioning style of images which induced an intensive communication between processors. In the first BlueGene/P experiments, we remarked a significant difference between the total time corresponding to a grid like partitioning and that corresponding to horizontal (or vertical) partitioning. This difference was mainly caused by the time corresponding to send/receive operations. This time was influenced by both the size of data to be transferred between processors corresponding to the neighboring image slices and the topology of the communicating processors. For instance, when the number of processors was 1024 by using a 32 × 32 partitioning style, the total size of data transferred by send/receive operations was 2603 bytes while for a 2 × 512 partitioning the number of bytes to be transferred was 22042. In the case of 512 processors, the 16 × 32 partitioning leads to around 3834 bytes to be transferred between processors while the horizontal partitioning (1 × 512) leads to the transfer of 43930 bytes. Therefore we changed the image partitioning style in order to reduce as much as possible the cost of communication operations.
On the other hand, the migration to BlueGene/P required to identify several ways to optimize the generated code. As a first optimization stage, we compiled the source code by enabling the floating point unit usage and some minimal memory primitives, for better memory management and access. The results showed an increase in the performance with almost 60% (using “-O3 -qarch=450d” flags). The second optimization stage enabled vector routines optimization, high-order transformation module and inter-procedure analysis (which improves inter procedures memory exchange performance). The flags used in the second optimization stage were “-O3 -qhot -qipa=level=2 -qarch=450d”. This optimization led to an improvement of the performance with another 60%, in case of standard operations and more than 1000%, in case of vector operations (especially in the case of images with a large number of spectral bands).
In the experiments conducted up to now, the high-speed interconnect of BlueGene/P was configured in its full-mesh topology in order to have similar network specification with the InfraGrid cluster. Although we configured the same topology, the BlueGene/P full-mesh allowed one node to communicate with six others directly (distance is one hop) and using two or more hops with the other neighboors (depending on the position of the neighboor compute node in the rack). Up to now, we have not tried to optimize the mapping between the image slices and the CPUs, as the BlueGene/P management system controlled the mapping (in some cases this can lead to extra overhead on the communication side).