HCW-2014

HCW-2014
Topology-aware Optimization of Communications
for Parallel Matrix Multiplication
on Hierarchical Heterogeneous HPC Platforms
Tania Malik, Vladimir Rychkov, Alexey Lastovetsky
Jean-Noël Quintin
School of Computer Science and Informatics
University College Dublin
Belfield, Dublin 4, Ireland
tania.malik@ucdconnect.ie, {vladimir.rychkov, alexey.lastovetsky}@ucd.ie
Extreme Computing R&D
Bull SAS
Paris, France
jean-noel.quintin@bull.net
Abstract—Communications on hierarchical heterogeneous
HPC platforms can be optimized based on topology information.
For MPI, as a major programming tool for such platforms, a
number of topology-aware implementations of collective operations have been proposed for optimal scheduling of messages.
This approach improves communication performance and does
not require to modify application source code. However, it is
applicable to collective operations only and does not affect
the parts of the application that are based on point-to-point
exchanges. In this paper, we address the problem of efficient
execution of data-parallel applications on interconnected clusters
and present a topology-aware optimization that improves data
partition by taking into account the entire communication flow
of the application. This approach is also non-intrusive to the
source code but application-specific. For illustration, we use
parallel matrix multiplication, where the matrices are partitioned
into irregular 2D rectangles assigned to different processors
and arranged in columns, and the processors communicate over
this partitioning vertically and horizontally. By rearranging the
rectangles, we can minimize communications between different
levels of the network hierarchy. Finding the optimal arrangement
is NP-complete, therefore, we propose a heuristic based on
evaluation of the communication flow on the given topology.
We demonstrate the correctness and efficiency of the proposed
approach by experimental results on multicore nodes and interconnected heterogeneous clusters.
Index Terms—heterogeneous clusters; topology-aware communications; data partitioning; matrix multiplication
I. I NTRODUCTION
Modern HPC platforms are comprised of highly heterogeneous computing devices connected by complex hierarchical
network. To execute data-parallel scientific applications efficiently on such platforms, it is necessary to balance the load of
processors and to minimize the cost of communications. The
former can be achieved by partitioning data between the processors in proportion to their speed. The latter can be achieved
by reducing the number and volume of communications, by
optimal mapping of the data to the processors, and by optimal
scheduling of communications. In this work, we optimize the
communication performance, assuming that the data have been
optimally partitioned between the processors so that the total
volume of communicated data is minimized.
Communications between the processes of parallel applications executed on heterogeneous platforms involve multiple message hops, non-optimal routes and traffic congestion,
which significantly affect performance. Communication operations can be optimized if topology information is available. Indeed, topology information has been used for optimal scheduling of messages in MPI collective communication operations
on heterogeneous HPC platforms [1], [2], [3]. However,
parallel applications based on point-to-point exchanges need
another solution, which could take into account the application
communication flow. If all point-to-point communications are
performed over a virtual topology of processes, it can be
optimally mapped onto physical topology of processors, and
this will minimize communication cost of the application [4],
[5], [6].
In this work, we target dedicated heterogeneous HPC platforms with two-level network hierarchy, such as interconnected clusters. The processors of such platforms are connected by a network that can be represented as a two-level
rooted tree with faster communications within sub-trees and
slower communications between. This topology information
can be taken into account, when the application data is mapped
to the processors, in order to minimize message hops and
reduce network contention.
To optimize communications in scientific data-parallel MPI
applications, we take into account both topology information
and application communication flow. Performance of dataparallel applications, especially those designed for heterogeneous platforms, highly depends on balanced workload, which
is achieved by partitioning the data in proportion to the speed
of processors. In its turn, heterogeneous data partitioning
affects the application communication flow and has to be
taken into account in topology-aware optimization. Assuming
that the workload is balanced among the processors, we
propose to rearrange the given heterogeneous data partition
in order to reduce the number of message hops and network
contention. This rearrangement is based on network topology
and communication flow of the application. This approach is
also non-intrusive to the source code but application-specific.
As a case study, we consider a parallel matrix multiplication application for heterogeneous platforms that is based
on the Scalable Universal Matrix Multiplication Algorithm
(SUMMA) [7]. SUMMA is designed for homogeneous multiprocessors and implemented using MPI. It distributes the
workload evenly between the processors, mapping dense
matrices onto a 2D grid of processors. Each processor receives one rectangle of matrices and participates in two MPI
communicators that combine all processors in the same row
and column. The communication flow consists of multiple
broadcasts of matrix elements over these communicators. If
SUMMA is executed on a hierarchical network of processors,
its performance may be lower than expected due to higher
communication cost. When network topology is known, this
problem can be solved by using topology-aware broadcasts.
In [8] and [9], SUMMA was adapted for heterogeneous platforms, with matrices being partitioned into irregular 2D rectangles in proportion to the speed of processors. The rectangles,
and hence the processors, are arranged in columns. In columns,
the processors communicate the same way as in the original
algorithm. In horizontal direction, the partitioning, and hence
the communication flow, is irregular. Usually, irregular communications between processors are implemented via pointto-point operations. Non-blocking point-to-point operations
additionally allow for overlapping communications and computations, which can significantly improve the performance of
heterogeneous algorithms. For parallel applications based on
point-to-point exchanges, like heterogeneous SUMMA, there
has been no solution proposed yet that could use topology
information to minimize communication cost.
In this work, we propose to rearrange the rectangles of
the matrix partition in order to minimize communications
between different levels of the network hierarchy. Finding
the optimal arrangement is an NP-complete combinatorial
optimization problem, therefore, it can be solved by using
heuristics. We propose a heuristic based on evaluation of
the application communication flow on the given network
topology. We demonstrate the accuracy and efficiency of the
proposed solution on experiments with interconnected clusters.
The rest of paper is organized as follows. In Section II,
we briefly describe parallel matrix multiplication algorithms
for heterogeneous platforms, and then we overview existing
approaches in topology-aware optimization of communications
for MPI applications. In Section III, we formulate the problem
of topology-aware optimization of communications in parallel
matrix multiplication in terms of application communication
flow and network topology. In Section IV, we present a
heuristic that finds the communication-optimal matrix partition. In Section V, we evaluate the heuristic in experiments
on computing clusters with two-level network hierarchy.
II. R ELATED W ORK
In this section, we describe parallel matrix multiplication
algorithms based on SUMMA, with the emphasis on their
communication flow. We consider these algorithms because
of their applicability to a wide range of HPC platforms. These
algorithms can be executed on the platforms that do not form
a 2D grid of processors. The workload in these algorithms
can be balanced by irregular matrix partitioning, proportional
to the speed of processors. The volume of communications
can be minimized. Communication performance of parallel
matrix multiplication on modern hierarchical HPC platforms
can be improved further by taking into account information
about network topology. However, to the best of the authors’
knowledge, all existing modifications of SUMMA are topology unaware.
In this section, we also overview related work on topologyaware optimization of communications. In this area, the main
directions are topology-aware MPI collectives and virtual MPI
topologies. These approaches are quite generic and applicable
to certain classes of parallel applications (which are based on
collective operations) and HPC platforms (for example, BlueGene/L). However, they cannot be applied to heterogeneous
matrix multiplication algorithms. In this work, we formulate
the problem of optimization of communications in terms of
application communication flow and network topology.
A. Parallel Matrix Multiplication on Heterogeneous Platforms
The Scalable Universal Matrix Multiplication Algorithm
(SUMMA) [7] is designed for homogeneous platforms and
implements parallel matrix multiplication C = A × B. In
this algorithm, dense matrices are partitioned over a 2D
grid of processors. Each processor is a part of two MPI
communicators that combine all processors in the same row
and column. To take advantage of processor cache, a blocking
factor, b, has been introduced, so that each matrix consists
of b × b blocks. The algorithm iterates over the columns of
blocks of matrix A and over the rows of blocks of matrix
B. At each iteration, a column of blocks (the pivot column),
A(b) , is broadcast horizontally, and a row of blocks (the pivot
row), B(b) is broadcast vertically. Then, matrix C is updated
on all processors in parallel: Ci + = A(b) × B(b) . At the end
of each iteration, the pivot column and row move horizontally
and vertically respectively.
The update operation can be performed efficiently by invoking a highly optimized general matrix multiplication (GEMM)
routine, available for most HPC platforms. This operation can
be considered as a computation kernel of the application because it represents the computation performance of the entire
application. Fig. 1 shows the communication flow of SUMMA,
which consists of the broadcasts in the row and column
communicators. The broadcasts pass the pivot column and row
in rings, pipelining computations and communications.
Heterogeneous modifications of SUMMA are based on the
approach to optimization of linear algebra computations on
heterogeneous platforms proposed in [10]. In this approach, to
balance the load of heterogeneous processors, the matrices are
partitioned into uneven rectangles such that faster processors
will process larger rectangles. Ideally, this way each processor
should receive the workload proportional to its computing
power. In the case of heterogeneous SUMMA, the amount of
computations related to the i-th rectangle will be proportional
A
B
Fig. 1: Communication flow of SUMMA
to di , the number of blocks it contains. A number of efficient
matrix partitioning algorithms have been proposed [10], [8],
[11], [12] returning rectangular matrix partitions of different
shapes. However, the most popular heterogeneous matrix multiplication algorithms implement column-based partitioning,
when processors are arranged into columns, and all processors
in a column are allocated rectangles of the same width. The
widths of all the columns sum to the width of the matrix.
The heights of rectangles in a column sum to the height of
the matrix. More elaborated irregular matrix partitioning, such
as [13], is out of scope of this paper. The overview of the
heterogeneous column-based algorithms is presented in [9].
There are two main directions the column-based algorithms
evolved: minimization of the volume of communications, and
data partitioning based on accurate computation performance
models of processors.
In [8], the algorithm minimizing the total volume of communication was presented. It arranges the processes into columns
and sets the rectangles’ dimensions (mi , ni ), using the relative
cycle times of processors as input. The total volume of
communication
is proportional to the sum of half-perimeters
Pp
i=1 (mi + ni ). The shape and ordering of rectangles are
calculated to minimize this sum. The algorithm returns the
optimal number of columns, the optimal number of rectangles
in each column and the optimal dimensions of rectangles. The
resulting rectangles are sorted in the order of increasing area,
di = mi × ni , with the shape as square as possible.
Fig. 2 describes the communication flow of the heterogeneous SUMMA presented in [8], which will be denoted by BR
for the rest of the text. It consists of one-to-all non-blocking
A
B
Fig. 2: Comm. flow of heterogeneous SUMMA: one-to-all
point-to-point communications in horizontal and vertical directions. In the horizontal direction, these communications are
irregular: each processor holding a part of the pivot column
sends multiple messages of different sizes to all processors
in horizontal direction, whose rectangles are overlapped with
the sender’s rectangle. The size of each message is equal
to the block size times the height of the overlap between
the sender and receiver. In other words, the overlap is the
maximum part of the pivot column required on the receiver
to perform its local update operation. It should be noted that
this communication pattern is not scalable if the number of
communicating processors increases.
The main shortcoming of the BR algorithm is that it
uses simplistic performance model of processors, where processor speed is represented by a single positive number.
This approach may fail to balance the load, especially for
highly heterogeneous platforms and self-adaptable applications. More reliable solution is data partitioning based on
accurate performance models, such as functional performance
model (FPM) [14]. It is built empirically and integrates many
important features characterizing the performance of both the
architecture and the application.
Under the functional performance model, the speed of each
process is represented by a continuous function of problem
size. The speed is defined as the number of computation units
performed by the process per one time unit. The computation
unit can be defined differently for different applications, but it
is required not to vary during the execution of the application.
For SUMMA-based matrix multiplication, it can be defined as
one update of one b × b matrix block: Cb×b + = Ab×b × Bb×b .
In this case, the problem size assigned to a process is given
by the number of b × b blocks. The amount of computations
assigned to the process is proportional to the area of the
rectangle formed by these blocks.
The processor speed is found experimentally by measuring the execution time over a range of problem sizes. This
time can be found by benchmarking the full application.
This benchmarking can be done more efficiently by using
a serial code, the speed of execution of which is the same
as that of the application but the execution time of which
is significantly less. A benchmark made of one such core
computation can be representative of the performance of the
whole application and can be used as a kernel. The speed
function of the application can be built more efficiently by
timing this kernel. For SUMMA-based matrix multiplication,
one update of a rectangle Ci + = A(b) × B(b) , implemented
by highly optimized GEMM and performed many times for
different pivot rows and columns, can be used as a kernel.
The problem of data partitioning using functional performance models was formulated and solved in [14]. In [9], FPMbased data partitioning was applied to the BR algorithm. We
will refer to this modification of matrix multiplication as FPMBR for the rest of the text. For FPM-BR algorithm, another
communication scheme was implemented, which consists of
non-blocking point-to-point communications in rings, in horizontal and vertical directions (see Fig. 3). In contrast to the
A
B
Fig. 3: Comm. flow of heterogeneous SUMMA: ring
one-to-all communication flow, each processor communicates
only with the processors from its neighbouring columns and
rows. In horizontal direction, the partitioning is irregular, and
the processor holding the pivot row sends multiple messages
to its right column. These messages can be addressed to the
same processor. The size of each message is equal to the block
size times the height of the overlap between all rectangles in
the horizontal direction. Here the overlap is the maximum part
of the pivot column that can be transmitted over the ring of
processors.
Table I summarizes the above-mentioned matrix multiplication algorithms based on SUMMA. The FPM-BR algorithm
better balances the workload and minimizes the total volume
of communications. However, none of the algorithms takes
into account the underlying networks topology, so that they
are not communication-optimal. In this work, we propose to
rearrange a given heterogeneous data partition in order to
reduce the number of message hops and network contention.
TABLE I: Comparison of some SUMMA-based algorithms
Algorithm
Data partitioning
Comm. vol.
Comm. flow
SUMMA
BR
FPM-BR
homogeneous
constant speeds
speed functions
–
min
min
broadcasts
nb-p2p one-to-all
nb-p2p one-to-all/ring
B. Topology-Aware Optimization of Communications
A broad overview of optimization of communications on
heterogeneous HPC platforms is given in [15], where all
existing techniques were classified as performance or topology
aware. The main goal of topology-aware optimization is to
reduce communication traffic and contention by placing communicating tasks on physically nearby processors. Communication traffic is quantified by the number of links a message
traverses. Contention is caused by multiple messages sharing
a network link.
A number of topology-aware implementations of MPI collective communication operations have been proposed. In [1],
two-level communication graphs were constructed for efficient
execution of collective operations on interconnected clusters. Clusters communicated via selected nodes, coordinators,
which formed the inter-cluster communicator. All nodes within
a cluster communicated with the cluster coordinator, forming
the intra-cluster communicator. These optimized implementations sent the minimal amount of data over the slow wide-area
links.
In [2] and [16], collective operations were optimized
for multilevel hierarchical heterogeneous networks and Grid.
In [3], hierarchical approach was applied to optimize collectives for multi-core clusters: inter- and intra-node communications were overlapped, using offloading and pipelining techniques. Homogeneous supercomputers with complex network
topologies, like BlueGene and Cray, can also benefit from
topology-aware collectives [17].
MPI implementations try to exploit target architectures as
efficiently as possible by using the most suitable communication channels and best algorithms for collective communication operations. Therefore, many existing MPI applications
can be executed efficiently on hierarchical heterogeneous HPC
platforms, without any modifications of the source code.
However, the approach of topology-aware collectives does not
address applications based on point-to-point exchanges. In this
case, it is necessary to ensure that frequently communicating
processes are placed close to each other. This closeness is
application-specific.
The problem of topology-aware optimization of point-topoint communications can be solved by introducing a graph
that represents the application communication flow and is
mapped onto the network topology. In [4], this approach was
applied to the mesh and graph virtual MPI topologies and
SMP clusters. In [5], it was applied to the mesh topology on
BlueGene/L. In [6], a tool for automatic profile-guided process
placement was developed for interconnected clusters. In all
these work, the heterogeneity of processors was not taken
into account and therefore the processes were placed freely
to processors in order to minimize the communication cost.
In our work, we focus on efficient execution of data-parallel
applications on hierarchical heterogeneous HPC platforms.
Their performance highly depends on balanced workload,
which is achieved by partitioning the data in proportion to
the speed of processors. In our case, the process placement
is determined by data partitioning algorithms. Assuming that
the workload is balanced among the processors, we propose
to rearrange the given heterogeneous data partition in order to
reduce the number of message hops and network contention.
This rearrangement is based on network topology and communication pattern of the application. This approach is also
non-intrusive to the source code but application-specific.
III. C OMMUNICATION -O PTIMAL M ATRIX PARTITIONING
In this section, we formulate the problem of communicationoptimal matrix partitioning for heterogeneous SUMMA on
interconnected heterogeneous HPC clusters. To minimize communication cost, we use information about the network topology and the application communication flow.
In our target platform, interconnected heterogeneous HPC
clusters, the network can be represented as a two-level rooted
tree with faster communications within sub-trees (clusters)
and slower communications between. Within each cluster, a
single network switch provides no-contention point-to-point
communications, appropriately forwarding packets between
sources and destinations. Inter-cluster links may be shared
by multiple processors from different clusters communicating
with each other.
Our goal is to minimize communication cost of the parallel
application that implements the FPM-BR matrix multiplication
algorithm. In this application, each processor is assigned a matrix rectangle of the area and shape that balance the workload
and minimize the communication volume. The communication
flow of this application is based on non-blocking point-to-point
communications in rings. Changing the position of a rectangle
within the matrix does not affect the load balance and the
communication volume, but the rectangles can be arranged
so as to minimize the cost of communications between the
processors. This forms the optimization problem we solve in
this work.
Since column widths are different, we cannot move a
rectangle to another column unless the whole columns are
interchanged. In a column, there are no restrictions on interchanges of rectangles. All these limit the solution space of our
optimization problem to a certain number of combinations. Let
c be the number of columns and ri be the number of rectangles
in column i, 1 ≤ i ≤ c. Then the number of combinations will
be equal to the product r1 ! × . . . × rc !. Which arrangement of
rectangles is communication-optimal? This is an NP-complete
problem.
We performed exhaustive search by running the application with all possible arrangements of rectangles on a
small platform of three interconnected heterogeneous clusters. Each cluster consisted of several heterogeneous nodes,
which were assigned rectangles proportional to their speed.
From exhaustive search, we found several arrangements that
reduced (Fig. 4) and increased (Fig. 5) the communication
cost (different colors/fillings correspond to different clusters).
We observed some regularity in the communication-optimal
arrangements, which was related to the topology. In the
optimal arrangements, the rectangles were grouped by clusters,
whereas, in the worst cases, the rectangles assigned to the same
cluster were dispersed vertically and horizontally. With the
optimal arrangements, the application, which is based on nonblocking point-to-point communications in rings, performs
less inter-cluster communications in horizontal and vertical
directions.
The factorial design of the exhaustive search leads to a large
number of trials, which becomes infeasible for large platforms.
If topology information is available, we can avoid exhaustive
search by applying some heuristic that efficiently finds a nearoptimal arrangement. Heuristic search requires to estimate
the communication cost incurred by each data partitioning.
Communication cost can be estimated by taking into account
the application communication flow and the network topology.
Using the observations from the exhaustive search, we propose
a cost function for the FPM-BR matrix multiplication with
the ring communication flow and two-level network hierarchy.
The main goal of this function is to characterize the number
and volume of inter-cluster communications incurred by an
arrangement of matrix rectangles. This function should monotonically increase with the increase of the number and the
volume of inter-cluster communications.
In the FPM-BR-ring algorithm, the point-to-point communications in the vertical direction are related to matrix
B (see Fig. 3). The volume of communications in each
column is proportional to the column width. The number of
communicating clusters in the vertical direction remains the
same for any arrangement of matrix rectangles. The number of
inter-cluster communications is proportional to the number of
message hops between clusters. In the communication-optimal
arrangements, the rectangles are grouped by clusters in each
column. In this configuration, the number of message hops
between clusters is minimal in each column. In the worst cases,
the rectangles belong to the same group are dispersed.
To estimate the inter-cluster communication cost, we take
the upper bound of the number of hops made to send the pivot
row over the ring in the column. The rightmost column of the
optimal arrangement in Fig. 6 illustrates the upper bound of
the number of hops. Namely, when the pivot row is on the
top of the matrix, there is only one communication between
clusters: between the processors holding the second and third
rectangles. The same happens when the pivot row is in the
third rectangle: the part of pivot row is sent between the
processors from different clusters that hold the fourth and first
rectangles. In other cases, when the pivot row is in second
and fourth rectangles, two inter-cluster communications are
performed.
We define the cost function for the inter-clusterc communicaP
tions related to matrix B as follows: costB =
h(i) × v(i),
Fig. 4: Some of the communication-optimal arrangements
Fig. 5: Some of the worst case arrangements
i=1
where variable i iterates over the columns of matrix rectangles,
functions h and v return the number of inter-cluster commu-
Optimal
Worst case
1
2
3
1
2
Optimal
Worst case
2
2
2
2
2
2
2
2
2
2
2
1
2
1
Fig. 6: Inter-cluster communications related to matrix B
Fig. 7: Inter-cluster communications related to matrix A
nications in a column and the column width respectively. This
estimate is equivalent to the integral of h over
the space of
R
columns of matrix b × b blocks: costB = h(x) dx, where
variable x iterates over columns of matrix blocks. The cost of
the arrangements in Fig. 6 is calculated as follows:
Worst case: (1 × 12) + (2 × 12) + (3 × 9) = 63
Optimal: (1 × 12) + (2 × 12) + (2 × 9) = 54
The point-to-point communications in the horizontal direction are related to matrix A (see Fig. 3). The number
of communicating clusters and the volume of inter-cluster
communications depend on the arrangement. The number of
inter-cluster communications along the pivot column varies.
The volume of inter-cluster communications is proportional
to the height of overlaps of matrix rectangles. The overlap is
the maximum part of the pivot column that can be transmitted
over the ring of processors in the horizontal direction. Fig. 7
illustrates both the numbers of inter-cluster communications in
overlaps and the heights of overlaps. In the optimal arrangement, the rectangles assigned to the same cluster are grouped
in rows as much as possible, while in the worst case, they are
scattered over the matrix.
Similarly to the communications related to matrix B, we
use the upper bound of the number of inter-cluster communications. For example, in the optimal arrangement in Fig. 7,
the number of inter-cluster communications over the upper
part of matrix A varies from one to two, depending on the
location of the pivot column. If the pivot column is in the
second column of matrix rectangles, there will be two intercluster communications in two top rings.
We define the cost function for the inter-cluster
communicao
P
tions related to matrix A as follows: costA =
h(i) × v(i),
(costA (M ), costB (M )), constituting a point in Euclidean
space. The problem of finding the communication-optimal
arrangement can be formulated as minimization of the Euclidean norm: k(costA (M ), costB (M ))k → min. This norm
represents a combined cost and can be used to compare any
two arrangements.
The combined costs √
of the above arrange√
ments are 642 + 632 = 89.80 and 502 + 542 = 73.59
respectively. Table II summarizes the execution time and the
inter-cluster communication cost for the worst and optimal
cases.
In the next section, we show how these intuitive and based
on the observations cost functions can be used in a heuristic
solution of the combinatorial problem of topology-aware optimization of communication cost in the heterogeneous matrix
multiplication application.
i=1
where variable i iterates over the o overlaps of matrix rectangles, functions h and v return the number of inter-cluster
communications in an overlap and the height of the overlap.
This estimate is equivalent to integral of h
R over the space
of rows of b × b matrix blocks: costA = h(x) dx, where
variable x iterates over rows of matrix blocks. The cost of the
arrangements in Fig. 7 is calculated as follows:
Worst case: 2 × (11 + 3 + 3 + 3 + 4 + 2 + 6) = 64
Optimal: 1 × (6 + 8) + 2 × (1 + 9 + 2 + 6) = 50
To conclude, the inter-cluster communication cost associated with arrangement M is represented by two values
TABLE II: Exhaustive search experimental results
Cost
Worst case
Optimal
Exhaustive search
89.80
Exec time (sec)
Worst case
Optimal
73.59
6.00
2.78
IV. H EURISTIC S EARCH
In this section, we use information about the network
topology and the application communication flow in a heuristic
that efficiently constructs a near-optimal arrangement. Our
heuristic does not require to run the application to collect
information about its communication performance. The main
idea of our heuristic approach is to reduce the search space
of rectangle arrangements and find the one that minimizes the
inter-cluster communication cost of the application.
Our heuristic can be summarized as follows. First, we
group the rectangles assigned to the same subnetwork in
columns. This will minimize the inter-cluster communication
cost related to matrix B. Then, we rearrange the groups of
rectangles in columns to minimize the inter-cluster communications related to matrix A. Let us present the rationale for
such a solution and describe the solution in detail.
A. Rationale
Finding the optimal arrangement is complicated by irregularity of communications over rows, which is related to
matrix A. We propose to apply cost function costA not to
the whole matrix but to some of its columns. In such a
way, costA (A1 , . . . , Ai ) estimates the cost of communications
between the first i columns of rectangles. Here Ai is the ith column of matrix rectangles. We will construct the nearoptimal arrangement by minimizing this cost function for
successive submatrices that consist of two, three or more
columns of rectangles: (A1 , A2 ), (A1 , A2 , A3 ), . . ..
Let us assume that the rectangles in the first i − 1
columns have been rearranged to minimize the cost:
costA (A1 , . . . , Ai−1 ) = min. With these columns fixed, we
can estimate the cost of i columns, with different permutations
of rectangles in the i-th column. The permutation providing
the minimal combined cost can be added to the solution.
This approach reduces the number and volume of intercluster communications but does not guarantee finding a global
minimum. It allows us to test a significantly smaller number
of combinations of rectangles, which is equal to the sum (not
the product) of permutations: r2 ! + . . . + rc !, where ri is the
number of rectangles in column i.
We observed that in communication-optimal arrangements
the matrix rectangles assigned to the same network subtree
were grouped (Fig. 4). Indeed, this minimizes the number
of hops between subnetworks, g(i), in each column i, and
therefore, reduces the communication cost related to matrix B,
costB . If the rectangles in columns are grouped, we will have
to estimate costA (A1 , . . . , Ai ) for significantly less number
of combinations. For gi communicating clusters in column i,
there will be gi ! permutations of the grouped matrix rectangles.
The number of combinations of groups gi ! is significantly
smaller than the number of combinations of individual rectangles ri !.
In one of the optimal cases, namely, in the left picture
of Fig. 4, one group of rectangles in the third column is
split between the top and bottom of the column. In this case,
the upper bound on the number of inter-cluster communications remains minimal, two, providing better communication
performance of the parallel matrix multiplication application.
Consideration of such cases significantly increases the search
space and complicates the construction of the near-optimal
arrangement. We exclude such cases from the search and
only test permutations of the non-split groups of rectangles in
each column. Nevertheless, our heuristic can find arrangements
close to the one in the right picture of Fig. 4.
B. Formal Description
Our heuristic is summarized in Algorithm 1. First, in each
column, we group the rectangles by subnetworks. We denote
each permutation of the groups in a column i as Aki , k =
1 . . . gi !, and the permutation with the minimum submatrix cost
as A∗i , costA (A1 , . . . , A∗i ) = min. Let us show how the nearoptimal arrangement is constructed by selecting the optimal
permutations for each column. For the submatrix consisting of
only one column (A1 ), we have nothing to test because there
are no communications in the horizontal direction. Therefore,
Algorithm 1 Heuristic for the communication-optimal arrangement
for each column i := 1 to c do
group rectangles by clusters → gi groups
end for
A∗1 := A1
for each column i := 2 to c do
generate group permutations → A1i , . . . , Agi i !
for each permutation k := 1 to gi ! do
find k such that costA (A∗1 , . . . , A∗i−1 , Aki ) = min
end for
A∗i := Aki
end for
we accept this column as the optimal permutation (A∗1 := A1 )
and add it in the resulting arrangement.
Let us assume that we have found the optimal permutations
in the first i − 1 columns, and hence costA (A∗1 , . . . , A∗i−1 ) =
min. We add another column of rectangles and estimate the
communication cost for the extended submatrix, trying different permutations Aki . The permutation with the minimal cost,
A∗i , such that costA (A∗1 , . . . , A∗i−1 , A∗i ) = min, is added to the
resulting arrangement. We repeat this step for all columns of
rectangles. The final arrangement will significantly minimize
the communication cost of parallel matrix multiplication.
In total, this heuristic requires to test g2 ! + . . . + gc !
arrangements of submatrices. This is significantly smaller than
the solution space of the exhaustive search, which is equal
to the product of the numbers of permutations of rectangles
in each column r1 ! × . . . × rc !. In addition, this heuristic
does not require to run the application or any benchmarks
to compare the communication cost of the application for
different arrangements. Instead, it uses the information about
network topology and application communication flow.
By minimizing the cost, this algorithm reduces the number
and volume of inter-cluster communications. However, it does
not guarantee finding the global minimum, and therefore,
it provides only some near-optimal solution. By fixing the
communication-optimal submatrices, we reduce the search
space but may loose the optimal solution. A∗i , the intermediate
result of the search, may change if we rearrange groups in one
of the first i − 1 columns, and this configuration all together
may be a better solution. Nevertheless, for the small platform
used in Section III, the result of the heuristic search coincided
with the communication-optimal arrangement found by the
exhaustive search.
V. E XPERIMENTAL R ESULTS
In this section, we demonstrate that, with the near-optimal
arrangement found by the heuristic, the communication performance of the heterogeneous matrix multiplication application
can be significantly improved.
In our experiments, we used FuPerMod, a software tool
for optimal data partitioning on dedicated heterogeneous HPC
platforms [18]. In addition to the programming interface for
TABLE III: Hardware specifications
Cluster
Site
Processor
Edel
Sol
Stremi
Paradent
Taurus
Chimint
Chinqchint
Grenoble
Sophia
Reims
Rennes
Lyon
Lille
Lille
2.27 GHz Xeon
2.6 GHz Opteron
1.7 GHz Opteron
2.5 Ghz Xeon
2.3GHz Xeon
2.4 GHz Xeon
2.83 GHz Xeon
Original
Cores
Memory
8
4
24
8
12
8
8
24GB
4GB
48GB
32GB
32GB
16GB
8GB
balancing the computational workload in data-parallel scientific applications, this tool provides the heterogeneous matrix
multiplication algorithm that we referred to as FPM-BR in
Section II-A. We improve the communication performance of
this algorithm on two-level network hierarchy, using topology
information. We compare the execution time of this algorithm
with the topology-unaware data partitioning and with the
communication-optimal data partitioning obtained from the
heuristic search.
We performed experiments on Grid’5000 infrastructure in
France, which consists of ten sites, geographically distributed
and interconnected using Renater network. Within each site
there are different clusters. Table III shows the specification
of the experimental platform. Each experiment was carried
out on a set of clusters from different sites. We had a
priori information about the network topology. To increase
heterogeneity, we used different numbers of threads on each
node. The execution time was the mean of 30 runs.
Heuristic
150
150
100
100
50
50
0
0
50
100
0
150
0
50
100
150
Fig. 8: Matrix partitioning for four interconnected clusters of
heterogeneous processors: 32 nodes
Original
Heuristic
300
300
200
200
100
100
0
0
100
200
300
0
0
100
200
300
Fig. 9: Matrix partitioning for four interconnected clusters of
heterogeneous processors: 90 nodes
A. Inter-Cluster Experiments
In these experiments, we use four clusters with different
numbers of nodes. We spawn one MPI process per node, with
different numbers of threads to increase the heterogeneity. The
block size is 64 in all experiments. Table IV shows the intercluster communication cost and the total execution time for
the following experiments:
•
•
16 nodes: We start experiment with 16 nodes, 4 nodes
on “Edel” cluster, 5 nodes on “Taurus” cluster, 3 nodes
on “Sol” cluster and 4 nodes on “Chimint” cluster. The
problem size is 16384, which is the area of a square
matrix 128 × 128, given in the number of b × b blocks.
32 nodes: In another set of experiment, we use 32 nodes
in total, 9 nodes on “Chinqchint” cluster, 8 nodes on
“Edel” cluster, 9 nodes on “Paradent” cluster and 6 nodes
on “Taurus” cluster. The problem size is 32400. Fig. 8
shows the original topology-unaware data partitioning
and the arrangement return by the heuristic.
TABLE IV: Inter-cluster experimental results
Nodes
16
32
90
Cost
Heuristic
Ratio
Orig
533
868
1719
432
710
1263
1.23
1.22
1.36
Exec time (sec)
Orig
Heuristic
58.00
119.30
400.80
42.58
88.30
297.83
90 nodes: we extend the experiments with 90 nodes
on four clusters, 25 nodes on “Chinqchint” cluster, 22
nodes on “Stremi” cluster, 29 nodes on “Paradent” cluster
and 13 nodes on “Taurus” cluster. The problem size is
90,000. Fig. 9 shows the original topology-unaware data
partitioning and the heuristic solution.
Experimental results show that the arrangement found by
the proposed heuristic improves the total execution time by
more than 30%. As expected, the rectangles assigned to the
same cluster are grouped in columns and, as much as possible,
in rows, which minimizes the inter-cluster communications.
In horizontal direction, the heuristic minimizes not only the
number of communications between clusters but also the number of communicating clusters. Indeed, the topology-unaware
partitions in both pictures result in communications between
all four clusters, which are performed in rings along all rows
of matrix A. With the heuristic-based partitions, three clusters
in average are involved in the ring communications.
•
B. Homogeneous Inter-Node Experiments
Ratio
1.36
1.35
1.34
In parallel matrix multiplication algorithms for homogeneous platforms, matrices are usually partitioned into a Cartesian grid. If the number of processors is not equal to a square
of some integer number, such partitioning will result in the
non-square shape of matrix blocks. In this case, the volume
Original
R EFERENCES
Heuristic
30
30
20
20
10
10
0
0
0
10
20
30
0
10
20
30
Fig. 10: Matrix partitioning for four homogeneous multi-core
nodes, with 24 MPI processes per node
TABLE V: Homogeneous inter-node experimental results
Nodes
4
Cost
Heuristic
Ratio
Orig
336
199
1.68
Exec time (sec)
Orig
Heuristic
3.85
3.17
Ratio
1.21
of communications is not optimal. It is minimized in such
SUMMA-based algorithms as BR.
We performed experiments on a homogeneous multi-core
cluster, which is characterized by faster communications between the processes running on the same node and slower
communications between the processes running on different
nodes. Each node of the Stremi cluster has 24 cores. We
spawned 24 MPI processes on four nodes. Problem size is
1024 and block size is 64. Table V shows the communication
cost and the total execution time of the application with
different arrangements of matrix rectangles. Fig. 10 shows
how matrices are partitioned by the BR algorithm and our
heuristic. These results show that the proposed topologyaware optimization technique can improve the communication
performance on homogeneous platforms as well.
VI. C ONCLUSION
In this paper, we presented the heuristic approach aimed to
minimize communication cost of heterogeneous matrix multiplication using information about network topology and application communication flow. Our heuristic approach provides
efficient near-optimal solution of the combinatorial problem
of mapping the application communications onto the network
topology. We validate this approach with experiments on
Grid5000.
ACKNOWLEDGMENT
Experiments were carried out on Grid’5000 developed under
the INRIA ALADDIN development action with support from
CNRS, RENATER and several Universities as well as other
funding bodies (see https://www.grid5000.fr). We are grateful
to Arnaud Legrand who provided sample code for minimising
the total volume of communication.
[1] T. Kielmann, R. F. Hofman, H. E. Bal, A. Plaat, and R. A. Bhoedjang,
“MagPIe: MPI’s collective communication operations for clustered wide
area systems,” in ACM Sigplan Notices, vol. 34, no. 8. ACM, 1999,
pp. 131–140.
[2] N. Karonis, B. De Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan, “Exploiting hierarchy in parallel computer networks to optimize
collective operation performance,” in Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International,
2000, pp. 377–384.
[3] T. Ma, G. Bosilca, A. Bouteiller, and J. Dongarra, “HierKNEM: An
adaptive framework for kernel-assisted and topology-aware collective
communications on many-core clusters,” in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, 2012, pp.
970–982.
[4] J. Traff, “Implementing the MPI process topology mechanism,” in
Supercomputing, ACM/IEEE 2002 Conference, 2002, pp. 28–28.
[5] T. Agarwal, A. Sharma, A. Laxmikant, and L. Kale, “Topology-aware
task mapping for reducing communication contention on large parallel
machines,” in Parallel and Distributed Processing Symposium, 2006.
IPDPS 2006. 20th International, 2006, p. 10.
[6] H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn, “MPIPP: An
automatic profile-guided parallel process placement toolset for SMP
clusters and multiclusters,” in Proceedings of the 20th Annual International Conference on Supercomputing, ser. ICS ’06. New York, NY,
USA: ACM, 2006, pp. 353–360.
[7] R. A. Van De Geijn and J. Watts, “SUMMA: scalable universal matrix
multiplication algorithm,” Concurrency-Practice and Experience, vol. 9,
no. 4, pp. 255–274, 1997.
[8] O. Beaumont, V. Boudet, F. Rastello, and Y. Robert, “Matrix multiplication on heterogeneous platforms,” IEEE Trans. Parallel Distrib. Syst.,
vol. 12, no. 10, pp. 1033–1051, 2001.
[9] D. Clarke, A. Lastovetsky, and V. Rychkov, “Column-based matrix partitioning for parallel matrix multiplication on heterogeneous processors
based on functional performance models,” in Euro-Par 2011: Parallel
Processing Workshops, ser. LNCS. Springer Berlin Heidelberg, 2012,
vol. 7155, pp. 450–459.
[10] A. Kalinov and A. Lastovetsky, “Heterogeneous distribution of computations while solving linear algebra problems on networks of heterogeneous computers,” in 7th International Conference on High Performance
Computing and Networking Europe (HPCN‘99), 1999.
[11] E. Dovolnov, A. Kalinov, and S. Klimov, “Natural block data decomposition for heterogeneous clusters,” in IPDPS 2003, April 2003.
[12] A. Lastovetsky, “On grid-based matrix partitioning for heterogeneous
processors,” in 6th International Symposium on Parallel and Distributed
Computing (ISPDC 2007), IEEE Computer Society.
Hagenberg,
Austria: IEEE Computer Society, 5-8 July 2007, pp. 383–390.
[13] A. DeFlumere, A. Lastovetsky, and B. Becker, “Partitioning for parallel matrix-matrix multiplication with heterogeneous processors: The
optimal solution,” in Parallel and Distributed Processing Symposium
Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, 2012,
pp. 125–139.
[14] A. Lastovetsky and R. Reddy, “Data partitioning with a functional
performance model of heterogeneous processors,” International Journal
of High Performance Computing Applications, vol. 21, no. 1, pp. 76–90,
2007.
[15] K. Dichev and A. Lastovetsky, “Optimization of collective communication for heterogeneous hpc platforms.” Wiley-Interscience, 2013.
[16] C. Coti, T. Herault, and F. Cappello, “MPI applications on grids:
a topology aware approach,” in Euro-Par 2009 Parallel Processing.
Springer, 2009, pp. 466–477.
[17] E. Solomonik, A. Bhatele, and J. Demmel, “Improving communication
performance in dense linear algebra via topology aware collectives,” in
Proceedings of 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis, ser. SC ’11. New York,
NY, USA: ACM, 2011, pp. 77:1–77:11.
[18] D. Clarke, Z. Zhong, V. Rychkov, and A. Lastovetsky, “FuPerMod: A
framework for optimal data partitioning for parallel scientific applications on dedicated heterogeneous HPC platforms,” in Parallel Computing
Technologies, ser. LNCS. Springer Berlin Heidelberg, 2013, vol. 7979,
pp. 182–196.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising