Table of Contents

Table of Contents
Table of Contents
Table of Contents
1
Abstract
2
1 Introduction
1.1 Background . . . . . .
1.2 Heuristic Algorithm .
1.3 Motivation and Goals
1.4 Approach . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
6
6
2 Theory
7
2.1 System Model and Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 ITERLP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 ITERLP2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Experiments
13
3.1 Simulations with random parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Results
16
4.1 Simulations with random parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Conclusion
25
5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bibliography
27
1
Abstract
Divisible Load Theory (DLT) is an effective tool for blueprinting data-intensive computational
problems; it is a mathematical framework that is applied using Divisible Load Scheduling (DLS)
algorithms, but has yet to be proved optimal in all cases. A heuristic algorithm known as ITERLP
rapidly produces near-optimal divisible load schedules with result collection. Though the results
are near-optimal, there are some underlying issues that did not get addressed. These issues include:
scheduling divisible loads on cluster networks that share identical network bandwidth speeds, and
neglecting the cost of latency. ITERLP performs well when processor nodes have varying network
bandwidth; however, it may pose a problem when each processor node shares the same bandwidth
speed but vary in computation speed. The second problem is that ITERLP does not factor in the
cost of latency; by taking the solution time calculated using ITERLP and factoring in latency, it
is shown that the time is adversely worsened in all cases. In this thesis, a few modifications are
made to the ITERLP algorithm, followed by numerous tests to further investigate more realistic
cases. These modifications involve factoring in latency as well as trying different sorting algorithms
for communication, computation and latency parameters. It is shown when latency is factored in
that ITERLP2 produces near-optimal times and outperforms ITERLP in all realistic cases. It is
also shown that some sorting algorithms perform better than others in certain cases; however, no
sorting algorithm is best suited for all cases.
2
Chapter 1
Introduction
1.1
Background
A divisible load is a computable entity that can be arbitrarily partitioned into an infinite number
of independent load fractions. Once partitioned, load fractions may then be allocated to a set of
computer processor nodes to be computed and then collected back. A few applications that fulfill
the divisibility property include: simulations, matrix computations, database searching, Kalman
filtering, as well as dataset, image, and signal processing [1, 2, 3, 4, 5]. Loads that are randomly or
arbitrarily partitioned do not always achieve the optimal solution and may, by consequence, result
in a worse time than that of a single processor node alone; so to properly partition a divisible load,
there needs to be a Divisible Load Schedule (DLS) to ensure that tasks can be run as efficiently
as possible in parallel on the given resources. Divisible Load Theory (DLT) is an effective tool for
blueprinting data-intensive computational problems; it is a mathematical framework that is applied
using Divisible Load Scheduling (DLS) algorithms, but has yet to be proved optimal in all cases
[1, 4, 5].
DLT has been becoming more popular in recent years through the use of volunteer computing;
users may volunteer their PC’s computational power to help form a grid network used to perform
numerous calculations in parallel [4]. SETI (Search for Extraterrestrial Intelligence) has a volunteer
computing network named [email protected] that is used for processing enormous amounts of signal
data. Initially, SETI did not possess the resources to build or buy a computer powerful enough
to process the signal data, so they compromised the fact that there was no convenient access to a
supercomputer and established a volunteer computing network to allow users to enlist their home
computers to do the job [6].
3
On the other side of the parallel computing spectrum, UNB Fredericton, as well as eight other
Atlantic Canadian universities are members of the Atlantic Computational Excellence Network
(ACEnet); this network provides multiple clusters for the purpose of High Performance Computing
(HPC) and research [7]. Cluster networks are generally used as a resource of computer processors
designed for parallel processing. A reasonable assumption is that bandwidth and latency are constant between all processors in a given cluster, however, computational power of each processor
may vary.
The simplest network model used in DLT is the star network model with one-port communication; it has a master processor in the center with the load to be distributed, and a set of worker
processors connected via the points. One-port communication enforces that only one transmission
is allowed at a given time between the master and any of the associated worker processors [1].
This ensures that no overlapping of communication occurs in either the allocation of load, or the
collection of results; however, when two or more worker processors have already received their allocated loads, they may compute them independently in parallel. Another simplification applied is
the one-round approach; although the multi-round approach allows for pipelining multiple sets of
smaller messages, the cost of latency would be too great [8, 9]. The one-round approach ensures
that a divisible load is split up only once and that latency occurs only twice for each load fraction,
once for allocation, and once for result collection.
All nodes Finish Simultaneously (AFS) is a policy that assumes no idle times occur and all
processor nodes participate in the optimal solution [1, 4, 5, 8, 9, 10]. The AFS policy cannot be
applied because it is unrealistic and oversimplified; though it would be simple to assume that all
processor nodes finish simultaneously, it is more important for them to finish at various time slices
to ensure that the idle time between communication of results is minimized, and that collection
can actually occur. A few papers that do assume the AFS policy use it to help simplify describing
DLT and conducting proofs [9, 10, 11].
Existing algorithms for scheduling divisible loads with result collection in this model include
FIFO (First In First Out), LIFO (Last In First Out), and OPT (Optimal Solution). Each algorithm
uses a linear program for a given allocation and collection order to solve a minimization problem with
several constraints. The solution to the linear program gives all load fractions for each processor
node considered, as well as the optimal time [1, 4, 5]. The FIFO algorithm uses the same order for
both allocation and collection, whereas the LIFO algorithm has the collection in reverse order of the
allocation. The OPT algorithm finds the optimal solution by permuting all possible combinations of
allocation and collection for m processor nodes. By solving a linear program for each permutation,
4
the optimal load fractions and times are found. In most cases the optimal solution is not always
attainable in practice, knowing that for m processor nodes, there would be a possible (m!)2 linear
programs to be solved [1, 5, 8]. So to alleviate the time taken to find the optimal solution, heuristic
algorithms have been proposed [1, 4, 5, 8, 9].
A recent DLS heuristic algorithm known as ITERLP was proposed in [1], April, 2008. It rapidly
produces near-optimal divisible load schedules including result collection. Many simulations were
performed to compare ITERLP, FIFO, and LIFO to the optimal solution. It was found that
ITERLP invariably produces nearest-optimal performance no matter the level of heterogeneity of
the system, the amount of processor nodes, or the size of the allocated and collected data [1].
Though the results are near-optimal, there are some underlying issues that did not get addressed.
These issues include: scheduling divisible loads on cluster networks that share identical network
bandwidth speeds, and neglecting the cost of latency. ITERLP performs well when processor nodes
have varying network bandwidth; however, it may pose a problem when each processor node shares
the same bandwidth speed but vary in computation speed. The second problem is that ITERLP
does not factor in the cost of latency; by taking the solution time calculated using ITERLP and
factoring in latency, it is shown that the time is adversely worsened in all cases. Another recent
heuristic algorithm known as SPORT was proposed in [5], August, 2008.
1.2
Heuristic Algorithm
A heuristic algorithm known as ITERLP was used in [1] to quickly calculate near-optimal
divisible load schedules with result collection. ITERLP is heuristic in manner since it often leads
to a near-optimal solution and computes much more quickly than OPT. Although the solution
from ITERLP is not strictly optimal, it occasionally finds the optimal solution. To compare the
performance of ITERLP to OPT, ITERLP only requires solving up to m2 linear programs in a given
iteration, as opposed to the OPT which solves (m!)2 , when given m processor nodes. Although
ITERLP is able to produce near-optimal divisible load schedules, there are some underlying issues
that did not get addressed. These issues include: limitations of sorting processor nodes only
by communication parameters, and neglecting the cost of latency. ITERLP performs well when
processor nodes vary in communication parameters; however, it may pose a problem with cluster
networks that have processor nodes sharing identical communication parameters. ITERLP does
not factor in the cost of latency because divisible loads are assumed to be sufficiently large enough
to ignore it in [1, 4]. Experimental results from taking the optimal time calculated by ITERLP
5
then factoring in the cost due to latency will show that the solution is adversely worsened and not
always near-optimal.
1.3
Motivation and Goals
The last paragraph from [1] mentions future work to create an algorithm with similar performance, but with better cost characteristics than ITERLP. This thesis investigates improving the
ITERLP algorithm by adding latency parameters as well as trying different methods for sorting
processor nodes in a new heuristic algorithm named ITERLP2. The limitations of ITERLP are
shown when latency is factored in, and when communication parameters are the same for all processor nodes. Multiple scenerios with cluster experiments will show different cases where some
sorting algorithms are best suited. Finally, results from experimentation are expected to show that
ITERLP2 outperforms ITERLP in all realistic cases.
1.4
Approach
The ITERLP algorithm was implemented using C++ and a linear program solver (COIN-Clp)
[12]. The algorithm was modified to accomodate latency using two different methods, one that adds
latency into the equations to be solved in the linear program, and another that factors in latency
after the linear programs have been solved. Rather than sorting only by communication parameters, other sorting algorithms for computation, communication then computation, and latency were
implemented. Experiments with the new algorithm ITERLP2 were applied using different configurations and orderings for communication, computation, and latency parameters to show that the
goals are met.
6
Chapter 2
Theory
2.1
System Model and Problem Definition
Figure 2.1: Heterogeneous star network H.
Consider the divisible load J to be allocated, computed and collected on a heterogeneous star
network H = (P, I, E, C, L) illustrated in Fig. 2.1. The set of m + 1 processors is denoted by
P = {p0 , ..., pm }, and the set of m network links denoted by I = {i1 , ..., im }; the links stem from
the master processor p0 to the worker processors p1 , ..., pm . The set of computational parameters
for each of the worker processors is E = {E1 , ..., Em }, and the set of communicational parameters
for each of the network links is C = {C1 , ..., Cm }. The new parameter for the latency on each of
7
the network links is L = {L1 , ..., Lm }. The value for Ck is the reciprocal of the bandwidth speed
of network link ik , and Ek is the reciprocal of the processor speed of processor pk . The values
for Ek and Ck are measured in time units per unit load; Ek represents the time to process a unit
load, and Ck represents the time to transmit a unit load from p0 to pk over network link ik . The
value for Lk measures the latency (total time taken in time units to initiate a connection) between
processor p0 to pk over network link ik . For non-negativity, it is assumed that ∀k ∈ {1, ..., m} that
Ek > 0, Ck > 0, and lk ≥ 0. This new system model is based upon the model from [1].
The values of C and E are already known ahead of time by the master processor p0 . Using these
known values, processor p0 calculates and divides up the entire load J into a set of load fractions
α1 , ..., αm ; once divided, processor p0 allocates the load fractions α1 , ..., αm to the respective worker
processors p1 , ..., pm to be computed.
Figure 2.2: Divisible load schedule for m = 3, σa = {2, 1, 3}, σc = {1, 3, 2}.
Each of the worker processors goes through three discrete phases: allocation, computation, and
collection. To begin, the allocation phase involves the master processor p0 sending load fractions
to a worker processor; the computation phase commences when the entire load fraction has been
allocated to a worker processor. After having finished computing, the collection phase has worker
processors send results back. Whether it is the allocation or collection phase, the master processor
p0 can only communicate to one processor at a time, and result collection can occur only after the
entire load has been allocated; however, any number of worker processors that have already received
load may compute in parallel [1]. The magnitude of data in the results to be returned is considered
proportional to the size of the load initially allocated. This assumption is only made when dealing
8
with image and video processing, matrix manipulation, or any computation that incorporates the
use of linear transformations [1, 2, 4, 5]. So given a load fraction αk , the associated returned result is
equal to δαk , where 0 ≤ δ ≤ 1. The constant value for δ is application dependent and all processors
share that value for a given load J . Given a load fraction αk , the time to transmit from p0 to pk is
equal to αk Ck plus the cost of latency Lk , the computation time on worker processor pk is equal to
αk Ck , and finally the time to transmit the results back from pk to p0 is equal to δαk Ck plus the cost
of latency Lk . For a given load fraction αk to go through all three phases: allocation, computation,
and collection, there is the associated cost of latency on both the allocation and collection phases,
due to the master processor p0 only being able to establish one communication at a time between
itself and any given worker processor.
Fig. 2.2 illustrates a simple example of a divisible load schedule with 3 processors showing all
3 phases; the x values represent the idle time between when a processor pk finishes computing, to
when it begins transferring the result, and the y value represents idle time between the point when
the last load fraction finished allocating to when the first result begins transferring.
Allocation and collection order are denoted by the permuted sets σa and σc respectively. σa [k]
and σa [k] respresent the processor number at index k ∈ {1, ..., m}. To find the index of a given
processor, σa (l) and σc (l) are two functions that are given a processor number and return the index.
Fig. 2.2 shows an example load schedule with an order of allocation σa = {2, 1, 3} and order of
collection σc = {1, 3, 2}.
The ITERLP uses a linear cost model for finding a solution using only the parameters for
communication and processing speed with respect to load fractions; however, for ITERLP2 to
accomodate the cost of latency, the linear model must factor in a fixed cost for the time to initiate
a connection, thus forming an affine cost model. The linear program below is a modified version
from [1, 4, 5] that accomodates the cost of latency using an affine cost model.
Given a divisible load J and a heterogeneous star network H = (P, I, E, C, L), the allocation and
collection orders (σa ,σc ) are tried by the linear program below and solved to find the optimal load
fractions α = {α1 , ..., αm }, and associated time T . Divisible load scheduling on a heterogeneous
star network is characterized by the linear program as below:
9
Minimize ζ = 0α1 + ... + 0αm + T
Subject To:
·σX
a (k)
¸
ασa [j] Cσa [j] + Lσa [j] + αk Ek +
j=1
·X
m
¸
δασc [j] Cσc [j] + Lσc [j] ≤ T
k = 1, ..., m (2.1)
j=σc (k)
¸
ασa [j] Cσa [j] + Lσa [j] +
·X
m
¸
δασc [j] Cσc [j] + Lσc [j] ≤ T
(2.2)
j=1
j=1
m
X
· X
m
αj = J
(2.3)
j=1
T ≥ 0,
αk ≥ 0
k = 1, ..., m
(2.4)
The LHS (left hand side) of constraint 2.1 is comprised of the total time spent in transmitting
all load fractions to processors that will receive load before processor pk , the computational time
for processor pk , and the total time spent in transmitting all tasks back from processors that finish
computing after processor pk . The no-overlap model is satisfied if the processing time T is greater
than or equal to the time given on the LHS for all m processors. The one-port model is enforced
by constraint 2.2 with the LHS consisting of the lower bound of communicational time and latency
cost for both allocation and collection. The LHS of constraint 2.3 ensures that the entire load is
distributed amongst the processors. Non-negativity of objective variables is ensured by constraint
2.4 [1].
2.2
ITERLP Algorithm
The heuristic algorithm ITERLP finds a solution by iteratively solving linear programs. To
begin, the processors are first sorted by communication parameters Ck . Starting with the first
two processors, all four permutations of σa and σc are passed into the linear program, defined by
constraints 2.1 through 2.4, and solved. After solving for the first two processors, the next processor
is added to the optimal sequence and the linear program is solved again; additional processors are
arranged in any position in the optimal sequence of σa and σc from the last iteration. When
solving the linear program, ITERLP ignores latency and keeps track of times before latency is
added; however, when ITERLP is finished computing, the solution time is adjusted for latency.
The algorithm requires at most k 2 linear programs solved in a given iteration with k processors.
10
Using the summation below, the total complexity of ITERLP, in the worst case, is calculated to be
O(m3 ) for m processor nodes. OPT permutes all possible combinations of allocation and collection
for k processors, solving (k!)2 linear programs. ITERLP will halt execution will halt and not
proceed to the next iteration if, in any given iteration, processor k is allocated zero load.
m
X
k=2
2.3
k2 =
1
k(k + 1)(2k + 1) − 1 = O(m3 )
6
(2.5)
ITERLP2 Algorithm
Similarly to the heuristic algorithm ITERLP, ITERLP2 finds a solution by iteratively solving
linear programs. The complexity of ITERLP2 is the same as ITERLP, O(m3 ). To begin, processors
are first shuffled in random order to ensure that parameters are sorted fairly. ITERLP2 uses three
new sorting configurations as well as sorting by increasing value of Ck (increasing time needed to
transfer a load fraction); these new sort methods are used for showing the limitation of ITERLP
sorting only by Ck and for comparing to see where each is best suited. The new methods include:
sorting by increasing value of Ck then increasing value of Ek (increasing time needed to compute),
sorting by increasing value of Ek , and sorting by increasing value of Lk (increasing time needed to
establish a connection). After solving for the first two processors, the next processor is added to
the optimal sequence and the linear program is solved again; additional processors are arranged in
any position in the optimal sequence of σa and σc from the last iteration. There are two separate
solution times tracked, one representing the best time for the current iteration, and another for
the best time overall. ITERLP2 has two different configurations for latency; the first, ITERLP2A,
uses latency in the constraint equations of the linear program prior to solving, tracking times after
the fact and the second, ITERLP2B, ignores latency when solving the linear program, but after
each linear program is solved, latency is factored in and the adjusted times are tracked. It might
be interesting to see whether one approach performs better than the other.
After the addition of latency, it is shown in results from Chapter 4 that, in some cases, ITERLP2
may produce better solution times with fewer processors. For example, ITERLP2 finds the solution
time for four processors, and then a fifth processor with much higher latency cost is added; since the
fifth processor has such high latency, it is given a very small fraction of the load and consequentially
produces a solution time worse than with four processors. So no matter the load fraction size given
to a processor, the magnitude of latency is a fixed cost that is applied regardless of there being zero
11
load.
For example, given the optimal sequences σa1 = {2, 1} and σc1 = {1, 2}, after the first iteration, the third processor would be added to the sequences in the following manner: Σ2a =
{{3, 2, 1}, {2, 3, 1}, {2, 1, 3}} and Σ2c = {{3, 1, 2}, {1, 3, 2}, {1, 2, 3}}. In the second iteration, Σ2a and
Σ2c represent all possible sequences for allocation and collection respectively.
Table 2.1: Results for 3 processors sorting by communication parameter C with C =
{100, 125, 150}, E = {1000, 700, 850}, L = {10, 7, 9}, δ = 0.5, rounded to 3 decimal places
Algorithm
OPT
ITERLP
ITERLP2A
ITERLP2B
σa
{1, 2, 3}
{1, 2, 3}
{1, 2, 3}
{1, 2, 3}
σc
{1, 3, 2}
{2, 1, 3}
{1, 3, 2}
{1, 2, 3}
α
{0.430, 0.263, 0.307}
{0.330, 0.371, 0.299}
{0.430, 0.263, 0.307}
{0.308, 0.392, 0.299}
T
436.033
444.258
436.033
437.559
Tables 2.1 and 2.2 show the results for iterations 2 and 3 respectively of an example with the
following parameters: C = {100, 125, 150, 175}, E = {1000, 700, 850}, L = {10, 7, 9}, and δ = 0.5.
ITERLP2A finds the optimal sequence in both iterations, and the remaining algorithms are nearoptimal. Though it appears that ITERLP is performing near-optimal, adding more processors
results in a solution time that deviates further and further from the optimal time; this is confirmed
in the experimental results in Chapter. 4.
Table 2.2: Results for 4 processors sorting by communication parameter C with C =
{100, 125, 150, 175}, E = {1000, 700, 850, 500}, L = {10, 7, 9, 8}, δ = 0.5, rounded to 3 decimal
places
Algorithm
OPT
ITERLP
ITERLP2A
ITERLP2B
σa
{1, 2, 3, 4}
{1, 2, 4, 3}
{1, 2, 3, 4}
{1, 2, 3, 4}
σc
{1, 3, 2, 4}
{4, 2, 1, 3}
{1, 3, 2, 4}
{1, 2, 3, 4}
12
α
{0.217, 0.308, 0.184, 0.292}
{0.257, 0.290, 0.187, 0.266}
{0.217, 0.308, 0.184, 0.292}
{0.219, 0.278, 0.212, 0.291}
T
352.196
368.724
352.196
354.228
Chapter 3
Experiments
3.1
Simulations with random parameters
Prior to experimentation, a few preliminary tests were conducted to find a good balance for the
system parameters. The values chosen for C and E in Tables. 3.1 and 3.2 are based on the values
chosen from experiments in [1] with the addition of the latency L values; the values in the ranges
are randomly chosen, but are also based on actual values for current processors and networks to
handle arbitrarily large divisible loads.
Table 3.1: Parameters for simulation using 5 processors with δ = 0.5
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
C values
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[10, 1000]
[10, 1000]
[10, 1000]
[10, 1000]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
E values
[100, 1000]
[100, 1000]
[100, 1000]
[100, 1000]
[1000, 10000]
[1000, 10000]
[1000, 10000]
[1000, 10000]
[100, 1000]
[100, 1000]
[100, 1000]
[100, 1000]
[1, 1000]
[1, 1000]
[1, 1000]
[1, 1000]
13
L values
[0.01, 10]
[0.1, 10]
[1, 10]
[1, 100]
[0.01, 10]
[0.1, 10]
[1, 10]
[1, 100]
[0.01, 10]
[0.1, 10]
[1, 10]
[1, 100]
[0.01, 10]
[0.1, 10]
[1, 10]
[1, 100]
The experiments from Table. 3.1 include the optimal solution OPT for comparing to ITERLP
and ITERLP2 with 5 processors; however, the experiments from Table. 3.2 do not include the
optimal solution, due to the time requirements to solve (16!)2 = 4.4 · 1026 and (32!)2 = 6.9 · 1070
linear programs for 16 and 32 processors respectively. Each case includes 100 simulation runs, with
each run being done 4 times to cover all sorting algorithms (by C, by E, by C then E, and by L)
for both ITERLP2 A and B; ITERLP is only used once per run, sorting by C. It is expected that
results will reveal that ITERLP2 A and B outperform ITERLP in all cases.
Table 3.2: Parameters for simulation using 16 and 32 processors with δ = 0.5
Case
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
3.2
# Processors
16
16
16
16
16
16
16
16
32
32
32
32
32
32
32
32
C values
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[10, 1000]
[10, 1000]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[1, 100]
[10, 1000]
[10, 1000]
[1, 100]
[1, 100]
E values
[100, 1000]
[100, 1000]
[1000, 10000]
[1000, 10000]
[100, 1000]
[100, 1000]
[1, 1000]
[1, 1000]
[100, 1000]
[100, 1000]
[1000, 10000]
[1000, 10000]
[100, 1000]
[100, 1000]
[1, 1000]
[1, 1000]
L values
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
[0.01, 10]
[1, 10]
Clusters
Experiments with clusters are done with 2, 4, and 8 clusters. Each cluster Uk has a range of
processor nodes that start from 2 to a possible 32 processor nodes per cluster. The total number of
processor nodes is restricted to a maximum of 64 to allow experiments to complete under realistic
time constraints. Values for C, E, and L defined to be the same for all processor nodes in a given
cluster; the goal is to simulate realistic grids of clusters for load schedules to be calculated. Table
3.3 lists 5 different cases with a total of 16 subcases amongst them. Each case should pose a problem
for ITERLP, due to the fact that it just sorts by C, and ignores latency; results are expected to
show that ITERLP2 A and B outperform ITERLP due to its limitations.
The following conjectures can be made about the five cases. Case 1 will cause ITERLP to choose
14
Table 3.3: Cluster experiments with δ = 0.5
Case
# Clusters Uk
# Nodes/Cluster
1
2
4, 8
2
2
8, 16
3
4
4, 8
4
4
8, 16
5
8
4, 8
U1
C = 150
E = 300
L = 2
C = 150
E = 800
L = 5
C = 100
E = 800
L = 5
C = 150
E = 300
L = 5
C = 110
E = 400
L = 5
U2
C = 100
E = 1000
L = 25
C = 100
E = 2000
L = 5
C = 100
E = 2000
L = 15
C = 100
E = 3000
L = 5
C = 100
E = 500
L = 5
U3
U4
U5
U6
U7
U8
C = 100
E = 1500
L = 10
C = 150
E = 1000
L = 5
C = 110
E = 450
L = 5
C = 100
E = 3000
L = 20
C = 100
E = 2000
L = 5
C = 100
E = 1000
L = 5
C = 110
E = 1000
L = 2
C = 100
E = 500
L = 10
C = 110
E = 500
L = 2
C = 100
E = 1000
L = 10
a very distant cluster with better bandwidth, although the slower bandwidth cluster is much closer
and has faster processors. Case 2 will make ITERLP choose a cluster with better bandwidth, but
with slower processors, in the same network. Case 3 gives an example that will show the limitations
of sorting only by communication parameters where bandwidth is the same for all processor nodes.
Case 4 is just an extended version of case 2 with two additional clusters. Case 5 gives a wide range
of parameters and should give some varying results.
15
Chapter 4
Results
4.1
Simulations with random parameters
Simulations with ranges of randomly generated system parameters were conducted for m =
5, 16, 32, and δ = 0.5. The following are algorithms used when simulating with m = 5: OPT,
ITERLP, ITERLP2A, and ITERLP2B. The ITERLP2 algorithms were sorted in four different
configurations (by C, by C then E, by E, and by L). When m = 16 or m = 32, the time taken
to calculate OPT would be unrealistic, so only the ITERLP, ITERLP2A, and ITERLP2B were
used. There were 32 cases in total with 100 simulation runs for each configuration of algorithm and
sorting order.
Average percent deviation for each algorithm TV ARIAN T is represented by ∆TV ARIAN T in the
calculation as below:
∆TV ARIAN T =
M ean(TV ARIAN T ) − M ean(TOP T )
∗ 100%
M ean(TOP T )
16
(4.1)
(a)
(b)
Figure 4.1: Simulation results.
The results for the first 16 cases are depicted in Fig. 4.1. The graph shows the average percent
deviation ∆TV ARIAN T that the variant algorithms ITERLP, ITERLP2A, and ITERLP2B make
from OPT. ITERLP is compared to both ITERLP2 algorithms sorting by C, and by E in the
left and right graphs respectively. Cases 4, 8, 12, and 16 all show anomalous points where both
ITERLP2 algorithms produce solution times lower than OPT; this was due to the values for L being
too high in relation to C and E. The problem was with OPT permuting all possible combinations
of 5 processors; some of them were being allocated zero load, but still assigned the cost of latency.
This made ITERLP2A and ITERLP2B produce much better schedules than OPT, only because
they were able to terminate early with fewer processors when zero load was allocated. Figure.
4.2 on page 18 illustrates a simulation run from case 8; it shows that OPT schedules load to 2
processors, but incurs the burden of latency from 5 processors, regardless of the fact that 3 of them
are allocated zero load.
Another anomaly can be seen in cases 9, 10, and 11. Since these cases have similar results,
one graph is shown for case 11 in Fig. 4.6. It appears that ITERLP is outperforming both of the
ITERLP2 algorithms that sort by E, or by L. These cases show that when values for communication
parameters are too high in relation to the computational parameters, that sorting by E, or by L is
not suitable. Though the results in this case are near-optimal for both ITERLP2 algorithms when
sorting by C or by C then E, in realistic cases, ITERLP should not be outperforming both ITERLP2
algorithms that sort by E, or by L. The reason is that ITERLP2 algorithms are designed to find
the best solution times with latency cost factored in, and adjusting for latency with ITERLP is
expected to fail.
The remainder of the results in cases 1, 2, 3, 5, 6, 7, 13, 14, and 15 show the limitations that
17
Figure 4.2: Anomalous Results
ITERLP has when latency is ignored. They also show that ITERLP2A and ITERLP2B both
outperform ITERLP in all realistic cases. Overall ITERLP2A performed best with the nearestoptimal times.
All even-numbered cases from cases 17 through 32 are shown in Figures 4.3, 4.4, 4.5, and 4.7.
For trending purposes, each figure has two graphs that show the effects of increasing the number
of processor nodes from 16 to 32. Each graph uses a boxplot; each box has a whisker coming from
the top representing the max value, a whisker coming from the bottom representing the min value,
and a line passing left-to-right through the box representing the mean value.
The performance of ITERLP in Figures 4.3, 4.4, and 4.7 appears to be following the same
decremental trend when the number of processor nodes is increased from 16 to 32. The results from
case 26 in Fig. 4.3 show that ITERLP produced a median optimal time that is almost 3 times that
of the worst optimal times that both ITERLP2 algorithms have produced. It is unquestionable that
ITERLP fails to perform near-optimal and is, by far, outperformed by ITERLP2A and ITERLP2B
in these cases. Both ITERLP2 algorithms perform well in all six graphs, but though the performance
does improve, the solution time is not reduced by half when the number of processor nodes is
doubled to 32. No matter what sorting method the ITERLP2 algorithms use, performance stays
pretty consistent with median values that vary no more than 30 time units.
18
(a) Case 18: 16 processors.
(b) Case 26: 32 processors.
Figure 4.3: Simulation results.
(a) Case 20: 16 processors.
(b) Case 28: 32 processors.
Figure 4.4: Simulation results.
(a) Case 22: 16 processors.
(b) Case 30: 32 processors.
Figure 4.5: Simulation results.
19
Fig. 4.5 shows results for cases 24 and 32
which follow a similar trend to what is seen in
cases 9, 10, and 11; they also share the same
parameters as case 11 for C, E, and L. Results
for case 11 are shown in Fig. 4.6 for comparing
to Fig. 4.5. The results for both ITERLP2 algorithms show again that when values for communication parameters are too high in relation
to the computational parameters, that sorting
by E, or by L is not suitable. Again it is un-
Figure 4.6: Case 11 results.
realistic that ITERLP have better performance
than ITERLP2A and ITERLP2B sorted by C or by C then E.
(a) Case 24: 16 processors.
(b) Case 32: 32 processors.
Figure 4.7: Simulation results.
Though there were a few anomalies with certain system parameters, the results definitely show
that with realistic ranges of random values and the cost of latency being factored in, that ITERLP
does not perform near-optimal; it is shown that ITERLP has limitations when latency is ignored.
Simulations show again that both ITERLP2 algorithms outperform ITERLP in all realistic cases;
as well ITERLP2A performed best with the nearest-optimal times overall.
4.2
Clusters
The following results include all algorithms except OPT. Experiments were conducted for 2,
4, and 8 clusters with m = 4, 8, 16, 32 processor nodes in each cluster with up to a possible 64
20
processors per experiment. The ITERLP2 algorithms were sorted in four different configurations
(by C, by C then E, by E, and by L). Figures 4.8, 4.9, 4.10, 4.11, and 4.12 show some results from
all five cluster experiments; there are two graphs per figure, with one cluster simulation run in each.
(a) 8 processors.
(b) 16 processors.
Figure 4.8: Case 1 cluster results.
Results for case 1 are quite interesting, because they show that ITERLP, and both ITERLP2
algoriths that sorting by C, or by C then E are progressively worsened when the number of processors
is increased from 8 to 16. The optimal time of ITERLP appears to get much worse, whereas, both
ITERLP2 algorithms seem to only worsen slightly. Despite the poor solution times when sorting
by C, or by C then E, the results are the complete reverse for both ITERLP2 algorithms, when
sorting by E, or by L; the results not only improve when going from 8 to 16 processors, but also
outperform ITERLP by three times. It is shown in Fig. 4.8 with 16 processors that with either
ITERLP2 algorithm, the worst results for E, or by L are in the 180 range for optimal time, as
opposed to ITERLP with an optimal time of around 560. The parameters used in this cluster
experiment definitely favour sorting by E, or by L.
Case 2 produced very monotonous results in Fig. 4.9 for all algorithms; however, results do show
that ITERLP is outperformed by ITERLP2A and ITERLP2B, but not by much. Another thing to
note is that with any of the algorithms going from 16 to 32 processors, when sorting by C, or by
C then E the results are worsened and when sorting by E, or by L, there is no change. The reason
that the results are worsened with 32 processors is that there were, on average, 16 processors chosen
for the solution times; this was the worst choice due to the computational parameters being much
higher than that of the other 16 processors that were not chosen. Overall for this case, whether
ITERLP2A or ITERLP2B is used, any sort order is sufficient and ITERLP is still outperformed
by both ITERLP2 algorithms.
21
(a) 16 processors.
(b) 32 processors.
Figure 4.9: Case 2 cluster results.
(a) 16 processors.
(b) 32 processors.
Figure 4.10: Case 3 cluster results.
Results from case 3 are among the most interesting cluster experiments. This experiment was
assumed to show the limitation of sorting only by C with ITERLP. This assumption was correct,
and it definitely shows that with either ITERLP2 algorithm, that sorting by C is also the worst
choice. Based on the parameters chosen initially, it was obvious that when sorting by E, by C then
E, or by L that results should be the same and nearest-optimal. The results also show that the
performance of ITERLP goes from being mediocre to more than twice as worse, when going from
16 to 32 processor respectively. It is shown from the results in Fig. 4.9 that ITERLP is once again
outperformed by both ITERLP2 algorithms.
Cases 4 and 5 show similar results with some interesting irregularities. When going from 32 to
64 processors, some optimal times had no change, others had worsened, and the rest had improved
only slightly. Figures 4.11 and 4.12 show that sorting by E or by L are the best choices for ordering
22
(a) 32 processors.
(b) 64 processors.
Figure 4.11: Case 4 cluster results.
with both ITERLP2 algorithms, regardless of the times being worse with 64 processors. ITERLP
is especially worsened when going from 32 to 64 processors in Fig. 4.11 with and optimal time
that is double that of ITERLP2B sorting by E. Though the results are worsened in some cases
for ITERLP2A and ITERLP2B, ITERLP is still outperformed; as well, ITERLP2B appears to
produce the nearest-optimal times.
(a) 32 processors.
(b) 64 processors.
Figure 4.12: Case 5 cluster results.
It was unexpected in the cluster results that ITERLP2B would produce better times that
ITERLP2A in all cases, since in the random simulations, ITERLP2A produced the nearest-optimal
times.
23
Chapter 5
Conclusion
In this thesis a modified version of the heuristic algorithm ITERLP, named ITERLP2, was
created to investigate other sorting methods and latency cost. Random simulations and cluster
experiments have shown that, in realistic cases, ITERLP fails to produce near-optimal times when
latency is introduced; however, ITERLP2A and ITERLP2B consistently produce near-optimal performance irregardless of the level of heterogeneity, the number of processor nodes, or the magnitude
of the result set. It has been shown where each of the sorting methods is best suited in cluster
experiments, but no method is best suited for all cases. ITERLP2A performs best in the random
simulations, whereas ITERLP2B performs best with cluster experiments. Finally, results from
experimentation have shown that ITERLP2 outperforms ITERLP in all realistic cases.
5.1
Contributions
To contribute to DLT, it has been shown that ITERLP has limitations when sorting only by
communication parameters and neglecting the cost of latency. As well, a newly modified and
improved heuristic algorithm, ITERLP2, has been created to be more realistic by factoring in
the cost of latency and considering other sorting methods. It has been shown that ITERLP2
outperforms ITERLP in all realistic cases.
5.2
Future Work
When the value of m becomes very large, ITERLP2 does not effectively schedule divisible loads
and could take a great deal of time to complete. To help get around this limitation, one can produce
24
an application of DLS involving clusters of clusters whereby each cluster can be considered as an
equivalent processor [1, 9, 11]. A load schedule can be created for this by considering clusters of
clusters as a multilevel-processor tree that can be recursively solved [1]. This is done by initially
solving with the set of equivalent processors in the first level of the tree. After each cluster in the
first level has had a load fraction calculated for it, each child processor/cluster below them will
have the parent load fraction split up yet again. This process repeats until the leaf nodes have been
reached in all branches.
25
Bibliography
[1] O. Beaumont, A. Ghatpande, H. Nakazato, H. Watanabe. Divisible Load Scheduling with Result
Collection on Heterogeneous Systems. (Heterogeneous Computation Workshop IPDPS, pp. 1-8,
2008).
[2] S. K. Chan, V. Bharadwaj, D. Ghose. Large matrix-vector products on distributed bus networks with communication delays using the divisible load paradigm: performance analysis and
simulation. (Math. Comput. Simul. 58, 1 (Dec. 2001), 71-92).
[3] S. Bataineh, T. Hsiung, T.G. Robertazzi. Closed Form Solutions for Bus and Tree Networks of
Processors Load Sharing a Divisible Job (IEEE Trans. Comput., vol. 43, no. 10, pp. 1184-1196,
1994).
[4] A. Ghatpande, H. Nakazato, O. Beaumont, H. Watanabe. Analysis of Divisible Load Scheduling
with Result Collection on Heterogenous Systems (IEICE Transactions on Communications
2008, vol. E91-B, no. 7, pp. 2234-2243, 2008).
[5] A. Ghatpande, H. Nakazato, O. Beaumont, H. Watanabe. SPORT: An Algorithm for Divisible
Load Scheduling with Result Collection on Heterogeneous Systems (IEICE Transactions on
Communications 2008, vol. E91-B, no. 8, pp. 2571-2588, 2008).
[6] [email protected]
[email protected]:
Search
for
Extraterrestrial
Intelligence
at
Home
(http://setiathome.ssl.berkeley.edu, August, 2008).
[7] ACEnet wiki. ACEnet website (http://wiki.ace-net.ca, May, 2008).
[8] O. Beaumont, L. Marchal, Y. Robert. Scheduling divisible loads with return messages on heterogeneous master-worker platforms (High performance computing. International conference
No12, Goa , INDE (18/12/2005) 2005, vol. 3769, pp. 498-507).
26
[9] O. Beaumont, H. Casanova, A. Legrand, Y. Robert, Y. Yang. Scheduling Divisible Loads on
Star and Tree Networks: Results and Open Problems (IEEE Transactions on Parallel and
Distributed Systems ,vol. 16, no. 3, pp. 207-218, March, 2005).
[10] T. G. Robertazzi. Ten Reasons to Use Divisible Load Theory. (Computer, vol. 36, no. 5, pp.
63-68, May, 2003).
[11] D. Yu, T.G. Robertazzi. Divisible Load Scheduling for Grid Computing (Proc. of the IASTED
International Conference on Parallel and Distributed Computing and Systems, Los Angeles,
Nov, 2003).
[12] COIN-Clp. COmputational INfrastructure for Operations Research (http://www.coin-or.org,
May, 2008).
27
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement