Strappaveccia Francesco tesi

Strappaveccia Francesco tesi
Alma Mater Studiorum - Università di Bologna
DOTTORATO DI RICERCA IN
INFORMATICA
Ciclo XXVII
Settore Concorsuale di afferenza: 01/B1
Settore Scientifico disciplinare: INF/01
MANY-CORE ALGORITHMS FOR COMBINATORIAL
OPTIMIZATION
Presentata da: Francesco Strappaveccia
Coordinatore Dottorato
Relatore
Prof. Paolo Ciaccia
Prof. Vittorio Maniezzo
Esame finale anno 2015
“May the wind under your wings bear you where the sun sails and the moon walks”
J. R. R. Tolkien
UNIVERSITY OF BOLOGNA
Abstract
Facoltà di Scienze Naturali, Fisiche e Matematiche
DISI, Dipartimento di Informatica, Scienze ed Ingegneria
Doctor of Philosophy
Many-core Algorithms for Combinatorial Optimization
by Francesco Strappaveccia
Combinatorial Optimization is becoming ever more crucial, in these days. From
natural sciences to economics, passing through urban centers administration and
personnel management, methodologies and algorithms with a strong theoretical
background and a consolidated real-word effectiveness is more and more requested,
in order to find, quickly, good solutions to complex strategical problems. Resource
optimization is, nowadays, a fundamental ground for building the basements of
successful projects. From the theoretical point of view, CO rests on stable and
strong foundations, that allow researchers to face ever more challenging problems.
However, from the application point of view, it seems that the rate of theoretical
developments cannot cope with that enjoyed by modern hardware technologies,
especially with reference to the one of processors industry.
We are witnessing a technological ‘golden era’, where enormous amount of computational capabilities is available at an affordable cost, even at the consumer
market level. Is also true that, to fully exploit these devices, a complete focus
shift is necessary in thinking and approaching optimization problems. In this
work we propose new parallel algorithms, designed for exploiting the new parallel
architectures available.
In our research, we found that, exposing the inherent parallelism of some resolution techniques (like Dynamic Programming), the computational benefits are
remarkable, lowering the execution times by more than an order of magnitude,
and allowing to address instances with dimensions not possible before.
We approached four CO’s notable problems: Packing Problem, Vehicle Routing
Problem, Single Source Shortest Path Problem and a Network Design problem.
For each of these problems we propose a collection of effective parallel solution
algorithms, either for solving the full problem (Guillotine Cuts and SSSPP) or for
enhancing a fundamental part of the solution method (VRP and ND).
We endorse our claim by presenting computational results for all problems, either
on standard benchmarks from the literature or, when possible, on data from realworld applications, where speed-ups of one order of magnitude are usually attained,
not uncommonly scaling up to 40 X factors.
Acknowledgements
I would like to thank first of all Professor Vittorio Maniezzo for his support during
this journey. Dott. Marco Antonio Boschetti for his precious theoretical and
human support. Prof. Marcus Poggi and Prof. Geir Hasle for their advices and
punctual observations on my work. My family, for their love, since the beginning.
Sara for her love, patience and understanding in the deepest dejection and in the
brightest joy (this would not have been possible without you). Alessandro, for
his sincere and lasting friendship, over the last ten years. Francesco, Stefano and
Rossano for the sincere and “politically correct” friendship. Matteo and Samuele
for their friendship in this foreign territory from via Machiavelli to the “TwinPeaksTriDomino-Jenga-Poker” evenings. Elena for every advice, suggestion (scientific
or not) and lunch together.
vi
Contents
Abstract
iv
Acknowledgements
vi
Contents
vii
List of Figures
xi
List of Tables
xiii
1 Introduction
1
2 High Performance Computing
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Main fields of application . . . . . . . . . . . . . . . . . .
2.2.1 Simulation . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Bioinformatics . . . . . . . . . . . . . . . . . . . .
2.2.3 Large Scale Data Visualization and Management
2.2.4 Combinatorial Optimization . . . . . . . . . . . .
2.3 High Performance Computing Programming Models . . .
2.3.1 Message Passing (MPI) . . . . . . . . . . . . . .
2.3.1.1 MPI Execution Model . . . . . . . . . .
2.3.2 Shared Memory (OpenMP) . . . . . . . . . . . .
2.3.2.1 OpenMP Execution Model . . . . . . . .
2.3.3 GPGPU Computing (CUDA, OpenCL) . . . .
2.3.3.1 NVIDIA CUDA . . . . . . . . . . . . . .
2.3.3.2 OpenCL . . . . . . . . . . . . . . . . . .
3 Combinatorial Optimization
3.1 Definition . . . . . . . . . . . .
3.2 Linear Problems (LP) . . . . .
3.3 Integer Linear Problems (ILP)
3.4 Nonlinear Problems (NPL) . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
8
9
10
11
11
12
14
15
17
18
20
.
.
.
.
23
23
24
25
25
Contents
3.5
3.6
Constraints Satisfaction Problems (CSP) . . . . . . . .
Solution and Approximation Methodologies . . . . . .
3.6.1 Simplex Algorithm . . . . . . . . . . . . . . . .
3.6.2 Lagrangean Relaxation . . . . . . . . . . . . . .
3.6.3 Column Generation . . . . . . . . . . . . . . . .
3.6.4 Dynamic Programming . . . . . . . . . . . . . .
3.6.5 Branch and Bound, Branch and Cut Techniques
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
27
27
28
29
30
31
4 Rectangular Knapsack, 2D Unconstrained Guillotine Cuts
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 The 2D-GCP: notation, definitions and algorithms . . . . . . .
4.2.1 Principle of Normal Patterns . . . . . . . . . . . . . . .
4.2.2 A Dynamic Programming algorithm for the 2D-GCP .
4.3 The GPU porting of the dynamic programming algorithm . .
4.3.1 Exploiting the inherent parallelism . . . . . . . . . . .
4.3.2 Recurrence Parallelization on GPU . . . . . . . . . . .
4.3.3 Parallel Reduction for the Max Operation . . . . . . .
4.3.4 Matrix Update . . . . . . . . . . . . . . . . . . . . . .
4.3.5 GPU Algorithm . . . . . . . . . . . . . . . . . . . . . .
4.4 Computational results . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Test instances . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Considerations and Future Work . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
35
36
37
39
39
40
42
44
45
47
48
48
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
57
58
59
59
60
62
65
68
69
69
70
73
73
73
75
75
76
78
5 Vehicle Routing Problem
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Problem Description and Mathematical Formulations . . . .
5.2.1 Two Index Formulation . . . . . . . . . . . . . . . .
5.2.2 Set Partitioning Formulation . . . . . . . . . . . . . .
5.3 Dynamic Programming Relaxations for the Pricing Problem
5.3.1 q-Route Relaxation . . . . . . . . . . . . . . . . . . .
5.3.2 through-q-Route Relaxation . . . . . . . . . . . . . .
5.3.3 ng-Route Relaxation . . . . . . . . . . . . . . . . . .
5.4 Parallel Relaxations on GPU . . . . . . . . . . . . . . . . . .
5.4.1 GPU q-Route . . . . . . . . . . . . . . . . . . . . . .
5.4.1.1 Exposing Parallelism . . . . . . . . . . . . .
5.4.1.2 Algorithm Description . . . . . . . . . . . .
5.4.2 GPU through-q-route . . . . . . . . . . . . . . . . . .
5.4.2.1 Exposing Parallelism . . . . . . . . . . . . .
5.4.2.2 Algorithm Description . . . . . . . . . . . .
5.4.3 GPU ng-Route . . . . . . . . . . . . . . . . . . . . .
5.4.3.1 Exposing Parallelism . . . . . . . . . . . . .
5.4.3.2 Dominance Management . . . . . . . . . . .
5.4.3.3 Active Sets . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
5.5
5.6
5.4.3.4 Algorithm Description
5.4.4 Asymmetric Relaxations . . . .
Computational Results . . . . . . . . .
Considerations and Future Work . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Single Source Shortest Path Problem
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Problem Definition . . . . . . . . . . . . . . . . . . . . .
6.2.1 Single Source Shortest Path Problem . . . . . . .
6.2.2 Multi-Modal Networks . . . . . . . . . . . . . . .
6.2.3 Label Constrained Shortest Path Problem . . . .
6.2.4 Time-Dependent Networks . . . . . . . . . . . . .
6.3 Serial Algorithm . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Time-Dependent Augmented Dijkstra Algorithm .
6.4 GPU Algorithm . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 GPU Dijkstra Algorithm . . . . . . . . . . . . . .
6.4.1.1 Frontier Set . . . . . . . . . . . . . . . .
6.4.1.2 GPU Implementation . . . . . . . . . .
6.4.2 Dynamic Frontier Definition . . . . . . . . . . . .
6.4.3 GPU Time-Dependent Algorithm . . . . . . . . .
6.5 Shared Memory Algorithm . . . . . . . . . . . . . . . . .
6.5.1 OpenMP Porting . . . . . . . . . . . . . . . . . .
6.6 Computational Results . . . . . . . . . . . . . . . . . . .
6.6.1 Data Sets Creation . . . . . . . . . . . . . . . . .
6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Considerations and Future Work . . . . . . . . . . . . . .
7 Membership Overlay Problem
7.1 Introduction . . . . . . . . . . . . . . . . . . . . .
7.2 Membership Overlay Problem . . . . . . . . . . .
7.2.1 Definition and Mathematical Formulation
7.2.2 MIP Formulation . . . . . . . . . . . . . .
7.3 Linear and Lagrangean SMOP Relaxations . . . .
7.3.1 LP Relaxation . . . . . . . . . . . . . . . .
7.3.2 Lagrangean Relaxation . . . . . . . . . . .
7.4 Subgradient Algorithm . . . . . . . . . . . . . . .
7.5 Distributed Subgradient . . . . . . . . . . . . . .
7.5.1 Algorithm foundations . . . . . . . . . . .
7.5.2 Quasi-constant step size update . . . . . .
7.5.3 Algorithm description . . . . . . . . . . .
7.6 Computational Results . . . . . . . . . . . . . . .
7.7 Shared Memory Algorithm . . . . . . . . . . . . .
7.8 Distributed Memory Algorithm . . . . . . . . . .
7.9 GPU Algorithm . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
80
84
89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
92
93
96
97
102
104
105
106
106
106
108
109
110
112
114
115
115
116
121
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
. 123
. 124
. 128
. 129
. 130
. 130
. 130
. 131
. 132
. 134
. 135
. 136
. 137
. 140
. 142
. 143
Contents
x
7.10 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.11 Considerations and Future Work . . . . . . . . . . . . . . . . . . . . 149
8 Conclusions
151
Bibliography
153
List of Figures
2.1
2.2
2.3
2.4
2.5
OpenMP logo. . . . . . . . . . . . . . .
Fork-Join Execution Model. . . . . . .
CUDA logo. . . . . . . . . . . . . . . .
CUDA execution and memory models.
OpenCL logo. . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.1
4.2
Computation of function V (x, y). . . . . . . . . . . . . . . . . . .
Parallel tasks executed exploiting the decomposition based on antidiagonals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Reduction Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Naive reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Naive reduction gives rise to bank conflicts (each column of the
shared memory represents a memory bank). . . . . . . . . . . . .
4.6 Strided access allows a fast reduction. . . . . . . . . . . . . . . . .
4.7 Strided access avoids bank conflicts. . . . . . . . . . . . . . . . . .
4.8 Matrix update in the GPU computing approach proposed . . . .
4.9 randcut6000c source. . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 randcut6000c solution. . . . . . . . . . . . . . . . . . . . . . . . .
4.11 testcut8000 solution. . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 gcut13 solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 38
. 40
. 42
. 42
.
.
.
.
.
.
.
.
43
43
43
45
53
53
54
54
f(q,i) q-path computation and 2-vertices loops avoiding.
ng-Path example. . . . . . . . . . . . . . . . . . . . . .
Stage q parallel computation. . . . . . . . . . . . . . .
NG path 3-D states space. . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
68
70
77
6.1
6.2
6.3
6.4
6.5
6.6
Types of Routes. . . . . . . . . . . . . . . . . . . . . . . . .
Automaton managing four states. . . . . . . . . . . . . . . .
Speed Profiles. . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Path in Berlin . . . . . . . . . . . . . . . . . . . . . . .
Serial times vs OpenMP times for the Berlin instance . . . .
Speed-Up for different h values for the Los Angeles instance
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
100
103
116
117
117
7.1
7.2
7.3
7.4
P2P network with 4 nodes.
Edges bandwidth values. .
Edges p’ values. . . . . . .
Edges lower bound l. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
125
126
126
126
.
.
.
.
.
.
.
.
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
16
18
19
20
5.1
5.2
5.3
5.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
7.5
7.6
7.7
7.8
Edges upper bound u. . . . . . . . . .
MOP Optimal Solution. . . . . . . . .
Nodes deploy in a eight core processor.
Performances evaluation. . . . . . . . .
xii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
127
141
149
List of Tables
2.1
CUDA Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . 20
4.1
4.2
4.3
Computational results on gcut instances. . . . . . . . . . . . . . . . 50
Computational results on testcut instances. . . . . . . . . . . . . . . 51
Computational results on randcut instances. . . . . . . . . . . . . . 52
5.1
5.3
5.4
Computational results for the symmetric q-path and through-qroute relaxations. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Computational results for the asymmetric q-path and through-qroute relaxations. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Computational results for the symmetric ng-path relaxation. . . .
Computational results for the asymmetric ng-path relaxation. . .
6.1
6.2
6.3
Test instances dimensions . . . . . . . . . . . . . . . . . . . . . . . 118
Computational results on CPU . . . . . . . . . . . . . . . . . . . . 119
Computational results on GPU . . . . . . . . . . . . . . . . . . . . 120
7.1
7.2
Gaps and execution times for LP, STD, DIST. . . . . . . . . . . . 139
Speed-Ups of the CUDA, MPI and OpenMP algorithms. . . . . . . 148
5.2
xiii
. 85
. 86
. 87
. 88
To my family.
To Vera, Marcello and Augusta.
To Sara.
xv
Chapter 1
Introduction
The Moore’s Law, in the early years, has brought to a turning point the microprocessors industry. In fact, the doubling of the computation capabilities of a
CPU every eighteen months was becoming a huge engineering problem, due to the
increasing work frequencies, power consumption and expensive cooling systems.
For instance, the density of heat on the surface of a high-frequency microprocessor
can be compared to the density of heat on the same area on the Sun surface.
Thus, the solution has been to decrease the work-frequencies and to add more
computation units inside the same silicon die; the Moore’s Law then, shifted its
prospective from doubling the frequencies to doubling the number of computation
units inside the processor, remains still valid.
This paradigm shift forced a massive change in software engineering and in computer science in general. In fact, from building an operative system to the design
of new algorithms, the new platforms’ high level of parallelism is an ever more
important aspect to considerate. Also, the academic research community is every year more interested in the exploitation of this new enormous availability of
computing resources, aware that this can be a great opportunity to reach new
and significant goals, unexpected until few years ago. On the other hand, to take
advantage from these architectures, is required a high level of expertise and a deep
knowledge of the features implemented on these devices.
A new kind of parallel Lagrangean Relaxation was analyzed and implemented
in [1] related to a network design problem (Membership Overlay Problem), this
algorithm achieved a great speed-up improvement by implementing it using three
of the main parallel paradigms available (MPI, OpenMP and CUDA). This result
1
Chapter 1. Introduction
2
encouraged a deeper investigation of this new approach to design optimization
algorithms and this topic was chosen as the main focus for this Ph.D. thesis.
Moreover, Combinatorial Optimization is still marginally affected by this new
approach and the analysis of the most exploited resolution methods, like Dynamic
Programming, Branch and Bound techniques or Column Generation, under this
new parallel prospective, is in its infancy. The state-of-the-art micro-processors
allow now to treat problems never approached before, due to the data dimensions,
to the enormous amount of computations required for their solution or to the
computation’s granularity.
The contribution of this thesis is a collection of parallel algorithms designed to
enhance, sometimes significantly, the execution times of the methods cited above.
We approached some of the most notable problems in CO: The Rectangular Knapsack (Unconstrained 2D Guillotine Cuts Problem) achieving, through a Dynamic
Programming algorithm designed for a many-core platform (GPU), a speed-up of
22 X for the biggest instances known in literature and a 30 X speed-up factor for
two new sets of instances, generated to test the real effectiveness of the method.
The Vehicle Routing Problem, for which we provided six many-core algorithms for
enhancing the relaxations (q-routes, through-q-routes and ng-routes relaxations)
relative to the pricing problem inside the Column Generation exact method, based
on the Set Partitioning formulation of the problem. In this case, we achieved a
maximum speed-up of 40 X for the asymmetric version of the ng-routes relaxation,
an average speed-up of 20 X for the q-routes and through-q-routes methods and
an average speed-up of 10 X for the ng-routes.
The Single Source Shortest Path Problem, for solving the Earliest Arrival Problem
in a Multi-modal, Time-Dependent environment. For this problem, we provide a
multi-core (CPU) algorithm, achieving a maximum of 3.5 X speed-up factor on
real-word based instances.
A Network Design Problem, the Membership Overlay, relative a the Peer-to-Peer
network model. For this problem, we propose a parallel subgradient algorithm
for solving the Lagrangean Dual Problem relative to the problem’s Lagrangean
relaxation, achieving a 29 X speed-up for the bigger instances.
Chapter 1. Introduction
3
This thesis is structured as follows: in Chapter 2 we will briefly review the High
Performance Computing field and the actual most used parallel programming models.
In Chapter 3 we will give some definitions for Combinatorial Optimization problems and we will shortly describe some solution methodologies.
In Chapter 4 we will propose a GPU algorithm for solving the Two Dimension
Guillotine Cutting Problem, with computational results and two new sets of instances.
In Chapter 5 we will describe the most effective pricing strategies for Column
Generation algorithms for solving some classes of the VRP, proposing the relative
parallel algorithms with computational results.
In Chapter 6 we will address the Single Source Shortest Path Problem in a TimeDependent Multi-Modal environment, proposing two parallel algorithm (CPU and
GPU) and a new set of instances.
In Chapter 7 we will propose three parallel subgradients for three platforms, CPU,
GPU and cluster, to obtain a valid bound for the Membership Overlay Problem.
In Chapter 8 we will draw the conclusions of our work and give some guidelines
for future researches.
Chapter 2
High Performance Computing
2.1
Definition
By High Performance Computing we mean the use of computers for high throughput computation, for solving large problems, or for getting results faster. High
Performance is relative to desktop computers, servers or clusters, characterized by
more computation units that can cooperate. Supercomputers were introduced in
the 1960s and were designed primarily by Seymour Cray at Control Data Corporation (CDC), and later at Cray Research. While the supercomputers of the 1970s
used only a few processors, in the 1990s, machines with thousands of processors
began to appear and by the end of the 20th century, massively parallel supercomputers with tens of thousands of “off-the-shelf” or commercial processors were
marketed [2].
The design of systems with a massive number of processors generally take one
of two paths: in one approach, e.g., in grid computing, the processing power of
a large number of computers in distributed, diverse administrative domains, is
opportunistically used whenever a computer is available. In another approach, a
large number of processors are used in close proximity to each other, for instance, in
a computer cluster. The use of multi-core processors combined with centralization
is an emerging direction [3]. In the early years of 21st century, another emerging
trend is to evaluate the “performance per watt” factor, or the sustainability of a
super computer, deserving a separate ranking [4]. The energy consumption of a
machine with 100,000 or more cores is obviously relevant. To address also this
aspect, the GPU computing or heterogeneous computing seems to be the right
5
Chapter 2. HPC
6
way. The most recent super-computer in the Top 500 list [5] are in most cases
composed not only by canonical processors but, inside every computation node, by
several “accelerators”, dramatically more energetically efficient than the canonical
CPU.
2.2
Main fields of application
The need of such a large amount of computational capabilities is restricted to specific research areas like physics, chemistry, engineering, etc. . . In the next sections,
we will give a brief list of the applications of HPC. But nowadays the parallelism
is a common factor that brings together the canonical computer to the most advanced smartphones and the same paradigms, used to exploit the HPC structures,
can be applied on smaller systems enhancing their performances, sometimes dramatically.
2.2.1
Simulation
Simulation is the imitation of the operation of a real-world process or system over
time [6]. The act of simulating something first requires that a model be developed;
this model represents the key characteristics or behaviors of the selected physical
or abstract system or process. The model represents the system itself, whereas
the simulation represents the operation of the system over time.
Simulation is used in many fields, such as engineering design, building design, geology, climatology and video games. Simulation can be used to show the eventual
real effects of alternative conditions or to predict the behavior of a system (for example, a building during an earthquake or an electronic device like a smartphone).
Simulation is also used when the real system cannot be directly studied, because
inaccessible, dangerous, not built or simply virtual.
More specifically, a computer simulation models a real-life or hypothetical situation
on a computer so that it can be studied to see how the system works. By changing
variables in the simulation, predictions may be made about the behavior of the
system. It is a tool to virtually investigate the studied system under changing
conditions. Computer simulation has become a useful part for modeling many
Chapter 2. HPC
7
natural systems in physics, chemistry and biology, as well as in engineering. For
example, the effectiveness of exploiting computers for simulating the behavior of
a system is particularly useful in the field of network traffic analysis.
Key issues in simulation include acquisition of valid source information about
the relevant selection of key characteristics and behaviors, the use of simplifying
approximations and assumptions within the simulation, and fidelity and validity of
the simulation outcomes. Traditionally, mathematical models are used to formally
describe systems, models that attempt to find analytical solutions enabling the
prediction of the behavior of the system from a set of parameters and initial
conditions. Computer simulation is often used as an adjunct to, or substitution for,
modeling systems for which simple closed form analytic solutions are not possible.
There are many different types of computer simulation, the common feature they
all share is the attempt to generate a sample of representative scenarios for a
model in which a complete enumeration of all possible states would be prohibitive
or impossible [7]. These are some research areas that uses HPC:
• Computational Fluid Dynamics: is a branch of fluid mechanics that
uses numerical methods and algorithms to solve and analyze problems that
involve fluid flows. Computers are used to perform the calculations required
to simulate the interaction of liquids and gases with surfaces defined by
boundary conditions. With high-speed supercomputers, better solutions can
be achieved. Ongoing research yields software that improves the accuracy
and speed of complex simulation scenarios such as transonic or turbulent
flows. Initial validation of such software is performed using a wind tunnel
with the final validation coming in full-scale testing, e.g. flight tests [8].
• Computational Chemistry: is a branch of chemistry that uses principles
of computer science to assist in solving chemical problems. It uses the results
of theoretical chemistry, incorporated into efficient computer programs, to
calculate the structures and properties of molecules and solids [9].
• Multi Agent Systems: are systems composed of multiple interacting intelligent agents within an environment. Multi agent systems can be used
to solve problems that are difficult or impossible for an individual agent
or a monolithic system to solve. Intelligence may include some methodic,
functional, procedural or algorithmic search, find and processing approach.
The study of multi-agent systems is concerned with the development and
Chapter 2. HPC
8
analysis of sophisticated Artificial Intelligence problem-solving and control
architectures for both single-agent and multiple-agent systems [10].
• Computational Astrophysics: refers to the methods and computing tools
developed and used in astrophysics research. It is both a specific branch of
theoretical astrophysics and an interdisciplinary field relying on computer
science, mathematics, and wider physics. Computational astrophysics is
most often studied through an applied mathematics or astrophysics program.
Well-established areas of astrophysics employing computational methods include magnetohydrodynamics, astrophysical radiative transfer, stellar and
galactic dynamics, and astrophysical fluid dynamics [11].
• Computational Physics: is the study and implementation of numerical
algorithms to solve problems in physics for which a quantitative theory already exists. It is often regarded as a sub-discipline of theoretical physics but
some consider it an intermediate branch between theoretical and experimental physics. It is a subset of computational science (or scientific computing),
which covers all of science rather than just physics. Theoretical physicists
provide very precise mathematical theory describing how a system will behave. Unfortunately, it is often the case that solving the theory’s equations
ab initio in order to produce a useful prediction is not practical. This is especially true with quantum mechanics, where only a handful of simple models
admit closed-form, analytic solutions. In cases where the equations can only
be solved approximately, computational methods are often used [12].
As we can see these scientific disciplines need an enormous computing time due
to the intensive and complex kind of calculations required to solving the models
on which they are based.
2.2.2
Bioinformatics
Bioinformatics is an interdisciplinary field involving disciplines from data mining,
machine learning to operational research, that develops and improves methods for
storing, retrieving, organizing and analyzing biological data. A major activity in
bioinformatics is to develop softwares to generate useful biological knowledge (e.g.
GROMACS [13]). Bioinformatics has become a fundamental mean for many areas
of biology, mainly the ones characterized by strong mathematical and statistical
Chapter 2. HPC
9
aspects. In experimental molecular biology for instance, bioinformatics techniques
such as image and signal processing, allow extraction of useful results from large
amounts of raw data. In the field of genetics and genomics, it aids in sequencing
and annotating genomes and their observed mutations [14].
Bioinformatics tools also aid in the comparison of genetic and genomic data, task
that otherwise would not be possible, given the enormous amount of data to analyze. Another notable contribution is the simulation and modeling of DNA, RNA
and proteins in general as well as molecular interactions, using precise mathematical models, the actual computational resources made available from the microprocessors industry.
The peculiarity of this research field is to design computationally intensive methods
to enhance the disciplines mentioned before, influenced by the enormous complexity of the problems treated. Some examples of involved methodologies include:
pattern recognition, data mining, machine learning, and visualization. Bioinformatics has brought remarkable contribution in sequence alignment, gene finding,
genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modeling of evolution.
Another challenging goal is to develop software and hardware designed following patterns inspired by the biological word itself or more appropriate to precise
biological analysis tasks (networks, processors, etc. . . ).
2.2.3
Large Scale Data Visualization and Management
Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes
or variables for the units of information [15]. Data visualization and management
has become an active area of research, teaching and development, in fact this field
become really interesting with the new HPC technologies. Also, considering the
enormous quantity of data available nowadays, a correct data interpretation based
on solid statistical and mathematical models, has become strategic, from marketing campaigns planning to corporate quantitative analysis. Some related areas
are:
Chapter 2. HPC
10
• Data Acquisition : is the sampling of the real world to generate data
that can be manipulated by a computer. Sometimes abbreviated DAQ or
DAS, data acquisition typically involves acquisition of signals and waveforms
and processing the signals to obtain desired information. The components
of data acquisition systems include appropriate sensors that convert any
measurement parameter to an electrical signal, which is acquired by data
acquisition hardware and stored in digital form [16].
• Data Analysis: is the process of studying and summarizing data with the
goal to extract useful information. Data analysis is closely related to data
mining, but data mining tends to focus on larger data sets with less emphasis
on making inference, and often uses data that was originally collected for
a different purpose. In statistical applications, is frequent to divide data
analysis into descriptive statistics, exploratory data analysis, and inferential
statistics (or confirmatory data analysis). For example, the exploratory data
analysis focuses on discovering new features in the data, instead confirmatory
data analysis aims to endorsing or confuting existing hypotheses [17].
• Data Mining: is the process of sorting through large amounts of data
and picking out relevant information [18]. It is usually used in business
intelligence finance, but is also being used in sciences to extrapolate useful
information from enormous data sets (e.g. physics large scale experiments).
It has been described as the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data [19] and the science
of extracting useful information from large data sets or databases [20]. In
relation to enterprise resource planning, data mining is the statistical and
logical analysis of large sets of transaction data, looking for patterns that
can aid decision making.
2.2.4
Combinatorial Optimization
Combinatorial Optimization is a field that consists of finding an optimal solution
of a constrained problem [21]. In many problems, complete search or solutions enumeration is not possible. CO is focused on dealing with optimization problems, in
which the set of feasible solutions is discrete or can be reduced to discrete, and in
which the goal is to find the best solution. This field is characterized by dealing
Chapter 2. HPC
11
with NP-Hard / NP-Complete problems and, obviously, the optimal or near optimal solution of these kind of problems requires a large amount of computing time
and large scales data structures to manage the, often, combinatorial explosion of
data involved. This field will be treated more deeply in Chapter 3, analyzing and
describing various kinds of combinatorial problems and their formulation. Combinatorial Optimization involves some of the research areas early cited. In fact a lot
of topics related to physics, chemistry and engineering need the solution of linear
or non linear constrained problems (e.g. lattice problems in particles physics).
2.3
High Performance Computing Programming
Models
In this paragraph we will give a short introduction and description of the “de facto”
standards used for the implementation of parallel applications. The spectrum
of parallel programming models, indeed, is enormously wide, often bounded to
specific types of hardware and software vendors, the three paradigms described in
the following are the most used and studied for their portability, effectiveness and
generality of model. MPI, OpenMP, CUDA, OpenCL are “de facto” standards
because the large part of parallel infrastructures implements these models and
the most important parallel libraries (for instance BLAS [22]) and softwares (for
instance OpenFOAM [23], Quantum Expresso [24] and many more) are based on
these.
2.3.1
Message Passing (MPI)
Message Passing Interface (MPI), proposed in 1992 by William Gropp and Ewing
Lusk, is a standardized and portable message-passing system designed by a group
of researchers from academia and industry to operate on a wide variety of parallel
computers [25]. The standard defines the syntax and semantics of a core of library
routines useful to a wide range of users writing portable message-passing programs
in Fortran 77, Fortran 90 or the C programming languages. Several well-tested
and efficient implementations of MPI include some that are free and in the public
domain. These fostered the development of a parallel software industry, and there
encouraged development of portable and scalable large-scale parallel applications.
Chapter 2. HPC
12
MPI provides parallel hardware vendors with a clearly defined base set of routines
that can be efficiently implemented. As a result, hardware vendors can build upon
this collection of standard low-level routines to create higher-level methods for
the distributed-memory communication environment supplied with their parallel
machines. MPI provides a simple-to-use portable interface for the basic user,
yet powerful enough to allow programmers to use the high-performance message
passing operations available on advance machines.
After twenty years MPI is still the most used and effective parallel programming
model for clusters. Formally, MPI is defined as a: “message-passing application
programmer interface, together with protocol and semantic specifications for how
its features must behave in any implementation” [26]. MPI is not a “de-iure”
standard but, due to its portability and well defined behavior and interface, is
become the “de facto” standard for distributed memory architectures. MPI is an
API referring from 5 to 7 levels of ISO-OSI Communication Stack. The benefit
in using MPI is its complete portability in every parallel environment. In fact,
every MPI implementation provided by each vendor is optimized for the hardware
where the application runs. Moreover, MPI allows the coexistence of portion
of other programming languages code inside the same application; common is
the hybridization with Open MP in order to exploit easily the shared memory
nodes inside the cluster (multi-core CPUs). MPI itself, in any case, allows the
programmer to manage a shared memory computation node. In the last years is
common also the hybridization with specific language-extensions used to manage
the GPU or many-core devices inside the computation node (CUDA or OpenCL).
2.3.1.1
MPI Execution Model
The MPI interface is meant to provide virtual topology, synchronization, and communication functionalities between a set of processes (that have been mapped to
nodes/servers/computer instances) in a language-independent way, with languagespecific syntax (bindings), plus a few language-specific features. MPI programs
always work with processes, but programmers commonly refer to the processes as
processors. Typically, for maximum performance, each CPU (or core in a multicore machine) will be assigned just a single process. This assignment happens at
runtime through the agent that starts the MPI program, normally called mpirun
or mpiexec. MPI library functions include, but are not limited to:
Chapter 2. HPC
13
• point-to-point,
• rendezvous-type,
• send/receive operations,
• choosing between a Cartesian or graph-like logical process topology,
• exchanging data between process pairs (send/receive operations),
• combining partial results of computations (gather and reduce operations),
• synchronizing nodes (barrier operation) as well as obtaining network-related
information such as the number of processes in the computing session,
• current processor identity that a process is mapped to,
• neighboring processes accessible in a logical topology.
More in detail, the MPI’s runtime system creates n processes called tasks and each
of these tasks:
• creates an independent copy of the application in the node where is executing,
• has local memory,
• can be mapped on a different processor,
• has an univocal index among the tasks created by the user.
Each process is part of a Communicator. An MPI Communicator is an object
connecting groups of processes. Each communicator:
• has a name,
• a cardinality,
• assign to its processes a proper index for the communicator itself,
• arrange the processes in an ordered topology,
• each process is equivalent to the others.
Chapter 2. HPC
14
MPI optimizes the deployment of the communicators understanding also when a
data transfer or a communication is relative only to a specific communicator.
Each MPI application is a single executable, managed by the run-time system
that organizes the communication among the application’s processes. The MPI
primitives are blocking or non-blocking for the application execution, obviously
the non-blocking are recommended when exploitable. The primitives also can
have different modes of communication:
• Standard : automatic synchronization and buffering,
• Buffered : buffering defined by the user,
• Synchronous: strict rendezvous,
• Ready: instant communication without hand-shacking.
Actually, the current version of MPI is the 3.0.
2.3.2
Shared Memory (OpenMP)
Figure 2.1: OpenMP logo.
OpenMP (Open Multiprocessing) is an API that supports multi-platform shared
memory multiprocessing programming in C, C++, and Fortran [27], on most
processor architectures and operating systems, including Solaris, AIX, HP-UX,
GNU/Linux, Mac OS X, and Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence runtime behavior . OpenMP is managed by the nonprofit technology consortium
OpenMP Architecture Review Board (or OpenMP ARB), represented by a group
of computer hardware and software vendors: AMD, IBM, Intel, Cray, HP, Fujitsu,
NVIDIA, NEC, Microsoft, Texas Instruments, Oracle Corporation, and more [28].
OpenMP is described by a portable, scalable programming model that provides
users with a minimal and flexible interface for developing parallel applications for
a wide spectrum of machines, from desktop computers to a clusters.
Chapter 2. HPC
15
OpenMP obtained a reasonable success due to some interesting characteristics:
• great emphasis on structured parallel programming,
• its modest learning curve. The compiler bears all the most difficult aspects of
the shared-memory parallel programming (threads synchronization, etc. . . ),
• portability: OpenMP libraries are available natively for the most common
and used programming languages,
• incremental implementation from the serial code by adding only some simple
pre-processor’s directives.
The most important aspect of OpenMP is the constant development and support
from the most important actors of software and hardware industry, making it a
‘de facto’ standard for the shared memory parallel programming.
The OpenMP’s directives allow to programmers to indicate to the compiler which
instructions execute in parallel and how to divide the workload among threads.
These compiler directives are extremely flexible and don’t require to re-write the
code in case of platform or compiler changing, besides if a compiler or a platform
is not enabled for run OpenMP, the serial code remains the same.
2.3.2.1
OpenMP Execution Model
OpenMP supports the Fork-Join programming model [29]. Under this approach,
the program starts as a single thread of execution, just like a sequential program.
The thread that executes this code is referred to as the initial thread. Whenever
an OpenMP parallel construct is encountered by a thread while it is executing the
program, it creates a team of threads (fork), becomes the master of the team, and
collaborates with the other members of the team to execute the code dynamically
enclosed by the construct. At the end of the construct, only the original thread,
or master of the team, continues; all others terminate (join). Each portion of code
enclosed by a parallel construct is called a parallel region. A thread is a run-time
entity capable to execute independently an instructions stream [30].
More in details, once the operative system executes an OpenMP application:
Chapter 2. HPC
16
• a process is created for the program,
• are instantiated the necessary system resources: memory pages and registers.
If more threads are spawned, they will share the resources created by the operative system also the same memory addresses space. Each thread needs only few
resources:
• a program counter,
• private memory locations for its specific data (registers and execution stack).
More threads can be executed by the same processor or core through context
switching procedures. More threads in a multi-core processor can run in a parallel
fashion, properly orchestrated. OpenMP makes transparent a considerable number
of interactions among the thread to the programmer, providing a more friendly
development environment.
Figure 2.2: Fork-Join Execution Model.
Chapter 2. HPC
17
The Fork/Join Execution Model implemented by OpenMP can be described as
follows and depicted in figure 2.2:
• the application starts in a serial mode, with only one thread, the initial
thread,
• once reached the code sections where the OpenMP directories are located, a
team of threads is spawned (Fork),
• the initial thread becomes the master thread and coordinates and cooperate
with the other threads in the parallel section,
• at the end of the OpenMP directives, the master thread continues the execution and the other are erased (Join).
In parallel regions, the developer can orchestrate at an higher abstraction level
the threads interaction, leaving to the compiler the more low-level implementation
details.
The OpenMP model implements, inside the single multi-core/multi-threaded elaboration unit, the same characteristics described for MPI. One of the most used
and effective techniques to develop parallel applications on massively parallel architectures is to mix, or hybridizing, the MPI code, exploiting the inter-node
communications and workload, with the intra-node or intra-core communications
and work-sharing, implemented with OpenMP. This approach guarantees a major
scalability of the application. The current release of OpenMP is the 3.0.
2.3.3
GPGPU Computing (CUDA, OpenCL)
Since 2004, when Intel decided to cancel the development of its last single core
processor in order to concentrate its efforts on multi-core architectures, we have
observed a radical change in the design of new generations of CPUs. The focus,
indeed, shifted from the increase of the processor’s frequency to the implementation of parallel architectures. Following this trend, the market got populated by
relatively affordable devices with great, natively parallel, computational resources.
General-purpose computing on graphics processing units (General-purpose graphics processing unit, GPGPU) is the exploitation of graphics processing unit (GPU),
Chapter 2. HPC
18
which typically handles computation only for computer graphics, to perform computation commonly handled by the CPU. Moreover, the use of multiple GPUs in
the same computer or computation node further parallelizes the already parallel
nature of these devices.
OpenCL is the actually the most used open-source GPU computing language.
The proprietary counterpart framework is NVIDIA’s CUDA. More in general, this
approach is generally called many-core computation, that differs from the ordinary
CPU by the fact that a GPU is composed by hundreds of simple computational
units, otherwise an ordinary CPU has less but more sophisticated cores. This
different prospective in the architecture’s design make the GPU the silver bullet
for high dense computation that involves thousands and thousands of small entities
and a fine grain granularity of computation.
2.3.3.1
NVIDIA CUDA
Figure 2.3: CUDA logo.
In 2007, NVIDIA introduced CUDA (Compute Unified Device Architecture) [31],
a parallel, general purpose programming model designed to exploit the great computational resources available on a GPU. CUDA is implemented as an extension
of the C/C++ or Fortran programming languages. NVIDIA GPU s follow the
SIMT (Single Instruction Multiple Threads) execution model, in which the same
instruction is executed concurrently on multiple data by different threads.
CUDA programming implements an higher abstraction model than a straight
transposition of the hardware architecture of the GPU. The model works at top
level on a kernel, which is the portion of code executed asynchronously on the
GPU, the rest of the user code being directly executed on the CPU. The kernel is
executed in parallel on a user-defined number of threads. The threads are grouped
into blocks, which are equal cardinality subsets of threads. The blocks are in turns
logically arranged into a grid, which is a 1-, 2- or 3-dimensional array of blocks
(see Figure 2.4a).
Chapter 2. HPC
(a) CUDA Execution Model
19
(b)
CUDA
Memory
Model
Figure 2.4: CUDA execution and memory models.
Every block is logically executed on a different SM (i.e., Stream Processor or
Cluster Processor) of the GPU, which consists of a set of simple arithmetic cores
called CUDA Cores (Fermi architecture counts 32 Cuda Cores for SM, Kepler 192
and call it SMX and Maxwell 128, calling the processor SMM).
Every block can accept up to 1024 threads [31], even though it will actually simultaneously execute sets of only 32 threads at a time, called Warps.
In Figure 2.4a we show in detail this model.
There is a memory hierarchy on these devices, aimed to minimize the communications between the host memory (named Global Memory in this context) and the
device one, via the PCI-Express bus. In fact, besides the core registries, the GPU
has an on-board, low-latency memory module accessible to all threads of each
single block. In table 2.1 we summarize the features of CUDA’s memory architecture. The access to Global Memory is a significant bottleneck for performance
enhancement in the design of a GPU-accelerated application. The Global Memory
is the only memory type visible by the entire grid, and sometimes it is needed for
the synchronization and data sharing among the blocks, even though its access
latency is very high (400-800 memory cycles). There are many techniques used to
hide this latency, but they are mainly application-specific, thus often impossible
to apply if the peculiar characteristics of the considered algorithm do not allow
them. It is therefore crucial to properly manage the access to global memory. The
effectiveness of this management is measured by the achieved memory bandwidth,
Chapter 2. HPC
20
which quantifies the amount of relevant data which the application can get from
the global memory per second.
Table 2.1: CUDA Memory Hierarchy.
Memory Type
Global
Shared
Local
Register
Constant
Texture
2.3.3.2
Scope
Grid+host
Block
Thread
Thread
Grid+host
Grid+host
Lifetime
Application
Kernel
Thread
Thread
Application
Application
Access
R-W
R-W
R-W
R-W
Read Only
Read Only
Location
Off-chip
On-chip
Off-chip
On-chip
Off-chip
Off-chip
Cached
No
N/A
No
N/A
Yes
Yes
OpenCL
Figure 2.5: OpenCL logo.
The Open Computing Language (OpenCL) [32] is a heterogeneous programming
framework created and sponsored by the nonprofit technology consortium Khronos
Group and adopted by some of the most important technology actors like Intel,
AMD, NVIDIA, ARM, Apple. OpenCL is a framework for developing applications
that execute across a broad spectrum of device types made by different vendors. It
supports a wide range of levels of parallelism and efficiently maps to homogeneous
or heterogeneous, single or multiple-device systems consisting of CPUs, GPUs, and
other types of device (e.g. FPGA, APU. . . ). The OpenCL definition offers both
a device-side language (based on C99) and a host management layer composed by
ad-hoc designed API, for the devices in a system.
The programming language used to write computation kernels is based on C99
with some limitations and additions. It omits the use of function pointers, recursion, bit fields, variable-length arrays, and standard C99 header files. The language
is extended to use parallelism with vector types and operations, synchronization
primitives, functions to manipulate work-items/groups. It also implements memory region qualifiers: global, local, constant, and private. To enhance productivity,
many built-in functions has been added (e.g. trigonometric functions). One of the
Chapter 2. HPC
21
most interesting and useful OpenCL’s peculiarities is the Device Fission that allows to queue instructions to a specific section or set of cores of the processor.
This feature allows to split the device into multiple areas assigned to different
tasks, maximizing its utilization or customizing the application for specific types
of processors. The actual release of OpenCL is the 2.0.
Chapter 3
Combinatorial Optimization
3.1
Definition
In mathematics and computer science, an optimization problem is the problem
of finding one best among all feasible solutions. Optimization problems can be
divided into two categories, depending on whether the variables are continuous or
discrete. An optimization problem with discrete variables is known as a combinatorial optimization problem [33]. In a combinatorial optimization problem, we are
looking for an object such as an integer, permutation or graph from a finite (or
possibly countable infinite) set. As anticipated in the Chapter 1, in this chapter
we will explain more in detail the Combinatorial Optimization field. In general,
we can describe an optimization problem as:
z = min f (x)
s.t. gi (x) ≤ 0 i = 1, . . . , m
(3.1)
hi (x) = 0 i = 1, . . . , p
where:
• f (x) : RN → R, is the objective function to be minimized(maximized) over
the variable x,
• gi (x) ≤ 0 are the inequality constrains,
• hi (x) = 0 are the equality constrains.
23
Chapter 3. Combinatorial Optimization
24
In the next sections we will describe some common problems and some of the most
used and known resolution methods.
3.2
Linear Problems (LP)
A Linear Problem (LP), can be expressed in this way (canonical form):
z = min
cx
s.t. Ax ≤ b
(3.2)
x≥0
Where:
• A is the matrix of coefficients that describe the convex set,
• c, b are the vector of costs and known coefficients,
• x is the linear solution.
The problem is characterized by being constrained only by linear inequalities.
Another way to describe the problem is:
z = min
n
X
c j xj
j=1
s.t.
n
X
aij xj ≤ bi , i = 1, . . . , n
j=1
xj ≥ 0,
j = 1, . . . , n
Where:
• aij are the elements of A,
• bi and cj are the elements of b and c respectively,
• xj are the elements of the linear solution x
(3.3)
Chapter 3. Combinatorial Optimization
3.3
25
Integer Linear Problems (ILP)
z = min
cx
s.t. Ax ≤ b
x≥0
(3.4)
x int
This kind of problems are characterized by another constraint imposing that the
solution x must be composed only by integer numbers. Two sub-classes of the ILP
are:
• MILP: where only some xj variables are integer,
• Zero-One Problems: where xj ∈ {0, 1}.
3.4
Nonlinear Problems (NPL)
They are distinguished by the presence of non linear constrains inside the system.
A simple example can be:
z = min x1 + x2
s.t. x21 + x22 ≥ 1
x21 + x22 ≤ 2
(3.5)
x1 ≥ 0
x2 ≥ 0
Generally, a non-linear problem can be describe in this way:
z = min f (x)
s.t. gi (x) ≤ 0 i ∈ I = 1, . . . , m
hj (x) = 0 j ∈ J = 1, . . . , p
where:
(3.6)
Chapter 3. Combinatorial Optimization
26
• f (x) : RN → R, is the objective function to be minimized(maximized) over
the variable x,
• x ∈ RN , that makes the model non-linear,
• gi (x) ≤ 0 are the inequality constrains,
• hi (x) = 0 are the equality constrains.
3.5
Constraints Satisfaction Problems (CSP)
Constraint Satisfaction arose mainly from Artificial Intelligence. A Constraint
satisfaction problem (CSP) is a problem defined as a set of objects that must
satisfy a number of constraints or relations among the problem’s decision variables
[34]. Constraint programming is defined “programming” in a double meaning: not
only “mathematical programming”, in the sense of declaration of constraints and
decision variables, but also in the sense of “computer programming”, in the sense
of programming a search strategy. The methods used to solve this kind of problems
are various: Branch and Bound algorithms, Backtracking, Local Search, typically
all embedded in a properly designed solver.
Formally, a constraint satisfaction problem is defined as a triple hX, D, Ci, where
X is a set of variables, D is a domain of values, and C is a set of constraints.
Every constraint is in turn a pair ht, Ri (usually represented as a matrix), where
t is an n-tuple of variables and R is an n-ary relation on D . An evaluation of the
variables is a function from the set of variables to the domain of values, v : X → D.
An evaluation v satisfies a constraint h(x1 , . . . , xn ), Ri if (v(x1 ), . . . , v(xn )) ∈ R.
A solution is an evaluation that satisfies all constraints [34].
Contraint Programming has been used to solve combinatorial problems like:
• Eight queens puzzle,
• Map coloring problem,
• Sudoku and many other logic puzzles,
• DNA sequencing,
• Scheduling.
Chapter 3. Combinatorial Optimization
27
Often CP deal with problems really hard to solve using canonical OR or CO
techniques.
3.6
3.6.1
Solution and Approximation Methodologies
Simplex Algorithm
Following some notable theoretical results of linear algebra, it can be shown that
for a linear program, if the objective function has a minimum value in the feasible
region then it has this value on (at least) one of the extreme points of the convex
set described by the inequalities constraining the problem [35]. It’s has been also
proven that there is a finite number of extreme points in the convex set, but the
number of extreme points is extremely large also for small linear programs, making
the enumeration of these points inapplicable. It can also be shown that exists
an edge that connects an extreme point to another where the objective function
decreases, if the starting point isn’t a minimum. If the edge is finite, it brings to
another extreme point where the objective function has a smaller value, otherwise
the objective function is unbounded and the linear program has no solution.
The simplex algorithm proposed by Danzig [36] exploits these results and by walking along edges of the polytope to extreme points with lower and lower objective
values, until the minimum value is reached or an unbounded edge is visited, concluding that the problem has no solution. The great contribution given by this algorithm is that it always terminates because the number of vertices in the polytope
is finite; moreover, since we jump between vertices always in the same direction
(that of the objective function), we hope that the number of vertices visited will
be small. The solution of a linear program is accomplished in two steps. In the
first step, we need to find an extreme point for starting. The possible results of
the first step are either a basic feasible solution is found or that the feasible region
is empty. In the latter case the linear program is called infeasible. In the second
step, the simplex algorithm is applied using the basic feasible solution found as
a starting point. The possible results from the second step are either an optimal
feasible solution to the linear program or an infinite edge on which the objective
function is unbounded.
Chapter 3. Combinatorial Optimization
28
Due to the wide spectrum of applications of this algorithm or its variations, commercial and free softwares implementing this method has been proposed. The
most used and effective is the IBM ILOG CPLEX solver [37]. It also implements methods for solving MIP problems, Quadratic Problems and Quadratically
Constrained Problems. Its free and open source counterpart is the CoinOR, Computational Infrastructure for Operative Research [38] that is a project developed
and maintained by the operative research community, aimed to provide a free
environment for developing and testing OR algorithms.
3.6.2
Lagrangean Relaxation
Lagrangean relaxation proposed by Polyak [39] and used for the first time by
Held et al. [40] and Held and Karp [41, 42] and is a relaxation method which
approximates a difficult optimization problem by a ‘simpler’ one. Solving the
relaxed problem can give an approximate solution to the original one.
The method penalizes violations of inequality constraints using a Lagrangean multiplier, which imposes a cost on constraints violations in the objective function.
In practice, this relaxed problem can often be solved more easily than the original one by using polynomial algorithms like the Subgradient [43], providing useful
information for its solution. The problem of maximizing the Lagrangean function of the dual variables is the Lagrangean dual problem (if we are searching for
the function’s minimum). Under certain conditions regarding function’s convexity
and constraints, we can state that the solution of primal and dual Lagrangean
problems are the same, avoiding the Duality Gap.
Taking as example a linear problem:
z = min
cx
s.t. A1 x ≤ b1
A2 x ≤ b2
(3.7)
x≥0
where the constraints in A2 are considered ‘difficult’. We can relax the problem
adding a penalty λ and bringing the constraints in the objective function:
Chapter 3. Combinatorial Optimization
z = min
s.t.
29
cx + λ(b2 − A2 x)
A1 x
≤ b1
x
≥0
(3.8)
This is a relaxed problem different from the initial one, but can be used like a
bound to the optimal value inside other solution techniques.
3.6.3
Column Generation
One of the most difficult aspects related to many linear programs is that the number of all decision variables is too large to be consider explicitly. Since most of the
variables will be non-basic and assuming a value of zero in the optimal solution,
only a subset of variables need to be considered when solving the problem. Column
generation [44, 45] exploits this idea: it generates only the variables which have
the potential to improve the objective function, finding variables with negative reduced cost (assuming without loss of generality that the problem is a minimization
problem).
The original problem is split into two problems: the master problem and the
subproblem. The master problem is composed only by a subset of variables selected
from the original problem (core problem). The subproblem is a new problem
created to identify a new variable (pricing problem). The objective function of
the subproblem is the reduced cost of the new variable with respect to the current
dual variables. The process, iteratively, behaves as follows:
• The master problem is solved taking in consideration only the selected variables, then we are able to obtain dual prices for each of the constraints in the
master problem. This information is then utilized in the objective function
of the subproblem.
• The subproblem is solved and if the objective value of the subproblem is
negative, a variable with negative reduced cost has been identified. This
variable is then added to the master problem, and the master problem is
solved again. Solving the master problem with the new variable, will generate
a new set of dual values, and the process is repeated until no negative reduced
cost variables are identified. On the other hand, if the subproblem returns a
Chapter 3. Combinatorial Optimization
30
solution with non-negative reduced cost, we can conclude that the solution
to the master problem is optimal.
In many cases, this approach makes tractable large linear or integer programs that
had been previously considered intractable. An example of a problem where this
technique is effective is the cutting stock problem where the number of all possible
feasible cuts is intractable. Additionally, column generation has been applied to
many problems such as crew scheduling, vehicle routing, etc . . . .
3.6.4
Dynamic Programming
Dynamic programming (DP) is a technique used for solving complex problems
by dividing them into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure (e.g. the
Shortest Path Problem). When applicable, the method takes far less time than
naive methods. The basic idea behind dynamic programming is simple: in general,
to solve a given problem, we need to solve different parts of the problem (subproblems), then we combine the solutions provided by the subproblems to compute
the global one. Often, many of these subproblems are identical. The dynamic
programming approach seeks to solve each subproblem only once, trying to reduce
the computations required. Once a subproblem has been solved, its solution is
stored.
This kind of algorithms works iteratively: once, during a computation step, a solution computed before is needed, it is simply retrieved from the ones stored. This
approach is especially useful when the number of repeating subproblems grows
exponentially as a function of the size of the input. The Dynamic Programming
method is based on the theoretical infrastructure based on R. Bellman’s principle
of optimality [46]: “An optimal policy has the property that whatever the initial
state and initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision”. More generally,
all DP recursions (or recurrences) are based on the Bellman Equation describing
the transitions from a state to another:
V (x) = maxa∈Γ(x) {F (x, a) + βV (T (x, a))}
(3.9)
Chapter 3. Combinatorial Optimization
31
This approach is also used for solving optimization problems or for computing
valid bounds (e.g. VRP or Cutting Stock Problem). The main disadvantage of
this class of algorithms is the extensive use of memory: the states space created
by the recursion is often too big to manage. For some class of problems, like
the Knapsack 0-1, the dynamic programming is a very fast and simple solution
method, but only for instances with limited dimensions.
3.6.5
Branch and Bound, Branch and Cut Techniques
A branch-and-bound (BB) algorithm [47] consists of a systematic enumeration of
solutions, where large subsets of fruitless candidates are discarded, typically using
bounds to the optimal solution. The search is performed as a tree search or similar
approaches. The key idea of the BB algorithm is: if a lower bound for some node
is greater than an upper bound for some other node, then may be safely discarded
from the search (pruning). Any node whose lower bound is greater than the best
lower bound achieved during the search, can be discarded. The branching phase
is the generation of other (improved) solutions from a node, then, new bounds are
computed on these new solutions (bounding). The search stops when the solution’s
set is reduced to a single element, or when the upper bound matches the lower
bound. Either way, the found value will be the minimum (or the maximum) of
the function.
Chapter 4
Rectangular Knapsack, 2D
Unconstrained Guillotine Cuts
In this chapter we investigate the application of GPU computing to the twodimensional guillotine cutting problem, using a dynamic programming approach.
We show a possible implementation and we discuss a number of technical issues.
Computational results on test instances available in the literature and on new
larger instances show the effectiveness of the dynamic programming approach
based on GPU computing for this problem.
4.1
Introduction
The Two-Dimensional Guillotine Cutting Problem (2D-GCP) consists in cutting
a rectangular surface, called stock rectangle or master surface , into a number of
smaller rectangular pieces, each with a given size and value, using guillotine cuts.
A guillotine cut on a rectangle consists in a cut from one edge of the rectangle to
the opposite edge which is parallel to the two remaining edges. A feasible cutting
pattern must be obtained by applying a sequence of guillotine cuts to the original
master surface or to the rectangles obtained in the previous cuts.
The number of pieces of each size and value to be cut can be unconstrained or
constrained within a given minimum and maximum value. The objective is to
maximize the total value of the pieces cut. The related problem of minimizing
33
Chapter 4. 2D-GCP
34
the amount of waste produced can be trivially converted into this maximization
problem by taking the value of a piece to be proportional to its area.
A special case of the 2D-GCP is the staged guillotine cutting, where at each stage
we associate a cut direction, i.e., the cuts alternate at each stage between being
parallel to the x-axis and being parallel to the y-axis. In many practical applications the number of stages is restricted and we say that the guillotine cutting is
k-staged. Often, the number of stages is only two or three (i.e., two- or three-staged
guillotine cutting).
The 2D-GCP can be found in several industrial settings. For example, glass plates
are cut into smaller pieces to produce windows or wood sheets are cut to produce
furniture, and so on.
According to the classification of [48] the 2D-GCP corresponds to the Two-Dimensional
Rectangular Single Large Object Packing Problem.
The 2D-GCP is a well known problem, it was considered for the first time by
Gilmore and Gomory ([49, 50]) who discussed possible mathematical formulations
and proposed dynamic programming algorithms to solve the problem for the twoand multi-stage versions.
In [51], Herz proposed a recursive tree-search procedure where the search space
is reduced by means of a preliminary discretization using the so-called canonical
dissections. Moreover, Herz pointed out an error in an algorithm proposed by [50].
Christofides and Whitlock ([52]) considered the constrained version of the 2D-GCP
and described a tree-search algorithm, where a valid upper bound is computed by
a dynamic programming procedure applied to the unconstrained version of the
2D-GCP. Christofides and Whitlock also used the canonical dissections, but they
used the term normal patterns for them.
Beasley [53] proposed heuristic and exact algorithms based on dynamic programming to solve the unconstrained version of the 2D-GCP. He considered both the
staged version and the general non-staged version of the problem and used the
normal patterns to reduce the search space.
Focusing our attention on the approaches for the 2D-GCP based on dynamic
programming, recently, Cintra et al. [54] proposed an improved implementation
Chapter 4. 2D-GCP
35
of the algorithm proposed by Beasley [53] and Russo et al. [55] proposed an
improved version of the algorithm of Gilmore and Gomory ([50]).
GPU Computing used for optimization has a recent and yet sparse literature.
Due to its structure, it seems to be fit to be combined with dynamic programming
procedures. In fact Kats et al. [56] and Lund et al. [57] proved the effectiveness of
this new paradigm of massively parallel processors on the very well known All-Pairs
Shortest Path Problem (APSP) solved with the Floyd-Warshall algorithm. Boyer
et al. [58] proved that GPU computing is effective also in the resolution of the
Knapsack Problem (KP) using the well-known dynamic programming algorithm
proposed by Bellman [46].
This chapter presents one of the first contributions applying GPU computing to
state of the art research algorithms. We consider the unconstrained and nonstaged version of the 2D-GCP and we focus on the implementation of a dynamic
programming algorithm using a GPU computing approach to gauge the effectiveness of this new paradigm on optimization problems. We have chosen the dynamic
programming algorithm proposed by Cintra [54] because is quite clean and it well
suited for our purpose. The other best performing approach, namely the algorithm
proposed by Russo et al. [55], is surely interesting and effective, but it is quite
complex and it may divert the attention of the reader on aspects that are beyond
the scope of this paper, which is the implementation of a dynamic programming
algorithm using the GPU computing.
We have organized this chapter as follows. In Section 4.2 we describe the problem
and we introduce the dynamic programming algorithm used. In Section 4.3 we
present the GPU porting of the dynamic programming algorithm for th 2D-GCP.
Computational results are reported and discussed in Section 4.4. Finally, in Section
4.5 we draw some conclusions and we give some ideas for future developments.
4.2
The 2D-GCP: notation, definitions and algorithms
A large rectangular master surface M = (W, H) of width W and height H must
be cut into a number of smaller rectangular pieces chosen from n types of pieces
available. Let P = {1, . . . , n} be the index set of piece types. Each piece of type
Chapter 4. 2D-GCP
36
i ∈ P has dimensions (wi , hi ) and value vi . The orientation of the pieces is fixed
(i.e., no rotations are allowed). The objective is to construct a guillotine cutting
pattern of M that maximizes the total value of pieces cut, using the given piece
types.
The master surface M is located in the positive quadrant of the Cartesian coordinate system with its origin (bottom left-hand corner) placed in position (0, 0)
and with its bottom and left-hand edges parallel to the x-axis and the y-axis,
respectively. The position of a piece within M is defined by the coordinates of its
bottom left-hand corner, referred to as the origin of the piece.
4.2.1
Principle of Normal Patterns
The origin of a piece of type i ∈ P can be located at every integer point (x, y) of the
master surface, such that 0 ≤ x ≤ W − wi and 0 ≤ y ≤ H − hi . However, this set
of points (x, y) can be reduced by applying the discretization principle proposed
by Hertz [51], who - as mentioned - speaks of canonical dissections, and used by
Christofides [52], who introduced for the first time the term normal patterns. The
principle of normal pattern has been thereafter used by Beasley [53] and many
other authors.
The principle of normal patterns is based on the observation that, in a given
feasible cutting pattern, the position where any piece is cut can be moved to the
left and/or downward as much as possible until its left-hand edge and its bottom
edge are both adjacent to other cut pieces or to the edges of the master surface.
Let X and Y denote the subsets of all x-positions and y-positions, respectively,
where a piece can be positioned applying the principle of normal patterns. These
sets can be computed as follows:
and
P
X = x = k∈P wk ξk : 0 ≤ x ≤ W, ξk ≥ 0 integer, k ∈ P
(4.1)
P
Y = y = k∈P hk ξk : 0 ≤ y ≤ H, ξk ≥ 0 integer, k ∈ P
(4.2)
The x-positions contained in X are sorted so that given xi , xj ∈ X, we have xi < xj
for each 1 ≤ i < j ≤ |X|. The y-positions contained in Y are also similarly sorted.
Chapter 4. 2D-GCP
37
A simple dynamic programming recursion for computing X and Y is described
both by Christofides [52] and by Cintra [54].
4.2.2
A Dynamic Programming algorithm for the 2D-GCP
The 2D-GCP considered in this chapter can be solved using the following dynamic
programming algorithm, originally proposed by Cinra et al. [54], and based on
the recurrence formula proposed by Beasley [53].
Let V (w, h) be the value of an optimal guillotine pattern of a rectangle of size
(w, h), evaluated by means of the following recurrence formula:
V (w, h) = max



 v(w, h)




max{V (w′ , h) + V (p(w − w′ ), h) : w′ ∈ X and 0 < w′ ≤ w2 }




 max{V (w, h′ ) + V (w, q(h − h′ )) : h′ ∈ Y and 0 < h′ ≤ h } 
2
(4.3)
where p(w) = max{x ∈ X : x ≤ w}, q(h) = max{y ∈ Y : y ≤ h} and where
v(w, h) denotes the value of the most valuable piece that can be cut in a rectangle
of size (w, h) (v(w, h) = 0 if no piece can be cut in such rectangle). Since the only
x and y-positions considered are the ones contained in the subsets X and Y , for
the sake of ease, we use the notation V (xi , yj ) and V (i, j) interchangeably. The
optimal solution value is V (p(W ), q(H)) = V (|X|, |Y |).
Let cut(i, j) be the position of the optimal cut within a rectangle of size (xi , yj ),
xi ∈ X and yj ∈ Y . It is equal to 0 if guillotine cuts are not applied, it is > 0
if a horizontal cut is applied in position cut(i, j), and it is < 0 if a vertical cut is
applied in position −cut(i, j).
A pseudocode for the dynamic programming algorithm for the 2D-GCP proposed
by Cintra et al. [54] is as follows:
Algorithm DP-2D-GCP (W, H, w, h, v)
1.
//Compute the sets X and Y using the normal pattern principle
2.
//Initialization
3.
for i = 1 to |X| do
4.
for j = 1 to |Y | do
Chapter 4. 2D-GCP
38
5.
V (i, j) = max {{vk : k ∈ P, wk ≤ xi and hk ≤ yj } ∪ {0}}
6.
cut(i, j) = 0
7.
// Recurrence
8.
for i = 2 to |X| do
9.
for j = 2 to |Y | do
10.
for i′ = 1 to max{k : xk ∈ X, xk ≤ xi /2} do
11.
i′′ = max{k : xk ∈ X, xk ≤ xi − xi′ }
12.
if V (i, j) ≤ V (i′ , j) + V (i′′ , j)
13.
14.
then V (i, j) = V (i′ , j) + V (i′′ , j)
cut(i, j) = −i′
15.
for j ′ = 1 to max{k : yk ∈ Y, yk ≤ yj /2} do
16.
j ′′ = max{k : yk ∈ Y, yk ≤ yj − yj ′ }
17.
if V (i, j) ≤ V (i, j ′ ) + V (i, j ′′ )
18.
19.
then V (i, j) = V (i, j ′ ) + V (i, j ′′ )
cut(i, j) = j ′
The algorithm starts by computing the sets X and Y using the normal pattern
principle. Then, it initializes V (i, j), for every xi ∈ X and yj ∈ Y , by setting
V (i, j) = v(xi , yj ). The recurrence tries to improve each value V (i, j) applying to
the rectangle of size (xi , yj ) a horizontal or a vertical guillotine cut. If the sum
of the values associated to the resulting two rectangles improves over V (i, j), its
value is updated and the applied cut is saved in cut(i, j).
(a) Not using normal
patterns
(b) Using normal patterns
Figure 4.1: Computation of function V (x, y).
Chapter 4. 2D-GCP
39
The usage of the subsets X and Y instead of the full sets of the available positions
can heavily reduce the computational complexity. In Figure 4.1 we show the
difference between the computation of function V (x, y) not using and using the
normal pattern principle. Normal patterns are shown by ”+” icons adjacent to the
corresponding rows and columns. The figure depicts how the computation of the
V (w, h) values is made based on all gray cells, each representing a partial solution.
It is evident how the number of partial solutions needed for the computation is
much lower using normal patterns than otherwise.
4.3
The GPU porting of the dynamic programming algorithm
The dynamic programming algorithm for the 2D-GCP proposed by Cintra et al.
[54] and described in Section 4.2.2 is natively suitable for an implementation using
GPU computing, due to its matrix-like structure.
In this section we show how to exploit the inherent parallelism of this algorithm,
we describe the parallelization process, the porting to a GPU environment and its
possible implementation using a CUDA model.
4.3.1
Exploiting the inherent parallelism
The objective of the algorithm is to compute the different V (i, j) values. In order to
do this, we need all the intermediate solutions evaluated in the previous iterations
of the dynamic recursion (see Figures 4.1a and 4.1b), as described in Section 4.2.2.
This process can be effectively decomposed into independent tasks, as shown in
the following.
Each index d, corresponding to a stage of the dynamic recursion and such that
1 ≤ d ≤ |X|+|Y |−1, identifies implicitely a corresponding anti-diagonal on matrix
V . The anti-diagonal is composed by the set of cells Ad = {(k, d − k + 1) : 1 ≤ k ≤
|X|, 1 ≤ d − k + 1 ≤ |Y |, k = 1, . . . , d}. At each stage d = 1, . . . , |X| + |Y | − 1, we
can concurrently compute the values V (i, j) belonging to the corresponding antidiagonal Ad (i.e., (i, j) ∈ Ad ) (see Figures 4.2a and 4.2b), and to evaluate each
V (i, j), we can perform a parallel max operation over the solution values of the
Chapter 4. 2D-GCP
(a)
Without discretization
40
(b) With discretization
Figure 4.2: Parallel tasks executed exploiting the decomposition based on
anti-diagonals.
composing subproblems. Notice that other approaches based on decomposing the
problem by columns or rows do not allow such an effective concurrent evaluation
of the V (i, j) values.
4.3.2
Recurrence Parallelization on GPU
The parallel implementation of the recursion has been designed to fit the CUDA
programming model. As described in 4.3.1, we can exploit two different granularities of parallelism, one for evaluating the V (i, j) values belonging to the antidiagonal Ad , and one for the max operation required for finding the cut that
maximizes the V (i, j) value. The mapping on CUDA becomes then straightforward:
1. inter-grid parallelism among blocks to concurrently evaluate all cells of Ad .
2. inter-block parallelism among threads to compute the reduction (max operation) for each V (i, j).
At each recursion stage d, 1 ≤ d ≤ |X| + |Y | − 1, each element (i, j) of the
anti-diagonal Ad is assigned to a different GPU block in order to concurrently
evaluate the corresponding V (i, j) value. Therefore, the maximum number of
blocks required to compute an anti-diagonal is b = min{|X|, |Y |}.
Chapter 4. 2D-GCP
41
A block computing cell (i, j) runs threads to evaluate Beasley’s recursive formula
4.3:
VijX (i′ ) = V (i′ , j) + V (i′′ (i, i′ ), j),
xi′ ∈ X and 0 < xi′ ≤
VijY (j ′ ) = V (i, j ′ ) + V (i, j ′′ (j, j ′ )), yj ′ ∈ Y and 0 < yj ′ ≤
xi
2
yj
2
(4.4)
where i′′ (i, i′ ) = max{k : xk ∈ X, xk ≤ xi − xi′ } and j ′′ (j, j ′ ) = max{k : yk ∈
Y, yk ≤ yj − yj ′ }.
Ideally, the block evaluating the values VijX (i′ ) and VijY (j ′ ) has a thread for each
index i′ and j ′ . The obtained values can be stored in the same shared memory location. This step is crucial to maximize the global memory bandwidth because, for
example, the thread evaluating VijX (i′ ), instead of loading the two values V (i′ , j)
and V (i′′ (i, i′ ), j) in the shared memory and performing the plus operation, wasting a large number memory cycles, performs the addition inside one instruction
requiring two global memory accesses, doubling the memory bandwidth for each
kernel call.
As mentioned before, the CUDA programming model allows to spawn a maximum of β threads per block (the value of β depends by the hardware configuration, e.g., β = 1024 for Fermi and Kepler and β = 512 for older architectures),
therefore each thread can be forced to evaluate more than one value VijX (i′ ) or
VijY (j ′ ) at each stage. In particular, a thread of index t ≤ ⌊β/2⌋ evaluates every value VijX (i′ ) where i′ %⌊β/2⌋ = t (i.e., the remainder of division of i′ by
⌊β/2⌋). Whereas a thread of index t > ⌊β/2⌋ evaluates every value VijY (j ′ ) where
⌊β/2⌋ + j ′ %⌊β/2⌋ = t. For sake of ease, we define Tx = {1, . . . , ntx = ⌊β/2⌋},
Ty = {⌊β/2⌋ + 1, . . . , nty = β}, and the following set of position assigned to each
thread:
PijX (t) = {i′ : xi′ ∈ X, 0 < xi′ ≤
PijY (t) = {j ′ : yj ′ ∈ Y, 0 < yj ′ ≤
xi ′
, i %⌊β/2⌋ = t},
2
yj
, ⌊β/2⌋ + j ′ %⌊β/2⌋
2
t ∈ Tx
= t}, t ∈ Ty
(4.5)
While each thread evaluates the values VijX (i′ ), i′ ∈ PijX (t), or VijY (j ′ ), j ′ ∈ PijX (t),
it also performs a max operation among them. The resulting maximum value
obtained is used in the parallel reduction for the max operation among threads,
described in the next section, which uses the values saved in the shared memory.
Chapter 4. 2D-GCP
42
Figure 4.3: Reduction Tree.
Figure 4.4: Naive reduction.
4.3.3
Parallel Reduction for the Max Operation
It is possible to exploit an inter-block parallelism among threads to compute the
reduction required by the max operation for each V (i, j). The reduction combines
all the elements in a collection into a single one, using an associative two-input,
one-output operator, which in our case is the max operator.
In general, a reduction is a low intensity arithmetic operation, but it has some
critical aspects to analyze to get an efficient parallel algorithm. It is in fact crucial
to be able to exploit all computational resources available on the the GPU device.
The most effective solution we found to parallelize this max operation is a treebased approach. We can represent the max operation as a binary tree, where we
can do the operation in parallel at each level (Figure 4.3). The complexity of this
algorithm becomes O(N/P + logN ) where N is the number of elements in every
level, and P the number of processors. In our case N = P and the complexity is
O(logN ).
Chapter 4. 2D-GCP
43
Figure 4.5: Naive reduction gives rise to bank conflicts (each column of the
shared memory represents a memory bank).
Figure 4.6: Strided access allows a fast reduction.
Figure 4.7: Strided access avoids bank conflicts.
The shared memory is subdivided into small arrays of 32 locations of 32-bit words
each, and its latency is significantly lower (≈ 4 memory cycles) than that of the
global memory. The shared memory is the only means to enable communications
among threads of the same block and, given its bank-like structure, bank conflicts
are the most important aspect to avoid for enhancing the kernel performance. A
bank conflict occurs when more threads of the same warp access the same memory
bank, thus forcing an access serialization. Avoiding this is an implementation
constraint which imposes a specific management of threads communication.
A naive algorithm, as shown in Figures 4.4 and 4.5, implemented with interleaved
addressing creates a large number of conflicts, serializing most of the operations
in the shared memory. On the contrary, a strided access (see Figures 4.6 and 4.7)
Chapter 4. 2D-GCP
44
resolves this problem leading to a conflict-free access to the shared memory, to
a maximization of the device memory bandwidth and finally to a good parallel
execution of the threads [59]. In our kernel, once the loading stage described
in section 4.3.2 is finished, we can perform this reduction step, retrieving the
maximization value for V (i, j).
4.3.4
Matrix Update
The normal pattern principle allows to reduce the set of values V (i, j) to be evaluated at each stage, this results in an improvement of the performance of the corresponding serial algorithm, presented in section 4.2.2. Unfortunately, the same
approach is not suitable for the GPU, because the resulting algorithm would induce
sparse, not linear, and serialized access to the global memory.
New NVIDIA architectures, such as Fermi or Kepler, partially resolve the problem
of coalesced access to global memory, that is the quest for loading data which
resides in adjacent positions even in the global memory. In any case, a more
efficient memory access can be obtained by transforming the double i, j indexing
of the V values into a one-index access, which maximizes the bandwidth on the
PCI-Express bus, structuring the data in a way more suitable for a coalesced
access.
A plain and sequential access to the partial solution values required for evaluating
each V (i, j) can be obtained by filling the discretization induced by the normal
patterns (see Figure 4.1a), as described below. Moreover, in order to maximize
coalescence, we stored the V matrix twice, in two mono-dimensional arrays: one in
row-major order Vrows and one in a column-major order Vcols . When the algorithm
evaluates the VijX (i′ ) it uses matrix Vcols , whereas when it evaluates the values
VijY (j ′ ) it uses matrix Vrows (see section 4.3.2).
Using these structures, we can take advantage of discretization when we evaluate
the V (i, j) only for the normal pattern positions, and we can use a complete
matrix representation by means of the arrays Vrows and Vcols defined for every
integer positions 0 ≤ x ≤ W and 0 ≤ y ≤ H, when we compute the maxima.
Given two adjacent normal pattern positions xi , xi+1 ∈ X and yj , yj+1 ∈ Y , we
have that V (xi , yj ) = V (x, y) for every xi ≤ x < xi+1 and yj ≤ y < yj+1 .
Therefore, when the value V (xi , yj ) is evaluated, it can be directly copied in every
Chapter 4. 2D-GCP
(a) Matrix structure
45
(b) Update for the value
V(i,j)
Figure 4.8: Matrix update in the GPU computing approach proposed
integer positions xi ≤ x < xi+1 and yj ≤ y < yj+1 (see Figures 4.8a and 4.8b), this
permits to fill the gaps in the matrix storage and obtain a full linear access when
computing the corresponding maximum.
As mentioned, the value V (xi , yj ) is then copied in both linearized matrices Vrows
and Vcols .
4.3.5
GPU Algorithm
In this section we present a pseudo-code where the structure of GPU algorithm is
fully detailed.
The core of the algorithm is the dynamic stage recursion, which is implemented
in a main loop, working at each iteration on an anti-diagonal d. Every element
V (i, j), (i, j) ∈ Ad , is assigned to a different block, which concurrently evaluates it
by expression 4.3. Each block splits the computation among threads. Half of them
consider the values VijX (i′ ) and the remaining ones consider the values VijY (j ′ ) (see
section 4.3.2). At the end, each block makes the reduction corresponding to the
max operation of expression 4.3 and updates the corresponding entries of both
linearized matrices Vrows and Vcols . The cut positions are saved in the linearized
matrix Vcut .
Chapter 4. 2D-GCP
46
Algorithm GPU DP-2D-GCP (H, W, w, h, v)
1.
//Compute the sets X and Y using the normal pattern principle
2.
//Initialization
3.
for i = 1 to |X| do
4.
for j = 1 . . . , |Y | do
5.
V (i, j) = max {{vk : k ∈ P, wk ≤ xi and hk ≤ yj } ∪ {0}}
6.
// Initialize the linearized matrices Vcols and Vrows
7.
for x = xi to xi+1 − 1 do
8.
for y = yj to yj+1 − 1 do
9.
Vrows [yW + x] = V (i, j)
10.
Vcols [xH + y] = V (i, j)
11.
Vcut [xH + y] = 0
12. // Recurrence
13. for d = 1 to |X| + |Y | − 1 do
14.
for each (i, j) ∈ Ad do
15.
// CUDA Kernel: each (i, j) is assigned to a different block
16.
X
// Threads evaluate Vmax
= max{VijX (i′ ) : xi′ ∈ X, 0 < xi′ ≤
17.
for each thread t ∈ Tx do
18.
V ′ [t] = 0, C ′ [t] = nil // Shared Memory Initialization
19.
for each i′ ∈ PijX (t) do
20.
21.
22.
if V ′ [t] < Vrows [yj W + xi′ ] + Vrows [yj W + (xi − xi′ )]
then V ′ [t] = Vrows [yj W + xi′ ] + Vrows [yj W + (xi − xi′ )]
C ′ [t] = −xi′
23.
X
// Reduction: at the end V ′ [1] = Vmax
24.
for s = ⌊ntx /2⌋ to 1, s = ⌊s/2⌋ do
25.
26.
if (t ≤ s) and (V ′ [t] < V ′ [t + s])
then V ′ [t] = V ′ [t + s], C ′ [t] = C ′ [t + s]
27.
Y
// Threads evaluate Vmax
= max{VijY (j ′ ) : yj ′ ∈ Y, 0 < yj ′ ≤
28.
for each thread t ∈ Ty do
29.
V ′ [t] = 0, C ′ [t] = nil // Shared Memory Initialization
30.
for each j ′ ∈ PijY (t) do
31.
32.
33.
xi
}
2
yj
}
2
if V ′ [t] < Vcols [xi H + yj ′ ] + Vcols [xi H + (yj − yj ′ )]
then V ′ [t] = Vcols [xi H + yj ′ ] + Vcols [xi H + (yj − yj ′ )]
C ′ [t] = yj ′
34.
Y
// Reduction: at the end V ′ [ntx + 1] = Vmax
35.
for s = ⌊nty /2⌋ to 1, s = ⌊s/2⌋ do
Chapter 4. 2D-GCP
47
if (t ≤ ntx + s) and (V ′ [t] < V ′ [t + s])
36.
then V ′ [t] = V ′ [t + s], C ′ [t] = C ′ [t + s]
37.
if V ′ [1] < V ′ [ntx + 1]
38.
39.
then M axV = V ′ [1], M axC = C ′ [1]
40.
else M axV = V ′ [ntx + 1], M axC = Cx [ntx + 1]
41.
// Update Vcols and Vrows
42.
for x = xi to xi+1 − 1do
for y = yj to yj+1 − 1 do
43.
44.
Vrows [yW + x] = M axV
45.
Vcols [xH + y] = M axV
46.
Vcut [xH + y] = M axC
The initialization section at the beginning of the algorithm includes the setup of
the Vrows , Vcols , and Vcut data structures. This is presented in the same loops where
we access the matrices linearized by rows to simplify the presentation, but in the
actual implementation we initialize Vrows by rows and Vcols and Vcut by columns.
In any given iteration d of the recurrence, the concurrent execution of the blocks is
managed by the GPU scheduler and at the end of the iteration a synchronization
is performed (i.e., iteration d + 1 is performed only after every blocks ended at
iteration d).
Notice that the reduction performed by each thread also requires a synchronization
among threads.
4.4
Computational results
This section reports the computational results and, in particular, the speed-up
factors obtained running the serial and the parallel versions of the algorithm.
The code was implemented in C/C++ with CUDA extensions using the NVCC
Compiler by Nvidia for the GPU version and using Microsoft Visual Studio 2010
with full optimization for the serial version.
All tests were run on a Intel i7 920 Bloomfield QuadCore @2.8 GHz with 6 Gigabytes of RAM and two different graphic card: an Nvidia GeForce GTX 570
Fermi with 1 Gigabyte of GDDR5 RAM and 480 CUDA Cores @1.464 GHz and
Chapter 4. 2D-GCP
48
an Nvidia Geforce GTX 770 Kepler with 2 Gigabytes of GDDR5 RAM and 1536
CUDA Cores @1.046 GHz.
We tested the algorithm on two different hardware configurations to compare the
scalability of the method proposed on two different generations of GPUs.
4.4.1
Test instances
We tested our algorithm on three different instance sets. The first set is the wellknown gcut set, generated by Beasley [53] and upgraded by Cintra et al. [54].
In order to check the effectiveness and the correctness of the parallel algorithm,
we generated by means of two opposite methodologies the other two sets, named
testcut and randcut. These two sets were generated as follows.
Every instance of the testcut set is composed by three types of items, where the
w and h dimensions are as follows:
• 25% of items with h ∈ [1, H/4[ and w ∈ [1, W/4[.
• 50% of items with h ∈ [H/4, H/2[ and w ∈ [W/4, W/2[.
• 25% of items with h ∈ [H/2, 3H/4[ and w ∈ [W/2, 3W/4[.
H and W are the dimensions of the master bin.
The instances of the randcut set are composed by items retrieved from randomly
generated guillotine cuts on the original master bin. The peculiarity of this set
is that the objective function z is equal to the bin area, W × H, giving us
the possibility to check the correctness of our algorithm on big instances with
known optimal solutions value. These two new sets are available on the website:
www.sitoistanze.com, together with the images of all the instances.
4.4.2
Experiments
The objective of our experiments is the comparison of our implementations of the
sequential and of the GPU versions of the algorithm, described in sections 4.2.2
and 4.3.5, respectively. The GPU version was run with 256 threads per block.
Chapter 4. 2D-GCP
49
In Tables 4.1, 4.2, and 4.3 we report the computational results obtained on the
three considered sets of instances. For each instance we indicate its N ame, the
size of the master surface H and W , the number of type of items n, the number of
normal pattern positions |Y | and |X|, the optimal solution value zOP T , and the percentage waste W aste%. For the algorithms we report the computing times TCP U
of the sequential version and TGP U of the GPU version, and the resulting SpeedU p
defined by the ratio between TGP U and TCP U , i.e., SpeedU p = TCP U /TGP U .
In Table 4.1 we ignored the first eleven instances due to non significant computational time required; in fact, the times required to solve these instances are below
0.0001 seconds. Analyzing the results, we can highlight the method’s scalability
as a function of the dimension of the instances. The GTX 570’s decreasing performances over the 8000x8000 dimensions of the master bin is the effect of the device’s
limited amount of on-board memory. Over that dimensions, we are forced to keep
in the system’s main memory some data structures, increasing the time required
to access these structure and, significantly, slowing down the performances. The
tests on the GTX 770 graphic card avoid this problem, due to the greater amount
of on-board memory available. As we can see, in this case, the speed-up factor is
almost constant.
Figures 4.9 and 4.10 display the original random generated instance randcutc6000
and its solution, respectively. As we can see, the calculated solution is almost the
same, except the cuts’ positions on the bin. Figure 4.11 displays testcut8000 ’s
solution, while Figure 4.12 shows the solution of gcut13.
N ame
gcut12
gcut13
gcut14
gcut15
gcut16
gcut17
H
1000
3000
3500
3500
3500
3500
W
1000
3000
3500
3500
3500
3500
n
50
32
42
52
62
82
Instances
|Y |
|X|
155
124
1457 2310
2390 2861
2422 2933
2559 2943
2676 2953
zOP T
979,986
8,997,780
12,245,410
12,246,032
12,248,836
12,248,892
W aste%
2.001
0.025
0.037
0.032
0.010
0.009
Core i7 920
TCP U
0.016
12.199
26.754
27.627
28.985
30.015
NVIDIA
TGP U
0.011
0.681
1.390
1.440
1.514
1.580
GTX 570
SpeedU p
1.455 X
17.913 X
19.247 X
19.185 X
19.145 X
18.997 X
NVIDIA
TGP U
0.011
0.574
1.168
1.206
1.267
1.331
GTX 770
SpeedU p
1.455 X
6.959 X
21.251 X
22.907 X
22,878 X
22.555 X
Chapter 4. 2D-GCP
Table 4.1: Computational results on gcut instances.
50
N ame
testcut6000
testcut6500
testcut7000
testcut7500
testcut8000
testcut8500
testcut9000
testcut9500
testcut10000
H
6000
6500
7000
7500
8000
8500
9000
9500
10000
W
6000
6500
7000
7500
8000
8500
9000
9500
10000
Instances
n
|Y |
80 4881
80 5610
80 6242
80 4988
100 7365
80 8413
80 7495
80 8784
80 9365
|X|
5384
6350
6757
6441
7426
8040
8546
8124
9426
zOP T
35985098
42241403
48997730
56201826
63993589
72249152
80980280
90231106
99998425
W aste%
0.041
0.020
0.004
0.085
0.010
0.001
0.024
0.020
0.001
Core i7 920
TCP U
163.145
230.397
295.356
253.937
437.347
518.404
569.479
673.266
866.644
NVIDIA
TGP U
6.797
9.442
11.669
10.066
17.004
35.351
37.899
45.342
57.579
GTX 570
SpeedU p
24.003 X
24.401 X
25.311 X
25.227 X
25.720 X
14.664 X
15.026 X
14.849 X
15.051 X
NVIDIA
TGP U
6.314
8.724
10.900
9.795
16.733
16.826
20.868
24.807
32.283
GTX 770
SpeedU p
25.839 X
26.410 X
27.097 X
25.925 X
26.137 X
30.744 X
27.290 X
27.140 X
26.845 X
Chapter 4. 2D-GCP
Table 4.2: Computational results on testcut instances.
51
N ame
randcut6000a
randcut6000b
randcut6000c
randcut6500a
randcut6500b
randcut6500c
randcut7000a
randcut7000b
randcut7000c
randcut7500a
randcut7500b
randcut7500c
randcut8000a
randcut8000b
randcut8000c
randcut8500a
randcut8500b
randcut8500c
randcut9000a
randcut9000b
randcut9000c
randcut9500a
randcut9500b
randcut9500c
randcut10000a
randcut10000b
randcut10000c
H
6000
6000
6000
6500
6500
6500
7000
7000
7000
7500
7500
7500
8000
8000
8000
8500
8500
8500
9000
9000
9000
9500
9500
9500
10000
10000
10000
W
6000
6000
6000
6500
6500
6500
7000
7000
7000
7500
7500
7500
8000
8000
8000
8500
8500
8500
9000
9000
9000
9500
9500
9500
10000
10000
10000
Instances
n
|Y |
34 4971
32 5456
36 5630
28 4967
52 6255
40 6187
44 6631
38 5956
32 6793
44 6993
32 6564
32 6651
28 7876
36 7967
34 7465
46 8023
42 8098
36 8208
44 8728
42 8712
36 8558
40 9256
36 8661
36 7931
36 8587
42 9298
38 8601
|X|
5747
5566
4844
4645
6311
5783
6669
6242
6494
6981
6565
6507
5639
6499
6935
8100
7748
6967
8547
9000
7075
6813
9349
8533
8382
9586
9618
zOP T
36000000
36000000
36000000
42250000
42250000
42250000
49000000
49000000
49000000
56250000
56250000
56250000
64000000
64000000
64000000
72250000
72250000
72250000
81000000
81000000
81000000
90250000
90250000
90250000
100000000
100000000
100000000
W aste%
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Core i7 920
TCP U
180.898
187.933
169.182
161.617
255.481
234.749
306.135
261.815
278.242
364.229
334.090
331.501
363.668
378.706
417.253
527.499
523.007
483.054
613.783
657.307
540.588
623.986
743.840
656.044
723.389
882.447
810.420
NVIDIA
TGP U
7.275
7.700
7.039
6.572
10.249
9.436
12.083
10.514
12.096
14.384
12.850
12.843
14.174
16.019
16.281
23.914
22.965
21.413
41.284
43.418
32.943
37.462
49.040
42.975
48.010
60.378
53.605
GTX 570
SpeedU p
24.866 X
24.407 X
24.035 X
24.592 X
24.927 X
24.878 X
25.336 X
24.902 X
23.003 X
25.322 X
25.999 X
25.812 X
25.657 X
23.641 X
25.628 X
22.058 X
22.774 X
22.559 X
14.867 X
15.139 X
16.410 X
16.657 X
15.168 X
15.266 X
15.067 X
14.615 X
15.118 X
NVIDIA
TGP U
6.770
7.103
6.452
6.081
9.385
8.640
11.190
9.814
11.225
13.425
12.258
12.247
14.183
15.863
16.027
19.392
18.768
17.411
23.287
24.206
19.733
22.143
27.395
24.012
27.788
32.539
31.043
GTX 770
SpeedU p
26.722 X
26.458 X
26.223 X
26.579 X
27.223 X
27.169 X
27.358 X
26.679 X
24.788 X
27.132 X
27.256 X
27.069 X
25.642 X
23.874 X
26.035 X
27.202 X
27.867 X
27.744 X
26.357 X
27.155 X
27.395 X
28.180 X
27.152 X
27.322 X
26.032 X
27.120 X
24.906 X
Chapter 4. 2D-GCP
Table 4.3: Computational results on randcut instances.
52
Chapter 4. 2D-GCP
53
Figure 4.9: randcut6000c source.
Figure 4.10: randcut6000c solution.
4.5
Considerations and Future Work
In this chapter we presented a parallel algorithm for solving the Unconstrained
Two-Dimensional Guillotine Cutting Problem (2D-GCP) especially designed for
running on a GPGPU platform. We proved the effectiveness of this method achieving, in the best case, a 30X speed-up factor upon the serial version, exploiting the
native matrix-like structure of the problem and the fine grained computation required by the dynamic programming algorithm. We also provided two new sets of
Chapter 4. 2D-GCP
54
Figure 4.11: testcut8000 solution.
Figure 4.12: gcut13 solution.
test instances for this problem.
Our future aims are to extend the algorithm for solving the staged version of
the problem and, eventually, GPU Computing to enhance other algorithms designed to approach similar packing problems (i.e. Bin Packing, Constrained TwoDimensional Guillotine Cutting Problem, etc.).
Chapter 5
Vehicle Routing Problem
In this chapter we investigate the application of GPU computing to some of
the most effective pricing strategies based on Dynamic Programming (q-route,
through-q-route and ng-route relaxations) for Column Generation methods for
the Vehicle Routing Problem. We propose the parallel versions of these algorithms in a massively parallel environment, discussing the implementation choices
and evaluating the speed-up factors on literature test instances with respect to the
serial version.
5.1
Introduction
The Vehicle Routing Problem (VRP) is among the most studied problems in combinatorial optimization, and retains unabated interest both because, though simple
to state, it enjoys intriguing mathematical properties and because it can be quickly
specified into problems of primary economic interest. The literature on the core
problem variants and on the possible real-world variations got huge after the seminal paper which introduced it [60], and includes dedicated books [61], [62] and,
more recently, also dedicated working groups of research associations [63].
The core problem can be quickly introduced as finding a least cost set of routes to
service a number of customers from a central depot, given a cost matrix specifying
the cost for going from any customer to any other one and from the depot to each
customer. The problem can then be complicated at will, by adding constraints
suggested by real world applications. A largely included constraint assumes that
55
Chapter 5. VRP
56
the routes are to be traveled by trucks in order to deliver or to collect goods from
customers, thus the total amount of goods loaded on each truck cannot exceed
its capacity, in weight or in volume. This gives rise to the Capacitated VRP
variant (CVRP). Alternatively, each customer can ask either for a delivery or
for a collection of goods, yielding the Pickup and Delivery variant(PDVRP). In
case all deliveries of each route are to be made first, then all collections, we have
the CVRP with backhauls (VRPB). A further quite common constraint considers
feasible time windows for the visits at the customers (TWVRP). Moreover, in
small area settings each truck could go back to the depot to reload (multitrip
VRP, MTVRP), while in bigger areas it is common the use of more depots by the
vehicles of the fleet (multidepot VRP, MDVRP). The vehicles of the fleet can be
all equal or different among themselves (heterogeneous fleet CVRP, HVRP), could
not be requested to return to the depot they started from (open CVRP, OVRP),
could be requested to repeat the same routes with a given periodicity over the
planning horizon (periodic VRP, PVRP), etc.
Furthermore, all listed constraints, and many more coming from operational practice, can be freely combined to model actual use cases. For example, a recent work
on city logistics operational optimization ([64]) models its problem as a CVRP with
time windows, multi-trip, heterogeneous fleet and pickup and delivery.
Given its theoretical and practical relevance, the VRP witnessed a wealth of diverse
approaches for its solution and still fosters a lively research community studying
either exact or heuristic methods or, more recently, both. A detailed survey is
clearly out of scope for this paper, in the following we will recall just a few contributions.
Heuristic approaches have a seminal work in [65] and went through tailored heuristics, such as [66], then metaheuristics, such as tabu search ([67]), simulated annealing ([68]), ant colony [69], genetic and in general evolutionary algorithms [70],
variable neighborhood search [71] and PSO [72], just to name a few.
Exact approaches are of more direct interest for this paper. Again, different approaches have been used, ranging from dynamic programming [73] to branch and
bound [74], from branch and cut [75] to column generation [76]. In all cases, a
central feature is the ability to compute tight lower bounds. Again, different approaches for computing bounds, recently, bounds based on nonelementary paths,
Chapter 5. VRP
57
such as q-routes [74] and most notably ng-routes [77, 78]appear to be particularly
effective for the Capacitated VRP and VRP with Time Windows.
This chapter reports on the results obtained by implementing the q-routes, throughq-routes and ng-routes relaxations computation on a GPU parallel architecture.
GPU are enjoying increasing interest among the optimization community given the
possibility to significantly speedup tasks at the core of any approach of interest,
thus to ultimately achieve substantial efficiency improvements [79]. Applications
to combinatorial optimization problems have so far been reported for the knapsack
problem [58] and for the Two-Dimensional Guillotine Cutting Problem [80]. This
is the first work porting state of-the-art vehicle routing optimization components
on GPU, specifically proposing a GPU implementation of the ng-relaxation.
The implementation on GPU of an optimization algorithm is a complex task that
involves the study of tailored data structures and corresponding routines. This
chapter reports in detail the choices we made to achieve the most efficient parallel
implementation of the q-routes, through-q-routes and ng-routes routines and substantiates this with computational results on standard problem benchmarks from
the literature.
5.2
Problem Description and Mathematical Formulations
The CVRP, in its basic version, consists in finding the least-cost set of routes
to be travelled by m homogeneous vehicles of identical capacity Q, in order to
service each of n customers, whose index set is V1 . All routes start and return
to a common depot, conventionally indexed by 0. Let V = V1 ∪ {0}. Input data
consist of the requests qi , i = 1, . . . , n and of the travelling costs cij , i = 0, . . . , n,
j = 0, . . . , n, between each pair of customers and between each customer and the
depot.
The problem can thus be defined on a complete weighted graph G = V, A, C,
where A = [(i, j)], i, j ∈ V , and C = [cij ], i, j ∈ V is the corresponding possibly
asymmetric cost matrix. In real-world application, G is typically an overlay graph
superimposed on an actual road network, and nodes in V correspond to geocoded
Chapter 5. VRP
58
facilities while arcs in A correspond to least-cost paths, computed according to
the metric to minimize (distance, time, ... ).
The problem can be formulated in different ways, we refer the reader to [61] for a
thorough overview. The following two subsections introduce the formulations and
the notation we use in the rest of the paper.
5.2.1
Two Index Formulation
The two index formulation associates a decision variable xij ∈ {0, 1} to each arc
(i, j) ∈ A, specifying whether or not in the optimal solution there is a vehicle
travelling directly from node i ∈ V to node j ∈ V .
Different variants of this formulation are possible, the most compact ones require
to compute for each subset of nodes S ⊆ V the minimum number of vehicles
needed to service set S, which is denoted by r(S).
The formulation is as follows.
zCV RP = min
XX
cij xij
(5.1)
i∈V j∈V
s.t.
X
xij = 1
j ∈ V1
(5.2)
xij = 1
i ∈ V1
(5.3)
i∈V
X
j∈V
X
x0j = m
(5.4)
j∈V
XX
xij ≥ r(S)
S ⊆ V1 , S 6= ∅
(5.5)
i, j ∈ V
(5.6)
i∈S
/ j∈S
xij ∈ {0, 1}
Constraints 5.2 and 5.3 impose that exactly one vehicle arrives and leaves each
customer, constraint 5.4 specify the number of available vehicles (not all of which
need to be used) and constraints 5.5, the so called capacity−cut constraints impose
both the connectivity of the solution and the vehicle capacity requirements (see
[61]).
Chapter 5. VRP
5.2.2
59
Set Partitioning Formulation
The set partitioning formulation, originally proposed by [81], associates a decision
variable xℓ to each feasible vehicle route, that is, to each route that can be travelled
by a vehicle, leaving the depot, servicing a subset of customers that collectively
do not exceed the vehicle capacity and finally returning to the depot.
Let R be the index set of all feasible routes, let cℓ be the cost of route ℓ1inR, and
let ai,ℓ be a binary coefficient, which is equal to 1 iff node i ∈ V belongs to route
ℓ ∈ (R).
The formulation is as follows.
zSP = min
X
c ℓ xℓ
(5.7)
ℓ∈R
s.t.
X
aiℓ xℓ = 1
i ∈ V1
(5.8)
ℓ∈R
X
xℓ = m
(5.9)
ℓ∈R
xℓ ∈ {0, 1}
ℓ∈R
(5.10)
Constraints 5.8 ensure that each customer is serviced by exactly one feasible route
and constraint 5.9 impose the fleet cardinality. It is noteworthy that, in case the
cost matrix satisfied the triangle inequality, equalities 5.8 could be turned into
greater or equal than inequalities, thus turning the problem into an extended set
covering problem, which is computationally easier to deal with.
5.3
Dynamic Programming Relaxations for the
Pricing Problem
In 5.1 we have briefly described some techniques to find a solution, heuristic or
exact, to the VRP. Mainly regarding the exact algorithms, is necessary to solve
the pricing problem for selecting the most interesting columns for the Column
Generation (CG) algorithm. The pricing problem is also an NP-Hard problem,
Chapter 5. VRP
60
the Elementary Shortest Path Problem with Resource Constrain (ESPPRC). This
problem is formally defined: let P be the set of paths of G s.t. each path P ∈
P starts from 0, visits a set of vertices VP ⊆ V , delivers qP units of product
and ends at vertex σP ∈ VP without loops. The ESPPRC can be solved with
Dynamic Programming recursions expressed as follows: we define a state-space
graph X = {(X, i) : X ⊆ V, i ∈ V ′ } and functions f (X, i), ∀(X, i) ∈ X, where
f (X, i) is the cost of the least-cost path P that visits the set of customers X,
P
ends at the customer i ∈ X, and such that j∈X qj ≤ Q. As we can see, the
exact DP algorithm can’t be applied because of the dimension of the state-space
graph X. Christofides et al. [82] proposed the State-Space Relaxation that is a
procedure whereby the state-space associated with DP recurrence is relaxed to
compute valid lower bounds to the original problem. The next three relaxations
that we will introduce in the next sections of this chapter are relaxations for the
ESPPRC problem based on this principle.
In [74], [77], [83], [84], [85] , were proposed effective and reliable relaxations, based
on Dynamic Programming recurrences. These recurrences, as well as all dynamic
programming algorithms, trade space for time, enumerating all the interesting
solutions for the relaxed problem. In these cases the elementary constraint is
relaxed and the aim is to find interesting almost elementary paths that can be the
base for the creation of feasible solution or a good start point for the computation
of valid and tight lower bounds. In the next sections we will describe three of these
methods (q-route, through-q-route and ng-route) and we will analyze the intrinsic
characteristics of each one.
5.3.1
q-Route Relaxation
The q-route relaxation described in Christofides and al. [74] , is aimed to find
routes without loops of two vertices. Defining f (q, i) the cost of the least cost
path P = (0, 11 , . . . , ik ), ik = i, (not necessarily simple) from the depot 0 to the
Pk
customer i with total load q =
h=1 qij . Such a path is called q-path. A q-
path with the additional edge 0, i is called q-route and has cost f (q, i) + d0i . We
can impose that the path should not contains loops formed by three consecutive
vertices can be described as follows: let π(q, i) be the vertex just prior to i on the
path corresponding to f (q, i). Let φ(q, i) be the cost of the least cost path ending
at vertex i with the constraint that the vertex γ(q, i) preceding i is not equal to
Chapter 5. VRP
61
π(q, i). This recurrence can be formalized as follows: for a given value of q, let
h(j, i) be the cost of the least path from 0 to i with j just prior to i and without
loops. Then:

f (q − qi , j) + dji , if π(q − qi , j) 6= i
h(j, i) =
φ(q − q , j) + d , otherwise
i
(5.11)
ji
Given the function h, function f and φ can be computed for the given q as follows:

f (q, i) = minj6=i {h(j, i)}
π(q, i) = j ∗
(5.12)
where j ∗ is the index of j, the predecessor, corresponding to the above minimum;

φ(q, i) = mink6=π(q,i),k6=i {h(k, i)}
γ(q, i) = k ∗
(5.13)
where k ∗ is the value of k corresponding to the above minimum. The initialization
of f, φ, π and γ is f (qi ) = φ(q, i) = inf, for q such that q 6= qi and:


f (q, i) = d0i



for q such that q = qi π(q, i) = 0



φ(q, i) = ∞
(5.14)
Informally, we can say that the method builds a minimum non-elementary path,
delivering q quantity of goods with a dynamic programming recursion that adds
an edge from j only to the nodes that don’t have j itself as direct predecessor,
finding a path without loops of two nodes.
The algorithm can be described as follows:
Algorithm Q PATHS (N, Q, qi , d)
1.
// Data Structures
2.
f, φ, π, γ
Chapter 5. VRP
62
3.
// Initialization
4.
for q = 0 to Q do
5.
6.
7.
for i = 1 to N do
if qi (i) == q
then f (q, i) = d(0, i), π(q, i) = 0
φ(q, i) = ∞, γ(q, i) = ∞
8.
9.
else f (q, i) = ∞, π(q, i) = −1
φ(q, i) = ∞, γ(q, i) = ∞
10.
11. // q-Paths
12. for q = 0 to Q do
13.
14.
15.
for i = 1 to N do
for j = 1 to N do
if π(q − qi (i)) 6= i
16.
then h(j, i) = f (q − q(i), j) + d(j, i)
17.
else h(j, i) = g(q − q(i), j) + d(j, i)
18.
//Minima calculation for each i
19.
f (q, i) = minj6=i {h(j, i)}
20.
π(q, i) = j ∗
21.
φ(q, i) = mink6=π(q,i),k6=i {h(k, i)}
22.
γ(q, i) = k ∗
23. return f, φ, π, γ
Using this method we can find shortest paths in respect of the request of each
node i and the vehicle’s capacity Q, but the routes, also avoiding loops of two
vertices, are not yet elementary (Figure 5.1b). The q-route relaxation can be
computed in pseudo-polynomial time with a complexity of O(n2 Q). In figure 5.1a
we can observe graphically the computation of a single f (q, i) value. In the case
of asymmetric VRP, where dij 6= dji , we compute the relaxation once for the d
matrix and once for its transpose dT .
5.3.2
through-q-Route Relaxation
The through-q-route relaxation [74] is an enhancement for the q-route relaxation.
In fact using the f and the g function we can find a better route mixing two paths
retrieved by the q-path relaxation. Formally: Let φ(q, i) be the value of the least
cost route, without loops, starting at the depot, passing through i and finishing
Chapter 5. VRP
63
(a) f(q,i) q-path computation
(b) q-path avoiding 2
vertices loop
Figure 5.1: f(q,i) q-path computation and 2-vertices loops avoiding.
back at the depot with a total load of q. This kind of route is a through-q-route.
The ψ(q, i) values are computed as follows:


f (q̄, i) + f (q + qi − q̄, i),




f (q̄, i) + φ(q + qi − q̄, i),
ψ(q, i) =
min
qi ≤q̄≤(q+qi )/2 
min



φ(q̄, i) + f (q + q − q̄, i)
if π(q̄, i) 6= π(q + qi − q̄, i)
otherwise
i
The algorithm can be described as follows:
Algorithm THROUGH-Q ROUTES (N, Q, qi , f , φ, π, γ)
1.
// Data Structures
2.
ψ
3.
// Initialization
4.
for q = 0 to Q do
5.
for i = 1 to N do
6.
ψ(q, i) = ∞
7.
// through q-routes
8.
min1, min2 // Minima
9.
for i = 1 to N do
10.
min1 = ∞, min2 = ∞
11.
for q = 0 to Q do
12.
for qi (i) ≤ q̄ ≤ (q + qi (i))/2 do
(5.15)
Chapter 5. VRP
64
13.
back = (q + qi (i))/2 − q̄
14.
if π(q̄, i) 6= π(back, i) ∧ f (q̄, i) + f (back, i) ≤ min1
15.
then min = f (q̄, i) + f (back, i)
16.
else if f (q̄, i) + φ(back, i) ≤ φ(q̄, i) + f (back, i)
17.
then min2 = f (q̄, i) + φ(back, i)
if min2 ≤ min1
18.
19.
then min1 = min2
20.
else min2 = φ(q̄, i) + f (back, i)
if min2 ≤ min1
21.
22.
then min1 = min2
23.
// ψ Update
24.
ψ(q, i) = min1
25. return ψ
More informally, this recurrence selects, among all the paths for a given q, the
best combination of paths that starts and end to the deposit. Once computed
the q-route relaxation, the through-q-route function ψ(q, i) can be computed in
pseudo-polynomial time with a complexity of O(n2 Q2 ).
In the asymmetric case, we will use the f and φ functions with the π and the γ
computed by the asymmetric q-path described above. We will call f f w and φf w
the functions obtained from q-path computation using the d matrix, along with
the predecessors matrices π f w and γ f w . In the same fashion we will call f bw , φbw ,
π bw and γ bw the one computed with the dT matrix. In this case, we can extend
the algorithm as follows:
Algorithm ASY THROUGH-Q ROUTES (N, Q, qi , f fw , φfw , π fw , γ fw , f bw , φbw , π bw , γ bw )
1.
// Data Structures
2.
ψ
3.
// Initialization
4.
for q = 0 to Q do
5.
for i = 1 to N do
6.
ψ(q, i) = ∞
7.
// through q-routes
8.
min1, min2 // Minima
Chapter 5. VRP
9.
65
for i = 1 to N do
10.
min1 = ∞, min2 = ∞
11.
for q = 0 to Q do
12.
for qi (i) ≤ q̄ ≤ (q + qi (i))/2 do
13.
back = (q + qi (i))/2 − q̄
14.
if π f w (q̄, i) 6= π bw (back, i) ∧ f f w (q̄, i) + f bw (back, i) ≤ min1
15.
then min = f f w (q̄, i) + f bw (back, i)
16.
else if f f w (q̄, i) + φbw (back, i) ≤ φf w (q̄, i) + f bw (back, i)
then min2 = f f w (q̄, i) + φbw (back, i)
17.
if min2 ≤ min1
18.
19.
then min1 = min2
20.
else min2 = φf w (q̄, i) + f bw (back, i)
if min2 ≤ min1
21.
22.
then min1 = min2
23.
// ψ Update
24.
ψ(q, i) = min1
25. return ψ
5.3.3
ng-Route Relaxation
Righini and Salani [83–85] proposed a DP relaxation based on the construction
of elementary paths in an decreasing state space, Baldacci et al. [77] proposed a
more effective relaxation for the pricing problem, generalizing the Righini’s idea,
the ng-route relaxation. The main problem afflicting the other methods described
above is that allow cycles longer than two vertices. Procedures to avoid bigger
loops are computationally expensive and can address loops of three or four vertices
only. The ng-route relaxation partially solves this problem introducing a supplementary information for the DP algorithm, allowing the recurrence to ’remember’
an arbitrary number of nodes during the state-expansion phase and avoiding the
creation of loops with a significant cardinality of nodes.
The algorithm has specific rules according to the different kind of Vehicle Routing
problem in which is applied (VRPTW, CVRP..). We take in consideration only the
one afferent to the Capacitated Vehicle Routing Problem (CVRP). The ng-route
relaxation can be described as follows: Let Ni ⊆ V be a set of selected customers
for vertex i (according to some criterion), such that i ∈ Ni and |Ni | ≤ ∆(Ni ),
Chapter 5. VRP
66
where ∆(Ni ) is the cardinality of the selected neighbors for i plus i itself. The sets
Ni allow us to associate with each path P = (0, i1 , . . . , ik ) the subset Π(P ) ⊆ V (P )
containing customer ik and every customer ir , r = 1, . . . , k − 1 of P that belongs
to all set Nir+1 , . . . , Nik associated with the customer ir+1 , . . . , ik visited after ir .
The set Π(P ) is defined as:
Π(P ) =
(
ir : ir ∈
k
\
Nis , r = 1, . . . , k − 1
s=r+1
)
[
{ik } .
(5.16)
A ng-path (q, i, N G), is a non-necessarily elementary path P = (0, i1 , . . . , ik−1 , ik =
i) starting from the depot, visiting a subset of customers (even more than once)
such that N G = Π(P ), ending at customer i such that i ∈
/ Π(P ′ ) where P ′ =
(0, i1 , . . . , ik−1 ). We indicate with f (q, i, N G) the cost of the least cost ng-path
(q, i, N G). We define an ng-route an ng-path (q, i, N G) plus the edge from i to
the depot and the cost of a ng-route f (q, 0, N G) = f (q, i, N G) + di0 . Functions
f (q, i, N G) can be computed on the graph defined as H = (Φ, Ψ) where:
Φ=
(
(i, q, N G) : qi ≤ q ≤ Q, ∀N G ⊆ Ni s.t. i ∈ N G ∧
X
qj ≤ q, ∀i ∈ V ′
j∈N G
)
,
(5.17)
Ψ = ((j, q ′ , N G′ ), (i, q, N G)) : ∀(j, q ′ , N G′ ) ∈ Ψ−1 (i, q, N G), ∀(i, q, N G) ∈ Φ ,
(5.18)
where
Ψ−1 (i, q, N G) = (j, q − qi , N G′ ) : ∀N G′ ⊆ Nj s.t. j ∈ N G′ and N G′ ∩ Ni = N G/ {i} , ∀j ∈ Γ−1
i
(5.19)
The function f (i, q, N G) can be computed using the DP recursion:
f (i, q, N G) =
min
(j,q ′ ,N G′ )∈Ψ−1 (i,q,N G)
{f (j, q ′ , N G′ ) + dji } , ∀(i, q, N G) ∈ Φ.
(5.20)
Chapter 5. VRP
67
Is necessary to notice that the ∆ parameter is critical for this relaxation: bigger
is the cardinality of the neighborhood set, better is the bound obtained. The
set’s cardinality brings, inevitably to a combinatorial explosion of the recursion’s
states. Martinelli, Pecin and Poggi [78] brilliantly resolve this problem mixing the
Decremental State Space relaxation proposed by Righini with the computation of
the ng-route relaxation, using the exact relaxation only when necessary, allowing to
an heuristic procedure based on the q-route relaxation to find the most promising
routes to insert in the CG algorithm.
Algorithm NG-PATHS (N, Q, d, qi , Ni )
1.
// Data Structures
2.
f, πn , πng
3.
// Initialization
4.
for q = 0 to Q do
5.
6.
for i = 1 to N do
for ng = 0 to |N Glist (i, q)| do
7.
if qi (i) == q
8.
then N G = i, N Glist (q, i).Add(N G)
9.
f (q, i, N G) = d(0, i), πn (q, i, N G) = 0, πng (q, i, N G) = 0
else f (q, i, N G) = ∞, πn (q, i, N G) = −1, πng (q, i, N G) = −1
10.
11. // ng-Paths
12. for q = 0 to Q do
13.
14.
for i = 1 to N do
for j = 1 to N do
15.
for ng = 0 to |N Glist (q − qi (i), j)|do
16.
N G = N Glist (q − qi (i), j, ng)
17.
if i ∈
/ Ni ∪ N G
18.
19.
20.
then N Gnew = Ni ∩ N G ∪ i
if f (q − qi (i), j, N G) + dji ≤ f (q, i, N Gnew )
then f (q, i, N Gnew ) = f (q − qi (i), j, N G) + dji
21.
Add N Gnew to N Glist (i, q)
22.
πn (q, i, N Gnew ) = j
23.
πng (q, i, N Gnew ) = N Glist (q − qi (i), j, ng)
24. return f, πn , πng
Chapter 5. VRP
68
In the algorithm we have introduced πn and πng that keep track of the predecessor
node j ∈ Γ−1
and the predecessor path Π(P ), P = {0, i1 , . . . , ik−1 }, respectively.
i
We also introduce the dominance among the labels in the recursion. In fact we
can obtain the same label (i, q, N G) expanding from two different predecessor state
(j, q − qj , N Gj ) and (k, q − qk , N Gk ). Obviously, the dominant label is the one
with the lower function value f .
The asymmetric extension for this relaxation is trivial: we can compute the
f f w (i, q, N G) using the d matrix and the f bw (i, q, N G) function using its transpose
dT .
Figure 5.2: ng-Path example.
5.4
Parallel Relaxations on GPU
In this section we describe the parallel algorithms designed to run these relaxations
on a GPU. Dynamic Programming algorithms ported on this kind of devices have
given good results in many application: Boyer a et al. [58] proved that on a
consumer GPU is possible to obtain a 20X speed-up factor for solving the Knapsack
Chapter 5. VRP
69
Problem. Harish et al. [86] and Buluç et al. [87] provided an extensive spectrum of
graph problems (Deep First Search, etc..) having advantages from the utilization of
a GPU. Ortega-Arranz et al. [88] and Kumar et al. [89] proved that also algorithms
like the Bellman-Ford and Dijkstra ones for solving the Single Source Shortest Path
Problem (SSSPP) can be enhanced by the use of a many-core processor.
The GPGPU has given remarkable results also for the All Pairs Shortest Path
Problem (APSPP), solved by means of the Floyd-Warshall algorithm, Katz et
al. [56] and Lund et al. [57]. Maniezzo et al. [80] proved the effectiveness of
many-cores platform also for supply chain’s problems. As we can see, all the
cited methods are based on the Dynamic Programming paradigm; indeed, the fine
calculation granularity and the matrix-like data structures characterizing often
these algorithms, fit particularly well on the many-cores architecture, where every
thread execute a, relatively simple, computational kernel. We decided to use the
CUDA parallel programming model because is, actually, the best trade off among
portability, usability and reliability. Nvidia, also, provides a good environment for
debugging and profiling applications (Nsight debugger for Microsoft Visual Studio,
etc..) with effective functionalities.
5.4.1
GPU q-Route
The first relaxation that we consider is the q-route relaxation. In the next subsections we will expose the parallelism inside the method and we will describe the
proposed algorithm for the GPU.
5.4.1.1
Exposing Parallelism
The equations 5.11 and 5.12 describe the expansion rules for the creation of a new
state f (q, i). As we can see, this is a backward recursion, because we create the
new state from the previous stages. This peculiarity enables a very interesting
effect inside the recursion: all the states f (q, i) ∈ q stage, can be calculated
independently using all the f (q − qi , j), j ∈ Γ−1
i , i ∈ V, i, j = 1, 2, . . . , N states of
the q − qi stage. In fact, we can evaluate at the same time all the f (q, i) states
inside the q stage, as depicted in figure 5.3.
Chapter 5. VRP
70
Figure 5.3: Stage q parallel computation.
5.4.1.2
Algorithm Description
In this section we will provide the pseudo-code for the parallel algorithm and the
computation kernel for the q-path algorithm. According to the CUDA programming model, we can assign a threads block for each state f (q, i) and T threads
for each block. We decided to assign in this manner the workload for allowing us
to compute in parallel not only the state of the recursion (intra-grid parallelism),
but also the min operation required to compute it. In fact, we will use the threads
of each block to evaluate in parallel the reductions (min operation) described in
the equations 5.12 and 5.13 (intra-block parallelism) using the method suggested
in [59]. All the data structures have linear access to each element, it means that
all matrices are stored in a row-major fashion. For the legibility of the algorithm,
we decided to use two indexes for the matrices anyway.
Algorithm GPU Q-PATHS (N, Q, qi , d)
1.
// Data Structures
2.
f, φ, π, γ
3.
// Kernel Setup
4.
BLOCKS B = N − 1, THREADS T , SHARED-MEM Sh[2 ∗ (N − 1)]
5.
// Main Loop
6.
for q = 0 to Q do
7.
8.
Q-PATHS-KERNEL<<< B, T, Sh >>>(Q, q, qi , d, f , φ, π, γ)
return f, π, φ, γ
Chapter 5. VRP
71
Algorithm Q-PATHS-KERNEL(Q, N, q, qi , d, f , φ, π, γ)
1.
min1, min2 // Variables Initialization
2.
pred1,pred2
3.
hsh [N − 1], πsh [N − 1]
4.
thdidx = threadIdx.x, N thds = blockDim.x, nodeidx = blockIdx.x
5.
slack = N %N thds, times = N/N thds
6.
if thdidx ≤ slack
7.
then times + +
8.
// Shared Memory Initialization
9.
for t = 0 to times do
10.
hsh [thdidx + t ∗ N thds] = ∞
11.
πsh [thdidx + t ∗ N thds] = −1
12. syncthreads()
13. // Partial Minima Computation
14. for t = 0 to times do
15.
j = thdidx + t ∗ N thds
16.
if π(q − qi (nodeidx), j) 6= nodeidx
17.
18.
19.
20.
then hsh [j] = f (q − qi (nodeidx), j) + d(j, i)
πsh [j] = j
else hsh [j] = φ(q − qi (nodeidx), j) + d(j, i)
πsh [j] = j
21. syncthreads()
22. // First reduction to get f (q, i) value
23. GPU REDUCTION(hsh , πsh )
24. min1 = hsh [0], pred1 = πsh [0], hsh [0] = ∞, πsh [0] = −1
25. // Second reduction to get φ(q, i) value
26. GPU REDUCTION(hsh , πsh )
27. min2 = hsh [0], pred2 = πsh [0]
28. // Matrices update
29. if nodeidx 6= thdidx
30.
then f (q, i) = min1, π(q, i) = pred1
31.
φ(q, i) = min2, γ(q, i) = pred2
Algorithm GPU REDUCTION (hsh , πsh , times)
1.
// Keeping only the most promising value for each thread
2.
for t = 0 to times do
Chapter 5. VRP
3.
72
if hsh [thdidx] > hsh [thdidx + t ∗ N thds]
4.
then fs wap = hsh [thdidx]
5.
hsh [thdidx] = hsh [thdidx + t ∗ N thds]
6.
hsh [thdidx + t ∗ N thds] = fs wap
7.
// Update Predecessor
8.
πs wap = πsh [thdidx]
9.
πsh [thdidx] = πsh [thdidx + t ∗ N thds]
10.
πsh [thdidx + t ∗ N thds] = πs wap
11. syncthreads()
12. // Reduction
13. for s = N thds/2 to 0,s/ = 2 do
14.
if thdidx < s
15.
16.
then if hsh [thdidx + s] < hsh [thdidx]
then fs wap = hsh [thdidx]
17.
hsh [thdidx] = hsh [thdidx + s]
18.
hsh [thdidx + s] = fs wap
19.
// Update Predecessor
20.
πs wap = πsh [thdidx]
21.
πsh [thdidx] = πsh [thdidx + s]
22.
πsh [thdidx + s] = πs wap
23.
syncthreads()
The lines 6-8 of the main procedure, GPU Q-PATHS , is the main loop of the
algorithm. Indeed, inside the for cycle we call, iteratively, for each stage q of
the Dynamic Programming recurrence the computation kernel for the GPU. We
also define the dimension of the shared memory for each block of the grid. The
entries for the h(i) vector defined in
dimension is designed to store all the Γ−1
i
the equation 5.11 and the predecessor node for each entry (in our case, the number
of predecessor is equal to N − 1 supposing that the digraph G is complete). Once
initialized the shared memory (lines 9-11 of the Q-PATHS-KERNEL procedure),
we calculate for each predecessor its function value and we store it in the hsh
array together with the node’s index in πsh (lines 14-21) according with the 5.11
equation.
In lines 23 and 26 we calculate the minima values for the f and the φ functions. In
order to compute these values, we call twice the GPU REDUCTION procedure.
This procedure, a device function in the implementation, is based on the one
Chapter 5. VRP
73
described in [59], we modified it in order to keep trace of the values inside the
shared memory, swapping instead of overwriting the values themselves. In lines
2-10 we pre-calculate the significant values for each thread, keeping only the most
promising, and then we perform the reduction to find the minimum (lines 13-23).
Every thread compute one or more values thanks to the indices that we assigned
in the lines 4-7 of the main kernel, calculating the occurrence of the thread index
inside the total number of the problem nodes. Finally, in lines 29-31, we update
the data structures with the new function values. In line 24 of the main kernel,
we reset the result of the first reduction and we store the first minimum and its
predecessor.
5.4.2
GPU through-q-route
In this section we will describe the parallel algorithm for the through-q-route relaxation. This method, as described in 5.3.2, is based on the function values obtained
from the q-paths relaxation. In order to minimize the memory transaction between
the CPU and the GPU, we will compute the q-path and, keeping the results on
the GPU memory, we will perform the GPU algorithm for the through-q-routes.
5.4.2.1
Exposing Parallelism
According to the equation 5.15, we can see the nature of the computation is strictly
combinatorial and each ψ(q, i) function value is independent from the others. We
decided to compute in parallel all the function values for each i node of the graph
(each column of the matrix) which is the dimension, the quantity dimension q,
often bigger and computationally more expensive. In the next paragraph we will
provide the pseudo-code for the GPU kernel and we will give a brief description
of the algorithm.
5.4.2.2
Algorithm Description
In this section we will describe the GPU kernel designed to compute the throughq-route relaxation on a GPU.
Chapter 5. VRP
74
Algorithm GPU THROUGH-Q ROUTES (N, Q, f , qi , φ, π, γ)
1.
// Data Structures
2.
ψ
3.
// Kernel Setup
4.
BLOCKS B = Q, THREADS T, SHARED-MEM Sh[T]
5.
// Main Loop
6.
for i = 1 to N do
7.
THROUGH-Q-ROUTES-KERNEL<<< B, T, Sh >>>(Q, i, qi , d, f ,
φ, π, γ, ψ)
8.
return ψ
Algorithm THROUGH-Q-ROUTES-KERNEL(Q, N, i, qi , f , φ, π, γ, ψ)
1.
sh[T ] // Shared memory
2.
N T hds = blockDim.x, thdidx = threadIdx.x
3.
q = blockIdx.x, startidx = qi (i), endidx = (q + qi (i))/2, dif f = endidx −
startidx
4.
// Shared Memory Initialization
5.
sh[thdidx] = ∞
6.
times = dif f /N T hds, slack = dif f %N T hds
7.
if thdidx < slack
8.
9.
then times + +
syncthreads()
10. for t = 0 to times do
11.
q̄ = thdidx + startidx + (t ∗ N T hds)
12.
if π(q̄, i) 6= π(q + qi (i) − q̄, i)
13.
14.
then fnew = f (q̄, i) + f (q + qi (i) − q̄, i)
if sh[thdidx] > fnew
15.
then sh[thdidx] = fnew
16.
else φa = f (q̄, i) + φ(q + qi (i) − q̄, i)
17.
φb = φ(q̄, i) + f (q + qi (i) − q̄, i)
18.
φnew =0, (φa < φb )?φnew = φa : φnew = φb
19.
if sh[thdidx] > φnew
20.
then sh[thdidx] = φnew
21. syncthreads()
22. // Parallel reduction
Chapter 5. VRP
75
23. for s = N thds/2 to 0,s/ = 2 do
24.
if thdidx < s
25.
then if hsh [thdidx + s] < hsh [thdidx]
26.
then hsh [thdidx] = hsh [thdidx + s]
27.
syncthreads()
28. syncthreads()
29. // Data Update
30. if thdidx == 0
31.
then ψ(q, i) = sh[0]
In line 3 we define the indices of the data for the q block. The dif f variable
describes the range of indices for the block. In line 5 we initialize the shared
memory and in lines 6-8, as for the q-paths kernel, we define the indices for each
thread of the block. In lines 10- 20 we compute the partial results and we store
them in the shared memory, according to the equation 5.15; in line 11 we define
the indices q̄ for each thread. Once computed the partial results, we can perform
a standard parallel reduction (lines 23-28) in the shared memory as described in
[59]. In this case we don’t need to keep all the values inside the shared memory,
in fact we need only the minimum for each state. Finally, the thread 0 updates
the ψ(q, i) function value (lines 30-31).
5.4.3
GPU ng-Route
Unlike the other relaxations, the ng-route is more computationally expensive but
numerically is more effective than the previous two. The main problems afflicting
this method is an efficient management of the NG sets and the dominance among
them. In fact, the NG set and its cardinality for each f (q, i) stage is dynamic. Dynamic data structures in a GPGPU environment are not desirable, indeed searches
and the management inside these structures are performance killers. In the next
sections we will describe our strategies for addressing these problems and then a
parallel algorithm for the GPU.
5.4.3.1
Exposing Parallelism
The first problem that emerges for the porting of this relaxation on a many-cores
platform is the dynamic nature of the NG ‘dimension’ of the recurrence. A static
Chapter 5. VRP
76
data structure is almost mandatory for exploiting the computation capabilities of
these devices. We addressed this problem in this way: As we can see, from the
equations 5.17 and 5.19, the N G set for each stage (i, q) is completely contained
in the Ni set chosen for the i node. Plus, in each N G set the relative i node is
contained. From these assumptions we can state that the complete enumeration
for all the possible NG sets in the (i, q) stage is the N GPi ∈ P(Ni ) set, that is a
subset of the power set of Ni composed only by the subset with i.
The cardinality of the power set of Ni is 2∆(Ni ) but for the N Gdim the cardinality
is 2∆(Ni )−1 , because we are taking into account only the sets with i. For reasonable
∆(Ni ) (10-14) we can easily enumerate all the possible sets for the N G dimension,
allowing us to make static this dimension too. As we illustrate in figure 5.4 it’s
possible to describe all the states and stages of the recursion like a 3-dimensional
cube: vehicle’s capacity Q, customers/nodes i and sets’ indices N G. In the third
dimension N G we consider only the indices of these sets, as we will describe in
the following sections, we can index these sets and use these indices to define the
dimension. To our purposes, in order to expose the parallelism of the method, we
can reformulate the equation 5.20 in its forward form:
f (k, q + qi , N G) =
min
(i,q,N G′ )∈Ψ(k,q+qi ,N G)
{f (i, q, N G) + dik } , ∀(k, q + qi , N G) ∈ Φ.
(5.21)
Exploiting the equation 5.21 we can easily observe that we can compute independently all the states f (k, q + qk , N G) from all the states (i, q, N G′ ) in the q set of
stages.
5.4.3.2
Dominance Management
To easily address the management of the dominances among the states during the
expansion, we decided to pre-calculate the transitions among the N G sets of each
node. We basically create a transition map that taking in input the N G set of the
starting node, the index of the starting node j and the end node i, gives in output
the index of the new N G set among the enumerated ones of i. We indexed the N G
sets for each node i exploiting a bitmap. As we described before, we enumerate
all the possible N G sets for each node obtained from the relative Ni using its
Chapter 5. VRP
77
power set. We define the mask for the N G set as: mask = {b0 , b1 , . . . , bh } , h =
2∆(Ni )−1 , b ∈ {0, 1} and the mask for each N G is defined:

1 if N G(k) ∈ N i, k = 0, . . . , |N G|
mask(h) =
0 otherwise
(5.22)
Based on this mask, we can define the index for the N G:
∆(Ni )−1
N Gindex =
X
b ∗ 2h
(5.23)
h=0
For example: Given N G = {7, 2, 4, 6} ∈ N GP7 , N7 = {7, 2, 4, 6, 8, 10}, ∆(N7 ) = 6.
We will have: maskN G = {1, 1, 1, 1, 0, 0} and the index: N Gindex = 1 ∗ 20 + 1 ∗
21 + 1 ∗ 22 + 1 ∗ 23 + 0 ∗ 24 + 0 ∗ 25 = 1 + 2 + 4 + 8 + 0 + 0 = 15.
We give, for each combination of the 1 and 0 of the N G map, an univocal index.
This univocal index gives us the possibility to know in advance which will be the
index for the new N G set created by the expansion to another state. Given the
description for the index of each N G set we can define the transition map among
the N G sets of each node: given the N Gindex , the starting node j and the Ni set
′
of the destination node i, the mapping function returns the index N Gindex of the
′
N G ∈ N GPi .
Figure 5.4: NG path 3-D states space.
Chapter 5. VRP
5.4.3.3
78
Active Sets
Exploiting the equation 5.16 we can pre-calculate the N G ∈ N GPi sets active
during the recursion. In fact not all the N G enumerated are effectively used
by the method. For each stage q, we can retrieve all the N G sets involved in
the computation simply running the recursion, without computing f , once before
the main part of the algorithm in which the ng-relaxation is used, exploiting the
transition map defined in the previous paragraph. This property allows us to
apply a modified version of what is called ‘threads compaction’, described in [86]
and analyzed in [90]. This method consists in creating a mask for allowing to the
GPU to spawn only the threads useful for the computation. Using this technique
and exploiting the property described before, we can take in consideration only
the states effectively useful for the relaxation and calibrate the device’s resources
on these, avoiding the overhead induced by not working threads. The active sets
for each stage together with the indexing for the N G sets also enhance the serial
version of the method, indeed we avoided the overhead for the dominance and we
reduced drastically the computation at each stage. We propose a modified version
of the serial algorithm in the next paragraph.
5.4.3.4
Algorithm Description
In this section we describe first the enhancement for the serial algorithm using the
previous consideration, then we will propose a parallel kernel for the GPU.
Algorithm NG-PATHS 2 (N, Q, d, qi , ActiveSets, TransMap)
1.
// Data Structures
2.
f, πn , πng
3.
// Initialization
4.
for q = 0 to Q do
5.
6.
for i = 1 to N do
for ng = 0 to |N Glist (i, q)| do
7.
if qi (i) == q
8.
then f (q, i, 0) = d(0, i), πn (q, i, 0) = 0, πng (q, i, 0) = 0
9.
else f (q, i, 0) = ∞, πn (q, i, 0) = −1, πng (q, i, 0) = −1
10. // ng-Paths
Chapter 5. VRP
79
11. for q = 0 to Q do
12.
13.
for i = 1 to N do
for j = 1 to N do
14.
distij = dist(i, j)
15.
for h = 0 to |ActiveSets(q − qi (i), j)| do
16.
N Gindex = ActiveSets(q − qi , j, h)
17.
N G′index = T ransM ap(i, j, N Gindex )
18.
if f (q, i, N G′index ) > f (q − qi (i), j, N Gindex ) + distij
then f (q, i, N G′index ) = f (q − qi (i), j, N Gindex ) + distij
19.
20.
πn (q, i, N G′index ) = j
21.
πng (q, i, N G′index ) = N Gindex
22. return f, πn , πng
In lines 15 and 16 we introduced the ActiveSets and TransMap data structures
giving us the NG sets indices to update.
Algorithm GPU NG-PATHS (N, Q, d, qi , ActiveSets, TransMap)
1.
// Data Structures
2.
f, πn , πng , SLabelsN , SLabelsN G , ActThds
3.
// The variables f, πn , πng are initialized in the same fashion of the serial
algorithm
4.
// Number of Active Threads for each set of stages q
5.
for q = 0 to Q do
6.
7.
8.
9.
for i = 1 to N do
ActT hds(q)+ = |ActSets(q, i)| // Initialize StartLabels
for q = 0 to Q do
for i = 1 to N do
10.
for h = 0 to |ActSets(q, i)| do
11.
N G = ActSets(q, i, h)
12.
SLabelsN (q).Add(i)
13.
SLabelsN G (q).Add(NG)
14. //Main Loop
15. for q = 0 to Q do
16.
// Kernel Setup
17.
T HDS = T
Chapter 5. VRP
18.
BLK.y = N − 1, BLK.x = |ActT hds(q)|/T HDS
19.
NG-PATHS-KERNEL<<< BLKS, T HDS >>>
20.
(q, qi , M ap, d, f , πn ,πng , ActSets, SLabelsN odes , SLabelsN G )
80
21. return f , πn ,πng
Algorithm NG-PATHS-KERNEL(q, qi , Map, d, f , π n , π ng , ActSets, SLabelsN , SLabelsNG )
1.
idx = blockIdx.x ∗ blockDim.x + threadIdx.x
2.
i = blockIdx.y
3.
j = SLabelsN (q, idx), N Gindex = SLabelsN G (q, idx)
4.
dist = d(j, i)
5.
N G′index = M ap(i, j, N Gindex )
6.
if f (q + qi (i), i, N G′index ) > f (q, j, N Gindex ) + dist
7.
then f (q + qi (i), i, N G′index ) = f (q, j, N Gindex ) + dist
8.
πn (q + qi (i), i, N G′index ) = j
9.
πng (q + qi (i), i, N G′index ) = N G
Inside the main procedure, GPU NG-PATHS , in lines 5-7 we count the number
of active labels for each stage q. Using this value, inside the main loop of the
procedure, we spawn the necessary number of thread for each iteration of the loop
(line 19).
In lines 8-13 we initialize the StartLabelsN odes and StartLabelsN G structures,
containing, for each q the indices of the active NG for each node. The main loop
is described in lines 15-20. As for the serial version of the algorithm, we return
the f array with the function values with the array for the predecessors node, πn
and the array for the predecessor path πng (line 21).
Inside the GPU kernel, NG-PATHS-KERNEL, in lines 1-3 we define the indices
of the label, in line 5, using the transition map, we find the index if the new
N G set for the expanded label and in lines 6-9 we update, if necessary, the new
label. The operation in these lines are implemented using the AtomicMin() CUDA
primitive to manage the concurrent update of a single variable by more threads
simultaneously, in order to avoid race conditions and inconsistent results.
5.4.4
Asymmetric Relaxations
The asymmetric case for the VRP is characterized by dij 6= dji . In the case of
q-path and ng-paths, we are forced to calculate twice the relaxation, once using
Chapter 5. VRP
81
the d matrix (forward) and once using the dT transpose matrix (backward). For
the through-q-route relaxation, we have to use the output from the asymmetric
computation of the q-path, as described in ASY THROUGH-Q ROUTES algorithm. We propose an effective parallel approach to solve this version of the VRP
exploiting all the parallel features of a GPU.
A Stream is a sequence of operations that execute in issue-order on the GPU.
More intuitively, we can say that a GPU can execute concurrently multiple kernels
and memory transactions, overlaying the operations (e.g. transferring data for the
kernel 2 from the HOST to the GPU while executing the kernel 1). This is possible
due to different engines managing the execution and memory transfer operations.
Depending on the number of the simultaneous streams supported by the GPU,
(typically 4) we can hide the memory transaction operation with the computation
of a kernel and then compute the data loaded with another kernel. This feature
allows us to compute concurrently the two relaxations (forward and backward) on
the same GPU, introducing another level of parallelism (among kernels). In our
case we don’t use the streams to hide the memory transactions between the CPU
and the GPU, but to execute the same kernel with different data on the same
GPU, in a typical SIMD approach.
In the following we propose the pseudo-code for these algorithms. The kernels
are the same described in the previous paragraph. The main difference is the
use of the cudaMemcpyAsync() primitive that is a page-locked memory (pinned
memory) for the characteristics of which we refer to the official documentation of
CUDA.
Algorithm GPU-Q-PATHS ASY (N, Q, qi , d, dT )
1.
// Data Structures
2.
f f w , φf w , π f w , γ f w
3.
f bw , φbw , π bw , γ bw
4.
// Stream Initialization
5.
Stream F W , Stream BW
6.
// Kernel Setup
7.
BLOCKS B = N − 1, THREADS T , SHARED-MEM Sh[2 ∗ (N − 1)]
8.
// Main Loop
9.
for q = 0 to Q do
Chapter 5. VRP
10.
82
Q-PATHS-KERNEL<<< B, T, Sh, F W >>>(Q, q, qi , d, f f w , φf w ,
πf w , γ f w )
11.
Q-PATHS-KERNEL<<< B, T, Sh, BW >>>(Q, q, qi , dT , f bw , φbw ,
π bw , γ bw )
12. return f f w , π f w , φf w , γ f w , f bw , π bw , φbw , γ bw
Algorithm GPU NG-PATHS ASY (N, Q, d, dT , qi , ActSetsfw , ActSetsbw , Mapfw , Mapbw )
1.
// Data Structures
2.
fw
fw
, SLabelsfNw , SLabelsfNwG , ActT hdsf w
, πng
f f w , πn
3.
bw
bw
bw
bw
, SLabelsbw
, πng
f bw , πn
N , SLabelsN G , ActT hds
4.
// The variables f, πn , πng are initialized in the same fashion of the serial algorithm
5.
// Number of Active Threads for each set of stages q
6.
for q = 0 to Q do
7.
for i = 1 to N do
8.
ActT hdsf w (q)+ = |ActSetsf w (q, i)|
9.
ActT hdsbw (q)+ = |ActSetsbw (q, i)|
10. // Initialize StartLabels
11. for q = 0 to Q do
12.
for i = 1 to N do
13.
for h = 0 to |ActSetsf w (q, i)| do
14.
N G = ActSetsf w (q, i, h)
15.
SLabelsfNw (q).Add(i)
16.
SLabelsfNwG (q).Add(NG)
17.
for h = 0 to |ActSetsbw (q, i)| do
18.
N G = ActSetsbw (q, i, h)
19.
SLabelsbw
N (q).Add(i)
20.
SLabelsbw
N G (q).Add(NG)
21. // Stream Initialization
22. Stream FW, Stream BW
23. //Main Loop
24. for q = 0 to Q do
25.
// Kernel Setup
26.
T HDS = T
27.
BLK.y = N − 1, BLK.x = |ActT hdsf w (q)|/T HDS
28.
NG-PATHS-KERNEL<<< BLKS, T HDS, 0, F W >>>
29.
fw
fw
(N , Q, q,qi , M apf w , d, f f w , πn
, πng
, ActSetsf w ,
30.
SLabelsfNw , SLabelsfNwG )
31.
BLK.y = N − 1, BLK.x = |ActT hdsbw (q)|/T HDS
32.
NG-PATHS-KERNEL<<< BLKS, T HDS, 0, BW >>>
33.
bw
bw
(N , Q, q, qi , M apbw , dT , f bw , πn
, πng
, ActSetsbw ,
34.
bw
SLabelsbw
N , SLabelsN G )
Chapter 5. VRP
83
fw
fw
bw
bw
35. returnf f w , f bw , πn
, πng
, fbw , πn
, πng
The through-q-route algorithm only changes the data structures in input:
Algorithm GPU THROUGH-Q-ROUTES ASY (N, Q, qi , f fw , φf w , π f w , γ f w , f bw , φbw , π bw , γ bw )
1.
// Data Structures
2.
ψ
3.
// Kernel Setup
4.
BLOCKS B = Q, THREADS T , SHARED-MEM Sh[T ]
5.
// Main Loop
6.
for i = 1 to N do
7.
THROUGH-Q-ROUTES-KERNEL ASY<<< B, T, Sh >>>(Q, i, qi , f f w , φf w , π f w ,
γ f w , f bw , φbw , π bw , γ bw , ψ)
8.
return ψ
Algorithm THROUGH-Q-ROUTES-KERNEL ASY (Q, N, i, qi , f fw , φf w , π f w , γ f w , f bw , φbw , π bw , γ bw , ψ)
1.
sh[T ] // Shared memory
2.
N T hds = blockDim.x, thdidx = threadIdx.x
3.
q = blockIdx.x, startidx = qi (i), endidx = (q + qi (i))/2, dif f = endidx − startidx
4.
// Shared Memory Initialization
5.
sh[thdidx] = ∞
6.
times = dif f /N T hds, slack = dif f %N T hds
7.
if thdidx < slack
8.
9.
then times + + syncthreads()
for t = 0 to times do
10.
q̄ = thdidx + startidx + (t ∗ N T hds)
11.
if π f w (q̄, i) 6= π bw (q + qi (i) − q̄, i)
12.
13.
then fnew = f f w (q̄, i) + f bw (q + qi (i) − q̄, i)
if sh[thdidx] > fnew
14.
then sh[thdidx] = fnew
15.
else φa = f f w (q̄, i) + φbw (q + qi (i) − q̄, i)
16.
φb = φf w (q̄, i) + f bw (q + qi (i) − q̄, i)
17.
φnew = 0, (φa < φb )?φnew = φa : φnew = φb
18.
if sh[thdidx] > φnew
19.
then sh[thdidx] = φnew
20. syncthreads()
21. // Parallel reduction
22. for s = N thds/2 to 0,s/ = 2 do
23.
if thdidx < s
Chapter 5. VRP
24.
then if hsh [thdidx + s] < hsh [thdidx]
25.
26.
84
then hsh [thdidx] = hsh [thdidx + s]
syncthreads()
27. syncthreads()
28. // Data Update
29. if thdidx == 0
30.
5.5
then ψ(q, i) = sh[0]
Computational Results
In this section we report the experimental results of our algorithms. Each table
reports the execution time for a single run of the methods. All the times for the
GPU are calculated taking in account the load and store time for the data between
the CPU and the GPU. We choose to take into account these times because the
relaxations are repeatedly called inside a CG algorithm and the load and store
times have a significant impact on the performances.
The test machine is a workstation equipped with an Intel i7 4820K @3.9 GHz with
32 Gigabytes of RAM and a Nvidia GTX TITAN with 2688 Cuda Cores @837 MHz
with 6 Gigabytes of GDDR5 RAM provided by SINTEF [91]. The data-sets are
the ones of the VRPLIB [92] and the bigger instances are the ones provided by
[93]. We report the speed-up factor between the serial and the parallel versions of
the methods. The speed up factor is the ratio between the serial and the parallel
algorithms: SpeedU p = T imeSerial /T imeP arallel .
Instances
Name
Nodes
V560-1200
560
V600-900
600
V640-1400
640
V720-1500
720
V760-900
760
V800-1700
800
V840-900
840
V880-1800
880
V960-2000
960
V1040-2100
1040
V1120-2300
1120
V1200-2500
1200
Capacity
1200
900
1400
1500
900
1700
900
1800
2000
2100
2300
2500
CPU Algorithm Times
q-path t-q-route q + through-q
1.045
2.325
3.370
0.874
1.311
2.185
1.576
4.695
6.271
2.215
6.677
8.892
1.857
2.340
4.197
4.805
11.497
16.302
3.027
2.932
5.959
7.004
15.241
22.245
10.046
23.353
33.399
13.166
29.640
42.806
16.832
40.154
56.986
21.092
52.026
73.118
GPU Algorithm Times
q-path t-q-route q + through-q
0.140
0.210
0.350
0.101
0.128
0.228
0.177
0.284
0.461
0.251
0.363
0.614
0.162
0.149
0.311
0.362
0.511
0.873
0.209
0.165
0.373
0.486
0.627
1.113
0.628
0.837
1.465
0.869
0.994
1.862
1.061
1.276
2.336
1.328
1.607
2.935
q-paths
7.455 X
8.692 X
8.897 X
8.831 X
11.483 X
5.126 X
14.514 X
14.424 X
15.995 X
15.158 X
15.869 X
15.878 X
Speed-Up
t-q-route q
11.055 X
10.257 X
16.527 X
18.369 X
15.717 X
22.489 X
17.807 X
24.289 X
27.910 X
29.831 X
31.479 X
32.385 X
+ through-q
9.615 X
9.568 X
13.596 X
14.475 X
13.513 X
18.663 X
15.967 X
19.985 X
22.802 X
22.987 X
24.392 X
24.914 X
Chapter 5. VRP
Table 5.1: Computational results for the symmetric q-path and through-q-route relaxations.
85
Instances
Name
Nodes
A034-02f
34
A036-03f
36
A039-03f
39
A045-03f
45
A048-03f
48
A056-03f
56
A065-03f
65
A071-03f
71
Balman859-1000
859
Balman859-2000
859
Capacity
1000
1000
1000
1000
1000
1000
1000
1000
1000
2000
CPU Algorithm Times
q-path t-q-route q + through-q
0.008
0.031
0.039
0.015
0.031
0.031
0.015
0.031
0.046
0.015
0.047
0.062
0.016
0.047
0.063
0.015
0.063
0.078
0.016
0.093
0.109
0.031
0.078
0.109
6.100
3.026
9.126
13.385
17.113
30.498
GPU Algorithm Times
q-path t-q-route q + through-q
0.024
0.007
0.031
0.024
0.007
0.031
0.026
0.007
0.033
0.026
0.009
0.034
0.025
0.009
0.035
0.026
0.011
0.037
0.026
0.013
0.039
0.026
0.014
0.040
0.393
0.122
0.515
0.791
0.502
1.292
q-paths
0.334 X
0.619 X
0.579 X
0.578 X
0.634 X
0.584 X
0.606 X
1.184 X
15.514 X
16.929 X
SpeedUp
t-q-route q
4.722 X
4.557 X
4.213 X
5.505 X
4.986 X
5.777 X
7.388 X
5.642 X
24.781 X
34.102 X
+ through-q
1.278 X
0.999 X
1.382 X
1.799 X
1.817 X
2.131 X
2.796 X
2.724 X
17.710 X
23.597 X
Chapter 5. VRP
Table 5.2: Computational results for the asymmetric q-path and through-q-route relaxations.
86
Name
A-n62-k8
A-n63-k10
A-n64-k9
A-n80-k10
B-n50-k8
B-n68-k9
B-n78-k10
E-n51-k5
E-n76-k7
E-n76-k8
E-n76-k10
E-n76-k14
E-n101-k8
E-n101-k14
F-n135-k7
M-n121-k7
M-n151-k12
M-n200-k16
M-n200-k17
P-n50-k8
P-n70-k10
Instances
Nodes
62
63
64
80
50
68
78
51
76
76
76
76
101
101
135
121
151
200
200
50
70
Capacity
100
100
100
100
100
100
100
160
220
180
140
100
200
112
2210
200
200
200
200
120
135
NG:8
0.047
0.047
0.047
0.078
0.031
0.094
0.110
0.063
0.156
0.125
0.078
0.031
0.272
0.110
9.516
0.671
0.577
0.983
0.998
0.031
0.047
CPU Algorithm Times
NG:10
NG:12
NG:13
0.093
0.281
0.421
0.094
0.234
0.359
0.093
0.249
0.375
0.187
0.484
0.733
0.094
0.296
0.484
0.234
0.639
0.842
0.343
0.998
1.482
0.125
0.359
0.593
0.421
1.279
2.043
0.296
0.921
1.420
0.187
0.530
0.811
0.078
0.203
0.297
0.795
2.371
3.619
0.280
0.765
1.108
30.342
91.853
Out
2.169
9.345
19.407
1.591
4.181
6.692
2.901
7.301
10.905
2.902
7.332
10.888
0.047
0.109
0.171
0.125
0.359
0.530
NG:14
0.717
0.546
0.515
1.061
0.827
1.373
2.511
0.920
3.260
2.215
1.216
0.390
5.523
1.623
Out
38.876
10.358
15.928
15.943
0.250
0.749
NG:8
0.004
0.003
0.003
0.005
0.003
0.006
0.010
0.004
0.010
0.008
0.005
0.003
0.017
0.007
0.480
0.045
0.034
0.064
0.064
0.002
0.004
GPU Algorithm Times
NG:10
NG:12
NG:13
0.007
0.021
0.047
0.007
0.020
0.046
0.007
0.021
0.034
0.012
0.036
0.067
0.007
0.024
0.042
0.021
0.068
0.088
0.032
0.109
0.199
0.008
0.025
0.062
0.026
0.081
0.194
0.019
0.071
0.139
0.012
0.036
0.067
0.006
0.018
0.044
0.046
0.182
0.306
0.018
0.061
0.098
1.634
5.694
Out
0.246
1.203
2.710
0.099
0.323
0.556
0.198
0.611
0.956
0.197
0.617
0.966
0.005
0.012
0.024
0.009
0.035
0.048
NG:14
0.080
0.070
0.066
0.113
0.095
0.167
0.290
0.084
0.305
0.222
0.124
0.065
0.528
0.170
Out
5.731
0.978
1.524
1.527
0.044
0.089
NG:8
12.796 X
14.030 X
14.766 X
14.888 X
10.431 X
14.925 X
11.542 X
16.475 X
15.388 X
16.346 X
14.684 X
10.375 X
15.826 X
15.621 X
19.811 X
14.871 X
16.731 X
15.292 X
15.501 X
14.299 X
12.401 X
NG:10
12.510 X
12.974 X
13.358 X
15.004 X
12.835 X
10.928 X
10.754 X
15.543 X
16.476 X
15.811 X
15.352 X
12.597 X
17.271 X
15.862 X
18.565 X
8.817 X
16.083 X
14.629 X
14.759 X
9.918 X
13.832 X
SpeedUp
NG:12
13.646 X
11.958 X
11.906 X
13.425 X
12.530 X
9.441 X
9.192 X
14.089 X
15.775 X
12.894 X
14.536 X
11.361 X
13.042 X
12.535 X
16.131 X
7.767 X
12.952 X
11.946 X
11.887 X
8.804 X
10.362 X
NG:13
8.875 X
7.775 X
10.969 X
10.927 X
11.433 X
9.530 X
7.442 X
9.607 X
10.533 X
10.228 X
12.174 X
6.688 X
11.824 X
11.344 X
Out
7.160 X
12.038 X
11.408 X
11.272 X
7.001 X
10.954 X
NG:14
8.947 X
7.813 X
7.830 X
9.356 X
8.715 X
8.220 X
8.659 X
10.916 X
10.680 X
9.975 X
9.773 X
5.998 X
10.453 X
9.520 X
Out
6.783 X
10.587 X
10.450 X
10.443 X
5.685 X
8.439 X
Chapter 5. VRP
Table 5.3: Computational results for the symmetric ng-path relaxation.
87
Name
A034-02f
A036-03f
A039-03f
A045-03f
A048-03f
A056-03f
A065-03f
A071-03f
Instances
Nodes
Capacity
34
1000
36
1000
39
1000
45
1000
48
1000
56
1000
65
1000
71
1000
NG:8
0.422
0.406
0.406
0.546
0.671
0.905
1.154
1.326
CPU Algorithm Times
NG:10
NG:12
NG:13
1.045
3.151
5.055
1.014
3.230
4.711
1.185
3.417
5.289
1.404
4.259
6.942
1.809
5.835
9.625
2.730
7.784
14.367
3.213
9.423
15.398
3.869
11.684
19.625
NG:14
8.096
7.629
8.205
12.137
18.205
21.918
25.365
30.639
NG:8
0.022
0.023
0.024
0.027
0.030
0.041
0.053
0.057
GPU Algorithm Times
NG:10
NG:12
NG:13
0.035
0.086
0.128
0.037
0.086
0.127
0.038
0.091
0.146
0.053
0.138
0.238
0.079
0.239
0.427
0.114
0.331
0.687
0.122
0.398
0.720
0.160
0.572
1.109
NG:14
0.199
0.194
0.221
0.443
0.956
1.102
1.298
1.813
NG:8
19.055 X
17.849 X
17.256 X
19.884 X
22.587 X
22.124 X
21.691 X
23.425 X
NG:10
29.825 X
27.397 X
31.399 X
26.447 X
22.900 X
23.856 X
26.385 X
24.127 X
SpeedUp
NG:12
36.821 X
37.417 X
37.441 X
30.809 X
24.374 X
23.514 X
23.661 X
20.428 X
NG:13
39.555 X
37.189 X
36.216 X
29.206 X
22.565 X
20.910 X
21.391 X
17.703 X
NG:14
40.595 X
39.292 X
37.103 X
27.377 X
19.041 X
19.898 X
19.547 X
16.904 X
Chapter 5. VRP
Table 5.4: Computational results for the asymmetric ng-path relaxation.
88
Chapter 5. VRP
89
In table 5.1 we report the speed-up factor relative to the q-paths and through-qroutes algorithms. We used the biggest instances for the CVRP in literature for
two reasons:
1. the execution times for these algorithms on the VRPLIB is negligible because
of the small instances dimensions ,
2. we want to highlight the method’s scalability on very difficult instances.
In fact, for bigger instances the global speed-up factor obtained is really consistent,
24 X for the V1200-2500 instance. We decided to report the unified speed-up for
both the methods because to compute the through-q-routes we need the results of
q-paths. However, we can see that the through-q-routes is the algorithm achieving
the best performances.
For the asymmetric instances, table 5.2, as before, the best results are obtained
for big instances.
In table 5.3, we report the results for the NG relaxation for the symmetric CVRP.
In this case we evaluate the scalability of the method among different dimensions
for the ∆ parameter (8,10,12,13,14), using the VRPLIB instances because of the
data-structures dimensions reached during the computations. It’s easy to notice
that the performances of the method degrade with bigger neighborhood sets, because of the increasing of atomic operations among the labels and the bigger data
transfer time from and to the GPU, but in most cases, remaining above the 10
X factor. In the asymmetric 5.4 case we can appreciate the maximum speed-up
obtained, 40 X. In this case we can show all the computation capabilities of the
device, exploiting all the parallelism levels available as described before.
5.6
Considerations and Future Work
In this chapter we highlighted the great advantages that a parallel algorithm can
bring to this pricing strategies for the VRP. In fact, inside a CG method, the
pricing problem is the most computationally expensive routine used to generate
the columns to insert inside the master. The use of a GPU seems to be a very
good alternative in terms of execution time, portability and affordability (the
device used is an high end gaming GPU). Seems notable, moreover, the great
Chapter 5. VRP
90
performances obtained for the biggest instances, the hardest to solve and the
ones taking relevant elaboration time. The future work planned is to explore
the implementation of these methods with OpenCL, with the aim to make the
methods portable on the most common parallel processors of different vendors,
and, obviously, exploit these relaxations, or some their variants, in order to design
most powerful and effective algorithms for solving instances from different classes
of VRPs.
Chapter 6
Single Source Shortest Path
Problem
6.1
Introduction
The ever more complex systems and eco-systems represented by urban centers
and industrialized countries are growing fast and, nowadays, to exploit optimally
the infrastructures composing these systems is more and more difficult. Passing
through the public transport to the goods transportation and delivery, the problems related are increasingly strategic and dealing with these is the focus of many
studies and researches from academia and private companies.
To provide high quality decision support tools is mandatory to preserve the resources from the environmental point of view and enhance life’s quality in densely
populated areas. By now, for instance, a great urban center has different kinds,
or modes, of transportation, covering the city area, allowing people to move easily
from a location to another. Also in highly industrialized countries, the transportation is characterized by multiple networks (railways, motorways, air-flights
. . . ), that permit a fast displacement of people and goods. The layered networks
composing these transportation infrastructures have different properties and characteristics, making the related mathematical models more complex and the consistent and effective resolution of optimization problem for these models a non-trivial
challenge.
91
Chapter 6. SSSPP
92
Routing is a widely explored research topic and has its origin in early fifties with
the well known Dijkstra’s algorithm [94], for finding the shortest path inside an
oriented graph or the Bellman-Ford algorithm [95], computationally more expensive but more effective for other purposes. More generally, for routing we mean
finding the ’best’ path relative to one or more aspects of the journey: mileage,
cost, fuel consumption, time, number of transportation modes used and others. In
fact, we shift the focus from finding the shortest path in a geographical network, to
optimize a route among different layers with respect to other factors. One aspect,
mainly related to the people and goods transportation, is the arrival time to a
certain destination. In a public transport or goods delivery scenario, earlier is the
arrival time, better is the QoS.
Over the years many enhancement to the basic algorithms cited above has been
proposed, achieving good results in a large spectrum of routing problems, but
in cases where the query is strictly related to a temporal dimension, algorithm
using bi-directional search, contraction hierarchies or an heuristic to compute a
completion bound to the solution like A* [96] are not applicable because of the
changes of the networks values (mainly the edges costs) or the topology in function
of time. In this case we can only compute the routing problem’s solution using an
’augmented’ version of the Shortest Path algorithm that, up to certain dimensions,
is computationally expensive.
In this chapter we propose two parallel algorithms implemented both on CPU
(multi-core, shared memory platform) and GPU to solve the Earliest Arrival
Problem in a Time-Dependent Multi-Modal Network reporting the performance
obtained compared to the serial version.
6.2
Problem Definition
In this section we will define all problem’s peculiarities starting from the definition
of Earliest Arrival Problem, showing that can be solved using an augmented version
of the Shortest Path algorithm. We will also define a Multi-Modal Network, the
algorithm to manage the routing in this type of graph and, finally, we will add to
the model the time dimension, describing what changes will it add to the model.
Chapter 6. SSSPP
93
Definition 6.1. (Earliest Arrival Problem). Given a time-independent or timedependent network, source and target points s and t in the network, we ask for a
route in the network with the following properties:
1. The route must start at s,
2. the route ends at t,
3. the length (travel time) of all other routes satisfying the properties (1)–(2)
must be bigger or at least equal.
In other words, from all the possible routes in the network from s to t, we seek the
route with the minimum cost for arriving to t. As mentioned before, the route’s
cost can be any aspect from the fuel consumption to the travel time or a mix of
more of these criteria. In this paper we will cover only the optimization relative
to one criteria.
Analyzing the definition 6.1, we can easily notice that, substituting to the optimization criteria any function of weight for the edges of the network, this problem
becomes a Single Source Shortest Path Problem. We will define the SSSP Problem
first, then we will propose the Multi-Modal and Time-Dependent version.
6.2.1
Single Source Shortest Path Problem
Shortest Path is a deeply investigated problem in the Combinatorial Optimization.
This problem and its variations are subject of research from about five decades
and can be found, often like subproblem, in a wide plethora of applications. In
our case, shortest paths are the basis for the problem we are discussing.
Definition 6.2. (Single Source Shortest Path Problem). Given a weighted, directed graph G = (V, E), a source node s ∈ V , a target node t ∈ V and a weight
function w(e) for the edge e = (va , vb ), va ∈ V and vb ∈ V , we ask for a path
P = {v1 , ..., vk }, with the following properties:
1. The path begins at s, thus v1 = s,
2. the path ends at t, thus vk = t,
Chapter 6. SSSPP
94
3. P is minimal.
We define Len(P ) =
Pk−1
i=1
w(vi , vi + 1), the length of the path P .
The SSSPP has some common declinations, depending on the number of sources
and target considered:
• Many-To-Many-Shortest Path Problem. This is a generalization of the Shortest Path Problem. Instead of one node s and t we are given a set of source
nodes S ⊆ V and a set of target nodes T ⊆ V . We now ask for a shortest
path Ps,v for each pair (s, t) ∈ S × T . In multi-modal routing the Earliest
Arrival Problem will actually transform to this version of the problem.
• One-To-All-Shortest Path Problem. This is a special case of the Many-ToMany-Shortest Path Problem where S is a singleton set consisting of one
source node s and T = V is the set of all nodes. Hence, we are asking for
shortest paths Pv to every node v ∈ V . Because the edge set of all resulting
paths T = ∪Ps,v , v = 1, . . . , |V | forms a tree, we might also say that we
compute a shortest path tree.
• All-Pairs-Shortest Path Problem. This is a version of the Many-To-ManyShortest Path Problem where both S and T are the complete node set V
of the graph. Having the All-Pairs-Shortest Path Problem solved automatically includes solutions for all instances of the Shortest Path Problem in the
graph. For this problem in most cases is used the Floyd-Warshall dynamic
programming algorithm [97].
All these problems can be solved using the same algorithm, executing it multiple
time often. In this paragraph we didn’t take into account the time dimension, we
will extend the model after the introduction of the Multi-Modal Networks.
We can consider the SSSPP a special case of a Multi-Modal network with only
one mode. In this case the solution algorithm for the routing problem without
time-dependencies is equivalent to the Dijkstra algorithm.
Algorithm Unimodal Routing(s, t, V, E, w())
Chapter 6. SSSPP
95
1.
vals // Tentative Distances Values
2.
preds // Predecessors vector
3.
Queue Q = null // Priority Queue
4.
// Data Structures Initialization
5.
Q.Insert(s)
6.
for i = 1 to |V | do
7.
if i == s
8.
then vals[i] = 0, preds[i] = 0
9.
else vals[i] = ∞, preds[i] = ∞
10. // Algorithm
11. while Q6= null do
12.
n = Q.first()
13.
if n == t
14.
then break
15.
else valn = vals[n]
16.
for each succ ∈ Γn do
17.
coste = w((n, succ))
18.
if vals[succ] > valn + coste
19.
then vals[succ] = valsn + coste
20.
preds[succ] = n
21.
Q.Insert(succ)
The proposed algorithm is a straightforward implementation of the Dijkstra one.
In fact in line 3 we initialize a priority queue implemented with a heap, ordered
by the tentative values, in line 5 we insert in the heap the start node s. In lines
6-9 we initialize the tentative values for each node and the predecessors to retrieve
the shortest path.
In lines 11-21 we have the core algorithm inside a do-while cycle that ends once
the queue is empty. In line 12 we extract the root from the heap, in lines 13-15
we check if we have reached the target t, otherwise we extract the cost of the
label relative to the actual node n. The foreach statement, in lies 16-21, evaluate
the new tentative values for each successor succ of n and update them if the
successor’s label is greater, keeping trace of the predecessor, n, and inserting the
successor node succ in the queue, line 21, that will reorder itself maintaining the
heap properties.
Chapter 6. SSSPP
6.2.2
96
Multi-Modal Networks
In this section, we define a Multi-Modal Network and the other basis on which
relies the routing for this type of networks. First, we give de definition of MultiModal or Multi-Layer Network:
Definition 6.3. (Multi-Modal Network). Given a graph G = (V , E, M ) where:
1. M = {M 1 , M 2 , . . . , M n } is the set of modes, or layers, composing the network, where M i = (V i , E i ) is the graph representing the i mode or layer, V i
is the set of vertices of i, E i is the set of edges of i and n = |M |,
2. V =
Sn
i=1
V i , with V i ∩ V j = {0}, i 6= j,
3. v i ∈ V i is a vertex for the mode i,
4. ei ∈ E i is an edge connecting two nodes of the same mode i, e = (vhi , vki ),
5. et ∈ E t , is an edge connecting two nodes of different modes i and j, et =
(vhi , vkj ), with i 6= j (transition edge),
6. E =
Sn
i=1
E i ∪ E t is the set of edges of G where E i ∩ E j = {0}.
The G graph is a multi-modal or multi-layer graph composed by n modes or layers.
For each M i graph we can define a cost function wi (ei ) for the mode’s edges. For
the transition edges, the wt (et ) is equal to 0.
For instance, the network of public transport in an urban area (roads, bus lines,
tram lines, trains, boats, etc. . . ), can be modelled with a multi-modal network.
Another example can be the different types of transportation for a delivery service
(postal service for instance), that has different types of networks to deliver the
goods (ship, aeroplane, etc. . . ).
The routing problem for this type of networks can be seen as Shortest Path Problem among the various, connected, modes of the network, using the relative wi ()
weight functions to evaluate each time the cost of an edge. In this case we will
found a route, passing through the various modes of the network from the source
s to the target t. In this scenario, we are allowed to choose also path exploiting
various type of transportation: car, bus, trams, boats, together in the same path.
Chapter 6. SSSPP
(a)
97
Not Convenient
Route
(b) Convenient Route
Figure 6.1: Types of Routes.
Obviously, this kind of solution is not desirable, in fact, a ‘well formed’ route is
a route passing only through certain types of modes depending on the state of
the traveler: For instance, assume that we are a tourist in Rome and we want to
visit some of the most beautiful places in the city. We are by foot and we want
a route that uses the public transport and the road network, allowing us to reach
the places we have chosen. The algorithm, as it is, can give us undesirable results
finding that the most convenient path between the Colosseum and the Vatican
Museums is taking the metro until a certain point, then take the car, take another
bus, then the car again to reach the Museums.
It’s straightforward that a path like this is not desirable, we’d like a path using
only the public transportation to reach our destination (6.1). Another side-effect
of using this algorithm on these types of networks is that while we are traveling
on a train, the railway intersects a road, without a stop for the train, that can
bring us to destination faster, the algorithm can suggest us to take that road. In
general, we are forced to evaluate paths that are compatible with the state (foot,
car, bicycle, etc. . . ) of the traveler.
6.2.3
Label Constrained Shortest Path Problem
Barrett [98] proposed an algorithm to solve the problem of not desirable routes
associating to the graph an Automaton that, treating the modes and the states
as part of a DFA, Deterministic Finite Automata, regulates the transition among
the modes of the network. We will give a brief definition for a language and an
automaton, then we will describe the algorithm based on these and its implications.
Definition 6.4. (Regular Languages). Let Σ be an alphabet. Then a language L
over Σ is regular if and only if it confirms the following construction rules:
1. The empty language {0} is regular,
Chapter 6. SSSPP
98
2. for each σ ∈ Σ the singleton language {σ} is regular,
3. if L1 and L2 are regular languages, then L1 ∪ L2 , L1 · L2 and L∗1 are also
regular languages.
To describe a regular language L we can use formalism like regular expression or
automata. In our case we will give a brief definition for the DFA describing the
language L.
Definition 6.5. (DFA, Deterministic Finite Automaton). A Deterministic Finite
Automaton describing the language L is given by a 5-tuple A = (Q, Σ, δ, q0 , F )
where:
1. Q is the set of states composing the automaton,
2. Σ is the alphabet, a finite set of symbols,
3. δ is the transition function δ : Q × Σ → P(S),
4. q0 is the initial state.
5. F is the set of final states.
We say that a word w is accepted by A if and only if exists a path from q0 to
qf ∈ F regulated by δ.
By Kleene’s Theorem [99] each regular language L can be described by a finite
automaton in the sense that every word w ∈ Σ∗ is accepted by this automaton.
The language L and the automaton A can be interchanged.
Given the definition of language and automata, we can define the Label Constrained Shortest Path Problem:
Definition 6.6. (Label Constrained Shortest Path Problem). Given an alphabet
Σ, a language L ⊂ Σ∗ , a weighted, directed graph G = (V, E) with Σ-labeled
edges and source and target nodes s, t ∈ Σ, we ask for a shortest path P from s
to t, where the sequence of labels along the edges of the path creates a word of L.
Thus given P = [v1 , . . . , vk ] it has to hold that:
label((v1 , v2 ))label((v2 , v3 )(. . . label((vk−1 , vk )) ∈ L.
(6.1)
Chapter 6. SSSPP
99
This definition needs no restrictions on the language L but, for our purposes,
a regular language is sufficient to model the transition among the modes of the
network. In [98] the following theorem is proven:
Theorem 6.7. the Label Constrained Shortest Path Problem restricted to Regular
Languages, RegL-CSPP, can be solved in deterministic polynomial time.
An algorithm for solving this problem operates on a product graph between the
automaton A, describing the allowed transitions among the modes and the graph
G.
Definition 6.8. (Product Network). Given a Σ-labeled graph G = (V, E) and
a non-deterministic finite automaton A = (Q, S, δ, q0 , F ), the product network
G× = (V × , E × ) is defined as follows:
1. The node set consists of product-nodes (v, q) ∈ V × where v ∈ V and q ∈ Q.
2. An edge e× = (v1 , q1 ), (v2 , q2 ) between two product-nodes is included in E ×
if and only if e = (v1 , v2 ) ∈ E and there is a label σ ∈ Σ for which exists
a transition q2 ∈ δ(q1 , σ) in the automaton. The weight of e× is set to the
weight of e and label(e× ) is set to σ.
The resulting graph is uni-modal.
In [98] is proven that this assumption holds:
Theorem 6.9. The RegL-CSPP for a Σ-labeled graph G = (V, E) from source
s ∈ V to target t ∈ V and a regular language L ⊆ Σ∗ can be reduced to the
Shortest Path Problem as follows:
1. Construct a finite automaton A = (Q, Σ, δ, S, F )describing L, where S is the
set of starting states,
2. construct the product network G× = G × A
3. solve the Many to Many Shortest Path Problem for G× using:
S=
[
(s, qs ), T =
qs ∈S
[
(t, qf )
(6.2)
qf ∈F
where S and T are, respectively, the set of the sources and the set of the
targets inside the product graph,
Chapter 6. SSSPP
100
4. from all resulting paths pick the one having minimal length.
Let P = [(v1 , q1 ), . . . , (vk , qk )]be the shortest path obtained by the algorithm induced from Theorem 2. Then the length of the path in G is the same as the length
Len(P ) in G× . The actual path in G can be obtained by omitting the ‘automaton
part’ of the product-nodes, thus, yielding [v1 , . . . , vk ]. On the other hand, the word
conforming to L along the path can be obtained by concatenating the edge labels:
word(P ) = label((v1 , q1 ), (v1 , q2 )) . . . label((vk−1 , qk−1 ), (vk , qk ))
(6.3)
The creation of the product graph can be computed in time O(|G| · |A|) which is
also polynomial. Hence, the algorithm induced by Theorem 2 runs in polynomial
time. The memory complexity and the space required to store the product graph
G× is also in O(|G| · |A|) which is extremely expensive for large instances.
The memory complexity of the product graph can be easily avoided using the
transition graph relative to the automaton A. We can compute implicitly the
shortest paths for the Many to Many SPP using an constrained version of the
Dijkstra algorithm regulated by the transition graph of the automaton.
Figure 6.2: Automaton managing four states.
In this case, the memory complexity of G× becomes O(|G| + |A|), reducing consistently the memory space required. For instance, we can assume that a traveller,
by foot, can use the public transport and walk through the streets by foot, then,
the the relative automaton will allow the moving through the various modes, like
Chapter 6. SSSPP
101
bus, trains or subways and the costs of the streets will be computed considering
the average speed of a man walking.
A bike trip, instead, must be constrained by the streets or the transport networks.
In fact is not allowed to ride a bicycle in a motorway or to bring it on a bus, instead
is allowed to ride in urban centers or in secondary roads and to bring the bicycle on
a train or a subway. All these aspects must be modeled by the automaton, that,
during the evaluation of the successors states and the computation of the new
tentative costs, allows the algorithm to expand or not a state. Here we propose
an augmented version of the algorithm Unimodal Routing, taking into account a
Multi-Modal Network and an automaton.
Algorithm RegL-CSPP (s, t, V, E, w(), A)
1.
Queue SQ=null // Generated States of G× queue
2.
// Queue Initialization
3.
for each (s, qs ) ∈ S do
4.
SQ.Insert((s, qs ),((0, 0), 0) 0)
5.
// Main cycle
6.
while SQ 6= null do
7.
((n, s), (p, q, val(p, q)), f (n, s)) = SQ.First()
8.
if (n, s) ∈ T
9.
then reached + +
10.
if reached == |T |
11.
then break
12.
13.
14.
for each edge e = (n, succ), succ ∈ Γn do
for each q ′ ∈ δ(s, label(e)) do
if (succ, q ′ ) ∈
/ SQ
15.
then SQ.Insert((succ, q ′ ),((n, s), f (n, s)), f (n, s) + w(e))
16.
else if f ((n, s)) + w(e) < f ((succ, q ′ ))
17.
then SQ.Update((succ, q ′ ),((n, s), f (n, s)),f ((n, s))+
w(e))
In the algorithm RegL-CSPP we introduce the triple ((v, q), (p, r, f (p, r)), f (v, q))
indicating a certain node v in a state q and its value, keeping trace of the predecessor triple of node p, state r and valuef (p, r) also, to retrieve the path at the
end of the computation.
Chapter 6. SSSPP
102
In lines 3-4 we initialize the queue SQ collecting all the G× states produced during
the computation, ordering following their value f . As in the algorithm Unimodal
Routing we implemented it with an heap. More specifically, in line 4 we initialize
all the states for the starting point s ∈ S, setting the predecessors and the values
to 0.
In lines 8-11 we check if all the targets states t ∈ T has been reached, in this case
we stop the computation. In lines 12-18 we update the data for each successor of
n. In line 13 we check if the transition of the edge e is allowed by the transition
function δ of A, if allowed, we check if the state has been already generated, if not
we insert the new triple in SQ (line 15), otherwise, we update the existing triple
with the new value, if the new path enhance the solution (lines 17-18).
6.2.4
Time-Dependent Networks
In time-dependent routing we do not longer have constant weights assigned to
the edges. To accommodate for time-dependency, we replace the edge weights
by arbitrary functions f from some function space F. The shortest s-t-path in a
time-dependent model then depends on the departure time τs of the source node.
This might result in shortest paths of different length for different departure times
or, in general, even a completely different route. In the simplest case, we can
+
describe these time-functions as periodic functions f : R+
0 → R0 with period Π,
meaning that for each value f (τ ) = f (τ mod Π).
To make the model more realistic, we need to express the F as a set of piecewise
+
linear functions: A periodic function f : R+
0 → R0 is called piecewise linear if it
consists of a finite number of segments of linear functions. Let f be a piecewise
linear function then f can be described by a finite set P of interpolation points
where each interpolation point pi ∈ P consists of a departure time τ and an
associated function value f (τ ).
The value of f for an arbitrary time τ is then computed by interpolation. This is
done differently for time-dependent road networks and public transportation networks. Whereas in road networks we interpolate linearly between two subsequent
interpolation points, the travel time function along a public transportation edge
is interpreted as follows. First we have to wait for the next train or aeroplane
to depart and then we have to add its mere travel time along that edge to that.
Chapter 6. SSSPP
103
(a) Road Network Speed Profile
(b) Public Transport Speed Profile
Figure 6.3: Speed Profiles.
Hence, for some arbitrary time point t we use the nearest interpolation point τ in
the future and interpolate by the formula:
f (τ ) = −γ · (τi − τ ) + f (τi ).
(6.4)
with γ ∈ [−1, 0] as the fixed gradient for f . In figure 6.3 we show the two types of
functions for the road network and the public transport network. For extending
the Dijkstra algorithm to manage time-dependency we need to add as input the
time τs of the day when the trip starts. There are two type of routing queries that
we can perform on a time-dependent graph:
• Time Queries: The time-dependent version of Dijkstra’s algorithm is almost
identical to the time-independent version as illustrated in Algorithm RegL-CSPP when computing time queries. The only changes to the algorithm that
need to be made are the following two:
1. we need to supply a departure time τ as additional input, as mentioned
before,
2. to evaluate the edge weights, we have to consider the current time at
which we encounter the respective edge. Let e = (v, w) be an edge
of which the weight has to be evaluated, then the time at which we
evaluate the function fe of the edge e is the departure time τ plus the
time along the path to v.
• Profile Queries: Using the previously described version of the time-dependent
query algorithm yields only shortest paths for one particular departure time
Chapter 6. SSSPP
104
τ . While this seems to be a canonical generalization of the time-independent
case, there is another type of query in time-dependent graphs, where we are
not only interested in the shortest path at one time point, but at all times
of day.
For example: in a railway network we state 8 o’clock as departure time τ for
a query. Let’s say there is a train departing at 8:00 to our destination takes
2 hours. But maybe there is another train departing at 9 o’clock that takes
only 1 hour and 10 minutes. Taking the second train would be a suboptimal
solution to the Earliest Arrival Problem (since the arrival time is 10 minutes
later), but its sheer travel time is 50 minutes shorter. So, maybe it would
be nice to present the user with the travel time for each possible departure
time τ < Π. In other words, the result of the query should be a piecewise
linear function f itself, where each interpolation point represents a shortest
path for that particular time.
In our work we will consider only Time Queries. In fact, other types of queries
in a time-dependent graph are extension or have as their foundation this type of
query and proposing a parallel algorithm for this problem can be useful for all the
others.
6.3
Serial Algorithm
Dean [100] proposed the extended version of the Dijkstra algorithm for timedependent graphs. In this section we will propose the final version of the algorithm
to compute the Shortest Path in a Multi-Modal Network with Time-Dependencies.
This extension of the problem is particularly hard to solve, because we can’t exploit
some of the most famous and effective techniques for speeding-up the computation
of Shortest Path or we need some pre-computation phases.
Algorithms like Contraction Hierarchies [101], bi-directional search, Arc-Flag [102,
103] or ALT (A* with Landmarks and Triangle Inequality) [104, 105] are not
adaptable or need some long-time pre-computation for larger graphs. Pajor [106]
proved that almost all of these methods are useless for the time-dependant case.
Chapter 6. SSSPP
6.3.1
105
Time-Dependent Augmented Dijkstra Algorithm
Algorithm T-D-RegL-CSPP (s, t, V, E, ftime , A, τstart )
1.
Queue SQ=null // Generated States of G× queue
2.
Queue Initialization
3.
for each (s, qs ) ∈ S do
4.
SQ.Insert((s, qs ),((0, 0), τ0 ), τstart )
5.
// Main cycle
6.
while SQ 6= null do
7.
((n, h), (p, q, τp,q ), τn,h ) = SQ.First()
8.
if (n, h) ∈ T
9.
then reached + +
10.
if reached == |T |
11.
then break
12.
for each succ ∈ Γ(n) do
13.
e = (n, succ)
14.
for q ′ ∈ δ(h, label(e)) do
15.
if (succ, q ′ ) ∈
/ SQ
16.
then SQ.Insert((succ, q ′ ),((n, h), τn,h ), τn,h + ftime (e, τn,h ))
17.
else if τn,h + ftime (e, τn,h ) < τsucc,q′
18.
then SQ.Update((succ, q ′ ),((n, h), τn,h ),τn,h +ftime (e, τn,h ))
We introduce the function ftime () calculating the time for traveling the edge e.
This function can be a collection of methods created to evaluate the traveling
time, based on various factors: road type (primary, secondary, etc . . . ), the type
of public transport (bus, subway, etc . . . ), the moment of the day (traffic profiles
or other factors), the state of the automaton (depending if we are traveling by
foot or on a bicycle). The algorithm is equivalent to the RegL-CSPP but here is
computed the time for each query inside the multi-graph G. Depending on the
granularity of the problem, the time can be expressed in minutes or seconds.
Chapter 6. SSSPP
6.4
106
GPU Algorithm
In this section, we propose an extension of the parallel Dijkstra algorithm proposed
by [88] for computing the EAP in a Multi-Modal Time-Dependent Graph and its
porting on a GPU using the CUDA [107] programming model. First, we present
the basic algorithm, then we will introduce a new approach to compute the frontier
set Fi to enable the parallelism of the method.
6.4.1
GPU Dijkstra Algorithm
We can distinguish two parallelization alternatives that can be applied to Dijkstra’s
algorithm. The first one parallelizes the internal operations of the sequential Dijkstra algorithm, while the second one performs several Dijkstra algorithms through
disjoint sub-graphs in parallel [108]. Our approach is focused in the first solution.
The key of the parallelization of a single sequential Dijkstra algorithm resides in the
inherent parallelism of its loops. For each iteration, the outer loop selects a node to
compute new distance labels. Inside this loop, the algorithm relaxes its outgoing
edges in order to update the old distance labels, that is the inner loop. Parallelizing
the outer loop implies to compute in each iteration i a frontier set Fi of nodes that
can be settled in parallel without affecting the algorithm correctness. The main
problem here is to identify this set of nodes v which tentative distances V al(v)
from source s must be the minimum shortest distance. Crauser et al.[109] and
Crobak et al. [110] proposed two solutions addressing this problem. Parallelizing
the inner loop implies to traverse simultaneously the outgoing edges of the frontier
node. One of the algorithms presented in [111] is an example of this parallelization
approach.
Following the approach in [109], we explain the method for identifying the frontier
set Fi and maximizing its cardinality. It’s straightforward to highlight that bigger
is the frontier set, higher is the level of parallelism of the method.
6.4.1.1
Frontier Set
The method, in each iteration i, calculates the minimum tentative distance of
the nodes belonging to the unsettled set, Ui . The node whose tentative distance
Chapter 6. SSSPP
107
is equal to this minimum value can be settled and becomes the frontier node.
Its outgoing edges are traversed to relax the distances of the adjacent nodes.
In order to parallelize the algorithm, it is needed to identify which nodes can
be settled and used as frontier nodes at the same time. Martin et al. [112]
inserts into the frontier set, Fi+1 , all nodes with this minimum tentative distance
with the aim to process them simultaneously. Crauser et al. [109] introduces a
more aggressive enhancement, augmenting the frontier set with nodes with longer
tentative distance. The algorithm computes in each iteration i, for each node of
the unsettled set, u ∈ Ui , the sum of:
1. its tentative distance,
2. the cost of its outgoing edges.
Afterwards, it calculates the minimum of these computed values. Finally, those
nodes whose tentative distance are lower or equal than this minimum value can
be settled becoming the frontier set.
Introducing ∆i as the threshold value computed in each iteration i that holds that
any unsettled node u with val(u) ≤ ∆i can be safely settled. The bigger the ∆i
value, the more parallelism is exploited. However, depending on the particular
graph being processed, the use of a very ambitious ∆i may induce overheads that
destroys any performance gain with respect to sequential execution.
The basic Dijkstra parallel method follows the idea proposed by Crauser [109] of
incrementing each ∆i . For every node v ∈ V , the minimum weight of its outgoing
edges, that is, ∆v = min {w(v, z) : (v, z) ∈ E}, is calculated in a pre-computation
phase. For each iteration i of the external loop, having all tentative distances of
the nodes in the unsettled set, we define
∆i = min {(val(u) + ∆v ) : u ∈ Ui }
(6.5)
More in general, we insert into the frontier set Fi+1 every node v with val(v) ≤ ∆i .
Chapter 6. SSSPP
6.4.1.2
108
GPU Implementation
Once defined the concept used for exposing the inherent parallelism inside the
Dijkstra algorithm, we provide the complete pseudo-code for the method on the
GPU.
Algorithm GPU Dijkstra(s, t, V, E, w(), ∆v )
1.
vals, preds, Ui , Fi
2.
∆i = ∞
3.
// Data Structures Initialization
4.
for i = 1 to |V | do
5.
if i == s
6.
then vals[i] = 0, preds[i] = 0
7.
else vals[i] = ∞, preds[i] = ∞
8.
// GPU Algorithm
9.
Initialize<<<>>>(Ui , Fi ) // Frontier and Unsettled Nodes initialization
10. Initialize(∆i ) //Threshold Initialization
11. while ∆i 6= ∞ do
12.
// Relax the frontier nodes
13.
RELAX-KERNEL<<<>>>(vals, preds, Fi , Ui )
14.
// Update ∆i
15.
∆i = DELTA-UPDATE-KERNEL<<<>>>(vals, Ui , ∆v )
16.
// Update Fi+1
17.
FRONTIER-KERNEL<<<>>>(∆i , Fi , Ui )
Algorithm RELAX-KERNEL(vals, preds, Fi , Ui )
1.
nidx = thread.id
2.
if Fi [nidx ] == true
3.
4.
5.
6.
7.
then for each u ∈ Γnidx do
if Ui [u] == true
then Start Atomic Operations
if vals[u] > vals[nidx ] + w(nidx , u)
then vals[u] = vals[nidx ] + w(nidx , u)
8.
preds[u] = nidx
9.
End Atomic Operations
Chapter 6. SSSPP
109
Algorithm FRONTIER-KERNEL(∆i , Fi , Ui , ∆v , vals)
1.
nidx = thread.id
2.
Fi [nidx ] == f alse
3.
if Ui [nidx ] = true ∧ vals[nidx ] ≤ ∆i
4.
then Ui [nidx ] = f alse,Fi [nidx ] = true
The main procedure, GPU Dijkstra, uses three kernels to relax the nodes and
create the frontier and unsettled nodes sets. In lines 3-11, we initialize the data
structures used by the algorithm, lines 12-15 are the main loop, that stops once
relaxed all the nodes in the graph or reached the t node. The kernel function
Relax-Kernel relaxes all the nodes inside Fi (line 13), the Delta-Update-Kernel
updates the ∆i value using a parallel reduction. This procedure is a modified
version of the reduce3 procedure taken from the CUDA SDK that comes along
with the CUDA package from Nvidia. Frontier-Kernel is the kernel function that
creates the F set at the next iteration.
The RELAX-KERNEL procedure updates the tentative distances of the nodes
inside the F set. Each thread elaborates a node nidx , relaxing all its successors
unsettled nodes w ∈ Γnidx . The relaxation at line 6 is an atomic operation among
the threads to avoid race conditions. We need to make this operation atomic
because, at the same time, other threads can update the same memory location
(the same unsettled node), generating inconsistent reads or writes.
The FRONTIER-KERNEL kernel generates the Ui+1 and Fi+1 sets. Each thread
is assigned to a node and checks if the node tentative distance is less or equal to
the ∆i threshold and if the node is in the Ui set. In this case the kernel inserts
the node in the Fi set.
The DELTA-UPDATE-KERNEL is a simple procedure implemented to avoid a
data transfer between the CPU and the GPU. Basically, it’s a parallel reduction
among the nodes in the Ui set, for finding the shortest tentative distance.
6.4.2
Dynamic Frontier Definition
We enhanced the Frontier-Creation-Kernel to address the Time-Dependency of
our model. We need to evaluate for each iteration i the ∆v values. The main
Chapter 6. SSSPP
110
problem is that we can’t evaluate a priori, the minimum cost among the outgoing
edges from the v vertex, because of the time dependency.
The only way to pre-calculate the value is to evaluate the minimum cost among
the edges for each second of the day (the granularity of our problem), for each
node, bringing to a great amount of memory used and a long computation time.
We define the set Ri as the set of the nodes u ∈ Ui which tentative distances has
been updated from the initial ∞ value, all possible members of the Fi+1 set at
the i + 1 iteration. For each node r ∈ Ri we evaluate, at time τ , the outgoing
minimum cost edge.
∆v = min {c = ftime (r, u, h, τr,u,h ) : r ∈ Ri , u ∈ Γ(r) ∩ Ui , (r, u) ∈ E, h ∈ A}
(6.6)
Where r is the node in the Ri set, u is a successor of r in the Ui set and h a state
of the automaton A. ftime is the time function evaluating the cost of the edge
from r to u, in the state h of the automaton at time τ . The ∆v is computed every
iteration i, then, the values take part to the evaluation of the ∆i threshold, as
described in the FRONTIER-KERNEL to create the Fi set.
6.4.3
GPU Time-Dependent Algorithm
The porting to the GPU environment implies some modification to the data structures used by the algorithm. We can’t use dynamic data structures, performance
killers for the GPU, and, to keep trace of the tentative distances and the predecessors, we will implement two bi-dimensional arrays times and preds, in which
one dimension is the |V | cardinality and the other is the number h of states in
which the traveler can be (foot,car,etc . . . ). For each state of the traveler, we have
an automaton Ah that regulates the transitions among the modes of the network.
For a more readable notation, we will call A the macro-automaton composed by
all the At automata for each type of traveler.
Algorithm GPU T-D-RegL-CSPP (s, t, V, E, ftime (), ∆v )
1.
vals, preds, Ui , Fi , Ri
Chapter 6. SSSPP
111
2.
∆i = ∞
3.
for h = 1 to |T ravelerStates| do
4.
5.
for i = 1 to |V | do
if i == s
6.
then vals[h][i] = 0, preds[h][i] = 0
7.
else vals[h][i] = ∞, preds[h][i] = ∞
8.
// GPU Algorithm
9.
// Frontier, Unsettled Nodes and R initialization
10. Initialize<<<>>>(Ui , Fi , Ri )
11. Initialize(∆i ) //Threshold Initialization
12. while ∆i 6= ∞ do
13.
// Relax the frontier nodes
14.
RELAX-KERNEL<<<>>>(vals, preds, Fi , Ui , A)
15.
// Update ∆v
16.
∆v =DYNAMIC-∆v -KERNEL<<<>>>(vals, Ui , Ri , A)
17.
// Update ∆i
18.
∆i = DELTA-UPDATE-KERNEL<<<>>>(vals, Ui , ∆v )
19.
// Update Fi+1
20.
FRONTIER-KERNEL<<<>>>(∆i , Fi , Ui )
Algorithm DYNAMIC-∆v -KERNEL(Ui , Ri , ∆v , vals, A, τ )
1.
nidx = thread.x.id
2.
hidx = thread.y.id
3.
if Ri [nidx ]==true
4.
then τ = vals[hidx ][nidx ]
5.
for each u ∈ Γnidx ∩ Ui do
6.
edge e = (nidx , u)
7.
for each q ′ ∈ δ(hidx , label(e)) do
8.
if ∆v [nidx ] > ftime (e, τ )
9.
then ∆v [nidx ] = ftime (e, τ )
Algorithm RELAX-KERNEL(vals, preds, Fi , Ui , A)
1.
nidx = thread.x.id
2.
hidx = thread.y.id
3.
if Fi [nidx ] == true
4.
then τ = vals[hidx ][nidx ]
Chapter 6. SSSPP
112
5.
for each w ∈ Γnidx do
6.
edge e = (nidx , w)
7.
for each q ′ ∈ δ(hidx , label(e)) do
8.
if Ui [w] == true
9.
then Start Atomic Operations
10.
vals[hidx ][w] = min(vals[hidx ][w], vals[hidx ][nidx ] +
ftime (e, τ ))
11.
preds[hidx ][w] = nidx
12.
End Atomic Operations
The GPU T-D-RegL-CSPP algorithm has the same behaviors of the original one,
except for the Dynamic-Deltav -Kernel at line 14, described in 6.4.2. The DeltaUpdate-Kernel and Frontier-Kernel are the same described in 6.4.1.2, only changing the input data. In Relax-Kernel we introduced the automaton A that regulates,
at line 7, the transition to the w ∈ Γ nodes. Here we have two indexes to address
the vals and the preds matrices.
The hidx index represent the traveler’s state and the index nidx the node to expand. In this case we use a bi-dimensional block indexing to create the CUDA
computational grid on the GPU. In lines 9-12 we have the atomic updates for
the tentative distances and the predecessors. The τ variable (line 4), represent
the actual time, used to evaluate the edge cost with the ftime function in line 10.
The Dynamic-∆v -Kernel exploit the same bi-dimensional indexing used for the
Relax-Kernel, in line 5 we select the successors of the nidx vertex that can be in
the i + 1 iteration regulated by the automaton, line 7, and we evaluate the cost of
the edge at line 8, using the ftime function, updating, if necessary, the ∆v values
in line 9.
6.5
Shared Memory Algorithm
Analyzing in detail the serial algorithm, we can observe that we can compute, for
each traveler state h a T-D-RegL-CSPP in a state-space of labels at least equal to
the dimension of the graph’s nodes set V . Under this prospective, we can describe
the state-space in a bi-dimensional matrix with h rows, one for each traveler state,
and |V | columns. For each h state, we can compute a Time-Dependent-RegLCSPP as described by the algorithm T-D-RegL-CSPP completely decoupled from
Chapter 6. SSSPP
113
the others. In fact we will have independent data structures (the h row in the
matrices for the predecessors and the tentative distances) and an independent priority queue. Noticed these peculiarities, we can re-write the algorithm exploiting
the characteristics described above:
Algorithm T-D-RegL-CSPP2 (s, t, V, E, ftime , A, τstart )
1.
Queue Q=null // Generated States of G× queue
2.
vals, preds
3.
// Data Structures Initialization
4.
for h = 1 to |T ravellerStates| do
5.
6.
for n = 1 to |V | do
if n == s
7.
then vals[h][n] = τstart , preds[h][n] = 0
8.
else vals[h][n] = ∞, preds[h][n] = ∞
9.
// Algorithm
10. for h = 1 to |T ravellerStates| do
11.
// Queue Initialization
12.
Q.Insert((s))
13.
while SQ 6= null do
14.
n =SQ.First()
15.
τ = vals[h][n]
16.
if (n, h) ∈ T
17.
then break
18.
for each u ∈ Γn do
19.
edge e = (n, u)
20.
for q ′ ∈ δ(h, label(e)) do
21.
22.
if vals[h][u] > vals[h][n] + ftime (e, τ )
then vals[h][u] = vals[h][n] + ftime (e, τ )
23.
preds[h][u] = n
24.
if Q.InQueue(u)
25.
then Q.Update(u)
26.
else Q.Insert(u)
Chapter 6. SSSPP
114
In lines 4-8 we initialize the tentative distance matrix vals and the predecessors
matrix preds. For each traveler state h we initialize the start node to τstart and
the predecessor to 0.
In lines 10-26 we evaluate, for each traveler state, the T-D-RegL-CSPP as done
in the original algorithm. As we can see, every h state is independent from the
others, allowing us to evaluate the states in parallel.
6.5.1
OpenMP Porting
OpenMP API [113] seems to be the best option to parallelize the algorithm proposed in the previous paragraph. In fact, we can easily parallelize the method,
using the fork/join programming model, assigning to each thread a traveler state.
With few pre-processors directives, we can exploit the parallelism inside multicore/multi-threaded CPUs.
Algorithm T-D-RegL-CSPP OMP (s, t, V, E, ftime , A, τstart )
1.
Queue Q=null // Generated States of G× queue
2.
vals, preds
3.
// Data Structures Initialization
4.
for h = 1 to |T ravellerStates| do
5.
6.
for n = 1 to |V | do
if n == s
7.
then vals[h][n] = τstart , preds[h][n] = 0
8.
else vals[h][n] = ∞, preds[h][n] = ∞
9.
// Algorithm
10. SetThreads(|T ravellerStates|)
11. # start parallel region
12. h = GetThreadID()
13. // Queue Initialization
14. Q.Insert((s))
15. while Q 6= null do
16.
n =Q.First()
17.
τ = vals[h][n]
18.
if (n, h) ∈ T
Chapter 6. SSSPP
19.
20.
115
then break
for each u ∈ Γ(n) do
21.
edge e = (n, u)
22.
for q ′ ∈ δ(h, label(e)) do
if vals[h][u] > vals[h][n] + ftime (e, τ )
23.
24.
then vals[h][u] = vals[h][n] + ftime (e, τ )
25.
preds[h][u] = n
26.
if Q.InQueue(u)
27.
then Q.Update(u)
28.
else Q.Insert(u)
29. # end parallel region
As we can see, the modification to the code are minimal. We introduce only the
directives in pseudo-code to the pre-processor in lines 9-11. In line 11 we simply
get the thread id and use it to indexing the state. In line 9 we spawn a thread for
each state. In line 10 we declare the parallel section (fork) and, finally, in line 29
we close it (join).
6.6
Computational Results
We tested our parallel methods reporting the Speed-Up factor with respect to the
serial version of the algorithm. Out test machine is a workstation equipped with a
Intel Core i7 920 @ 2,9 GHz with 6 Gigabytes of RAM and a GPU Nvidia GTX570
with 1,280 Gigabyte of GDDR5 memory on board, 480 CUDA Cores @ 1,464 GHz.
For each instance we report the serial time, the relative parallel time and the
Speed-Up factor obtained. For each instance, we evaluated the performance for an
increasing number of traveller state h (2,3,4,6,8) and, consequentially, for a bigger
state-space for the algorithm. The time horizon is one day and the granularity of
time is expressed in seconds (one day = 86400 seconds) . The Speed-Up factor is
computed: SpeedU p = T imeserial /T imeparallel .
6.6.1
Data Sets Creation
The data set used as benchmark for the methods is a set of 10 instances generated
as follows:
Chapter 6. SSSPP
116
• we selected 10 urban centres from the OSM [114] repository,
• extracted the relative graph, keeping trace of the streets classes ( primary,
secondary, etc . . . ),
• we created 4 dense random graphs of 500000, 400000, 300000 and 200000
nodes, for the modes over the geographical network,
• connected with transition edges all the nodes of the random graphs to the
geographical network,
• connected with transition edges the modes among them.
We used the street classes from OSM to calculate the journey time using the speed
limits relative to that class and the speed profile described in 6.2.4 as penalties.
For the other modes, we assume that represent the public transport networks
(metro, bus, trams, etc . . . ) we used a constant speed limit and the speed profiles
for the public transport described in the previous sections. The path is calculated
as depicted in figure 6.4, from an extreme point to another of the area, to force
the method to visit all the graph.
Figure 6.4: Test Path in Berlin
6.6.2
Results
In this section we provide the experimental results for the instances described
above. First, we give some data relative to the instances dimensions, then, the relative computational times for the serial implementation, the GPU implementation
Chapter 6. SSSPP
117
and the parallel CPU implementation. We indicate with Φ = T ravelerStates ∗ V
the dimension of the state-space.
Figure 6.5: Serial times vs OpenMP times for the Berlin instance
Figure 6.6: Speed-Up for different h values for the Los Angeles instance
Instance
Berlin
London
Los Angeles
Melbourne
Milan
Moskow
New York
Paris
Rome
Tokyo
Nodes
1594184
1962806
1845707
1636709
1515734
1621233
1650314
1654535
1490676
2600054
Edges
26360404
27134087
26957936
26438780
26154972
26423233
26519156
26457090
26098377
29232779
Modes Φ (h = 2)
5
3188368
5
3925612
5
3691414
5
3273418
5
3031468
5
3242466
5
3300628
5
3309070
5
2981352
5
5200108
Φ (h = 3)
4782552
5888418
5537121
4910127
4547202
4863699
4950942
4963605
4472028
7800162
Φ (h = 4)
6376736
7851224
7382828
6546836
6062936
6484932
6601256
6618140
5962704
10400216
Φ (h = 6)
9565104
11776836
11074242
9820254
9094404
9727398
9901884
9927210
8944056
15600324
Φ (h = 8)
12753472
15702448
14765656
13093672
12125872
12969864
13202512
13236280
11925408
20800432
Chapter 6. SSSPP
Table 6.1: Test instances dimensions
118
Instances
N ame
Berlin
London
Los Angeles
Melbourne
Milan
Moskow
New York
Paris
Rome
Tokyo
h=2
3.90491
4.27784
4.31991
4.00122
3.98671
3.91440
4.02293
4.17275
3.88276
2.91007
h=3
7.19491
7.43718
6.99828
6.65853
6.97139
6.70570
6.91773
7.10199
6.86815
6.02751
Serial
h=4
9.16738
9.51450
9.12968
8.29722
9.37055
9.01145
9.29118
9.21067
8.94360
9.13981
h=6
12.35290
13.39980
12.88160
10.70190
12.58200
11.84520
12.68630
12.81390
12.50710
13.20910
h=8
18.62190
20.09110
20.42397
16.39150
18.68020
18.17520
19.21360
19.63470
18.54690
16.69850
h=2
4.00861
4.12370
4.22763
3.97519
4.05487
3.99721
3.85927
4.03054
3.76575
2.53121
h=3
4.83804
4.51029
4.33566
4.33418
4.42006
4.29074
4.25159
4.39365
4.09594
4.39309
OpenMP
h=4
4.64429
4.80851
4.63512
4.68507
4.72360
4.65833
4.71184
4.73818
4.49787
4.97617
h=6
5.04666
5.20207
5.00295
5.03967
5.10111
4.92247
4.96807
5.13069
5.02301
5.24188
h=8
5.84201
6.24187
5.91230
5.89474
5.82905
5.77115
5.82088
5.87103
5.71587
5.91094
h=2
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
1.1 X
h=3
1.5 X
1.6 X
1.6 X
1.5 X
1.6 X
1.6 X
1.6 X
1.6 X
1.7 X
1.4 X
SpeedUp
h=4
2.0 X
2.0 X
2.0 X
1.8 X
2.0 X
1.9 X
2.0 X
1.9 X
2.0 X
1.8 X
h=6
2.4 X
2.6 X
2.6 X
2.1 X
2.5 X
2.4 X
2.6 X
2.5 X
2.5 X
2.5 X
h=8
3.2 X
3.2 X
3.5 X
2.8 X
3.2 X
3.1 X
3.3 X
3.3 X
3.2 X
2.8 X
Chapter 6. SSSPP
Table 6.2: Computational results on CPU
119
Instances
N ame
Berlin
London
Los Angeles
Melbourne
Milan
Moskow
New York
Paris
Rome
Tokyo
h=2
3.90491
4.27784
4.31991
4.00122
3.98671
3.91440
4.02293
4.17275
3.88276
2.91007
h=3
7.19491
7.43718
6.99828
6.65853
6.97139
6.70570
6.91773
7.10199
6.86815
6.02751
Serial
h=4
9.16738
9.51450
9.12968
8.29722
9.37055
9.01145
9.29118
9.21067
8.94360
9.13981
h=6
12.35290
13.39980
12.88160
10.70190
12.58200
11.84520
12.68630
12.81390
12.50710
13.20910
h=8
18.62190
20.09110
20.42397
16.39150
18.68020
18.17520
19.21360
19.63470
18.54690
16.69850
h=2
15.29472
15.85713
14.84052
14.94752
15.95023
14.95064
15.88630
15.68335
15.95033
14.12753
h=3
35.38592
34.75631
35.02742
34.46021
33.67302
31.10204
36.67120
34.05063
33.12574
35.69305
CUDA
h=4
45.31485
44.36134
44.23401
46.28461
42.01237
43.47502
44.65321
45.75230
45.78124
42.87932
h=6
60.45892
61.98234
62.09213
61.45213
62.23768
60.14586
60.27451
63.35672
61.01203
60.36789
h=8
90.10293
100.42816
100.90235
88.16702
90.56330
91.09123
92.34551
91.76812
90.01445
86.13599
Chapter 6. SSSPP
Table 6.3: Computational results on GPU
120
Chapter 6. SSSPP
6.7
121
Considerations and Future Work
As it emerged from the experimental test, for this particular problem, the GPU
algorithm is not effective. The main reason is that the geographical network is a
really sparse graph and the frontier set’s Fi cardinality is too small to allow the
GPU to release its massive parallelism. The multi-core/ OpenMP version, instead,
brings results near to the theoretical speed-up for a quad-core processor for the
bigger instances. The other bottleneck for the GPU performance is the load balancing inside the device where most part of the computational kernel is composed
by serial operations. The new Nvidia chips, the Maxwell series, partially resolve
this problem, introducing a new feature called Dynamic Parallelism, allowing the
kernel function to call another kernel, making the load balancing relative to nonuniform data structures (e.g. adjacency lists) more effective. As soon as possible
we will provide a method exploiting this new feature.
Chapter 7
Membership Overlay Problem
In this chapter we will considerate a parallel version of the Subgradient Method,
used to solve the Dual Lagrangean Problem. We choose a network design problem,
the Membership Overlay Problem (MOP), relative to the Peer-to-Peer networks.
We designed three parallel algorithms for the GPU Computing, Shared Memory
and Distributed Memory environments exploiting, respectively, CUDA, OpenMP
and MPI.
7.1
Introduction
Peer-to-Peer (P2P) networks actually represent a conspicuous part of the internet
data traffic. This model, indeed, is the counterpart of the well-known and studied
client-server model. A large number of network applications (legal or not) are
already adopting this network model.
The proliferate of this kind of networks has brought to the attention of the academic community some challenging problems relative to the P2P model. In literature, for example, doesn’t exist a precise and coherent definition and a precise
and exhaustive description is hard to find.
Informally speaking, a P2P network is a totally decentralized network, composed
by peers that share, exchange and distribute data and information. The success
of this kind of networks is related to the anonymous identity of the members,
allowing, in some case, the exercise of not totally legal trades.
123
Chapter 7 MOP
124
Another peculiarity of P2P is the dynamic topology of the network. A peer can
connect to the network for a limited time and share its data or its bandwidth with
the others. Once disconnected, the topology changes and some routes for the data
packets or connections must be modified to maintain the network performances.
This strictly dynamic feature is definitely interesting and has arose critical problems for some applications. The P2P paradigm can be a fundamental part of large
scale distributed computation infrastructures or grid computing applications. The
dynamic nature of the network topology implies a significant degradation of the
performance or a non-optimal configuration of communications and connections,
compromising the system scalability.
Most of the P2P applications are based on the TCP/IP communication protocol
and, virtually, all the peers are connected to each other. A P2P network, instead,
is at application level in the ISO/OSI stack having their routing and topology.
In this scenario, network design problems seem critical, mainly the ones dealing
with the optimization of the connections for enhancing the network’s performances.
The Membership Overlay Problem (MOP ) is one of these network design problems
that can arise in relation with a P2P environment. The problem consists in the
creation of an overlay network that maximizes the bandwidth throughput among
the peers inside the P2P application.
In this chapter we will describe a Lagrangean Relaxation for the problem and
a Distributed Subgradient for solving the relative Lagrangean Dual Problem proposed originally by Boschetti et al. [115]. We will propose three parallel algorithms
based on this Subgradient, designed to exploit three ‘de-facto’ standard parallel
programming models, CUDA, OpenMP and MPI, relative to GPU, Shared Memory and Distributed Memory environments respectively.
7.2
Membership Overlay Problem
In this section we will illustrate the Membership Overlay Problem, giving first
an informal definition, then a mathematical formulation, describing the inherent
characteristics. MOP is a network design problem, in fact is focused on the creation
of a network topology following precise performance, fault tolerance or efficiency
constraints.
Chapter 7 MOP
125
Network Design is actually one of the most interesting and studied class of problems in the CO field. It affects many real-world applications like supply chain
logistics or telecommunications.
The problem consists in maximizing the throughput of a P2P network where the
nodes (peers) are described by a percentage of on-line time and a bandwidth to the
internet. The edges are the connections among these nodes, with a capacity. The
solution of the problem is a subset of these edges that maximize the throughput,
creating an overlay network defining a topology among the nodes.
Before the mathematical formulation, we will give an example to illustrate the
problem.
Referring to the figure 7.1, we depicted a graph representing a network with the
characteristics cited before Without lack of completeness, from now we will represent the network as a graph with nodes and edges.
Node 1
Node 2
w:1024
p:0.70
w:512
p:0.90
w:2048
p:0.90
w:1024
p:0.60
Node 4
Node 3
Figure 7.1: P2P network with 4 nodes.
Every node has a percentage, p, to be on-line and a bandwidth, w, representing the
bandwidth of the internet connection. The graph is compete, due to the TCP/IP
protocol. The edges are not oriented, and we can assume that are bi-directional.
The edges have a capacity b equal to the minimum value w of the nodes that
connects as described in figure 7.2 .
Every edge has a probability, p′ , equals to the product of the p probabilities of
its extremes (figure 7.3). Once characterized the graph, we need to set other
parameters that constraint the problem’s solution. Birattari et al. [116] and
Rardin and Uzsoy [117], proposed these parameters. To guarantee a minimum
Chapter 7 MOP
126
Node 2
Node 1
b: 512
w:512
p:0.90
w:1024
p:0.70
b:1024
b:512
b:512
b:1024
w:2048
p:0.90
w:1024
p:0.60
b:1024
Node 4
Node 3
Figure 7.2: Edges bandwidth values.
Node 2
Node 1
b: 512
p’:0.63
w:512
p:0.90
w:1024
p:0.70
b:512
p’:0.81
b:1024
p’:0.42
b:1024
p’:0.63
b:512
p’:0.54
w:2048
p:0.90
w:1024
p:0.60
b:1024
p’:0.54
Node 4
Node 3
Figure 7.3: Edges p’ values.
QoS (Quality of Service) is necessary to set a bandwidth lower bound, l, for each
edge of 14 Kbps for instance, as shown in figure 7.4.
Node 2
Node 1
b: 512
p’:0.63
w:512
p:0.90
w:1024
l:14
p:0.70
b:1024
p’:0.63
b:1024
p’:0.42
b:512
p’:0.81
l:14
l:14
l:14
l:14
b:512
p’:0.54
l:14
w:2048
p:0.90
b:1024
p’:0.54
Node 4
w:1024
p:0.60
Node 3
Figure 7.4: Edges lower bound l.
Chapter 7 MOP
127
Similarly, is necessary to set a bandwidth upper bound, u, for each edge to avoid
the congestion of the node and the network. For the results cited above, we can
set this value to 256 Kbps (figure 7.5).
Node 2
Node 1
b: 512
p’:0.63
w:512
p:0.90
w:1024
p:0.70
b:1024
p’:0.42
l:14
u:256
l:14
u:256
b:512
p’:0.81
l:14
u:256
l:14
u:256
b:1024
p’:0.63
l:14
u:256
l:14
u:256
w:2048
p:0.90
b:512
p’:0.54
w:1024
p:0.60
b:1024
p’:0.54
Node 4
Node 3
Figure 7.5: Edges upper bound u.
In figure 7.6 is described the optimal solution for the proposed example.
Node 2
Node 1
161.28
w:512
p:0.90
w:1024
p:0.70
207.36
107.42
161.28
w:2048
p:0.90
138.34
Node 4
w:1024
p:0.60
Node 3
Figure 7.6: MOP Optimal Solution.
More in details, the value for an edge is given by uij p′ij and the optimal value
of the problem will be the sum of these values relative to the edges in solution:
P
uij p′ij . In figure 7.6 we show the problem’s solution, coloring in green the edges.
The final values is 775.68 and the edge (2, 3) has been rejected from the solution.
The resulting sub-graph is the overlay network that maximize the throughput for
the P2P network.
Chapter 7 MOP
7.2.1
128
Definition and Mathematical Formulation
Given a graph G = (V, E) where n = |V |, the vertices are the peers of the P2P
network and the edges the possible connections among the vertices. Two edges i
and j are connected by the edge (i, j) if both nodes can send messages each others,
using the underlying routing structure (the internet typically). Each node i can
enter and exit the network, according to the P2P model, and, when is on-line, can
share a limited amount of bandwidth. Each node is characterized by two weights:
pi e wi , respectively the connection time, expressed in percentage (1 = always on
line, 0 = never on line) like described by Saroiu et al. [118] and the available
bandwidth of its connection. δ(i) is the neighbors set for the node i.
The Membership Overlay Problem consists in finding a sub-graph G′ = (V, E ′ )
from G. The edges of G′ define that two nodes decide to allocate a part of their
bandwidth to communicate between them. If bi e bj are the bandwidths of node
i and j, the available bandwidth for the edge (i, j) is bij = min{bi , bj }. The two
bandwidth values can be equal to the wi and wj values, saturating the nodes or
inferior in relation to the other nodes in the graph. Anyway, we will define an
upper bound,uij and a lower bound, lij , to guarantee a minimum QoS and to
avoid the network’s congestion, respectively.
The graph G′ obtained will have the following peculiarities:
1. the global throughput is maximized;
2. the graph G′ diameter is logarithmic, creating a connected graph;
3. the total bandwidth used by each node i is less or equal to bi .
From these first considerations, we can assume that for each node i is not necessary
the global knowledge of the graph, but only the state of the nodes in δ(i).
Following the definition given by [117], we can define MOP a control problem
that : ‘must be solved frequently and it involves decision over a relatively short
horizon’. Solutions for these kind of problems ‘must be obtained in near real time,
algorithms must run in fractions of seconds’ and ‘quality matters somewhat less’
than speed.
This kind of problems is often linked to critical applications in robotics or automations in general, also in high-precision tools or car control units. Taking into
Chapter 7 MOP
129
account these factors, seems mandatory design fast and effective algorithms for
giving a good solution to the problem in a short execution time.
In the next section we will provide a Mixed Integer formulation for the static
version of MOP (SMOP), from this formulation will be derived a polynomial upper
bound and a relaxation framework.
7.2.2
MIP Formulation
Static Membership Overlay Problem can be formulated as follows. We have two
set of decision values {xij } and {ξij }, (i, j) ∈ E. The xij continuous variables
defines the bandwidth between i e j and 0 ≤ xij ≤ uij . The binary variables
ξij are 1 if the edge (i, j) is used to connect the nodes, 0 otherwise. The MIP
formulation is the following:
zSM OP = max
X
pij xij
(1)
(i,j)∈E
s.t.
X
xij ≤ bi ,
i∈V
(2)
j∈δ(i)
(7.1)
lij ξij ≤ xij ≤ uij ξij , (i, j) ∈ E (3)
ξij ∈ {0, 1},
(i, j) ∈ E (4)
where pij = pi × pj for each edge (i, j) ∈ E and δ(i) represent the neighborhood
of i in G (in our case δ(i) = V /{i}, G is complete). The objective function (1)
maximizes the total bandwidth given by the sum of all the assigned bandwidths
to each connection (i, j) ∈ E weighted by their up-times (on-line times) p. The
constraints (2) ensure that the bandwidth assigned to the i node does not exceed
the limit bi .
Static Membership Overlay Problem is an NP-Hard problem, setting lij = uij , for
each edge (i, j) ∈ E. SMOP can be described as a Multidimensional Knapsack
Problem:
Chapter 7 MOP
130
X
zM KP = max
pij uij ξij
(5)
(i,j)∈E
s.t.
X
uij ξij ≤ bi , i ∈ V
(6)
(7.2)
j∈δ(i)
ξij ∈ {0, 1},
(i, j) ∈ E (7)
That is the generalization for the Knapsack 0-1 where the bin has more than one
dimensional constraint and the objective is to maximize the use of the bin.
7.3
7.3.1
Linear and Lagrangean SMOP Relaxations
LP Relaxation
The SMOP LP relaxation, zLP , is the following:
zLP = max
X
pij xij
(8)
(i,j)∈E
s.t.
X
xij ≤ bi ,
i∈V
(9)
j∈δ(i)
(7.3)
lij ξij ≤ xij ≤ uij ξij , (i, j) ∈ E (10)
0 ≤ ξij ≤ 1
(i, j) ∈ E (11)
We relaxed the (7) constraint, making it continuous: 0 ≤ ξij ≤ 1. This relaxation
gives us an upper bound to the integer solution for SMOP, bound that we will
use to evaluate the effectiveness of the Lagrangean bound that we will describe in
the next section. We can easily compute the LP relaxation using a linear solver
(CoinMP or CPLEX).
7.3.2
Lagrangean Relaxation
The Lagrangean relaxation for SMOP can be formulated associating a non negative
penalty λi to each constraint (2). The formulation, called zLR (λ), is the following:
Chapter 7 MOP
131
zLR (λ) = max
X
p′ij xij +
(i,j)∈E
X
bi λ i
(12)
i∈V
s.t. lij ξij ≤ xij ≤ uij ξij ,
ξij ∈ {0, 1},
(i, j) ∈ E (13)
(7.4)
(i, j) ∈ E (14)
that is equivalent to the problem:
zLR (λ) = max
X
p′ij xij +
(i,j)∈E
X
bi λ i
(15)
(7.5)
i∈V
s.t. 0 ≤ xij ≤ uij ,
(i, j) ∈ E
(16)
where p′ij = pij − λi − λj and the {ξij } variables are not necessary.
Given λ, the optimal value of zLR (λ) is computed according to the following
observations :
• if p′ij ≥ 0 we use all the bandwidth available, then the edge will be part of
the solution: ξij = 1 e xij = uij
• if p′ij < 0, the connection is discarded: ξij = 0 e xij = 0
The Dual Lagrangean Problem associated can be described as follows:
zLR (λ∗ ) = min{zLR (λ) : λ ≥ 0}
7.4
(7.6)
Subgradient Algorithm
The Subgradient algorithm proposed by Shore in [43] and successfully used by
[39–42] is an iterative procedure that, at each iteration k, computes a new approximation λk+1 of the Lagrangean multipliers in such a way that, for k → +∞, λk is
an optimal or near optimal solution of the corresponding Lagrangean Dual. Let xk
of cost zLR (λk ) obtained at iteration k by solving problem 7.5 setting λk = λk+1 .
The Lagrangean multipliers can be updated as follows:
λk+1
= max{0, λki + αk gik }, i ∈ V
i
(7.7)
Chapter 7 MOP
132
where:
gik =
X
xkij − bi , i ∈ V
(7.8)
j∈δ(i)
is the i-th component of the subgradient g k and αk is the length of the step
along the search direction given by the subgradient itself. In literature has been
proposed several versions of the step size αk update. In this chapter we will take
in consideration the standard one proposed by Polyak [119] and a constant one
(called quasi-constant) for a fully distributed algorithm proposed by [115].
The standard update rule can be described as follows:
αk = β k
z̄ − zLR (λk )
k
g (7.9)
where z̄ is an overestimate of zLR (λ∗ ). Polyak proved the convergence of the
method for ǫ ≤ β k ≤ 2. Instead of a overestimate of the Lagrangean function, is
possible, in many application, substitute it with:
k
α =
0.001zLR (λ
k
βk
k
)
g
(7.10)
Usually, the β k step is initialized with a value dependent to the considered problem
and updated (e.g β k+1 = 0.5β k ) if after and arbitrary number of iteration the
zRL (λk ) is not improved.
Our implementation of the standard subgradient algorithm implements the following update rule for the Lagrangean penalties:
λk+1
= max{0, λki + β
i
7.5
0.01zLR (λ) k
gi }
kg 2 k
(7.11)
Distributed Subgradient
The Lagrangean relaxation proposed in 7.3.2 doesn’t take into account the dynamic
aspect of a P2P net, considering static the topology of the network. In fact, in
Chapter 7 MOP
133
a P2P environment, the nodes join and exit dynamically from the net and for
each node is impossible to have a consistent view of the graph’s status and the
optimization process described above is not effective.
This aspect makes impossible to built an optimization process based on global
parameters (e.g. the g subgradient).
Deeply analyzing the P2P environment we can observe that:
• each node has a local knowledge of the net. It is aware only of the status of
its neighbors,
• each node can optimize its parameters only looking at the behavior of its
neighbors,
• the potentially infinite time horizon of the net brigs to a constant optimization, based on the changing graph topology (e.g. selecting a new set of
connections because of the exit of some nodes from the net),
• each node, computing its local optimization step, can contribute to the global
optimization of the graph, in an asynchronous fashion.
Aware of these observations we can draw some guidelines for the design of a
distributed algorithm:
• the optimization step is local, each node compute its values through its
knowledge of the net and its connections,
• the optimization process must be asynchronous,
• the optimization process must be distributed among the nodes of the graph,
• once each node computed its optimization step, communicate to others its
results and create the overlay network. When one or more nodes exit the
net, each node optimize again, considering the new topology.
These considerations allow us to indicate the behavior of a distributed subgradient:
• each node will compute its Lagrangean penalty λi , according to its local
knowledge,
Chapter 7 MOP
134
• each node will compute its part of the global objective function,
• each node will update its penalty exploiting a local method,(e.g. kgk can’t
be considered),
• each node will compute its optimization step using the updated penalties
from the other nodes.
We report, after the previous considerations, a dynamic, asynchronous and fully
distributed subgradient. This approach doesn’t solve the Lagrangean problem of
the original graph G, but for each node of the graph. Given:
(7.12)
Gh (Vh , Eh )
containing only the h node and its neighborhood δ(h), with:
Vh = δ(i) ∪ {h}
(7.13)
Eh = {(i, j) ∈ E : i, j ∈ Vh }
and a set of Lagrangean penalties λ = {λ0 , . . . , λk } with k = 1, . . . , |Vh | for each
subgraph Gh (Vh , Eh ), h ∈ V , we solve the following problem:
h
zLR
(λ) = max
X 1
p′ xij + bh λh
2 ij
(17)
(7.14)
(i,j)∈Eh
s.t. 0 ≤ xij ≤ uij ,
(i, j) ∈ Eh
(18)
This problem is derived directly from the 7.5, considering only the h node. The
P
h
global value of the objective function is given by zLR (λ) = h∈V zLR
(λ).
The coefficient
1
2
is included for avoiding the double evaluation of the node and
its edges.
7.5.1
Algorithm foundations
The proposed formulation suggests an algorithm that follows the steps:
Chapter 7 MOP
135
Step 1 At each iteration k each node h request the Lagrangean penalties from
its neighborhood i ∈ δ(h),
Step 2 exploiting the new penalties, the node h locally optimize, solving the
h
problem zLR
(λ) and computing its solution xij ,
Step 3 node h updates its penaltyλh , λh = max{0, λh + αhk gh }.
It’s trivial to notice that we can’t perform at each step the exchange of the λh
penalties because of the computational cost of the communications. To avoid this
problem and enhancing the algorithm’s performances, [115] proposed a two-level
optimization process:
h
I Level (Core Optimization) An internal loop in which h optimize its zLR
(λ)
value, using its λh penalty and keeping constant the penalties λi from its
neighborhood,
II Level (External Optimization) once each node h ∈ G completed their optimization step, each penalty is updated and sent to the neighborhood, then,
completed the communication, it’s possible to each node to perform another
Core Optimization step.
7.5.2
Quasi-constant step size update
The main problem afflicting the distributed subgradient is the update of the λh
penalties. We can’t use a standard step update rule, like the one described in 7.4.
The h node doesn’t have a consistent knowledge of the network and to perform
other communications will degrade the method’s performances.
It’s necessary use an update rule that exploit constant and local parameters. This
rule is named quasi-constant step size rule:
Step 1 initially α0 = αstart an arbitrary small value,
Step 2 once the bound is improved, the step size is incremented αk+1 = γαk , con
γ > 1, by means of a constant defined a priori,
Chapter 7 MOP
136
Step 3 if the bound is not improved for a given number of iterations, the step size
′
′
is reduced αk+1 = max{γ αk , αmin } where 0 < γ < 1 and 0 < αmin < αstart ,
otherwise αk+1 = αk .
The subgradient ghk is also computed using only the local xh solution.
The rule for updating the Lagrangean penalties is:
λk+1
= max{0, λkh + αhk ghk }
h
7.5.3
(7.15)
Algorithm description
We can summarize the two level optimization procedure as follows:
Algorithm Inner-Subgradient(h, λ′ )
1.
Initialize λi = λ′i for each i ∈ Vh = δ(h) ∪ {h}
2.
while it < InnerM axIter do
3.
h
Solve zLR
(λ)
4.
Update only the node h penalty: λh = max{0, λh + αh gh }
5.
h
h
if zLR
(λ) < zLR
(λ′ )
6.
then λ′h = λh
Algorithm External-Optimization()
1.
2.
3.
4.
while et < ExtM axIter do
for h = 1 to |V | do
Inner-Subgradient(h, λ)
Send each λh to each j ∈ δ(h)
The External-Optimization algorithm manages the external optimization step,
calling the Inner-Subgradient procedure for each node of the graph (line 3). Each
h
node h do its optimization step, solving its Lagrangean Problem zLR
(λ) (line 3),
updating its penalty (line 4) and, re-optimizing with the new λ, for a given number
of iterations. At the end of the main procedure External-Optimization the best
penalties for each node is send to the other peers of the network (line 4).
Chapter 7 MOP
7.6
137
Computational Results
In this section we propose the computational result regarding the distributed subgradient algorithm, compared to the standard one and the LP relaxation of the
problem, computed using CPLEX and CoinMP.
The test sets are provided by Boschetti et al. [115]
We call STD the standard subgradient algorithm and DIST the distributed one.
The standard subgradient is implemented using the standard update rule proposed
k
(λ )
, with these parameters:
by Polyak: αk = β k 0.001zgLR
k
k k
• β 0 = 0.005, start step,
• ∆k = 75, number of iterations before computing a smaller step: β k+1 =
0.95β k ,
• M axIter = 10000, max number of iterations,
• Stop Condition: if the bound is not improved at least of 0.1% in the last
3000 iterations, the method is stopped.
The parameters of the DIST algorithm are:
• αstart = 0.0025,
• αmin = 0.000005,
• γ = 1.005,
• γ ′ = 0.95.
• ExtM axIter = 500.
• InnerM axIter = 20.
• ∆k = 10.
The CPLEX version is the 11.2 and the CoinMP one is 1.7. We report the average
value of the instances in the set, the execution times and the Gap% (Gap =
100 ×
zLRopt −zLP
zLP
) between the LP relaxation and the results computed by the
subgradient algorithms.
Chapter 7 MOP
138
The test machine is equipped with an Intel i7 Core 920 @ 2.8 GHz and 6 Gigabytes
of RAM.
Problem Group
Name
grafo50a-14
grafo100a-14
grafo250a-14
grafo500a-14
grafo750a-14
grafo1000a-14
grafo50a-128
grafo100a-128
grafo250a-128
grafo500a-128
grafo750a-128
grafo1000a-128
grafo50b-14
grafo100b-14
grafo250b-14
grafo500b-14
grafo750b-14
grafo1000b-14
grafo50b-128
grafo100b-128
grafo250b-128
grafo500b-128
grafo750b-128
grafo1000b-128
zLP (AV G)
8181.31
18898.20
70508.01
151190.40
238698.34
316579.60
8271.55
21850.80
68858.90
153809.50
242610.80
336517.60
1092.54
2298.66
5447.14
11198.03
16524.70
22448.11
1565.68
3243.71
8110.50
16062.21
23748.90
32271.93
LP
CoinM Pt
0.06
0.23
3.69
92.96
761.80
3203.72
0.06
0.18
3.85
108.38
770.96
2734.91
0.03
0.09
0.72
4.93
17.17
50.93
0.03
0.09
0.71
3.98
12.85
31.71
CP LEXt
0.04
0.05
0.35
1.91
4.79
10.95
0.02
0.05
0.41
1.90
5.03
9.06
0.02
0.08
1.35
12.37
47.03
164.76
0.01
0.09
1.40
13.76
52.10
204.95
STD
Time Gap%
0.61
0.0100
2.78
0.0020
7.89
0.0200
52.65 0.0100
150.09 0.0330
330.12 0.0150
0.62
0.0010
2.43
0.0021
7.89
0.0250
52.65 0.0100
150.09 0.0343
330.12 0.0124
0.64
0.0300
2.65
0.0010
8.56
0.0110
52.65 0.0060
110.09 0.0231
200.15 0.0260
0.65
0.0100
2.38
0.0020
8.54
0.0200
52.65 0.0140
150.09 0.0190
223.12 0.0050
DIST
Time Gap%
0.04 0.2810
0.14 0.4070
2.55 0.5430
9.82 0.3800
37.50 0.0000
52.01 0.0001
0.05 0.2950
0.17 0.1180
2.53 0.3900
9.83 0.0060
37.56 0.0010
52.06 0.0001
0.03 0.5640
0.16 0.4630
2.58 0.5850
9.96 0.4740
38.04 0.4350
53.78 0.4330
0.04 0.4320
0.16 0.3780
2.57 0.5330
9.94 0.4920
37.96 0.4840
53.77 0.4950
Chapter 7 MOP
Table 7.1: Gaps and execution times for LP, STD, DIST.
139
Chapter 7 MOP
140
It’s straightforward to notice that CPLEX outperforms the execution time of
CoinMP, confirming it’s efficiency. The Lagrangean relaxations proposed provide
good quality bounds, competitive with the LP relaxation of the problem. For the
biggest instances, the execution times are comparable to the CPLEX one and in
some cases, like the grafo1000b-128 set where the execution time are better than
the CPLEX one. Obviously, the STD algorithm has better results than DIST,
having a global knowledge of the network, using global parameters, like described
in 7.4, but, for our purposes this approach is not applicable.
7.7
Shared Memory Algorithm
The shared memory algorithm proposed exploits the inherent parallelism inside
the distributed subgradient. The problem granularity is straightforward: each
node h is an independent entity to consider and we can map a subset of nodes
for each thread spawned in a parallel cycle. We used OpenMP because is a defacto standard for the shared memory parallel programming model and for its
portability. The simplicity of implementation and the non-invasive pre-processor
calls to the APIs enable a fast code deploying, adding few lines of code to the
serial version.
Each node has its own data structures to compute its zLR (λ) value and a private array λnodes for storing the penalties relative to the other nodes. The used
processor for testing the algorithm, an Intel Core i7 @ 2.8 GHz, implements the
Hyper-Threading [120] proprietary technology by Intel giving to a quad-core processor the ability to act like a processor with the double of cores (in our case
8 cores). In our test, we reported the best speed-up value spawning 8 threads,
theoretically, one thread per core.
The load balancing is totally managed by OpenMP, that assigns to each thread
an equal number of nodes, as shown in picture 7.7, relative to an instance with
1000 nodes.
As mentioned before, the insertion of the OpenMP directories is not invasive and
the parallel algorithm has few modifications compared to the serial one.
Chapter 7 MOP
CORE 0
CORE 4
141
CORE 1
125 Nodes
125 Nodes
CORE 5
125 Nodes
CORE 2
125 Nodes
CORE 3
125 Nodes
CORE 6
125 Nodes
CORE 7
125 Nodes
125 Nodes
CPU
Figure 7.7: Nodes deploy in a eight core processor.
Algorithm Inner-Subgradient(h, λ′ )
1.
Initialize λi = λ′i for each i ∈ Vh = δ(h) ∪ {h}
2.
while it < InnerM axIter do
3.
h
Solve zLR
(λ)
4.
Update only the node h penalty: λh = max{0, λh + αh gh }
5.
h
h
if zLR
(λ) < zLR
(λ′ )
6.
then λ′h = λh
Algorithm External-Optimization-OMP ()
1.
while et < ExtM axIter do
2.
# parallel for private(h, λnode )
3.
for h = 1 to |V | do
4.
Inner-Subgradient(h, λnode )
5.
# end parallel for
6.
Update each λnode array with the new λh penalties
The differences between the two algorithms are minimal: we added in line 2 the
OpenMP directories, opening a parallel for (fork) and declaring private for each
thread the λnode and the h variables. Each thread, once partitioned the indexes
h, will compute the Inner-Subgradient.
Chapter 7 MOP
142
We don’t need explicit synchronization directives because the end of the parallel
region is an implicit synchronization point for the threads.
At line 5 of the main procedure, closed the parallel region (join), the algorithm
updates the new λh penalties in the λnode array of each node.
7.8
Distributed Memory Algorithm
In this case we used the hybridization of MPI with OpenMP described in 2.3.1
where the shared memory model is used to enhance the intra-node performances
of the message passing model. The algorithm exploits another level of parallelism
with the possibility to divide the graph among the cluster’s node and inside each
node, with OpenMP.
Each MPI task, deployed on a different cluster’s node, performs the External
Optimization step of a given subset of nodes Vtask . Then, the Vtask set is computed
in parallel, in the same fashion described in 7.7.
h
At the end of each inner cycle, once each task computed the zLR
(λ) relative to its
Vtask subset of nodes, we need a synchronization primitive (barrier) to synchronize
the tasks and broadcast consistent penalties values.
The communication step is implemented with a broadcast communication primitive that updates the other tasks with the new Lagrangean penalties. The computational tests has been conducted on a experimental cluster implemented with
Microsoft HPC 2008 Server with:
• 33 HP PCs with: Intel Pentium E6800 @ 3.2 GHz, 4 Gbyte of RAM,
• Windows 7 Professional Edition 64 bit, for each node,
• network switch: HP Procurve 2650.
The bottleneck relative to this cluster is the slow network that connects the machines (computation nodes). For our purposes, it’s sufficient to show the scalability
of the method in a message passing environment.
Chapter 7 MOP
143
In the 7.10 section we will observe that over a certain number of compute nodes,
the method doesn’t scale anymore, and the execution times are slowed down by
the network and the communications.
7.9
GPU Algorithm
In this section we propose a many-core algorithm for running the algorithm on
a many-core platform. As in the other chapters, we used the Nvidia CUDA parallel programming model for its reliability and more effective programming and
debugging tools.
For this algorithm we need to explore deeply the Inner-Subgradient procedure. It’s
necessary, indeed, to break into small pieces the steps of the Core Optimization
and design four kernels for executing the method on a GPU.
First, we propose a extended version for the Core Optimization and, then, we will
design the GPU algorithm.
Algorithm Inner-Subgradient-Extended (h, λ′ )
1.
2.
3.
4.
while k < InnerM axIter do
h
h
if zLR
(λk ) < zLR
(λk−1 ) // If the bound is improved
P
then Compute subgradient: ghk = j∈δ(h) xkhj − bh
Compute αk : αk = αk−1 γ
5.
Save best λh and Xhk
6.
Update λkh : max{0, λhk−1 + αk }
7.
Update λk with λkh
8.
//Update X:
9.
for each j ∈ δ(h) do
10.
11.
12.
13.
(phj − λkh − λj > 0)?xhj = uhj : xhj = 0
P
else Compute subgradient: ghk = j∈δ(h) xkhj − bh
if ∆ > ∆k
then αk = max{αk−1 γ ′ , αmin }
14.
Update λkh : max{0, λhk−1 + αk }
15.
Update λk with λkh
16.
//Update X:
17.
for each j ∈ δ(h) do
Chapter 7 MOP
144
(phj − λkh − λj > 0)?xhj = uhj : xhj = 0
18.
19.
20.
∆++
k++
In line 2 the algorithm checks if the bound has been improved. If yes, the subgradient gh is computed (line 3), updated the step α, line 4, saved the penalties
and the solution (line 5). In lines 6-10, the algorithm updates the actual solution
(lines 9-10) and the Lagrangean penalties. Otherwise, the gh subgradient is computed anyway together with the other parameters without saving the penalties
and the solution. In lines 4 and 12-13 the α step update is computed following
the quasi-constant rule described before.
The main idea behind the GPU algorithm is to execute in a concurrent fashion
each k iteration of each h node in the problem. Following the CUDA programming
model, we can assign to each computation block a node h and compute in parallel
each step of the Inner-Subgradient algorithm.
To minimize the communications between the HOST and the GPU, has been implemented three support kernels that manipulate on the device the data structures
and manage the communications of the Lagrangean penalties.
Algorithm External-Optimization-GPU
1.
BLKS = h
2.
T HDS = n
3.
Sharedmem = T HDS
4.
while et < ExtM axIter do
5.
Update-X-Kernel<<<BLKS,THDS>>>(λ′ , xij )
6.
while k < InnerM axIter do
7.
Compute-zLR -Kernel<<<BLKS, THDS, Sharedmem >>>(zLR (λ′ ),
xij )
8.
Inner-Subgradient-Kernel<<<BLKS, THDS, Sharedmem >>>(λ′ ,
zLR (λ′ ), xij )
9.
Update-Lambda-Kernel<<<BLKS,THDS>>>(λ, λ′ )
The peculiarity of this algorithm is the design’s shift from executing the relaxation
of a certain number of nodes in parallel to executing the same relaxation’s step for
all the nodes in parallel. The Update-X-Kernel (line 5), updates the xij solution for
Chapter 7 MOP
145
the Lagrangean problem in parallel for each node (the number of blocks is the same
in each kernel) at the beginning of each Core Optimization step. The ComputezLR -Kernel (line 7), evaluates, for each node h its actual objective function value.
The Update-Lambda-Kernel (line 9), finally, at the end of the Core Optimization,
send the updated penalties to each node.
Algorithm Inner-Subgradient-Kernel (λ′ , zLR (λ′ ), xij )
1.
h = blockIDx
2.
thidx = threadIDx
3.
T HDS = blockDIMx
4.
// Shared Memory Initialization
5.
shared[thidx ] = 0
6.
times = |V | /T HDS
7.
reminder = |V | %T HDS
8.
if thidx < reminder
9.
then times++
10. Thread-Synchronization()
h
h
(λk−1 )
(λk ) < zLR
11. if zLR
12.
then
13.
// Compute subgradient: ghk =
14.
for t = 0 to times do
P
j∈δ(h)
xkhj − bh
15.
index = h ∗ V + (thidx + t ∗ T HDS)
16.
shared[thidx ] += xij [index]
17.
Thread-Synchronization()
18.
// Reduction
19.
for s = T HDS/2 to 0, s/ = 2 do
20.
21.
22.
if thidx < s
then shared[thidx ] += shared[thidx + s]
Thread-Synchronization()
23.
ghk = shared[0]
24.
// Compute αk : αk = αk−1 γ
25.
// Save best λh and Xhk
26.
// Update λkh
27.
λ[h] = max{0, λhk−1 + αk }
28.
// Update λk with λkh
29.
//Update X:
Chapter 7 MOP
30.
146
for t = 0 to times do
31.
index = h ∗ V + (thidx + t ∗ T HDS)
32.
(p[index] − λ[h] − λ[(thidx + t ∗ T HDS)] > 0)?x[index] = uhj :
x[index] = 0
33.
34.
else // Compute subgradient: ghk =
for t = 0 to times do
P
j∈δ(h)
xkhj − bh
35.
index = h ∗ V + (thidx + t ∗ T HDS)
36.
shared[thidx ] += xij [index]
37.
Thread-Synchronization()
38.
// Reduction
39.
for s = T HDS/2 to 0, s/ = 2 do
40.
if thidx < s
41.
then shared[thidx ] += shared[thidx + s]
42.
Thread-Synchronization()
43.
ghk
44.
if ∆ > ∆k
45.
= shared[0]
then αk = max{αk−1 γ ′ , αmin }
46.
// Update λkh
47.
λ[h] = max{0, λhk−1 + αk }
48.
// Update λk with λkh
49.
//Update X:
50.
for t = 0 to times do
51.
index = h ∗ V + (thidx + t ∗ T HDS)
52.
(p[index] − λ[h] − λ[(thidx + t ∗ T HDS)] > 0)?x[index] = uhj :
x[index] = 0
53.
∆++
The algorithm spawns a block for each h node. In lines 14-23 and 34-43 the subgradient is computed using a modified version of the parallel reduction suggested
by [59]. In the case the bound is improved, the new α is computed (line 24), the
node’s Lagrangean penalty updated (line 27) and created the new solution for the
h node, lines 30-32. Otherwise, in the else branch of the if-then-else statement,
starting at line 33, the instructions are the same, the only difference is in the
penalty update.
Chapter 7 MOP
147
The data structures are indexed in a row-major fashion. The index value, in lines
15, 31, 35 and 51 is the one-dimensional address of the structures accessed by the
thidx thread.
The notable performances of this method is given by the use of parallel reductions
that seems to fit very well in the cluster processors of the GPU, together with
the use of the shared memory. The Kepler architecture, with 192 Cuda Cores per
cluster processor of our test device, seems to manage very well the reductions,
due to the great amount of cores in the same cluster (preliminary test done on a
Fermi GPU, with cluster of 32 Cuda Cores, showed that the reductions step was
the bottleneck for the method).
7.10
Computational Results
We present the computational results relative to the Speed-Up factors obtained
on the different parallel platforms considered. The testbed machine used for the
OpenMP and the CUDA algorithms is the same used in the paragraph 7.6 with
an Nvidia GeForce GTX 770 with 2 GigaBytes of GDDR5 memory. The cluster
used for the MPI algorithm is the one described in 7.8.
The tuning parameters are the same used in 7.6
We report the SpeedU p factor, as the ratio between the serial time of the algorithms and the parallel ones, SpeedU p = T imeserial /T imeparallel .
Problem Group
Name
grafo50a-14
grafo100a-14
grafo250a-14
grafo500a-14
grafo750a-14
grafo1000a-14
grafo50a-128
grafo100a-128
grafo250a-128
grafo500a-128
grafo750a-128
grafo1000a-128
grafo50b-14
grafo100b-14
grafo250b-14
grafo500b-14
grafo750b-14
grafo1000b-14
grafo50b-128
grafo100b-128
grafo250b-128
grafo500b-128
grafo750b-128
grafo1000b-128
Serial (AVG) (sec.)
Serial Time
0.041
0.144
2.550
9.828
37.509
52.015
0.050
0.171
2.537
9.837
37.566
52.062
0.03
0.16
2.58
9.96
38.04
53.78
0.041
0.164
2.578
9.947
37.960
53.778
Parallel
OpenMP
0.024
0.055
0.652
2.445
9.370
13.378
0.025
0.058
0.665
2.098
9.281
13.467
0.022
0.049
0.654
2.136
9.335
13.387
0.023
0.055
0.641
2.340
9.381
13.729
(AVG) (sec.)
MPI CUDA
0.650 0.023
0.829 0.024
1.522 0.171
1.993 0.415
5.747 1.343
6.390 1.799
0.642 0.027
0.837 0.029
1.544 0.171
1.989 0.421
6.729 1.354
7.400 1.798
0.643 0.025
0.889 0.028
1.535 0.172
1.998 0.426
6.792 1.365
7.400 1.823
0.680 0.027
0.899 0.029
1.531 0.174
1.873 0.429
6.834 1.362
7.123 1.816
OpenMP
1.708 X
2.618 X
3.911 X
4.020 X
4.003 X
3.888 X
2.000 X
2.948 X
3.815 X
4.689 X
4.048 X
3.866 X
1.655 X
3.388 X
3.948 X
4.667 X
4.075 X
4.018 X
1.783 X
2.971 X
4.022 X
4.251 X
4.046 X
3.917 X
SpeedUp
MPI
0.063 X
0.174 X
1.675 X
4.931 X
6.527 X
8.140 X
0.078 X
0.204 X
1.643 X
4.946 X
5.583 X
7.035 X
0.057 X
0.187 X
1.682 X
4.989 X
5.601 X
7.268 X
0.060 X
0.182 X
1.684 X
5.311 X
5.555 X
7.550 X
CUDA
1.754 X
5.937 X
14.873 X
23.697 X
27.933 X
28.919 X
1.864 X
5.964 X
14.805 X
23.350 X
27.754 X
28.953 X
1.456 X
6.022 X
14.986 X
23.384 X
27.860 X
29.502 X
1.493 X
5.752 X
14.799 X
23.202 X
27.863 X
29.615 X
Chapter 7 MOP
Table 7.2: Speed-Ups of the CUDA, MPI and OpenMP algorithms.
148
Chapter 7 MOP
149
The experimental results show clearly that the GPU algorithm obtains the best
Speed-Up factor among the proposed solutions, lowering the execution time more
than one order of magnitude. The effectiveness of this method is a consequence
of the massively parallel architecture implemented in the GPU, allowing us to
execute in a parallel fashion a large number of operations. The other two solutions, indeed, obtained good results, proving the intrinsic parallel nature of the
distributed subgradient.
For what concerns the MPI algorithm, we are aware that the used cluster is not
the best option to test our method. We are sure that, executing the algorithm
on faster (mainly on the communication side) infrastructure, the execution times
can be lower. We used only 15 compute nodes, because, as shown in figure 7.8a, a
larger number of compute nodes brings to a degradation of the performance, due
to communications.
The OpenMP algorithm has the quality that is almost the same with respect to the
serial version, due to the intelligent design of the OpenMP’s directives, allowing
to achieve good results modifying few parts of the original code.
(a) MPI Performance degradation for grafo1000a-128
(b) Platforms comparison
Figure 7.8: Performances evaluation.
7.11
Considerations and Future Work
In this chapter we presented three parallel algorithms for finding a bound to the
Membership Overlay Problem, relative to the Peer-2-Peer network model. The
results reported show the efficiency of the GPU algorithm, executed on a mid-level
consumer device. The others algorithm, indeed, showed good results, proving the
method’s high scalability. We planned to apply this new kind of relaxation to other
Chapter 7 MOP
150
Combinatorial Optimization problems, to verify the reliability and the generality
of the method.
Chapter 8
Conclusions
In this Thesis we presented new optimization algorithms designed to run on the
state of the art, inherently parallel processors available on the market. The need
to exploit these new architectures in Combinatorial Optimization is becoming ever
more urgent. The advent of technologies and paradigms like cloud computing or big
data for example, containing at their core notable CO problems, together with the
increasing dimension of real-life instances (vehicle routing, resource scheduling and
optimization, network design, etc...) requires consistent and reliable developments
of both theoretical and practical issues. Moreover, the computation of solutions for
these problems must be really fast in most cases, from decision support softwares
to scientific applications. This work has provided parallel methodologies for solving or enhancing the solution methods of some important problems in literature,
methodologies that can be applied to a wide spectrum of real-word situations.
The speed-up factors obtained are suggestive of the potential behind these architectures. Not only the hardware parallelism is growing, but also the quantity of
memory got far beyond the 8/16 gigabytes of RAM for each computation node or
workstation, and 2/4 gigabytes of memory for accelerators. To exploit in a better
way the computational resources of these devices, also at the dawn of exa-scale
era, can bring an effective enhancement in both industrial and academic fields.
Our envisioned future work includes the porting of these algorithms on a larger
number of hardware platforms, like AMD devices or FPGA, by exploiting the
OpenCL parallel programming model to take advantage also of the peculiar characteristics of these devices (faster computation, better double precision throughput, larger number of cores, etc . . . ). Another interesting development related
151
Chapter 8. Conclusions
152
to the processors evolution is the insertion of a many-core processor (GPU) in
the same silicon die of a canonical multi-core CPU. This solution is called APU,
Accelerated Processing Unit. HSA, Heterogeneous System Architecture, proposed
by AMD is a new environment designed to specifically exploit these processors.
The great advantage of this approach is avoiding the PCI Express communications
between the HOST and the GPU, thus exploiting the whole system memory. In
this technological period, is mandatory to take into account the massively parallel implementation of these processors during the design of new methodologies
and algorithms. Our aim is to design, starting from the well established CO’s
theoretical foundations, new algorithms targeted to run on these new devices.
Bibliography
[1] Francesco Strappaveccia. A parallel algorithm for a distributed lagrangean
relaxation for the membership overlay problem, 2011. Master Thesis.
[2] Wikipedia. Supercomputer — Wikipedia, the free encyclopedia, 2015. URL
http://en.wikipedia.org/wiki/Supercomputer.
[Online; accessed 23-
January-2015].
[3] Allan R Director-Hoffman. Supercomputers: directions in technology and
applications. National Academy Press, 1989.
[4] Top 500. http://www.green500.org/, 2015.
[5] Top 500. http://www.top500.org/, 2014.
[6] Jerry Banks et al. Handbook of simulation. Wiley Online Library, 1998.
[7] Wikipedia.
Simulation — Wikipedia, the free encyclopedia, 2015.
URL http://en.wikipedia.org/wiki/Simulation. [Online; accessed 23January-2015].
[8] Wikipedia.
Wikipedia,
Computational
the
free
encyclopedia,
fluid
dynamics
2015.
—
URL
http://en.wikipedia.org/wiki/Computational_fluid_dynamics.
[Online; accessed 23-January-2015].
[9] Wikipedia. Computational chemistry — Wikipedia, the free encyclopedia,
2015. URL http://en.wikipedia.org/wiki/Computational_chemistry.
[Online; accessed 23-January-2015].
[10] Wikipedia. Multi-agent systems — Wikipedia, the free encyclopedia, 2015.
URL http://en.wikipedia.org/wiki/Multi-agent_system. [Online; accessed 23-January-2015].
153
Bibliography
154
[11] Wikipedia.
Wikipedia,
Computational
the
free
encyclopedia,
astrophysics
2015.
—
URL
http://en.wikipedia.org/wiki/Computational_astrophysics. [Online;
accessed 23-January-2015].
[12] Wikipedia. Computational physics — Wikipedia, the free encyclopedia,
2015.
URL http://en.wikipedia.org/wiki/Computational_physics.
[Online; accessed 23-January-2015].
[13] Berk Hess, Carsten Kutzner, David Van Der Spoel, and Erik Lindahl. Gromacs 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. Journal of chemical theory and computation, 4(3):435–447,
2008.
[14] Wikipedia. Bioinformatics — Wikipedia, the free encyclopedia, 2015. URL
http://en.wikipedia.org/wiki/Bioinformatics. [Online; accessed 23January-2015].
[15] Michael Friendly and Daniel J Denis. Milestones in the history of thematic
cartography, statistical graphics, and data visualization. Seeing Science:
Today American Association for the Advancement of Science, 2008.
[16] Wikipedia. Data acquisition — Wikipedia, the free encyclopedia, 2015.
URL http://en.wikipedia.org/wiki/Data_acquisition.
[Online; ac-
cessed 23-January-2015].
[17] Wikipedia. Data analysis — Wikipedia, the free encyclopedia, 2015. URL
http://en.wikipedia.org/wiki/Data_analysis.
[Online; accessed 23-
January-2015].
[18] Ian H Witten and Eibe Frank. Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann, 2005.
[19] Gregory Piateski and William Frawley. Knowledge discovery in databases.
MIT press, 1991.
[20] David J Hand, Heikki Mannila, and Padhraic Smyth. Principles of data
mining. MIT press, 2001.
[21] Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency,
volume 24. Springer, 2003.
Bibliography
155
[22] Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff.
A set of level 3 basic linear algebra subprograms. ACM Transactions on
Mathematical Software (TOMS), 16(1):1–17, 1990.
[23] Hrvoje Jasak, Aleksandar Jemcov, and Zeljko Tukovic. Openfoam: A c++
library for complex physics simulations. In International workshop on coupled methods in numerical dynamics, volume 1000, pages 1–20, 2007.
[24] Paolo Giannozzi, Stefano Baroni, Nicola Bonini, Matteo Calandra, Roberto
Car, Carlo Cavazzoni, Davide Ceresoli, Guido L Chiarotti, Matteo Cococcioni, Ismaila Dabo, et al. Quantum espresso: a modular and open-source
software project for quantum simulations of materials. Journal of Physics:
Condensed Matter, 21(39):395502, 2009.
[25] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: portable
parallel programming with the message-passing interface, volume 1. MIT
press, 1999.
[26] Sayantan Sur, Matthew J Koop, and Dhabaleswar K Panda.
High-
performance and scalable mpi over infiniband with reduced memory usage:
an in-depth performance analysis. In Proceedings of the 2006 ACM/IEEE
conference on Supercomputing, page 105. ACM, 2006.
[27] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne.
Wiley
Plus/Blackboard Stand-alone to accompany Operating Systems Concepts with
Java (Wiley Plus Products). John Wiley & Sons, 2006.
[28] OpenMP
cation
Architecture
program
Review
interface
version
Board.
3.0,
OpenMP
may
2008.
appliURL
http://www.openmp.org/mp-documents/spec30.pdf.
[29] Jack B Dennis and Earl C Van Horn. Programming semantics for multiprogrammed computations. Communications of the ACM, 9(3):143–155, 1966.
[30] Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP:
portable shared memory parallel programming, volume 10. MIT press, 2008.
[31] Nvidia Developer Zone. https://developer.nvidia.com/, 2014.
[32] Khronos
Group.
Opencl,
https://www.khronos.org/opencl/.
august
2009.
URL
Bibliography
156
[33] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
1982. ISBN 0-13-152462-3.
[34] Francesca Rossi, Peter Van Beek, and Toby Walsh. Handbook of constraint
programming. Elsevier, 2006.
[35] George B Dantzig. Linear programming and extensions. Princeton university
press, 1998.
[36] George B Dantzig, Alex Orden, Philip Wolfe, et al. The generalized simplex method for minimizing a linear form under linear inequality restraints.
Pacific Journal of Mathematics, 5(2):183–195, 1955.
[37] Ilog CLEX. http://www-01.ibm.com/software/commerce/optimization/cplexoptimizer/, 2014.
[38] Robin Lougee-Heimer. The common optimization interface for operations
research: Promoting open-source software in the operations research community. IBM Journal of Research and Development, 47(1):57–66, 2003.
[39] Boris Teodorovich Polyak. The conjugate gradient method in extremal problems. USSR Computational Mathematics and Mathematical Physics, 9(4):
94–112, 1969.
[40] Michael Held, Philip Wolfe, and Harlan P Crowder. Validation of subgradient optimization. Mathematical programming, 6(1):62–88, 1974.
[41] Michael Held and Richard M Karp. The traveling-salesman problem and
minimum spanning trees. Operations Research, 18(6):1138–1162, 1970.
[42] Michael Held and Richard M Karp. The traveling-salesman problem and
minimum spanning trees: Part ii. Mathematical programming, 1(1):6–25,
1971.
[43] Naum Zuselevich Shor, Krzysztof Kiwiel, and Andrzej Ruszcaynski. Minimization methods for non-differentiable functions. Springer-Verlag New
York, Inc., 1985.
[44] George L Nemhauser and Laurence A Wolsey. Integer and combinatorial
optimization, volume 18. Wiley New York, 1988.
Bibliography
157
[45] Karla Hoffman and Manfred Padberg. Lp-based combinatorial problem solving. Annals of Operations Research, 4(1):145–194, 1985.
[46] Richard Bellman. Dynamic programming. Princeton University Press, 1957.
[47] Ailsa H Land and Alison G Doig. An automatic method of solving discrete
programming problems. Econometrica: Journal of the Econometric Society,
pages 497–520, 1960.
[48] G. Wäscher, H. Haussner, and H. Schumann. An improved typology of
cutting and packing problems. European Journal of Operational Research,
183(3):1109––1130, 2007.
[49] P. Gilmore and R. Gomory. Multistage cutting problems of two and more
dimensions. Operations Research, 13:94–119, 1965.
[50] P. Gilmore and R. Gomory. The theory and computation of knapsack functions. Operations Research, 14:1045–1074, 1966.
[51] J. C. Herz.
stock cutting.
Recursive computational procedure for two-dimensional
IBM Journal of Research and Development, 16(5):462–
469, September 1972. ISSN 0018-8646. doi: 10.1147/rd.165.0462. URL
http://dx.doi.org/10.1147/rd.165.0462.
[52] N. Christofides and C. Whitlock. An algorithm for two-dimensional cutting
problems. Operations Research, 25:30–44, 1977.
[53] J. E. Beasley. Algorithms for unconstrained two-dimensional guillotine cutting. Journal of the Operational Research Society, 36:297–306, 1985.
[54] G.F. Cintra, F.K. Miyazawa, Y. Wakabayashi, and E.C. Xavier.
Al-
gorithms for two-dimensional cutting stock and strip packing problems using dynamic programming and column generation.
Euro-
pean Journal of Operational Research, 191(1):61 – 85, 2008.
ISSN
0377-2217.
URL
doi:
http://dx.doi.org/10.1016/j.ejor.2007.08.007.
http://www.sciencedirect.com/science/article/pii/S0377221707008831.
[55] Mauro Russo, Antonio Sforza, and Claudio Sterle.
An improvement
of the knapsack function based algorithm of gilmore and gomory for
the unconstrained two-dimensional guillotine cutting problem.
Inter-
national Journal of Production Economics, 145(2):451 – 462, 2013.
Bibliography
158
ISSN 0925-5273. doi: http://dx.doi.org/10.1016/j.ijpe.2013.04.031. URL
http://www.sciencedirect.com/science/article/pii/S0925527313001953.
[56] Gary
J.
Katz
and
Joseph
T.
Kider,
paths for large graphs on the gpu.
Jr.
All-pairs
shortest-
In Proceedings of the 23rd
ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH ’08, pages 47–55, Aire-la-Ville, Switzerland, Switzerland,
2008. Eurographics Association.
ISBN 978-3-905674-09-5.
URL
http://dl.acm.org/citation.cfm?id=1413957.1413966.
[57] Ben
D.
Lund
and
Justin
kernel for Floyd-Warshall.
W.
Smith.
A
multi-stage
CoRR, abs/1001.4108,
CUDA
2010.
URL
http://arxiv.org/abs/1001.4108.
[58] Vincent Boyer, Didier El Baz, and Moussa Elkihel. Solving knapsack problems on GPU. Computers & Operations Research, 39(1):42–47, 2012.
[59] Mark
Harris.
cuda.
Optimizing
Nvidia
developer
parallel
zone,
reduction
2009.
in
URL
http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
[60] Ramser J. Dantzig G. The truck dispatching problem. Management Science,
6(1):80–91, 1959.
[61] Paolo Toth and Daniele Vigo, editors. The Vehicle Routing Problem. Society
for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001. ISBN
0-89871-498-2.
[62] Wasil Edward A. Golden Bruce L., Raghavan S., editor. The Vehicle Routing
Problem: Latest Advances and New Challenges, volume 43 of Operations
Research/Computer Science Interfaces Series. Springer, 2008. ISBN 978-0387-77778-8.
[63] VeroLog Web Site. http://www.verolog.eu/, 2014.
[64] Marco Boschetti and Vittorio Maniezzo. A set covering based matheuristic for a real-world city logistics problem.
International Transactions
in Operational Research, pages n/a–n/a, 2014.
ISSN 1475-3995.
doi:
10.1111/itor.12110. URL http://dx.doi.org/10.1111/itor.12110.
[65] Wright J.R. Clarke G. Scheduling of vehicles from a central depot to a
number of delivery points. Operations Research, 12:568–581, 1964.
Bibliography
159
[66] Jaikumar R. Fisher M. L. A generalized assignment heuristic for vehicle
routing. Networks, 11:109–124, 1981.
[67] Paolo Toth and Daniele Vigo. The granular tabu search and its application
to the vehicle-routing problem. INFORMS J. on Computing, 15(4):333–346,
2003. ISSN 1526-5528.
[68] Wen-Chyuan Chiang and RobertA. Russell. Simulated annealing metaheuristics for the vehicle routing problem with time windows. Annals of
Operations Research, 63(1):3–27, 1996. ISSN 0254-5330.
[69] Luca Maria Gambardella, Éric Taillard, and Giovanni Agazzi. Macs-vrptw:
A multiple colony system for vehicle routing problems with time windows.
In New Ideas in Optimization, pages 63–76. McGraw-Hill, 1999.
[70] Christian Prins. A simple and effective evolutionary algorithm for the vehicle
routing problem. COMPUTERS AND OPERATIONS RESEARCH, 31:
2004, 2001.
[71] Jari Kytöjoki, Teemu Nuortio, Olli Bräysy, and Michel Gendreau. An efficient variable neighborhood search heuristic for very large scale vehicle
routing problems. Computers & OR, 34(9):2743–2757, 2007.
[72] Yannis Marinakis and Magdalene Marinaki. A hybrid particle swarm optimization algorithm for the open vehicle routing problem. In Marco Dorigo,
Mauro Birattari, Christian Blum, AndersLyhne Christensen, AndriesP. Engelbrecht, Roderich Groß, and Thomas Stützle, editors, Swarm Intelligence,
volume 7461 of Lecture Notes in Computer Science, pages 180–187. Springer
Berlin Heidelberg, 2012. ISBN 978-3-642-32649-3.
[73] Wei Ou and Bao-Gang Sun. A dynamic programming algorithm for vehicle
routing problems. In Proceedings of the 2010 International Conference on
Computational and Information Sciences, ICCIS ’10, pages 733–736, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4270-6.
[74] P. Toth Christofides N., A. Mingozzi. Exact algorithms for the vehicle routing problem based on spanning tree and shortest path relaxation. Math.
Programming, 10:255–280, 1981.
Bibliography
160
[75] Jens Lysgaard, Adam N. Letchford, and Richard W. Eglese. A new branchand-cut algorithm for the capacitated vehicle routing problem. Mathematical
Programming A, 100(2):423–445, 2004.
[76] Martin Desrochers, Jacques Desrosiers, and Marius Solomon. A new optimization algorithm for the vehicle routing problem with time windows.
Operations Research, 40(2):342–354, 1992.
[77] R. Roberti Baldacci R., A. Mingozzi. New route relaxation and pricing
strategies for the vehicle routing problem. Operation Research, 59:1269–
1283, 2011.
[78] Rafael Martinelli, Diego Pecin, and Marcus Poggi. Efficient elementary and
restricted non-elementary route pricing. European Journal of Operational
Research, 239(1):102–111, 2014.
[79] AndréR. Brodtkorb, TrondR. Hagen, Christian Schulz, and Geir Hasle. Gpu
computing in discrete optimization. part i: Introduction to the gpu. EURO
Journal on Transportation and Logistics, 2(1-2):129–157, 2013. ISSN 21924376.
[80] Marco A. Boschetti, Vittorio Maniezzo, and Francesco Strappaveccia. Using
gpu computing for solving the two-dimensional guillotine cutting problem.
Submitted for publication, 2014.
[81] M. L. Balinski and R. E. Quandt. On an integer program for a delivery
problem. Operations Research, 12(2):pp. 300–304, 1964. ISSN 0030364X.
[82] Toth P. Christofides N., Mingozzi A. State-space relaxation procedures for
the computation of bounds to routing problems. Networks, 11(2):145–164,
1981.
[83] Salani M. Righini G. Symmetry helps: bounded bi-directional dynamic
programming for the elementary shortest path problem with resource constraints. Discrete Optimization, 3(3):255–273, 2006.
[84] Salani M. Righini G. Decremental state space relaxation strategies and
initialization heuristics for solving the orienteering problem with time windows with dynamic programming. Computers and operations research, 36(4):
1191–1203, 2009.
Bibliography
161
[85] Salani M. Righini G. New dynamic programming algorithms for the resource
constrained elementary shortest path. Networks, 51(3):155–170, 2008.
[86] Pawan Harish, Vibhav Vineet, and PJ Narayanan. Large graph algorithms
for massively multithreaded architectures. Centre for Visual Information
Technology, I. Institute of Information Technology, Hyderabad, India, Tech.
Rep. IIIT/TR/2009/74, 2009.
[87] Aydın Buluç, John R. Gilbert, and Ceren Budak. Solving path problems on
the gpu. Parallel Computing, 36(5):241–253, 2010.
[88] Hector Ortega-Arranz, Yuri Torres, Diego R Llanos, and Arturo GonzalezEscribano. A new gpu-based approach to the shortest path problem. In
High performance computing and simulation (HPCS), 2013 international
Conference on, pages 505–511. IEEE, 2013.
[89] Sumit Kumar, Alok Misra, and Raghvendra Singh Tomar. A modified parallel approach to single source shortest path problem for massively dense
graphs using cuda. In Computer and Communication Technology (ICCCT),
2011 2nd International Conference on, pages 635–639. IEEE, 2011.
[90] Christian Schulz. Efficient local search on the gpu—investigations on the
vehicle routing problem. Journal of Parallel and Distributed Computing, 73
(1):14–31, 2013.
[91] SINTEF. http://www.sintef.no/, 2014.
[92] VRPLIB. http://www.or.deis.unibo.it/index.html, 2014.
[93] Feiyue Li, Bruce Golden, and Edward Wasil. Very large-scale vehicle routing: new test problems, algorithms, and results. Computers & Operations
Research, 32(5):1165–1179, 2005.
[94] Edsger W Dijkstra. A note on two problems in connexion with graphs.
Numerische mathematik, 1(1):269–271, 1959.
[95] Richard Bellman. On a routing problem. Technical report, DTIC Document,
1956.
[96] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for
the heuristic determination of minimum cost paths. Systems Science and
Cybernetics, IEEE Transactions on, 4(2):100–107, 1968.
Bibliography
162
[97] Robert W Floyd. Algorithm 97: shortest path. Communications of the
ACM, 5(6):345, 1962.
[98] Chris Barrett, Riko Jacob, and Madhav Marathe.
Formal-language-
constrained path problems. SIAM Journal on Computing, 30(3):809–837,
2000.
[99] Stephen Cole Kleene. Representation of events in nerve nets and finite automata. Technical report, DTIC Document, 1951.
[100] Brian C Dean. Continuous-time dynamic shortest path algorithms. PhD
thesis, Massachusetts Institute of Technology, 1999.
[101] Robert Geisberger, Peter Sanders, Dominik Schultes, and Daniel Delling.
Contraction hierarchies: Faster and simpler hierarchical routing in road networks. In Experimental Algorithms, pages 319–333. Springer, 2008.
[102] Ulrich Lauther. Slow preprocessing of graphs for extremely fast shortest
path calculations. In Lecture at the Workshop on Computational Integer
Programming at ZIB, volume 10, page 11, 1997.
[103] Ulrich Lauther. An extremely fast, exact algorithm for finding shor test
paths in static networks with geographical background. Geoinformation
und Mobilitat - von der Forschung zur praktischen Anwendung, 22:219–230,
2004.
[104] Andrew V Goldberg and Chris Harrelson. Computing the shortest path:
A search meets graph theory. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 156–165. Society for Industrial and Applied Mathematics, 2005.
[105] Andrew V Goldberg and Renato Fonseca F Werneck. Computing point-topoint shortest paths from external memory. In ALENEX/ANALCO, pages
26–40, 2005.
[106] Thomas Pajor. Multi-modal Route Planning. PhD thesis, Universitat Karlsruhe Institut fur theoretische Informatik, 2009.
[107] Nvidia. Nvidia developer zone, 2007. https://developer.nvidia.com/, Accessed: 2013-12-03.
Bibliography
163
[108] Dhirendra Pratap Singh and Nilay Khare. A study of different parallel
implementations of single source shortest path algorithms. International
Journal of Computer Applications, 54(10):26–30, 2012.
[109] Andreas Crauser, Kurt Mehlhorn, Ulrich Meyer, and Peter Sanders. A parallelization of dijkstra’s shortest path algorithm. In Mathematical Foundations
of Computer Science 1998, pages 722–731. Springer, 1998.
[110] Joseph R Crobak, Jonathan W Berry, Kamesh Madduri, and David A Bader.
Advanced shortest paths algorithms on a massively-multithreaded architecture. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007.
IEEE International, pages 1–8. IEEE, 2007.
[111] Marios Papaefthymiou and Joseph Rodrigue.
Implementing parallel
shortest-paths algorithms. DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, 30:59–68, 1997.
[112] Pedro J Martı́n, Roberto Torres, and Antonio Gavilanes. Cuda solutions
for the sssp problem. In Computational Science–ICCS 2009, pages 904–913.
Springer, 2009.
[113] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api
for shared-memory programming. Computational Science & Engineering,
IEEE, 5(1):46–55, 1998.
[114] Steve Coast. Open street map, 2004. http://www.openstreetmap.org/.
[115] Marco A Boschetti, Vittorio Maniezzo, and Matteo Roffilli. A fully distributed lagrangean solution for a peer-to-peer overlay network design problem. INFORMS Journal on Computing, 23(1):90–104, 2011.
[116] Mauro Birattari, Luis Paquete, Thomas Strutzle, and Klaus Varrentrapp.
Classification of metaheuristics and design of experiments for the analysis of
components tech. rep. aida-01-05. 2001.
[117] Ronald L Rardin and Reha Uzsoy. Experimental evaluation of heuristic
optimization algorithms: A tutorial. Journal of Heuristics, 7(3):261–304,
2001.
[118] Stefan Saroiu, P Krishna Gummadi, and Steven D Gribble. Measurement
study of peer-to-peer file sharing systems. In Electronic Imaging 2002, pages
156–170. International Society for Optics and Photonics, 2001.
Bibliography
164
[119] Boris Teodorovich Polyak. Minimization of unsmooth functionals. USSR
Computational Mathematics and Mathematical Physics, 9(3):14–29, 1969.
[120] Deborah T Marr, Frank Binns, David L Hill, Glenn Hinton, David A Koufaty, J Alan Miller, and Michael Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1), 2002.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement