Alma Mater Studiorum - Università di Bologna DOTTORATO DI RICERCA IN INFORMATICA Ciclo XXVII Settore Concorsuale di afferenza: 01/B1 Settore Scientifico disciplinare: INF/01 MANY-CORE ALGORITHMS FOR COMBINATORIAL OPTIMIZATION Presentata da: Francesco Strappaveccia Coordinatore Dottorato Relatore Prof. Paolo Ciaccia Prof. Vittorio Maniezzo Esame finale anno 2015 “May the wind under your wings bear you where the sun sails and the moon walks” J. R. R. Tolkien UNIVERSITY OF BOLOGNA Abstract Facoltà di Scienze Naturali, Fisiche e Matematiche DISI, Dipartimento di Informatica, Scienze ed Ingegneria Doctor of Philosophy Many-core Algorithms for Combinatorial Optimization by Francesco Strappaveccia Combinatorial Optimization is becoming ever more crucial, in these days. From natural sciences to economics, passing through urban centers administration and personnel management, methodologies and algorithms with a strong theoretical background and a consolidated real-word effectiveness is more and more requested, in order to find, quickly, good solutions to complex strategical problems. Resource optimization is, nowadays, a fundamental ground for building the basements of successful projects. From the theoretical point of view, CO rests on stable and strong foundations, that allow researchers to face ever more challenging problems. However, from the application point of view, it seems that the rate of theoretical developments cannot cope with that enjoyed by modern hardware technologies, especially with reference to the one of processors industry. We are witnessing a technological ‘golden era’, where enormous amount of computational capabilities is available at an affordable cost, even at the consumer market level. Is also true that, to fully exploit these devices, a complete focus shift is necessary in thinking and approaching optimization problems. In this work we propose new parallel algorithms, designed for exploiting the new parallel architectures available. In our research, we found that, exposing the inherent parallelism of some resolution techniques (like Dynamic Programming), the computational benefits are remarkable, lowering the execution times by more than an order of magnitude, and allowing to address instances with dimensions not possible before. We approached four CO’s notable problems: Packing Problem, Vehicle Routing Problem, Single Source Shortest Path Problem and a Network Design problem. For each of these problems we propose a collection of effective parallel solution algorithms, either for solving the full problem (Guillotine Cuts and SSSPP) or for enhancing a fundamental part of the solution method (VRP and ND). We endorse our claim by presenting computational results for all problems, either on standard benchmarks from the literature or, when possible, on data from realworld applications, where speed-ups of one order of magnitude are usually attained, not uncommonly scaling up to 40 X factors. Acknowledgements I would like to thank first of all Professor Vittorio Maniezzo for his support during this journey. Dott. Marco Antonio Boschetti for his precious theoretical and human support. Prof. Marcus Poggi and Prof. Geir Hasle for their advices and punctual observations on my work. My family, for their love, since the beginning. Sara for her love, patience and understanding in the deepest dejection and in the brightest joy (this would not have been possible without you). Alessandro, for his sincere and lasting friendship, over the last ten years. Francesco, Stefano and Rossano for the sincere and “politically correct” friendship. Matteo and Samuele for their friendship in this foreign territory from via Machiavelli to the “TwinPeaksTriDomino-Jenga-Poker” evenings. Elena for every advice, suggestion (scientific or not) and lunch together. vi Contents Abstract iv Acknowledgements vi Contents vii List of Figures xi List of Tables xiii 1 Introduction 1 2 High Performance Computing 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Main fields of application . . . . . . . . . . . . . . . . . . 2.2.1 Simulation . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . 2.2.3 Large Scale Data Visualization and Management 2.2.4 Combinatorial Optimization . . . . . . . . . . . . 2.3 High Performance Computing Programming Models . . . 2.3.1 Message Passing (MPI) . . . . . . . . . . . . . . 2.3.1.1 MPI Execution Model . . . . . . . . . . 2.3.2 Shared Memory (OpenMP) . . . . . . . . . . . . 2.3.2.1 OpenMP Execution Model . . . . . . . . 2.3.3 GPGPU Computing (CUDA, OpenCL) . . . . 2.3.3.1 NVIDIA CUDA . . . . . . . . . . . . . . 2.3.3.2 OpenCL . . . . . . . . . . . . . . . . . . 3 Combinatorial Optimization 3.1 Definition . . . . . . . . . . . . 3.2 Linear Problems (LP) . . . . . 3.3 Integer Linear Problems (ILP) 3.4 Nonlinear Problems (NPL) . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 6 8 9 10 11 11 12 14 15 17 18 20 . . . . 23 23 24 25 25 Contents 3.5 3.6 Constraints Satisfaction Problems (CSP) . . . . . . . . Solution and Approximation Methodologies . . . . . . 3.6.1 Simplex Algorithm . . . . . . . . . . . . . . . . 3.6.2 Lagrangean Relaxation . . . . . . . . . . . . . . 3.6.3 Column Generation . . . . . . . . . . . . . . . . 3.6.4 Dynamic Programming . . . . . . . . . . . . . . 3.6.5 Branch and Bound, Branch and Cut Techniques viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 27 27 28 29 30 31 4 Rectangular Knapsack, 2D Unconstrained Guillotine Cuts 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The 2D-GCP: notation, definitions and algorithms . . . . . . . 4.2.1 Principle of Normal Patterns . . . . . . . . . . . . . . . 4.2.2 A Dynamic Programming algorithm for the 2D-GCP . 4.3 The GPU porting of the dynamic programming algorithm . . 4.3.1 Exploiting the inherent parallelism . . . . . . . . . . . 4.3.2 Recurrence Parallelization on GPU . . . . . . . . . . . 4.3.3 Parallel Reduction for the Max Operation . . . . . . . 4.3.4 Matrix Update . . . . . . . . . . . . . . . . . . . . . . 4.3.5 GPU Algorithm . . . . . . . . . . . . . . . . . . . . . . 4.4 Computational results . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Test instances . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Considerations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 35 36 37 39 39 40 42 44 45 47 48 48 53 . . . . . . . . . . . . . . . . . . . 55 55 57 58 59 59 60 62 65 68 69 69 70 73 73 73 75 75 76 78 5 Vehicle Routing Problem 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Problem Description and Mathematical Formulations . . . . 5.2.1 Two Index Formulation . . . . . . . . . . . . . . . . 5.2.2 Set Partitioning Formulation . . . . . . . . . . . . . . 5.3 Dynamic Programming Relaxations for the Pricing Problem 5.3.1 q-Route Relaxation . . . . . . . . . . . . . . . . . . . 5.3.2 through-q-Route Relaxation . . . . . . . . . . . . . . 5.3.3 ng-Route Relaxation . . . . . . . . . . . . . . . . . . 5.4 Parallel Relaxations on GPU . . . . . . . . . . . . . . . . . . 5.4.1 GPU q-Route . . . . . . . . . . . . . . . . . . . . . . 5.4.1.1 Exposing Parallelism . . . . . . . . . . . . . 5.4.1.2 Algorithm Description . . . . . . . . . . . . 5.4.2 GPU through-q-route . . . . . . . . . . . . . . . . . . 5.4.2.1 Exposing Parallelism . . . . . . . . . . . . . 5.4.2.2 Algorithm Description . . . . . . . . . . . . 5.4.3 GPU ng-Route . . . . . . . . . . . . . . . . . . . . . 5.4.3.1 Exposing Parallelism . . . . . . . . . . . . . 5.4.3.2 Dominance Management . . . . . . . . . . . 5.4.3.3 Active Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 5.5 5.6 5.4.3.4 Algorithm Description 5.4.4 Asymmetric Relaxations . . . . Computational Results . . . . . . . . . Considerations and Future Work . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Single Source Shortest Path Problem 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . 6.2.1 Single Source Shortest Path Problem . . . . . . . 6.2.2 Multi-Modal Networks . . . . . . . . . . . . . . . 6.2.3 Label Constrained Shortest Path Problem . . . . 6.2.4 Time-Dependent Networks . . . . . . . . . . . . . 6.3 Serial Algorithm . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Time-Dependent Augmented Dijkstra Algorithm . 6.4 GPU Algorithm . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 GPU Dijkstra Algorithm . . . . . . . . . . . . . . 6.4.1.1 Frontier Set . . . . . . . . . . . . . . . . 6.4.1.2 GPU Implementation . . . . . . . . . . 6.4.2 Dynamic Frontier Definition . . . . . . . . . . . . 6.4.3 GPU Time-Dependent Algorithm . . . . . . . . . 6.5 Shared Memory Algorithm . . . . . . . . . . . . . . . . . 6.5.1 OpenMP Porting . . . . . . . . . . . . . . . . . . 6.6 Computational Results . . . . . . . . . . . . . . . . . . . 6.6.1 Data Sets Creation . . . . . . . . . . . . . . . . . 6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Considerations and Future Work . . . . . . . . . . . . . . 7 Membership Overlay Problem 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . 7.2 Membership Overlay Problem . . . . . . . . . . . 7.2.1 Definition and Mathematical Formulation 7.2.2 MIP Formulation . . . . . . . . . . . . . . 7.3 Linear and Lagrangean SMOP Relaxations . . . . 7.3.1 LP Relaxation . . . . . . . . . . . . . . . . 7.3.2 Lagrangean Relaxation . . . . . . . . . . . 7.4 Subgradient Algorithm . . . . . . . . . . . . . . . 7.5 Distributed Subgradient . . . . . . . . . . . . . . 7.5.1 Algorithm foundations . . . . . . . . . . . 7.5.2 Quasi-constant step size update . . . . . . 7.5.3 Algorithm description . . . . . . . . . . . 7.6 Computational Results . . . . . . . . . . . . . . . 7.7 Shared Memory Algorithm . . . . . . . . . . . . . 7.8 Distributed Memory Algorithm . . . . . . . . . . 7.9 GPU Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 80 84 89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 91 92 93 96 97 102 104 105 106 106 106 108 109 110 112 114 115 115 116 121 . . . . . . . . . . . . . . . . 123 . 123 . 124 . 128 . 129 . 130 . 130 . 130 . 131 . 132 . 134 . 135 . 136 . 137 . 140 . 142 . 143 Contents x 7.10 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.11 Considerations and Future Work . . . . . . . . . . . . . . . . . . . . 149 8 Conclusions 151 Bibliography 153 List of Figures 2.1 2.2 2.3 2.4 2.5 OpenMP logo. . . . . . . . . . . . . . . Fork-Join Execution Model. . . . . . . CUDA logo. . . . . . . . . . . . . . . . CUDA execution and memory models. OpenCL logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 Computation of function V (x, y). . . . . . . . . . . . . . . . . . . Parallel tasks executed exploiting the decomposition based on antidiagonals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Reduction Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Naive reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Naive reduction gives rise to bank conflicts (each column of the shared memory represents a memory bank). . . . . . . . . . . . . 4.6 Strided access allows a fast reduction. . . . . . . . . . . . . . . . . 4.7 Strided access avoids bank conflicts. . . . . . . . . . . . . . . . . . 4.8 Matrix update in the GPU computing approach proposed . . . . 4.9 randcut6000c source. . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 randcut6000c solution. . . . . . . . . . . . . . . . . . . . . . . . . 4.11 testcut8000 solution. . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 gcut13 solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 . 40 . 42 . 42 . . . . . . . . 43 43 43 45 53 53 54 54 f(q,i) q-path computation and 2-vertices loops avoiding. ng-Path example. . . . . . . . . . . . . . . . . . . . . . Stage q parallel computation. . . . . . . . . . . . . . . NG path 3-D states space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 68 70 77 6.1 6.2 6.3 6.4 6.5 6.6 Types of Routes. . . . . . . . . . . . . . . . . . . . . . . . . Automaton managing four states. . . . . . . . . . . . . . . . Speed Profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . Test Path in Berlin . . . . . . . . . . . . . . . . . . . . . . . Serial times vs OpenMP times for the Berlin instance . . . . Speed-Up for different h values for the Los Angeles instance . . . . . . . . . . . . . . . . . . . . . . . . 97 100 103 116 117 117 7.1 7.2 7.3 7.4 P2P network with 4 nodes. Edges bandwidth values. . Edges p’ values. . . . . . . Edges lower bound l. . . . . . . . . . . . . . . . . . . . 125 126 126 126 . . . . . . . . xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 16 18 19 20 5.1 5.2 5.3 5.4 . . . . . . . . . . . . . . . . . . . . . List of Figures 7.5 7.6 7.7 7.8 Edges upper bound u. . . . . . . . . . MOP Optimal Solution. . . . . . . . . Nodes deploy in a eight core processor. Performances evaluation. . . . . . . . . xii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 127 141 149 List of Tables 2.1 CUDA Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 4.2 4.3 Computational results on gcut instances. . . . . . . . . . . . . . . . 50 Computational results on testcut instances. . . . . . . . . . . . . . . 51 Computational results on randcut instances. . . . . . . . . . . . . . 52 5.1 5.3 5.4 Computational results for the symmetric q-path and through-qroute relaxations. . . . . . . . . . . . . . . . . . . . . . . . . . . . Computational results for the asymmetric q-path and through-qroute relaxations. . . . . . . . . . . . . . . . . . . . . . . . . . . . Computational results for the symmetric ng-path relaxation. . . . Computational results for the asymmetric ng-path relaxation. . . 6.1 6.2 6.3 Test instances dimensions . . . . . . . . . . . . . . . . . . . . . . . 118 Computational results on CPU . . . . . . . . . . . . . . . . . . . . 119 Computational results on GPU . . . . . . . . . . . . . . . . . . . . 120 7.1 7.2 Gaps and execution times for LP, STD, DIST. . . . . . . . . . . . 139 Speed-Ups of the CUDA, MPI and OpenMP algorithms. . . . . . . 148 5.2 xiii . 85 . 86 . 87 . 88 To my family. To Vera, Marcello and Augusta. To Sara. xv Chapter 1 Introduction The Moore’s Law, in the early years, has brought to a turning point the microprocessors industry. In fact, the doubling of the computation capabilities of a CPU every eighteen months was becoming a huge engineering problem, due to the increasing work frequencies, power consumption and expensive cooling systems. For instance, the density of heat on the surface of a high-frequency microprocessor can be compared to the density of heat on the same area on the Sun surface. Thus, the solution has been to decrease the work-frequencies and to add more computation units inside the same silicon die; the Moore’s Law then, shifted its prospective from doubling the frequencies to doubling the number of computation units inside the processor, remains still valid. This paradigm shift forced a massive change in software engineering and in computer science in general. In fact, from building an operative system to the design of new algorithms, the new platforms’ high level of parallelism is an ever more important aspect to considerate. Also, the academic research community is every year more interested in the exploitation of this new enormous availability of computing resources, aware that this can be a great opportunity to reach new and significant goals, unexpected until few years ago. On the other hand, to take advantage from these architectures, is required a high level of expertise and a deep knowledge of the features implemented on these devices. A new kind of parallel Lagrangean Relaxation was analyzed and implemented in [1] related to a network design problem (Membership Overlay Problem), this algorithm achieved a great speed-up improvement by implementing it using three of the main parallel paradigms available (MPI, OpenMP and CUDA). This result 1 Chapter 1. Introduction 2 encouraged a deeper investigation of this new approach to design optimization algorithms and this topic was chosen as the main focus for this Ph.D. thesis. Moreover, Combinatorial Optimization is still marginally affected by this new approach and the analysis of the most exploited resolution methods, like Dynamic Programming, Branch and Bound techniques or Column Generation, under this new parallel prospective, is in its infancy. The state-of-the-art micro-processors allow now to treat problems never approached before, due to the data dimensions, to the enormous amount of computations required for their solution or to the computation’s granularity. The contribution of this thesis is a collection of parallel algorithms designed to enhance, sometimes significantly, the execution times of the methods cited above. We approached some of the most notable problems in CO: The Rectangular Knapsack (Unconstrained 2D Guillotine Cuts Problem) achieving, through a Dynamic Programming algorithm designed for a many-core platform (GPU), a speed-up of 22 X for the biggest instances known in literature and a 30 X speed-up factor for two new sets of instances, generated to test the real effectiveness of the method. The Vehicle Routing Problem, for which we provided six many-core algorithms for enhancing the relaxations (q-routes, through-q-routes and ng-routes relaxations) relative to the pricing problem inside the Column Generation exact method, based on the Set Partitioning formulation of the problem. In this case, we achieved a maximum speed-up of 40 X for the asymmetric version of the ng-routes relaxation, an average speed-up of 20 X for the q-routes and through-q-routes methods and an average speed-up of 10 X for the ng-routes. The Single Source Shortest Path Problem, for solving the Earliest Arrival Problem in a Multi-modal, Time-Dependent environment. For this problem, we provide a multi-core (CPU) algorithm, achieving a maximum of 3.5 X speed-up factor on real-word based instances. A Network Design Problem, the Membership Overlay, relative a the Peer-to-Peer network model. For this problem, we propose a parallel subgradient algorithm for solving the Lagrangean Dual Problem relative to the problem’s Lagrangean relaxation, achieving a 29 X speed-up for the bigger instances. Chapter 1. Introduction 3 This thesis is structured as follows: in Chapter 2 we will briefly review the High Performance Computing field and the actual most used parallel programming models. In Chapter 3 we will give some definitions for Combinatorial Optimization problems and we will shortly describe some solution methodologies. In Chapter 4 we will propose a GPU algorithm for solving the Two Dimension Guillotine Cutting Problem, with computational results and two new sets of instances. In Chapter 5 we will describe the most effective pricing strategies for Column Generation algorithms for solving some classes of the VRP, proposing the relative parallel algorithms with computational results. In Chapter 6 we will address the Single Source Shortest Path Problem in a TimeDependent Multi-Modal environment, proposing two parallel algorithm (CPU and GPU) and a new set of instances. In Chapter 7 we will propose three parallel subgradients for three platforms, CPU, GPU and cluster, to obtain a valid bound for the Membership Overlay Problem. In Chapter 8 we will draw the conclusions of our work and give some guidelines for future researches. Chapter 2 High Performance Computing 2.1 Definition By High Performance Computing we mean the use of computers for high throughput computation, for solving large problems, or for getting results faster. High Performance is relative to desktop computers, servers or clusters, characterized by more computation units that can cooperate. Supercomputers were introduced in the 1960s and were designed primarily by Seymour Cray at Control Data Corporation (CDC), and later at Cray Research. While the supercomputers of the 1970s used only a few processors, in the 1990s, machines with thousands of processors began to appear and by the end of the 20th century, massively parallel supercomputers with tens of thousands of “off-the-shelf” or commercial processors were marketed [2]. The design of systems with a massive number of processors generally take one of two paths: in one approach, e.g., in grid computing, the processing power of a large number of computers in distributed, diverse administrative domains, is opportunistically used whenever a computer is available. In another approach, a large number of processors are used in close proximity to each other, for instance, in a computer cluster. The use of multi-core processors combined with centralization is an emerging direction [3]. In the early years of 21st century, another emerging trend is to evaluate the “performance per watt” factor, or the sustainability of a super computer, deserving a separate ranking [4]. The energy consumption of a machine with 100,000 or more cores is obviously relevant. To address also this aspect, the GPU computing or heterogeneous computing seems to be the right 5 Chapter 2. HPC 6 way. The most recent super-computer in the Top 500 list [5] are in most cases composed not only by canonical processors but, inside every computation node, by several “accelerators”, dramatically more energetically efficient than the canonical CPU. 2.2 Main fields of application The need of such a large amount of computational capabilities is restricted to specific research areas like physics, chemistry, engineering, etc. . . In the next sections, we will give a brief list of the applications of HPC. But nowadays the parallelism is a common factor that brings together the canonical computer to the most advanced smartphones and the same paradigms, used to exploit the HPC structures, can be applied on smaller systems enhancing their performances, sometimes dramatically. 2.2.1 Simulation Simulation is the imitation of the operation of a real-world process or system over time [6]. The act of simulating something first requires that a model be developed; this model represents the key characteristics or behaviors of the selected physical or abstract system or process. The model represents the system itself, whereas the simulation represents the operation of the system over time. Simulation is used in many fields, such as engineering design, building design, geology, climatology and video games. Simulation can be used to show the eventual real effects of alternative conditions or to predict the behavior of a system (for example, a building during an earthquake or an electronic device like a smartphone). Simulation is also used when the real system cannot be directly studied, because inaccessible, dangerous, not built or simply virtual. More specifically, a computer simulation models a real-life or hypothetical situation on a computer so that it can be studied to see how the system works. By changing variables in the simulation, predictions may be made about the behavior of the system. It is a tool to virtually investigate the studied system under changing conditions. Computer simulation has become a useful part for modeling many Chapter 2. HPC 7 natural systems in physics, chemistry and biology, as well as in engineering. For example, the effectiveness of exploiting computers for simulating the behavior of a system is particularly useful in the field of network traffic analysis. Key issues in simulation include acquisition of valid source information about the relevant selection of key characteristics and behaviors, the use of simplifying approximations and assumptions within the simulation, and fidelity and validity of the simulation outcomes. Traditionally, mathematical models are used to formally describe systems, models that attempt to find analytical solutions enabling the prediction of the behavior of the system from a set of parameters and initial conditions. Computer simulation is often used as an adjunct to, or substitution for, modeling systems for which simple closed form analytic solutions are not possible. There are many different types of computer simulation, the common feature they all share is the attempt to generate a sample of representative scenarios for a model in which a complete enumeration of all possible states would be prohibitive or impossible [7]. These are some research areas that uses HPC: • Computational Fluid Dynamics: is a branch of fluid mechanics that uses numerical methods and algorithms to solve and analyze problems that involve fluid flows. Computers are used to perform the calculations required to simulate the interaction of liquids and gases with surfaces defined by boundary conditions. With high-speed supercomputers, better solutions can be achieved. Ongoing research yields software that improves the accuracy and speed of complex simulation scenarios such as transonic or turbulent flows. Initial validation of such software is performed using a wind tunnel with the final validation coming in full-scale testing, e.g. flight tests [8]. • Computational Chemistry: is a branch of chemistry that uses principles of computer science to assist in solving chemical problems. It uses the results of theoretical chemistry, incorporated into efficient computer programs, to calculate the structures and properties of molecules and solids [9]. • Multi Agent Systems: are systems composed of multiple interacting intelligent agents within an environment. Multi agent systems can be used to solve problems that are difficult or impossible for an individual agent or a monolithic system to solve. Intelligence may include some methodic, functional, procedural or algorithmic search, find and processing approach. The study of multi-agent systems is concerned with the development and Chapter 2. HPC 8 analysis of sophisticated Artificial Intelligence problem-solving and control architectures for both single-agent and multiple-agent systems [10]. • Computational Astrophysics: refers to the methods and computing tools developed and used in astrophysics research. It is both a specific branch of theoretical astrophysics and an interdisciplinary field relying on computer science, mathematics, and wider physics. Computational astrophysics is most often studied through an applied mathematics or astrophysics program. Well-established areas of astrophysics employing computational methods include magnetohydrodynamics, astrophysical radiative transfer, stellar and galactic dynamics, and astrophysical fluid dynamics [11]. • Computational Physics: is the study and implementation of numerical algorithms to solve problems in physics for which a quantitative theory already exists. It is often regarded as a sub-discipline of theoretical physics but some consider it an intermediate branch between theoretical and experimental physics. It is a subset of computational science (or scientific computing), which covers all of science rather than just physics. Theoretical physicists provide very precise mathematical theory describing how a system will behave. Unfortunately, it is often the case that solving the theory’s equations ab initio in order to produce a useful prediction is not practical. This is especially true with quantum mechanics, where only a handful of simple models admit closed-form, analytic solutions. In cases where the equations can only be solved approximately, computational methods are often used [12]. As we can see these scientific disciplines need an enormous computing time due to the intensive and complex kind of calculations required to solving the models on which they are based. 2.2.2 Bioinformatics Bioinformatics is an interdisciplinary field involving disciplines from data mining, machine learning to operational research, that develops and improves methods for storing, retrieving, organizing and analyzing biological data. A major activity in bioinformatics is to develop softwares to generate useful biological knowledge (e.g. GROMACS [13]). Bioinformatics has become a fundamental mean for many areas of biology, mainly the ones characterized by strong mathematical and statistical Chapter 2. HPC 9 aspects. In experimental molecular biology for instance, bioinformatics techniques such as image and signal processing, allow extraction of useful results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and annotating genomes and their observed mutations [14]. Bioinformatics tools also aid in the comparison of genetic and genomic data, task that otherwise would not be possible, given the enormous amount of data to analyze. Another notable contribution is the simulation and modeling of DNA, RNA and proteins in general as well as molecular interactions, using precise mathematical models, the actual computational resources made available from the microprocessors industry. The peculiarity of this research field is to design computationally intensive methods to enhance the disciplines mentioned before, influenced by the enormous complexity of the problems treated. Some examples of involved methodologies include: pattern recognition, data mining, machine learning, and visualization. Bioinformatics has brought remarkable contribution in sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modeling of evolution. Another challenging goal is to develop software and hardware designed following patterns inspired by the biological word itself or more appropriate to precise biological analysis tasks (networks, processors, etc. . . ). 2.2.3 Large Scale Data Visualization and Management Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes or variables for the units of information [15]. Data visualization and management has become an active area of research, teaching and development, in fact this field become really interesting with the new HPC technologies. Also, considering the enormous quantity of data available nowadays, a correct data interpretation based on solid statistical and mathematical models, has become strategic, from marketing campaigns planning to corporate quantitative analysis. Some related areas are: Chapter 2. HPC 10 • Data Acquisition : is the sampling of the real world to generate data that can be manipulated by a computer. Sometimes abbreviated DAQ or DAS, data acquisition typically involves acquisition of signals and waveforms and processing the signals to obtain desired information. The components of data acquisition systems include appropriate sensors that convert any measurement parameter to an electrical signal, which is acquired by data acquisition hardware and stored in digital form [16]. • Data Analysis: is the process of studying and summarizing data with the goal to extract useful information. Data analysis is closely related to data mining, but data mining tends to focus on larger data sets with less emphasis on making inference, and often uses data that was originally collected for a different purpose. In statistical applications, is frequent to divide data analysis into descriptive statistics, exploratory data analysis, and inferential statistics (or confirmatory data analysis). For example, the exploratory data analysis focuses on discovering new features in the data, instead confirmatory data analysis aims to endorsing or confuting existing hypotheses [17]. • Data Mining: is the process of sorting through large amounts of data and picking out relevant information [18]. It is usually used in business intelligence finance, but is also being used in sciences to extrapolate useful information from enormous data sets (e.g. physics large scale experiments). It has been described as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [19] and the science of extracting useful information from large data sets or databases [20]. In relation to enterprise resource planning, data mining is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making. 2.2.4 Combinatorial Optimization Combinatorial Optimization is a field that consists of finding an optimal solution of a constrained problem [21]. In many problems, complete search or solutions enumeration is not possible. CO is focused on dealing with optimization problems, in which the set of feasible solutions is discrete or can be reduced to discrete, and in which the goal is to find the best solution. This field is characterized by dealing Chapter 2. HPC 11 with NP-Hard / NP-Complete problems and, obviously, the optimal or near optimal solution of these kind of problems requires a large amount of computing time and large scales data structures to manage the, often, combinatorial explosion of data involved. This field will be treated more deeply in Chapter 3, analyzing and describing various kinds of combinatorial problems and their formulation. Combinatorial Optimization involves some of the research areas early cited. In fact a lot of topics related to physics, chemistry and engineering need the solution of linear or non linear constrained problems (e.g. lattice problems in particles physics). 2.3 High Performance Computing Programming Models In this paragraph we will give a short introduction and description of the “de facto” standards used for the implementation of parallel applications. The spectrum of parallel programming models, indeed, is enormously wide, often bounded to specific types of hardware and software vendors, the three paradigms described in the following are the most used and studied for their portability, effectiveness and generality of model. MPI, OpenMP, CUDA, OpenCL are “de facto” standards because the large part of parallel infrastructures implements these models and the most important parallel libraries (for instance BLAS [22]) and softwares (for instance OpenFOAM [23], Quantum Expresso [24] and many more) are based on these. 2.3.1 Message Passing (MPI) Message Passing Interface (MPI), proposed in 1992 by William Gropp and Ewing Lusk, is a standardized and portable message-passing system designed by a group of researchers from academia and industry to operate on a wide variety of parallel computers [25]. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in Fortran 77, Fortran 90 or the C programming languages. Several well-tested and efficient implementations of MPI include some that are free and in the public domain. These fostered the development of a parallel software industry, and there encouraged development of portable and scalable large-scale parallel applications. Chapter 2. HPC 12 MPI provides parallel hardware vendors with a clearly defined base set of routines that can be efficiently implemented. As a result, hardware vendors can build upon this collection of standard low-level routines to create higher-level methods for the distributed-memory communication environment supplied with their parallel machines. MPI provides a simple-to-use portable interface for the basic user, yet powerful enough to allow programmers to use the high-performance message passing operations available on advance machines. After twenty years MPI is still the most used and effective parallel programming model for clusters. Formally, MPI is defined as a: “message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation” [26]. MPI is not a “de-iure” standard but, due to its portability and well defined behavior and interface, is become the “de facto” standard for distributed memory architectures. MPI is an API referring from 5 to 7 levels of ISO-OSI Communication Stack. The benefit in using MPI is its complete portability in every parallel environment. In fact, every MPI implementation provided by each vendor is optimized for the hardware where the application runs. Moreover, MPI allows the coexistence of portion of other programming languages code inside the same application; common is the hybridization with Open MP in order to exploit easily the shared memory nodes inside the cluster (multi-core CPUs). MPI itself, in any case, allows the programmer to manage a shared memory computation node. In the last years is common also the hybridization with specific language-extensions used to manage the GPU or many-core devices inside the computation node (CUDA or OpenCL). 2.3.1.1 MPI Execution Model The MPI interface is meant to provide virtual topology, synchronization, and communication functionalities between a set of processes (that have been mapped to nodes/servers/computer instances) in a language-independent way, with languagespecific syntax (bindings), plus a few language-specific features. MPI programs always work with processes, but programmers commonly refer to the processes as processors. Typically, for maximum performance, each CPU (or core in a multicore machine) will be assigned just a single process. This assignment happens at runtime through the agent that starts the MPI program, normally called mpirun or mpiexec. MPI library functions include, but are not limited to: Chapter 2. HPC 13 • point-to-point, • rendezvous-type, • send/receive operations, • choosing between a Cartesian or graph-like logical process topology, • exchanging data between process pairs (send/receive operations), • combining partial results of computations (gather and reduce operations), • synchronizing nodes (barrier operation) as well as obtaining network-related information such as the number of processes in the computing session, • current processor identity that a process is mapped to, • neighboring processes accessible in a logical topology. More in detail, the MPI’s runtime system creates n processes called tasks and each of these tasks: • creates an independent copy of the application in the node where is executing, • has local memory, • can be mapped on a different processor, • has an univocal index among the tasks created by the user. Each process is part of a Communicator. An MPI Communicator is an object connecting groups of processes. Each communicator: • has a name, • a cardinality, • assign to its processes a proper index for the communicator itself, • arrange the processes in an ordered topology, • each process is equivalent to the others. Chapter 2. HPC 14 MPI optimizes the deployment of the communicators understanding also when a data transfer or a communication is relative only to a specific communicator. Each MPI application is a single executable, managed by the run-time system that organizes the communication among the application’s processes. The MPI primitives are blocking or non-blocking for the application execution, obviously the non-blocking are recommended when exploitable. The primitives also can have different modes of communication: • Standard : automatic synchronization and buffering, • Buffered : buffering defined by the user, • Synchronous: strict rendezvous, • Ready: instant communication without hand-shacking. Actually, the current version of MPI is the 3.0. 2.3.2 Shared Memory (OpenMP) Figure 2.1: OpenMP logo. OpenMP (Open Multiprocessing) is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran [27], on most processor architectures and operating systems, including Solaris, AIX, HP-UX, GNU/Linux, Mac OS X, and Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence runtime behavior . OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board (or OpenMP ARB), represented by a group of computer hardware and software vendors: AMD, IBM, Intel, Cray, HP, Fujitsu, NVIDIA, NEC, Microsoft, Texas Instruments, Oracle Corporation, and more [28]. OpenMP is described by a portable, scalable programming model that provides users with a minimal and flexible interface for developing parallel applications for a wide spectrum of machines, from desktop computers to a clusters. Chapter 2. HPC 15 OpenMP obtained a reasonable success due to some interesting characteristics: • great emphasis on structured parallel programming, • its modest learning curve. The compiler bears all the most difficult aspects of the shared-memory parallel programming (threads synchronization, etc. . . ), • portability: OpenMP libraries are available natively for the most common and used programming languages, • incremental implementation from the serial code by adding only some simple pre-processor’s directives. The most important aspect of OpenMP is the constant development and support from the most important actors of software and hardware industry, making it a ‘de facto’ standard for the shared memory parallel programming. The OpenMP’s directives allow to programmers to indicate to the compiler which instructions execute in parallel and how to divide the workload among threads. These compiler directives are extremely flexible and don’t require to re-write the code in case of platform or compiler changing, besides if a compiler or a platform is not enabled for run OpenMP, the serial code remains the same. 2.3.2.1 OpenMP Execution Model OpenMP supports the Fork-Join programming model [29]. Under this approach, the program starts as a single thread of execution, just like a sequential program. The thread that executes this code is referred to as the initial thread. Whenever an OpenMP parallel construct is encountered by a thread while it is executing the program, it creates a team of threads (fork), becomes the master of the team, and collaborates with the other members of the team to execute the code dynamically enclosed by the construct. At the end of the construct, only the original thread, or master of the team, continues; all others terminate (join). Each portion of code enclosed by a parallel construct is called a parallel region. A thread is a run-time entity capable to execute independently an instructions stream [30]. More in details, once the operative system executes an OpenMP application: Chapter 2. HPC 16 • a process is created for the program, • are instantiated the necessary system resources: memory pages and registers. If more threads are spawned, they will share the resources created by the operative system also the same memory addresses space. Each thread needs only few resources: • a program counter, • private memory locations for its specific data (registers and execution stack). More threads can be executed by the same processor or core through context switching procedures. More threads in a multi-core processor can run in a parallel fashion, properly orchestrated. OpenMP makes transparent a considerable number of interactions among the thread to the programmer, providing a more friendly development environment. Figure 2.2: Fork-Join Execution Model. Chapter 2. HPC 17 The Fork/Join Execution Model implemented by OpenMP can be described as follows and depicted in figure 2.2: • the application starts in a serial mode, with only one thread, the initial thread, • once reached the code sections where the OpenMP directories are located, a team of threads is spawned (Fork), • the initial thread becomes the master thread and coordinates and cooperate with the other threads in the parallel section, • at the end of the OpenMP directives, the master thread continues the execution and the other are erased (Join). In parallel regions, the developer can orchestrate at an higher abstraction level the threads interaction, leaving to the compiler the more low-level implementation details. The OpenMP model implements, inside the single multi-core/multi-threaded elaboration unit, the same characteristics described for MPI. One of the most used and effective techniques to develop parallel applications on massively parallel architectures is to mix, or hybridizing, the MPI code, exploiting the inter-node communications and workload, with the intra-node or intra-core communications and work-sharing, implemented with OpenMP. This approach guarantees a major scalability of the application. The current release of OpenMP is the 3.0. 2.3.3 GPGPU Computing (CUDA, OpenCL) Since 2004, when Intel decided to cancel the development of its last single core processor in order to concentrate its efforts on multi-core architectures, we have observed a radical change in the design of new generations of CPUs. The focus, indeed, shifted from the increase of the processor’s frequency to the implementation of parallel architectures. Following this trend, the market got populated by relatively affordable devices with great, natively parallel, computational resources. General-purpose computing on graphics processing units (General-purpose graphics processing unit, GPGPU) is the exploitation of graphics processing unit (GPU), Chapter 2. HPC 18 which typically handles computation only for computer graphics, to perform computation commonly handled by the CPU. Moreover, the use of multiple GPUs in the same computer or computation node further parallelizes the already parallel nature of these devices. OpenCL is the actually the most used open-source GPU computing language. The proprietary counterpart framework is NVIDIA’s CUDA. More in general, this approach is generally called many-core computation, that differs from the ordinary CPU by the fact that a GPU is composed by hundreds of simple computational units, otherwise an ordinary CPU has less but more sophisticated cores. This different prospective in the architecture’s design make the GPU the silver bullet for high dense computation that involves thousands and thousands of small entities and a fine grain granularity of computation. 2.3.3.1 NVIDIA CUDA Figure 2.3: CUDA logo. In 2007, NVIDIA introduced CUDA (Compute Unified Device Architecture) [31], a parallel, general purpose programming model designed to exploit the great computational resources available on a GPU. CUDA is implemented as an extension of the C/C++ or Fortran programming languages. NVIDIA GPU s follow the SIMT (Single Instruction Multiple Threads) execution model, in which the same instruction is executed concurrently on multiple data by different threads. CUDA programming implements an higher abstraction model than a straight transposition of the hardware architecture of the GPU. The model works at top level on a kernel, which is the portion of code executed asynchronously on the GPU, the rest of the user code being directly executed on the CPU. The kernel is executed in parallel on a user-defined number of threads. The threads are grouped into blocks, which are equal cardinality subsets of threads. The blocks are in turns logically arranged into a grid, which is a 1-, 2- or 3-dimensional array of blocks (see Figure 2.4a). Chapter 2. HPC (a) CUDA Execution Model 19 (b) CUDA Memory Model Figure 2.4: CUDA execution and memory models. Every block is logically executed on a different SM (i.e., Stream Processor or Cluster Processor) of the GPU, which consists of a set of simple arithmetic cores called CUDA Cores (Fermi architecture counts 32 Cuda Cores for SM, Kepler 192 and call it SMX and Maxwell 128, calling the processor SMM). Every block can accept up to 1024 threads [31], even though it will actually simultaneously execute sets of only 32 threads at a time, called Warps. In Figure 2.4a we show in detail this model. There is a memory hierarchy on these devices, aimed to minimize the communications between the host memory (named Global Memory in this context) and the device one, via the PCI-Express bus. In fact, besides the core registries, the GPU has an on-board, low-latency memory module accessible to all threads of each single block. In table 2.1 we summarize the features of CUDA’s memory architecture. The access to Global Memory is a significant bottleneck for performance enhancement in the design of a GPU-accelerated application. The Global Memory is the only memory type visible by the entire grid, and sometimes it is needed for the synchronization and data sharing among the blocks, even though its access latency is very high (400-800 memory cycles). There are many techniques used to hide this latency, but they are mainly application-specific, thus often impossible to apply if the peculiar characteristics of the considered algorithm do not allow them. It is therefore crucial to properly manage the access to global memory. The effectiveness of this management is measured by the achieved memory bandwidth, Chapter 2. HPC 20 which quantifies the amount of relevant data which the application can get from the global memory per second. Table 2.1: CUDA Memory Hierarchy. Memory Type Global Shared Local Register Constant Texture 2.3.3.2 Scope Grid+host Block Thread Thread Grid+host Grid+host Lifetime Application Kernel Thread Thread Application Application Access R-W R-W R-W R-W Read Only Read Only Location Off-chip On-chip Off-chip On-chip Off-chip Off-chip Cached No N/A No N/A Yes Yes OpenCL Figure 2.5: OpenCL logo. The Open Computing Language (OpenCL) [32] is a heterogeneous programming framework created and sponsored by the nonprofit technology consortium Khronos Group and adopted by some of the most important technology actors like Intel, AMD, NVIDIA, ARM, Apple. OpenCL is a framework for developing applications that execute across a broad spectrum of device types made by different vendors. It supports a wide range of levels of parallelism and efficiently maps to homogeneous or heterogeneous, single or multiple-device systems consisting of CPUs, GPUs, and other types of device (e.g. FPGA, APU. . . ). The OpenCL definition offers both a device-side language (based on C99) and a host management layer composed by ad-hoc designed API, for the devices in a system. The programming language used to write computation kernels is based on C99 with some limitations and additions. It omits the use of function pointers, recursion, bit fields, variable-length arrays, and standard C99 header files. The language is extended to use parallelism with vector types and operations, synchronization primitives, functions to manipulate work-items/groups. It also implements memory region qualifiers: global, local, constant, and private. To enhance productivity, many built-in functions has been added (e.g. trigonometric functions). One of the Chapter 2. HPC 21 most interesting and useful OpenCL’s peculiarities is the Device Fission that allows to queue instructions to a specific section or set of cores of the processor. This feature allows to split the device into multiple areas assigned to different tasks, maximizing its utilization or customizing the application for specific types of processors. The actual release of OpenCL is the 2.0. Chapter 3 Combinatorial Optimization 3.1 Definition In mathematics and computer science, an optimization problem is the problem of finding one best among all feasible solutions. Optimization problems can be divided into two categories, depending on whether the variables are continuous or discrete. An optimization problem with discrete variables is known as a combinatorial optimization problem [33]. In a combinatorial optimization problem, we are looking for an object such as an integer, permutation or graph from a finite (or possibly countable infinite) set. As anticipated in the Chapter 1, in this chapter we will explain more in detail the Combinatorial Optimization field. In general, we can describe an optimization problem as: z = min f (x) s.t. gi (x) ≤ 0 i = 1, . . . , m (3.1) hi (x) = 0 i = 1, . . . , p where: • f (x) : RN → R, is the objective function to be minimized(maximized) over the variable x, • gi (x) ≤ 0 are the inequality constrains, • hi (x) = 0 are the equality constrains. 23 Chapter 3. Combinatorial Optimization 24 In the next sections we will describe some common problems and some of the most used and known resolution methods. 3.2 Linear Problems (LP) A Linear Problem (LP), can be expressed in this way (canonical form): z = min cx s.t. Ax ≤ b (3.2) x≥0 Where: • A is the matrix of coefficients that describe the convex set, • c, b are the vector of costs and known coefficients, • x is the linear solution. The problem is characterized by being constrained only by linear inequalities. Another way to describe the problem is: z = min n X c j xj j=1 s.t. n X aij xj ≤ bi , i = 1, . . . , n j=1 xj ≥ 0, j = 1, . . . , n Where: • aij are the elements of A, • bi and cj are the elements of b and c respectively, • xj are the elements of the linear solution x (3.3) Chapter 3. Combinatorial Optimization 3.3 25 Integer Linear Problems (ILP) z = min cx s.t. Ax ≤ b x≥0 (3.4) x int This kind of problems are characterized by another constraint imposing that the solution x must be composed only by integer numbers. Two sub-classes of the ILP are: • MILP: where only some xj variables are integer, • Zero-One Problems: where xj ∈ {0, 1}. 3.4 Nonlinear Problems (NPL) They are distinguished by the presence of non linear constrains inside the system. A simple example can be: z = min x1 + x2 s.t. x21 + x22 ≥ 1 x21 + x22 ≤ 2 (3.5) x1 ≥ 0 x2 ≥ 0 Generally, a non-linear problem can be describe in this way: z = min f (x) s.t. gi (x) ≤ 0 i ∈ I = 1, . . . , m hj (x) = 0 j ∈ J = 1, . . . , p where: (3.6) Chapter 3. Combinatorial Optimization 26 • f (x) : RN → R, is the objective function to be minimized(maximized) over the variable x, • x ∈ RN , that makes the model non-linear, • gi (x) ≤ 0 are the inequality constrains, • hi (x) = 0 are the equality constrains. 3.5 Constraints Satisfaction Problems (CSP) Constraint Satisfaction arose mainly from Artificial Intelligence. A Constraint satisfaction problem (CSP) is a problem defined as a set of objects that must satisfy a number of constraints or relations among the problem’s decision variables [34]. Constraint programming is defined “programming” in a double meaning: not only “mathematical programming”, in the sense of declaration of constraints and decision variables, but also in the sense of “computer programming”, in the sense of programming a search strategy. The methods used to solve this kind of problems are various: Branch and Bound algorithms, Backtracking, Local Search, typically all embedded in a properly designed solver. Formally, a constraint satisfaction problem is defined as a triple hX, D, Ci, where X is a set of variables, D is a domain of values, and C is a set of constraints. Every constraint is in turn a pair ht, Ri (usually represented as a matrix), where t is an n-tuple of variables and R is an n-ary relation on D . An evaluation of the variables is a function from the set of variables to the domain of values, v : X → D. An evaluation v satisfies a constraint h(x1 , . . . , xn ), Ri if (v(x1 ), . . . , v(xn )) ∈ R. A solution is an evaluation that satisfies all constraints [34]. Contraint Programming has been used to solve combinatorial problems like: • Eight queens puzzle, • Map coloring problem, • Sudoku and many other logic puzzles, • DNA sequencing, • Scheduling. Chapter 3. Combinatorial Optimization 27 Often CP deal with problems really hard to solve using canonical OR or CO techniques. 3.6 3.6.1 Solution and Approximation Methodologies Simplex Algorithm Following some notable theoretical results of linear algebra, it can be shown that for a linear program, if the objective function has a minimum value in the feasible region then it has this value on (at least) one of the extreme points of the convex set described by the inequalities constraining the problem [35]. It’s has been also proven that there is a finite number of extreme points in the convex set, but the number of extreme points is extremely large also for small linear programs, making the enumeration of these points inapplicable. It can also be shown that exists an edge that connects an extreme point to another where the objective function decreases, if the starting point isn’t a minimum. If the edge is finite, it brings to another extreme point where the objective function has a smaller value, otherwise the objective function is unbounded and the linear program has no solution. The simplex algorithm proposed by Danzig [36] exploits these results and by walking along edges of the polytope to extreme points with lower and lower objective values, until the minimum value is reached or an unbounded edge is visited, concluding that the problem has no solution. The great contribution given by this algorithm is that it always terminates because the number of vertices in the polytope is finite; moreover, since we jump between vertices always in the same direction (that of the objective function), we hope that the number of vertices visited will be small. The solution of a linear program is accomplished in two steps. In the first step, we need to find an extreme point for starting. The possible results of the first step are either a basic feasible solution is found or that the feasible region is empty. In the latter case the linear program is called infeasible. In the second step, the simplex algorithm is applied using the basic feasible solution found as a starting point. The possible results from the second step are either an optimal feasible solution to the linear program or an infinite edge on which the objective function is unbounded. Chapter 3. Combinatorial Optimization 28 Due to the wide spectrum of applications of this algorithm or its variations, commercial and free softwares implementing this method has been proposed. The most used and effective is the IBM ILOG CPLEX solver [37]. It also implements methods for solving MIP problems, Quadratic Problems and Quadratically Constrained Problems. Its free and open source counterpart is the CoinOR, Computational Infrastructure for Operative Research [38] that is a project developed and maintained by the operative research community, aimed to provide a free environment for developing and testing OR algorithms. 3.6.2 Lagrangean Relaxation Lagrangean relaxation proposed by Polyak [39] and used for the first time by Held et al. [40] and Held and Karp [41, 42] and is a relaxation method which approximates a difficult optimization problem by a ‘simpler’ one. Solving the relaxed problem can give an approximate solution to the original one. The method penalizes violations of inequality constraints using a Lagrangean multiplier, which imposes a cost on constraints violations in the objective function. In practice, this relaxed problem can often be solved more easily than the original one by using polynomial algorithms like the Subgradient [43], providing useful information for its solution. The problem of maximizing the Lagrangean function of the dual variables is the Lagrangean dual problem (if we are searching for the function’s minimum). Under certain conditions regarding function’s convexity and constraints, we can state that the solution of primal and dual Lagrangean problems are the same, avoiding the Duality Gap. Taking as example a linear problem: z = min cx s.t. A1 x ≤ b1 A2 x ≤ b2 (3.7) x≥0 where the constraints in A2 are considered ‘difficult’. We can relax the problem adding a penalty λ and bringing the constraints in the objective function: Chapter 3. Combinatorial Optimization z = min s.t. 29 cx + λ(b2 − A2 x) A1 x ≤ b1 x ≥0 (3.8) This is a relaxed problem different from the initial one, but can be used like a bound to the optimal value inside other solution techniques. 3.6.3 Column Generation One of the most difficult aspects related to many linear programs is that the number of all decision variables is too large to be consider explicitly. Since most of the variables will be non-basic and assuming a value of zero in the optimal solution, only a subset of variables need to be considered when solving the problem. Column generation [44, 45] exploits this idea: it generates only the variables which have the potential to improve the objective function, finding variables with negative reduced cost (assuming without loss of generality that the problem is a minimization problem). The original problem is split into two problems: the master problem and the subproblem. The master problem is composed only by a subset of variables selected from the original problem (core problem). The subproblem is a new problem created to identify a new variable (pricing problem). The objective function of the subproblem is the reduced cost of the new variable with respect to the current dual variables. The process, iteratively, behaves as follows: • The master problem is solved taking in consideration only the selected variables, then we are able to obtain dual prices for each of the constraints in the master problem. This information is then utilized in the objective function of the subproblem. • The subproblem is solved and if the objective value of the subproblem is negative, a variable with negative reduced cost has been identified. This variable is then added to the master problem, and the master problem is solved again. Solving the master problem with the new variable, will generate a new set of dual values, and the process is repeated until no negative reduced cost variables are identified. On the other hand, if the subproblem returns a Chapter 3. Combinatorial Optimization 30 solution with non-negative reduced cost, we can conclude that the solution to the master problem is optimal. In many cases, this approach makes tractable large linear or integer programs that had been previously considered intractable. An example of a problem where this technique is effective is the cutting stock problem where the number of all possible feasible cuts is intractable. Additionally, column generation has been applied to many problems such as crew scheduling, vehicle routing, etc . . . . 3.6.4 Dynamic Programming Dynamic programming (DP) is a technique used for solving complex problems by dividing them into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure (e.g. the Shortest Path Problem). When applicable, the method takes far less time than naive methods. The basic idea behind dynamic programming is simple: in general, to solve a given problem, we need to solve different parts of the problem (subproblems), then we combine the solutions provided by the subproblems to compute the global one. Often, many of these subproblems are identical. The dynamic programming approach seeks to solve each subproblem only once, trying to reduce the computations required. Once a subproblem has been solved, its solution is stored. This kind of algorithms works iteratively: once, during a computation step, a solution computed before is needed, it is simply retrieved from the ones stored. This approach is especially useful when the number of repeating subproblems grows exponentially as a function of the size of the input. The Dynamic Programming method is based on the theoretical infrastructure based on R. Bellman’s principle of optimality [46]: “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision”. More generally, all DP recursions (or recurrences) are based on the Bellman Equation describing the transitions from a state to another: V (x) = maxa∈Γ(x) {F (x, a) + βV (T (x, a))} (3.9) Chapter 3. Combinatorial Optimization 31 This approach is also used for solving optimization problems or for computing valid bounds (e.g. VRP or Cutting Stock Problem). The main disadvantage of this class of algorithms is the extensive use of memory: the states space created by the recursion is often too big to manage. For some class of problems, like the Knapsack 0-1, the dynamic programming is a very fast and simple solution method, but only for instances with limited dimensions. 3.6.5 Branch and Bound, Branch and Cut Techniques A branch-and-bound (BB) algorithm [47] consists of a systematic enumeration of solutions, where large subsets of fruitless candidates are discarded, typically using bounds to the optimal solution. The search is performed as a tree search or similar approaches. The key idea of the BB algorithm is: if a lower bound for some node is greater than an upper bound for some other node, then may be safely discarded from the search (pruning). Any node whose lower bound is greater than the best lower bound achieved during the search, can be discarded. The branching phase is the generation of other (improved) solutions from a node, then, new bounds are computed on these new solutions (bounding). The search stops when the solution’s set is reduced to a single element, or when the upper bound matches the lower bound. Either way, the found value will be the minimum (or the maximum) of the function. Chapter 4 Rectangular Knapsack, 2D Unconstrained Guillotine Cuts In this chapter we investigate the application of GPU computing to the twodimensional guillotine cutting problem, using a dynamic programming approach. We show a possible implementation and we discuss a number of technical issues. Computational results on test instances available in the literature and on new larger instances show the effectiveness of the dynamic programming approach based on GPU computing for this problem. 4.1 Introduction The Two-Dimensional Guillotine Cutting Problem (2D-GCP) consists in cutting a rectangular surface, called stock rectangle or master surface , into a number of smaller rectangular pieces, each with a given size and value, using guillotine cuts. A guillotine cut on a rectangle consists in a cut from one edge of the rectangle to the opposite edge which is parallel to the two remaining edges. A feasible cutting pattern must be obtained by applying a sequence of guillotine cuts to the original master surface or to the rectangles obtained in the previous cuts. The number of pieces of each size and value to be cut can be unconstrained or constrained within a given minimum and maximum value. The objective is to maximize the total value of the pieces cut. The related problem of minimizing 33 Chapter 4. 2D-GCP 34 the amount of waste produced can be trivially converted into this maximization problem by taking the value of a piece to be proportional to its area. A special case of the 2D-GCP is the staged guillotine cutting, where at each stage we associate a cut direction, i.e., the cuts alternate at each stage between being parallel to the x-axis and being parallel to the y-axis. In many practical applications the number of stages is restricted and we say that the guillotine cutting is k-staged. Often, the number of stages is only two or three (i.e., two- or three-staged guillotine cutting). The 2D-GCP can be found in several industrial settings. For example, glass plates are cut into smaller pieces to produce windows or wood sheets are cut to produce furniture, and so on. According to the classification of [48] the 2D-GCP corresponds to the Two-Dimensional Rectangular Single Large Object Packing Problem. The 2D-GCP is a well known problem, it was considered for the first time by Gilmore and Gomory ([49, 50]) who discussed possible mathematical formulations and proposed dynamic programming algorithms to solve the problem for the twoand multi-stage versions. In [51], Herz proposed a recursive tree-search procedure where the search space is reduced by means of a preliminary discretization using the so-called canonical dissections. Moreover, Herz pointed out an error in an algorithm proposed by [50]. Christofides and Whitlock ([52]) considered the constrained version of the 2D-GCP and described a tree-search algorithm, where a valid upper bound is computed by a dynamic programming procedure applied to the unconstrained version of the 2D-GCP. Christofides and Whitlock also used the canonical dissections, but they used the term normal patterns for them. Beasley [53] proposed heuristic and exact algorithms based on dynamic programming to solve the unconstrained version of the 2D-GCP. He considered both the staged version and the general non-staged version of the problem and used the normal patterns to reduce the search space. Focusing our attention on the approaches for the 2D-GCP based on dynamic programming, recently, Cintra et al. [54] proposed an improved implementation Chapter 4. 2D-GCP 35 of the algorithm proposed by Beasley [53] and Russo et al. [55] proposed an improved version of the algorithm of Gilmore and Gomory ([50]). GPU Computing used for optimization has a recent and yet sparse literature. Due to its structure, it seems to be fit to be combined with dynamic programming procedures. In fact Kats et al. [56] and Lund et al. [57] proved the effectiveness of this new paradigm of massively parallel processors on the very well known All-Pairs Shortest Path Problem (APSP) solved with the Floyd-Warshall algorithm. Boyer et al. [58] proved that GPU computing is effective also in the resolution of the Knapsack Problem (KP) using the well-known dynamic programming algorithm proposed by Bellman [46]. This chapter presents one of the first contributions applying GPU computing to state of the art research algorithms. We consider the unconstrained and nonstaged version of the 2D-GCP and we focus on the implementation of a dynamic programming algorithm using a GPU computing approach to gauge the effectiveness of this new paradigm on optimization problems. We have chosen the dynamic programming algorithm proposed by Cintra [54] because is quite clean and it well suited for our purpose. The other best performing approach, namely the algorithm proposed by Russo et al. [55], is surely interesting and effective, but it is quite complex and it may divert the attention of the reader on aspects that are beyond the scope of this paper, which is the implementation of a dynamic programming algorithm using the GPU computing. We have organized this chapter as follows. In Section 4.2 we describe the problem and we introduce the dynamic programming algorithm used. In Section 4.3 we present the GPU porting of the dynamic programming algorithm for th 2D-GCP. Computational results are reported and discussed in Section 4.4. Finally, in Section 4.5 we draw some conclusions and we give some ideas for future developments. 4.2 The 2D-GCP: notation, definitions and algorithms A large rectangular master surface M = (W, H) of width W and height H must be cut into a number of smaller rectangular pieces chosen from n types of pieces available. Let P = {1, . . . , n} be the index set of piece types. Each piece of type Chapter 4. 2D-GCP 36 i ∈ P has dimensions (wi , hi ) and value vi . The orientation of the pieces is fixed (i.e., no rotations are allowed). The objective is to construct a guillotine cutting pattern of M that maximizes the total value of pieces cut, using the given piece types. The master surface M is located in the positive quadrant of the Cartesian coordinate system with its origin (bottom left-hand corner) placed in position (0, 0) and with its bottom and left-hand edges parallel to the x-axis and the y-axis, respectively. The position of a piece within M is defined by the coordinates of its bottom left-hand corner, referred to as the origin of the piece. 4.2.1 Principle of Normal Patterns The origin of a piece of type i ∈ P can be located at every integer point (x, y) of the master surface, such that 0 ≤ x ≤ W − wi and 0 ≤ y ≤ H − hi . However, this set of points (x, y) can be reduced by applying the discretization principle proposed by Hertz [51], who - as mentioned - speaks of canonical dissections, and used by Christofides [52], who introduced for the first time the term normal patterns. The principle of normal pattern has been thereafter used by Beasley [53] and many other authors. The principle of normal patterns is based on the observation that, in a given feasible cutting pattern, the position where any piece is cut can be moved to the left and/or downward as much as possible until its left-hand edge and its bottom edge are both adjacent to other cut pieces or to the edges of the master surface. Let X and Y denote the subsets of all x-positions and y-positions, respectively, where a piece can be positioned applying the principle of normal patterns. These sets can be computed as follows: and P X = x = k∈P wk ξk : 0 ≤ x ≤ W, ξk ≥ 0 integer, k ∈ P (4.1) P Y = y = k∈P hk ξk : 0 ≤ y ≤ H, ξk ≥ 0 integer, k ∈ P (4.2) The x-positions contained in X are sorted so that given xi , xj ∈ X, we have xi < xj for each 1 ≤ i < j ≤ |X|. The y-positions contained in Y are also similarly sorted. Chapter 4. 2D-GCP 37 A simple dynamic programming recursion for computing X and Y is described both by Christofides [52] and by Cintra [54]. 4.2.2 A Dynamic Programming algorithm for the 2D-GCP The 2D-GCP considered in this chapter can be solved using the following dynamic programming algorithm, originally proposed by Cinra et al. [54], and based on the recurrence formula proposed by Beasley [53]. Let V (w, h) be the value of an optimal guillotine pattern of a rectangle of size (w, h), evaluated by means of the following recurrence formula: V (w, h) = max v(w, h) max{V (w′ , h) + V (p(w − w′ ), h) : w′ ∈ X and 0 < w′ ≤ w2 } max{V (w, h′ ) + V (w, q(h − h′ )) : h′ ∈ Y and 0 < h′ ≤ h } 2 (4.3) where p(w) = max{x ∈ X : x ≤ w}, q(h) = max{y ∈ Y : y ≤ h} and where v(w, h) denotes the value of the most valuable piece that can be cut in a rectangle of size (w, h) (v(w, h) = 0 if no piece can be cut in such rectangle). Since the only x and y-positions considered are the ones contained in the subsets X and Y , for the sake of ease, we use the notation V (xi , yj ) and V (i, j) interchangeably. The optimal solution value is V (p(W ), q(H)) = V (|X|, |Y |). Let cut(i, j) be the position of the optimal cut within a rectangle of size (xi , yj ), xi ∈ X and yj ∈ Y . It is equal to 0 if guillotine cuts are not applied, it is > 0 if a horizontal cut is applied in position cut(i, j), and it is < 0 if a vertical cut is applied in position −cut(i, j). A pseudocode for the dynamic programming algorithm for the 2D-GCP proposed by Cintra et al. [54] is as follows: Algorithm DP-2D-GCP (W, H, w, h, v) 1. //Compute the sets X and Y using the normal pattern principle 2. //Initialization 3. for i = 1 to |X| do 4. for j = 1 to |Y | do Chapter 4. 2D-GCP 38 5. V (i, j) = max {{vk : k ∈ P, wk ≤ xi and hk ≤ yj } ∪ {0}} 6. cut(i, j) = 0 7. // Recurrence 8. for i = 2 to |X| do 9. for j = 2 to |Y | do 10. for i′ = 1 to max{k : xk ∈ X, xk ≤ xi /2} do 11. i′′ = max{k : xk ∈ X, xk ≤ xi − xi′ } 12. if V (i, j) ≤ V (i′ , j) + V (i′′ , j) 13. 14. then V (i, j) = V (i′ , j) + V (i′′ , j) cut(i, j) = −i′ 15. for j ′ = 1 to max{k : yk ∈ Y, yk ≤ yj /2} do 16. j ′′ = max{k : yk ∈ Y, yk ≤ yj − yj ′ } 17. if V (i, j) ≤ V (i, j ′ ) + V (i, j ′′ ) 18. 19. then V (i, j) = V (i, j ′ ) + V (i, j ′′ ) cut(i, j) = j ′ The algorithm starts by computing the sets X and Y using the normal pattern principle. Then, it initializes V (i, j), for every xi ∈ X and yj ∈ Y , by setting V (i, j) = v(xi , yj ). The recurrence tries to improve each value V (i, j) applying to the rectangle of size (xi , yj ) a horizontal or a vertical guillotine cut. If the sum of the values associated to the resulting two rectangles improves over V (i, j), its value is updated and the applied cut is saved in cut(i, j). (a) Not using normal patterns (b) Using normal patterns Figure 4.1: Computation of function V (x, y). Chapter 4. 2D-GCP 39 The usage of the subsets X and Y instead of the full sets of the available positions can heavily reduce the computational complexity. In Figure 4.1 we show the difference between the computation of function V (x, y) not using and using the normal pattern principle. Normal patterns are shown by ”+” icons adjacent to the corresponding rows and columns. The figure depicts how the computation of the V (w, h) values is made based on all gray cells, each representing a partial solution. It is evident how the number of partial solutions needed for the computation is much lower using normal patterns than otherwise. 4.3 The GPU porting of the dynamic programming algorithm The dynamic programming algorithm for the 2D-GCP proposed by Cintra et al. [54] and described in Section 4.2.2 is natively suitable for an implementation using GPU computing, due to its matrix-like structure. In this section we show how to exploit the inherent parallelism of this algorithm, we describe the parallelization process, the porting to a GPU environment and its possible implementation using a CUDA model. 4.3.1 Exploiting the inherent parallelism The objective of the algorithm is to compute the different V (i, j) values. In order to do this, we need all the intermediate solutions evaluated in the previous iterations of the dynamic recursion (see Figures 4.1a and 4.1b), as described in Section 4.2.2. This process can be effectively decomposed into independent tasks, as shown in the following. Each index d, corresponding to a stage of the dynamic recursion and such that 1 ≤ d ≤ |X|+|Y |−1, identifies implicitely a corresponding anti-diagonal on matrix V . The anti-diagonal is composed by the set of cells Ad = {(k, d − k + 1) : 1 ≤ k ≤ |X|, 1 ≤ d − k + 1 ≤ |Y |, k = 1, . . . , d}. At each stage d = 1, . . . , |X| + |Y | − 1, we can concurrently compute the values V (i, j) belonging to the corresponding antidiagonal Ad (i.e., (i, j) ∈ Ad ) (see Figures 4.2a and 4.2b), and to evaluate each V (i, j), we can perform a parallel max operation over the solution values of the Chapter 4. 2D-GCP (a) Without discretization 40 (b) With discretization Figure 4.2: Parallel tasks executed exploiting the decomposition based on anti-diagonals. composing subproblems. Notice that other approaches based on decomposing the problem by columns or rows do not allow such an effective concurrent evaluation of the V (i, j) values. 4.3.2 Recurrence Parallelization on GPU The parallel implementation of the recursion has been designed to fit the CUDA programming model. As described in 4.3.1, we can exploit two different granularities of parallelism, one for evaluating the V (i, j) values belonging to the antidiagonal Ad , and one for the max operation required for finding the cut that maximizes the V (i, j) value. The mapping on CUDA becomes then straightforward: 1. inter-grid parallelism among blocks to concurrently evaluate all cells of Ad . 2. inter-block parallelism among threads to compute the reduction (max operation) for each V (i, j). At each recursion stage d, 1 ≤ d ≤ |X| + |Y | − 1, each element (i, j) of the anti-diagonal Ad is assigned to a different GPU block in order to concurrently evaluate the corresponding V (i, j) value. Therefore, the maximum number of blocks required to compute an anti-diagonal is b = min{|X|, |Y |}. Chapter 4. 2D-GCP 41 A block computing cell (i, j) runs threads to evaluate Beasley’s recursive formula 4.3: VijX (i′ ) = V (i′ , j) + V (i′′ (i, i′ ), j), xi′ ∈ X and 0 < xi′ ≤ VijY (j ′ ) = V (i, j ′ ) + V (i, j ′′ (j, j ′ )), yj ′ ∈ Y and 0 < yj ′ ≤ xi 2 yj 2 (4.4) where i′′ (i, i′ ) = max{k : xk ∈ X, xk ≤ xi − xi′ } and j ′′ (j, j ′ ) = max{k : yk ∈ Y, yk ≤ yj − yj ′ }. Ideally, the block evaluating the values VijX (i′ ) and VijY (j ′ ) has a thread for each index i′ and j ′ . The obtained values can be stored in the same shared memory location. This step is crucial to maximize the global memory bandwidth because, for example, the thread evaluating VijX (i′ ), instead of loading the two values V (i′ , j) and V (i′′ (i, i′ ), j) in the shared memory and performing the plus operation, wasting a large number memory cycles, performs the addition inside one instruction requiring two global memory accesses, doubling the memory bandwidth for each kernel call. As mentioned before, the CUDA programming model allows to spawn a maximum of β threads per block (the value of β depends by the hardware configuration, e.g., β = 1024 for Fermi and Kepler and β = 512 for older architectures), therefore each thread can be forced to evaluate more than one value VijX (i′ ) or VijY (j ′ ) at each stage. In particular, a thread of index t ≤ ⌊β/2⌋ evaluates every value VijX (i′ ) where i′ %⌊β/2⌋ = t (i.e., the remainder of division of i′ by ⌊β/2⌋). Whereas a thread of index t > ⌊β/2⌋ evaluates every value VijY (j ′ ) where ⌊β/2⌋ + j ′ %⌊β/2⌋ = t. For sake of ease, we define Tx = {1, . . . , ntx = ⌊β/2⌋}, Ty = {⌊β/2⌋ + 1, . . . , nty = β}, and the following set of position assigned to each thread: PijX (t) = {i′ : xi′ ∈ X, 0 < xi′ ≤ PijY (t) = {j ′ : yj ′ ∈ Y, 0 < yj ′ ≤ xi ′ , i %⌊β/2⌋ = t}, 2 yj , ⌊β/2⌋ + j ′ %⌊β/2⌋ 2 t ∈ Tx = t}, t ∈ Ty (4.5) While each thread evaluates the values VijX (i′ ), i′ ∈ PijX (t), or VijY (j ′ ), j ′ ∈ PijX (t), it also performs a max operation among them. The resulting maximum value obtained is used in the parallel reduction for the max operation among threads, described in the next section, which uses the values saved in the shared memory. Chapter 4. 2D-GCP 42 Figure 4.3: Reduction Tree. Figure 4.4: Naive reduction. 4.3.3 Parallel Reduction for the Max Operation It is possible to exploit an inter-block parallelism among threads to compute the reduction required by the max operation for each V (i, j). The reduction combines all the elements in a collection into a single one, using an associative two-input, one-output operator, which in our case is the max operator. In general, a reduction is a low intensity arithmetic operation, but it has some critical aspects to analyze to get an efficient parallel algorithm. It is in fact crucial to be able to exploit all computational resources available on the the GPU device. The most effective solution we found to parallelize this max operation is a treebased approach. We can represent the max operation as a binary tree, where we can do the operation in parallel at each level (Figure 4.3). The complexity of this algorithm becomes O(N/P + logN ) where N is the number of elements in every level, and P the number of processors. In our case N = P and the complexity is O(logN ). Chapter 4. 2D-GCP 43 Figure 4.5: Naive reduction gives rise to bank conflicts (each column of the shared memory represents a memory bank). Figure 4.6: Strided access allows a fast reduction. Figure 4.7: Strided access avoids bank conflicts. The shared memory is subdivided into small arrays of 32 locations of 32-bit words each, and its latency is significantly lower (≈ 4 memory cycles) than that of the global memory. The shared memory is the only means to enable communications among threads of the same block and, given its bank-like structure, bank conflicts are the most important aspect to avoid for enhancing the kernel performance. A bank conflict occurs when more threads of the same warp access the same memory bank, thus forcing an access serialization. Avoiding this is an implementation constraint which imposes a specific management of threads communication. A naive algorithm, as shown in Figures 4.4 and 4.5, implemented with interleaved addressing creates a large number of conflicts, serializing most of the operations in the shared memory. On the contrary, a strided access (see Figures 4.6 and 4.7) Chapter 4. 2D-GCP 44 resolves this problem leading to a conflict-free access to the shared memory, to a maximization of the device memory bandwidth and finally to a good parallel execution of the threads [59]. In our kernel, once the loading stage described in section 4.3.2 is finished, we can perform this reduction step, retrieving the maximization value for V (i, j). 4.3.4 Matrix Update The normal pattern principle allows to reduce the set of values V (i, j) to be evaluated at each stage, this results in an improvement of the performance of the corresponding serial algorithm, presented in section 4.2.2. Unfortunately, the same approach is not suitable for the GPU, because the resulting algorithm would induce sparse, not linear, and serialized access to the global memory. New NVIDIA architectures, such as Fermi or Kepler, partially resolve the problem of coalesced access to global memory, that is the quest for loading data which resides in adjacent positions even in the global memory. In any case, a more efficient memory access can be obtained by transforming the double i, j indexing of the V values into a one-index access, which maximizes the bandwidth on the PCI-Express bus, structuring the data in a way more suitable for a coalesced access. A plain and sequential access to the partial solution values required for evaluating each V (i, j) can be obtained by filling the discretization induced by the normal patterns (see Figure 4.1a), as described below. Moreover, in order to maximize coalescence, we stored the V matrix twice, in two mono-dimensional arrays: one in row-major order Vrows and one in a column-major order Vcols . When the algorithm evaluates the VijX (i′ ) it uses matrix Vcols , whereas when it evaluates the values VijY (j ′ ) it uses matrix Vrows (see section 4.3.2). Using these structures, we can take advantage of discretization when we evaluate the V (i, j) only for the normal pattern positions, and we can use a complete matrix representation by means of the arrays Vrows and Vcols defined for every integer positions 0 ≤ x ≤ W and 0 ≤ y ≤ H, when we compute the maxima. Given two adjacent normal pattern positions xi , xi+1 ∈ X and yj , yj+1 ∈ Y , we have that V (xi , yj ) = V (x, y) for every xi ≤ x < xi+1 and yj ≤ y < yj+1 . Therefore, when the value V (xi , yj ) is evaluated, it can be directly copied in every Chapter 4. 2D-GCP (a) Matrix structure 45 (b) Update for the value V(i,j) Figure 4.8: Matrix update in the GPU computing approach proposed integer positions xi ≤ x < xi+1 and yj ≤ y < yj+1 (see Figures 4.8a and 4.8b), this permits to fill the gaps in the matrix storage and obtain a full linear access when computing the corresponding maximum. As mentioned, the value V (xi , yj ) is then copied in both linearized matrices Vrows and Vcols . 4.3.5 GPU Algorithm In this section we present a pseudo-code where the structure of GPU algorithm is fully detailed. The core of the algorithm is the dynamic stage recursion, which is implemented in a main loop, working at each iteration on an anti-diagonal d. Every element V (i, j), (i, j) ∈ Ad , is assigned to a different block, which concurrently evaluates it by expression 4.3. Each block splits the computation among threads. Half of them consider the values VijX (i′ ) and the remaining ones consider the values VijY (j ′ ) (see section 4.3.2). At the end, each block makes the reduction corresponding to the max operation of expression 4.3 and updates the corresponding entries of both linearized matrices Vrows and Vcols . The cut positions are saved in the linearized matrix Vcut . Chapter 4. 2D-GCP 46 Algorithm GPU DP-2D-GCP (H, W, w, h, v) 1. //Compute the sets X and Y using the normal pattern principle 2. //Initialization 3. for i = 1 to |X| do 4. for j = 1 . . . , |Y | do 5. V (i, j) = max {{vk : k ∈ P, wk ≤ xi and hk ≤ yj } ∪ {0}} 6. // Initialize the linearized matrices Vcols and Vrows 7. for x = xi to xi+1 − 1 do 8. for y = yj to yj+1 − 1 do 9. Vrows [yW + x] = V (i, j) 10. Vcols [xH + y] = V (i, j) 11. Vcut [xH + y] = 0 12. // Recurrence 13. for d = 1 to |X| + |Y | − 1 do 14. for each (i, j) ∈ Ad do 15. // CUDA Kernel: each (i, j) is assigned to a different block 16. X // Threads evaluate Vmax = max{VijX (i′ ) : xi′ ∈ X, 0 < xi′ ≤ 17. for each thread t ∈ Tx do 18. V ′ [t] = 0, C ′ [t] = nil // Shared Memory Initialization 19. for each i′ ∈ PijX (t) do 20. 21. 22. if V ′ [t] < Vrows [yj W + xi′ ] + Vrows [yj W + (xi − xi′ )] then V ′ [t] = Vrows [yj W + xi′ ] + Vrows [yj W + (xi − xi′ )] C ′ [t] = −xi′ 23. X // Reduction: at the end V ′ [1] = Vmax 24. for s = ⌊ntx /2⌋ to 1, s = ⌊s/2⌋ do 25. 26. if (t ≤ s) and (V ′ [t] < V ′ [t + s]) then V ′ [t] = V ′ [t + s], C ′ [t] = C ′ [t + s] 27. Y // Threads evaluate Vmax = max{VijY (j ′ ) : yj ′ ∈ Y, 0 < yj ′ ≤ 28. for each thread t ∈ Ty do 29. V ′ [t] = 0, C ′ [t] = nil // Shared Memory Initialization 30. for each j ′ ∈ PijY (t) do 31. 32. 33. xi } 2 yj } 2 if V ′ [t] < Vcols [xi H + yj ′ ] + Vcols [xi H + (yj − yj ′ )] then V ′ [t] = Vcols [xi H + yj ′ ] + Vcols [xi H + (yj − yj ′ )] C ′ [t] = yj ′ 34. Y // Reduction: at the end V ′ [ntx + 1] = Vmax 35. for s = ⌊nty /2⌋ to 1, s = ⌊s/2⌋ do Chapter 4. 2D-GCP 47 if (t ≤ ntx + s) and (V ′ [t] < V ′ [t + s]) 36. then V ′ [t] = V ′ [t + s], C ′ [t] = C ′ [t + s] 37. if V ′ [1] < V ′ [ntx + 1] 38. 39. then M axV = V ′ [1], M axC = C ′ [1] 40. else M axV = V ′ [ntx + 1], M axC = Cx [ntx + 1] 41. // Update Vcols and Vrows 42. for x = xi to xi+1 − 1do for y = yj to yj+1 − 1 do 43. 44. Vrows [yW + x] = M axV 45. Vcols [xH + y] = M axV 46. Vcut [xH + y] = M axC The initialization section at the beginning of the algorithm includes the setup of the Vrows , Vcols , and Vcut data structures. This is presented in the same loops where we access the matrices linearized by rows to simplify the presentation, but in the actual implementation we initialize Vrows by rows and Vcols and Vcut by columns. In any given iteration d of the recurrence, the concurrent execution of the blocks is managed by the GPU scheduler and at the end of the iteration a synchronization is performed (i.e., iteration d + 1 is performed only after every blocks ended at iteration d). Notice that the reduction performed by each thread also requires a synchronization among threads. 4.4 Computational results This section reports the computational results and, in particular, the speed-up factors obtained running the serial and the parallel versions of the algorithm. The code was implemented in C/C++ with CUDA extensions using the NVCC Compiler by Nvidia for the GPU version and using Microsoft Visual Studio 2010 with full optimization for the serial version. All tests were run on a Intel i7 920 Bloomfield QuadCore @2.8 GHz with 6 Gigabytes of RAM and two different graphic card: an Nvidia GeForce GTX 570 Fermi with 1 Gigabyte of GDDR5 RAM and 480 CUDA Cores @1.464 GHz and Chapter 4. 2D-GCP 48 an Nvidia Geforce GTX 770 Kepler with 2 Gigabytes of GDDR5 RAM and 1536 CUDA Cores @1.046 GHz. We tested the algorithm on two different hardware configurations to compare the scalability of the method proposed on two different generations of GPUs. 4.4.1 Test instances We tested our algorithm on three different instance sets. The first set is the wellknown gcut set, generated by Beasley [53] and upgraded by Cintra et al. [54]. In order to check the effectiveness and the correctness of the parallel algorithm, we generated by means of two opposite methodologies the other two sets, named testcut and randcut. These two sets were generated as follows. Every instance of the testcut set is composed by three types of items, where the w and h dimensions are as follows: • 25% of items with h ∈ [1, H/4[ and w ∈ [1, W/4[. • 50% of items with h ∈ [H/4, H/2[ and w ∈ [W/4, W/2[. • 25% of items with h ∈ [H/2, 3H/4[ and w ∈ [W/2, 3W/4[. H and W are the dimensions of the master bin. The instances of the randcut set are composed by items retrieved from randomly generated guillotine cuts on the original master bin. The peculiarity of this set is that the objective function z is equal to the bin area, W × H, giving us the possibility to check the correctness of our algorithm on big instances with known optimal solutions value. These two new sets are available on the website: www.sitoistanze.com, together with the images of all the instances. 4.4.2 Experiments The objective of our experiments is the comparison of our implementations of the sequential and of the GPU versions of the algorithm, described in sections 4.2.2 and 4.3.5, respectively. The GPU version was run with 256 threads per block. Chapter 4. 2D-GCP 49 In Tables 4.1, 4.2, and 4.3 we report the computational results obtained on the three considered sets of instances. For each instance we indicate its N ame, the size of the master surface H and W , the number of type of items n, the number of normal pattern positions |Y | and |X|, the optimal solution value zOP T , and the percentage waste W aste%. For the algorithms we report the computing times TCP U of the sequential version and TGP U of the GPU version, and the resulting SpeedU p defined by the ratio between TGP U and TCP U , i.e., SpeedU p = TCP U /TGP U . In Table 4.1 we ignored the first eleven instances due to non significant computational time required; in fact, the times required to solve these instances are below 0.0001 seconds. Analyzing the results, we can highlight the method’s scalability as a function of the dimension of the instances. The GTX 570’s decreasing performances over the 8000x8000 dimensions of the master bin is the effect of the device’s limited amount of on-board memory. Over that dimensions, we are forced to keep in the system’s main memory some data structures, increasing the time required to access these structure and, significantly, slowing down the performances. The tests on the GTX 770 graphic card avoid this problem, due to the greater amount of on-board memory available. As we can see, in this case, the speed-up factor is almost constant. Figures 4.9 and 4.10 display the original random generated instance randcutc6000 and its solution, respectively. As we can see, the calculated solution is almost the same, except the cuts’ positions on the bin. Figure 4.11 displays testcut8000 ’s solution, while Figure 4.12 shows the solution of gcut13. N ame gcut12 gcut13 gcut14 gcut15 gcut16 gcut17 H 1000 3000 3500 3500 3500 3500 W 1000 3000 3500 3500 3500 3500 n 50 32 42 52 62 82 Instances |Y | |X| 155 124 1457 2310 2390 2861 2422 2933 2559 2943 2676 2953 zOP T 979,986 8,997,780 12,245,410 12,246,032 12,248,836 12,248,892 W aste% 2.001 0.025 0.037 0.032 0.010 0.009 Core i7 920 TCP U 0.016 12.199 26.754 27.627 28.985 30.015 NVIDIA TGP U 0.011 0.681 1.390 1.440 1.514 1.580 GTX 570 SpeedU p 1.455 X 17.913 X 19.247 X 19.185 X 19.145 X 18.997 X NVIDIA TGP U 0.011 0.574 1.168 1.206 1.267 1.331 GTX 770 SpeedU p 1.455 X 6.959 X 21.251 X 22.907 X 22,878 X 22.555 X Chapter 4. 2D-GCP Table 4.1: Computational results on gcut instances. 50 N ame testcut6000 testcut6500 testcut7000 testcut7500 testcut8000 testcut8500 testcut9000 testcut9500 testcut10000 H 6000 6500 7000 7500 8000 8500 9000 9500 10000 W 6000 6500 7000 7500 8000 8500 9000 9500 10000 Instances n |Y | 80 4881 80 5610 80 6242 80 4988 100 7365 80 8413 80 7495 80 8784 80 9365 |X| 5384 6350 6757 6441 7426 8040 8546 8124 9426 zOP T 35985098 42241403 48997730 56201826 63993589 72249152 80980280 90231106 99998425 W aste% 0.041 0.020 0.004 0.085 0.010 0.001 0.024 0.020 0.001 Core i7 920 TCP U 163.145 230.397 295.356 253.937 437.347 518.404 569.479 673.266 866.644 NVIDIA TGP U 6.797 9.442 11.669 10.066 17.004 35.351 37.899 45.342 57.579 GTX 570 SpeedU p 24.003 X 24.401 X 25.311 X 25.227 X 25.720 X 14.664 X 15.026 X 14.849 X 15.051 X NVIDIA TGP U 6.314 8.724 10.900 9.795 16.733 16.826 20.868 24.807 32.283 GTX 770 SpeedU p 25.839 X 26.410 X 27.097 X 25.925 X 26.137 X 30.744 X 27.290 X 27.140 X 26.845 X Chapter 4. 2D-GCP Table 4.2: Computational results on testcut instances. 51 N ame randcut6000a randcut6000b randcut6000c randcut6500a randcut6500b randcut6500c randcut7000a randcut7000b randcut7000c randcut7500a randcut7500b randcut7500c randcut8000a randcut8000b randcut8000c randcut8500a randcut8500b randcut8500c randcut9000a randcut9000b randcut9000c randcut9500a randcut9500b randcut9500c randcut10000a randcut10000b randcut10000c H 6000 6000 6000 6500 6500 6500 7000 7000 7000 7500 7500 7500 8000 8000 8000 8500 8500 8500 9000 9000 9000 9500 9500 9500 10000 10000 10000 W 6000 6000 6000 6500 6500 6500 7000 7000 7000 7500 7500 7500 8000 8000 8000 8500 8500 8500 9000 9000 9000 9500 9500 9500 10000 10000 10000 Instances n |Y | 34 4971 32 5456 36 5630 28 4967 52 6255 40 6187 44 6631 38 5956 32 6793 44 6993 32 6564 32 6651 28 7876 36 7967 34 7465 46 8023 42 8098 36 8208 44 8728 42 8712 36 8558 40 9256 36 8661 36 7931 36 8587 42 9298 38 8601 |X| 5747 5566 4844 4645 6311 5783 6669 6242 6494 6981 6565 6507 5639 6499 6935 8100 7748 6967 8547 9000 7075 6813 9349 8533 8382 9586 9618 zOP T 36000000 36000000 36000000 42250000 42250000 42250000 49000000 49000000 49000000 56250000 56250000 56250000 64000000 64000000 64000000 72250000 72250000 72250000 81000000 81000000 81000000 90250000 90250000 90250000 100000000 100000000 100000000 W aste% 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Core i7 920 TCP U 180.898 187.933 169.182 161.617 255.481 234.749 306.135 261.815 278.242 364.229 334.090 331.501 363.668 378.706 417.253 527.499 523.007 483.054 613.783 657.307 540.588 623.986 743.840 656.044 723.389 882.447 810.420 NVIDIA TGP U 7.275 7.700 7.039 6.572 10.249 9.436 12.083 10.514 12.096 14.384 12.850 12.843 14.174 16.019 16.281 23.914 22.965 21.413 41.284 43.418 32.943 37.462 49.040 42.975 48.010 60.378 53.605 GTX 570 SpeedU p 24.866 X 24.407 X 24.035 X 24.592 X 24.927 X 24.878 X 25.336 X 24.902 X 23.003 X 25.322 X 25.999 X 25.812 X 25.657 X 23.641 X 25.628 X 22.058 X 22.774 X 22.559 X 14.867 X 15.139 X 16.410 X 16.657 X 15.168 X 15.266 X 15.067 X 14.615 X 15.118 X NVIDIA TGP U 6.770 7.103 6.452 6.081 9.385 8.640 11.190 9.814 11.225 13.425 12.258 12.247 14.183 15.863 16.027 19.392 18.768 17.411 23.287 24.206 19.733 22.143 27.395 24.012 27.788 32.539 31.043 GTX 770 SpeedU p 26.722 X 26.458 X 26.223 X 26.579 X 27.223 X 27.169 X 27.358 X 26.679 X 24.788 X 27.132 X 27.256 X 27.069 X 25.642 X 23.874 X 26.035 X 27.202 X 27.867 X 27.744 X 26.357 X 27.155 X 27.395 X 28.180 X 27.152 X 27.322 X 26.032 X 27.120 X 24.906 X Chapter 4. 2D-GCP Table 4.3: Computational results on randcut instances. 52 Chapter 4. 2D-GCP 53 Figure 4.9: randcut6000c source. Figure 4.10: randcut6000c solution. 4.5 Considerations and Future Work In this chapter we presented a parallel algorithm for solving the Unconstrained Two-Dimensional Guillotine Cutting Problem (2D-GCP) especially designed for running on a GPGPU platform. We proved the effectiveness of this method achieving, in the best case, a 30X speed-up factor upon the serial version, exploiting the native matrix-like structure of the problem and the fine grained computation required by the dynamic programming algorithm. We also provided two new sets of Chapter 4. 2D-GCP 54 Figure 4.11: testcut8000 solution. Figure 4.12: gcut13 solution. test instances for this problem. Our future aims are to extend the algorithm for solving the staged version of the problem and, eventually, GPU Computing to enhance other algorithms designed to approach similar packing problems (i.e. Bin Packing, Constrained TwoDimensional Guillotine Cutting Problem, etc.). Chapter 5 Vehicle Routing Problem In this chapter we investigate the application of GPU computing to some of the most effective pricing strategies based on Dynamic Programming (q-route, through-q-route and ng-route relaxations) for Column Generation methods for the Vehicle Routing Problem. We propose the parallel versions of these algorithms in a massively parallel environment, discussing the implementation choices and evaluating the speed-up factors on literature test instances with respect to the serial version. 5.1 Introduction The Vehicle Routing Problem (VRP) is among the most studied problems in combinatorial optimization, and retains unabated interest both because, though simple to state, it enjoys intriguing mathematical properties and because it can be quickly specified into problems of primary economic interest. The literature on the core problem variants and on the possible real-world variations got huge after the seminal paper which introduced it [60], and includes dedicated books [61], [62] and, more recently, also dedicated working groups of research associations [63]. The core problem can be quickly introduced as finding a least cost set of routes to service a number of customers from a central depot, given a cost matrix specifying the cost for going from any customer to any other one and from the depot to each customer. The problem can then be complicated at will, by adding constraints suggested by real world applications. A largely included constraint assumes that 55 Chapter 5. VRP 56 the routes are to be traveled by trucks in order to deliver or to collect goods from customers, thus the total amount of goods loaded on each truck cannot exceed its capacity, in weight or in volume. This gives rise to the Capacitated VRP variant (CVRP). Alternatively, each customer can ask either for a delivery or for a collection of goods, yielding the Pickup and Delivery variant(PDVRP). In case all deliveries of each route are to be made first, then all collections, we have the CVRP with backhauls (VRPB). A further quite common constraint considers feasible time windows for the visits at the customers (TWVRP). Moreover, in small area settings each truck could go back to the depot to reload (multitrip VRP, MTVRP), while in bigger areas it is common the use of more depots by the vehicles of the fleet (multidepot VRP, MDVRP). The vehicles of the fleet can be all equal or different among themselves (heterogeneous fleet CVRP, HVRP), could not be requested to return to the depot they started from (open CVRP, OVRP), could be requested to repeat the same routes with a given periodicity over the planning horizon (periodic VRP, PVRP), etc. Furthermore, all listed constraints, and many more coming from operational practice, can be freely combined to model actual use cases. For example, a recent work on city logistics operational optimization ([64]) models its problem as a CVRP with time windows, multi-trip, heterogeneous fleet and pickup and delivery. Given its theoretical and practical relevance, the VRP witnessed a wealth of diverse approaches for its solution and still fosters a lively research community studying either exact or heuristic methods or, more recently, both. A detailed survey is clearly out of scope for this paper, in the following we will recall just a few contributions. Heuristic approaches have a seminal work in [65] and went through tailored heuristics, such as [66], then metaheuristics, such as tabu search ([67]), simulated annealing ([68]), ant colony [69], genetic and in general evolutionary algorithms [70], variable neighborhood search [71] and PSO [72], just to name a few. Exact approaches are of more direct interest for this paper. Again, different approaches have been used, ranging from dynamic programming [73] to branch and bound [74], from branch and cut [75] to column generation [76]. In all cases, a central feature is the ability to compute tight lower bounds. Again, different approaches for computing bounds, recently, bounds based on nonelementary paths, Chapter 5. VRP 57 such as q-routes [74] and most notably ng-routes [77, 78]appear to be particularly effective for the Capacitated VRP and VRP with Time Windows. This chapter reports on the results obtained by implementing the q-routes, throughq-routes and ng-routes relaxations computation on a GPU parallel architecture. GPU are enjoying increasing interest among the optimization community given the possibility to significantly speedup tasks at the core of any approach of interest, thus to ultimately achieve substantial efficiency improvements [79]. Applications to combinatorial optimization problems have so far been reported for the knapsack problem [58] and for the Two-Dimensional Guillotine Cutting Problem [80]. This is the first work porting state of-the-art vehicle routing optimization components on GPU, specifically proposing a GPU implementation of the ng-relaxation. The implementation on GPU of an optimization algorithm is a complex task that involves the study of tailored data structures and corresponding routines. This chapter reports in detail the choices we made to achieve the most efficient parallel implementation of the q-routes, through-q-routes and ng-routes routines and substantiates this with computational results on standard problem benchmarks from the literature. 5.2 Problem Description and Mathematical Formulations The CVRP, in its basic version, consists in finding the least-cost set of routes to be travelled by m homogeneous vehicles of identical capacity Q, in order to service each of n customers, whose index set is V1 . All routes start and return to a common depot, conventionally indexed by 0. Let V = V1 ∪ {0}. Input data consist of the requests qi , i = 1, . . . , n and of the travelling costs cij , i = 0, . . . , n, j = 0, . . . , n, between each pair of customers and between each customer and the depot. The problem can thus be defined on a complete weighted graph G = V, A, C, where A = [(i, j)], i, j ∈ V , and C = [cij ], i, j ∈ V is the corresponding possibly asymmetric cost matrix. In real-world application, G is typically an overlay graph superimposed on an actual road network, and nodes in V correspond to geocoded Chapter 5. VRP 58 facilities while arcs in A correspond to least-cost paths, computed according to the metric to minimize (distance, time, ... ). The problem can be formulated in different ways, we refer the reader to [61] for a thorough overview. The following two subsections introduce the formulations and the notation we use in the rest of the paper. 5.2.1 Two Index Formulation The two index formulation associates a decision variable xij ∈ {0, 1} to each arc (i, j) ∈ A, specifying whether or not in the optimal solution there is a vehicle travelling directly from node i ∈ V to node j ∈ V . Different variants of this formulation are possible, the most compact ones require to compute for each subset of nodes S ⊆ V the minimum number of vehicles needed to service set S, which is denoted by r(S). The formulation is as follows. zCV RP = min XX cij xij (5.1) i∈V j∈V s.t. X xij = 1 j ∈ V1 (5.2) xij = 1 i ∈ V1 (5.3) i∈V X j∈V X x0j = m (5.4) j∈V XX xij ≥ r(S) S ⊆ V1 , S 6= ∅ (5.5) i, j ∈ V (5.6) i∈S / j∈S xij ∈ {0, 1} Constraints 5.2 and 5.3 impose that exactly one vehicle arrives and leaves each customer, constraint 5.4 specify the number of available vehicles (not all of which need to be used) and constraints 5.5, the so called capacity−cut constraints impose both the connectivity of the solution and the vehicle capacity requirements (see [61]). Chapter 5. VRP 5.2.2 59 Set Partitioning Formulation The set partitioning formulation, originally proposed by [81], associates a decision variable xℓ to each feasible vehicle route, that is, to each route that can be travelled by a vehicle, leaving the depot, servicing a subset of customers that collectively do not exceed the vehicle capacity and finally returning to the depot. Let R be the index set of all feasible routes, let cℓ be the cost of route ℓ1inR, and let ai,ℓ be a binary coefficient, which is equal to 1 iff node i ∈ V belongs to route ℓ ∈ (R). The formulation is as follows. zSP = min X c ℓ xℓ (5.7) ℓ∈R s.t. X aiℓ xℓ = 1 i ∈ V1 (5.8) ℓ∈R X xℓ = m (5.9) ℓ∈R xℓ ∈ {0, 1} ℓ∈R (5.10) Constraints 5.8 ensure that each customer is serviced by exactly one feasible route and constraint 5.9 impose the fleet cardinality. It is noteworthy that, in case the cost matrix satisfied the triangle inequality, equalities 5.8 could be turned into greater or equal than inequalities, thus turning the problem into an extended set covering problem, which is computationally easier to deal with. 5.3 Dynamic Programming Relaxations for the Pricing Problem In 5.1 we have briefly described some techniques to find a solution, heuristic or exact, to the VRP. Mainly regarding the exact algorithms, is necessary to solve the pricing problem for selecting the most interesting columns for the Column Generation (CG) algorithm. The pricing problem is also an NP-Hard problem, Chapter 5. VRP 60 the Elementary Shortest Path Problem with Resource Constrain (ESPPRC). This problem is formally defined: let P be the set of paths of G s.t. each path P ∈ P starts from 0, visits a set of vertices VP ⊆ V , delivers qP units of product and ends at vertex σP ∈ VP without loops. The ESPPRC can be solved with Dynamic Programming recursions expressed as follows: we define a state-space graph X = {(X, i) : X ⊆ V, i ∈ V ′ } and functions f (X, i), ∀(X, i) ∈ X, where f (X, i) is the cost of the least-cost path P that visits the set of customers X, P ends at the customer i ∈ X, and such that j∈X qj ≤ Q. As we can see, the exact DP algorithm can’t be applied because of the dimension of the state-space graph X. Christofides et al. [82] proposed the State-Space Relaxation that is a procedure whereby the state-space associated with DP recurrence is relaxed to compute valid lower bounds to the original problem. The next three relaxations that we will introduce in the next sections of this chapter are relaxations for the ESPPRC problem based on this principle. In [74], [77], [83], [84], [85] , were proposed effective and reliable relaxations, based on Dynamic Programming recurrences. These recurrences, as well as all dynamic programming algorithms, trade space for time, enumerating all the interesting solutions for the relaxed problem. In these cases the elementary constraint is relaxed and the aim is to find interesting almost elementary paths that can be the base for the creation of feasible solution or a good start point for the computation of valid and tight lower bounds. In the next sections we will describe three of these methods (q-route, through-q-route and ng-route) and we will analyze the intrinsic characteristics of each one. 5.3.1 q-Route Relaxation The q-route relaxation described in Christofides and al. [74] , is aimed to find routes without loops of two vertices. Defining f (q, i) the cost of the least cost path P = (0, 11 , . . . , ik ), ik = i, (not necessarily simple) from the depot 0 to the Pk customer i with total load q = h=1 qij . Such a path is called q-path. A q- path with the additional edge 0, i is called q-route and has cost f (q, i) + d0i . We can impose that the path should not contains loops formed by three consecutive vertices can be described as follows: let π(q, i) be the vertex just prior to i on the path corresponding to f (q, i). Let φ(q, i) be the cost of the least cost path ending at vertex i with the constraint that the vertex γ(q, i) preceding i is not equal to Chapter 5. VRP 61 π(q, i). This recurrence can be formalized as follows: for a given value of q, let h(j, i) be the cost of the least path from 0 to i with j just prior to i and without loops. Then: f (q − qi , j) + dji , if π(q − qi , j) 6= i h(j, i) = φ(q − q , j) + d , otherwise i (5.11) ji Given the function h, function f and φ can be computed for the given q as follows: f (q, i) = minj6=i {h(j, i)} π(q, i) = j ∗ (5.12) where j ∗ is the index of j, the predecessor, corresponding to the above minimum; φ(q, i) = mink6=π(q,i),k6=i {h(k, i)} γ(q, i) = k ∗ (5.13) where k ∗ is the value of k corresponding to the above minimum. The initialization of f, φ, π and γ is f (qi ) = φ(q, i) = inf, for q such that q 6= qi and: f (q, i) = d0i for q such that q = qi π(q, i) = 0 φ(q, i) = ∞ (5.14) Informally, we can say that the method builds a minimum non-elementary path, delivering q quantity of goods with a dynamic programming recursion that adds an edge from j only to the nodes that don’t have j itself as direct predecessor, finding a path without loops of two nodes. The algorithm can be described as follows: Algorithm Q PATHS (N, Q, qi , d) 1. // Data Structures 2. f, φ, π, γ Chapter 5. VRP 62 3. // Initialization 4. for q = 0 to Q do 5. 6. 7. for i = 1 to N do if qi (i) == q then f (q, i) = d(0, i), π(q, i) = 0 φ(q, i) = ∞, γ(q, i) = ∞ 8. 9. else f (q, i) = ∞, π(q, i) = −1 φ(q, i) = ∞, γ(q, i) = ∞ 10. 11. // q-Paths 12. for q = 0 to Q do 13. 14. 15. for i = 1 to N do for j = 1 to N do if π(q − qi (i)) 6= i 16. then h(j, i) = f (q − q(i), j) + d(j, i) 17. else h(j, i) = g(q − q(i), j) + d(j, i) 18. //Minima calculation for each i 19. f (q, i) = minj6=i {h(j, i)} 20. π(q, i) = j ∗ 21. φ(q, i) = mink6=π(q,i),k6=i {h(k, i)} 22. γ(q, i) = k ∗ 23. return f, φ, π, γ Using this method we can find shortest paths in respect of the request of each node i and the vehicle’s capacity Q, but the routes, also avoiding loops of two vertices, are not yet elementary (Figure 5.1b). The q-route relaxation can be computed in pseudo-polynomial time with a complexity of O(n2 Q). In figure 5.1a we can observe graphically the computation of a single f (q, i) value. In the case of asymmetric VRP, where dij 6= dji , we compute the relaxation once for the d matrix and once for its transpose dT . 5.3.2 through-q-Route Relaxation The through-q-route relaxation [74] is an enhancement for the q-route relaxation. In fact using the f and the g function we can find a better route mixing two paths retrieved by the q-path relaxation. Formally: Let φ(q, i) be the value of the least cost route, without loops, starting at the depot, passing through i and finishing Chapter 5. VRP 63 (a) f(q,i) q-path computation (b) q-path avoiding 2 vertices loop Figure 5.1: f(q,i) q-path computation and 2-vertices loops avoiding. back at the depot with a total load of q. This kind of route is a through-q-route. The ψ(q, i) values are computed as follows: f (q̄, i) + f (q + qi − q̄, i), f (q̄, i) + φ(q + qi − q̄, i), ψ(q, i) = min qi ≤q̄≤(q+qi )/2 min φ(q̄, i) + f (q + q − q̄, i) if π(q̄, i) 6= π(q + qi − q̄, i) otherwise i The algorithm can be described as follows: Algorithm THROUGH-Q ROUTES (N, Q, qi , f , φ, π, γ) 1. // Data Structures 2. ψ 3. // Initialization 4. for q = 0 to Q do 5. for i = 1 to N do 6. ψ(q, i) = ∞ 7. // through q-routes 8. min1, min2 // Minima 9. for i = 1 to N do 10. min1 = ∞, min2 = ∞ 11. for q = 0 to Q do 12. for qi (i) ≤ q̄ ≤ (q + qi (i))/2 do (5.15) Chapter 5. VRP 64 13. back = (q + qi (i))/2 − q̄ 14. if π(q̄, i) 6= π(back, i) ∧ f (q̄, i) + f (back, i) ≤ min1 15. then min = f (q̄, i) + f (back, i) 16. else if f (q̄, i) + φ(back, i) ≤ φ(q̄, i) + f (back, i) 17. then min2 = f (q̄, i) + φ(back, i) if min2 ≤ min1 18. 19. then min1 = min2 20. else min2 = φ(q̄, i) + f (back, i) if min2 ≤ min1 21. 22. then min1 = min2 23. // ψ Update 24. ψ(q, i) = min1 25. return ψ More informally, this recurrence selects, among all the paths for a given q, the best combination of paths that starts and end to the deposit. Once computed the q-route relaxation, the through-q-route function ψ(q, i) can be computed in pseudo-polynomial time with a complexity of O(n2 Q2 ). In the asymmetric case, we will use the f and φ functions with the π and the γ computed by the asymmetric q-path described above. We will call f f w and φf w the functions obtained from q-path computation using the d matrix, along with the predecessors matrices π f w and γ f w . In the same fashion we will call f bw , φbw , π bw and γ bw the one computed with the dT matrix. In this case, we can extend the algorithm as follows: Algorithm ASY THROUGH-Q ROUTES (N, Q, qi , f fw , φfw , π fw , γ fw , f bw , φbw , π bw , γ bw ) 1. // Data Structures 2. ψ 3. // Initialization 4. for q = 0 to Q do 5. for i = 1 to N do 6. ψ(q, i) = ∞ 7. // through q-routes 8. min1, min2 // Minima Chapter 5. VRP 9. 65 for i = 1 to N do 10. min1 = ∞, min2 = ∞ 11. for q = 0 to Q do 12. for qi (i) ≤ q̄ ≤ (q + qi (i))/2 do 13. back = (q + qi (i))/2 − q̄ 14. if π f w (q̄, i) 6= π bw (back, i) ∧ f f w (q̄, i) + f bw (back, i) ≤ min1 15. then min = f f w (q̄, i) + f bw (back, i) 16. else if f f w (q̄, i) + φbw (back, i) ≤ φf w (q̄, i) + f bw (back, i) then min2 = f f w (q̄, i) + φbw (back, i) 17. if min2 ≤ min1 18. 19. then min1 = min2 20. else min2 = φf w (q̄, i) + f bw (back, i) if min2 ≤ min1 21. 22. then min1 = min2 23. // ψ Update 24. ψ(q, i) = min1 25. return ψ 5.3.3 ng-Route Relaxation Righini and Salani [83–85] proposed a DP relaxation based on the construction of elementary paths in an decreasing state space, Baldacci et al. [77] proposed a more effective relaxation for the pricing problem, generalizing the Righini’s idea, the ng-route relaxation. The main problem afflicting the other methods described above is that allow cycles longer than two vertices. Procedures to avoid bigger loops are computationally expensive and can address loops of three or four vertices only. The ng-route relaxation partially solves this problem introducing a supplementary information for the DP algorithm, allowing the recurrence to ’remember’ an arbitrary number of nodes during the state-expansion phase and avoiding the creation of loops with a significant cardinality of nodes. The algorithm has specific rules according to the different kind of Vehicle Routing problem in which is applied (VRPTW, CVRP..). We take in consideration only the one afferent to the Capacitated Vehicle Routing Problem (CVRP). The ng-route relaxation can be described as follows: Let Ni ⊆ V be a set of selected customers for vertex i (according to some criterion), such that i ∈ Ni and |Ni | ≤ ∆(Ni ), Chapter 5. VRP 66 where ∆(Ni ) is the cardinality of the selected neighbors for i plus i itself. The sets Ni allow us to associate with each path P = (0, i1 , . . . , ik ) the subset Π(P ) ⊆ V (P ) containing customer ik and every customer ir , r = 1, . . . , k − 1 of P that belongs to all set Nir+1 , . . . , Nik associated with the customer ir+1 , . . . , ik visited after ir . The set Π(P ) is defined as: Π(P ) = ( ir : ir ∈ k \ Nis , r = 1, . . . , k − 1 s=r+1 ) [ {ik } . (5.16) A ng-path (q, i, N G), is a non-necessarily elementary path P = (0, i1 , . . . , ik−1 , ik = i) starting from the depot, visiting a subset of customers (even more than once) such that N G = Π(P ), ending at customer i such that i ∈ / Π(P ′ ) where P ′ = (0, i1 , . . . , ik−1 ). We indicate with f (q, i, N G) the cost of the least cost ng-path (q, i, N G). We define an ng-route an ng-path (q, i, N G) plus the edge from i to the depot and the cost of a ng-route f (q, 0, N G) = f (q, i, N G) + di0 . Functions f (q, i, N G) can be computed on the graph defined as H = (Φ, Ψ) where: Φ= ( (i, q, N G) : qi ≤ q ≤ Q, ∀N G ⊆ Ni s.t. i ∈ N G ∧ X qj ≤ q, ∀i ∈ V ′ j∈N G ) , (5.17) Ψ = ((j, q ′ , N G′ ), (i, q, N G)) : ∀(j, q ′ , N G′ ) ∈ Ψ−1 (i, q, N G), ∀(i, q, N G) ∈ Φ , (5.18) where Ψ−1 (i, q, N G) = (j, q − qi , N G′ ) : ∀N G′ ⊆ Nj s.t. j ∈ N G′ and N G′ ∩ Ni = N G/ {i} , ∀j ∈ Γ−1 i (5.19) The function f (i, q, N G) can be computed using the DP recursion: f (i, q, N G) = min (j,q ′ ,N G′ )∈Ψ−1 (i,q,N G) {f (j, q ′ , N G′ ) + dji } , ∀(i, q, N G) ∈ Φ. (5.20) Chapter 5. VRP 67 Is necessary to notice that the ∆ parameter is critical for this relaxation: bigger is the cardinality of the neighborhood set, better is the bound obtained. The set’s cardinality brings, inevitably to a combinatorial explosion of the recursion’s states. Martinelli, Pecin and Poggi [78] brilliantly resolve this problem mixing the Decremental State Space relaxation proposed by Righini with the computation of the ng-route relaxation, using the exact relaxation only when necessary, allowing to an heuristic procedure based on the q-route relaxation to find the most promising routes to insert in the CG algorithm. Algorithm NG-PATHS (N, Q, d, qi , Ni ) 1. // Data Structures 2. f, πn , πng 3. // Initialization 4. for q = 0 to Q do 5. 6. for i = 1 to N do for ng = 0 to |N Glist (i, q)| do 7. if qi (i) == q 8. then N G = i, N Glist (q, i).Add(N G) 9. f (q, i, N G) = d(0, i), πn (q, i, N G) = 0, πng (q, i, N G) = 0 else f (q, i, N G) = ∞, πn (q, i, N G) = −1, πng (q, i, N G) = −1 10. 11. // ng-Paths 12. for q = 0 to Q do 13. 14. for i = 1 to N do for j = 1 to N do 15. for ng = 0 to |N Glist (q − qi (i), j)|do 16. N G = N Glist (q − qi (i), j, ng) 17. if i ∈ / Ni ∪ N G 18. 19. 20. then N Gnew = Ni ∩ N G ∪ i if f (q − qi (i), j, N G) + dji ≤ f (q, i, N Gnew ) then f (q, i, N Gnew ) = f (q − qi (i), j, N G) + dji 21. Add N Gnew to N Glist (i, q) 22. πn (q, i, N Gnew ) = j 23. πng (q, i, N Gnew ) = N Glist (q − qi (i), j, ng) 24. return f, πn , πng Chapter 5. VRP 68 In the algorithm we have introduced πn and πng that keep track of the predecessor node j ∈ Γ−1 and the predecessor path Π(P ), P = {0, i1 , . . . , ik−1 }, respectively. i We also introduce the dominance among the labels in the recursion. In fact we can obtain the same label (i, q, N G) expanding from two different predecessor state (j, q − qj , N Gj ) and (k, q − qk , N Gk ). Obviously, the dominant label is the one with the lower function value f . The asymmetric extension for this relaxation is trivial: we can compute the f f w (i, q, N G) using the d matrix and the f bw (i, q, N G) function using its transpose dT . Figure 5.2: ng-Path example. 5.4 Parallel Relaxations on GPU In this section we describe the parallel algorithms designed to run these relaxations on a GPU. Dynamic Programming algorithms ported on this kind of devices have given good results in many application: Boyer a et al. [58] proved that on a consumer GPU is possible to obtain a 20X speed-up factor for solving the Knapsack Chapter 5. VRP 69 Problem. Harish et al. [86] and Buluç et al. [87] provided an extensive spectrum of graph problems (Deep First Search, etc..) having advantages from the utilization of a GPU. Ortega-Arranz et al. [88] and Kumar et al. [89] proved that also algorithms like the Bellman-Ford and Dijkstra ones for solving the Single Source Shortest Path Problem (SSSPP) can be enhanced by the use of a many-core processor. The GPGPU has given remarkable results also for the All Pairs Shortest Path Problem (APSPP), solved by means of the Floyd-Warshall algorithm, Katz et al. [56] and Lund et al. [57]. Maniezzo et al. [80] proved the effectiveness of many-cores platform also for supply chain’s problems. As we can see, all the cited methods are based on the Dynamic Programming paradigm; indeed, the fine calculation granularity and the matrix-like data structures characterizing often these algorithms, fit particularly well on the many-cores architecture, where every thread execute a, relatively simple, computational kernel. We decided to use the CUDA parallel programming model because is, actually, the best trade off among portability, usability and reliability. Nvidia, also, provides a good environment for debugging and profiling applications (Nsight debugger for Microsoft Visual Studio, etc..) with effective functionalities. 5.4.1 GPU q-Route The first relaxation that we consider is the q-route relaxation. In the next subsections we will expose the parallelism inside the method and we will describe the proposed algorithm for the GPU. 5.4.1.1 Exposing Parallelism The equations 5.11 and 5.12 describe the expansion rules for the creation of a new state f (q, i). As we can see, this is a backward recursion, because we create the new state from the previous stages. This peculiarity enables a very interesting effect inside the recursion: all the states f (q, i) ∈ q stage, can be calculated independently using all the f (q − qi , j), j ∈ Γ−1 i , i ∈ V, i, j = 1, 2, . . . , N states of the q − qi stage. In fact, we can evaluate at the same time all the f (q, i) states inside the q stage, as depicted in figure 5.3. Chapter 5. VRP 70 Figure 5.3: Stage q parallel computation. 5.4.1.2 Algorithm Description In this section we will provide the pseudo-code for the parallel algorithm and the computation kernel for the q-path algorithm. According to the CUDA programming model, we can assign a threads block for each state f (q, i) and T threads for each block. We decided to assign in this manner the workload for allowing us to compute in parallel not only the state of the recursion (intra-grid parallelism), but also the min operation required to compute it. In fact, we will use the threads of each block to evaluate in parallel the reductions (min operation) described in the equations 5.12 and 5.13 (intra-block parallelism) using the method suggested in [59]. All the data structures have linear access to each element, it means that all matrices are stored in a row-major fashion. For the legibility of the algorithm, we decided to use two indexes for the matrices anyway. Algorithm GPU Q-PATHS (N, Q, qi , d) 1. // Data Structures 2. f, φ, π, γ 3. // Kernel Setup 4. BLOCKS B = N − 1, THREADS T , SHARED-MEM Sh[2 ∗ (N − 1)] 5. // Main Loop 6. for q = 0 to Q do 7. 8. Q-PATHS-KERNEL<<< B, T, Sh >>>(Q, q, qi , d, f , φ, π, γ) return f, π, φ, γ Chapter 5. VRP 71 Algorithm Q-PATHS-KERNEL(Q, N, q, qi , d, f , φ, π, γ) 1. min1, min2 // Variables Initialization 2. pred1,pred2 3. hsh [N − 1], πsh [N − 1] 4. thdidx = threadIdx.x, N thds = blockDim.x, nodeidx = blockIdx.x 5. slack = N %N thds, times = N/N thds 6. if thdidx ≤ slack 7. then times + + 8. // Shared Memory Initialization 9. for t = 0 to times do 10. hsh [thdidx + t ∗ N thds] = ∞ 11. πsh [thdidx + t ∗ N thds] = −1 12. syncthreads() 13. // Partial Minima Computation 14. for t = 0 to times do 15. j = thdidx + t ∗ N thds 16. if π(q − qi (nodeidx), j) 6= nodeidx 17. 18. 19. 20. then hsh [j] = f (q − qi (nodeidx), j) + d(j, i) πsh [j] = j else hsh [j] = φ(q − qi (nodeidx), j) + d(j, i) πsh [j] = j 21. syncthreads() 22. // First reduction to get f (q, i) value 23. GPU REDUCTION(hsh , πsh ) 24. min1 = hsh [0], pred1 = πsh [0], hsh [0] = ∞, πsh [0] = −1 25. // Second reduction to get φ(q, i) value 26. GPU REDUCTION(hsh , πsh ) 27. min2 = hsh [0], pred2 = πsh [0] 28. // Matrices update 29. if nodeidx 6= thdidx 30. then f (q, i) = min1, π(q, i) = pred1 31. φ(q, i) = min2, γ(q, i) = pred2 Algorithm GPU REDUCTION (hsh , πsh , times) 1. // Keeping only the most promising value for each thread 2. for t = 0 to times do Chapter 5. VRP 3. 72 if hsh [thdidx] > hsh [thdidx + t ∗ N thds] 4. then fs wap = hsh [thdidx] 5. hsh [thdidx] = hsh [thdidx + t ∗ N thds] 6. hsh [thdidx + t ∗ N thds] = fs wap 7. // Update Predecessor 8. πs wap = πsh [thdidx] 9. πsh [thdidx] = πsh [thdidx + t ∗ N thds] 10. πsh [thdidx + t ∗ N thds] = πs wap 11. syncthreads() 12. // Reduction 13. for s = N thds/2 to 0,s/ = 2 do 14. if thdidx < s 15. 16. then if hsh [thdidx + s] < hsh [thdidx] then fs wap = hsh [thdidx] 17. hsh [thdidx] = hsh [thdidx + s] 18. hsh [thdidx + s] = fs wap 19. // Update Predecessor 20. πs wap = πsh [thdidx] 21. πsh [thdidx] = πsh [thdidx + s] 22. πsh [thdidx + s] = πs wap 23. syncthreads() The lines 6-8 of the main procedure, GPU Q-PATHS , is the main loop of the algorithm. Indeed, inside the for cycle we call, iteratively, for each stage q of the Dynamic Programming recurrence the computation kernel for the GPU. We also define the dimension of the shared memory for each block of the grid. The entries for the h(i) vector defined in dimension is designed to store all the Γ−1 i the equation 5.11 and the predecessor node for each entry (in our case, the number of predecessor is equal to N − 1 supposing that the digraph G is complete). Once initialized the shared memory (lines 9-11 of the Q-PATHS-KERNEL procedure), we calculate for each predecessor its function value and we store it in the hsh array together with the node’s index in πsh (lines 14-21) according with the 5.11 equation. In lines 23 and 26 we calculate the minima values for the f and the φ functions. In order to compute these values, we call twice the GPU REDUCTION procedure. This procedure, a device function in the implementation, is based on the one Chapter 5. VRP 73 described in [59], we modified it in order to keep trace of the values inside the shared memory, swapping instead of overwriting the values themselves. In lines 2-10 we pre-calculate the significant values for each thread, keeping only the most promising, and then we perform the reduction to find the minimum (lines 13-23). Every thread compute one or more values thanks to the indices that we assigned in the lines 4-7 of the main kernel, calculating the occurrence of the thread index inside the total number of the problem nodes. Finally, in lines 29-31, we update the data structures with the new function values. In line 24 of the main kernel, we reset the result of the first reduction and we store the first minimum and its predecessor. 5.4.2 GPU through-q-route In this section we will describe the parallel algorithm for the through-q-route relaxation. This method, as described in 5.3.2, is based on the function values obtained from the q-paths relaxation. In order to minimize the memory transaction between the CPU and the GPU, we will compute the q-path and, keeping the results on the GPU memory, we will perform the GPU algorithm for the through-q-routes. 5.4.2.1 Exposing Parallelism According to the equation 5.15, we can see the nature of the computation is strictly combinatorial and each ψ(q, i) function value is independent from the others. We decided to compute in parallel all the function values for each i node of the graph (each column of the matrix) which is the dimension, the quantity dimension q, often bigger and computationally more expensive. In the next paragraph we will provide the pseudo-code for the GPU kernel and we will give a brief description of the algorithm. 5.4.2.2 Algorithm Description In this section we will describe the GPU kernel designed to compute the throughq-route relaxation on a GPU. Chapter 5. VRP 74 Algorithm GPU THROUGH-Q ROUTES (N, Q, f , qi , φ, π, γ) 1. // Data Structures 2. ψ 3. // Kernel Setup 4. BLOCKS B = Q, THREADS T, SHARED-MEM Sh[T] 5. // Main Loop 6. for i = 1 to N do 7. THROUGH-Q-ROUTES-KERNEL<<< B, T, Sh >>>(Q, i, qi , d, f , φ, π, γ, ψ) 8. return ψ Algorithm THROUGH-Q-ROUTES-KERNEL(Q, N, i, qi , f , φ, π, γ, ψ) 1. sh[T ] // Shared memory 2. N T hds = blockDim.x, thdidx = threadIdx.x 3. q = blockIdx.x, startidx = qi (i), endidx = (q + qi (i))/2, dif f = endidx − startidx 4. // Shared Memory Initialization 5. sh[thdidx] = ∞ 6. times = dif f /N T hds, slack = dif f %N T hds 7. if thdidx < slack 8. 9. then times + + syncthreads() 10. for t = 0 to times do 11. q̄ = thdidx + startidx + (t ∗ N T hds) 12. if π(q̄, i) 6= π(q + qi (i) − q̄, i) 13. 14. then fnew = f (q̄, i) + f (q + qi (i) − q̄, i) if sh[thdidx] > fnew 15. then sh[thdidx] = fnew 16. else φa = f (q̄, i) + φ(q + qi (i) − q̄, i) 17. φb = φ(q̄, i) + f (q + qi (i) − q̄, i) 18. φnew =0, (φa < φb )?φnew = φa : φnew = φb 19. if sh[thdidx] > φnew 20. then sh[thdidx] = φnew 21. syncthreads() 22. // Parallel reduction Chapter 5. VRP 75 23. for s = N thds/2 to 0,s/ = 2 do 24. if thdidx < s 25. then if hsh [thdidx + s] < hsh [thdidx] 26. then hsh [thdidx] = hsh [thdidx + s] 27. syncthreads() 28. syncthreads() 29. // Data Update 30. if thdidx == 0 31. then ψ(q, i) = sh[0] In line 3 we define the indices of the data for the q block. The dif f variable describes the range of indices for the block. In line 5 we initialize the shared memory and in lines 6-8, as for the q-paths kernel, we define the indices for each thread of the block. In lines 10- 20 we compute the partial results and we store them in the shared memory, according to the equation 5.15; in line 11 we define the indices q̄ for each thread. Once computed the partial results, we can perform a standard parallel reduction (lines 23-28) in the shared memory as described in [59]. In this case we don’t need to keep all the values inside the shared memory, in fact we need only the minimum for each state. Finally, the thread 0 updates the ψ(q, i) function value (lines 30-31). 5.4.3 GPU ng-Route Unlike the other relaxations, the ng-route is more computationally expensive but numerically is more effective than the previous two. The main problems afflicting this method is an efficient management of the NG sets and the dominance among them. In fact, the NG set and its cardinality for each f (q, i) stage is dynamic. Dynamic data structures in a GPGPU environment are not desirable, indeed searches and the management inside these structures are performance killers. In the next sections we will describe our strategies for addressing these problems and then a parallel algorithm for the GPU. 5.4.3.1 Exposing Parallelism The first problem that emerges for the porting of this relaxation on a many-cores platform is the dynamic nature of the NG ‘dimension’ of the recurrence. A static Chapter 5. VRP 76 data structure is almost mandatory for exploiting the computation capabilities of these devices. We addressed this problem in this way: As we can see, from the equations 5.17 and 5.19, the N G set for each stage (i, q) is completely contained in the Ni set chosen for the i node. Plus, in each N G set the relative i node is contained. From these assumptions we can state that the complete enumeration for all the possible NG sets in the (i, q) stage is the N GPi ∈ P(Ni ) set, that is a subset of the power set of Ni composed only by the subset with i. The cardinality of the power set of Ni is 2∆(Ni ) but for the N Gdim the cardinality is 2∆(Ni )−1 , because we are taking into account only the sets with i. For reasonable ∆(Ni ) (10-14) we can easily enumerate all the possible sets for the N G dimension, allowing us to make static this dimension too. As we illustrate in figure 5.4 it’s possible to describe all the states and stages of the recursion like a 3-dimensional cube: vehicle’s capacity Q, customers/nodes i and sets’ indices N G. In the third dimension N G we consider only the indices of these sets, as we will describe in the following sections, we can index these sets and use these indices to define the dimension. To our purposes, in order to expose the parallelism of the method, we can reformulate the equation 5.20 in its forward form: f (k, q + qi , N G) = min (i,q,N G′ )∈Ψ(k,q+qi ,N G) {f (i, q, N G) + dik } , ∀(k, q + qi , N G) ∈ Φ. (5.21) Exploiting the equation 5.21 we can easily observe that we can compute independently all the states f (k, q + qk , N G) from all the states (i, q, N G′ ) in the q set of stages. 5.4.3.2 Dominance Management To easily address the management of the dominances among the states during the expansion, we decided to pre-calculate the transitions among the N G sets of each node. We basically create a transition map that taking in input the N G set of the starting node, the index of the starting node j and the end node i, gives in output the index of the new N G set among the enumerated ones of i. We indexed the N G sets for each node i exploiting a bitmap. As we described before, we enumerate all the possible N G sets for each node obtained from the relative Ni using its Chapter 5. VRP 77 power set. We define the mask for the N G set as: mask = {b0 , b1 , . . . , bh } , h = 2∆(Ni )−1 , b ∈ {0, 1} and the mask for each N G is defined: 1 if N G(k) ∈ N i, k = 0, . . . , |N G| mask(h) = 0 otherwise (5.22) Based on this mask, we can define the index for the N G: ∆(Ni )−1 N Gindex = X b ∗ 2h (5.23) h=0 For example: Given N G = {7, 2, 4, 6} ∈ N GP7 , N7 = {7, 2, 4, 6, 8, 10}, ∆(N7 ) = 6. We will have: maskN G = {1, 1, 1, 1, 0, 0} and the index: N Gindex = 1 ∗ 20 + 1 ∗ 21 + 1 ∗ 22 + 1 ∗ 23 + 0 ∗ 24 + 0 ∗ 25 = 1 + 2 + 4 + 8 + 0 + 0 = 15. We give, for each combination of the 1 and 0 of the N G map, an univocal index. This univocal index gives us the possibility to know in advance which will be the index for the new N G set created by the expansion to another state. Given the description for the index of each N G set we can define the transition map among the N G sets of each node: given the N Gindex , the starting node j and the Ni set ′ of the destination node i, the mapping function returns the index N Gindex of the ′ N G ∈ N GPi . Figure 5.4: NG path 3-D states space. Chapter 5. VRP 5.4.3.3 78 Active Sets Exploiting the equation 5.16 we can pre-calculate the N G ∈ N GPi sets active during the recursion. In fact not all the N G enumerated are effectively used by the method. For each stage q, we can retrieve all the N G sets involved in the computation simply running the recursion, without computing f , once before the main part of the algorithm in which the ng-relaxation is used, exploiting the transition map defined in the previous paragraph. This property allows us to apply a modified version of what is called ‘threads compaction’, described in [86] and analyzed in [90]. This method consists in creating a mask for allowing to the GPU to spawn only the threads useful for the computation. Using this technique and exploiting the property described before, we can take in consideration only the states effectively useful for the relaxation and calibrate the device’s resources on these, avoiding the overhead induced by not working threads. The active sets for each stage together with the indexing for the N G sets also enhance the serial version of the method, indeed we avoided the overhead for the dominance and we reduced drastically the computation at each stage. We propose a modified version of the serial algorithm in the next paragraph. 5.4.3.4 Algorithm Description In this section we describe first the enhancement for the serial algorithm using the previous consideration, then we will propose a parallel kernel for the GPU. Algorithm NG-PATHS 2 (N, Q, d, qi , ActiveSets, TransMap) 1. // Data Structures 2. f, πn , πng 3. // Initialization 4. for q = 0 to Q do 5. 6. for i = 1 to N do for ng = 0 to |N Glist (i, q)| do 7. if qi (i) == q 8. then f (q, i, 0) = d(0, i), πn (q, i, 0) = 0, πng (q, i, 0) = 0 9. else f (q, i, 0) = ∞, πn (q, i, 0) = −1, πng (q, i, 0) = −1 10. // ng-Paths Chapter 5. VRP 79 11. for q = 0 to Q do 12. 13. for i = 1 to N do for j = 1 to N do 14. distij = dist(i, j) 15. for h = 0 to |ActiveSets(q − qi (i), j)| do 16. N Gindex = ActiveSets(q − qi , j, h) 17. N G′index = T ransM ap(i, j, N Gindex ) 18. if f (q, i, N G′index ) > f (q − qi (i), j, N Gindex ) + distij then f (q, i, N G′index ) = f (q − qi (i), j, N Gindex ) + distij 19. 20. πn (q, i, N G′index ) = j 21. πng (q, i, N G′index ) = N Gindex 22. return f, πn , πng In lines 15 and 16 we introduced the ActiveSets and TransMap data structures giving us the NG sets indices to update. Algorithm GPU NG-PATHS (N, Q, d, qi , ActiveSets, TransMap) 1. // Data Structures 2. f, πn , πng , SLabelsN , SLabelsN G , ActThds 3. // The variables f, πn , πng are initialized in the same fashion of the serial algorithm 4. // Number of Active Threads for each set of stages q 5. for q = 0 to Q do 6. 7. 8. 9. for i = 1 to N do ActT hds(q)+ = |ActSets(q, i)| // Initialize StartLabels for q = 0 to Q do for i = 1 to N do 10. for h = 0 to |ActSets(q, i)| do 11. N G = ActSets(q, i, h) 12. SLabelsN (q).Add(i) 13. SLabelsN G (q).Add(NG) 14. //Main Loop 15. for q = 0 to Q do 16. // Kernel Setup 17. T HDS = T Chapter 5. VRP 18. BLK.y = N − 1, BLK.x = |ActT hds(q)|/T HDS 19. NG-PATHS-KERNEL<<< BLKS, T HDS >>> 20. (q, qi , M ap, d, f , πn ,πng , ActSets, SLabelsN odes , SLabelsN G ) 80 21. return f , πn ,πng Algorithm NG-PATHS-KERNEL(q, qi , Map, d, f , π n , π ng , ActSets, SLabelsN , SLabelsNG ) 1. idx = blockIdx.x ∗ blockDim.x + threadIdx.x 2. i = blockIdx.y 3. j = SLabelsN (q, idx), N Gindex = SLabelsN G (q, idx) 4. dist = d(j, i) 5. N G′index = M ap(i, j, N Gindex ) 6. if f (q + qi (i), i, N G′index ) > f (q, j, N Gindex ) + dist 7. then f (q + qi (i), i, N G′index ) = f (q, j, N Gindex ) + dist 8. πn (q + qi (i), i, N G′index ) = j 9. πng (q + qi (i), i, N G′index ) = N G Inside the main procedure, GPU NG-PATHS , in lines 5-7 we count the number of active labels for each stage q. Using this value, inside the main loop of the procedure, we spawn the necessary number of thread for each iteration of the loop (line 19). In lines 8-13 we initialize the StartLabelsN odes and StartLabelsN G structures, containing, for each q the indices of the active NG for each node. The main loop is described in lines 15-20. As for the serial version of the algorithm, we return the f array with the function values with the array for the predecessors node, πn and the array for the predecessor path πng (line 21). Inside the GPU kernel, NG-PATHS-KERNEL, in lines 1-3 we define the indices of the label, in line 5, using the transition map, we find the index if the new N G set for the expanded label and in lines 6-9 we update, if necessary, the new label. The operation in these lines are implemented using the AtomicMin() CUDA primitive to manage the concurrent update of a single variable by more threads simultaneously, in order to avoid race conditions and inconsistent results. 5.4.4 Asymmetric Relaxations The asymmetric case for the VRP is characterized by dij 6= dji . In the case of q-path and ng-paths, we are forced to calculate twice the relaxation, once using Chapter 5. VRP 81 the d matrix (forward) and once using the dT transpose matrix (backward). For the through-q-route relaxation, we have to use the output from the asymmetric computation of the q-path, as described in ASY THROUGH-Q ROUTES algorithm. We propose an effective parallel approach to solve this version of the VRP exploiting all the parallel features of a GPU. A Stream is a sequence of operations that execute in issue-order on the GPU. More intuitively, we can say that a GPU can execute concurrently multiple kernels and memory transactions, overlaying the operations (e.g. transferring data for the kernel 2 from the HOST to the GPU while executing the kernel 1). This is possible due to different engines managing the execution and memory transfer operations. Depending on the number of the simultaneous streams supported by the GPU, (typically 4) we can hide the memory transaction operation with the computation of a kernel and then compute the data loaded with another kernel. This feature allows us to compute concurrently the two relaxations (forward and backward) on the same GPU, introducing another level of parallelism (among kernels). In our case we don’t use the streams to hide the memory transactions between the CPU and the GPU, but to execute the same kernel with different data on the same GPU, in a typical SIMD approach. In the following we propose the pseudo-code for these algorithms. The kernels are the same described in the previous paragraph. The main difference is the use of the cudaMemcpyAsync() primitive that is a page-locked memory (pinned memory) for the characteristics of which we refer to the official documentation of CUDA. Algorithm GPU-Q-PATHS ASY (N, Q, qi , d, dT ) 1. // Data Structures 2. f f w , φf w , π f w , γ f w 3. f bw , φbw , π bw , γ bw 4. // Stream Initialization 5. Stream F W , Stream BW 6. // Kernel Setup 7. BLOCKS B = N − 1, THREADS T , SHARED-MEM Sh[2 ∗ (N − 1)] 8. // Main Loop 9. for q = 0 to Q do Chapter 5. VRP 10. 82 Q-PATHS-KERNEL<<< B, T, Sh, F W >>>(Q, q, qi , d, f f w , φf w , πf w , γ f w ) 11. Q-PATHS-KERNEL<<< B, T, Sh, BW >>>(Q, q, qi , dT , f bw , φbw , π bw , γ bw ) 12. return f f w , π f w , φf w , γ f w , f bw , π bw , φbw , γ bw Algorithm GPU NG-PATHS ASY (N, Q, d, dT , qi , ActSetsfw , ActSetsbw , Mapfw , Mapbw ) 1. // Data Structures 2. fw fw , SLabelsfNw , SLabelsfNwG , ActT hdsf w , πng f f w , πn 3. bw bw bw bw , SLabelsbw , πng f bw , πn N , SLabelsN G , ActT hds 4. // The variables f, πn , πng are initialized in the same fashion of the serial algorithm 5. // Number of Active Threads for each set of stages q 6. for q = 0 to Q do 7. for i = 1 to N do 8. ActT hdsf w (q)+ = |ActSetsf w (q, i)| 9. ActT hdsbw (q)+ = |ActSetsbw (q, i)| 10. // Initialize StartLabels 11. for q = 0 to Q do 12. for i = 1 to N do 13. for h = 0 to |ActSetsf w (q, i)| do 14. N G = ActSetsf w (q, i, h) 15. SLabelsfNw (q).Add(i) 16. SLabelsfNwG (q).Add(NG) 17. for h = 0 to |ActSetsbw (q, i)| do 18. N G = ActSetsbw (q, i, h) 19. SLabelsbw N (q).Add(i) 20. SLabelsbw N G (q).Add(NG) 21. // Stream Initialization 22. Stream FW, Stream BW 23. //Main Loop 24. for q = 0 to Q do 25. // Kernel Setup 26. T HDS = T 27. BLK.y = N − 1, BLK.x = |ActT hdsf w (q)|/T HDS 28. NG-PATHS-KERNEL<<< BLKS, T HDS, 0, F W >>> 29. fw fw (N , Q, q,qi , M apf w , d, f f w , πn , πng , ActSetsf w , 30. SLabelsfNw , SLabelsfNwG ) 31. BLK.y = N − 1, BLK.x = |ActT hdsbw (q)|/T HDS 32. NG-PATHS-KERNEL<<< BLKS, T HDS, 0, BW >>> 33. bw bw (N , Q, q, qi , M apbw , dT , f bw , πn , πng , ActSetsbw , 34. bw SLabelsbw N , SLabelsN G ) Chapter 5. VRP 83 fw fw bw bw 35. returnf f w , f bw , πn , πng , fbw , πn , πng The through-q-route algorithm only changes the data structures in input: Algorithm GPU THROUGH-Q-ROUTES ASY (N, Q, qi , f fw , φf w , π f w , γ f w , f bw , φbw , π bw , γ bw ) 1. // Data Structures 2. ψ 3. // Kernel Setup 4. BLOCKS B = Q, THREADS T , SHARED-MEM Sh[T ] 5. // Main Loop 6. for i = 1 to N do 7. THROUGH-Q-ROUTES-KERNEL ASY<<< B, T, Sh >>>(Q, i, qi , f f w , φf w , π f w , γ f w , f bw , φbw , π bw , γ bw , ψ) 8. return ψ Algorithm THROUGH-Q-ROUTES-KERNEL ASY (Q, N, i, qi , f fw , φf w , π f w , γ f w , f bw , φbw , π bw , γ bw , ψ) 1. sh[T ] // Shared memory 2. N T hds = blockDim.x, thdidx = threadIdx.x 3. q = blockIdx.x, startidx = qi (i), endidx = (q + qi (i))/2, dif f = endidx − startidx 4. // Shared Memory Initialization 5. sh[thdidx] = ∞ 6. times = dif f /N T hds, slack = dif f %N T hds 7. if thdidx < slack 8. 9. then times + + syncthreads() for t = 0 to times do 10. q̄ = thdidx + startidx + (t ∗ N T hds) 11. if π f w (q̄, i) 6= π bw (q + qi (i) − q̄, i) 12. 13. then fnew = f f w (q̄, i) + f bw (q + qi (i) − q̄, i) if sh[thdidx] > fnew 14. then sh[thdidx] = fnew 15. else φa = f f w (q̄, i) + φbw (q + qi (i) − q̄, i) 16. φb = φf w (q̄, i) + f bw (q + qi (i) − q̄, i) 17. φnew = 0, (φa < φb )?φnew = φa : φnew = φb 18. if sh[thdidx] > φnew 19. then sh[thdidx] = φnew 20. syncthreads() 21. // Parallel reduction 22. for s = N thds/2 to 0,s/ = 2 do 23. if thdidx < s Chapter 5. VRP 24. then if hsh [thdidx + s] < hsh [thdidx] 25. 26. 84 then hsh [thdidx] = hsh [thdidx + s] syncthreads() 27. syncthreads() 28. // Data Update 29. if thdidx == 0 30. 5.5 then ψ(q, i) = sh[0] Computational Results In this section we report the experimental results of our algorithms. Each table reports the execution time for a single run of the methods. All the times for the GPU are calculated taking in account the load and store time for the data between the CPU and the GPU. We choose to take into account these times because the relaxations are repeatedly called inside a CG algorithm and the load and store times have a significant impact on the performances. The test machine is a workstation equipped with an Intel i7 4820K @3.9 GHz with 32 Gigabytes of RAM and a Nvidia GTX TITAN with 2688 Cuda Cores @837 MHz with 6 Gigabytes of GDDR5 RAM provided by SINTEF [91]. The data-sets are the ones of the VRPLIB [92] and the bigger instances are the ones provided by [93]. We report the speed-up factor between the serial and the parallel versions of the methods. The speed up factor is the ratio between the serial and the parallel algorithms: SpeedU p = T imeSerial /T imeP arallel . Instances Name Nodes V560-1200 560 V600-900 600 V640-1400 640 V720-1500 720 V760-900 760 V800-1700 800 V840-900 840 V880-1800 880 V960-2000 960 V1040-2100 1040 V1120-2300 1120 V1200-2500 1200 Capacity 1200 900 1400 1500 900 1700 900 1800 2000 2100 2300 2500 CPU Algorithm Times q-path t-q-route q + through-q 1.045 2.325 3.370 0.874 1.311 2.185 1.576 4.695 6.271 2.215 6.677 8.892 1.857 2.340 4.197 4.805 11.497 16.302 3.027 2.932 5.959 7.004 15.241 22.245 10.046 23.353 33.399 13.166 29.640 42.806 16.832 40.154 56.986 21.092 52.026 73.118 GPU Algorithm Times q-path t-q-route q + through-q 0.140 0.210 0.350 0.101 0.128 0.228 0.177 0.284 0.461 0.251 0.363 0.614 0.162 0.149 0.311 0.362 0.511 0.873 0.209 0.165 0.373 0.486 0.627 1.113 0.628 0.837 1.465 0.869 0.994 1.862 1.061 1.276 2.336 1.328 1.607 2.935 q-paths 7.455 X 8.692 X 8.897 X 8.831 X 11.483 X 5.126 X 14.514 X 14.424 X 15.995 X 15.158 X 15.869 X 15.878 X Speed-Up t-q-route q 11.055 X 10.257 X 16.527 X 18.369 X 15.717 X 22.489 X 17.807 X 24.289 X 27.910 X 29.831 X 31.479 X 32.385 X + through-q 9.615 X 9.568 X 13.596 X 14.475 X 13.513 X 18.663 X 15.967 X 19.985 X 22.802 X 22.987 X 24.392 X 24.914 X Chapter 5. VRP Table 5.1: Computational results for the symmetric q-path and through-q-route relaxations. 85 Instances Name Nodes A034-02f 34 A036-03f 36 A039-03f 39 A045-03f 45 A048-03f 48 A056-03f 56 A065-03f 65 A071-03f 71 Balman859-1000 859 Balman859-2000 859 Capacity 1000 1000 1000 1000 1000 1000 1000 1000 1000 2000 CPU Algorithm Times q-path t-q-route q + through-q 0.008 0.031 0.039 0.015 0.031 0.031 0.015 0.031 0.046 0.015 0.047 0.062 0.016 0.047 0.063 0.015 0.063 0.078 0.016 0.093 0.109 0.031 0.078 0.109 6.100 3.026 9.126 13.385 17.113 30.498 GPU Algorithm Times q-path t-q-route q + through-q 0.024 0.007 0.031 0.024 0.007 0.031 0.026 0.007 0.033 0.026 0.009 0.034 0.025 0.009 0.035 0.026 0.011 0.037 0.026 0.013 0.039 0.026 0.014 0.040 0.393 0.122 0.515 0.791 0.502 1.292 q-paths 0.334 X 0.619 X 0.579 X 0.578 X 0.634 X 0.584 X 0.606 X 1.184 X 15.514 X 16.929 X SpeedUp t-q-route q 4.722 X 4.557 X 4.213 X 5.505 X 4.986 X 5.777 X 7.388 X 5.642 X 24.781 X 34.102 X + through-q 1.278 X 0.999 X 1.382 X 1.799 X 1.817 X 2.131 X 2.796 X 2.724 X 17.710 X 23.597 X Chapter 5. VRP Table 5.2: Computational results for the asymmetric q-path and through-q-route relaxations. 86 Name A-n62-k8 A-n63-k10 A-n64-k9 A-n80-k10 B-n50-k8 B-n68-k9 B-n78-k10 E-n51-k5 E-n76-k7 E-n76-k8 E-n76-k10 E-n76-k14 E-n101-k8 E-n101-k14 F-n135-k7 M-n121-k7 M-n151-k12 M-n200-k16 M-n200-k17 P-n50-k8 P-n70-k10 Instances Nodes 62 63 64 80 50 68 78 51 76 76 76 76 101 101 135 121 151 200 200 50 70 Capacity 100 100 100 100 100 100 100 160 220 180 140 100 200 112 2210 200 200 200 200 120 135 NG:8 0.047 0.047 0.047 0.078 0.031 0.094 0.110 0.063 0.156 0.125 0.078 0.031 0.272 0.110 9.516 0.671 0.577 0.983 0.998 0.031 0.047 CPU Algorithm Times NG:10 NG:12 NG:13 0.093 0.281 0.421 0.094 0.234 0.359 0.093 0.249 0.375 0.187 0.484 0.733 0.094 0.296 0.484 0.234 0.639 0.842 0.343 0.998 1.482 0.125 0.359 0.593 0.421 1.279 2.043 0.296 0.921 1.420 0.187 0.530 0.811 0.078 0.203 0.297 0.795 2.371 3.619 0.280 0.765 1.108 30.342 91.853 Out 2.169 9.345 19.407 1.591 4.181 6.692 2.901 7.301 10.905 2.902 7.332 10.888 0.047 0.109 0.171 0.125 0.359 0.530 NG:14 0.717 0.546 0.515 1.061 0.827 1.373 2.511 0.920 3.260 2.215 1.216 0.390 5.523 1.623 Out 38.876 10.358 15.928 15.943 0.250 0.749 NG:8 0.004 0.003 0.003 0.005 0.003 0.006 0.010 0.004 0.010 0.008 0.005 0.003 0.017 0.007 0.480 0.045 0.034 0.064 0.064 0.002 0.004 GPU Algorithm Times NG:10 NG:12 NG:13 0.007 0.021 0.047 0.007 0.020 0.046 0.007 0.021 0.034 0.012 0.036 0.067 0.007 0.024 0.042 0.021 0.068 0.088 0.032 0.109 0.199 0.008 0.025 0.062 0.026 0.081 0.194 0.019 0.071 0.139 0.012 0.036 0.067 0.006 0.018 0.044 0.046 0.182 0.306 0.018 0.061 0.098 1.634 5.694 Out 0.246 1.203 2.710 0.099 0.323 0.556 0.198 0.611 0.956 0.197 0.617 0.966 0.005 0.012 0.024 0.009 0.035 0.048 NG:14 0.080 0.070 0.066 0.113 0.095 0.167 0.290 0.084 0.305 0.222 0.124 0.065 0.528 0.170 Out 5.731 0.978 1.524 1.527 0.044 0.089 NG:8 12.796 X 14.030 X 14.766 X 14.888 X 10.431 X 14.925 X 11.542 X 16.475 X 15.388 X 16.346 X 14.684 X 10.375 X 15.826 X 15.621 X 19.811 X 14.871 X 16.731 X 15.292 X 15.501 X 14.299 X 12.401 X NG:10 12.510 X 12.974 X 13.358 X 15.004 X 12.835 X 10.928 X 10.754 X 15.543 X 16.476 X 15.811 X 15.352 X 12.597 X 17.271 X 15.862 X 18.565 X 8.817 X 16.083 X 14.629 X 14.759 X 9.918 X 13.832 X SpeedUp NG:12 13.646 X 11.958 X 11.906 X 13.425 X 12.530 X 9.441 X 9.192 X 14.089 X 15.775 X 12.894 X 14.536 X 11.361 X 13.042 X 12.535 X 16.131 X 7.767 X 12.952 X 11.946 X 11.887 X 8.804 X 10.362 X NG:13 8.875 X 7.775 X 10.969 X 10.927 X 11.433 X 9.530 X 7.442 X 9.607 X 10.533 X 10.228 X 12.174 X 6.688 X 11.824 X 11.344 X Out 7.160 X 12.038 X 11.408 X 11.272 X 7.001 X 10.954 X NG:14 8.947 X 7.813 X 7.830 X 9.356 X 8.715 X 8.220 X 8.659 X 10.916 X 10.680 X 9.975 X 9.773 X 5.998 X 10.453 X 9.520 X Out 6.783 X 10.587 X 10.450 X 10.443 X 5.685 X 8.439 X Chapter 5. VRP Table 5.3: Computational results for the symmetric ng-path relaxation. 87 Name A034-02f A036-03f A039-03f A045-03f A048-03f A056-03f A065-03f A071-03f Instances Nodes Capacity 34 1000 36 1000 39 1000 45 1000 48 1000 56 1000 65 1000 71 1000 NG:8 0.422 0.406 0.406 0.546 0.671 0.905 1.154 1.326 CPU Algorithm Times NG:10 NG:12 NG:13 1.045 3.151 5.055 1.014 3.230 4.711 1.185 3.417 5.289 1.404 4.259 6.942 1.809 5.835 9.625 2.730 7.784 14.367 3.213 9.423 15.398 3.869 11.684 19.625 NG:14 8.096 7.629 8.205 12.137 18.205 21.918 25.365 30.639 NG:8 0.022 0.023 0.024 0.027 0.030 0.041 0.053 0.057 GPU Algorithm Times NG:10 NG:12 NG:13 0.035 0.086 0.128 0.037 0.086 0.127 0.038 0.091 0.146 0.053 0.138 0.238 0.079 0.239 0.427 0.114 0.331 0.687 0.122 0.398 0.720 0.160 0.572 1.109 NG:14 0.199 0.194 0.221 0.443 0.956 1.102 1.298 1.813 NG:8 19.055 X 17.849 X 17.256 X 19.884 X 22.587 X 22.124 X 21.691 X 23.425 X NG:10 29.825 X 27.397 X 31.399 X 26.447 X 22.900 X 23.856 X 26.385 X 24.127 X SpeedUp NG:12 36.821 X 37.417 X 37.441 X 30.809 X 24.374 X 23.514 X 23.661 X 20.428 X NG:13 39.555 X 37.189 X 36.216 X 29.206 X 22.565 X 20.910 X 21.391 X 17.703 X NG:14 40.595 X 39.292 X 37.103 X 27.377 X 19.041 X 19.898 X 19.547 X 16.904 X Chapter 5. VRP Table 5.4: Computational results for the asymmetric ng-path relaxation. 88 Chapter 5. VRP 89 In table 5.1 we report the speed-up factor relative to the q-paths and through-qroutes algorithms. We used the biggest instances for the CVRP in literature for two reasons: 1. the execution times for these algorithms on the VRPLIB is negligible because of the small instances dimensions , 2. we want to highlight the method’s scalability on very difficult instances. In fact, for bigger instances the global speed-up factor obtained is really consistent, 24 X for the V1200-2500 instance. We decided to report the unified speed-up for both the methods because to compute the through-q-routes we need the results of q-paths. However, we can see that the through-q-routes is the algorithm achieving the best performances. For the asymmetric instances, table 5.2, as before, the best results are obtained for big instances. In table 5.3, we report the results for the NG relaxation for the symmetric CVRP. In this case we evaluate the scalability of the method among different dimensions for the ∆ parameter (8,10,12,13,14), using the VRPLIB instances because of the data-structures dimensions reached during the computations. It’s easy to notice that the performances of the method degrade with bigger neighborhood sets, because of the increasing of atomic operations among the labels and the bigger data transfer time from and to the GPU, but in most cases, remaining above the 10 X factor. In the asymmetric 5.4 case we can appreciate the maximum speed-up obtained, 40 X. In this case we can show all the computation capabilities of the device, exploiting all the parallelism levels available as described before. 5.6 Considerations and Future Work In this chapter we highlighted the great advantages that a parallel algorithm can bring to this pricing strategies for the VRP. In fact, inside a CG method, the pricing problem is the most computationally expensive routine used to generate the columns to insert inside the master. The use of a GPU seems to be a very good alternative in terms of execution time, portability and affordability (the device used is an high end gaming GPU). Seems notable, moreover, the great Chapter 5. VRP 90 performances obtained for the biggest instances, the hardest to solve and the ones taking relevant elaboration time. The future work planned is to explore the implementation of these methods with OpenCL, with the aim to make the methods portable on the most common parallel processors of different vendors, and, obviously, exploit these relaxations, or some their variants, in order to design most powerful and effective algorithms for solving instances from different classes of VRPs. Chapter 6 Single Source Shortest Path Problem 6.1 Introduction The ever more complex systems and eco-systems represented by urban centers and industrialized countries are growing fast and, nowadays, to exploit optimally the infrastructures composing these systems is more and more difficult. Passing through the public transport to the goods transportation and delivery, the problems related are increasingly strategic and dealing with these is the focus of many studies and researches from academia and private companies. To provide high quality decision support tools is mandatory to preserve the resources from the environmental point of view and enhance life’s quality in densely populated areas. By now, for instance, a great urban center has different kinds, or modes, of transportation, covering the city area, allowing people to move easily from a location to another. Also in highly industrialized countries, the transportation is characterized by multiple networks (railways, motorways, air-flights . . . ), that permit a fast displacement of people and goods. The layered networks composing these transportation infrastructures have different properties and characteristics, making the related mathematical models more complex and the consistent and effective resolution of optimization problem for these models a non-trivial challenge. 91 Chapter 6. SSSPP 92 Routing is a widely explored research topic and has its origin in early fifties with the well known Dijkstra’s algorithm [94], for finding the shortest path inside an oriented graph or the Bellman-Ford algorithm [95], computationally more expensive but more effective for other purposes. More generally, for routing we mean finding the ’best’ path relative to one or more aspects of the journey: mileage, cost, fuel consumption, time, number of transportation modes used and others. In fact, we shift the focus from finding the shortest path in a geographical network, to optimize a route among different layers with respect to other factors. One aspect, mainly related to the people and goods transportation, is the arrival time to a certain destination. In a public transport or goods delivery scenario, earlier is the arrival time, better is the QoS. Over the years many enhancement to the basic algorithms cited above has been proposed, achieving good results in a large spectrum of routing problems, but in cases where the query is strictly related to a temporal dimension, algorithm using bi-directional search, contraction hierarchies or an heuristic to compute a completion bound to the solution like A* [96] are not applicable because of the changes of the networks values (mainly the edges costs) or the topology in function of time. In this case we can only compute the routing problem’s solution using an ’augmented’ version of the Shortest Path algorithm that, up to certain dimensions, is computationally expensive. In this chapter we propose two parallel algorithms implemented both on CPU (multi-core, shared memory platform) and GPU to solve the Earliest Arrival Problem in a Time-Dependent Multi-Modal Network reporting the performance obtained compared to the serial version. 6.2 Problem Definition In this section we will define all problem’s peculiarities starting from the definition of Earliest Arrival Problem, showing that can be solved using an augmented version of the Shortest Path algorithm. We will also define a Multi-Modal Network, the algorithm to manage the routing in this type of graph and, finally, we will add to the model the time dimension, describing what changes will it add to the model. Chapter 6. SSSPP 93 Definition 6.1. (Earliest Arrival Problem). Given a time-independent or timedependent network, source and target points s and t in the network, we ask for a route in the network with the following properties: 1. The route must start at s, 2. the route ends at t, 3. the length (travel time) of all other routes satisfying the properties (1)–(2) must be bigger or at least equal. In other words, from all the possible routes in the network from s to t, we seek the route with the minimum cost for arriving to t. As mentioned before, the route’s cost can be any aspect from the fuel consumption to the travel time or a mix of more of these criteria. In this paper we will cover only the optimization relative to one criteria. Analyzing the definition 6.1, we can easily notice that, substituting to the optimization criteria any function of weight for the edges of the network, this problem becomes a Single Source Shortest Path Problem. We will define the SSSP Problem first, then we will propose the Multi-Modal and Time-Dependent version. 6.2.1 Single Source Shortest Path Problem Shortest Path is a deeply investigated problem in the Combinatorial Optimization. This problem and its variations are subject of research from about five decades and can be found, often like subproblem, in a wide plethora of applications. In our case, shortest paths are the basis for the problem we are discussing. Definition 6.2. (Single Source Shortest Path Problem). Given a weighted, directed graph G = (V, E), a source node s ∈ V , a target node t ∈ V and a weight function w(e) for the edge e = (va , vb ), va ∈ V and vb ∈ V , we ask for a path P = {v1 , ..., vk }, with the following properties: 1. The path begins at s, thus v1 = s, 2. the path ends at t, thus vk = t, Chapter 6. SSSPP 94 3. P is minimal. We define Len(P ) = Pk−1 i=1 w(vi , vi + 1), the length of the path P . The SSSPP has some common declinations, depending on the number of sources and target considered: • Many-To-Many-Shortest Path Problem. This is a generalization of the Shortest Path Problem. Instead of one node s and t we are given a set of source nodes S ⊆ V and a set of target nodes T ⊆ V . We now ask for a shortest path Ps,v for each pair (s, t) ∈ S × T . In multi-modal routing the Earliest Arrival Problem will actually transform to this version of the problem. • One-To-All-Shortest Path Problem. This is a special case of the Many-ToMany-Shortest Path Problem where S is a singleton set consisting of one source node s and T = V is the set of all nodes. Hence, we are asking for shortest paths Pv to every node v ∈ V . Because the edge set of all resulting paths T = ∪Ps,v , v = 1, . . . , |V | forms a tree, we might also say that we compute a shortest path tree. • All-Pairs-Shortest Path Problem. This is a version of the Many-To-ManyShortest Path Problem where both S and T are the complete node set V of the graph. Having the All-Pairs-Shortest Path Problem solved automatically includes solutions for all instances of the Shortest Path Problem in the graph. For this problem in most cases is used the Floyd-Warshall dynamic programming algorithm [97]. All these problems can be solved using the same algorithm, executing it multiple time often. In this paragraph we didn’t take into account the time dimension, we will extend the model after the introduction of the Multi-Modal Networks. We can consider the SSSPP a special case of a Multi-Modal network with only one mode. In this case the solution algorithm for the routing problem without time-dependencies is equivalent to the Dijkstra algorithm. Algorithm Unimodal Routing(s, t, V, E, w()) Chapter 6. SSSPP 95 1. vals // Tentative Distances Values 2. preds // Predecessors vector 3. Queue Q = null // Priority Queue 4. // Data Structures Initialization 5. Q.Insert(s) 6. for i = 1 to |V | do 7. if i == s 8. then vals[i] = 0, preds[i] = 0 9. else vals[i] = ∞, preds[i] = ∞ 10. // Algorithm 11. while Q6= null do 12. n = Q.first() 13. if n == t 14. then break 15. else valn = vals[n] 16. for each succ ∈ Γn do 17. coste = w((n, succ)) 18. if vals[succ] > valn + coste 19. then vals[succ] = valsn + coste 20. preds[succ] = n 21. Q.Insert(succ) The proposed algorithm is a straightforward implementation of the Dijkstra one. In fact in line 3 we initialize a priority queue implemented with a heap, ordered by the tentative values, in line 5 we insert in the heap the start node s. In lines 6-9 we initialize the tentative values for each node and the predecessors to retrieve the shortest path. In lines 11-21 we have the core algorithm inside a do-while cycle that ends once the queue is empty. In line 12 we extract the root from the heap, in lines 13-15 we check if we have reached the target t, otherwise we extract the cost of the label relative to the actual node n. The foreach statement, in lies 16-21, evaluate the new tentative values for each successor succ of n and update them if the successor’s label is greater, keeping trace of the predecessor, n, and inserting the successor node succ in the queue, line 21, that will reorder itself maintaining the heap properties. Chapter 6. SSSPP 6.2.2 96 Multi-Modal Networks In this section, we define a Multi-Modal Network and the other basis on which relies the routing for this type of networks. First, we give de definition of MultiModal or Multi-Layer Network: Definition 6.3. (Multi-Modal Network). Given a graph G = (V , E, M ) where: 1. M = {M 1 , M 2 , . . . , M n } is the set of modes, or layers, composing the network, where M i = (V i , E i ) is the graph representing the i mode or layer, V i is the set of vertices of i, E i is the set of edges of i and n = |M |, 2. V = Sn i=1 V i , with V i ∩ V j = {0}, i 6= j, 3. v i ∈ V i is a vertex for the mode i, 4. ei ∈ E i is an edge connecting two nodes of the same mode i, e = (vhi , vki ), 5. et ∈ E t , is an edge connecting two nodes of different modes i and j, et = (vhi , vkj ), with i 6= j (transition edge), 6. E = Sn i=1 E i ∪ E t is the set of edges of G where E i ∩ E j = {0}. The G graph is a multi-modal or multi-layer graph composed by n modes or layers. For each M i graph we can define a cost function wi (ei ) for the mode’s edges. For the transition edges, the wt (et ) is equal to 0. For instance, the network of public transport in an urban area (roads, bus lines, tram lines, trains, boats, etc. . . ), can be modelled with a multi-modal network. Another example can be the different types of transportation for a delivery service (postal service for instance), that has different types of networks to deliver the goods (ship, aeroplane, etc. . . ). The routing problem for this type of networks can be seen as Shortest Path Problem among the various, connected, modes of the network, using the relative wi () weight functions to evaluate each time the cost of an edge. In this case we will found a route, passing through the various modes of the network from the source s to the target t. In this scenario, we are allowed to choose also path exploiting various type of transportation: car, bus, trams, boats, together in the same path. Chapter 6. SSSPP (a) 97 Not Convenient Route (b) Convenient Route Figure 6.1: Types of Routes. Obviously, this kind of solution is not desirable, in fact, a ‘well formed’ route is a route passing only through certain types of modes depending on the state of the traveler: For instance, assume that we are a tourist in Rome and we want to visit some of the most beautiful places in the city. We are by foot and we want a route that uses the public transport and the road network, allowing us to reach the places we have chosen. The algorithm, as it is, can give us undesirable results finding that the most convenient path between the Colosseum and the Vatican Museums is taking the metro until a certain point, then take the car, take another bus, then the car again to reach the Museums. It’s straightforward that a path like this is not desirable, we’d like a path using only the public transportation to reach our destination (6.1). Another side-effect of using this algorithm on these types of networks is that while we are traveling on a train, the railway intersects a road, without a stop for the train, that can bring us to destination faster, the algorithm can suggest us to take that road. In general, we are forced to evaluate paths that are compatible with the state (foot, car, bicycle, etc. . . ) of the traveler. 6.2.3 Label Constrained Shortest Path Problem Barrett [98] proposed an algorithm to solve the problem of not desirable routes associating to the graph an Automaton that, treating the modes and the states as part of a DFA, Deterministic Finite Automata, regulates the transition among the modes of the network. We will give a brief definition for a language and an automaton, then we will describe the algorithm based on these and its implications. Definition 6.4. (Regular Languages). Let Σ be an alphabet. Then a language L over Σ is regular if and only if it confirms the following construction rules: 1. The empty language {0} is regular, Chapter 6. SSSPP 98 2. for each σ ∈ Σ the singleton language {σ} is regular, 3. if L1 and L2 are regular languages, then L1 ∪ L2 , L1 · L2 and L∗1 are also regular languages. To describe a regular language L we can use formalism like regular expression or automata. In our case we will give a brief definition for the DFA describing the language L. Definition 6.5. (DFA, Deterministic Finite Automaton). A Deterministic Finite Automaton describing the language L is given by a 5-tuple A = (Q, Σ, δ, q0 , F ) where: 1. Q is the set of states composing the automaton, 2. Σ is the alphabet, a finite set of symbols, 3. δ is the transition function δ : Q × Σ → P(S), 4. q0 is the initial state. 5. F is the set of final states. We say that a word w is accepted by A if and only if exists a path from q0 to qf ∈ F regulated by δ. By Kleene’s Theorem [99] each regular language L can be described by a finite automaton in the sense that every word w ∈ Σ∗ is accepted by this automaton. The language L and the automaton A can be interchanged. Given the definition of language and automata, we can define the Label Constrained Shortest Path Problem: Definition 6.6. (Label Constrained Shortest Path Problem). Given an alphabet Σ, a language L ⊂ Σ∗ , a weighted, directed graph G = (V, E) with Σ-labeled edges and source and target nodes s, t ∈ Σ, we ask for a shortest path P from s to t, where the sequence of labels along the edges of the path creates a word of L. Thus given P = [v1 , . . . , vk ] it has to hold that: label((v1 , v2 ))label((v2 , v3 )(. . . label((vk−1 , vk )) ∈ L. (6.1) Chapter 6. SSSPP 99 This definition needs no restrictions on the language L but, for our purposes, a regular language is sufficient to model the transition among the modes of the network. In [98] the following theorem is proven: Theorem 6.7. the Label Constrained Shortest Path Problem restricted to Regular Languages, RegL-CSPP, can be solved in deterministic polynomial time. An algorithm for solving this problem operates on a product graph between the automaton A, describing the allowed transitions among the modes and the graph G. Definition 6.8. (Product Network). Given a Σ-labeled graph G = (V, E) and a non-deterministic finite automaton A = (Q, S, δ, q0 , F ), the product network G× = (V × , E × ) is defined as follows: 1. The node set consists of product-nodes (v, q) ∈ V × where v ∈ V and q ∈ Q. 2. An edge e× = (v1 , q1 ), (v2 , q2 ) between two product-nodes is included in E × if and only if e = (v1 , v2 ) ∈ E and there is a label σ ∈ Σ for which exists a transition q2 ∈ δ(q1 , σ) in the automaton. The weight of e× is set to the weight of e and label(e× ) is set to σ. The resulting graph is uni-modal. In [98] is proven that this assumption holds: Theorem 6.9. The RegL-CSPP for a Σ-labeled graph G = (V, E) from source s ∈ V to target t ∈ V and a regular language L ⊆ Σ∗ can be reduced to the Shortest Path Problem as follows: 1. Construct a finite automaton A = (Q, Σ, δ, S, F )describing L, where S is the set of starting states, 2. construct the product network G× = G × A 3. solve the Many to Many Shortest Path Problem for G× using: S= [ (s, qs ), T = qs ∈S [ (t, qf ) (6.2) qf ∈F where S and T are, respectively, the set of the sources and the set of the targets inside the product graph, Chapter 6. SSSPP 100 4. from all resulting paths pick the one having minimal length. Let P = [(v1 , q1 ), . . . , (vk , qk )]be the shortest path obtained by the algorithm induced from Theorem 2. Then the length of the path in G is the same as the length Len(P ) in G× . The actual path in G can be obtained by omitting the ‘automaton part’ of the product-nodes, thus, yielding [v1 , . . . , vk ]. On the other hand, the word conforming to L along the path can be obtained by concatenating the edge labels: word(P ) = label((v1 , q1 ), (v1 , q2 )) . . . label((vk−1 , qk−1 ), (vk , qk )) (6.3) The creation of the product graph can be computed in time O(|G| · |A|) which is also polynomial. Hence, the algorithm induced by Theorem 2 runs in polynomial time. The memory complexity and the space required to store the product graph G× is also in O(|G| · |A|) which is extremely expensive for large instances. The memory complexity of the product graph can be easily avoided using the transition graph relative to the automaton A. We can compute implicitly the shortest paths for the Many to Many SPP using an constrained version of the Dijkstra algorithm regulated by the transition graph of the automaton. Figure 6.2: Automaton managing four states. In this case, the memory complexity of G× becomes O(|G| + |A|), reducing consistently the memory space required. For instance, we can assume that a traveller, by foot, can use the public transport and walk through the streets by foot, then, the the relative automaton will allow the moving through the various modes, like Chapter 6. SSSPP 101 bus, trains or subways and the costs of the streets will be computed considering the average speed of a man walking. A bike trip, instead, must be constrained by the streets or the transport networks. In fact is not allowed to ride a bicycle in a motorway or to bring it on a bus, instead is allowed to ride in urban centers or in secondary roads and to bring the bicycle on a train or a subway. All these aspects must be modeled by the automaton, that, during the evaluation of the successors states and the computation of the new tentative costs, allows the algorithm to expand or not a state. Here we propose an augmented version of the algorithm Unimodal Routing, taking into account a Multi-Modal Network and an automaton. Algorithm RegL-CSPP (s, t, V, E, w(), A) 1. Queue SQ=null // Generated States of G× queue 2. // Queue Initialization 3. for each (s, qs ) ∈ S do 4. SQ.Insert((s, qs ),((0, 0), 0) 0) 5. // Main cycle 6. while SQ 6= null do 7. ((n, s), (p, q, val(p, q)), f (n, s)) = SQ.First() 8. if (n, s) ∈ T 9. then reached + + 10. if reached == |T | 11. then break 12. 13. 14. for each edge e = (n, succ), succ ∈ Γn do for each q ′ ∈ δ(s, label(e)) do if (succ, q ′ ) ∈ / SQ 15. then SQ.Insert((succ, q ′ ),((n, s), f (n, s)), f (n, s) + w(e)) 16. else if f ((n, s)) + w(e) < f ((succ, q ′ )) 17. then SQ.Update((succ, q ′ ),((n, s), f (n, s)),f ((n, s))+ w(e)) In the algorithm RegL-CSPP we introduce the triple ((v, q), (p, r, f (p, r)), f (v, q)) indicating a certain node v in a state q and its value, keeping trace of the predecessor triple of node p, state r and valuef (p, r) also, to retrieve the path at the end of the computation. Chapter 6. SSSPP 102 In lines 3-4 we initialize the queue SQ collecting all the G× states produced during the computation, ordering following their value f . As in the algorithm Unimodal Routing we implemented it with an heap. More specifically, in line 4 we initialize all the states for the starting point s ∈ S, setting the predecessors and the values to 0. In lines 8-11 we check if all the targets states t ∈ T has been reached, in this case we stop the computation. In lines 12-18 we update the data for each successor of n. In line 13 we check if the transition of the edge e is allowed by the transition function δ of A, if allowed, we check if the state has been already generated, if not we insert the new triple in SQ (line 15), otherwise, we update the existing triple with the new value, if the new path enhance the solution (lines 17-18). 6.2.4 Time-Dependent Networks In time-dependent routing we do not longer have constant weights assigned to the edges. To accommodate for time-dependency, we replace the edge weights by arbitrary functions f from some function space F. The shortest s-t-path in a time-dependent model then depends on the departure time τs of the source node. This might result in shortest paths of different length for different departure times or, in general, even a completely different route. In the simplest case, we can + describe these time-functions as periodic functions f : R+ 0 → R0 with period Π, meaning that for each value f (τ ) = f (τ mod Π). To make the model more realistic, we need to express the F as a set of piecewise + linear functions: A periodic function f : R+ 0 → R0 is called piecewise linear if it consists of a finite number of segments of linear functions. Let f be a piecewise linear function then f can be described by a finite set P of interpolation points where each interpolation point pi ∈ P consists of a departure time τ and an associated function value f (τ ). The value of f for an arbitrary time τ is then computed by interpolation. This is done differently for time-dependent road networks and public transportation networks. Whereas in road networks we interpolate linearly between two subsequent interpolation points, the travel time function along a public transportation edge is interpreted as follows. First we have to wait for the next train or aeroplane to depart and then we have to add its mere travel time along that edge to that. Chapter 6. SSSPP 103 (a) Road Network Speed Profile (b) Public Transport Speed Profile Figure 6.3: Speed Profiles. Hence, for some arbitrary time point t we use the nearest interpolation point τ in the future and interpolate by the formula: f (τ ) = −γ · (τi − τ ) + f (τi ). (6.4) with γ ∈ [−1, 0] as the fixed gradient for f . In figure 6.3 we show the two types of functions for the road network and the public transport network. For extending the Dijkstra algorithm to manage time-dependency we need to add as input the time τs of the day when the trip starts. There are two type of routing queries that we can perform on a time-dependent graph: • Time Queries: The time-dependent version of Dijkstra’s algorithm is almost identical to the time-independent version as illustrated in Algorithm RegL-CSPP when computing time queries. The only changes to the algorithm that need to be made are the following two: 1. we need to supply a departure time τ as additional input, as mentioned before, 2. to evaluate the edge weights, we have to consider the current time at which we encounter the respective edge. Let e = (v, w) be an edge of which the weight has to be evaluated, then the time at which we evaluate the function fe of the edge e is the departure time τ plus the time along the path to v. • Profile Queries: Using the previously described version of the time-dependent query algorithm yields only shortest paths for one particular departure time Chapter 6. SSSPP 104 τ . While this seems to be a canonical generalization of the time-independent case, there is another type of query in time-dependent graphs, where we are not only interested in the shortest path at one time point, but at all times of day. For example: in a railway network we state 8 o’clock as departure time τ for a query. Let’s say there is a train departing at 8:00 to our destination takes 2 hours. But maybe there is another train departing at 9 o’clock that takes only 1 hour and 10 minutes. Taking the second train would be a suboptimal solution to the Earliest Arrival Problem (since the arrival time is 10 minutes later), but its sheer travel time is 50 minutes shorter. So, maybe it would be nice to present the user with the travel time for each possible departure time τ < Π. In other words, the result of the query should be a piecewise linear function f itself, where each interpolation point represents a shortest path for that particular time. In our work we will consider only Time Queries. In fact, other types of queries in a time-dependent graph are extension or have as their foundation this type of query and proposing a parallel algorithm for this problem can be useful for all the others. 6.3 Serial Algorithm Dean [100] proposed the extended version of the Dijkstra algorithm for timedependent graphs. In this section we will propose the final version of the algorithm to compute the Shortest Path in a Multi-Modal Network with Time-Dependencies. This extension of the problem is particularly hard to solve, because we can’t exploit some of the most famous and effective techniques for speeding-up the computation of Shortest Path or we need some pre-computation phases. Algorithms like Contraction Hierarchies [101], bi-directional search, Arc-Flag [102, 103] or ALT (A* with Landmarks and Triangle Inequality) [104, 105] are not adaptable or need some long-time pre-computation for larger graphs. Pajor [106] proved that almost all of these methods are useless for the time-dependant case. Chapter 6. SSSPP 6.3.1 105 Time-Dependent Augmented Dijkstra Algorithm Algorithm T-D-RegL-CSPP (s, t, V, E, ftime , A, τstart ) 1. Queue SQ=null // Generated States of G× queue 2. Queue Initialization 3. for each (s, qs ) ∈ S do 4. SQ.Insert((s, qs ),((0, 0), τ0 ), τstart ) 5. // Main cycle 6. while SQ 6= null do 7. ((n, h), (p, q, τp,q ), τn,h ) = SQ.First() 8. if (n, h) ∈ T 9. then reached + + 10. if reached == |T | 11. then break 12. for each succ ∈ Γ(n) do 13. e = (n, succ) 14. for q ′ ∈ δ(h, label(e)) do 15. if (succ, q ′ ) ∈ / SQ 16. then SQ.Insert((succ, q ′ ),((n, h), τn,h ), τn,h + ftime (e, τn,h )) 17. else if τn,h + ftime (e, τn,h ) < τsucc,q′ 18. then SQ.Update((succ, q ′ ),((n, h), τn,h ),τn,h +ftime (e, τn,h )) We introduce the function ftime () calculating the time for traveling the edge e. This function can be a collection of methods created to evaluate the traveling time, based on various factors: road type (primary, secondary, etc . . . ), the type of public transport (bus, subway, etc . . . ), the moment of the day (traffic profiles or other factors), the state of the automaton (depending if we are traveling by foot or on a bicycle). The algorithm is equivalent to the RegL-CSPP but here is computed the time for each query inside the multi-graph G. Depending on the granularity of the problem, the time can be expressed in minutes or seconds. Chapter 6. SSSPP 6.4 106 GPU Algorithm In this section, we propose an extension of the parallel Dijkstra algorithm proposed by [88] for computing the EAP in a Multi-Modal Time-Dependent Graph and its porting on a GPU using the CUDA [107] programming model. First, we present the basic algorithm, then we will introduce a new approach to compute the frontier set Fi to enable the parallelism of the method. 6.4.1 GPU Dijkstra Algorithm We can distinguish two parallelization alternatives that can be applied to Dijkstra’s algorithm. The first one parallelizes the internal operations of the sequential Dijkstra algorithm, while the second one performs several Dijkstra algorithms through disjoint sub-graphs in parallel [108]. Our approach is focused in the first solution. The key of the parallelization of a single sequential Dijkstra algorithm resides in the inherent parallelism of its loops. For each iteration, the outer loop selects a node to compute new distance labels. Inside this loop, the algorithm relaxes its outgoing edges in order to update the old distance labels, that is the inner loop. Parallelizing the outer loop implies to compute in each iteration i a frontier set Fi of nodes that can be settled in parallel without affecting the algorithm correctness. The main problem here is to identify this set of nodes v which tentative distances V al(v) from source s must be the minimum shortest distance. Crauser et al.[109] and Crobak et al. [110] proposed two solutions addressing this problem. Parallelizing the inner loop implies to traverse simultaneously the outgoing edges of the frontier node. One of the algorithms presented in [111] is an example of this parallelization approach. Following the approach in [109], we explain the method for identifying the frontier set Fi and maximizing its cardinality. It’s straightforward to highlight that bigger is the frontier set, higher is the level of parallelism of the method. 6.4.1.1 Frontier Set The method, in each iteration i, calculates the minimum tentative distance of the nodes belonging to the unsettled set, Ui . The node whose tentative distance Chapter 6. SSSPP 107 is equal to this minimum value can be settled and becomes the frontier node. Its outgoing edges are traversed to relax the distances of the adjacent nodes. In order to parallelize the algorithm, it is needed to identify which nodes can be settled and used as frontier nodes at the same time. Martin et al. [112] inserts into the frontier set, Fi+1 , all nodes with this minimum tentative distance with the aim to process them simultaneously. Crauser et al. [109] introduces a more aggressive enhancement, augmenting the frontier set with nodes with longer tentative distance. The algorithm computes in each iteration i, for each node of the unsettled set, u ∈ Ui , the sum of: 1. its tentative distance, 2. the cost of its outgoing edges. Afterwards, it calculates the minimum of these computed values. Finally, those nodes whose tentative distance are lower or equal than this minimum value can be settled becoming the frontier set. Introducing ∆i as the threshold value computed in each iteration i that holds that any unsettled node u with val(u) ≤ ∆i can be safely settled. The bigger the ∆i value, the more parallelism is exploited. However, depending on the particular graph being processed, the use of a very ambitious ∆i may induce overheads that destroys any performance gain with respect to sequential execution. The basic Dijkstra parallel method follows the idea proposed by Crauser [109] of incrementing each ∆i . For every node v ∈ V , the minimum weight of its outgoing edges, that is, ∆v = min {w(v, z) : (v, z) ∈ E}, is calculated in a pre-computation phase. For each iteration i of the external loop, having all tentative distances of the nodes in the unsettled set, we define ∆i = min {(val(u) + ∆v ) : u ∈ Ui } (6.5) More in general, we insert into the frontier set Fi+1 every node v with val(v) ≤ ∆i . Chapter 6. SSSPP 6.4.1.2 108 GPU Implementation Once defined the concept used for exposing the inherent parallelism inside the Dijkstra algorithm, we provide the complete pseudo-code for the method on the GPU. Algorithm GPU Dijkstra(s, t, V, E, w(), ∆v ) 1. vals, preds, Ui , Fi 2. ∆i = ∞ 3. // Data Structures Initialization 4. for i = 1 to |V | do 5. if i == s 6. then vals[i] = 0, preds[i] = 0 7. else vals[i] = ∞, preds[i] = ∞ 8. // GPU Algorithm 9. Initialize<<<>>>(Ui , Fi ) // Frontier and Unsettled Nodes initialization 10. Initialize(∆i ) //Threshold Initialization 11. while ∆i 6= ∞ do 12. // Relax the frontier nodes 13. RELAX-KERNEL<<<>>>(vals, preds, Fi , Ui ) 14. // Update ∆i 15. ∆i = DELTA-UPDATE-KERNEL<<<>>>(vals, Ui , ∆v ) 16. // Update Fi+1 17. FRONTIER-KERNEL<<<>>>(∆i , Fi , Ui ) Algorithm RELAX-KERNEL(vals, preds, Fi , Ui ) 1. nidx = thread.id 2. if Fi [nidx ] == true 3. 4. 5. 6. 7. then for each u ∈ Γnidx do if Ui [u] == true then Start Atomic Operations if vals[u] > vals[nidx ] + w(nidx , u) then vals[u] = vals[nidx ] + w(nidx , u) 8. preds[u] = nidx 9. End Atomic Operations Chapter 6. SSSPP 109 Algorithm FRONTIER-KERNEL(∆i , Fi , Ui , ∆v , vals) 1. nidx = thread.id 2. Fi [nidx ] == f alse 3. if Ui [nidx ] = true ∧ vals[nidx ] ≤ ∆i 4. then Ui [nidx ] = f alse,Fi [nidx ] = true The main procedure, GPU Dijkstra, uses three kernels to relax the nodes and create the frontier and unsettled nodes sets. In lines 3-11, we initialize the data structures used by the algorithm, lines 12-15 are the main loop, that stops once relaxed all the nodes in the graph or reached the t node. The kernel function Relax-Kernel relaxes all the nodes inside Fi (line 13), the Delta-Update-Kernel updates the ∆i value using a parallel reduction. This procedure is a modified version of the reduce3 procedure taken from the CUDA SDK that comes along with the CUDA package from Nvidia. Frontier-Kernel is the kernel function that creates the F set at the next iteration. The RELAX-KERNEL procedure updates the tentative distances of the nodes inside the F set. Each thread elaborates a node nidx , relaxing all its successors unsettled nodes w ∈ Γnidx . The relaxation at line 6 is an atomic operation among the threads to avoid race conditions. We need to make this operation atomic because, at the same time, other threads can update the same memory location (the same unsettled node), generating inconsistent reads or writes. The FRONTIER-KERNEL kernel generates the Ui+1 and Fi+1 sets. Each thread is assigned to a node and checks if the node tentative distance is less or equal to the ∆i threshold and if the node is in the Ui set. In this case the kernel inserts the node in the Fi set. The DELTA-UPDATE-KERNEL is a simple procedure implemented to avoid a data transfer between the CPU and the GPU. Basically, it’s a parallel reduction among the nodes in the Ui set, for finding the shortest tentative distance. 6.4.2 Dynamic Frontier Definition We enhanced the Frontier-Creation-Kernel to address the Time-Dependency of our model. We need to evaluate for each iteration i the ∆v values. The main Chapter 6. SSSPP 110 problem is that we can’t evaluate a priori, the minimum cost among the outgoing edges from the v vertex, because of the time dependency. The only way to pre-calculate the value is to evaluate the minimum cost among the edges for each second of the day (the granularity of our problem), for each node, bringing to a great amount of memory used and a long computation time. We define the set Ri as the set of the nodes u ∈ Ui which tentative distances has been updated from the initial ∞ value, all possible members of the Fi+1 set at the i + 1 iteration. For each node r ∈ Ri we evaluate, at time τ , the outgoing minimum cost edge. ∆v = min {c = ftime (r, u, h, τr,u,h ) : r ∈ Ri , u ∈ Γ(r) ∩ Ui , (r, u) ∈ E, h ∈ A} (6.6) Where r is the node in the Ri set, u is a successor of r in the Ui set and h a state of the automaton A. ftime is the time function evaluating the cost of the edge from r to u, in the state h of the automaton at time τ . The ∆v is computed every iteration i, then, the values take part to the evaluation of the ∆i threshold, as described in the FRONTIER-KERNEL to create the Fi set. 6.4.3 GPU Time-Dependent Algorithm The porting to the GPU environment implies some modification to the data structures used by the algorithm. We can’t use dynamic data structures, performance killers for the GPU, and, to keep trace of the tentative distances and the predecessors, we will implement two bi-dimensional arrays times and preds, in which one dimension is the |V | cardinality and the other is the number h of states in which the traveler can be (foot,car,etc . . . ). For each state of the traveler, we have an automaton Ah that regulates the transitions among the modes of the network. For a more readable notation, we will call A the macro-automaton composed by all the At automata for each type of traveler. Algorithm GPU T-D-RegL-CSPP (s, t, V, E, ftime (), ∆v ) 1. vals, preds, Ui , Fi , Ri Chapter 6. SSSPP 111 2. ∆i = ∞ 3. for h = 1 to |T ravelerStates| do 4. 5. for i = 1 to |V | do if i == s 6. then vals[h][i] = 0, preds[h][i] = 0 7. else vals[h][i] = ∞, preds[h][i] = ∞ 8. // GPU Algorithm 9. // Frontier, Unsettled Nodes and R initialization 10. Initialize<<<>>>(Ui , Fi , Ri ) 11. Initialize(∆i ) //Threshold Initialization 12. while ∆i 6= ∞ do 13. // Relax the frontier nodes 14. RELAX-KERNEL<<<>>>(vals, preds, Fi , Ui , A) 15. // Update ∆v 16. ∆v =DYNAMIC-∆v -KERNEL<<<>>>(vals, Ui , Ri , A) 17. // Update ∆i 18. ∆i = DELTA-UPDATE-KERNEL<<<>>>(vals, Ui , ∆v ) 19. // Update Fi+1 20. FRONTIER-KERNEL<<<>>>(∆i , Fi , Ui ) Algorithm DYNAMIC-∆v -KERNEL(Ui , Ri , ∆v , vals, A, τ ) 1. nidx = thread.x.id 2. hidx = thread.y.id 3. if Ri [nidx ]==true 4. then τ = vals[hidx ][nidx ] 5. for each u ∈ Γnidx ∩ Ui do 6. edge e = (nidx , u) 7. for each q ′ ∈ δ(hidx , label(e)) do 8. if ∆v [nidx ] > ftime (e, τ ) 9. then ∆v [nidx ] = ftime (e, τ ) Algorithm RELAX-KERNEL(vals, preds, Fi , Ui , A) 1. nidx = thread.x.id 2. hidx = thread.y.id 3. if Fi [nidx ] == true 4. then τ = vals[hidx ][nidx ] Chapter 6. SSSPP 112 5. for each w ∈ Γnidx do 6. edge e = (nidx , w) 7. for each q ′ ∈ δ(hidx , label(e)) do 8. if Ui [w] == true 9. then Start Atomic Operations 10. vals[hidx ][w] = min(vals[hidx ][w], vals[hidx ][nidx ] + ftime (e, τ )) 11. preds[hidx ][w] = nidx 12. End Atomic Operations The GPU T-D-RegL-CSPP algorithm has the same behaviors of the original one, except for the Dynamic-Deltav -Kernel at line 14, described in 6.4.2. The DeltaUpdate-Kernel and Frontier-Kernel are the same described in 6.4.1.2, only changing the input data. In Relax-Kernel we introduced the automaton A that regulates, at line 7, the transition to the w ∈ Γ nodes. Here we have two indexes to address the vals and the preds matrices. The hidx index represent the traveler’s state and the index nidx the node to expand. In this case we use a bi-dimensional block indexing to create the CUDA computational grid on the GPU. In lines 9-12 we have the atomic updates for the tentative distances and the predecessors. The τ variable (line 4), represent the actual time, used to evaluate the edge cost with the ftime function in line 10. The Dynamic-∆v -Kernel exploit the same bi-dimensional indexing used for the Relax-Kernel, in line 5 we select the successors of the nidx vertex that can be in the i + 1 iteration regulated by the automaton, line 7, and we evaluate the cost of the edge at line 8, using the ftime function, updating, if necessary, the ∆v values in line 9. 6.5 Shared Memory Algorithm Analyzing in detail the serial algorithm, we can observe that we can compute, for each traveler state h a T-D-RegL-CSPP in a state-space of labels at least equal to the dimension of the graph’s nodes set V . Under this prospective, we can describe the state-space in a bi-dimensional matrix with h rows, one for each traveler state, and |V | columns. For each h state, we can compute a Time-Dependent-RegLCSPP as described by the algorithm T-D-RegL-CSPP completely decoupled from Chapter 6. SSSPP 113 the others. In fact we will have independent data structures (the h row in the matrices for the predecessors and the tentative distances) and an independent priority queue. Noticed these peculiarities, we can re-write the algorithm exploiting the characteristics described above: Algorithm T-D-RegL-CSPP2 (s, t, V, E, ftime , A, τstart ) 1. Queue Q=null // Generated States of G× queue 2. vals, preds 3. // Data Structures Initialization 4. for h = 1 to |T ravellerStates| do 5. 6. for n = 1 to |V | do if n == s 7. then vals[h][n] = τstart , preds[h][n] = 0 8. else vals[h][n] = ∞, preds[h][n] = ∞ 9. // Algorithm 10. for h = 1 to |T ravellerStates| do 11. // Queue Initialization 12. Q.Insert((s)) 13. while SQ 6= null do 14. n =SQ.First() 15. τ = vals[h][n] 16. if (n, h) ∈ T 17. then break 18. for each u ∈ Γn do 19. edge e = (n, u) 20. for q ′ ∈ δ(h, label(e)) do 21. 22. if vals[h][u] > vals[h][n] + ftime (e, τ ) then vals[h][u] = vals[h][n] + ftime (e, τ ) 23. preds[h][u] = n 24. if Q.InQueue(u) 25. then Q.Update(u) 26. else Q.Insert(u) Chapter 6. SSSPP 114 In lines 4-8 we initialize the tentative distance matrix vals and the predecessors matrix preds. For each traveler state h we initialize the start node to τstart and the predecessor to 0. In lines 10-26 we evaluate, for each traveler state, the T-D-RegL-CSPP as done in the original algorithm. As we can see, every h state is independent from the others, allowing us to evaluate the states in parallel. 6.5.1 OpenMP Porting OpenMP API [113] seems to be the best option to parallelize the algorithm proposed in the previous paragraph. In fact, we can easily parallelize the method, using the fork/join programming model, assigning to each thread a traveler state. With few pre-processors directives, we can exploit the parallelism inside multicore/multi-threaded CPUs. Algorithm T-D-RegL-CSPP OMP (s, t, V, E, ftime , A, τstart ) 1. Queue Q=null // Generated States of G× queue 2. vals, preds 3. // Data Structures Initialization 4. for h = 1 to |T ravellerStates| do 5. 6. for n = 1 to |V | do if n == s 7. then vals[h][n] = τstart , preds[h][n] = 0 8. else vals[h][n] = ∞, preds[h][n] = ∞ 9. // Algorithm 10. SetThreads(|T ravellerStates|) 11. # start parallel region 12. h = GetThreadID() 13. // Queue Initialization 14. Q.Insert((s)) 15. while Q 6= null do 16. n =Q.First() 17. τ = vals[h][n] 18. if (n, h) ∈ T Chapter 6. SSSPP 19. 20. 115 then break for each u ∈ Γ(n) do 21. edge e = (n, u) 22. for q ′ ∈ δ(h, label(e)) do if vals[h][u] > vals[h][n] + ftime (e, τ ) 23. 24. then vals[h][u] = vals[h][n] + ftime (e, τ ) 25. preds[h][u] = n 26. if Q.InQueue(u) 27. then Q.Update(u) 28. else Q.Insert(u) 29. # end parallel region As we can see, the modification to the code are minimal. We introduce only the directives in pseudo-code to the pre-processor in lines 9-11. In line 11 we simply get the thread id and use it to indexing the state. In line 9 we spawn a thread for each state. In line 10 we declare the parallel section (fork) and, finally, in line 29 we close it (join). 6.6 Computational Results We tested our parallel methods reporting the Speed-Up factor with respect to the serial version of the algorithm. Out test machine is a workstation equipped with a Intel Core i7 920 @ 2,9 GHz with 6 Gigabytes of RAM and a GPU Nvidia GTX570 with 1,280 Gigabyte of GDDR5 memory on board, 480 CUDA Cores @ 1,464 GHz. For each instance we report the serial time, the relative parallel time and the Speed-Up factor obtained. For each instance, we evaluated the performance for an increasing number of traveller state h (2,3,4,6,8) and, consequentially, for a bigger state-space for the algorithm. The time horizon is one day and the granularity of time is expressed in seconds (one day = 86400 seconds) . The Speed-Up factor is computed: SpeedU p = T imeserial /T imeparallel . 6.6.1 Data Sets Creation The data set used as benchmark for the methods is a set of 10 instances generated as follows: Chapter 6. SSSPP 116 • we selected 10 urban centres from the OSM [114] repository, • extracted the relative graph, keeping trace of the streets classes ( primary, secondary, etc . . . ), • we created 4 dense random graphs of 500000, 400000, 300000 and 200000 nodes, for the modes over the geographical network, • connected with transition edges all the nodes of the random graphs to the geographical network, • connected with transition edges the modes among them. We used the street classes from OSM to calculate the journey time using the speed limits relative to that class and the speed profile described in 6.2.4 as penalties. For the other modes, we assume that represent the public transport networks (metro, bus, trams, etc . . . ) we used a constant speed limit and the speed profiles for the public transport described in the previous sections. The path is calculated as depicted in figure 6.4, from an extreme point to another of the area, to force the method to visit all the graph. Figure 6.4: Test Path in Berlin 6.6.2 Results In this section we provide the experimental results for the instances described above. First, we give some data relative to the instances dimensions, then, the relative computational times for the serial implementation, the GPU implementation Chapter 6. SSSPP 117 and the parallel CPU implementation. We indicate with Φ = T ravelerStates ∗ V the dimension of the state-space. Figure 6.5: Serial times vs OpenMP times for the Berlin instance Figure 6.6: Speed-Up for different h values for the Los Angeles instance Instance Berlin London Los Angeles Melbourne Milan Moskow New York Paris Rome Tokyo Nodes 1594184 1962806 1845707 1636709 1515734 1621233 1650314 1654535 1490676 2600054 Edges 26360404 27134087 26957936 26438780 26154972 26423233 26519156 26457090 26098377 29232779 Modes Φ (h = 2) 5 3188368 5 3925612 5 3691414 5 3273418 5 3031468 5 3242466 5 3300628 5 3309070 5 2981352 5 5200108 Φ (h = 3) 4782552 5888418 5537121 4910127 4547202 4863699 4950942 4963605 4472028 7800162 Φ (h = 4) 6376736 7851224 7382828 6546836 6062936 6484932 6601256 6618140 5962704 10400216 Φ (h = 6) 9565104 11776836 11074242 9820254 9094404 9727398 9901884 9927210 8944056 15600324 Φ (h = 8) 12753472 15702448 14765656 13093672 12125872 12969864 13202512 13236280 11925408 20800432 Chapter 6. SSSPP Table 6.1: Test instances dimensions 118 Instances N ame Berlin London Los Angeles Melbourne Milan Moskow New York Paris Rome Tokyo h=2 3.90491 4.27784 4.31991 4.00122 3.98671 3.91440 4.02293 4.17275 3.88276 2.91007 h=3 7.19491 7.43718 6.99828 6.65853 6.97139 6.70570 6.91773 7.10199 6.86815 6.02751 Serial h=4 9.16738 9.51450 9.12968 8.29722 9.37055 9.01145 9.29118 9.21067 8.94360 9.13981 h=6 12.35290 13.39980 12.88160 10.70190 12.58200 11.84520 12.68630 12.81390 12.50710 13.20910 h=8 18.62190 20.09110 20.42397 16.39150 18.68020 18.17520 19.21360 19.63470 18.54690 16.69850 h=2 4.00861 4.12370 4.22763 3.97519 4.05487 3.99721 3.85927 4.03054 3.76575 2.53121 h=3 4.83804 4.51029 4.33566 4.33418 4.42006 4.29074 4.25159 4.39365 4.09594 4.39309 OpenMP h=4 4.64429 4.80851 4.63512 4.68507 4.72360 4.65833 4.71184 4.73818 4.49787 4.97617 h=6 5.04666 5.20207 5.00295 5.03967 5.10111 4.92247 4.96807 5.13069 5.02301 5.24188 h=8 5.84201 6.24187 5.91230 5.89474 5.82905 5.77115 5.82088 5.87103 5.71587 5.91094 h=2 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 1.1 X h=3 1.5 X 1.6 X 1.6 X 1.5 X 1.6 X 1.6 X 1.6 X 1.6 X 1.7 X 1.4 X SpeedUp h=4 2.0 X 2.0 X 2.0 X 1.8 X 2.0 X 1.9 X 2.0 X 1.9 X 2.0 X 1.8 X h=6 2.4 X 2.6 X 2.6 X 2.1 X 2.5 X 2.4 X 2.6 X 2.5 X 2.5 X 2.5 X h=8 3.2 X 3.2 X 3.5 X 2.8 X 3.2 X 3.1 X 3.3 X 3.3 X 3.2 X 2.8 X Chapter 6. SSSPP Table 6.2: Computational results on CPU 119 Instances N ame Berlin London Los Angeles Melbourne Milan Moskow New York Paris Rome Tokyo h=2 3.90491 4.27784 4.31991 4.00122 3.98671 3.91440 4.02293 4.17275 3.88276 2.91007 h=3 7.19491 7.43718 6.99828 6.65853 6.97139 6.70570 6.91773 7.10199 6.86815 6.02751 Serial h=4 9.16738 9.51450 9.12968 8.29722 9.37055 9.01145 9.29118 9.21067 8.94360 9.13981 h=6 12.35290 13.39980 12.88160 10.70190 12.58200 11.84520 12.68630 12.81390 12.50710 13.20910 h=8 18.62190 20.09110 20.42397 16.39150 18.68020 18.17520 19.21360 19.63470 18.54690 16.69850 h=2 15.29472 15.85713 14.84052 14.94752 15.95023 14.95064 15.88630 15.68335 15.95033 14.12753 h=3 35.38592 34.75631 35.02742 34.46021 33.67302 31.10204 36.67120 34.05063 33.12574 35.69305 CUDA h=4 45.31485 44.36134 44.23401 46.28461 42.01237 43.47502 44.65321 45.75230 45.78124 42.87932 h=6 60.45892 61.98234 62.09213 61.45213 62.23768 60.14586 60.27451 63.35672 61.01203 60.36789 h=8 90.10293 100.42816 100.90235 88.16702 90.56330 91.09123 92.34551 91.76812 90.01445 86.13599 Chapter 6. SSSPP Table 6.3: Computational results on GPU 120 Chapter 6. SSSPP 6.7 121 Considerations and Future Work As it emerged from the experimental test, for this particular problem, the GPU algorithm is not effective. The main reason is that the geographical network is a really sparse graph and the frontier set’s Fi cardinality is too small to allow the GPU to release its massive parallelism. The multi-core/ OpenMP version, instead, brings results near to the theoretical speed-up for a quad-core processor for the bigger instances. The other bottleneck for the GPU performance is the load balancing inside the device where most part of the computational kernel is composed by serial operations. The new Nvidia chips, the Maxwell series, partially resolve this problem, introducing a new feature called Dynamic Parallelism, allowing the kernel function to call another kernel, making the load balancing relative to nonuniform data structures (e.g. adjacency lists) more effective. As soon as possible we will provide a method exploiting this new feature. Chapter 7 Membership Overlay Problem In this chapter we will considerate a parallel version of the Subgradient Method, used to solve the Dual Lagrangean Problem. We choose a network design problem, the Membership Overlay Problem (MOP), relative to the Peer-to-Peer networks. We designed three parallel algorithms for the GPU Computing, Shared Memory and Distributed Memory environments exploiting, respectively, CUDA, OpenMP and MPI. 7.1 Introduction Peer-to-Peer (P2P) networks actually represent a conspicuous part of the internet data traffic. This model, indeed, is the counterpart of the well-known and studied client-server model. A large number of network applications (legal or not) are already adopting this network model. The proliferate of this kind of networks has brought to the attention of the academic community some challenging problems relative to the P2P model. In literature, for example, doesn’t exist a precise and coherent definition and a precise and exhaustive description is hard to find. Informally speaking, a P2P network is a totally decentralized network, composed by peers that share, exchange and distribute data and information. The success of this kind of networks is related to the anonymous identity of the members, allowing, in some case, the exercise of not totally legal trades. 123 Chapter 7 MOP 124 Another peculiarity of P2P is the dynamic topology of the network. A peer can connect to the network for a limited time and share its data or its bandwidth with the others. Once disconnected, the topology changes and some routes for the data packets or connections must be modified to maintain the network performances. This strictly dynamic feature is definitely interesting and has arose critical problems for some applications. The P2P paradigm can be a fundamental part of large scale distributed computation infrastructures or grid computing applications. The dynamic nature of the network topology implies a significant degradation of the performance or a non-optimal configuration of communications and connections, compromising the system scalability. Most of the P2P applications are based on the TCP/IP communication protocol and, virtually, all the peers are connected to each other. A P2P network, instead, is at application level in the ISO/OSI stack having their routing and topology. In this scenario, network design problems seem critical, mainly the ones dealing with the optimization of the connections for enhancing the network’s performances. The Membership Overlay Problem (MOP ) is one of these network design problems that can arise in relation with a P2P environment. The problem consists in the creation of an overlay network that maximizes the bandwidth throughput among the peers inside the P2P application. In this chapter we will describe a Lagrangean Relaxation for the problem and a Distributed Subgradient for solving the relative Lagrangean Dual Problem proposed originally by Boschetti et al. [115]. We will propose three parallel algorithms based on this Subgradient, designed to exploit three ‘de-facto’ standard parallel programming models, CUDA, OpenMP and MPI, relative to GPU, Shared Memory and Distributed Memory environments respectively. 7.2 Membership Overlay Problem In this section we will illustrate the Membership Overlay Problem, giving first an informal definition, then a mathematical formulation, describing the inherent characteristics. MOP is a network design problem, in fact is focused on the creation of a network topology following precise performance, fault tolerance or efficiency constraints. Chapter 7 MOP 125 Network Design is actually one of the most interesting and studied class of problems in the CO field. It affects many real-world applications like supply chain logistics or telecommunications. The problem consists in maximizing the throughput of a P2P network where the nodes (peers) are described by a percentage of on-line time and a bandwidth to the internet. The edges are the connections among these nodes, with a capacity. The solution of the problem is a subset of these edges that maximize the throughput, creating an overlay network defining a topology among the nodes. Before the mathematical formulation, we will give an example to illustrate the problem. Referring to the figure 7.1, we depicted a graph representing a network with the characteristics cited before Without lack of completeness, from now we will represent the network as a graph with nodes and edges. Node 1 Node 2 w:1024 p:0.70 w:512 p:0.90 w:2048 p:0.90 w:1024 p:0.60 Node 4 Node 3 Figure 7.1: P2P network with 4 nodes. Every node has a percentage, p, to be on-line and a bandwidth, w, representing the bandwidth of the internet connection. The graph is compete, due to the TCP/IP protocol. The edges are not oriented, and we can assume that are bi-directional. The edges have a capacity b equal to the minimum value w of the nodes that connects as described in figure 7.2 . Every edge has a probability, p′ , equals to the product of the p probabilities of its extremes (figure 7.3). Once characterized the graph, we need to set other parameters that constraint the problem’s solution. Birattari et al. [116] and Rardin and Uzsoy [117], proposed these parameters. To guarantee a minimum Chapter 7 MOP 126 Node 2 Node 1 b: 512 w:512 p:0.90 w:1024 p:0.70 b:1024 b:512 b:512 b:1024 w:2048 p:0.90 w:1024 p:0.60 b:1024 Node 4 Node 3 Figure 7.2: Edges bandwidth values. Node 2 Node 1 b: 512 p’:0.63 w:512 p:0.90 w:1024 p:0.70 b:512 p’:0.81 b:1024 p’:0.42 b:1024 p’:0.63 b:512 p’:0.54 w:2048 p:0.90 w:1024 p:0.60 b:1024 p’:0.54 Node 4 Node 3 Figure 7.3: Edges p’ values. QoS (Quality of Service) is necessary to set a bandwidth lower bound, l, for each edge of 14 Kbps for instance, as shown in figure 7.4. Node 2 Node 1 b: 512 p’:0.63 w:512 p:0.90 w:1024 l:14 p:0.70 b:1024 p’:0.63 b:1024 p’:0.42 b:512 p’:0.81 l:14 l:14 l:14 l:14 b:512 p’:0.54 l:14 w:2048 p:0.90 b:1024 p’:0.54 Node 4 w:1024 p:0.60 Node 3 Figure 7.4: Edges lower bound l. Chapter 7 MOP 127 Similarly, is necessary to set a bandwidth upper bound, u, for each edge to avoid the congestion of the node and the network. For the results cited above, we can set this value to 256 Kbps (figure 7.5). Node 2 Node 1 b: 512 p’:0.63 w:512 p:0.90 w:1024 p:0.70 b:1024 p’:0.42 l:14 u:256 l:14 u:256 b:512 p’:0.81 l:14 u:256 l:14 u:256 b:1024 p’:0.63 l:14 u:256 l:14 u:256 w:2048 p:0.90 b:512 p’:0.54 w:1024 p:0.60 b:1024 p’:0.54 Node 4 Node 3 Figure 7.5: Edges upper bound u. In figure 7.6 is described the optimal solution for the proposed example. Node 2 Node 1 161.28 w:512 p:0.90 w:1024 p:0.70 207.36 107.42 161.28 w:2048 p:0.90 138.34 Node 4 w:1024 p:0.60 Node 3 Figure 7.6: MOP Optimal Solution. More in details, the value for an edge is given by uij p′ij and the optimal value of the problem will be the sum of these values relative to the edges in solution: P uij p′ij . In figure 7.6 we show the problem’s solution, coloring in green the edges. The final values is 775.68 and the edge (2, 3) has been rejected from the solution. The resulting sub-graph is the overlay network that maximize the throughput for the P2P network. Chapter 7 MOP 7.2.1 128 Definition and Mathematical Formulation Given a graph G = (V, E) where n = |V |, the vertices are the peers of the P2P network and the edges the possible connections among the vertices. Two edges i and j are connected by the edge (i, j) if both nodes can send messages each others, using the underlying routing structure (the internet typically). Each node i can enter and exit the network, according to the P2P model, and, when is on-line, can share a limited amount of bandwidth. Each node is characterized by two weights: pi e wi , respectively the connection time, expressed in percentage (1 = always on line, 0 = never on line) like described by Saroiu et al. [118] and the available bandwidth of its connection. δ(i) is the neighbors set for the node i. The Membership Overlay Problem consists in finding a sub-graph G′ = (V, E ′ ) from G. The edges of G′ define that two nodes decide to allocate a part of their bandwidth to communicate between them. If bi e bj are the bandwidths of node i and j, the available bandwidth for the edge (i, j) is bij = min{bi , bj }. The two bandwidth values can be equal to the wi and wj values, saturating the nodes or inferior in relation to the other nodes in the graph. Anyway, we will define an upper bound,uij and a lower bound, lij , to guarantee a minimum QoS and to avoid the network’s congestion, respectively. The graph G′ obtained will have the following peculiarities: 1. the global throughput is maximized; 2. the graph G′ diameter is logarithmic, creating a connected graph; 3. the total bandwidth used by each node i is less or equal to bi . From these first considerations, we can assume that for each node i is not necessary the global knowledge of the graph, but only the state of the nodes in δ(i). Following the definition given by [117], we can define MOP a control problem that : ‘must be solved frequently and it involves decision over a relatively short horizon’. Solutions for these kind of problems ‘must be obtained in near real time, algorithms must run in fractions of seconds’ and ‘quality matters somewhat less’ than speed. This kind of problems is often linked to critical applications in robotics or automations in general, also in high-precision tools or car control units. Taking into Chapter 7 MOP 129 account these factors, seems mandatory design fast and effective algorithms for giving a good solution to the problem in a short execution time. In the next section we will provide a Mixed Integer formulation for the static version of MOP (SMOP), from this formulation will be derived a polynomial upper bound and a relaxation framework. 7.2.2 MIP Formulation Static Membership Overlay Problem can be formulated as follows. We have two set of decision values {xij } and {ξij }, (i, j) ∈ E. The xij continuous variables defines the bandwidth between i e j and 0 ≤ xij ≤ uij . The binary variables ξij are 1 if the edge (i, j) is used to connect the nodes, 0 otherwise. The MIP formulation is the following: zSM OP = max X pij xij (1) (i,j)∈E s.t. X xij ≤ bi , i∈V (2) j∈δ(i) (7.1) lij ξij ≤ xij ≤ uij ξij , (i, j) ∈ E (3) ξij ∈ {0, 1}, (i, j) ∈ E (4) where pij = pi × pj for each edge (i, j) ∈ E and δ(i) represent the neighborhood of i in G (in our case δ(i) = V /{i}, G is complete). The objective function (1) maximizes the total bandwidth given by the sum of all the assigned bandwidths to each connection (i, j) ∈ E weighted by their up-times (on-line times) p. The constraints (2) ensure that the bandwidth assigned to the i node does not exceed the limit bi . Static Membership Overlay Problem is an NP-Hard problem, setting lij = uij , for each edge (i, j) ∈ E. SMOP can be described as a Multidimensional Knapsack Problem: Chapter 7 MOP 130 X zM KP = max pij uij ξij (5) (i,j)∈E s.t. X uij ξij ≤ bi , i ∈ V (6) (7.2) j∈δ(i) ξij ∈ {0, 1}, (i, j) ∈ E (7) That is the generalization for the Knapsack 0-1 where the bin has more than one dimensional constraint and the objective is to maximize the use of the bin. 7.3 7.3.1 Linear and Lagrangean SMOP Relaxations LP Relaxation The SMOP LP relaxation, zLP , is the following: zLP = max X pij xij (8) (i,j)∈E s.t. X xij ≤ bi , i∈V (9) j∈δ(i) (7.3) lij ξij ≤ xij ≤ uij ξij , (i, j) ∈ E (10) 0 ≤ ξij ≤ 1 (i, j) ∈ E (11) We relaxed the (7) constraint, making it continuous: 0 ≤ ξij ≤ 1. This relaxation gives us an upper bound to the integer solution for SMOP, bound that we will use to evaluate the effectiveness of the Lagrangean bound that we will describe in the next section. We can easily compute the LP relaxation using a linear solver (CoinMP or CPLEX). 7.3.2 Lagrangean Relaxation The Lagrangean relaxation for SMOP can be formulated associating a non negative penalty λi to each constraint (2). The formulation, called zLR (λ), is the following: Chapter 7 MOP 131 zLR (λ) = max X p′ij xij + (i,j)∈E X bi λ i (12) i∈V s.t. lij ξij ≤ xij ≤ uij ξij , ξij ∈ {0, 1}, (i, j) ∈ E (13) (7.4) (i, j) ∈ E (14) that is equivalent to the problem: zLR (λ) = max X p′ij xij + (i,j)∈E X bi λ i (15) (7.5) i∈V s.t. 0 ≤ xij ≤ uij , (i, j) ∈ E (16) where p′ij = pij − λi − λj and the {ξij } variables are not necessary. Given λ, the optimal value of zLR (λ) is computed according to the following observations : • if p′ij ≥ 0 we use all the bandwidth available, then the edge will be part of the solution: ξij = 1 e xij = uij • if p′ij < 0, the connection is discarded: ξij = 0 e xij = 0 The Dual Lagrangean Problem associated can be described as follows: zLR (λ∗ ) = min{zLR (λ) : λ ≥ 0} 7.4 (7.6) Subgradient Algorithm The Subgradient algorithm proposed by Shore in [43] and successfully used by [39–42] is an iterative procedure that, at each iteration k, computes a new approximation λk+1 of the Lagrangean multipliers in such a way that, for k → +∞, λk is an optimal or near optimal solution of the corresponding Lagrangean Dual. Let xk of cost zLR (λk ) obtained at iteration k by solving problem 7.5 setting λk = λk+1 . The Lagrangean multipliers can be updated as follows: λk+1 = max{0, λki + αk gik }, i ∈ V i (7.7) Chapter 7 MOP 132 where: gik = X xkij − bi , i ∈ V (7.8) j∈δ(i) is the i-th component of the subgradient g k and αk is the length of the step along the search direction given by the subgradient itself. In literature has been proposed several versions of the step size αk update. In this chapter we will take in consideration the standard one proposed by Polyak [119] and a constant one (called quasi-constant) for a fully distributed algorithm proposed by [115]. The standard update rule can be described as follows: αk = β k z̄ − zLR (λk ) k g (7.9) where z̄ is an overestimate of zLR (λ∗ ). Polyak proved the convergence of the method for ǫ ≤ β k ≤ 2. Instead of a overestimate of the Lagrangean function, is possible, in many application, substitute it with: k α = 0.001zLR (λ k βk k ) g (7.10) Usually, the β k step is initialized with a value dependent to the considered problem and updated (e.g β k+1 = 0.5β k ) if after and arbitrary number of iteration the zRL (λk ) is not improved. Our implementation of the standard subgradient algorithm implements the following update rule for the Lagrangean penalties: λk+1 = max{0, λki + β i 7.5 0.01zLR (λ) k gi } kg 2 k (7.11) Distributed Subgradient The Lagrangean relaxation proposed in 7.3.2 doesn’t take into account the dynamic aspect of a P2P net, considering static the topology of the network. In fact, in Chapter 7 MOP 133 a P2P environment, the nodes join and exit dynamically from the net and for each node is impossible to have a consistent view of the graph’s status and the optimization process described above is not effective. This aspect makes impossible to built an optimization process based on global parameters (e.g. the g subgradient). Deeply analyzing the P2P environment we can observe that: • each node has a local knowledge of the net. It is aware only of the status of its neighbors, • each node can optimize its parameters only looking at the behavior of its neighbors, • the potentially infinite time horizon of the net brigs to a constant optimization, based on the changing graph topology (e.g. selecting a new set of connections because of the exit of some nodes from the net), • each node, computing its local optimization step, can contribute to the global optimization of the graph, in an asynchronous fashion. Aware of these observations we can draw some guidelines for the design of a distributed algorithm: • the optimization step is local, each node compute its values through its knowledge of the net and its connections, • the optimization process must be asynchronous, • the optimization process must be distributed among the nodes of the graph, • once each node computed its optimization step, communicate to others its results and create the overlay network. When one or more nodes exit the net, each node optimize again, considering the new topology. These considerations allow us to indicate the behavior of a distributed subgradient: • each node will compute its Lagrangean penalty λi , according to its local knowledge, Chapter 7 MOP 134 • each node will compute its part of the global objective function, • each node will update its penalty exploiting a local method,(e.g. kgk can’t be considered), • each node will compute its optimization step using the updated penalties from the other nodes. We report, after the previous considerations, a dynamic, asynchronous and fully distributed subgradient. This approach doesn’t solve the Lagrangean problem of the original graph G, but for each node of the graph. Given: (7.12) Gh (Vh , Eh ) containing only the h node and its neighborhood δ(h), with: Vh = δ(i) ∪ {h} (7.13) Eh = {(i, j) ∈ E : i, j ∈ Vh } and a set of Lagrangean penalties λ = {λ0 , . . . , λk } with k = 1, . . . , |Vh | for each subgraph Gh (Vh , Eh ), h ∈ V , we solve the following problem: h zLR (λ) = max X 1 p′ xij + bh λh 2 ij (17) (7.14) (i,j)∈Eh s.t. 0 ≤ xij ≤ uij , (i, j) ∈ Eh (18) This problem is derived directly from the 7.5, considering only the h node. The P h global value of the objective function is given by zLR (λ) = h∈V zLR (λ). The coefficient 1 2 is included for avoiding the double evaluation of the node and its edges. 7.5.1 Algorithm foundations The proposed formulation suggests an algorithm that follows the steps: Chapter 7 MOP 135 Step 1 At each iteration k each node h request the Lagrangean penalties from its neighborhood i ∈ δ(h), Step 2 exploiting the new penalties, the node h locally optimize, solving the h problem zLR (λ) and computing its solution xij , Step 3 node h updates its penaltyλh , λh = max{0, λh + αhk gh }. It’s trivial to notice that we can’t perform at each step the exchange of the λh penalties because of the computational cost of the communications. To avoid this problem and enhancing the algorithm’s performances, [115] proposed a two-level optimization process: h I Level (Core Optimization) An internal loop in which h optimize its zLR (λ) value, using its λh penalty and keeping constant the penalties λi from its neighborhood, II Level (External Optimization) once each node h ∈ G completed their optimization step, each penalty is updated and sent to the neighborhood, then, completed the communication, it’s possible to each node to perform another Core Optimization step. 7.5.2 Quasi-constant step size update The main problem afflicting the distributed subgradient is the update of the λh penalties. We can’t use a standard step update rule, like the one described in 7.4. The h node doesn’t have a consistent knowledge of the network and to perform other communications will degrade the method’s performances. It’s necessary use an update rule that exploit constant and local parameters. This rule is named quasi-constant step size rule: Step 1 initially α0 = αstart an arbitrary small value, Step 2 once the bound is improved, the step size is incremented αk+1 = γαk , con γ > 1, by means of a constant defined a priori, Chapter 7 MOP 136 Step 3 if the bound is not improved for a given number of iterations, the step size ′ ′ is reduced αk+1 = max{γ αk , αmin } where 0 < γ < 1 and 0 < αmin < αstart , otherwise αk+1 = αk . The subgradient ghk is also computed using only the local xh solution. The rule for updating the Lagrangean penalties is: λk+1 = max{0, λkh + αhk ghk } h 7.5.3 (7.15) Algorithm description We can summarize the two level optimization procedure as follows: Algorithm Inner-Subgradient(h, λ′ ) 1. Initialize λi = λ′i for each i ∈ Vh = δ(h) ∪ {h} 2. while it < InnerM axIter do 3. h Solve zLR (λ) 4. Update only the node h penalty: λh = max{0, λh + αh gh } 5. h h if zLR (λ) < zLR (λ′ ) 6. then λ′h = λh Algorithm External-Optimization() 1. 2. 3. 4. while et < ExtM axIter do for h = 1 to |V | do Inner-Subgradient(h, λ) Send each λh to each j ∈ δ(h) The External-Optimization algorithm manages the external optimization step, calling the Inner-Subgradient procedure for each node of the graph (line 3). Each h node h do its optimization step, solving its Lagrangean Problem zLR (λ) (line 3), updating its penalty (line 4) and, re-optimizing with the new λ, for a given number of iterations. At the end of the main procedure External-Optimization the best penalties for each node is send to the other peers of the network (line 4). Chapter 7 MOP 7.6 137 Computational Results In this section we propose the computational result regarding the distributed subgradient algorithm, compared to the standard one and the LP relaxation of the problem, computed using CPLEX and CoinMP. The test sets are provided by Boschetti et al. [115] We call STD the standard subgradient algorithm and DIST the distributed one. The standard subgradient is implemented using the standard update rule proposed k (λ ) , with these parameters: by Polyak: αk = β k 0.001zgLR k k k • β 0 = 0.005, start step, • ∆k = 75, number of iterations before computing a smaller step: β k+1 = 0.95β k , • M axIter = 10000, max number of iterations, • Stop Condition: if the bound is not improved at least of 0.1% in the last 3000 iterations, the method is stopped. The parameters of the DIST algorithm are: • αstart = 0.0025, • αmin = 0.000005, • γ = 1.005, • γ ′ = 0.95. • ExtM axIter = 500. • InnerM axIter = 20. • ∆k = 10. The CPLEX version is the 11.2 and the CoinMP one is 1.7. We report the average value of the instances in the set, the execution times and the Gap% (Gap = 100 × zLRopt −zLP zLP ) between the LP relaxation and the results computed by the subgradient algorithms. Chapter 7 MOP 138 The test machine is equipped with an Intel i7 Core 920 @ 2.8 GHz and 6 Gigabytes of RAM. Problem Group Name grafo50a-14 grafo100a-14 grafo250a-14 grafo500a-14 grafo750a-14 grafo1000a-14 grafo50a-128 grafo100a-128 grafo250a-128 grafo500a-128 grafo750a-128 grafo1000a-128 grafo50b-14 grafo100b-14 grafo250b-14 grafo500b-14 grafo750b-14 grafo1000b-14 grafo50b-128 grafo100b-128 grafo250b-128 grafo500b-128 grafo750b-128 grafo1000b-128 zLP (AV G) 8181.31 18898.20 70508.01 151190.40 238698.34 316579.60 8271.55 21850.80 68858.90 153809.50 242610.80 336517.60 1092.54 2298.66 5447.14 11198.03 16524.70 22448.11 1565.68 3243.71 8110.50 16062.21 23748.90 32271.93 LP CoinM Pt 0.06 0.23 3.69 92.96 761.80 3203.72 0.06 0.18 3.85 108.38 770.96 2734.91 0.03 0.09 0.72 4.93 17.17 50.93 0.03 0.09 0.71 3.98 12.85 31.71 CP LEXt 0.04 0.05 0.35 1.91 4.79 10.95 0.02 0.05 0.41 1.90 5.03 9.06 0.02 0.08 1.35 12.37 47.03 164.76 0.01 0.09 1.40 13.76 52.10 204.95 STD Time Gap% 0.61 0.0100 2.78 0.0020 7.89 0.0200 52.65 0.0100 150.09 0.0330 330.12 0.0150 0.62 0.0010 2.43 0.0021 7.89 0.0250 52.65 0.0100 150.09 0.0343 330.12 0.0124 0.64 0.0300 2.65 0.0010 8.56 0.0110 52.65 0.0060 110.09 0.0231 200.15 0.0260 0.65 0.0100 2.38 0.0020 8.54 0.0200 52.65 0.0140 150.09 0.0190 223.12 0.0050 DIST Time Gap% 0.04 0.2810 0.14 0.4070 2.55 0.5430 9.82 0.3800 37.50 0.0000 52.01 0.0001 0.05 0.2950 0.17 0.1180 2.53 0.3900 9.83 0.0060 37.56 0.0010 52.06 0.0001 0.03 0.5640 0.16 0.4630 2.58 0.5850 9.96 0.4740 38.04 0.4350 53.78 0.4330 0.04 0.4320 0.16 0.3780 2.57 0.5330 9.94 0.4920 37.96 0.4840 53.77 0.4950 Chapter 7 MOP Table 7.1: Gaps and execution times for LP, STD, DIST. 139 Chapter 7 MOP 140 It’s straightforward to notice that CPLEX outperforms the execution time of CoinMP, confirming it’s efficiency. The Lagrangean relaxations proposed provide good quality bounds, competitive with the LP relaxation of the problem. For the biggest instances, the execution times are comparable to the CPLEX one and in some cases, like the grafo1000b-128 set where the execution time are better than the CPLEX one. Obviously, the STD algorithm has better results than DIST, having a global knowledge of the network, using global parameters, like described in 7.4, but, for our purposes this approach is not applicable. 7.7 Shared Memory Algorithm The shared memory algorithm proposed exploits the inherent parallelism inside the distributed subgradient. The problem granularity is straightforward: each node h is an independent entity to consider and we can map a subset of nodes for each thread spawned in a parallel cycle. We used OpenMP because is a defacto standard for the shared memory parallel programming model and for its portability. The simplicity of implementation and the non-invasive pre-processor calls to the APIs enable a fast code deploying, adding few lines of code to the serial version. Each node has its own data structures to compute its zLR (λ) value and a private array λnodes for storing the penalties relative to the other nodes. The used processor for testing the algorithm, an Intel Core i7 @ 2.8 GHz, implements the Hyper-Threading [120] proprietary technology by Intel giving to a quad-core processor the ability to act like a processor with the double of cores (in our case 8 cores). In our test, we reported the best speed-up value spawning 8 threads, theoretically, one thread per core. The load balancing is totally managed by OpenMP, that assigns to each thread an equal number of nodes, as shown in picture 7.7, relative to an instance with 1000 nodes. As mentioned before, the insertion of the OpenMP directories is not invasive and the parallel algorithm has few modifications compared to the serial one. Chapter 7 MOP CORE 0 CORE 4 141 CORE 1 125 Nodes 125 Nodes CORE 5 125 Nodes CORE 2 125 Nodes CORE 3 125 Nodes CORE 6 125 Nodes CORE 7 125 Nodes 125 Nodes CPU Figure 7.7: Nodes deploy in a eight core processor. Algorithm Inner-Subgradient(h, λ′ ) 1. Initialize λi = λ′i for each i ∈ Vh = δ(h) ∪ {h} 2. while it < InnerM axIter do 3. h Solve zLR (λ) 4. Update only the node h penalty: λh = max{0, λh + αh gh } 5. h h if zLR (λ) < zLR (λ′ ) 6. then λ′h = λh Algorithm External-Optimization-OMP () 1. while et < ExtM axIter do 2. # parallel for private(h, λnode ) 3. for h = 1 to |V | do 4. Inner-Subgradient(h, λnode ) 5. # end parallel for 6. Update each λnode array with the new λh penalties The differences between the two algorithms are minimal: we added in line 2 the OpenMP directories, opening a parallel for (fork) and declaring private for each thread the λnode and the h variables. Each thread, once partitioned the indexes h, will compute the Inner-Subgradient. Chapter 7 MOP 142 We don’t need explicit synchronization directives because the end of the parallel region is an implicit synchronization point for the threads. At line 5 of the main procedure, closed the parallel region (join), the algorithm updates the new λh penalties in the λnode array of each node. 7.8 Distributed Memory Algorithm In this case we used the hybridization of MPI with OpenMP described in 2.3.1 where the shared memory model is used to enhance the intra-node performances of the message passing model. The algorithm exploits another level of parallelism with the possibility to divide the graph among the cluster’s node and inside each node, with OpenMP. Each MPI task, deployed on a different cluster’s node, performs the External Optimization step of a given subset of nodes Vtask . Then, the Vtask set is computed in parallel, in the same fashion described in 7.7. h At the end of each inner cycle, once each task computed the zLR (λ) relative to its Vtask subset of nodes, we need a synchronization primitive (barrier) to synchronize the tasks and broadcast consistent penalties values. The communication step is implemented with a broadcast communication primitive that updates the other tasks with the new Lagrangean penalties. The computational tests has been conducted on a experimental cluster implemented with Microsoft HPC 2008 Server with: • 33 HP PCs with: Intel Pentium E6800 @ 3.2 GHz, 4 Gbyte of RAM, • Windows 7 Professional Edition 64 bit, for each node, • network switch: HP Procurve 2650. The bottleneck relative to this cluster is the slow network that connects the machines (computation nodes). For our purposes, it’s sufficient to show the scalability of the method in a message passing environment. Chapter 7 MOP 143 In the 7.10 section we will observe that over a certain number of compute nodes, the method doesn’t scale anymore, and the execution times are slowed down by the network and the communications. 7.9 GPU Algorithm In this section we propose a many-core algorithm for running the algorithm on a many-core platform. As in the other chapters, we used the Nvidia CUDA parallel programming model for its reliability and more effective programming and debugging tools. For this algorithm we need to explore deeply the Inner-Subgradient procedure. It’s necessary, indeed, to break into small pieces the steps of the Core Optimization and design four kernels for executing the method on a GPU. First, we propose a extended version for the Core Optimization and, then, we will design the GPU algorithm. Algorithm Inner-Subgradient-Extended (h, λ′ ) 1. 2. 3. 4. while k < InnerM axIter do h h if zLR (λk ) < zLR (λk−1 ) // If the bound is improved P then Compute subgradient: ghk = j∈δ(h) xkhj − bh Compute αk : αk = αk−1 γ 5. Save best λh and Xhk 6. Update λkh : max{0, λhk−1 + αk } 7. Update λk with λkh 8. //Update X: 9. for each j ∈ δ(h) do 10. 11. 12. 13. (phj − λkh − λj > 0)?xhj = uhj : xhj = 0 P else Compute subgradient: ghk = j∈δ(h) xkhj − bh if ∆ > ∆k then αk = max{αk−1 γ ′ , αmin } 14. Update λkh : max{0, λhk−1 + αk } 15. Update λk with λkh 16. //Update X: 17. for each j ∈ δ(h) do Chapter 7 MOP 144 (phj − λkh − λj > 0)?xhj = uhj : xhj = 0 18. 19. 20. ∆++ k++ In line 2 the algorithm checks if the bound has been improved. If yes, the subgradient gh is computed (line 3), updated the step α, line 4, saved the penalties and the solution (line 5). In lines 6-10, the algorithm updates the actual solution (lines 9-10) and the Lagrangean penalties. Otherwise, the gh subgradient is computed anyway together with the other parameters without saving the penalties and the solution. In lines 4 and 12-13 the α step update is computed following the quasi-constant rule described before. The main idea behind the GPU algorithm is to execute in a concurrent fashion each k iteration of each h node in the problem. Following the CUDA programming model, we can assign to each computation block a node h and compute in parallel each step of the Inner-Subgradient algorithm. To minimize the communications between the HOST and the GPU, has been implemented three support kernels that manipulate on the device the data structures and manage the communications of the Lagrangean penalties. Algorithm External-Optimization-GPU 1. BLKS = h 2. T HDS = n 3. Sharedmem = T HDS 4. while et < ExtM axIter do 5. Update-X-Kernel<<<BLKS,THDS>>>(λ′ , xij ) 6. while k < InnerM axIter do 7. Compute-zLR -Kernel<<<BLKS, THDS, Sharedmem >>>(zLR (λ′ ), xij ) 8. Inner-Subgradient-Kernel<<<BLKS, THDS, Sharedmem >>>(λ′ , zLR (λ′ ), xij ) 9. Update-Lambda-Kernel<<<BLKS,THDS>>>(λ, λ′ ) The peculiarity of this algorithm is the design’s shift from executing the relaxation of a certain number of nodes in parallel to executing the same relaxation’s step for all the nodes in parallel. The Update-X-Kernel (line 5), updates the xij solution for Chapter 7 MOP 145 the Lagrangean problem in parallel for each node (the number of blocks is the same in each kernel) at the beginning of each Core Optimization step. The ComputezLR -Kernel (line 7), evaluates, for each node h its actual objective function value. The Update-Lambda-Kernel (line 9), finally, at the end of the Core Optimization, send the updated penalties to each node. Algorithm Inner-Subgradient-Kernel (λ′ , zLR (λ′ ), xij ) 1. h = blockIDx 2. thidx = threadIDx 3. T HDS = blockDIMx 4. // Shared Memory Initialization 5. shared[thidx ] = 0 6. times = |V | /T HDS 7. reminder = |V | %T HDS 8. if thidx < reminder 9. then times++ 10. Thread-Synchronization() h h (λk−1 ) (λk ) < zLR 11. if zLR 12. then 13. // Compute subgradient: ghk = 14. for t = 0 to times do P j∈δ(h) xkhj − bh 15. index = h ∗ V + (thidx + t ∗ T HDS) 16. shared[thidx ] += xij [index] 17. Thread-Synchronization() 18. // Reduction 19. for s = T HDS/2 to 0, s/ = 2 do 20. 21. 22. if thidx < s then shared[thidx ] += shared[thidx + s] Thread-Synchronization() 23. ghk = shared[0] 24. // Compute αk : αk = αk−1 γ 25. // Save best λh and Xhk 26. // Update λkh 27. λ[h] = max{0, λhk−1 + αk } 28. // Update λk with λkh 29. //Update X: Chapter 7 MOP 30. 146 for t = 0 to times do 31. index = h ∗ V + (thidx + t ∗ T HDS) 32. (p[index] − λ[h] − λ[(thidx + t ∗ T HDS)] > 0)?x[index] = uhj : x[index] = 0 33. 34. else // Compute subgradient: ghk = for t = 0 to times do P j∈δ(h) xkhj − bh 35. index = h ∗ V + (thidx + t ∗ T HDS) 36. shared[thidx ] += xij [index] 37. Thread-Synchronization() 38. // Reduction 39. for s = T HDS/2 to 0, s/ = 2 do 40. if thidx < s 41. then shared[thidx ] += shared[thidx + s] 42. Thread-Synchronization() 43. ghk 44. if ∆ > ∆k 45. = shared[0] then αk = max{αk−1 γ ′ , αmin } 46. // Update λkh 47. λ[h] = max{0, λhk−1 + αk } 48. // Update λk with λkh 49. //Update X: 50. for t = 0 to times do 51. index = h ∗ V + (thidx + t ∗ T HDS) 52. (p[index] − λ[h] − λ[(thidx + t ∗ T HDS)] > 0)?x[index] = uhj : x[index] = 0 53. ∆++ The algorithm spawns a block for each h node. In lines 14-23 and 34-43 the subgradient is computed using a modified version of the parallel reduction suggested by [59]. In the case the bound is improved, the new α is computed (line 24), the node’s Lagrangean penalty updated (line 27) and created the new solution for the h node, lines 30-32. Otherwise, in the else branch of the if-then-else statement, starting at line 33, the instructions are the same, the only difference is in the penalty update. Chapter 7 MOP 147 The data structures are indexed in a row-major fashion. The index value, in lines 15, 31, 35 and 51 is the one-dimensional address of the structures accessed by the thidx thread. The notable performances of this method is given by the use of parallel reductions that seems to fit very well in the cluster processors of the GPU, together with the use of the shared memory. The Kepler architecture, with 192 Cuda Cores per cluster processor of our test device, seems to manage very well the reductions, due to the great amount of cores in the same cluster (preliminary test done on a Fermi GPU, with cluster of 32 Cuda Cores, showed that the reductions step was the bottleneck for the method). 7.10 Computational Results We present the computational results relative to the Speed-Up factors obtained on the different parallel platforms considered. The testbed machine used for the OpenMP and the CUDA algorithms is the same used in the paragraph 7.6 with an Nvidia GeForce GTX 770 with 2 GigaBytes of GDDR5 memory. The cluster used for the MPI algorithm is the one described in 7.8. The tuning parameters are the same used in 7.6 We report the SpeedU p factor, as the ratio between the serial time of the algorithms and the parallel ones, SpeedU p = T imeserial /T imeparallel . Problem Group Name grafo50a-14 grafo100a-14 grafo250a-14 grafo500a-14 grafo750a-14 grafo1000a-14 grafo50a-128 grafo100a-128 grafo250a-128 grafo500a-128 grafo750a-128 grafo1000a-128 grafo50b-14 grafo100b-14 grafo250b-14 grafo500b-14 grafo750b-14 grafo1000b-14 grafo50b-128 grafo100b-128 grafo250b-128 grafo500b-128 grafo750b-128 grafo1000b-128 Serial (AVG) (sec.) Serial Time 0.041 0.144 2.550 9.828 37.509 52.015 0.050 0.171 2.537 9.837 37.566 52.062 0.03 0.16 2.58 9.96 38.04 53.78 0.041 0.164 2.578 9.947 37.960 53.778 Parallel OpenMP 0.024 0.055 0.652 2.445 9.370 13.378 0.025 0.058 0.665 2.098 9.281 13.467 0.022 0.049 0.654 2.136 9.335 13.387 0.023 0.055 0.641 2.340 9.381 13.729 (AVG) (sec.) MPI CUDA 0.650 0.023 0.829 0.024 1.522 0.171 1.993 0.415 5.747 1.343 6.390 1.799 0.642 0.027 0.837 0.029 1.544 0.171 1.989 0.421 6.729 1.354 7.400 1.798 0.643 0.025 0.889 0.028 1.535 0.172 1.998 0.426 6.792 1.365 7.400 1.823 0.680 0.027 0.899 0.029 1.531 0.174 1.873 0.429 6.834 1.362 7.123 1.816 OpenMP 1.708 X 2.618 X 3.911 X 4.020 X 4.003 X 3.888 X 2.000 X 2.948 X 3.815 X 4.689 X 4.048 X 3.866 X 1.655 X 3.388 X 3.948 X 4.667 X 4.075 X 4.018 X 1.783 X 2.971 X 4.022 X 4.251 X 4.046 X 3.917 X SpeedUp MPI 0.063 X 0.174 X 1.675 X 4.931 X 6.527 X 8.140 X 0.078 X 0.204 X 1.643 X 4.946 X 5.583 X 7.035 X 0.057 X 0.187 X 1.682 X 4.989 X 5.601 X 7.268 X 0.060 X 0.182 X 1.684 X 5.311 X 5.555 X 7.550 X CUDA 1.754 X 5.937 X 14.873 X 23.697 X 27.933 X 28.919 X 1.864 X 5.964 X 14.805 X 23.350 X 27.754 X 28.953 X 1.456 X 6.022 X 14.986 X 23.384 X 27.860 X 29.502 X 1.493 X 5.752 X 14.799 X 23.202 X 27.863 X 29.615 X Chapter 7 MOP Table 7.2: Speed-Ups of the CUDA, MPI and OpenMP algorithms. 148 Chapter 7 MOP 149 The experimental results show clearly that the GPU algorithm obtains the best Speed-Up factor among the proposed solutions, lowering the execution time more than one order of magnitude. The effectiveness of this method is a consequence of the massively parallel architecture implemented in the GPU, allowing us to execute in a parallel fashion a large number of operations. The other two solutions, indeed, obtained good results, proving the intrinsic parallel nature of the distributed subgradient. For what concerns the MPI algorithm, we are aware that the used cluster is not the best option to test our method. We are sure that, executing the algorithm on faster (mainly on the communication side) infrastructure, the execution times can be lower. We used only 15 compute nodes, because, as shown in figure 7.8a, a larger number of compute nodes brings to a degradation of the performance, due to communications. The OpenMP algorithm has the quality that is almost the same with respect to the serial version, due to the intelligent design of the OpenMP’s directives, allowing to achieve good results modifying few parts of the original code. (a) MPI Performance degradation for grafo1000a-128 (b) Platforms comparison Figure 7.8: Performances evaluation. 7.11 Considerations and Future Work In this chapter we presented three parallel algorithms for finding a bound to the Membership Overlay Problem, relative to the Peer-2-Peer network model. The results reported show the efficiency of the GPU algorithm, executed on a mid-level consumer device. The others algorithm, indeed, showed good results, proving the method’s high scalability. We planned to apply this new kind of relaxation to other Chapter 7 MOP 150 Combinatorial Optimization problems, to verify the reliability and the generality of the method. Chapter 8 Conclusions In this Thesis we presented new optimization algorithms designed to run on the state of the art, inherently parallel processors available on the market. The need to exploit these new architectures in Combinatorial Optimization is becoming ever more urgent. The advent of technologies and paradigms like cloud computing or big data for example, containing at their core notable CO problems, together with the increasing dimension of real-life instances (vehicle routing, resource scheduling and optimization, network design, etc...) requires consistent and reliable developments of both theoretical and practical issues. Moreover, the computation of solutions for these problems must be really fast in most cases, from decision support softwares to scientific applications. This work has provided parallel methodologies for solving or enhancing the solution methods of some important problems in literature, methodologies that can be applied to a wide spectrum of real-word situations. The speed-up factors obtained are suggestive of the potential behind these architectures. Not only the hardware parallelism is growing, but also the quantity of memory got far beyond the 8/16 gigabytes of RAM for each computation node or workstation, and 2/4 gigabytes of memory for accelerators. To exploit in a better way the computational resources of these devices, also at the dawn of exa-scale era, can bring an effective enhancement in both industrial and academic fields. Our envisioned future work includes the porting of these algorithms on a larger number of hardware platforms, like AMD devices or FPGA, by exploiting the OpenCL parallel programming model to take advantage also of the peculiar characteristics of these devices (faster computation, better double precision throughput, larger number of cores, etc . . . ). Another interesting development related 151 Chapter 8. Conclusions 152 to the processors evolution is the insertion of a many-core processor (GPU) in the same silicon die of a canonical multi-core CPU. This solution is called APU, Accelerated Processing Unit. HSA, Heterogeneous System Architecture, proposed by AMD is a new environment designed to specifically exploit these processors. The great advantage of this approach is avoiding the PCI Express communications between the HOST and the GPU, thus exploiting the whole system memory. In this technological period, is mandatory to take into account the massively parallel implementation of these processors during the design of new methodologies and algorithms. Our aim is to design, starting from the well established CO’s theoretical foundations, new algorithms targeted to run on these new devices. Bibliography [1] Francesco Strappaveccia. A parallel algorithm for a distributed lagrangean relaxation for the membership overlay problem, 2011. Master Thesis. [2] Wikipedia. Supercomputer — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Supercomputer. [Online; accessed 23- January-2015]. [3] Allan R Director-Hoffman. Supercomputers: directions in technology and applications. National Academy Press, 1989. [4] Top 500. http://www.green500.org/, 2015. [5] Top 500. http://www.top500.org/, 2014. [6] Jerry Banks et al. Handbook of simulation. Wiley Online Library, 1998. [7] Wikipedia. Simulation — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Simulation. [Online; accessed 23January-2015]. [8] Wikipedia. Wikipedia, Computational the free encyclopedia, fluid dynamics 2015. — URL http://en.wikipedia.org/wiki/Computational_fluid_dynamics. [Online; accessed 23-January-2015]. [9] Wikipedia. Computational chemistry — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Computational_chemistry. [Online; accessed 23-January-2015]. [10] Wikipedia. Multi-agent systems — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Multi-agent_system. [Online; accessed 23-January-2015]. 153 Bibliography 154 [11] Wikipedia. Wikipedia, Computational the free encyclopedia, astrophysics 2015. — URL http://en.wikipedia.org/wiki/Computational_astrophysics. [Online; accessed 23-January-2015]. [12] Wikipedia. Computational physics — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Computational_physics. [Online; accessed 23-January-2015]. [13] Berk Hess, Carsten Kutzner, David Van Der Spoel, and Erik Lindahl. Gromacs 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. Journal of chemical theory and computation, 4(3):435–447, 2008. [14] Wikipedia. Bioinformatics — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Bioinformatics. [Online; accessed 23January-2015]. [15] Michael Friendly and Daniel J Denis. Milestones in the history of thematic cartography, statistical graphics, and data visualization. Seeing Science: Today American Association for the Advancement of Science, 2008. [16] Wikipedia. Data acquisition — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Data_acquisition. [Online; ac- cessed 23-January-2015]. [17] Wikipedia. Data analysis — Wikipedia, the free encyclopedia, 2015. URL http://en.wikipedia.org/wiki/Data_analysis. [Online; accessed 23- January-2015]. [18] Ian H Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. [19] Gregory Piateski and William Frawley. Knowledge discovery in databases. MIT press, 1991. [20] David J Hand, Heikki Mannila, and Padhraic Smyth. Principles of data mining. MIT press, 2001. [21] Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer, 2003. Bibliography 155 [22] Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990. [23] Hrvoje Jasak, Aleksandar Jemcov, and Zeljko Tukovic. Openfoam: A c++ library for complex physics simulations. In International workshop on coupled methods in numerical dynamics, volume 1000, pages 1–20, 2007. [24] Paolo Giannozzi, Stefano Baroni, Nicola Bonini, Matteo Calandra, Roberto Car, Carlo Cavazzoni, Davide Ceresoli, Guido L Chiarotti, Matteo Cococcioni, Ismaila Dabo, et al. Quantum espresso: a modular and open-source software project for quantum simulations of materials. Journal of Physics: Condensed Matter, 21(39):395502, 2009. [25] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: portable parallel programming with the message-passing interface, volume 1. MIT press, 1999. [26] Sayantan Sur, Matthew J Koop, and Dhabaleswar K Panda. High- performance and scalable mpi over infiniband with reduced memory usage: an in-depth performance analysis. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 105. ACM, 2006. [27] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. Wiley Plus/Blackboard Stand-alone to accompany Operating Systems Concepts with Java (Wiley Plus Products). John Wiley & Sons, 2006. [28] OpenMP cation Architecture program Review interface version Board. 3.0, OpenMP may 2008. appliURL http://www.openmp.org/mp-documents/spec30.pdf. [29] Jack B Dennis and Earl C Van Horn. Programming semantics for multiprogrammed computations. Communications of the ACM, 9(3):143–155, 1966. [30] Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming, volume 10. MIT press, 2008. [31] Nvidia Developer Zone. https://developer.nvidia.com/, 2014. [32] Khronos Group. Opencl, https://www.khronos.org/opencl/. august 2009. URL Bibliography 156 [33] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1982. ISBN 0-13-152462-3. [34] Francesca Rossi, Peter Van Beek, and Toby Walsh. Handbook of constraint programming. Elsevier, 2006. [35] George B Dantzig. Linear programming and extensions. Princeton university press, 1998. [36] George B Dantzig, Alex Orden, Philip Wolfe, et al. The generalized simplex method for minimizing a linear form under linear inequality restraints. Pacific Journal of Mathematics, 5(2):183–195, 1955. [37] Ilog CLEX. http://www-01.ibm.com/software/commerce/optimization/cplexoptimizer/, 2014. [38] Robin Lougee-Heimer. The common optimization interface for operations research: Promoting open-source software in the operations research community. IBM Journal of Research and Development, 47(1):57–66, 2003. [39] Boris Teodorovich Polyak. The conjugate gradient method in extremal problems. USSR Computational Mathematics and Mathematical Physics, 9(4): 94–112, 1969. [40] Michael Held, Philip Wolfe, and Harlan P Crowder. Validation of subgradient optimization. Mathematical programming, 6(1):62–88, 1974. [41] Michael Held and Richard M Karp. The traveling-salesman problem and minimum spanning trees. Operations Research, 18(6):1138–1162, 1970. [42] Michael Held and Richard M Karp. The traveling-salesman problem and minimum spanning trees: Part ii. Mathematical programming, 1(1):6–25, 1971. [43] Naum Zuselevich Shor, Krzysztof Kiwiel, and Andrzej Ruszcaynski. Minimization methods for non-differentiable functions. Springer-Verlag New York, Inc., 1985. [44] George L Nemhauser and Laurence A Wolsey. Integer and combinatorial optimization, volume 18. Wiley New York, 1988. Bibliography 157 [45] Karla Hoffman and Manfred Padberg. Lp-based combinatorial problem solving. Annals of Operations Research, 4(1):145–194, 1985. [46] Richard Bellman. Dynamic programming. Princeton University Press, 1957. [47] Ailsa H Land and Alison G Doig. An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society, pages 497–520, 1960. [48] G. Wäscher, H. Haussner, and H. Schumann. An improved typology of cutting and packing problems. European Journal of Operational Research, 183(3):1109––1130, 2007. [49] P. Gilmore and R. Gomory. Multistage cutting problems of two and more dimensions. Operations Research, 13:94–119, 1965. [50] P. Gilmore and R. Gomory. The theory and computation of knapsack functions. Operations Research, 14:1045–1074, 1966. [51] J. C. Herz. stock cutting. Recursive computational procedure for two-dimensional IBM Journal of Research and Development, 16(5):462– 469, September 1972. ISSN 0018-8646. doi: 10.1147/rd.165.0462. URL http://dx.doi.org/10.1147/rd.165.0462. [52] N. Christofides and C. Whitlock. An algorithm for two-dimensional cutting problems. Operations Research, 25:30–44, 1977. [53] J. E. Beasley. Algorithms for unconstrained two-dimensional guillotine cutting. Journal of the Operational Research Society, 36:297–306, 1985. [54] G.F. Cintra, F.K. Miyazawa, Y. Wakabayashi, and E.C. Xavier. Al- gorithms for two-dimensional cutting stock and strip packing problems using dynamic programming and column generation. Euro- pean Journal of Operational Research, 191(1):61 – 85, 2008. ISSN 0377-2217. URL doi: http://dx.doi.org/10.1016/j.ejor.2007.08.007. http://www.sciencedirect.com/science/article/pii/S0377221707008831. [55] Mauro Russo, Antonio Sforza, and Claudio Sterle. An improvement of the knapsack function based algorithm of gilmore and gomory for the unconstrained two-dimensional guillotine cutting problem. Inter- national Journal of Production Economics, 145(2):451 – 462, 2013. Bibliography 158 ISSN 0925-5273. doi: http://dx.doi.org/10.1016/j.ijpe.2013.04.031. URL http://www.sciencedirect.com/science/article/pii/S0925527313001953. [56] Gary J. Katz and Joseph T. Kider, paths for large graphs on the gpu. Jr. All-pairs shortest- In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH ’08, pages 47–55, Aire-la-Ville, Switzerland, Switzerland, 2008. Eurographics Association. ISBN 978-3-905674-09-5. URL http://dl.acm.org/citation.cfm?id=1413957.1413966. [57] Ben D. Lund and Justin kernel for Floyd-Warshall. W. Smith. A multi-stage CoRR, abs/1001.4108, CUDA 2010. URL http://arxiv.org/abs/1001.4108. [58] Vincent Boyer, Didier El Baz, and Moussa Elkihel. Solving knapsack problems on GPU. Computers & Operations Research, 39(1):42–47, 2012. [59] Mark Harris. cuda. Optimizing Nvidia developer parallel zone, reduction 2009. in URL http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf. [60] Ramser J. Dantzig G. The truck dispatching problem. Management Science, 6(1):80–91, 1959. [61] Paolo Toth and Daniele Vigo, editors. The Vehicle Routing Problem. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001. ISBN 0-89871-498-2. [62] Wasil Edward A. Golden Bruce L., Raghavan S., editor. The Vehicle Routing Problem: Latest Advances and New Challenges, volume 43 of Operations Research/Computer Science Interfaces Series. Springer, 2008. ISBN 978-0387-77778-8. [63] VeroLog Web Site. http://www.verolog.eu/, 2014. [64] Marco Boschetti and Vittorio Maniezzo. A set covering based matheuristic for a real-world city logistics problem. International Transactions in Operational Research, pages n/a–n/a, 2014. ISSN 1475-3995. doi: 10.1111/itor.12110. URL http://dx.doi.org/10.1111/itor.12110. [65] Wright J.R. Clarke G. Scheduling of vehicles from a central depot to a number of delivery points. Operations Research, 12:568–581, 1964. Bibliography 159 [66] Jaikumar R. Fisher M. L. A generalized assignment heuristic for vehicle routing. Networks, 11:109–124, 1981. [67] Paolo Toth and Daniele Vigo. The granular tabu search and its application to the vehicle-routing problem. INFORMS J. on Computing, 15(4):333–346, 2003. ISSN 1526-5528. [68] Wen-Chyuan Chiang and RobertA. Russell. Simulated annealing metaheuristics for the vehicle routing problem with time windows. Annals of Operations Research, 63(1):3–27, 1996. ISSN 0254-5330. [69] Luca Maria Gambardella, Éric Taillard, and Giovanni Agazzi. Macs-vrptw: A multiple colony system for vehicle routing problems with time windows. In New Ideas in Optimization, pages 63–76. McGraw-Hill, 1999. [70] Christian Prins. A simple and effective evolutionary algorithm for the vehicle routing problem. COMPUTERS AND OPERATIONS RESEARCH, 31: 2004, 2001. [71] Jari Kytöjoki, Teemu Nuortio, Olli Bräysy, and Michel Gendreau. An efficient variable neighborhood search heuristic for very large scale vehicle routing problems. Computers & OR, 34(9):2743–2757, 2007. [72] Yannis Marinakis and Magdalene Marinaki. A hybrid particle swarm optimization algorithm for the open vehicle routing problem. In Marco Dorigo, Mauro Birattari, Christian Blum, AndersLyhne Christensen, AndriesP. Engelbrecht, Roderich Groß, and Thomas Stützle, editors, Swarm Intelligence, volume 7461 of Lecture Notes in Computer Science, pages 180–187. Springer Berlin Heidelberg, 2012. ISBN 978-3-642-32649-3. [73] Wei Ou and Bao-Gang Sun. A dynamic programming algorithm for vehicle routing problems. In Proceedings of the 2010 International Conference on Computational and Information Sciences, ICCIS ’10, pages 733–736, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4270-6. [74] P. Toth Christofides N., A. Mingozzi. Exact algorithms for the vehicle routing problem based on spanning tree and shortest path relaxation. Math. Programming, 10:255–280, 1981. Bibliography 160 [75] Jens Lysgaard, Adam N. Letchford, and Richard W. Eglese. A new branchand-cut algorithm for the capacitated vehicle routing problem. Mathematical Programming A, 100(2):423–445, 2004. [76] Martin Desrochers, Jacques Desrosiers, and Marius Solomon. A new optimization algorithm for the vehicle routing problem with time windows. Operations Research, 40(2):342–354, 1992. [77] R. Roberti Baldacci R., A. Mingozzi. New route relaxation and pricing strategies for the vehicle routing problem. Operation Research, 59:1269– 1283, 2011. [78] Rafael Martinelli, Diego Pecin, and Marcus Poggi. Efficient elementary and restricted non-elementary route pricing. European Journal of Operational Research, 239(1):102–111, 2014. [79] AndréR. Brodtkorb, TrondR. Hagen, Christian Schulz, and Geir Hasle. Gpu computing in discrete optimization. part i: Introduction to the gpu. EURO Journal on Transportation and Logistics, 2(1-2):129–157, 2013. ISSN 21924376. [80] Marco A. Boschetti, Vittorio Maniezzo, and Francesco Strappaveccia. Using gpu computing for solving the two-dimensional guillotine cutting problem. Submitted for publication, 2014. [81] M. L. Balinski and R. E. Quandt. On an integer program for a delivery problem. Operations Research, 12(2):pp. 300–304, 1964. ISSN 0030364X. [82] Toth P. Christofides N., Mingozzi A. State-space relaxation procedures for the computation of bounds to routing problems. Networks, 11(2):145–164, 1981. [83] Salani M. Righini G. Symmetry helps: bounded bi-directional dynamic programming for the elementary shortest path problem with resource constraints. Discrete Optimization, 3(3):255–273, 2006. [84] Salani M. Righini G. Decremental state space relaxation strategies and initialization heuristics for solving the orienteering problem with time windows with dynamic programming. Computers and operations research, 36(4): 1191–1203, 2009. Bibliography 161 [85] Salani M. Righini G. New dynamic programming algorithms for the resource constrained elementary shortest path. Networks, 51(3):155–170, 2008. [86] Pawan Harish, Vibhav Vineet, and PJ Narayanan. Large graph algorithms for massively multithreaded architectures. Centre for Visual Information Technology, I. Institute of Information Technology, Hyderabad, India, Tech. Rep. IIIT/TR/2009/74, 2009. [87] Aydın Buluç, John R. Gilbert, and Ceren Budak. Solving path problems on the gpu. Parallel Computing, 36(5):241–253, 2010. [88] Hector Ortega-Arranz, Yuri Torres, Diego R Llanos, and Arturo GonzalezEscribano. A new gpu-based approach to the shortest path problem. In High performance computing and simulation (HPCS), 2013 international Conference on, pages 505–511. IEEE, 2013. [89] Sumit Kumar, Alok Misra, and Raghvendra Singh Tomar. A modified parallel approach to single source shortest path problem for massively dense graphs using cuda. In Computer and Communication Technology (ICCCT), 2011 2nd International Conference on, pages 635–639. IEEE, 2011. [90] Christian Schulz. Efficient local search on the gpu—investigations on the vehicle routing problem. Journal of Parallel and Distributed Computing, 73 (1):14–31, 2013. [91] SINTEF. http://www.sintef.no/, 2014. [92] VRPLIB. http://www.or.deis.unibo.it/index.html, 2014. [93] Feiyue Li, Bruce Golden, and Edward Wasil. Very large-scale vehicle routing: new test problems, algorithms, and results. Computers & Operations Research, 32(5):1165–1179, 2005. [94] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959. [95] Richard Bellman. On a routing problem. Technical report, DTIC Document, 1956. [96] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2):100–107, 1968. Bibliography 162 [97] Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962. [98] Chris Barrett, Riko Jacob, and Madhav Marathe. Formal-language- constrained path problems. SIAM Journal on Computing, 30(3):809–837, 2000. [99] Stephen Cole Kleene. Representation of events in nerve nets and finite automata. Technical report, DTIC Document, 1951. [100] Brian C Dean. Continuous-time dynamic shortest path algorithms. PhD thesis, Massachusetts Institute of Technology, 1999. [101] Robert Geisberger, Peter Sanders, Dominik Schultes, and Daniel Delling. Contraction hierarchies: Faster and simpler hierarchical routing in road networks. In Experimental Algorithms, pages 319–333. Springer, 2008. [102] Ulrich Lauther. Slow preprocessing of graphs for extremely fast shortest path calculations. In Lecture at the Workshop on Computational Integer Programming at ZIB, volume 10, page 11, 1997. [103] Ulrich Lauther. An extremely fast, exact algorithm for finding shor test paths in static networks with geographical background. Geoinformation und Mobilitat - von der Forschung zur praktischen Anwendung, 22:219–230, 2004. [104] Andrew V Goldberg and Chris Harrelson. Computing the shortest path: A search meets graph theory. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 156–165. Society for Industrial and Applied Mathematics, 2005. [105] Andrew V Goldberg and Renato Fonseca F Werneck. Computing point-topoint shortest paths from external memory. In ALENEX/ANALCO, pages 26–40, 2005. [106] Thomas Pajor. Multi-modal Route Planning. PhD thesis, Universitat Karlsruhe Institut fur theoretische Informatik, 2009. [107] Nvidia. Nvidia developer zone, 2007. https://developer.nvidia.com/, Accessed: 2013-12-03. Bibliography 163 [108] Dhirendra Pratap Singh and Nilay Khare. A study of different parallel implementations of single source shortest path algorithms. International Journal of Computer Applications, 54(10):26–30, 2012. [109] Andreas Crauser, Kurt Mehlhorn, Ulrich Meyer, and Peter Sanders. A parallelization of dijkstra’s shortest path algorithm. In Mathematical Foundations of Computer Science 1998, pages 722–731. Springer, 1998. [110] Joseph R Crobak, Jonathan W Berry, Kamesh Madduri, and David A Bader. Advanced shortest paths algorithms on a massively-multithreaded architecture. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–8. IEEE, 2007. [111] Marios Papaefthymiou and Joseph Rodrigue. Implementing parallel shortest-paths algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 30:59–68, 1997. [112] Pedro J Martı́n, Roberto Torres, and Antonio Gavilanes. Cuda solutions for the sssp problem. In Computational Science–ICCS 2009, pages 904–913. Springer, 2009. [113] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46–55, 1998. [114] Steve Coast. Open street map, 2004. http://www.openstreetmap.org/. [115] Marco A Boschetti, Vittorio Maniezzo, and Matteo Roffilli. A fully distributed lagrangean solution for a peer-to-peer overlay network design problem. INFORMS Journal on Computing, 23(1):90–104, 2011. [116] Mauro Birattari, Luis Paquete, Thomas Strutzle, and Klaus Varrentrapp. Classification of metaheuristics and design of experiments for the analysis of components tech. rep. aida-01-05. 2001. [117] Ronald L Rardin and Reha Uzsoy. Experimental evaluation of heuristic optimization algorithms: A tutorial. Journal of Heuristics, 7(3):261–304, 2001. [118] Stefan Saroiu, P Krishna Gummadi, and Steven D Gribble. Measurement study of peer-to-peer file sharing systems. In Electronic Imaging 2002, pages 156–170. International Society for Optics and Photonics, 2001. Bibliography 164 [119] Boris Teodorovich Polyak. Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics, 9(3):14–29, 1969. [120] Deborah T Marr, Frank Binns, David L Hill, Glenn Hinton, David A Koufaty, J Alan Miller, and Michael Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1), 2002.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement