Dynamic Scheduling of Network Updates

Dynamic Scheduling of Network Updates
Dynamic Scheduling of Network Updates
Xin Jin†
Hongqiang Harry Liu? Rohan Gandhi∧
Srikanth Kandula◦
◦
◦
†
Ratul Mahajan
Ming Zhang
Jennifer Rexford Roger Wattenhofer×
Microsoft Research◦
Princeton University†
Yale University?
Abstract— We present Dionysus, a system for fast, consistent
network updates in software-defined networks. Dionysus encodes
as a graph the consistency-related dependencies among updates at
individual switches, and it then dynamically schedules these updates based on runtime differences in the update speeds of different
switches. This dynamic scheduling is the key to its speed; prior
update methods are slow because they pre-determine a schedule,
which does not adapt to runtime conditions. Testbed experiments
and data-driven simulations show that Dionysus improves the median update speed by 53–88% in both wide area and data center
networks compared to prior methods.
Categories and Subject Descriptors: C.2.1 [ComputerCommunication Networks]: Network Architecture and Design—
Network communications; C.2.3 [Computer-Communication Networks]: Network Operations—Network management
Keywords: Software-defined networking; network update
1.
Introduction
Many researchers have shown the value of centrally controlling
networks. This approach can prevent oscillations due to distributed
route computation [1]; ensure that network paths are policy compliant [2, 3]; reduce energy consumption [4]; and increase throughput [5, 6, 7, 8, 9]. Independent of their goal, such systems operate
by frequently updating the data plane state of the network, either
periodically or based on triggers such as failures. This state consists of a set of rules that determine how switches forward packets.
A common challenge faced in all centrally-controlled networks
is consistently and quickly updating the data plane. Consistency
implies that certain properties should not be violated during network updates, for instance, packets should not loop (loop freedom)
and traffic arriving at a link should not exceed its capacity (congestion freedom). Consistency requirements impose dependencies on
the order in which rules can be updated at switches. For instance,
for congestion freedom, a rule update that brings a new flow to a
link must occur after an update that removes an existing flow if the
link cannot support both flows simultaneously. Not obeying update
ordering requirements can lead to inconsistencies such as loops,
blackholes, and congestion.
Current methods for consistent network updates are slow because they are based on static ordering of rule updates [9, 10, 11,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]
SIGCOMM’14, August 17–22, 2014, Chicago, IL, USA.
Copyright 2014 ACM 978-1-4503-2836-4/14/08 ...$15.00.
http://dx.doi.org/10.1145/2619239.2626307.
Purdue University∧
ETH Zurich×
12]. They pre-compute an order in which rules must be updated,
and this order does not adapt to runtime differences in the time
it takes for individual switches to apply updates. These differences inevitably arise because of disparities in switches’ hardware
and CPU load and the variabilities in the time it takes the centralized controller to make remote procedure calls (RPC) to switches.
In B4, a centrally-controlled wide area network, the ratio of the
99th percentile to the median delay to change a rule at a switch
was found to be over five (5 versus 1 second) [8]. Further, some
switches can “straggle,” taking substantially more time than average (e.g., 10-100x) to apply an update. Current methods can stall
in the face of straggling switches.
The speed of network updates is important because it determines
the agility of the control loop. If the network is being updated in
response to a failure, slower updates imply a longer period during which congestion or packet loss occurs. Further, many systems
update the network based on current workload, both in the wide
area [8, 9] and the data center [5, 6, 7], and their effectiveness is
tied to how quickly they adapt to changing workloads. For example, recent works [8, 9] argue for frequent traffic engineering (e.g.,
every 5 minutes) to achieve high network utilization; slower network updates would lower network utilization.
We develop a new approach for consistent network updates. It
is based on the observations that i) there exist multiple valid rule
orderings that lead to consistent updates; and ii) dynamically selecting an ordering based on update speeds of switches can lead to
fast network updates. Our approach is general and can be applied to
many consistency properties, including all the ones that have been
explored by prior work [9, 10, 11, 12, 13].
We face two main challenges in practically realizing our approach. The first is devising a compact way to represent multiple
valid orderings of rule updates; there can be exponentially many
such orderings. We address this challenge using a dependency
graph in which nodes correspond to rule updates and network resources, such as link bandwidth and switch rule memory capacity,
and (directed) edges denote dependencies among rule updates and
network resources. Scheduling updates in any order, while respecting dependencies, guarantees consistent updates.
The second challenge is scheduling updates based on dynamic
behavior of switches. This problem is NP-complete in the general
case, and making matters worse, the dependency graph can also
have cycles. To schedule efficiently, we develop greedy heuristics
based on preferring critical paths and strongly connected components in the dependency graph [14].
We instantiate our approach in a system called Dionysus and
evaluate it using experiments on a modest-sized testbed and largescale simulations. Our simulations are based on topology and traffic data from two real networks, one wide-area network and one
data center network. We show that Dionysus improves the median
network update speed by 53–88%. We also show that its faster
updates lower congestion and packet loss by over 40%.
1
0.5
0
0 100 200 300 400 500 600 700
Number of Inserted Rules
10
Control
Data
8
6
4
2
0
0 100 200 300 400 500 600 700
Number of Inserted Rules
7
6
5
4
3
2
1
0
1
Control
Data
0.8
CDF
1.5
12
Time (second)
Control
Data
Time (second)
Time (second)
2
0.6
0.4
Control
Data
0.2
0
0 100 200 300 400 500 600 700
Number of Modified Rules
0
5
10 15 20 25
Time (second)
30
(a)
(b)
(c)
(d)
Figure 1: Rule update times on a commodity switch. (a) Inserting single-priority rules. (b) Inserting random-priority rules. (c)
Modifying rules in a switch with 600 single-priority rules. (d) Modifying 100 rules in a switch with concurrent control plane load.
2.
Motivation
Our work is motivated by the observations that the time to update
switch rules varies widely and that not accounting for this variation
leads to slow network updates. We illustrate these observations using measurements from commodity switches and simple examples.
2.1
Variability in update time
Several factors lead to variable end-to-end rule update times, including switch hardware capabilities, control load on the switch,
the nature of the updates, RPC delays (which include network path
delays), etc. [7, 8, 15, 16]. To illustrate this variability, we perform
controlled experiments on commodity switches. In these experiments, RPC delays are negligible and identical switch hardware
and software are used, yet significant variability is evident.
The experiments explore the impact of four factors: i) the number of rules to be updated; ii) the priorities of the rules; iii) the
types of rule updates (e.g., insertion vs. modification); and iv)
control load on the switch. We measure switches from two different vendors and observe similar results. Figure 1 shows results
for one switch vendor. We build a customized switch agent on the
switch and obtain confirmation of rule updates in both the control
and data planes. The control plane confirmation is based on the
switch agent verifying that the update is installed in the switch’s
TCAM (ternary content addressable memory), and the data plane
confirmation is based on observing the impact of the update in the
switch’s forwarding behavior (e.g., changes in which interface a
packet is sent out on).
Figure 1(a) shows the impact of the number of rules by plotting
the time to add different numbers of rules. Here, the switch has no
control load besides rule updates, the switch starts with an empty
TCAM, and all rule updates correspond to adding new rules with
the same priority. We see that, as one might expect, that the update
time grows linearly with the number of rules being updated, with
the per-rule update time being 3.3 ms.
Figure 1(b) shows the impact of priorities. As above, the switch
has no load and starts with an empty TCAM. The difference is that
the inserted rules are assigned random priorities. We see that the
per-rule update time is significantly higher than before. The slope
of the line increases as the number of rules increase, and the perrule update time reaches 18 ms when inserting 600 rules.
This variability stems from the fact that TCAM packing algorithms do different amounts of work, depending on the TCAM’s
current content and the type of operation performed. For instance,
the TCAM itself does not encode any rule priority information. The
rules are stored from top to bottom in decreasing priority and when
multiple rules match a packet, the one with the highest place is
chosen. Thus, when a new rule is inserted, it may cause existing
rules to move in the table. Although the specific packing algorithms are proprietary and vary across vendors, the intrinsic design
of a TCAM makes the update time variable.
S1
S2
F2: 5
F1: 5 F4: 5
S4
F3: 10
S5
(a) Current State
S3
S1
S2
F2: 5
F1: 5
F4: 5
S4
F3: 10
S3
S5
(b) Target State
Figure 2: A network update example. Each link has 10 units of
capacity; flows are labeled with their sizes.
Figure 1(c) shows the impact of the type of rule update. Rather
than inserting rules into an empty TCAM, we start with 600 rules
of the same priority and measure the time for rule modifications.
We modify only match fields or actions, not rule priorities. The
graph is nearly linear, with a per-rule modification latency of 11 ms.
This latency is larger than the per-rule insertion latency because a
rule modification requires two operations in the measured switch:
inserting the new rule and deleting the old rule.
Finally, Figure 1(d) shows the impact of control load, by engaging the switch in different control activities while updates are
performed. Here, the switch starts with the 600 same-priority rules
and we modify 100 of them. Control activities performed include
reading packet and byte counters on rules with OpenFlow protocol, querying SNMP counters, reading switch information with CLI
commands, and running BGP protocol (which SDN systems use as
backup [8]). We see that despite the fact that update operations are
identical (100 new rules), the time to update highly varies, with the
99th percentile 10 times larger than the median. Significant rule
update time variations are also reported in [8, 16].
In summary, we find that even in controlled conditions, switch
update time varies significantly. While some sources of this variability can be accounted for statically by update algorithms (e.g.,
number of rule updates), others are inherently dynamic in nature
(e.g., control plane load and RPC delays). Accounting for these
dynamic factors ahead of time is difficult. Our work thus focuses
on adapting to them at runtime.
2.2
Consistent updates amid variability
We illustrate the downside of static ordering of rule updates with
the example of Figure 2. Each link has a capacity of 10 units and
each flow’s size is marked. The controller wants to update the network configuration from Figure 2(a) to 2(b). Assume for simplicity
that the network uses tunnel-based routing and all necessary tunnels
have already been established. So, moving a flow requires updating
only the ingress switch.
If we want a congestion-free network update, we cannot update
all the switches in “one shot” (i.e., send all update commands simultaneously). Since different switches will apply the updates at
different times, such a strategy may cause congestion at some links.
For instance, if S1 applies the update for moving F 1 before S2
moves F 2 and S4 moves F 4, link S1-S5 will be congested.
F5: 10
S1
S3
S2
S1
F5: 10
F2: 5
F1: 5
F4: 5
S4
S5 F3: 5
(a) Current State
F1: 5
F4: 5
S4
Target State
F2: 5
Dependency Graph Generator S5 F3: 5
Update Scheduler (b) Target State
Current
State
Figure 3: An example in which a completely opportunistic approach to scheduling updates leads to a deadlock. Each link
has 10 units of capacity; flows are labeled with their sizes. If
F 2 is moved first, F 1 and F 4 get stuck.
Ensuring that no link is congested requires us to carefully order
the updates. Two valid orderings are:
Plan A: [F 3 → F 2] [F 4 → F 1]
Plan B: [F 4] [F 3 → F 2 → F 1]
Plan A mandates that F 2 be done after F 3 and F 1 be done after
F 4. Plan B mandates that F 1 be done after F 2 and that F 2 be
done after F 3. In both plans, F 3 and F 4 have no pre-requisites
and can be done anytime and in parallel.1
Which plan is faster? In the absence of update time variability,
if all updates take unit time, Plan A will take 2 time units and Plan
B will take 3. However, with update time variability, no plan is a
clear winner. For instance, if S4 takes 3 time units to move F 4,
and other switches take 1, Plan A will take 4 time units and Plan B
will take 3. On the other hand, if S2 is slow and takes 3 time units
to move F 2, while other switches take 1, Plan A will take 4 time
units and Plan B will take 5.
Now consider a dynamic plan that first issues updates for F 3 and
F 4, issues an update for F 2 as soon as F 3 finishes, and issues an
update for F 1 as soon as F 2 or F 4 finishes. This plan dynamically selects between the two static plans above and will thus equal
or beat those two plans regardless of which switches are slow to
update. Practically implementing such plans for arbitrary network
topologies and updates is the goal of our work.
3.
Dionysus Overview
We achieve fast, consistent network updates through dynamic
scheduling of rule updates. As in the example above, there can be
multiple valid rule orderings that lead to consistent updates. Instead
of statically selecting an order, we implement on-the-fly ordering
based on the realtime behavior of the network and the switches.
Our focus is on flow-based traffic management applications for
the network core (e.g., ElasticTree, MicroTE, B4, SWAN [4, 6, 8,
9]). As is the case for these applications, we assume that any forwarding rule at a switch matches at most one flow, where a flow is
(a subset of) traffic between ingress and egress switches that uses
either single or multiple paths. This assumption does not hold in
networks that use wild-card rules or longest prefix matching. Increasingly, such rules are being moved to the network edge or even
hosts [17, 18, 19], keeping the core simple with exact match rules.
The primary challenge is to tractably explore valid orderings.
One difficulty is that there are combinatorially many such orderings. Conceivably, one may formulate the problem as an ILP (Integer Linear Program). But this approach would be too slow and
does not scale to large networks with a lot of flows. Also it is static
1
Consistency Property
S3
S2
Some consistent update methods [9] use stages, a more rigid version of static ordering. They divide updates into multiple stages,
and all updates in the previous stage must finish before any update
in the next stage can begin. In this terminology, Plan A is a twostage solution in which the first stage will update F 3 and F 4 and
the second will update F 2 and F 1. Plan B is a three-stage solution.
Since SWAN minimizes the number of stages, it will prefer Plan A.
Network
Figure 4: Our approach.
Mv.
F3
10
S2-­‐S5: 0 5
Mv.
F4
5
S1-­‐S5: 0 5
Mv.
F1
S1-­‐S5: 5 Mv.
F2
5
5
Mv.
F4
5
5
5
Mv.
F2
Mv.
F1
S4-­‐S5: 0 5
(a) Dependency graph for (b) Dependency graph for Figure 3
Figure 2
Figure 5: Example dependency graphs.
and not incrementally computable; one has to rerun the ILP every
time the switch behaviors change. Another difficulty is that the extreme approach of being completely opportunistic about rule ordering does not always work. In such an approach, the controller will
immediately issue any updates that are not gated (per consistency
requirements) on any other update. While this approach works for
the simple example in the previous section, in general, it can result in deadlocks (that are otherwise resolvable). Figure 3 shows
an example. Since F 2 can be moved without waiting for any other
flow movement, an opportunistic approach might make that move.
But at this point, we are stuck, because no flow can be moved to its
destination without overloading at least some link. This is avoidable if we move other flows first. It is because of such possibilities
that current approaches carefully plan transitions, but they err on
the side of not allowing any runtime flexibility in rule orderings.
We balance planning and opportunism using a two-stage approach,
shown in Figure 4. In the first stage, we generate a dependency
graph that compactly describes many valid orderings. In the second
stage, we schedule updates based on the constraints imposed by the
dependency graph. Our approach is general in that it can maintain
any consistency property that can be described using a dependency
graph, which includes all properties used in prior work [8, 9, 11].
The scheduler is independent of the consistency property.
Figure 5(a) shows a simplified view of the dependency graph for
the example of Figure 2. In the graph, circular nodes denote update
operations, and rectangular nodes represent link capacity resources.
The numbers within rectangles indicate the current free capacity
of resources. A label on an edge from an operation to a resource
node shows the amount of resource that will be released when the
operation completes. For example, link S2-S5 has 0 free capacity,
and moving F 3 will release a capacity of 10 to it. Labels on edges
from resource to operation nodes show the amount of free resource
needed to conduct these operations. As moving F 1 requires 5 free
capacity on link S1-S5, F 1 cannot move until F 2 or F 4 finishes.
Given the dependency graph in Figure 5(a), we can dynamically
generate good schedules. First, we observe that F 3 and F 4 don’t
depend on other updates, so they can be scheduled immediately.
After F 3 finishes, we can schedule F 2. Finally, we schedule F 1
once one of F 2 or F 4 finishes. From this example, we see that the
dependency graph captures dependencies but still leaves scheduling
flexibility, which we leverage at runtime to implement fast updates.
There are two challenges in dynamically scheduling updates.
The first is to resolve cycles in the dependency graph. These arise
due to complex dependencies between rules. For example, Figure 5(b) shows that there are cycles in the dependency graph for
the example of Figure 3. Second, at any given time, multiple subsets of rule updates can be issued, and we need to decide which
ones to issue first. As described later, the greedy heuristics we use
for these challenges are based on critical-path scheduling and the
concept of SCC (strongly connected component) in graph theory.
p 1: 5
S2
p 2: 5
S1
Network State Model
This section describes the model of network forwarding state that
we use in Dionysus. The following sections describe dependency
graph generation and scheduling in detail.
The network G consists of a set of switches S and a set of
directed links L. A flow f is from an ingress switch si to an
egress switch sj with traffic volume tf , and its traffic is carried
over a set of paths Pf . The forwarding state of f is defined as
Rf = {rf,p |p ∈ Pf } where rf,p is the traffic load of f on path p.
The network state N S is then the combined state of all flows, i.e.,
N S = {Rf |f ∈ F }. For example, consider the network in Figure 6(a) that is forwarding a flow across two paths, with 5 units of
traffic along each. Here, tf = 10, Pf = {p1 = S1 S2 S3 S5 , p2 =
S1 S2 S5 }, and Rf = {rf,p1 = 5, rf,p2 = 5}.
The state model above captures both tunnel-based forwarding
that is prevalent in WANs and also WCMP (weighted cost multi
path) forwarding that is prevalent in data center networks. In tunnelbased forwarding, a flow is forwarded along one or more tunnels.
The ingress switch matches incoming traffic to the flow, based on
packet headers, and splits it across the tunnels based on configured
weights. Before forwarding a packet along a tunnel, the ingress
switch tags the packet with the tunnel identifier. Subsequent switches
only match on tunnel tags and forward packets, and the egress
switch removes the tunnel identifier. Representing tunnel-based
forwarding in our state model is straightforward. Pf is the set of
tunnels and the weight of a tunnel is rf,p /tf .
In WCMP forwarding, switches at every hop match on packet
headers and split flows over multiple next hops with configured
weights. Shortest-path and ECMP (equal cost multipath) forwarding are special cases of WCMP forwarding. To represent WCMP
routing inP
our state model, we first calculate the flow rate on link l
as rfl = l∈p,p∈Pf rf,p . Then at switch si , the weight for nextP
l
hop sj is: wi,j = rfij / l∈Li rfl where lij is the link from si to sj
and Li is the set of links starting at si . For instance, in Figure 6(a),
w1,2 = 1, w1,4 = 0, w2,3 = 0.5, w2,5 = 0.5.
5.
Dependency Graph Generation
As shown in Figure 4, the dependency graph generator takes as
input the current state N Sc , the target state N St , and the consistency property. The network states includes the flow rate, and as in
current systems [4, 6, 8, 9], we assume that flows obey this rate as a
result of rate limiting or robust estimation. A static input to Dionysus is the rule capacity of each switch, relevant in settings where
this resource is limited. Since Dionysus manages all rule additions
and removals, it then knows how much rule capacity is available on
S5
(a) Current state
S1:50 S4:50
1
1
1
B A G 1
F S2:50 S2
S4
S1-­‐S4:10 S2:50 5
C S5
1
5
5
S4-­‐S5:10 5
p3
5
5
X p3
5
D 5
p2
1
p 1: 5
(b) Target state
S1-­‐S4:10 S4-­‐S5:10 S5:50
1
S1
p 3: 5
S4
E 4.
S3
S3
1
Z 5
S1-­‐S2:0 Y 5
p2
5
S2-­‐S5:5 S1-­‐S2:0 5
S2-­‐S5:5 (c) Dependency graph using (d) Dependency graph using
tunnel-based rules (Table 1)
WCMP-based rules (Table 2)
Figure 6: Example of building dependency graph for updating
flow f from current state (a) to target state (b).
Index
A
B
C
D
E
F
G
Operation
Add p3 at S1
Add p3 at S4
Add p3 at S5
Change weight at S1
Delete p2 at S1
Delete p2 at S2
Delete p2 at S5
Table 1: Operations to update f with tunnel-based rules.
Index
X
Y
Z
Operation
Add weights with new version at S2
Change weights, assign new version at S1
Delete weights with old version at S2
Table 2: Operations to update f in WCMP forwarding.
each switch at any given time. This information is used such that
rule capacity is not exceeded at any switch.
Given N Sc and N St , it is straightforward to compute the set of
operations that would update the network from N Sc to N St . The
goal of dependency graph generation is to inter-link these operations based on the consistency property. Our dependency graph has
three types of nodes: operation nodes, resource nodes, and path
nodes. Operation nodes represent addition, deletion, or modification of a forwarding rule at a switch, and resource nodes correspond
to resources such as link capacity and switch memory and are labeled with the amount of resource currently available. An edge between two operation nodes captures an operation dependency and
implies that the parent operation must be done before the child. An
edge between a resource and an operation node captures a resource
dependency. An edge from a resource to an operation node is labeled with the amount of resource that must be available before the
operation can occur. An edge from an operation to a resource node
is labeled with the amount of the resource that will be freed by that
operation. There are no edges between resource nodes.
Path nodes help group operations and link capacity resources on
a path. Path nodes can connect to operation nodes as well as to
resource nodes. An edge between an operation and a path node
OD: P cannot be used OD: O cannot be scheduled O2 cannot be scheduled un/l O is done un/l P is removed un/l O1 is done O
O
P P O1 O2 RD: Amount of RD: Amount of resource O frees on P resource O consumes on P Amount of R consumed by P P R Amount of R freed by P Amount of R consumed by O R O Amount of R freed by O Figure 7: Links and relationships among path, operation, and
resource nodes; RD indicates a resource dependency and OD
indicates an operation dependency.
can be either an operation dependency (un-weighted) or a resource
dependency (weighted). The various types of links connecting different types of nodes are detailed in Figure 7.
During scheduling, each path node that frees link resources has
a label committed that denotes the amount of traffic that is moving
away from the path; when the movement finishes, we use committed
to update the free resource of its child resource nodes. We do not
need to keep committed for path nodes that require resource, because we always reduce free capacity on its parent resource nodes
first before we move traffic into the path.
In this paper, we focus on four consistency properties from prior
work [13] and show how our dependency graphs capture them. The
properties are i) blackhole-freedom: no packet should be dropped
at a switch (e.g., due to a missing rule); ii) loop-freedom: no packet
should loop in the network; iii) packet coherence: no packet should
see a mix of old and new rules in the network; and iv) congestionfreedom: traffic arriving at a link should be below its capacity. We
believe that our dependency graphs are general enough to describe
other properties as well, which may be proposed in the future.
We now describe dependency graph generation. We first focus
on tunnel-based forwarding without resource limits and then discuss WCMP forwarding and resource constraints.
Tunnel-based forwarding: Tunnel-based forwarding offers loop
freedom and packet coherence by design; it is not possible for packets to loop or to see a mix of old and new rules during updates. We
defer discussion of congestion freedom until we discuss resource
constraints. The remaining property, blackhole freedom, is guaranteed as long as we ensure that i) a tunnel is fully established before
the ingress switch puts any traffic on it, and ii) all traffic is removed
from the tunnel before the tunnel is deleted.
A dependency graph that encodes these constraints can be built
as follows. For each flow f , using N Sc and N St , we first calculate
the tunnels to be added and deleted and generate a path node for
each. Then, we generate an operation node for every hop, adding
an edge from each of them to the path node (or from the path node
to each of them), denoting adding (or deleting) this tunnel at the
switch. Then, we generate an operation node that changes the tunnel weights to those in N St at the ingress switch. To ensure blackhole freedom, we add an edge from each path node that adds new
tunnels to the operation node that changes tunnel weights, and an
edge from the operation node that changes tunnel weights to each
path node that deletes old tunnels.
We use the example in Figure 6 to illustrate the steps above. Initially, we set the tunnel weights on p1 and p2 with 0.5 and 0.5
respectively. In the target state, we add tunnel p3 , delete tunnel
p2 , and change the tunnel weights to 0.5 on p1 and 0.5 on p3 . To
generate the dependency graph for this transition, we first generate
path nodes for p2 and p3 and the related switch operations as in
Table 1. Then we add edges from the tunnel-addition operations
(A, B and C) to the corresponding path node (p3), and edges to
the tunnel-deletion operations (E, F and G) from the correspond-
Algorithm 1 Dependency graph for packet coherence in a WCMP
network
– v0 : old version number
– v1 : new version number
1: for each flow f do
2:
s∗ = GetIngressSwitch(f )
3:
o∗ = GenRuleM odif yOp(s∗ , v1 )
4:
for si ∈ GetAllSwitches(f ) − s∗ do
5:
if si has multiple next-hops then
6:
o1 = GenRuleInsertOp(si , v1 )
7:
o2 = GenRuleDeleteOp(si , v0 )
8:
Add edge from o1 to o∗
9:
Add edge from o∗ to o2
ing path node (p2). Finally, we add an edge from the path node of
the added path (p3) to the weight-changing operation (D) and from
D to the path node for the path to be deleted (p2). The resulting
graph is shown in Figure 6(c). The resource nodes in this graph are
discussed later.
WCMP forwarding: With N Sc and N St , we calculate for each
flow the weight change operations that update the network from
N Sc to N St . We then create dependency edges between these operations based on the consistency property. Algorithm 1 shows how
to do that for packet-coherence, using version numbers [10, 11]. In
this approach, the ingress switch tags each packet with a version
number and downstream switches handle packets based on the embedded version number. This tagging ensures that each packet either uses the old configuration or the new configuration, and never a
mix of the two. The algorithm generates three types of operations:
i) the ingress switch tags packets with the new version number and
uses new weights (Line 3); ii) downstream switches have rules for
handling the packets with the new version number and new weights
(Line 6); and iii) downstream switches delete rules for the old version number (Line 7). Packet coherence is guaranteed if Type i
operation occurs after Type ii (Line 8) and Type iii operations occur after Type i (Line 9). Line 5 is an optimization; no changes are
needed at switches that have only one next hop for the flow in both
the old and new configurations.
We use the example in Figure 6 again to illustrate the algorithm
above. For flow f , we need to update the flow weights at S1 from
[(S2, 1), (S4, 0)] to [(S2, 0.5), (S4, 0.5)], and weights at S2 from
[(S3, 0.5), (S5, 0.5)] to [(S3, 1), (S5, 0)]. This translates to three
operations (Table 2): add new weights with new version numbers
at S2 (X), change to new weights and new version numbers at S1
(Y), and delete old weights at S2 (Z). We connect X to Y and Y
to Z as shown in Figure 6(d).
Blackhole-freedom and loop-freedom do not require version numbers. For the former, we must ensure that every switch that may
receive a packet from a flows always has a rule for it. For the latter, we must ensure that downstream switches (per new configuration) are updated before updating a switch to new rules [13]. These
conditions are easy to encode in a dependency graph. For space
constraints, we omit detailed description of graph construction.
Resource constraints: We introduce resource nodes to the graph
corresponding to resources of interest, including link bandwidth
and switch memory. These nodes are labeled with their current free
amount or with infinity if that resource can never be a bottleneck.
We connect link bandwidth nodes with other nodes as follows.
For each path node and bandwidth node for links along the path: if
the traffic on the path increases, we add an edge from the bandwidth
node to the path node with a label indicating the amount of traffic
increase; if the traffic decreases, we add edges in the other direction. For a tunnel-based network, we add an edge from each path
node on which traffic increases to the operation node that changes
A CPL=3
5
P1
5
R1:5 P3
5
5
P2
5
CPL=2
5
C 5
5
P4
CPL=1
B R2:0 Symbol
Oi
Rj
Rj .f ree
Pk
Pk .committed
lij
5
P5
5
CPL=1 D Figure 8: Critical-path scheduling. C has larger CP L than B,
and is scheduled.
weight at the ingress switch with a label indicating the amount of
traffic increase; similarly, we add an edge in the other direction
if the traffic decreases. For a WCMP network, we add an edge
from each path node on which traffic increases to each operation
node that adds weights with new versions with a label indicating
the amount of increase; similarly, we add an edge from the operation node that changes weight at the ingress switch to each path
node on which traffic decreases with a label indicating the amount
of decrease. This difference is due to that tunnels offer packet coherence by design, while WCMP networks need version numbers.
Connecting switch memory resource nodes with other nodes is
straightforward. We add an edge from a resource node to an operation node if the operation consumes that switch memory with an
weight indicating the amount of consumption; we add an edge from
an operation node to a resource node if the operation releases that
switch memory with an weight indicating the amount of release.
For example, in Figure 6(c) node D, which changes tunnel weights
at S1, increases 5 units of traffic on p3 which includes link S1-S4
and S4-S5, and decreases 5 units of traffic on p2 which includes
link S1-S2 and S2-S5. Node A that adds tunnel p3 consumes 1
rule at S1. In Figure 6(d), we link p3 to X and link X to Y . X and
Y essentially takes the same effect as D in Figure 6(d).
Post-processing: After generating the dependency graph, we reduce it by deleting edges from non-bottlenecked resources. For
each resource node Ri , we check the edges to itsP
child nodes Nj .
If the free resource Ri .f ree is no smaller than j lij where lij
is the edge weight, we delete all the
Pedges from Ri to its children
and decrease the free capacity by j lij . The idea is that Ri has
enough free resource to accommodate all operations that need it, so
it’s not a bottleneck resource and the scheduling will not consider
it. For example, if S1-S4 has over 5 units of free capacity, we can
delete the edge from S1-S4 to p3 in Figures 6(c) and 6(d).
6.
Dionysus Scheduling
We now describe how updates are scheduled in Dionysus. First,
we discuss the hardness of the scheduling problem, which guided
our approach. Then, we describe scheduling algorithm for the special case where the dependency graph is a DAG (directed acyclic
graph). Finally, we extend this algorithm to handle cycles.
6.1
The hardness of the scheduling problem
Scheduling is a resource allocation problem, that is, how to allocate available resources to operations to minimize the update time.
For example, resource node R1 in Figure 8 has 5 units of free resource. It cannot cover both B and C. We must decide to schedule
i) B, ii) C, or iii) part of B and C. Every time we make a scheduling decision, we decide how to allocate a resource to its child operations and which parent operation to execute to obtain a resource.
Additional constraints on scheduling are placed by dependencies
between operations.
We can prove the following about network update scheduling;
proofs of both theorems are in our technical report [20].
T HEOREM 1. In the presence of both link capacity and switch
memory constraints, finding a feasible update schedule is NP-complete.
Description
Operation node i
Resource node j
Free capacity of Rj
Path node k
Traffic that is moving away from path k
Edge weight from node i to j
Table 3: Key notation in our algorithms.
Algorithm 2 ScheduleGraph(G)
1: while true do
2:
U pdateGraph(G)
3:
Calculate CP L for every node
4:
Sort nodes by CP L in decreasing order
5:
for unscheduled operation node Oi ∈ G do
6:
if CanScheduleOperation(Oi ) then
7:
Schedule Oi
8:
Wait for time t or for all scheduled operations to finish
The hardness stems from the fact that memory constraints involve integers and memory cannot be allocated fractionally. Scheduling is simpler if we only have link capacity constraints, but finding
the fastest schedule is still hard because of the huge search space.
T HEOREM 2. In the presence of link capacity constraints, but
no switch memory constraints, finding the fastest update schedule
is NP-complete.
6.2
Scheduling DAGs
We first consider the special case of a DAG. Scheduling a DAG
is, expectedly, simpler:
L EMMA 1. If the dependency graph is a DAG, finding a feasible update schedule is in P.
While it is easy to find a feasible solution for a DAG, we want to
find a fast one. Different scheduling orders lead to different finishing times. For example, if all operations take the same amount of
time Figure 8, scheduling C before B will be faster.
We use critical-path scheduling. The intuition is that the critical
path decides the completion time, and we thus want to schedule
operations on the critical path first. Since resource nodes and path
nodes in the dependency graph are only used to express constraints,
we assign weight w=0 to them when calculating critical paths; for
operation nodes, we assign weight w=1. With this, we calculate a
critical-path length CP L for each node i as:
CP Li = wi +
max
j∈children(i)
CP Lj
(1)
To calculate CP L for all the nodes in the graph, we first topologically sort all the nodes and then iterate over them to calculate CP L
with Equation 1 in the reverse topological order. In Figure 8, for
example, CP LD =1, CP LC =2, CP LB =1, CP LA =3. The CPL
for each node can be computed efficiently in linear time.
Algorithm 2 shows how Dionysus uses CPL to schedule updates,
with key notations summarized in Table 3. Each time we enter the
scheduling phase, we first update the graph with finished operations
and delete edges from unbottlenecked resources (line 2). Then, we
calculate CP L for every node (Line 3) and sort nodes in decreasing order of CP L (Line 4). Then, we iterate over operation nodes
and schedule them if their operation dependency and resource dependency are satisfied (Lines 6, 7). Finally, the scheduler waits for
some time for all scheduled operations to finish before starting the
next round (Line 10).
To simplify presentation, we first show the related pseudo code
of CanScheduleOperation(Oi ) and U pdateGraph(G) for tunnelbased networks and describe them below. Then, we briefly describe
how the WCMP case differs.
Algorithm 3 CanScheduleOperation(Oi )
// Add tunnel operation node
1: if Oi .isAddT unnelOp() then
2:
if Oi .hasN oP arents() then
3:
return true
4:
Rj ← parent(Oi ) // AddTunnelOp only has 1 parent
5:
if Rj .f ree ≥ lji then
6:
Rj .f ree ← Rj .f ree − lji
7:
Delete edge Rj → Oi
8:
return true
9:
return false
// Delete tunnel operation node
10: if Oi .isDelT unnelOp() then
11:
if Oi .hasN oP arents() then
12:
return true
13:
return false
// Change weight operation node
14: total ← 0
15: canSchedule ← f alse
16: for path node Pj ∈ parents(Oi ) do
17:
available ← lji
18:
if Pj .hasOpP arents() then
19:
available ← 0
20:
else
21:
for resource node Rk ∈ parents(Pj ) do
22:
available ← min(available, lkj , Rk .f ree)
23:
for resource node Rk ∈ parents(Pj ) do
24:
lkj ← lkj − available
25:
Rk .f ree ← Rk .f ree − available
26:
total ← total + available
27:
lji ← lji − available
28: if total > 0 then
29:
canSchedule ← true
30: for path node Pj ∈ children(Oi ) do
31:
Pj .committed ← min(lij , total)
32:
lij ← lij − Pj .committed
33:
total ← total − Pj .committed
34: return canSchedule
CanScheduleOperation (Algorithm 3): This function decides if
an operation Oi is ready to be scheduled and updates the resource
levels for resource and path nodes accordingly. If Oi is a tunnel
addition operation, we can schedule it either if it has no parents
(Lines 2, 3) or its parent resource node has enough free resource
(Lines 4–8). If Oi is a tunnel deletion operation, we can schedule it if it has no parents (Lines 11–12); tunnel deletion operations
do not have resource nodes as parents because they always release
(memory) resources. If Oi is a weight change operation, we gather
all free capacities on the paths where traffic increases and moves
traffic to them (line 14-34). We iterate over each parent path node
and obtain the available capacity (available) of the path (Lines 16–
27). This capacity limits the amount of traffic that we can move to
this path. We sum them up to total, which is the total traffic we
can move for this flow (Line 26). Then, we iterate over child path
nodes (Lines 30–33). Finally, we decrease Pj .committed traffic
on path represented by Pj (Line 31).
UpdateGraph (Algorithm 4): This function updates the graph before scheduling based on operations that successfully finished in
the last round. We get all such operations and update related nodes
in the graph (Lines 1–22). If the operation node adds a tunnel,
we delete the node and its edges (Lines 2, 3). If the operation node
deletes a tunnel, it frees rule space. So, we update the resource node
(Lines 5, 6) and delete it (Line 7). If the operation node changes
weight, for each child path node, we release resources to links on
it (Lines 11–12) and delete the edge if all resources are released
(Lines 13, 14). We reset the amount of traffic that is moving away
from this path, Pj .committed, to 0 (Line 15). If we have moved
Algorithm 4 UpdateGraph(G)
1: for finished operation node Oi ∈ G do
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
// Finish add tunnel operation node
if Oi .isAddT unneOp() then
Delete Oi and all its edges
// Finish delete tunnel operation node
else if Oi .isDelT unnelOp() then
Rj ← child(Oi )
Rj .f ree ← Rj .f ree + lij
Delete Oi and all its edges // DelTunnelOp only has 1 child
// Finish change weight operation node
else
for path node Pj ∈ children(Oi ) do
for resource node Rk ∈ children(Pj ) do
ljk ← ljk − Pj .committed
Rk .f ree ← Rk .f ree + Pj .committed
if ljk = 0 then
Delete edge Pj → Rk
Pj .committed ← 0
if lij = 0 then
Delete Pj and its edges
for path node Pj ∈ parents(Oi ) do
if lji = 0 then
Delete Pj and its edges
if Oi .hasN oP arents() then
Delete Oi and its edges
for resource node R
Pi ∈ G do
if Ri .f ree ≥ j lij then
P
Ri .f ree ← Ri .f ree − j lij
Delete all edges from Ri
all the traffic away from this path, we delete this path node (Lines
16, 17). Similarly, we check all the parent path nodes (Lines 18–
20). If we have moved all the traffic into a path, we delete the path
node (Lines 19, 20). Finally, if all parent path nodes are removed,
the weight change for this flow finishes; we remove it from the
graph (Line 22). After updating the graph with finished operations,
we check all resource nodes (Lines 23–26). We delete edges from
unbottlenecked resources (Lines 24–26).
WCMP network: Algorithms 3 and 4 for WCMP-based networks
differ in two respects. First, WCMP networks do not have tunnel
add or delete operations. Second, unlike tunnel-based networks
that can simply change the weights at the ingress switches, WCMP
networks perform a two-phase commit using version numbers to
maintain packet coherence (node X and Y in Figure 6(d)). The
code related to the weight change operation in the two algorithms
has minor difference accordingly.
6.3
Handling cycles
Cycles in the dependency graph pose a challenge because inappropriate scheduling can lead to deadlocks where no progress can
be made, as we saw for Figure 5(b) if F 2 is moved first. Further,
many cycles may intertwine together, which makes the problem
even more complicated. For instance, A, B and C are involved in
several cycles in Figure 9.
We handle dependency graphs with cycles by first transforming
them into a virtual DAG and then using the DAG scheduling algorithm above. We use the concept of a strongly connected component (SCC), a subgraph where every node has a path to every other
node [14]. One can think of an SCC as a set of intertwined cycles.
If we view each SCC as a virtual node in the graph, then the graph
becomes a virtual DAG, which is called the component graph in
graph theory. We use Tarjan’s algorithm [21] to efficiently find all
SCCs in the dependency graph. Its time complexity is O(|V |+|E|)
where |V | and |E| are the number of nodes and edges.
8
R3: 0 8
P5
8
A B 4
P6
R1: 0 4
4
P4
8
4
P1
P7
8
P3
8
4
R2: 0 4
4
4
P2
C D Figure 9: A deadlock example where the target state is valid
but no feasible solution exists.
Algorithm 5 RateLimit(SCC, k∗ )
1: O∗ ← weight change nodes ∈ SCC
2: for i=0 ; i<k∗ && O∗ 6=∅ ; i++ do
3:
Oi ← O∗ .pop()
4:
for path node Pj ∈ children(Oi ) do
5:
6:
7:
8:
// fi is the corresponding flow of Oi
Rate limit flow fi by lij on path Pj
for resource node Rk ∈ children(Pj ) do
Rk .f ree ← Rk .f ree + lij
Delete Pj and its edges
With each SCC being a virtual node, we can use critical-path
scheduling on the component graph. While calculating CP Ls, we
use the number of operation nodes in an SCC as the weight of the
corresponding virtual node, which makes the scheduler prefer paths
with larger SCCs.
We make two modifications to the scheduling algorithm to incorporate SCCs. The first is that the for loop at Line 5 in Algorithm 2
iterates over all nodes in the virtual graph. When a node is selected,
if it is a single node, we directly call CanScheduleOperation(Oi ).
If it is a virtual node, we iterate over the operation nodes in its SCC
and call the functions accordingly. We use centrality [22] to decide
the order of the iteration; the intuition is that a central node of an
SCC is on many cycles, and if we can schedule this node early,
many cycles will disappear and we can finish the SCC quickly.
We use the popular outdegree-based definition of centrality, but
other definitions may also be used. The second modification is that
when path nodes consume link resources or tunnel add operations
consume switch resources, they can only consume resources from
nodes that either are in the same SCC or are independent nodes (not
in any SCC). This heuristic prevents deadlocks caused by allocating resources to nodes outside the SCC (“Mv. F2”) before nodes in
the SCC are satisfied as in Figure 5(b).
Deadlocks: The scheduling algorithm resolves most cycles without deadlocks (§9). However, we may still encounter deadlocks in
which no operations in the SCC can make any progress even if the
SCC have obtained all resources from outside nodes. This can happen because (1) given the hardness of the problem, our scheduling
algorithm, which is basically an informed heuristic, doesn’t find the
feasible solution among the combinatorially many orderings and
gets stuck, or (2) there does not exist a feasible solution even if the
target state is valid, like the example in Figure 9. One should note
that deadlocks stem from the need for consistent network updates.
Previous solutions face the same challenge but are much slower and
cause more congestion than Dionysus (§9.4).
Our strategy for resolving deadlocks is to reduce flow rates (e.g.,
by informing rate limiters). Reducing flow rate frees up link capacity; and reducing it to zero on a path allows removal of the
tunnel, which in turn frees up switch memory. Freeing up these
resources allows some of the operations that were earlier blocked
on resources to go through. In the extreme case, if we rate limit
all the flows involved in the deadlocked SCC, the deadlock can be
resolved in one step. However, this extreme remedy leads to excessive throughput loss. It is also unnecessary because often rate
limiting a few strategically selected flows suffices.
We thus rate limit a few flows to begin with, which enables some
operations in the SCC to be scheduled. If that does not fully resolve
the SCC, we rate limit a few more, until the SCC is fully resolved.
The parameter k∗ determines the maximum number of flows that
we rate limit each time, and it controls the tradeoff between the
time to resolve the deadlock and the amount of throughput loss.
Algorithm 5 shows the procedure to resolve deadlocks for tunnelbased networks. It iterates over up to k∗ weight change nodes in
the SCC, each of which corresponds to a flow (Lines 2–8). The order of iteration is based on centrality value as above. We can prove
that as long as the target state is valid (i.e., no resource is oversubscribed), we can fully resolve a deadlock by calling the procedure
by at most dO∗ /k∗ e times. While we skip the formal proof, it is
based on the observation that each time RateLimit is called, followed by ScheduleGraph, the size of the SCC reduces by at least
k∗ operation nodes. We find experimentally that often the number
of steps needed is a lot fewer than this bound.
We use Figure 9 to illustrate deadlock resolution. Let k∗ =1. The
procedure first selects node A. It reduces 4 units of traffic on path
P 6 and 4 units on P 7, which releases 4 units of free capacity to
R1 and 4 units to R2, and deletes P 6 and P 7. At this point, node
A has no children and thus does not belong to the SCC any more.
After this, we call ScheduleGraph(G) to continue the update. It
schedules C, and partially schedules B (i.e., moves 4 units of traffic
from path P 3 to P 4). After C finishes, it schedules the remainded
of operation B and finishes the update. Finally, for node A and its
corresponding flow fA , we increase its rate on P 5 as long as R3
receives free capacity released by P 4.
7.
Implementation
We have implemented a prototype of Dionysus with 5,000+ lines
of C# code. It receives current state from the network and target
state from applications as input, generates a dependency graph,
and schedules rule updates. We implemented dependency graph
generators for both tunnel-based and WCMP networks and all the
scheduling algorithms discussed above. For accurate control plane
confirmations of rule updates (not available in most OpenFlow agents
today), we run a custom software agent on our switches.
8.
Testbed Evaluation
We evaluate Dionysus using testbed experiments in this section
and using large-scale simulations in the next section. We use two
update cases, a WAN TE case and a WAN failure recovery case. To
show its benefits, we compare Dionysus against SWAN [9], a static
solution.
Methodology: Our testbed consists of 8 Arista 7050T switches as
shown in Figure 10(a). It emulates a WAN scenario. The switches
are connected by 10 Gbps links. With the help of our switch agents,
we log the time of sending updates and receiving confirmation. We
use VLAN tags to implement tunnels and use prefix-splitting to
implement weights when a flow uses multiple tunnels. We let S2
and S4 be straggler switches and inject 500 ms latency for rule
updates on them. The remaining switches update at their own pace.
WAN TE case: In this experiment, the update is triggered by a
traffic matrix change. TE calculates a new allocation for the new
matrix, and we update the network accordingly. A simplified dependency graph for this update is shown in Figure 10(b). Numbers
in the circles correspond to the switch to which the rule update is
S4
8
S8
Switch Index
S2
S5
S6
S7
(a) Testbed topology
S8
S7
5
S6
S5
5
S6-­‐S5: 0 S8-­‐S6: 0 5
S1
S3-­‐S6: 5 5
5
5
5
5
S7
5
S8
5
S4-­‐S7: 0 S2
S3
S5
S4-­‐S5: 0 S5
S8
S6
5
S5
S2
S6-­‐S5: 0 5
S3
5
5
5
S4
5
S4-­‐S5: 0 S3-­‐S4: 0 5
4
3
5
S4-­‐S7: 0 5
S1
5
S1-­‐S8: -­‐5 (c) Dependency graph for WAN failure recovery case
Figure 10: Testbed setup. Path nodes are removed from the
dependency graphs ((b) and (c)) for brevity.
sent. For example, the operation node with annotation “S8” means
a rule update at switch S8. The graph contains a cycle that includes
nodes “S8”, “S3-S6”, “S1” and “S8-S6”. Careless scheduling, e.g.,
one that schedules node “S3” before “S1” may cause a deadlock.
There are also operation dependencies for this update: to move a
flow at S6, we have to install a new tunnel at S8 and S7; after the
movement finishes, we delete an old tunnel at S5 and S7.
Figure 11 shows the time series of this experiment. The x-axis is
the time, and the y-axis is the switch index. A rectangle represents a
rule update on a switch (y-axis) for some period (x-axis). Different
rectangular patterns show different rule update operations (add rule,
change rule, or delete rule). Rule updates on straggler switches, S2
and S4, take longer than those on other switches. But even on nonstraggler switches, the rule update time varies—the lengths of the
rectangles are not identical—between 20 and 100 ms.
Dionysus dynamically performs the update as shown in Figure
11(a). First it finds the SCC and schedules node “S1”. It also
schedules “S2”, “S8” and “S7” as they don’t have any parents. After they finish, Dionysus schedules “S6” and “S8”, then “S3”, “S5”
and “S7”. Rather than waiting for “S2,” which is a straggler, Dionysus schedules “S4” after “S3” finishes—“S3” releases enough capacity for it. Finally Dionysus schedules “S5”. The update finishes
in 842 ms.
SWAN uses a static, multi-step solution to perform the update
(Figure 11(b)). It first installs the new tunnel (node “S8” and “S7”).
Then, it adjusts tunnel weights with a congestion-free plan with the
minimal number of steps, as follows:
Step 1: “S1”, “S6”, “S2”
Step 2: “S4”, “S8”
Step 3: “S3”, “S5”
Due to stragglers S2 and S4, SWAN takes a long time on both
Steps 1 and 2. Finally, SWAN deletes the old tunnel (node “S5”
and “S7”). It does not start the tunnel addition and deletion steps
with the weight change steps. The whole update takes 1241 ms,
47% longer than Dionysus.
WAN failure recovery case: In this experiment, the network update is triggered by a topology change. Link S3-S8 fails; flows
that use this link rescale their traffic to other tunnels. This causes
link S1-S8 to get overloaded by 50%. To address this problem, TE
8
Add
Change
Delete
7
6
5
4
3
2
2
1
1
0
S4
5
5
6
5
(b) Dependency graph for WAN traffic engineering case
S7
Add
Change
Delete
7
Switch Index
S3
S1
300 600 900 1200 1500
Time (millisecond)
0
300 600 900 1200 1500
Time (millisecond)
(a) Dionysus
(b) SWAN
Figure 11: Time series for testbed experiment of WAN TE.
calculates a new traffic allocation that eliminates the link overload.
The simplified dependency graph for this network update is shown
in Figure 10(c). To eliminate the overload on link S1-S8, a flow
at S1 is to be moved away, which depends on several other rule
updates. Doing all the rule updates in one shot is undesirable as it
may cause more link overloads and affect more flows. For example,
if “S1” finishes faster than “S3” and “S4”, then it causes 50% link
overload on link S3-S4 and S4-S7 and unnecessarily brings congestions to flows on these links. We present extensive results in §9.3
to show that one-shot updates can cause excessive congestion.
Figure 12(a) shows the time series of the update performed by
Dionysus. It first schedules nodes “S7”, “S5” and “S2”. After “S7”
and “S5” finish, a new tunnel is established and it safely schedules
“S8”. Then it schedules “S3”, “S5” and “S6”. Although “S2” is on
a straggler switch and is delayed, Dionysus dynamically schedules
“S4” once “S3” finishes. Finally, it schedules “S1”. It finishes
the update in 808 ms, which eliminates the overload on S1-S8, as
shown in Figure 12(c).
Figure 12(b) shows the time series of the update performed by
SWAN. It first installs the new tunnel (node “S7” and “S5”), then
calculates an update plan with minimal steps as follows.
Step 1: node “S2”, node “S8”
Step 2: node “S3”, node “S4”
Step 3: node “S1”
This static plan does not adapt, and it is delayed by straggler switches
at both Steps 1 and 2. It misses the opportunity to dynamically reorder rule updates. It takes 1299 ms to finish the update and eliminate the link overload, 61% longer than Dionysus.
9.
Large-Scale Simulations
We now conduct large-scale simulations to show that Dionysus
can significantly improve update speed, reduce congestion, and effectively handle cycles in dependency graphs. We focus on congestion freedom as the consistency property, a particularly challenging
property and most relevant for the networks we study.
9.1
Datasets and methodology
Wide area network: This dataset is from a large WAN that interconnects O(50) sites. Inter-site links have tens to hundreds of Gbps
capacity. We collect traffic logs on routers and aggregate them into
site-to-site flows over 5-minute intervals. The flows are classified
into 3 priorities: interactive, elastic and background [9]. We obtain
288 traffic matrices on a typical working day, where each traffic
matrix consists of all the site-to-site flows in one interval.
The network uses tunnel-based routing, and we implement the
TE algorithm of SWAN [9] which maximizes network throughput
and approximates max-min fairness among flows of the same priorities. The TE algorithm produces the network configuration for
successive intervals and we compute the time to update the network
from one interval to the next.
8
7
6
5
4
3
2
1
Add
Change
Delete
Link Utilization
Add
Change
Delete
Switch Index
Switch Index
2
8
7
6
5
4
3
2
1
1.5
1
0.5
Dionysus
SWAN
0
400 600 800 1000 1200 1400
Time (millisecond)
6
5
4
3
2
1
0
OneShot [50,90,99 perc.]
Dionysus [50,90,99 perc.]
SWAN [50,90,99 perc.]
Normal Straggler
200
400 600 800 1000 1200 1400
Time (millisecond)
5
OneShot [50,90,99 perc.]
Dionysus [50,90,99 perc.]
SWAN [50,90,99 perc.]
4
3
2
1
0
Normal Straggler
(a) WAN TE
(b) Data center TE
Figure 13: Dionysus is faster than SWAN and close to OneShot.
Data center network: This dataset is from a large data center network with several hundred switches. The topoogy has 3 layers:
ToR (Top-of-Rack), Agg (Aggeration), and Core. Links between
switches are 10 Gbps. We collect traffic traces by logging the
socket events on all servers and aggregate them into ToR-to-ToR
flows over 5-minute intervals. As for the WAN, we obtain 288 traffic matrices for a typical working day.
Due to the large scale, we do elephant-flow routing [5, 6, 7]. We
choose the 1500 largest flows, which account for 40–60% of all
traffic. We use an LP to calculate their traffic allocation and use
ECMP for other flows. This method improves the total throughput
by up to 30% as compared to using ECMP for all flows. We run TE
and update WCMP weights for elephant flows every interval. Since
mice flows use default ECMP entries, nothing is updated for them.
For both settings, we leave 10% scratch capacity on links to aid
transitions [9], and we use 1500 as switch rule memory size. This
memory size means that the memory slack (i.e., unused capacity) is
at least 50% in our experiments in §9.2 and §9.3. In §9.4, we study
the impact of memory limitation by reducing memory size.
Alternative approaches: We compare Dionysus with two alternative approaches. First, OneShot sends all updates in one shot. It
does not maintain any consistency, but serves as the lower bound
for update time. Second, SWAN is the state-of-the-art approach in
maintaining congestion freedom [9]. It uses a heuristic to divide
the update into multiple phases based on memory constraints so
that each intermediate phase can fit all rules in switches. SWAN
may rate limit flows in intermediate phase as the paths in the network cannot carry all the traffic. Between consecutive phases, it
uses a linear program to calculate a congestion-free multi-step plan
based on capacity constraints.
Rule update time: The rule update time at switches is based on
switch measurement results (§2). We show results in both normal
setting and straggler setting. In the former case, we use the median
rule update time in §2; in the latter case, we draw rule update time
from the CDF in §2. We use 50 ms as RTT in WAN scenario.
9.2
0
200
400 600 800 1000 1200 1400
Time (millisecond)
(b) SWAN
(c) Link Utilization on Link S1-S8
Figure 12: Time series for testbed experiment of WAN failure recovery.
Time (second)
Time (second)
(a) Dionysus
0
Update time
WAN TE: Figure 13(a) shows the 50th , 90th , 99th percentile update time across all intervals for the WAN TE scenario. Dionysus
5
OneShot [50,90,99 perc.]
Dionysus [50,90,99 perc.]
SWAN [50,90,99 perc.]
4
Time (second)
200
Oversubscription (GB)
0
3
2
1
0
Normal Straggler
OneShot [50,90,99 perc.]
Dionysus [50,90,99 perc.]
SWAN [50,90,99 perc.]
3
2.5
2
1.5
1
0.5
0
Normal Straggler
(a) Link Oversubscription
(b) Update time
Figure 14: In WAN failure recovery, Dionysus significantly reduces oversubscription and update time as compared to SWAN.
OneShot, while fast, incurs huge oversubscription.
outperforms SWAN in both normal and straggler settings. In the
normal setting, Dionysus is 57%, 49%, and 52% faster than SWAN
in the 50th , 90th , 99th percentile, respectively. The gain is mainly
from pipelining: in every step, different switches receive different
number of rules to update and thus takes different amount of time
to finish. While SWAN has to wait for the switch with the most
number of rules to finish, Dionysus begins to issue new operations
as soon as some switches finish.
In the straggler setting, Dionysus reduces update time even more.
It is 88%, 84%, and 81% faster than SWAN in the 50th , 90th , 99th
percentile, respectively. This advantage is because stragglers provide more opportunities for dynamic scheduling which SWAN cannot leverage. Dionysus also performs close to OneShot. It is only
25% and 13% slower than OneShot in the 90th percentile in normal
and straggler settings, respectively.
Data center TE: Figure 13(b) shows results for the data center TE
scenario. Again, Dionysus significantly outperforms SWAN. In the
normal setting, it is 53%, 48%, and 40% faster than SWAN in the
50th , 90th , and 99th percentile; in the straggler setting, it is 81%,
74%, and 67% faster. Data center TE takes more time because
it involves a two-phase commit across multiple switches for each
flow; WAN TE only needs to update the ingress switch if all tunnels
are established.
9.3
Link oversubscription
We use a WAN failure recovery scenario to show that Dionysus
can reduce link oversubscription and shorten recovery time. We use
the same topology and traffic matrices as in the WAN TE case. For
each traffic matrix, we first use TE to calculate a state N S0 . Then
we fail a randomly selected link, which causes the ingress switches
to move traffic away from the failed tunnels to the remaining ones.
For example, if flow f originally uses tunnels T1 , T2 and T3 with
weights w1 , w2 and w3 and the failed link causes T1 to break, then
f carries its traffic using T2 and T3 with weights w2 /(w2 + w3 )
and w3 /(w2 + w3 ). We denote the network state that emerges after
the failure and rescaling as N S1 . Since rescaling is a local action,
N S1 may have overloaded links. The TE calculates a new state
Dionysus
SWAN
80
60
40
20
0
2%
4%
6%
8%
Memory Slack
20
Dionysus [50,90,99 perc.]
SWAN [50,90,99 perc.]
15
10
5
0
10%
Dionysus [50,90,99 perc.]
SWAN [50,90,99 perc.]
4
3
2
1
0
2%
(a) Percentage of rate limited cases
5
Time (second)
Throughput Loss (GB)
Deadlocks (%)
100
4%
6%
8%
Memory Slack
10%
(b) Throughput Loss
2%
4%
6%
8%
Memory Slack
10%
(c) Update Time
Figure 16: Dionysus only occasionally runs into deadlocks and uses rate limiting, and experiences little throughput loss. It also
consistently outperforms SWAN in update time.
Opportunistic
100
80
60
40
20
0
Dionysus
SWAN
9.4
10.
Deadlocks (%)
N S2 to eliminate congestion. The network update that we study is
the update from N S1 to N S2 .
If the initial state N S1 already has congestion, there will be no
congestion-free update plan. For Dionysus and SWAN, we generate plans in which, during updates, no oversubscribed link carries
more load than its current load. In such plans, the capacity of congested links is virtually increased to its current load, to make each
link appear non-congested. For Dionysus, we increase the weight
of overloaded links to the overloaded amount in CP L calculation
(Equation 1). Then, Dionysus will prefer operations that move traffic away from overloaded links. For SWAN, we use the linear program to compute the plan such that total oversubscription across all
links is minimized at each step. Of all possible static plans, this
modification makes SWAN prefer one that minimizes congestion
quickly. OneShot operates as before because it does not care about
congestion.
Figure 14 shows the update time and link oversubscription—the
amount of data above capacity arriving at a link. Dionysus has the
least oversubscription among the three. OneShot, while quick, has
huge oversubscription. SWAN incurs 1.49 GB and 2.04 GB oversubscription in the 99th percentile in normal and straggler settings,
respectively. As even high-end switches today only have hundreds
of MB buffer [23], such oversubscription will cause heavy packet
loss. Dionysus reduces oversubscription to 0.88 GB and 1.19 GB,
which are 41% and 42% less than SWAN. For update time, Dionysus is 45% and 82% faster than SWAN in the 99th percentile in
normal and straggler setting, respectively.
not violate consistency (§3), instead of planning using a dependency graph. The data in the figure corresponds to the WAN and
data center TE scenarios in §9.2, where the memory slack was over
50%. We do not show results for OneShot; it does not deadlock by
design as it does not worry about consistency.
We see that planning-based approaches, Dionysus and SWAN,
lead to no deadlocks, but the opportunistic approach deadlocks
90% of the time for WAN TE and 70% of the time for data center TE. It performs worse for WAN TE because the WAN topology
is less regular than the data center topology, which leads to more
complex dependencies.
We now evaluate Dionysus and SWAN in resource-constrained
settings. To emulate such a setting, instead of using 1500 as memory size, we vary switch memory slack; 10% memory slack means
we set the memory size as 1100 when the switch is loaded with
1000 rules. We show three metrics in the WAN TE setup: (1) the
percentage of cases that deadlock and use rate limiting to finish
the update, (2) the throughput loss caused by rate limiting (i.e., the
product of the limited rate and the rate limited time), and (3) the
update time. We set k∗ =5 in Algorithm 5 for Dionysus.
Figure 16 shows the results for the straggler setting. The results with the normal setting are similar. Figure 16(a) shows the
percentage of cases that use rate limiting under different levels of
memory slack. Dionysus only occasionally runs into deadlocks and
uses resorts to rate limiting more sparingly than SWAN. Even with
only 2% memory slack, Dionysus uses rate limiting in fewer than
10% cases. SWAN, on the other hand, uses rate limiting in more
than 80% of the cases. This difference is because the heuristics in
Dionysus strategically account for dependencies during scheduling.
SWAN uses simplistic metrics, such as the amount of traffic that a
tunnel carries and the number of hops of the tunnel, to decide which
tunnel to add or delete.
Figure 16(b) shows the throughput loss. The throughput loss
with SWAN can be as high as 20 GB, while that with Dionysus is
only tens of MB. Finally 16(c) shows the update time. Dionysus
is 60%, 145%, and 84% faster than SWAN in the 90th percentile
under 2%, 6% and 10% memory slack respectively.
WAN TE
DC TE
Figure 15: Opportunistic scheduling frequently deadlocks.
Dionysus and SWAN have no deadlocks.
Deadlocks
We now study the effectiveness of Dionysus in handling circular dependencies, which can lead to deadlocks. First, we show
that, as mentioned in §3, completely opportunistic scheduling can
lead to frequent deadlocks even in a setting that is not resourceconstrained. Then, we show the effectiveness of Dionysus in handling resource-constrained settings.
Figure 15 shows the percentage of network updates finished by
Dionysus, SWAN, and an opportunistic approach without deadlocks, that is, without having to reduce flow rates during updates.
The opportunistic approach immediately issues any updates that do
Related Work
In the domain of distributed protocols, there is a lot of work on
avoiding transient misbehavior during network updates. Much focuses on maintain properties like loop-freedom for specific protocols or scenarios. For example, Francois et al. [24], John et al. [25]
and Kushman et al. [26] focus on BGP, Francois et al. [27, 28]
and Raza et al. [29] focus on link-state protocols, and Vanbever et
al. [30] focus on migration from one routing protocol to another.
With the advent of SDN, many recent works propose solutions to
maintain different consistency properties during network updates.
Reitblatt et al. [11] provide a theoretical foundation and propose
a two-phase commit protocol to maintain packet coherence. Katta
et al. [12] and McGeer et al. [31] propose solutions to reduce the
memory requirements to maintain packet coherence. SWAN [9],
zUpdate [10] and Ghorbani and Caesar [32] provide solutions for
congestion-free updates. Noyes et al. [33] propose a model checking based approach to generate update orderings that maintain invariants specified by the operator. Mahajan and Wattenhofer [13]
present an efficient solution for maintaining loop freedom. As mentioned earlier, unlike these works, the key characteristic of our approach is dynamic scheduling, which leads to faster updates.
Mahajan and Wattenhofer [13] also analyze the nature of dependencies among switches induced by different consistency properties and outline a general architecture for consistent updates. We
build on their work by developing a concrete system.
Some works develop approaches that spread traffic such that the
network stays congestion-free after a class of common failures [34,
35], and thus no network-wide updates are needed to react to these
failures. These approaches are complementary to our work. They
help reduce the number of network updates needed. But network
updates are still be needed to adjust to changing traffic demands
and reacting to failures that are not handled by these approaches.
Dionysus ensures that these updates will be fast and consistent.
11.
Conclusion
Dionysus enables fast, consistent network updates in SDNs. The
key to its speed is dynamic scheduling of updates at individual
switches based on runtime differences in their update speeds. We
showed using testbed experiments and data-driven simulations that
Dionysus improves the median network update speed by 53%-88%
over static scheduling. These faster updates translates to a more
nimble network that reacts faster to events like failures and changes
in traffic demand.
Acknowledgements We thank Srinivas Narayana, Meg WalraedSullivan, our shepherd Brighten Godfrey, and the anonymous SIGCOMM reviewers for their feedback on earlier versions of this paper. Xin Jin and Jennifer Rexford were partially supported by NSF
grant TC-1111520 and DARPA grant MRC-007692-001.
12.
References
[1] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, A. Shaikh, and
J. van der Merwe, “Design and implementation of a routing control
platform,” in USENIX NSDI, 2005.
[2] A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers, J. Rexford,
G. Xie, H. Yan, J. Zhan, and H. Zhang, “A clean slate 4D approach to
network control and management,” SIGCOMM CCR, 2005.
[3] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. Gude, N. McKeown,
and S. Shenker, “Rethinking enterprise network control,” IEEE/ACM
Trans. Networking, vol. 17, no. 4, 2009.
[4] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma,
S. Banerjee, and N. McKeown, “ElasticTree: Saving energy in data
center networks.,” in USENIX NSDI, 2010.
[5] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and
A. Vahdat, “Hedera: Dynamic flow scheduling for data center
networks.,” in USENIX NSDI, 2010.
[6] T. Benson, A. Anand, A. Akella, and M. Zhang, “MicroTE: Fine
grained traffic engineering for data centers,” in ACM CoNEXT, 2011.
[7] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma,
and S. Banerjee, “DevoFlow: Scaling flow management for
high-performance networks,” in ACM SIGCOMM, 2011.
[8] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh,
S. Venkata, J. Wanderer, J. Zhou, M. Zhu, et al., “B4: Experience
with a globally-deployed software defined WAN,” in ACM
SIGCOMM, 2013.
[9] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri,
and R. Wattenhofer, “Achieving high utilization with software-driven
WAN,” in ACM SIGCOMM, 2013.
[10] H. H. Liu, X. Wu, M. Zhang, L. Yuan, R. Wattenhofer, and D. A.
Maltz, “zUpdate: Updating data center networks with zero loss,” in
ACM SIGCOMM, 2013.
[11] M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker,
“Abstractions for network update,” in ACM SIGCOMM, 2012.
[12] N. P. Katta, J. Rexford, and D. Walker, “Incremental consistent
updates,” in ACM SIGCOMM HotSDN Workshop, 2013.
[13] R. Mahajan and R. Wattenhofer, “On consistent updates in software
defined networks,” in ACM SIGCOMM HotNets Workshop, 2013.
[14] C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cormen,
Introduction to Algorithms. The MIT press, 2001.
[15] C. Rotsos, N. Sarrar, S. Uhlig, R. Sherwood, and A. W. Moore,
“OFLOPS: An open framework for OpenFlow switch evaluation,” in
Passive and Active Measurement Conference, 2012.
[16] A. Ferguson, A. Guha, C. Liang, R. Fonseca, and S. Krishnamurthi,
“Participatory networking: An API for application control of SDNs,”
in ACM SIGCOMM, 2013.
[17] Nicira, “Network virtualization for cloud data centers.”
http://tinyurl.com/c9jbkuu.
[18] M. Casado, T. Koponen, S. Shenker, and A. Tootoonchian, “Fabric:
A retrospective on evolving SDN,” in ACM SIGCOMM HotSDN
Workshop, 2012.
[19] B. Raghavan, M. Casado, T. Koponen, S. Ratnasamy, A. Ghodsi, and
S. Shenker, “Software-defined Internet architecture: Decoupling
architecture from infrastructure,” in ACM SIGCOMM HotNets
Workshop, 2012.
[20] X. Jin, H. H. Liu, R. Gandhi, S. Kandula, R. Mahajan, M. Zhang,
J. Rexford, and R. Wattenhofer, “Dionysus: Dynamic scheduling of
network updates (extended version),” in Microsoft Research
Technical Report MSR-TR-2014-79, 2014.
[21] R. Tarjan, “Depth-first search and linear graph algorithms,” SIAM
Journal on Computing, vol. 1, no. 2, 1972.
[22] M. Newman, Networks: An Introduction. Oxford University Press,
2009.
[23] Arista, “Arista 7500 series technical specifications.”
http://tinyurl.com/lene8sw.
[24] P. Francois, O. Bonaventure, B. Decraene, and P.-A. Coste,
“Avoiding disruptions during maintenance operations on BGP
sessions,” Network and Service Management, IEEE Transactions on,
vol. 4, no. 3, 2007.
[25] J. P. John, E. Katz-Bassett, A. Krishnamurthy, T. Anderson, and
A. Venkataramani, “Consensus routing: The Internet as a distributed
system,” in USENIX NSDI, 2008.
[26] N. Kushman, S. Kandula, D. Katabi, and B. M. Maggs, “R-BGP:
Staying connected in a connected world,” in USENIX NSDI, 2007.
[27] P. Francois, M. Shand, and O. Bonaventure, “Disruption free
topology reconfiguration in OSPF networks,” in INFOCOM, 2007.
[28] P. Francois and O. Bonaventure, “Avoiding transient loops during the
convergence of link-state routing protocols,” IEEE/ACM Trans.
Networking, vol. 15, no. 6, 2007.
[29] S. Raza, Y. Zhu, and C.-N. Chuah, “Graceful network state
migrations,” IEEE/ACM Trans. Networking, vol. 19, no. 4, 2011.
[30] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, and
O. Bonaventure, “Lossless migrations of link-state IGPs,”
IEEE/ACM Trans. Networking, vol. 20, no. 6, 2012.
[31] R. McGeer, “A safe, efficient update protocol for OpenFlow
networks,” in ACM SIGCOMM HotSDN Workshop, 2012.
[32] S. Ghorbani and M. Caesar, “Walk the line: Consistent network
updates with bandwidth guarantees,” in ACM SIGCOMM HotSDN
Workshop, 2012.
[33] A. Noyes, T. Warszawski, and N. Foster, “Toward synthesis of
network updates,” in Workshop on Synthesis (SYNT), 2013.
[34] Y. Wang, H. Wang, A. Mahimkar, R. Alimi, Y. Zhang, L. Qiu, and
Y. R. Yang, “R3:resilient routing reconfiguration,” in ACM
SIGCOMM, 2010.
[35] H. H. Liu, S. Kandula, R. Mahajan, M. Zhang, and D. Gelernter,
“Traffic engineering with forward fault correction,” in ACM
SIGCOMM, 2014.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement