zUpdate: Updating Data Center Networks with Zero Loss

zUpdate: Updating Data Center Networks with Zero Loss
zUpdate: Updating Data Center Networks with Zero Loss
Hongqiang Harry Liu
Xin Wu
Yale University
Duke University
[email protected]
Lihua Yuan
[email protected]
Roger Wattenhofer
Microsoft
Microsoft
[email protected]
[email protected]
ABSTRACT
Datacenter networks (DCNs) are constantly evolving due to various
updates such as switch upgrades and VM migrations. Each update
must be carefully planned and executed in order to avoid disrupting many of the mission-critical, interactive applications hosted in
DCNs. The key challenge arises from the inherent difficulty in synchronizing the changes to many devices, which may result in unforeseen transient link load spikes or even congestions. We present
one primitive, zUpdate, to perform congestion-free network updates under asynchronous switch and traffic matrix changes. We
formulate the update problem using a network model and apply our
model to a variety of representative update scenarios in DCNs. We
develop novel techniques to handle several practical challenges in
realizing zUpdate as well as implement the zUpdate prototype
on OpenFlow switches and deploy it on a testbed that resembles
real DCN topology. Our results, from both real-world experiments
and large-scale trace-driven simulations, show that zUpdate can
effectively perform congestion-free updates in production DCNs.
Categories and Subject Descriptors: C.2.1 [Computer Communication Networks]: Network Architecture and Design–Network
communications. C.2.3 [Computer Communication Networks]:
Network Operations.
Keywords: Data Center Network, Congestion, Network Update.
1.
INTRODUCTION
The rise of cloud computing platform and Internet-scale services
has fueled the growth of large datacenter networks (DCNs) with
thousands of switches and hundreds of thousands of servers. Due
to the sheer number of hosted services and underlying physical
devices, DCN updates occur frequently, whether triggered by the
operators, applications, or sometimes even failures. For example,
DCN operators routinely upgrade existing switches or onboard new
switches to fix known bugs or to add new capacity. For applications, migrating VMs or reconfiguring load balancers are considered the norm rather than the exception.
Despite their prevalence, DCN updates can be challenging and
distressing even for the most experienced operators. One key rea-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]
SIGCOMM’13, August 12–16, 2013, Hong Kong, China.
Copyright 20XX ACM 978-1-4503-2056-6/13/08 ...$15.00.
Ming Zhang
Microsoft
[email protected]
David A. Maltz
Microsoft
[email protected]
son is because of the complex nature of the updates themselves.
An update usually must be performed in multiple steps, each of
which is well planned to minimize disruptions to the applications.
Each step can involve changes to a myriad of switches, which if not
properly coordinated may lead to catastrophic incidents. Making
matters even worse, there are different types of update with diverse
requirements and objectives, forcing operators to develop and follow a unique process for each type of update. Because of these
reasons, a DCN update may take hours or days to carefully plan
and execute while still running the risk of spiraling into operators’
nightmare.
This stark reality calls for a simple yet powerful abstraction for
DCN updates, which can relieve the operators from the nitty-gritty,
such as deciding which devices to change or in what order, while
offering seamless update experience to the applications. We identify three essential properties of such an abstraction. First, it should
provide a simple interface for operators to use. Second, it should
handle a wide range of common update scenarios. Third, it should
provide certain levels of guarantee which are relevant to the applications.
The seminal work by Reitblatt et al. [17] introduces two abstractions for network updates: per-packet and per-flow consistency.
These two abstractions guarantee that a packet or a flow is handled
either by the old configuration before an update or by the new configuration after an update, but never by both. To implement such
abstractions, they proposed a two-phase commit mechanism which
first populates the new configuration to the middle of the network
and then flips the packet version numbers at the ingress switches.
These abstractions can preserve certain useful trace properties, e.g.,
loop-free during the update, as long as these properties hold before
and after the update.
While the two abstractions are immensely useful, they are not the
panacea for all the problems during DCN updates. In fact, a DCN
update may trigger network-wide traffic migrations, in which case
many flows’ configurations have to be changed. Because of the
inherent difficulty in synchronizing the changes to the flows from
different ingress switches, the link load during an update could get
significantly higher than that before or after the update (see example in §3). This problem may further exacerbate when the application traffic is also fluctuating independently from the changes to
switches. As a result, nowadays operators are completely in the
dark about how badly links could be congested during an update,
not to mention how to come up with a feasible workaround.
This paper introduces one key primitive, zUpdate, to perform
congestion-free network-wide traffic migration during DCN updates. The letter “z” means zero loss and zero human effort. With
zUpdate, operators simply need to describe the end requirements
of the update, which can easily be converted into a set of input con-
straints to zUpdate. Then zUpdate will attempt to compute and
execute a sequence of steps to progressively meet the end requirements from an initial traffic matrix and traffic distribution. When
such a sequence is found, zUpdate guarantees that there will be
no congestion throughout the update process. We demonstrate the
power and simplicity of zUpdate by applying it to several realistic, complex update scenarios in large DCNs.
To formalize the traffic migration problem, we present a network
model that can precisely describe the relevant state of a network —
specifically the traffic matrix and traffic distribution. This model
enables us to derive the sufficient and necessary conditions under which the transition between two network states will not incur any congestion. Based on that, we propose an algorithm to
find a sequence of lossless transitions from an initial state to an
end state which satisfies the end constraints of an update. We
also illustrate by examples how to translate the high-level humanunderstandable update requirements into the corresponding mathematical constraints which are compliant with our model.
zUpdate can be readily implemented on existing commodity
OpenFlow switches. One major challenge in realizing zUpdate
is the limited flow and group table sizes on those switches. Based
on the observation that ECMP works sufficiently well for most of
the flows in a DCN [19], we present an algorithm that greedily
consolidates such flows to make efficient use of the limited table
space. Furthermore, we devise heuristics to reduce the computation
time and the switch update overhead.
We summarize our contributions as follows:
• We introduce the zUpdate primitive to perform congestionfree network updates under asynchronous switch and traffic
matrix changes.
• We formalize the network-wide traffic migration problem using a network model and propose a novel algorithm to solve
it.
• We illustrate the power of zUpdate by applying it to several
representative update scenarios in DCNs.
• We handle several practical challenges, e.g. switch table size
limit and computation complexity, in implementing zUpdate.
• We build a zUpdate prototype on top of OpenFlow [3]
switches and Floodlight controller [1].
Flow and group tables on commodity switches: An (OpenFlow)
switch forwards packets by matching packet headers, e.g., source
and destination IP addresses, against entries in the so called flow
table (Figure 6). A flow entry specifies a pattern used for matching
and actions taken on matching packets. To perform multipath forwarding, a flow entry can direct a matching packet to an entry in a
group table, which can further direct the packet to one of its multiple next hops by hashing on the packet header. Various hash functions may be used to implement different load balancing schemes,
such as ECMP or Weighted-Cost-Multi-Path (WCMP). To perform
pattern matching, the flow table is made of TCAM (Ternary Content Addressable Memory) which is expensive and power-hungry.
Thus, the commodity switches have limited flow table size, usually
between 1K to 4K entries. The group table typically has 1K entries.
3.
NETWORK UPDATE PROBLEM
Scenario
VM migration
Load balancer reconfiguration
Switch firmware upgrade
Switch failure repair
New switch onboarding
Table 1: The common update scenarios in production DCNs.
We surveyed the operators of several production DCNs about the
typical update scenarios and listed them in Table 1. One common
problem that makes these DCN updates hard is they all have to deal
with so called network-wide traffic migration where the forwarding
rules of many flows have to be changed. For example in switch
firmware upgrade, in order to avoid impacting the applications, operators would move all the traffic away from a target switch before
performing the upgrade. Taking VM migration as another example, to relocate a group of VM’s, all the traffic associated with the
VM’s will be migrated as well.
ĨůŽǁƐ
ŝŶŐƌĞƐƐ
ůŝŶŬƐ
Ɛϭ
ůϭ
ƐϮ
ůϮ
Ĩϭ
• We extensively evaluate zUpdate both on a real network
testbed and in large-scale simulations driven by the topology
and traffic demand from a large production DCN.
2.
DATACENTER NETWORK
Topology: A state-of-the-art DCN typically adopts a FatTree or
Clos topology to attain high bisection bandwidth between servers.
Figure 2(a) shows an example in which the switches are organized
into three layers from the top to the bottom: Core, Aggregation
(Agg) and Top-of-Rack (ToR). Servers are connected to the ToRs.
Forwarding and routing: In such a hierarchical network, traffic traverses a valley-free path from one ToR to another: first go
upwards and then downwards. To limit the forwarding table size,
the servers under the same ToR may share one IP prefix and forwarding is performed based on each ToR’s prefix. To fully exploit
the redundant paths, each switch uses ECMP to evenly split traffic among multiple next hops. The emergence of Software Defined
Networks (SDN) allows the forwarding tables of each switch to be
directly controlled by a logically centralized controller, e.g., via the
OpenFlow APIs, dramatically simplifying the routing in DCNs.
Description
Moving VMs among physical servers.
Changing the mapping between a load
balancer and its backend servers.
Rebooting a switch to install new
version of firmware.
Shutting down a faulty switch to
prevent failure propagation.
Moving traffic to a new switch to test
its functionality and compatibility.
ĨϮ
ĨůŽǁƐ
ŝŶŐƌĞƐƐ
ůŝŶŬƐ
Ɛϭ
ůϭ
ƐϮ
ůϮ
Ĩϭ
;ĂͿŝŶŝƚŝĂů
ĨϮ
ĨůŽǁƐ
ŝŶŐƌĞƐƐ
Ɛϭ
Ĩϭ
;ďͿĨŝŶĂů
ĨϮ
ƐϮ
ůŝŶŬƐ
ůϭ
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0 0 0 0 0 0 0ůϮ
0000000000
;ĐͿƚƌĂŶƐŝĞŶƚ
Figure 1: Transient load increase during traffic migration.
Such network-wide traffic migration, if not done properly, could
lead to severe congestion. The fundamental reason is because of the
difficulty in synchronizing the changes to the flows from different
ingress switches, causing certain links to carry significantly more
traffic during the migration than before or after the migration. We
illustrate this problem using a simple example in Figure 1. Flows
f1 and f2 enter the network from ingress switches s1 and s2 respectively. To move the traffic distribution from the initial one in
(a) to the final one in (b), we need to change the forwarding rules
in both s1 and s2 . As shown in (c), link l2 will carry the aggregate
traffic of f1 and f2 if s1 is changed before s2 . Similar problem will
occur if s2 is changed first. In fact, this problem cannot be solved
by the two-phase commit mechanism proposed in [17].
A modern DCN hosts many interactive applications such as search
and advertisement which require very low latencies. Prior research [5]
reported that even small losses and queuing delays could dramatically elevate the flow completion time and impair the user-perceived
ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ
KZ
ϭ
''
ϭ
dŽZ
Ϯ
Ϯ
ϯ
ϯ
ϭ
Ϯ
ϲϬϬ dŽZϮ
ϰ
ϰ
ϲ
ϭ
ϱ
ϲϬϬ dŽZϮ
;ĂͿDWŝŶŝƚŝĂůϬ
ϭ
ϱ
ϯ
ϰ
Ϯ
Ϯ
ϯ
ϯ
ϭ
Ϯ
ϲϬϬ dŽZϮ
ϰ
ϱ
ϯ
ϰ
ϭ
ϲ
ϭ
ϱ
ϲϬϬ dŽZϮ
;ďͿDWĨŝŶĂůϮ’
ϰ
Ϯ
Ϯ
ϯ
ϯ
ϭ
Ϯ
ϲϬϬ dŽZϮ
ϰ
ϱ
ϯ
ϰ
ϰ
ϲ
ϱ
ϲϬϬ dŽZϮ
;ĐͿnjhƉĚĂƚĞĨŝŶĂůϮ
ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ ϲϮϬ dŽZϯ͕ϰ
KZ
ϭ
''
ϭ
dŽZ
Ϯ
Ϯ
ϯ
ϯ
ϭ
Ϯ
ϲϬϬ dŽZϮ
ϰ
ϯ
ϱ
ϰ
ϰ
ϭ
ϲ
ϭ
ϱ
ϲϬϬ dŽZϮ
;ĚͿdƌĂŶƐŝƚŝŽŶĨƌŽŵϬƚŽϮ
Ϯ
Ϯ
ϭ
Ϯ
ϲϬϬ dŽZϮ
ϯ
ϯ
ϰ
ϯ
ϱ
ϰ
ϰ
ϭ
ϲ
ϭ
ϱ
ϲϬϬ dŽZϮ
Ϯ
ϭ
Ϯ
ϲϬϬ dŽZϮ
;ĞͿnjhƉĚĂƚĞŝŶƚĞƌŵĞĚŝĂƚĞϭ
ĞdžƉĞĐƚŝŶŐƵƉŐƌĂĚĞ
Ϯ
ϯ
ϯ
ϰ
ϯ
ϰ
ϱ
ϰ
ϲ
ϱ
ϲϬϬ dŽZϮ
;ĨͿdƌĂŶƐŝƚŝŽŶĨƌŽŵϬƚŽϭ
ŽǀĞƌůŽĂĚĞĚůŝŶŬ
Figure 2: This example shows how to perform a lossless firmware upgrade through careful traffic distribution transitions.
performance. Thus, it is critical to avoid congestion during DCN
updates.
Performing lossless network-wide traffic migration can be highly
tricky in DCNs, because it often involves changes to many switches
and its impact can ripple throughout the network. To avoid congestion, operators have to develop a thoughtful migration plan in
which changes are made step-by-step and in an appropriate order.
Furthermore, certain update (e.g., VM migration) may require coordination between servers and switches. Operators, thus, have to
carefully calibrate the impact of server changes along with that of
switch changes. Finally, because each update scenario has its distinctive requirements, operators today have to create a customized
migration plan for each scenario.
Due to the reasons above, network-wide traffic migration is an
arduous and complicated process which could take weeks for operators to plan and execute while some of the subtle yet important
corner cases might still be overlooked. Thus, risk-averse operators sometimes deliberately defer an update, e.g., leaving switches
running an out-of-date, buggy firmware, because the potential damages from the update may outweigh the gains. Such tendency would
severely hurt the efficiency and agility of the whole DCN.
Our goal: is to provide a primitive called zUpdate to manage the
network-wide traffic migration for all the DCN updates shown in
Table 1. In our approach, operators only need to provide the end
requirements of a specific DCN update, and then zUpdate will
automatically handle all the details, including computing a lossless
(perhaps multi-step) migration plan and coordinating the changes
to different switches. This would dramatically simplify the migration process and minimize the burden placed on operators.
4.
OVERVIEW
In this section, we illustrate by two examples how asynchronous
switch and traffic matrix changes lead to congestion during traffic
migration in DCN and how to prevent the congestion through a
carefully-designed migration plan.
Switch firmware upgrade: Figure 2(a) shows a FatTree [4] network where the capacity of each link is 1000. The numbers above
the core switches and those below the ToRs are traffic demands.
The number on each link is the traffic load and the arrow indicates
the traffic direction. This figure shows the initial traffic distribu-
tion D0 where each switch uses ECMP to evenly split the traffic
among the next hops. For example, the load on link ToR1 →AGG1
is 300, half of the traffic demand ToR1 →ToR2 . The busiest link
CORE1 →AGG3 has a load of 920, which is the sum of 620 (demand CORE1 →ToR3/4 ), 150 (traffic AGG1 →CORE1 ), and 150
(traffic AGG5 →CORE1 ). No link is congested.
Suppose we want to move the traffic away from AGG1 before
taking it down for firmware upgrade. A naive way is to disable link
ToR1 →AGG1 so that all the demand ToR1 →ToR2 shifts to link
ToR1 →AGG2 whose load becomes 600 (as shown in Figure 2(b)).
As a result, link CORE3 →AGG4 will have a load of 1070 by combining 300 (traffic AGG2 →CORE3 ), 150 (traffic AGG6 →CORE3 ),
and 620 (demand CORE3 →ToR3/4 ), exceeding its capacity.
Figure 2(c) shows the preceding congestion can be prevented
through proper traffic distribution D2 , where ToR5 forwards 500
traffic on link ToR5 →AGG5 and 100 traffic on link ToR5 →AGG6
instead of using ECMP. This reduces the load on link CORE3 →AGG4
to 970, right below its capacity.
However, to transition from the initial D0 (Figure 2(a)) to D2
(Figure 2(c)), we need to change the traffic split ratio on both ToR1
and ToR5 . Since it is hard to change two switches simultaneously,
we may end up with a traffic distribution shown in Figure 2(d)
where link CORE1 →AGG3 is congested, when ToR5 is changed
before ToR1 . Conversely, if ToR1 is changed first, we will have the
0
traffic distribution D2 in Figure 2(b) where link CORE3 →AGG4 is
congested.
Given the asynchronous changes to different switches, it seems
impossible to transit from D0 (Figure 2(a)) to D2 (Figure 2(c))
without causing any loss. Our basic idea is to introduce an intermediate traffic distribution D1 as a stepping stone, such that the
transitions D0 →D1 and D1 →D2 are both lossless. Figure 2(e) is
such an intermediate D1 where ToR1 splits traffic by 200:400 and
ToR5 splits traffic by 450:150. It is easy to verify that no link is
congested in D1 since the busiest link CORE1 →AGG3 has a load
of 945.
Furthermore, when transitioning from D0 (Figure 2(a)) to D1 , no
matter in what order ToR1 and ToR5 are changed, there will be no
congestion. Figure 2(f) gives an example where ToR5 is changed
before ToR1 . The busiest link CORE1 →AGG3 has a load of 995.
Although not shown here, we verified that there is no congestion
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϭ
Ϯ
KZ
''
dŽZ
^ZsZ
ϭ
Ϯ
ϭ Ϯ ϯ ϰ
^ϭ ^ϭ
ĐŽŶƚĂŝŶĞƌϭ
ϯ
ϱ
^ϭΖ
ϲϰϬ^ϭ;ƐͿ
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϭ
Ϯ
ϯ
ϰ
ϰ
ϱ
ϲ
ϲ
ϳ ϴ ϵ ϭϬ
^ϮΖ
^Ϯ ^Ϯ
ϲϰϬ^Ϯ;ƐͿ ĐŽŶƚĂŝŶĞƌϮ
ϭ
Ϯ
ϭ Ϯ ϯ ϰ
^ϭ ^Ϯ
ĐŽŶƚĂŝŶĞƌϭ
''
dŽZ
^ZsZ
ϭ
Ϯ
ϭ Ϯ ϯ ϰ
^ϭ ^ϭ͕Ϯ
ĐŽŶƚĂŝŶĞƌϭ
ϯ
ϱ
^ϭΖ
ϲϰϬ^ϭ;ƐͿ
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϯ
ϰ
ϰ
ϱ
^ϭΖ
ϲϰϬ^ϭ;ƐͿ
ϰ
ϱ
ϲ
ϲ
ϳ ϴ ϵ ϭϬ
^ϮΖ
^Ϯ ^ϭ
ϲϰϬ^Ϯ;ƐͿ ĐŽŶƚĂŝŶĞƌϮ
;ďͿDWĨŝŶĂůdϮ
;ĂͿDWŝŶŝƚŝĂůdϬ
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϭ
Ϯ
KZ
ϯ
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϯ
ϰ
ϱ
ϲ
ϲ
ϳ ϴ ϵ ϭϬ
^ϮΖ
^Ϯ
ĐŽŶƚĂŝŶĞƌϮ
ϲϰϬ^Ϯ;ƐͿ
;ĐͿDWƚƌĂŶƐŝĞŶƚdϭ
ĨĂŝůĞĚůŝŶŬ
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϭ
Ϯ
ϭ
Ϯ
ϭ Ϯ ϯ ϰ
^ϭ ^ϭ͕Ϯ
ĐŽŶƚĂŝŶĞƌϭ
ϯ
ϱ
^ϭΖ
ϲϰϬ^ϭ;ƐͿ
ϲϬϬdŽZϭ͕ϰ ϲϬϬdŽZϭ͕ϰ
ϲϬϬdŽZϳ͕ϭϬ ϲϬϬdŽZϳ͕ϭϬ
ϯ
ϰ
ϰ
ϲ
^ϮΖ
ϲϰϬ^Ϯ;ƐͿ
ϱ
ϲ
ϳ ϴ ϵ ϭϬ
^Ϯ
ĐŽŶƚĂŝŶĞƌϮ
;ĚͿnjhƉĚĂƚĞƚƌĂŶƐŝĞŶƚdϭ
ŽǀĞƌůŽĂĚĞĚůŝŶŬ
Figure 3: This example shows how to avoid congestion by choosing the proper traffic split ratios for switches.
if ToR1 is changed first. Similarly, the transition from D1 (Figure 2(e)) to D2 (Figure 2(c)) is lossless regardless of the change
order of ToR1 and ToR5 .
In this example, the key challenge is to find the appropriate D2
(Figure 2(c)) that satisfies the firmware upgrade requirement (moving traffic away from AGG1 ) as well as the appropriate D1 (Figure 2(e)) that bridges the lossless transitions from D0 (Figure 2(a))
to D2 . We will explain how zUpdate computes the intermediate
and final traffic distributions in §5.3.
Load balancer reconfiguration: Figure 3(a) shows another FatTree network where the link capacity remains 1000. One of the
links CORE2 ↔AGG3 is down (a common incident in DCN). Two
servers S1‘ and S2‘ are sending traffic to two services S1 and S2 ,
located in Container1 and Container2 respectively. The labels on
some links, for example, ToR5 →AGG3 , are in the form of “l1 +l2 ”,
which indicate the traffic load towards Container1 and Container2
respectively.
Suppose all the switches use ECMP to forward traffic, Figure 3(a)
shows the traffic distribution under the initial traffic matrix T0 where
the load on the busiest link CORE1 →AGG1 is 920, which is the
sum of 600 (the demand CORE1 →ToR1/4 ) and 320 (the traffic on
link AGG3 →CORE1 towards Container1 ). No link is congested.
To be resilient to the failure of a single container, we now want
S1 and S2 to run in both Container1 and Container2 . For S2 , we
will instantiate a new server under ToR3 and reconfigure its load
balancer (LB) (not shown in the figure) to shift half of its load from
ToR9 to ToR3 . For S1 , we will take similar steps to shift half of its
load from ToR3 to ToR9 . Figure 3(b) shows the traffic distribution
under the final traffic matrix T2 after the update. Note that the
traffic on link ToR5 →AGG3 is “160+160” because half of it goes
to the S1 under ToR2 in Container1 and the other half goes to the
S1 under ToR9 in Container2 . It is easy to verify that there is no
congestion.
However, the reconfiguration of the LBs of S1 and S2 usually
cannot be done simultaneously because they reside on different devices. Such asynchrony may lead to a transient traffic matrix T1
shown in Figure 3(c) where S2 ’ LB is reconfigured before S1 ’s.
This causes link ToR6 →AGG3 to carry “160+160” traffic, half of
which goes to the S2 in Contain1 , and further causes congestion on
link CORE1 →AGG1 . Although not shown here, we have verified
that congestion will happen if S1 ’s LB is reconfigured first.
The congestion above is caused by asynchronous traffic matrix
changes. Our basic idea to solve this problem is to find the proper
traffic split ratios for the switches such that there will be no congestion under the initial, final or any possible transient traffic matrices during the update. Figure 3(d) shows one such solution where
ToR5 and ToR6 send 240 traffic to AGG3 and 400 traffic to AGG4 ,
and other switches still use ECMP. The load on the busiest link
CORE1 →AGG1 now becomes 960 and hence no link is congested
under T1 . Although not shown here, we have verified that, given
such traffic split ratios, the network is congestion-free under the
initial T0 (Figure 3(a)), the final T2 (Figure 3(b)), and the transient
traffic matrix where S1 ’s LB is reconfigured first.
Generally, the asynchronous reconfigurations of multiple LBs
could result in a large number of possible transient traffic matrices, making it hard to find the proper traffic split ratios for all
the switches. We will explain how to solve this problem with
zUpdate in §6.2.
The zUpdate process: We provide zUpdate(T0 , D0 , C) to perform lossless traffic migration for DCN updates. Given an initial
traffic matrix T0 , zUpdate will attempt to compute a sequence
of lossless transitions from the initial traffic distribution D0 to the
dϬ
Ϭ
Ŷ
&ŽƌŵĂůŝnjĞƵƉĚĂƚĞ
ƌĞƋƵŝƌĞŵĞŶƚ;^ĞĐϲͿ
ƵŝůĚŶĞƚǁŽƌŬƐƚĂƚĞ
ŵŽĚĞů;^ĞĐϱ͘ϭ͕ϱ͘ϮͿ
ŽŶĨŝŶĞƐĞĂƌĐŚƐƉĂĐĞ
;^ĞĐϳ͘ϯͿ
zͬE
/ŵƉůĞŵĞŶƚƚƌĂŶƐŝƚŝŽŶ
ƉůĂŶ;^ĞĐϳ͘ϭ͕ϳ͘ϮͿ
ŽŵƉƵƚĞƚƌĂŶƐŝƚŝŽŶ
ƉůĂŶ;^ĞĐϱ͘ϯͿ
ĞĨŝŶĞƐĞĂƌĐŚ
ŽďũĞĐƚŝǀĞ;^ĞĐϳ͘ϰͿ
f
Traffic distribution: Let lv,u
be f ’s traffic load on link ev,u . We
f
define D = {lv,u |∀f, ev,u ∈ E} as a traffic distribution, which
represents each flow’s load on each link. Given a T , we call D
feasible if it satisfies:
∀f :
Figure 4: The high-level working process of zUpdate.
X
lsff ,u =
u∈V
final traffic distribution Dn which satisfies the update requirements
C. Dn would then allow an update, e.g., upgrading switch or reconfiguring LB, to be executed without incurring any loss.
Figure 4 shows the overall workflow of zUpdate. In the following, we will first present a network model for describing lossless traffic distribution transition (§5.1, 5.2) and an algorithm for
computing a lossless transition plan (§5.3). We will then explain
how to represent the constraints in each update scenario (§6). After
that, we will show how to implement the transition plan on switches
with limited table size (§7.1, 7.2), reduce computational complexity by confining search space (§7.3), and reduce transition overhead
by picking a proper search objective function (§7.4).
5.
NETWORK MODEL
This section describes a network model under which we formally define the traffic matrix and traffic distribution as the inputs
to zUpdate. We use this model to derive the sufficient and necessary conditions for a lossless transition between two traffic distributions. In the end, we present an algorithm for computing a lossless
transition plan using an optimization programming model.
V
E
G
The set of all switches.
The set of all links between switches.
The directed network graph G = (V, E).
A directed link from switch v to u.
The link capacity of ev,u .
A flow from an ingress to an egress switch.
The ingress switch of f .
The egress switch of f .
A path taken by f from sf to df .
The subgraph formed by all the pf ’s.
The traffic matrix of the network.
The flow size of f in T .
The traffic load placed on ev,u by f .
ev,u
cv,u
f
sf
df
pf
Gf
T
Tf
f
lv,u
D
f
rv,u
R
Rvf
Rf
D(T )
DC (T )
P(T )
f
A traffic distribution D := {lv,u
|∀f, ev,u ∈ E}.
f
f
A rule for f on ev,u : rv,u = lv,u
/Tf .
f
All rules in network: R := {rv,u
|∀f, ev,u ∈ E}.
f
Rules for f on switch v: {rv,u
|∀u : ev,u ∈ E}.
f
|∀ev,u ∈ E}.
Rules for f in the network: {rv,u
All feasible traffic distributions which fully deliver T .
All traffic distributions in D(T ) which satisfy constraints C.
P(T ) := {(D1 , D2 )|∀D1 , D2 ∈ D(T ), direct transition
D1 to D2 is lossless.}
Table 2: The key notations of the network model.
5.1
Abstraction of Traffic Distribution
Network, flow and traffic matrix: A network is a directed graph
G = (V, E), where V is the set of switches, and E is the set
of links between switches. A flow f enters the network from an
ingress switch (sf ) and exits at an egress switch (df ) through one
or multiple paths (pf ). Let Gf be the subgraph formed by all the
pf ’s. For instance, in Figure 5, suppose f takes the two paths (1,
2, 4, 6, 8) and (1, 2, 4, 7, 8) from switch1 to switch8 , then Gf is
comprised of switches {1, 2, 4, 6, 7, 8} and the links between them.
A traffic matrix T defines the size of each flow Tf .
X
f
= Tf
lv,d
f
(1)
v∈V
∀f, v ∈ V \ {sf , df } :
X
u∈V
f
∀f, ev,u ∈
/ Gf : lv,u
=0
X f
∀ev,u ∈ E :
lv,u ≤ cv,u
f
lv,u
=
X
f
lu,v
(2)
u∈V
(3)
(4)
∀f
Equations (1) and (2) guarantee that all the traffic is fully delivered,
(3) means a link should not carry f ’s traffic if it is not on the paths
from sf to df , and (4) means no link is congested. We denote
D(T ) as the set of all feasible traffic distributions under T .
f
Flow rule: We define a rule for a flow f on link ev,u as rv,u
=
f
lv,u /Tf , which is essentially a normalized value of lv,u by the flow
size Tf . We also define the set of rules in the whole network as
f
R = {rv,u
|∀f, ev,u ∈ E}, the set of rules of a flow f as Rf =
f
{rv,u |∀ev,u ∈ E}, and the set of rules for a flow f on a switch v
f
as Rvf = {rv,u
|∀u : ev,u ∈ E}.
Given T , we can compute the rule set R for the traffic distribution D and vice versa. We use D = T ×R to denote the correspondence between D and R. We call a R feasible if its corresponding
D is feasible. In practice, we will install R into the switches to realize the corresponding D under T because R is independent from
the flow sizes and can be directly implemented with the existing
switch functions. We will discuss the implementation details in §8.
5.2
Lossless Transition between Traffic Distributions
To transition from D1 to D2 under given T , we need to change
the corresponding rules from R1 to R2 on all the switches. A basic
requirement for a lossless transition is that both D1 and D2 are
feasible: D1 ∈ D(T ) ∧ D2 ∈ D(T ). However, this requirement is
insufficient due to asynchronous switch changes as shown in §4.
We explain this problem in more detail using an example in Figure 5, which is the subgraph Gf of f from ToR1 to ToR8 in a small
Clos network. Each of switches 1-5 has two next hops towards
f
ToR8 . Thus l7,8
depends on f ’s rules on switches 1-5. When the
switches are changed asynchronously, each of them could be using
f
either old or new rules, resulting in 25 potential values of l7,8
.
f
Generally, the number of potential values of lv,u grows expof
nentially with the number of switches which may influence lv,u
.
To guarantee a lossless transition under an arbitrary switch change
f
order, Equation (4) must hold for any potential value of lv,u
, which
is computationally infeasible to check in a large network.
To solve the state explosion problem, we leverage the two-phase
commit mechanism [17] to change the rules of each flow. In the first
phase, the new rules of f (Rf,2 ) are added to all the switches while
f ’s packets tagged with an old version number are still processed
with the old rules (Rf,1 ). In the second phase, sf tags f ’s packets
with a new version number, causing all the switches to process the
packets with the new version number using Rf,2 .
To see how two-phase commit helps solve the state explosion
problem, we observe that the subgraph Gf of a flow f (Figure 5)
has multiple layers and the propagation delay between two adjacent
layers is almost a constant. When there is no congestion, the queuing and processing delays on switches are negligibly small. Sup-
dŽZ
ŐŐ
Ϯ
ŽƌĞ
ϰ
ŐŐ
ϲ
ϯ
ϱ
ϳ
Hence, (5) holds.
⇐: When (5) holds, we have:
dŽZ
ϴ
ϭ
ŝŶŐƌĞƐƐ
GHOD\ ʣϬ
ĞŐƌĞƐƐ
X
GHOD\ ʣ
f,1
lv,u
+
1
f ∈Fv,u
ƐǁŝƚĐŚƵƐŝŶŐŶĞǁƌƵůĞƐ
ƐǁŝƚĐŚƵƐŝŶŐŽůĚƌƵůĞƐ
Figure 5: Two-phase commit simplifies link load calculations.
pose switch1 flips to the new rules at time 0, switch4 will receive
the packets with the new version number on both of its incoming
interfaces at τ0 + τ1 and flip to the new rules at the same time.
It will never receive a mix of packets with two different version
numbers. Moreover, all the switches in the same layer will flip to
the new rules simultaneously. This is illustrated in Figure 5 where
switch4 and switch5 (in shaded boxes) just flipped to the new rules
while switch6 and switch7 (in unshaded boxes) are still using the
old rules. Formally, we can prove
L EMMA 1. Suppose a network uses two-phase commit to transition the traffic distribution of a flow f from D1 to D2 . If Gf
satisfies the following three conditions:
i) Layered structure: All switches in Gf can be partitioned into
sets L0 , . . . , Lm , where L0 = {sf }, Lm = {df } and ∀ev,u ∈
Gf , if v ∈ Lk , then u ∈ Lk+1 .
ii) Constant delay between adjacent layers: ∀ev,u ∈ Gf , let sv,u
and rv,u be the sending rate of v and the receiving rate of u,
δv,u be the delay from the time when sv,u changes to the time
when rv,u changes. Suppose ∀v1 , v2 ∈ Vk , ev1 ,u1 , ev2 ,u2 ∈
Gf : δv1 ,u1 = δv2 ,u2 = δk .
iii) No switch queuing or processing delay: Given a switch u and
1
2
∀ev,u , eu,w ∈ Gf : if rv,u changes from lv,u
to lv,u
simulta1
2
neously, then su,w changes from lu,w
to lu,w
immediately at
the same time.
1
2
then we have ∀ev,u ∈ Gf , sv,u and rv,u are either lv,u
or lv,u
during the transition.
P ROOF. See appendix.
L EMMA 2. When each flow is changed independently, a transition from D1 to D2 is lossless if and only if:
∀ev,u ∈ E :
X
f,1 f,2
max {lv,u
, lv,u } ≤ cv,u
(5)
∀f
P ROOF. At any snapshot during the transition, ∀ev,u ∈ E, let
1
2
Fv,u
/Fv,u
be the set of flows with the old/new load values. Due to
1
2
two-phase commit, Fv,u
∪ Fv,u
contains all the flows on ev,u .
1
2
1
⇒: Construct Fv,u and Fv,u
as follows: f is put into Fv,u
if
f,1
f,2
2
lv,u ≥ lv,u , otherwise it is put into Fv,u . Because the transition is
congestion-free, we have:
X
1
f ∈Fv,u
f,1
lv,u
+
X
2
f ∈Fv,u
f,2
lv,u
=
X
∀f
f,1 f,2
max {lv,u
, lv,u } ≤ cv,u
f,2
lv,u
≤
X
2
f ∈Fv,u
f,1 f,2
max {lv,u
, lv,u } ≤ cv,u
∀f
Thus no link is congested at any snapshot during the transition.
Lemma 2 means we only need to check Equation (5) to ensure
a lossless transition, which is now computationally feasible. Note
that when the flow changes are dependent, e.g., the flows on the
same ingress switch are tagged with a new version number simultaneously, (5) will be a sufficient condition. We define P(T ) as
the set of all pairs of feasible traffic distributions (D1 , D2 ) which
satisfy (5) under traffic matrix T .
5.3
Computing Transition Plan
Given T0 and D0 , zUpdate tries to find a feasible Dn which
satisfies constraints C and can be transitioned from D0 without loss.
The search is done by constructing an optimization programming
model M.
In the simplest form, M is comprised of Dn as the variable and
two constraints: (i) Dn ∈ DC (T0 ); (ii) (D0 , Dn ) ∈ P(T0 ). Note
that (i) & (ii) can be represented with equations (1)∼(5). We defer the discussion of constraints C in §6. If such a Dn is found,
the problem is solved (by the definitions of DC (T0 ) and P(T0 ) in
Table 2) and a lossless transition can be performed in one step.
Algorithm 1: zUpdate(T0 , D0 , C)
1
2
3
4
5
6
7
8
9
10
Two-phase commit reduces the number of potential values of
f
lv,u
to just two, but it does not completely solve the problem. In
fact, when each f is changed asynchronously
two-phase comP via
f
in Equation (4)
mit, the number of potential values of ∀f lv,u
will be 2n where n is the number of flows. To further reduce the
complexity in checking (4), we introduce the following:
X
11
12
13
14
15
16
// D0 is the initial traffic distribution
// If D0 satisfies the constraints C, return D0 directly
if D0 ∈ DC (T0 ) then
return [D0 ] ;
// The initial # of steps is 1, N is the max # of steps
n←1;
while n ≤ N do
M ← new optimization model ;
D[0] ← D0 ;
for k ← 1, 2, . . . , n do
D[k] ← new traffic distribution variable ;
M.addVariable(D[k]) ;
// D[k] should be feasible under T0
M.addConstraint(D[k] ∈ D(T0 )) ;
for k ← 1, 2, . . . , n do
// Transition D[k − 1] → D[k] is lossless ;
M.addConstraint((D[k − 1], D[k]) ∈ P(T0 )) ;
17
// D[n] should satisfy the constraints C
M.addConstraint(D[n] ∈ DC (T0 )) ;
18
// An objective is optional
M.addObjective(objective) ;
19
20
21
22
23
if M.solve() = Successful then
return D[1 → n] ;
n←n+1;
return [] // no solution is found ;
However, sometimes we cannot find a Dn which satisfies the two
constraints above. When this happens, our key idea is to introduce
a sequence of intermediate traffic distributions (D1 , . . . , Dn−1 ) to
bridge the transition from D0 to Dn via n steps. Specifically,
zUpdate will attempt to find Dk (k = 1, . . . , n) which satisfy:
(I) Dn ∈ DC (T0 ); (II) (Dk−1 , Dk ) ∈ P(T0 ). If such a sequence
is found, it means a lossless transition from D0 to Dn can be performed in n steps. In this general form of M, Dk (k = 1, . . . , n)
are the variables and (I) & (II) are the constraints.
Algorithm 1 shows the pseudocode of zUpdate(T0 , D0 , C).
Since we do not know how many steps are needed in advance, we
will search from n = 1 and increment n by 1 until a solution is
found or n reaches a predefined limit N . In essence, we aim to
minimize the number of transition steps to save the overall transition time. Note that there may exist many solutions to M, we will
show how to pick a proper objective function to reduce the transition overhead in §7.4.
initial and final traffic matrices, we may use zUpdate to transition from the initial traffic distribution D0 to a new D∗ whose corresponding rule set R∗ is feasible under T0 , T1 , and any possible
transient traffic matrices during the update.
As explained in §4, the number of possible transient traffic matrices can be enormous when many LBs (or VMs) are being updated.
It is thus computationally infeasible even to enumerate all of them.
Our key idea to solve this problem is to introduce a maximum traffic
matrix Tmax that is “larger” than T0 , T1 and any possible transient
traffic matrices and only search for a D∗ whose corresponding R∗
is feasible under Tmax .
Suppose during the update process, the real traffic matrix T (t) is
a function of time t. We define ∀f : Tmax,f := sup(Tf (t)) where
sup means the upper bound over time t. We derive the following:
6.
L EMMA 3. Given a rule set R∗ , if Tmax × R∗ ∈ D(Tmax ), we
have T (t) × R∗ ∈ D(T (t)).
P ROOF. Because Tf (t) ≤ Tf,max and Tmax ×R∗ ∈ D(Tmax ),
∀ev,u ∈ E, we have:
HANDLING UPDATE SCENARIOS
In this section, we apply zUpdate to various update scenarios
listed in Table 1. Specifically, we will explain how to formulate the
requirements of each scenario as zUpdate’s input constraints C.
6.1
Network Topology Updates
Certain update scenarios, e.g., switch firmware upgrade, switch
failure repair, and new switch on-boarding, involve network topology changes but no traffic matrix change. We may use zUpdate
to transition from the initial traffic distribution D0 to a new traffic
distribution D∗ which satisfies the following requirements.
Switch firmware upgrade & switch failure repair: Before the
operators shutdown or reboot switches for firmware upgrade or
failure repair, they want to move all the traffic away from those
switches to avoid disrupting the applications. Let U be the set of
candidate switches. The preceding requirement can be represented
as the following constraints C on the traffic distribution D∗ :
f,∗
∀f, u ∈ U, ev,u ∈ E : lv,u
=0
(6)
which forces all the neighbor switches to stop forwarding traffic to
switch u before the update.
New device on-boarding: Before the operators add a new switch
to the network, they want to test the functionality and performance
of the new switch with some non-critical production traffic. Let
u0 be the new switch, Ftest be the test flows, and Gf (u0 ) be the
subgraph formed by all the pf ’s which traverse u0 . The preceding
requirement can be represented as the following constraints C on
the traffic distribution D∗ :
f,∗
∀f ∈ Ftest , ev,u ∈
/ Gf (u0 ) : lv,u
=0
(7)
f,∗
∀f ∈
/ Ftest , ev,u0 ∈ E : lv,u
=0
0
(8)
where (7) forces all the test flows to only use the paths through u0 ,
while (8) forces all the non-test flows not to traverse u0 .
Restoring ECMP: A DCN often uses ECMP in normal condition,
but WCMP during updates. After an upgrade or testing is completed, operators may want to restore ECMP in the network. This
can simply be represented as the following constraints C on D∗ :
f,∗
f,∗
∀f, v ∈ V, ev,u1 , ev,u2 ∈ Gf : lv,u
= lv,u
1
2
(9)
which forces switches to evenly split f ’s traffic among next hops.
6.2
Traffic Matrix Updates
Certain update scenarios, e.g., VM migration and LB reconfiguration, will trigger traffic matrix changes. Let T0 and T1 be the
X
f,∗
Tf (t) × rv,u
≤
∀f
X
f,∗
Tf,max × rv,u
≤ cv,u
∀f
Hence, T (t) × R∗ ∈ D(T (t)).
Lemma 3 says if R∗ is feasible under Tmax , it is feasible throughout the update process. This means, before updating the traffic matrix from T0 to T1 , we may use zUpdate to transition from D0
into D∗ whose corresponding R∗ is feasible under Tmax . This
leads to the following constraints C on D∗ :
D∗ = T0 × R∗
(10)
Tmax × R∗ ∈ D(Tmax )
(11)
Here Tmax is specified by the applications owners who are going
to perform the update. In essence, lemma 3 enables the operators
to migrate multiple VMs in parallel, saving the overall migration
time, while not incurring any congestion.
7.
PRACTICAL ISSUES
In this section, we discuss several practical issues in implementing zUpdate including switch table size, computational complexity, transition overhead, and unplanned failures and traffic matrix
variations.
7.1
Implementing zUpdate on Switches
The output of zUpdate is a sequence of traffic distributions,
each of which can be implemented by installing its corresponding
flow table entries into the switches. Given a flow f ’s traffic disf
tribution on a switch v: {lv,u
}, we compute a weight set Wvf in
P f
f
f
which wv,u
= lv,u
/ ui lv,u
. In practice, a zUpdate flow f is
i
the collection of all the 5-tuple flows from the same ingress to the
same egress switches, since all these 5-tuple flows share the same
set of paths. Wvf can then be implemented on switch v by hashing
each of f ’s 5-tuple flows into one next hop in f ’s next hop set using
WCMP. As in [17], the version number used by two-phase commit
can be encoded in the VLAN tag.
Figure 6 shows an example of how to map the zUpdate flows
to the flow and group table entries on an OpenFlow switch. The
zUpdate flow (ver2 , s2 , d2 ) maps to the switch flow table entry
(vlan2 , s2 , d2 ) which further points to group2 in the switch group
table. group2 implements this flow’s weight set {0.25, 0.75, −}
using the SELECT group type and the WCMP hash function.
ǀĞƌƐŝŽŶƐƌĐWĨdž ĚƐƚWĨdž ŝŶƚĨϭŝŶƚĨϮŝŶƚĨϯ
ϭƐϭĚϭ͘ϴϳϱ͘ϭϮϱͲ
ϭƐϮĚϮ͘ϲϮϱ͘ϯϳϱͲ
ϭƐϯĚϮ͘ϱ͘ϱͲ
ϭƐϰĚϯͲ
͘ϱ͘ϱ
ϮƐϭĚϭ͘ϱ͘ϱ Ͳ
ϮƐϮĚϮ͘Ϯϱ ͘ϳϱ Ͳ
ϮƐϯĚϮ͘ϱ͘ϱ Ͳ
ϮƐϰĚϯͲ
͘ϱ ͘ϱ
njhƉĚĂƚĞ ƚĂďůĞďĞĨŽƌĞĐŽŶƐŽůŝĚĂƚŝŽŶ
DĂƚĐŚ&ŝĞůĚƐ
/ŶƐƚƌƵĐƚŝŽŶ
ǀůĂŶ/ ƐƌĐWĨdž ĚƐƚWĨdž ŐƌŽƵƉ
ϭƐϭĚϭϭ
ϭƐϮĚϮϯ
ϮƐϮĚϮϮ
Ύ
Ύ
Ěϭϯ
Ύ
Ύ
ĚϮϯ
Ύ
Ύ
Ěϯϰ
'ƌŽƵƉ/
'ƌŽƵƉdLJƉĞ͗ƐĞůĞĐƚďLJtDWĐƚŝŽŶƵĐŬĞƚƐ
ďƵĐŬĞƚϭ͕ďƵĐŬĞƚϮ͕ďƵĐŬĞƚϯďƵĐŬĞƚϭďƵĐŬĞƚϮďƵĐŬĞƚϯ
ϭ͘ϴϳϱ ͘ϭϮϱ Ͳ
ŝŶƚĨϭŝŶƚĨϮŝŶƚĨϯ
Ϯ͘Ϯϱ͘ϳϱͲ
ŝŶƚĨϭŝŶƚĨϮŝŶƚĨϯ
ϯ͘ϱ͘ϱ Ͳ
ŝŶƚĨϭŝŶƚĨϮŝŶƚĨϯ
ϰͲ
͘ϱ͘ϱ ŝŶƚĨϭŝŶƚĨϮŝŶƚĨϯ
ƐǁŝƚĐŚŐƌŽƵƉƚĂďůĞ
ƐǁŝƚĐŚĨůŽǁƚĂďůĞ
Figure 6: Implementing zUpdate on an OpenFlow switch.
7.2
Limited Flow and Group Table Size
As described in §2, a commodity switch has limited flow table
size, usually between 1K and 4K entries. However, a large DCN
may have several hundreds ToR switches, elevating the number
of zUpdate flows beyond 100K, far exceeding the flow table size.
Making things even worse, because two-phase commit is used, a
flow table may hold two versions of the entry for each flow, potentially doubling the number of entries. Finally, the group table size
on commodity switches also poses a challenge, since it is around
1K (sometimes smaller than the flow table size).
Our solution to this problem is motivated by one key observation: ECMP works reasonably well for most of the flows in a DCN.
During transition, there usually exist only several bottleneck links
on which congestion may arise. Such congestion can be avoided by
adjusting the traffic distribution of a small number of critical flows.
This allows us to significantly cut down the number of flow table
entries by keeping most of the flows in ECMP.
Consolidating flow table entries: Let S be the flow table size
and n be the number of ToR switches. In a flow table, we will always have one wildcard entry for the destination prefix of each ToR
switch, resulting in n wildcard entries in the table. Any flow that
matches a wildcard entry will simply use ECMP. Figure 6 shows
an example where the switch flow table has three wildcard entries
for destinations d1 , d2 and d3 . Since the weight set of zUpdate
flows (ver1 , s4 , d3 ) and (ver2 , s4 , d3 ) is {−, 0.5, 0.5}, they both
map to one wildcard entry (*, *, d3 ) and use ECMP.
Suppose we need to consolidate k zUpdate flows into the switch
flow table. Excluding the wildcard entries, the flow table still has
S − n free entries (note that S is almost certainly larger than n).
Therefore, we will select S − n critical flows and install a specific entry for each of them while forcing the remaining non-critical
flows to use the wildcard entries (ECMP). This is illustrated in Figure 6 where the zUpdate flows (ver1 , s1 , d1 ) and (ver1 , s3 , d2 )
map to specific and wildcard entries in the switch flow table respectively. To resolve matching ambiguity in the switch flow table,
a specific entry, e.g., (vlan1 , s1 , d1 ), always has higher priority
than a wildcard entry, e.g., (*, *, d1 ).
The remaining question is how to select the critical flows. Suppose Dvf is the traffic distribution of a zUpdate flow f on switch
v, we calculate the corresponding D̄vfP
which is f ’s traffic distribuf
¯f
tion if it uses ECMP. We use δvf =
u |lv,u − lv,u | to quantify
the “penalty” we pay if f is forced to use ECMP. To minimize the
penalty caused by the flow consolidation, we pick the top S − n
flows with the largest penalty as the critical flows. In Figure 6,
there are 3 critical flows whose penalty is greater than 0 and 5 noncritical flows whose penalty is 0.
Because of two-phase commit, each zUpdate flow has two versions. We follow the preceding process to consolidate both versions of the flows into the switch flow table. As shown in Fig-
ure 6, zUpdate flows (ver1 , s4 , d3 ) and (ver2 , s4 , d3 ) share the
same wildcard entry in the switch flow table. In contrast, zUpdate
flows (ver1 , s1 , d1 ) and (ver2 , s1 , d1 ) map to one specific entry
and one wildcard entry in the switch flow table separately.
On some switches, the group table size may not be large enough
to hold the weight sets of all the critical flows. Let T be group table
size and m be the number of ECMP group entries. Because a group
table must at least hold all the ECMP entries, T is almost always
greater than m. After excluding the ECMP entries, the group table
still has T − m free entries. If S − n > T − m, we follow the
preceding process to select T − m critical flows with the largest
penalty and install a group entry for each of them while forcing the
remaining non-critical flows to use ECMP.
After flow consolidation, the real traffic distribution D̃ may deviate from the ideal traffic distribution D computed by zUpdate.
Thus, an ideal lossless transition from D0 to Dn may not be feasible due to the table size limits. To keep the no loss guarantee,
zUpdate will check the real loss of transitioning from D̃0 to D˜n
after flow consolidation and return an empty list if loss does occur.
7.3
Reducing Computational Complexity
In §5.3, we construct an optimization programming model M
to compute a lossless transition plan. Let |F |, |V |, and |E| be
the number of flows, switches and links in the network and n be
the number of transition steps. The total number of variables and
constraints in M is O(n|F ||E|) and O(n|F |(|V | + |E|)). In a
large DCN, it could take a long time to solve M.
Given a network, |V | and |E| are fixed and n is usually very
small, the key to shortening the computation time is to reduce |F |.
Fortunately in DCNs, congestion usually occurs only on a small
number of bottleneck links during traffic migration, and such congestion may be avoided by just manipulating the traffic distribution
of the bottleneck flows that traverse those bottleneck links. Thus,
our basic idea is to treat only the bottleneck flows as variables while
fixing all the non-bottleneck flows as constants in M. This effectively reduces |F | to be the number of bottleneck flows, which is
far smaller than the total number of flows, dramatically improving
the scalability of zUpdate.
Generally, without solving the (potentially expensive) M, it is
difficult to precisely know the bottleneck links. To circumvent this
problem, we use a simple heuristic called ECMP-Only (or ECMPO) to roughly estimate the bottleneck links. In essence, ECMPO mimics how operators perform traffic migration today by solely
relying on ECMP.
For network topology update (§6.1), the final traffic distribution
D∗ must satisfy Equations (6)∼(8), each of which is in the form of
f,∗
f,∗
lv,u
= 0. To meet each constraint lv,u
= 0, we simply remove the
corresponding u from f ’s next hop set on switch v. After that, we
compute D∗ by splitting each flow’s traffic among its remaining
next hops using ECMP. Finally, we identify the bottleneck links as:
i) the congested links during the one-step transition from D0 to D∗
(violating Equation (5)); ii) the congested links under D∗ after the
transition is done (violating Equation (4)).
For traffic matrix update (§6.2), ECMP-O does not perform any
traffic distribution transition, and thus congestion can arise only
during traffic matrix changes. Let Tmax be the maximum traffic
matrix and Recmp be ECMP, we simply identify the bottleneck
links as the congested links under Dmax = Tmax × Recmp .
7.4
n
X
X
f,i
f,i
|lv,u
− lv,w
|, where ev,u , ev,w ∈ E
ĨůŽǁƌƵůĞƐ
ƵƉĚĂƚĞƐĐĞŶĂƌŝŽ
ƚƌĂŶƐůĂƚŽƌ
ƚƌĂŶƐŝƚŝŽŶƉůĂŶ
ƚƌĂŶƐůĂƚŽƌ
ƚƌĂŶƐŝƚŝŽŶ
ƉůĂŶ
ĐŽŶƐƚƌĂŝŶƚƐ
njhƉĚĂƚĞ ĞŶŐŝŶĞ
Figure 7: zUpdate’s prototype implementation.
(12)
i=1 f,v,u,w
in which n is the number of transition steps. Clearly, the objective value is 0 when all the flows use ECMP. One nice property
of Equation(12) is its linearity. In fact, because Equations(1) ∼
(12) are all linear, M becomes a linear programming (LP) problem
which can be solved efficiently.
Failures and Traffic Matrix Variations
It is trivial for zUpdate to handle unplanned failures during
transitions. In fact, failures can be treated in the same way as switch
upgrades (see §6.1) by adding the failed switches/links to the update requirements, e.g., those failed switches/links should not carry
any traffic in the future. zUpdate will then attempt to re-compute
a transition plan from the current traffic matrix and traffic distribution to meet the new update requirements.
Handling traffic matrix variations is also quite simple. When
estimating Tmax , we may multiply it by an error margin η (η > 1).
Lemma 3 guarantees that the transitions are lossless so long as the
real traffic matrix T ≤ ηTmax .
8.
KƉĞŶ&ůŽǁ
ĐŽŶƚƌŽůůĞƌ
ŽƉĞƌĂƚŽƌ
hƉĚĂƚĞ^ĐĞŶĂƌŝŽƐ͗
ƐǁŝƚĐŚƵƉŐƌĂĚĞͬƌĞƉĂŝƌ
sDŵŝŐƌĂƚŝŽŶ
>ƌĞĐŽŶĨŝŐƵƌĂƚŝŽŶ
ƐǁŝƚĐŚŽŶͲďŽĂƌĚŝŶŐ
Transition Overhead
To perform traffic distribution transitions, zUpdate needs to
change the flow and group tables on switches. Besides guaranteeing a lossless transition, we would also like to minimize the number of table changes. Remember that under the optimization model
M, there may exist many possible solutions. We could favor solutions with low transition overhead by picking a proper objective
function. As just discussed, the ECMP-related entries (e.g., wildcard entries) will remain static in the flow and group tables. In
contrast, the non-ECMP entries (e.g., specific entries) are more dynamic since they are directly influenced by the transition plan computed by zUpdate. Hence, a simple way to reduce the number
of table changes is to “nudge” more flows towards ECMP. This
prompts us to minimize the following objective function in M:
7.5
ƉƵƐŚƌƵůĞƐ
IMPLEMENTATION
Figure 7 shows the key components and workflow of zUpdate.
When an operator wants to perform a DCN update, she will submit
a request containing the update requirements to the update scenario
translator. The latter converts the operator’s request into the formal update constraints (§6). The zUpdate engine takes the update
constraints together with the current network topology, traffic matrix, and flow rules and attempts to produce a lossless transition
plan (§5, 7.3 & 7.4). If such a plan cannot be found, it will notify the operator who may decide to revise or postpone the update.
Otherwise, the transition plan translator will convert the transition
plan into the corresponding flow rules (§7.1 & 7.2). Finally, the
OpenFlow controller will push the flow rules into the switches.
The zUpdate engine and the update scenario translator consists
of 3000+ lines of C# code with Mosek [2] as the linear program-
ming solver. The transition plan translator is written in 1500+ lines
of Python code. We use Floodlight 0.9 [1] as the OpenFlow controller and commodity switches which support OpenFlow 1.0 [3].
Given that WCMP is not available in OpenFlow 1.0, we emulate
WCMP as follows: given the weight set of a zUpdate flow f at
switch v, for each constituent 5-tuple flow ξ in f , we first compute
the next hop u of ξ according to WCMP hashing and then insert a
rule for ξ with u as the next hop into v.
9.
EVALUATIONS
In this section, we show zUpdate can effectively perform congestion-free traffic migration using both testbed experiments and
large-scale simulations. Compared to alternative traffic migration
approaches, zUpdate can not only prevent loss but also reduce
the transition time and transition overhead.
9.1
Experimental Methodology
Testbed experiments: Our testbed experiments run on a FatTree
network with 4 CORE switches and 3 containers as illustrated in
Figure 3. (Note that there are 2 additional ToRs connected to AGG3,4
which are not shown in the figure because they do not send or
receive any traffic). All the switches support OpenFlow 1.0 with
10Gbps link speed. A commercial traffic generator is connected to
all the ToRs and CORE’s to inject 5-tuple flows at pre-configured
constant bit rate.
Large-scale simulations: Our simulations are based on a production DCN with hundreds of switches and tens of thousands of servers.
The flow and group table sizes are 750 and 1,000 entries respectively, matching the numbers of the commodity switches used in the
DCN. To obtain the traffic matrices, we log the socket events on all
the servers and aggregate the logs into ingress-to-egress flows over
10-minute intervals. A traffic matrix is comprised of all the ingressto-egress flows in one interval. From the 144 traffic matrices in
a typical working day, we pick 3 traffic matrices that correspond
to the minimum, median and maximum network-wide traffic loads
respectively. The simulations run on a commodity server with 1
quad-core Intel Xeon 2.13GHz CPU and 48GB RAM.
Alternative approaches: We compare zUpdate with three alternative approaches: (1) zUpdate-One-Step (zUpdate-O): It
uses zUpdate to compute the final traffic distribution and then
jumps from the initial traffic distribution to the final one directly,
omitting all the intermediate steps. (2) ECMP-O (defined in §7.3).
(3) ECMP-Planned (ECMP-P): For traffic matrix update, ECMP-P
does not perform any traffic distribution transition (like ECMP-O).
For network topology update, ECMP-P has the same final traffic
distribution as ECMP-O. Their only difference is, when there are
k ingress-to-egress flows to be migrated from the initial traffic distribution to the final traffic distribution, ECMP-O migrates all the
k flows in one step while ECMP-P migrates only one flow in each
CORE3-AGG4
change ToR1
0
2
4
6
8
10 12 14 16 18 time(s)
CORE1-AGG3
1.1
1
0.9
0.8
0.7
Link utilization
Link utilization
Link utilization
CORE1-AGG3
1.1
1
0.9
0.8
0.7
CORE3-AGG4
change ToR1
change ToR5
0
2
4
6
8 10 12 14 16 18 20 22 time(s)
CORE1-AGG3
CORE3-AGG4
change ToR5
1
change ToR5
0.9
0.8
change ToR1
0
5
10
change ToR1
15
20
25
30
35
40 time(s)
CORE1-AGG1
1
0.9
0.8
0.7
0.6
CORE3-AGG6
reconfig S2
0
2
4
6
reconfig S1
8
Link utilization
Link utilization
(a) ECMP-O
(b) zUpdate-O
(c) zUpdate
Figure 8: The link utilization of the two busiest links in the switch upgrade example.
CORE1-AGG1
1
change ToR5
0.9
reconfig S1
reconfig S2
0.8
0.7
change ToR6
0
10 12 14 16 18 time(s)
CORE3-AGG6
change ToR5
5
change ToR6
10
15
20
25
time(s)
(a) ECMP
(b) zUpdate
Figure 9: The link utilization of the two busiest links in LB reconfiguration example.
step, resulting in k! candidate migration sequences. In our simulations, ECMP-P will evaluate 1,000 randomly-chosen candidate
migration sequences and use the one with the minimum losses. In
essence, ECMP-P mimics how today’s operators sequentially migrate multiple flows in DCN.
Performance metrics: We use the following metrics to compare
different approaches. (1) Link utilization: the ratio between the link
load and the link capacity. For the ease of presentation, we represent link congestion as link utilization value higher than 100%. (2)
Post-transition loss (Post-TrLoss): the maximum link loss rate after
reaching the final traffic distribution. (3) Transition loss (TrLoss):
the maximum link loss rate under all the possible ingress-to-egress
flow migration sequences during traffic distribution transitions. (4)
Number of steps: the whole traffic migration process can be divided
into multiple steps. The flow migrations within the same step are
done in parallel while the flow migrations of the next step cannot
start until the flow migrations of the current step complete. This
metric reflects how long the traffic migration process will take. (5)
Switch touch times (STT): the total number of times the switches
are reconfigured during a traffic migration. This metrics reflects
the transition overhead.
max
Calculating Tmax : Tmax includes two components: Tb and Tapp
.
Tb is the background traffic which is independent from the applicamax
tion being updated. Tapp
is the maximum traffic matrix comprised
of only the ingress-to-egress flows (fapp ’s) related to the applicamax
tions being updated. We calculate Tapp
as follows: for each fapp ,
max
the size of fapp in in Tapp is the largest size that fapp can possibly
get during the entire traffic matrix update process.
9.2
Testbed Experiments
We now conduct testbed experiments to reproduce the two traffic
migration examples described in §4.
Switch upgrade: Figure 8 shows the real-time utilization of the
two busiest links, CORE1 → AGG3 and CORE3 → AGG4 , in the
switch upgrade example (Figure 2). Figure 8(a) shows the transition process from Figure 2a (0s ∼ 6s) to Figure 2b (6s ∼ 14s)
under ECMP-O. The two links initially carry the same amount of
traffic. At 6s, ToR1 → AGG1 is deactivated, triggering traffic loss
on CORE3 → AGG4 (Figure 2b). The congestion lasts until 14s
when ToR1 → AGG1 is restored. Note that we deliberately shorten
the switch upgrade period for illustration purpose. In addition, because only one ingress-to-egress flow (ToR1 → ToR2 ) needs to be
migrated, ECMP-P is the same as ECMP-O.
Figure 8(b) shows the transition process from Figure 2a (0s ∼ 6s)
to Figure 2d (6s ∼ 8s) to Figure 2c (8s ∼ 16s) under zUpdate-O.
At 6s ∼ 8s, ToR5 and ToR1 are changed asynchronously, leading
to a transient congestion on CORE1 → AGG3 (Figure 2d). After
ToR1,5 are changed, the upgrading of AGG1 is congestion-free at
8s ∼ 16s (Figure 2c). Once the upgrading of AGG1 completes at
16s, the network is restored back to ECMP. Again because of the
asynchronous changes to ToR5 and ToR1 , another transient congestion happens on CORE3 → AGG4 at 16s ∼ 18s.
Figure 8(c) shows the transition process from Figure 2a (0s ∼ 6s)
to Figure 2e (8s ∼ 16s) to Figure 2c (18s ∼ 26s) under zUpdate.
Due to the introduction of an intermediate traffic distribution between 8s ∼ 16s (Figure 2e), the transition process is lossless despite
of asynchronous switch changes at 6s ∼ 8s and 16s ∼ 18s.
LB reconfiguration: Figure 9 shows the real-time utilization of
the two busiest links, CORE1 → AGG1 and CORE3 → AGG6 in
the LB reconfiguration example (Figure 3). Figure 9(a) shows the
migration process from Figure 3a (0s ∼ 6s) to Figure 3c (6s ∼ 14s)
to Figure 3b (after 14s) under ECMP. At 6s ∼ 14s, S2 ’s LB and
S1 ’s LB are reconfigured asynchronously, causing congestion on
CORE1 → AGG1 (Figure 3c). After both LB’s are reconfigured at
14s, the network is congestion-free (Figure 3b).
Figure 9(b) shows the migration process from Figure 3a (0s ∼
6s) to Figure 3d (10s ∼ 18s) to Figure 3b (after 22s) under zUpdate.
By changing the traffic split ratio on ToR5 and ToR6 at 6s ∼ 8s,
zUpdate ensures the network is congestion-free even though S2 ’s
LB and S1 ’s LB are reconfigured asynchronously at 10s ∼ 18s.
Once the LB reconfiguration completes at 18s, the traffic split ratio
on ToR5 and ToR6 is restored to ECMP at 20s ∼ 22s. Note that
zUpdate is the same as zUpdate-O in this experiment, because
there is no intermediate step in the traffic distribution transition.
9.3
Large-Scale Simulations
We run large-scale simulations to study how zUpdate enables
lossless switch onboarding and VM migration in production DCN.
Switch onboarding: In this experiment, a new CORE switch is
initially connected to each container but carries no traffic. We then
randomly select 1% of the ingress-to-egress flows, as test flows,
to traverse the new CORE switch for testing. Figure 10(a) compares different migration approaches under the median networkwide traffic load. The y-axis on the left is the traffic loss rate and
the y-axis on the right is the number of steps. zUpdate attains
zero loss by taking 2 transition steps. Although not shown in the
10
100
10
2
1
5
0
0
zUpdate zUpdate-O ECMP-O ECMP-P
1000
Transition-Loss
Post-Transition-Loss
#Step
6
100
3
10
2
1
0
0
zUpdate
zUpdate-O
#Step
Traffic Loss Rate (%)
(a) Switch Onboarding
9
36
ECMP
Figure 10: Comparison of different migration approaches.
''
dŽZ
ϭ
͘͘͘
ϭ
Ϯ
ϭ
ϳ
Ϯ
ϳ
ϭ
ϳ
;ĂͿ/ŶŝƚŝĂůƚƌĂĨĨŝĐĚŝƐƚƌŝďƵƚŝŽŶ
ŶŽŶͲƚĞƐƚƚƌĂĨĨŝĐ
͘͘͘
ŶĞǁ
Ϯ
ϭ
ϯ
ϯ
ϯ
ϭ
ϴ
Ϯ
ϭ
͘͘͘
ϯнϴ
;ďͿDW
ƚĞƐƚƚƌĂĨĨŝĐ
zUpdate-O/TrLoss
ECMP-P/TrLoss
ECMP-O/TrLoss
ECMP/Post-TrLoss
zUpdate/#Step
30
(b) VM Migration
KZ
Impact of traffic load: We re-run the switch onboarding experiment under the minimum, median, and maximum network-wide
traffic loads. In Figure 12, we omit the loss of of zUpdate and
the post-transition loss of zUpdate-O, since all of them are 0.
ŶĞǁ
Ϯ
ϯ
ϴ
ϭ
ϭ
Ϯ
ϱ
ϭ
ϭнϴ
;ĐͿnjhƉĚĂƚĞ
Loss Rate (%)
15
Figure 10(b) compares different migration approaches under the
median network-wide traffic load. zUpdate takes 2 steps to reach
a traffic distribution that can accommodate the large volume of tunneling traffic and the varying traffic matrices during the live migration. Hence, it does not have any loss. In contrast, zUpdate-O has
4.5% transition loss because it skips the intermediate step taken by
zUpdate. We combine ECMP-O and ECMP-P into ECMP because they are the same for traffic matrix updates (§9.1). ECMP’s
post-transition loss is large (7.4%) because it cannot handle the
large volume of tunneling traffic during the live migration.
24
5
4
18
3
12
2
6
1
0
#Step
1000
Transition-Loss
Post-Transition-Loss
#Step
#Step
Traffic Loss Rate (%)
20
0
Min-Load Med-Load Max-Load
Figure 12: Comparison under different traffic loads.
ĐŽŶŐĞƐƚŝŽŶ
Figure 11: Why congestion occurs in switch onboarding.
figure, our flow consolidation heuristic (§7.2) successfully fits a
large number of ingress-to-egress flows into the limited switch flow
and group tables. zUpdate-O has no post-transition loss but 8%
transition loss because it takes just one transition step.
ECMP-O incurs 7% transition loss and 13.5% post-transition
loss. This is a bit counterintuitive because the overall network capacity actually increases with the new switch. We explain this phenomenon with a simple example in Figure 11. Suppose there are
7 ingress-to-egress flows to ToR1 , each of which is 2Gbps, and the
link capacity is 10Gbps. Figure 11(a) shows the initial traffic distribution under ECMP where each downward link to ToR1 carries
7Gbps traffic. In Figure 11(b), 4 out of the 7 flows are selected as
the test flows and are moved to the new CORE3 . Thus, CORE3 →
AGG2 has 8Gbps traffic (the 4 test flows) and CORE2 → AGG2
has 3Gbps traffic (half of the 3 non-test flows due to ECMP). This
in turn overloads AGG2 → ToR1 with 11Gbps traffic. Figure 11(c)
shows zUpdate avoids the congestion by moving 2 non-test flows
away from CORE2 → AGG2 to AGG1 → ToR1 . This leaves only
1Gbps traffic (half of the remaining 1 non-test flow) on CORE2 →
AGG2 and reduces the load on AGG2 → ToR1 to 9Gbps.
ECMP-P has smaller transition loss (4%) than ECMP-O because
ECMP-P attempts to use a flow migration sequence that incurs the
minimum loss. They have the same post-transition loss because
their final traffic distribution is the same. Compared to zUpdate,
ECMP-P has significantly higher loss although it takes hundreds of
transition steps (which also implies much longer transition period).
VM migration: In this experiment, we migrate a group of VMs
from one ToR to another ToR in two different containers. During
the live migration, the old and new VMs establish tunnels to synchronize data and running states [6, 13]. The total traffic rate of the
tunnels is 6Gbps.
We observe that only zUpdate can attain zero loss under different levels of traffic load. Surprisingly, the transition loss of
zUpdate-O and ECMP-O is actually higher under the minimum
load than under the median load. This is because the traffic loss
is determined to a large extent by a few bottleneck links. Hence,
without careful planning, it is risky to perform network-wide traffic migration even during off-peak hours. Figure 12 also shows
zUpdate takes more transition steps as the network-wide traffic
load grows. This is because when the traffic load is higher, it is
more difficult for zUpdate to find the spare bandwidth to accommodate the temporary link load increase during transitions.
Transition overhead: Table 3 shows the number of switch touch
times (STT) of different migration approaches in the switch onboarding experiment. Compared to the STT of zUpdate-O and
ECMP-O, the STT of zUpdate is doubled because it takes two
steps instead of one. However, this also indicates zUpdate touches
at most 68 switches which represent a small fraction of the several
hundreds switches in the DCN. This can be attributed to the heuristics in §7.3 and §7.4 which restrict the number of flows to be migrated. ECMP-P has much larger STT than the other approaches
because it takes a lot more transition steps.
STT
zUpdate
68
zUpdate-O
34
ECMP-O
34
ECMP-P
410
Table 3: Comparison of transition overhead.
The computation time of zUpdate is reasonably small for performing traffic migration in large DCNs. In fact, the running time
is below 1 minute for all the experiments except the maximum traffic load case in Figure 12, where it takes 2.5 minutes to compute
a 4-step transition plan. This is because of the heuristic in §7.3
which ties the computation complexity to the number of bottleneck
flows rather than the total number of flows, effectively reducing the
number of variables by at least two orders of magnitude.
10.
RELATED WORK
Congestion during update: Several recent papers focus on preventing congestion during a specific type of update. Raza et al. [16]
study the problem of how to schedule link weight changes during
IGP migrations. Ghorbani et al. [9] attempt to find a congestionfree VM migration sequence. In contrast, our work provides one
primitive for a variety of update scenarios. Another key difference
is they do not consider the transient congestion caused by asynchronous traffic matrix or switch changes since they assume there
is only one link weight change or one VM being migrated at a time.
Routing consistency: There is a rich body of work on preventing
transient misbehaviors during routing protocol updates. Vanbever
et al. [18] and Francois et al. [8] seek to guarantee no forwarding
loop during IGP migrations and link metric reconfigurations. Consensus routing [10] is a policy routing protocol aiming at eliminating transient problems during BGP convergence times. The work
above emphasizes on routing consistency rather than congestion.
Several tools have been created to statically check the correctness of network configurations. Feamster et al. [7] built a tool
to detect errors in BGP configurations. Header Space Analysis
(HSA) [12] and Anteater [15] can check a few useful network invariants, such as reachability and no loop, in the forwarding plane.
Built on the earlier work, VeriFlow [14] and realtime HSA [11]
have been developed to check network invariants on-the-fly.
11.
CONCLUSION
We have introduced zUpdate for performing congestion-free
traffic migration in DCNs given the presence of asynchronous switch
and traffic matrix changes. The core of zUpdate is an optimization programming model that enables lossless transitions from an
initial traffic distribution to a final traffic distribution to meet the
predefined update requirements. We have built a zUpdate prototype on top of OpenFlow switches and Floodlight controller and
demonstrated its capability in handling a variety of representative
DCN update scenarios using both testbed experiments and largescale simulations. zUpdate, as in its current form, works only for
hierarchical DCN topology such as FatTree and Clos. We plan to
extend zUpdate to support a wider range of network topologies
in the future.
12.
ACKNOWLEDGMENTS
We thank our shepherd, Nick Feamster, and anonymous reviewers for their valuable feedback on improving this paper.
13.
REFERENCES
[1] Floodlight. http://floodlight.openflowhub.org/.
[2] MOSEK. http://mosek.com/.
[3] OpenFlow 1.0. http://www.openflow.org/documents/
openflow-spec-v1.0.0.pdf.
[4] M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity
Data Center Network Architecture. In SIGCOMM’08.
[5] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel,
B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP
DCTCP. In SIGCOMM’10.
[6] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach,
I. Pratt, and A. Warfield. Live Migration of Virtual Machines. In
NSDI’05.
[7] N. Feamster and H. Balakrishnan. Detecting BGP Configuration
Faults with Static Analysis. In NSDI’05.
[8] P. Francois, O. Bonaventure, B. Decraene, and P. A. Coste. Avoiding
Disruptions During Maintenance Operations on BGP Sessions. IEEE
Trans. on Netw. and Serv. Manag., 2007.
[9] S. Ghorbani and M. Caesar. Walk the Line: Consistent Network
Updates with Bandwidth Guarantees. In HotSDN’12.
[10] J. P. John, E. Katz-Bassett, A. Krishnamurthy, T. Anderson, and
A. Venkataramani. Consensus Routing: the Internet as a Distributed
System. In NSDI’08.
[11] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and
S. Whyte. Real Time Network Policy Checking Using Header Space
Analysis. In NSDI’13.
[12] P. Kazemian, G. Varghese, and N. McKeown. Header Space
Analysis: Static Checking for Networks. In NSDI’12.
[13] E. Keller, S. Ghorbani, M. Caesar, and J. Rexford. Live Migration of
an Entire Network (and its hosts). In HotNets’12.
[14] A. Khurshid, W. Zhou, M. Caesar, and P. B. Godfrey. Veriflow:
Verifying Network-Wide Invariants in Real Time. In HotSDN’12.
[15] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T.
King. Debugging the Data Plane with Anteater. In SIGCOMM’11.
[16] S. Raza, Y. Zhu, and C.-N. Chuah. Graceful Network State
Migrations. Networking, IEEE/ACM Transactions on, 2011.
[17] M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker.
Abstractions for Network Update. In SIGCOMM’12.
[18] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, and
O. Bonaventure. Seamless Network-Wide IGP Migrations. In
SIGCOMM’11.
[19] X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, and
M. Zhang. NetPilot: Automating Datacenter Network Failure
Mitigation. In SIGCOMM’12.
APPENDIX
Proof of Lemma 1: Assume at time t = τk , flow f ’s traffic distribution is:
2
• ∀v ∈ Li (i < k), ev,u ∈ Gf : sv,u = rv,u = lv,u
.
1
• ∀v ∈ Li (i > k), ev,u ∈ Gf : sv,u = rv,u = lv,u .
1
• ∀v ∈ Lk , ev,u ∈ Gf : all the sv,u ’s are changing from lv,u
to
2
1
lv,u simultaneously, but all the rv,u ’s are lv,u .
Consider ∀u ∈ Lk+1 , ev,u , eu,w ∈ Gf : According to Condition
ii), in the duration τk ≤ t < τk+1 = τk + δk , all the rv,u ’s remain
1
lv,u
. Therefore, f ’s traffic distribution is:
2
• ∀v ∈ Li (i < k), ev,u ∈ Gf : sv,u = rv,u = lv,u
, because
nothing has changed on these links.
1
• ∀v ∈ Li (i > k), ev,u ∈ Gf : sv,u = rv,u = lv,u
, because
nothing has changed on these links.
2
1
• ∀v ∈ Lk , ev,u ∈ Gf : sv,u = lv,u
and rv,u = lv,u
, since the
rate change on the sending end has not reached the receiving
end due to the link delay.
1
2
At t = τk+1 , ∀u ∈ Lk+1 , all the rv,u ’s change from lv,u
to lv,u
simultaneously. According to Condition iii), all the su,w ’s also
1
2
change from lu,w
to lu,w
at the same time. Thus, at t = τk+1 :
2
• ∀v ∈ Li (i < k + 1), ev,u ∈ Gf : sv,u = rv,u = lv,u
1
• ∀v ∈ Li (i > k + 1), ev,u ∈ Gf : sv,u = rv,u = lv,u
1
• ∀v ∈ Lk+1 , ev,u ∈ Gf : all the sv,u ’s are changing from lv,u
2
1
to lv,u simultaneously, but all the rv,u ’s are lv,u .
At the beginning of the transition t = τ0 , sf , the only switch in
L0 , starts to tag f ’s packets with a new version number, causing
all of its outgoing links to change from the old sending rates to the
new sending rates simultaneously. Hence, we have:
1
• ∀v ∈ Li (i > 0), ev,u ∈ Gf : sv,u = rv,u = lv,u
.
1
• ∀v ∈ L0 , ev,u ∈ Gf : all the sv,u ’s are changing from lv,u
to
2
1
lv,u
simultaneously, but all the rv,u ’s are lv,u
.
which matches our preceding assumption at t = τk . Since we have
derived that if the assumption holds for t = τk , it also holds for
t = τk+1 . Hence it holds for the whole transition process. Because
in each duration [τk , τk+1 ), ∀ev,u ∈ Gf , sv,u and rv,u are either
1
2
lv,u
or lv,u
, proof completes.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement