Weighted Fair Queuing with Differential Dropping Feng Lu, Geoffrey M. Voelker, and Alex C. Snoeren UC San Diego Abstract—Weighted fair queuing (WFQ) allows Internet operators to define traffic classes and then assign different bandwidth proportions to these classes. Unfortunately, the complexity of efficiently allocating the buffer space to each traffic class turns out to be overwhelming, leading most operators to vastly overprovision buffering—resulting in a large resource footprint. A single buffer for all traffic classes would be preferred due to its simplicity and ease of management. Our work is inspired by the approximate differential dropping scheme but differs substantially in the flow identification and packet dropping strategies. Augmented with our novel differential dropping scheme, a shared buffer WFQ performs as well or better than the original WFQ implementation under varied traffic loads with a vastly reduced resource footprint. I. I NTRODUCTION Internet traffic demand is increasing at a staggering rate, forcing providers to carefully consider how they apportion capacity across competing users. This resource allocation challenge is complicated by the diversity of traffic classes and their vastly different importance and profitability. Indeed, many Internet Service Providers have taken steps to curb the bandwidth dedicated to traffic perceived as low value (i.e., peer-to-peer file sharing, over-the-top multimedia streaming) while attempting to improve performance for high-value traffic classes such as provider-supported streaming media, VoIP, and games. These demands are not unique to the consumer Internet, and similar requirements have begun appearing in multi-tenant datacenter  and other enterprise environments. In many instances, network operators do not wish to set explicit bandwidth guarantees, but instead express relative importance among traffic classes. Such proportional resource allocation is a well-studied problem, and a variety of techniques are available in the literature—and many are even deployed in commercial routers and switches. Indeed, algorithms like Weighted Fair Queuing (WFQ)  and Deficit Round Robin  have been proposed, analyzed, and implemented in many vendors’ standard access router and switch offerings. Unfortunately, sophisticated traffic management mechanisms like WFQ are not available in all switch and router types, largely because of their complexity and significant resource footprint at scale (both in terms of CPU and buffer space). In an effort to combat this complexity, researchers have proposed packet dropping schemes built around the idea that packets from different flows should be dropped differently to achieve the fairness goal. Packet dropping schemes are generally much cheaper to implement and a single FIFO queue is often sufficient. However, they are only able to provide coarsegrained fairness in the long term, which may not be sufficient for many applications. Perhaps most importantly, however, even those proportional bandwidth allocation mechanisms that are available and appropriate are not always employed, due in part to their configuration complexity as well as the potential for router forwarding performance degradation. This is especially true when links and switches are operating near capacity: parameter tuning becomes increasingly critical when buffer space is tight, and poorly configured traffic management schemes may perform worse than a simple FIFO queue under heavy load. Hence, we argue that there is a need for a robust, easy-to-configure mechanism that scales to a large number of flows and traffic classes without prohibitively large buffering requirements. In this paper, we propose an Enhanced Weighted Fair Queuing (EWFQ) scheme, which combines the accuracy of scheduling algorithms like WFQ with the decreased resource footprint of dropping-based active queue management schemes. Critically, our system is largely self-tuning, and does not require the demand-specific buffer configuration that WFQ does, nor the parameter adjustment that traditional active queue management schemes require. II. P RELIMINARIES AND R ELATED W ORK Weighted Fair Queuing  provides weighted fair sharing at the packet level, as does Deficit Round Robin (DRR) . Both of these algorithms have been deployed in a wide range of commercially available hardware. However, they are often criticized for their intricate packet scheduling which makes it difficult to implement them at high line rates. Moreover, in their pure forms, these schemes require per-flow queues to ensure perfect isolation in the face of ill-behaved flows. Properly sizing these per-flow queues—which requires users to partition the total buffer space—is error prone and cumbersome. To reduce implementation complexity and the necessity of keeping per-flow queues, a large body of work has explored so-called active queue management (AQM) , , , , . By relaxing the packet-by-packet fairness requirement, these solutions utilize cheap flow-rate estimation methods and employ a single FIFO queue. As a result, they achieve approximate fairness in the long run, but provide no guarantees about the bandwidth sharing among short-lived flows. For example, CHOKe  makes the observation that flow rate is proportional to the number of packets in the queue. Upon the arrival of a new packet, it compares this new packet with a randomly chosen packet from the queue, and both packets are dropped if they are from the same flow. SRED  employed a packet list to keep a counter for each flow, and packets are dropped according to their respective flow counts. However, the flow estimation methods employed by CHOKe and SRED turn out to be too rough in many instances. Theoretically, flow rate could be accurately estimated if all the incoming packets are recorded, but keeping state for all flows proves to be overly complex. Both AFD  and AF-QCN  use random packet sampling to reduce the overhead in the hope that large flows can still be identified and accurately estimated. Unfortunately, it is not easy to find a sampling probability where rate estimation is sufficiently accurate while sampling costs remains low. Therefore, Kim et al.  propose to use sample-and-hold techniques to accurately estimate flow rates. However, their queue management scheme and packet drop methodology require accurate per-flow RTT estimates. B. Identification of High-bandwidth Flows In this work, we aim to incorporate the benefits of active queue management into weighted fair queuing while preserving the accuracy of WFQ. In particular, we propose a sharedbuffer WFQ implementation, removing the principle limitation to supporting large numbers of traffic classes , as well as freeing the network operator from configuring per-flow queue sizes. In our design, the shared queue is managed by a probabilistic packet dropper in a fashion similar to AFD, but packet scheduling is still based on WFQ—as opposed to FIFO in AFD—maintaining packet-by-packet fairness for both short and long-lived flows. The first step in active queue management is determining the offered load. In particular, we must detect high-bandwidth flows and estimate their sending rates. In this work, we apply the sample-and-hold technique proposed by Estan and Varghese . The advantage of sample-and-hold is that it decouples the flow identification from flow estimation. In basic sample-and-hold, the probability of flow detection is linearly proportional to the normalized flow size (which is equivalent to flow rate since the measurement interval is fixed). Hence, in order to reliably track large bandwidth flows, many small flows are likely to be unnecessarily included. A number of variants ,  have been proposed to reduce this identification probability for small flows. The basic idea underlying them all is to maintain a randomly sampled packet cache of the arriving packets. A flow is considered identified only when a packet from it is sampled and the packet matches a packet randomly drawn from the packet cache. Considering the fact that the packet cache is randomly updated, this matchhit strategy effectively makes the flow detection probability proportional to the square of the flow size. We generalize the match-hit approach by extending it to multiple hit stages so that the detection probability for small flows can be arbitrarily scaled down. In our multi-stage match-hit scheme, k packet matches must happen before an entry for a flow is created. A. Weighted Fair Dropping C. Differentiated Drop Treatment The most basic issue with AFD that we must address is to provide support for weighted fair dropping, where each flow has a weight wi associated with it. We observe that this modification is trivial if one follows a particular weight assignment scheme. Suppose there are n flows in the system with weights w1 , w2 , · · · , wn ; we P desire that flow i will on average achieve a data rate of Cwi / wj when all flows are backlogged and C is the link rate. If we normalize the weights such that min(w1 , · · · , wn ) = 1 then a flow with weight wi can be treated as wi concurrent flows in the standard fair queuing system , where the effective data rate of a “transformed flow” from flow i is wrii . The per-flow fair sharing in WFQ can be expressed as rfi P air = min(ri , rf air wi ), where rf air is the solution to C = i min(ri , rf air wi ). Once rf air and ri are known, the differential dropping probability applied to r )+ ,1 where the flow i can be expressed as di = (1 − frair i expression x+ denotes max(0, x). We now revisit the original AFD scheme  using this normalization. Instead of uniformly sampling each packet with probability 1s , packets belonging to flow i are sampled with 1 probability s·w . The estimated rate for each flow is then wrii , i and the rest of the AFD system does not need to be altered: mf air is estimated in the same manner except it now has a different meaning, and the dropping probability is based on the normalized mi and mf air (see Table I for the definition of each parameter). Once the flows have been characterized, it remains to drop packets that exceed a flow’s fair share. Our solution is based on an observation made by Jacobson et al. : Consider a bottleneck ingress router with a single long-lived TCP flow. When the size of the input queue is small, the average packet drop rate of this TCP flow can be approximated by 1/( 38 P 2 ), where P is the bandwidth(R) · delay(RT T ) product in packets and the effective data rate is 0.75 ∗ R. We find that when the packet drop rate is decreased to 1/( 32 P 2 ), the TCP flow achieves the full bandwidth R (proof elided due to space constraints). Happily, our result also matches the well-known q 2 , relationship between window size and drop rate, w = 3q where w is the window size and q is the average packet drop rate . Based on our earlier description, it can be seen that rf air · ti is roughly the number of bytes transmitted by a fairbandwidth shared flow over a measurement interval ti . Given a specific value of RT T , then the packet drop rate for a fair r ∗RT T 2 bandwidth flow would be: 1/( 32 ( f air ) ). Therefore, we sizep propose the following packet drop rate heuristic: III. OVERALL S YSTEM D ESIGN 1 This applies to non-responsive flows only. 1 di = 0.66 · mf air sizep · RT T ti 2 , if mf air < mi (1) where sizep is the average packet size of a TCP flow and mi is the observed actual byte count over a measurement interval. Often, the value of RT T is unknown and may not be static over the lifetime of a TCP connection. We defer our discussion of how to handle an unknown RTT to the next section. Finally, for a flow of weight wi , its corresponding bytes sent during the measurement interval ti will be mf air .wi . Hence, if the observed bytes sent is substituted in, the actual drop rate will be d′i = di /(wi )2 . IV. D ESIGN AND I MPLEMENTATION D ETAILS In this section, we discuss our prototype implementation and justify our design choices for various system parameters. A. Aggregated Flows Class-based WFQ extends basic WFQ by aggregating multiple responsive or non-responsive flows into a single userdefined class. It is not hard to see that classes comprised of non-responsive flows should be treated the same as a single non-responsive flow. Unfortunately, this does not hold for classes that are made up of responsive flows. For the single TCP flow case, when a packet is dropped, it immediately responds by halving its sending window. While for the aggregated TCP flow case, a single packet drop only causes one of the flows to halve its sending window, and the aggregated window size is not reduced by half. Obviously, the aggregated case would acquire more bandwidth under the same packet drop rate. We find that with n aggregated TCP flows, the packet drop rate needs to scale quadratically so that both cases enjoy the same bandwidth share. (Proof omitted due to space constraints.) For example, if there are n TCP flows in a single aggregated class, the corresponding drop rate would be n2 di . B. Unknown RTTs As developed in Section III-C, the packet drop rate depends on the RTT of the flow. However, in practice the RTT of a flow is generally not known a priori and could vary from time to time depending on the path connecting the two end hosts. Hence, we seek to remove the need to include RTT in the dropping equation. Suppose the actual RTT for a TCP flow i is RT Ti , but RT To is used in the packet drop probability in Equation 1 instead. During the next measurement interval, To rather than mf air bytes, mf air RT RT Ti bytes are transmitted. Therefore, if the actual number of bytes sent is known, the To RTT ratio RT RT Ti can be estimated and tracked. At a first glance, it might appear that additional state variables are needed to estimate the RTT ratio. In fact, if a flow is sending near mf air or more bytes in an interval, this flow will most likely be identified by the sampling and holding hashmaps, which suggests that the actual bytes sent is already kept in the system. To send more than mf air bytes and be identified, RT To has to be larger than RT Ti . Fortunately, in reality, a maximum RT T value can generally be safely assumed. Thus, for each identified flow, an additional variable γi = RT To RT Ti is kept. At the end of each interval, the value of γi is updated as follows: mi , 1) γinew = max(0.5γiold + 0.5γiold m̄f air where mi is the actual bytes sent and m̄f air is the average of mf air during the measurement interval. If the flow is not found in the holding hashmap, we reset the value of γi to 1. Another advantage of setting RT To to be the maximum RT T Parameter Meaning p wi n mi L di mf air T pc α, β k R fs RT To byte sampling probability assigned weight for flow i total number of flows bytes countered for flow i during an interval packet length packet dropping parameter bytes counted assumed a fair flow flow identification threshold packet cache insertion probability mf air update parameters number of match-hit stages aggregated traffic load queue length sampling frequency estimated maximum RTT value TABLE I: System parameters. s1 d1 d2 s2 R1 R2 dn sn Fig. 1: Network topology. is that γi is always lower bounded by 1. Finally, the drop probability di is written as: di = γi2 n2i · wi2 1 0.66 · mf air sizep · RT To ti 2 (2) where ni is the number of active TCP flows in a responsive flow class. Note that if ni is unknown and set to 1 in Equation 2, then γi could be used to directly estimate the To value of ni · RT RT Ti . V. E XPERIMENTAL R ESULTS We use the ns2 simulator to evaluate our proposed design. Table I summarizes all the system parameters used in our implementation.The following configuration parameters have worked well for our simulation experiments: α = 0.85, β = 0.9, fs = 100, pc = 0.1, k = 1, qtarget = 0.5 · qsize , and measurement interval = 1s. The parameters p and T are chosen in such a way that the probability of miss identification for a flow whose rate exceeds the fair share is less than 10%. We first demonstrate that no single optimal buffer allocation decision matches all offered traffic load under WFQ. In contrast EWFQ shares the same physical buffer among all flow classes, dynamically repartitioning the buffer to consistently achieve excellent bandwidth sharing under a variety of traffic loads without operator intervention. Further, even if the offered traffic load is known in advance, it is still challenging to determine the optimal buffer allocation decision under WFQ. Any sub-optimal allocation would inevitably result in degraded buffer utilization. The second part of this section then compares the relative performance of both WFQ and EWFQ under 2000 s1 Attained bandwidth (kbps) 1800 s2 s3 s 1 s2 s1 s3 s2 s3 1600 1400 1200 1000 800 600 400 200 0 EWFQ WFQ Allocation 1 WFQ Allocation 2 Fig. 2: Per flow bandwidth for the “Uniform” network. 2000 Attained bandwidth (kbps) 1800 s 1 s 2 1600 s3 s 3 s1 s 2 1400 s 3 s1 s 2 1200 1000 800 600 400 200 0 EWFQ WFQ Allocation 1 WFQ Allocation 2 Fig. 3: Per flow bandwidth for the “Varied” network. a range of buffer sizes, showing that EWFQ can achieve more with much less buffer memory. We use the network topology in Figure 1 with n = 3 nodes. Under the first scenario “Uniform”, links s1 , s2 to R1 have a delay of 10ms. In the second scenario “Varied”, the links have delays 50 and 100ms, respectively. Delays for all other links in both scenarios are 1ms. In addition, the bottleneck connection between R1 and R2 is 5M bps while all other links have a bandwidth of 10M bps. Both s1 and s2 send TCP traffic, and s3 sends UDP traffic. All traffic flows have equal weight and the total queue size is 30KB. For WFQ, we use two buffer allocations, (10, 10, 10)KB (“Uniform allocation”) and (15, 5, 10)KB (“Varied allocation”). EWFQ dynamically partitions a single 30KB buffer and does not require a specific allocation to be pre-configured. Figure 2 summarizes the results for the “Uniform” network scenario. It shows the attained bandwidth for each flow under EWFQ and under the two different WFQ buffer allocations. EWFQ achieves a fair bandwidth allocation (fairness index 0.9999), as does the “Uniform allocation” for WFQ (fairness 0.9998) which matches the network scenario. The “Varied allocation” for WFQ also performs well (fairness 0.9963) even on this scenario. Figure 3 shows similar results but for the “Varied” network scenario. EWFQ requires no operator configuration, yet still achieves excellent bandwidth fairness (0.9978). The WFQ buffer allocations, however, perform less well. The fairness among flows using the “Varied allocation” for WFQ, even though it is tuned for this scenario, suffers in comparison (0.9765). And the “Uniform allocation” for WFQ performs slightly worse yet (0.9733). With a shared buffer implementation, EWFQ performs consistently well under both scenarios, performs slightly better than WFQ in the “Uniform” scenario, and significantly outperforms WFQ in the “Varied” scenario. The purpose here is not to show how much EWFQ could outperform WFQ. Rather, we want to illustrate that there does not exist a single choice of static WFQ buffer allocations which match all different scenarios. Instead, dynamically adjusting allocations as with EWFQ can effectively adapt to changes in network conditions without reconfiguration by the operator. We then evaluate the impact of buffer size on the relative behavior of EWFQ and WFQ, and, at least in one traffic instance, the amount of buffer space required by WFQ to achieve similar performance as EWFQ. In this experiment the topology and workload remain fixed, while we vary the amount of buffer memory available to balance per-flow bandwidth, total bandwidth, and fairness among the flows. We configure the topology in Figure 1 with five source nodes, four nodes sending TCP traffic and one sending UDP traffic. The bottleneck link between R1 and R2 has a bandwidth of 10M bps, and the links connecting si to Ri are 2.5M bps. The bandwidth for all other links is 10M bps. The link delays are 2, 20, 100, 200, 50ms, respectively, for links connecting s1 , · · · , s5 to R1 . We create 5 equally-weighted flow classes with each flow from si belonging to one class. Since all flows have the same weight, we allocate the buffer space equally for each flow under WFQ (one-fifth of the buffer size per flow). We then vary the total buffer memory from 5KB to 150KB in increments of 1KB. Figures 4(a) and 4(b) show the attained bandwidth per flow over the range of buffer sizes, and Figure 4(c) shows the Jain fairness index across the flows. Particularly for small to medium buffer sizes, WFQ struggles to distribute bandwidth equally across flows. Small queue sizes induce frequent packet drops, placing long-RTT TCP flows (s3 , s4 ) at a disadvantage as they recover from loss. The attained bandwidth for these flows are well below their fair share (at very small queue sizes, small RTT flows achieve 10× the bandwidth of the long RTT flows). Since WFQ is a work-conserving scheduling algorithm, the small RTT flows (s1 , s2 ) are able to take more than their fair share. As more buffer space is available, flows gradually converge towards their fair share. However, the fairness index does not start to approach 1.0 until the buffer size is larger than 90KB and, even with a buffer size of 150KB, there is still considerable variation among the flows. In contrast EWFQ compensates for differences in RTT among flows and drops packets from small RTT flows more frequently, resulting in an extremely fair bandwidth share among competing flows with different RTTs. As buffer sizes increase the attained bandwidth curves consistently overlap each other, showing that EWFQ is able to attain higher bandwidths in a fair manner. Even with a buffer size as low as 10KB EWFQ achieves a fairness index close to 1.0. Figure 4(d) shows the total aggregate bandwidth across flows as a function of buffer size. Total bandwidth increases as buffer size increases because fewer packets are dropped from the queues. When the buffer space is very small (≤ 10kbytes), the total attained bandwidth under EWFQ is smaller than WFQ. There are two reasons for this difference. The first is that, when buffer space is small, the AQM mechanism becomes less stable and may lead to oscillations where queue occupancy goes back and forth between 0–100% . When oscillation occurs, bandwidth performance degrades. The other 3000 2800 2400 2400 Attained bandwidth (kbps) Attained bandwidth (kbps) 3000 2800 2000 1600 s1 1200 s2 s3 800 s4 400 0 0 s5 10 20 30 40 50 60 70 80 90 Buffer space (kbytes) 100 110 120 130 140 2000 1600 s2 s3 800 s4 400 s5 0 0 150 s1 1200 10 40 50 60 70 80 90 Buffer space (kbytes) 100 110 120 130 140 150 140 150 (b) Bandwidth per flow vs buffer space in EWFQ 1 EWFQ 0.95 WFQ 0.9 0.85 0.8 0.75 0.7 0 30 10000 Attained bandwidth (kbps) Achieved fairness index (0−1) (a) Bandwidth per flow vs buffer space in WFQ 20 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Buffer space (kbytes) (c) Fairness vs buffer space 9500 EWFQ 9000 WFQ 8500 8000 7500 7000 6500 0 10 20 30 40 50 60 70 80 90 Buffer space (kbytes) 100 110 120 130 (d) Total bandwidth vs buffer space Fig. 4: Comparison between WFQ and EWFQ on fairness and attained bandwidth with respect to buffer space. reason is that EWFQ by design avoids the situation where the queue is completely full. Once the queue is full, packets are forced to drop rather than be shaped by the drop probability. An extremely shallow buffer results in a very small targeted queue length. Therefore, although EWFQ achieves over 0.98 fairness at very small buffer sizes, it does so at the penalty of lower aggregated bandwidth. As buffer space gradually increases, the AQM mechanisms becomes more stable and EWFQ matches and then exceeds WFQ in aggregated bandwidth attained at 30KB. At this buffer size, EWFQ can apportion buffer space across flows according to their RTTs such that all flows obtain their desired bandwidth share. WFQ partitions the buffer into separate per-flow queues, though, and cannot reallocate excess capacity for use by other flows. In summary, considering that WFQ approaches the same degree of fairness as EWFQ at 150KB, while they both approach maximum total bandwidth at 30KB, for this particular traffic instance WFQ needs 5× the buffer space of EWFQ to achieve the same aggregated bandwidth and fairness. VI. C ONCLUSION Weighted fair queueing, being a close approximation of generalized processor sharing, can provide near-optimal fair bandwidth sharing, bounded delay, and traffic isolation. Unfortunately, it is also very difficult to allocate the appropriate amount of buffer space for each flow, particularly when the offered traffic load is unknown a priori. Further, no single optimal configuration works well for all types of traffic workloads and, even when the offered load is known, the optimal buffer allocation decision remains challenging. In this paper we have introduced EWFQ, an AQM mechanism that is able to drop packets differentially based on its corresponding flow weight and type. For workloads consisting of a variety of traffic flows, simulation results indicate that EWFQ can attain near perfect fairness among competing flows while requiring much less buffer space than a WFQ that uses separate queues. In addition, EWFQ frees operators from having to specify buffer allocations for traffic classes by dynamically sharing the same buffer among all flows. ACKNOWLEDGMENT We are indebted to Barath Raghavan and George Varghese who first exposed us to the buffer-management problem in WFQ implementations. R EFERENCES  A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha, “Sharing the Data Center Network,” in Proc. USENIX NSDI, 2011.  J. C. R. Bennett and H. Zhang, “WF2Q: worst-case fair weighted fair queueing,” in Proc. IEEE INFOCOM, 1996, pp. 120–128.  M. Shreedhar and G. Varghese, “Efficient fair queueing using deficit round robin,” in Proc. ACM SIGCOMM, 1995, pp. 231–242.  R. Pan, B. Prabhakar, and K. Psounis, “CHOKe - a stateless active queue management scheme for approximating fair bandwidth allocation,” in Proc. IEEE INFOCOM, vol. 2, 2000, pp. 942–951.  R. Pan, L. Breslau, B. Prabhakar, and S. Shenker, “Approximate fairness through differential dropping,” SIGCOMM Comput. Commun. Rev., vol. 33, pp. 23–39, April 2003.  T. Ott, T. Lakshman, and L. Wong, “SRED: Stabilized RED,” in Proc. IEEE INFOCOM, vol. 3, Mar. 1999, pp. 1346–1355.  A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar, “AFQCN: Approximate fairness with quantized congestion notification for multi-tenanted data centers,” in Proc. High Performance Interconnects (HOTI), Aug. 2010, pp. 58–65.  J. Kim, H. Yoon, and I. Yeom, “Active queue management for flow fairness and stable queue length,” IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 4, pp. 571–579, Apr. 2011.  C. Estan and G. Varghese, “New directions in traffic measurement and accounting,” in Proc. ACM SIGCOMM, 2002, pp. 323–336.  F. Hao, M. Kodialam, T. Lakshman, and H. Zhang, “Fast, memoryefficient traffic estimation by coincidence counting,” in Proc. IEEE INFOCOM, vol. 3, Mar. 2005, pp. 2080–2090.  M. Kodialam, T. V. Lakshman, and S. Mohanty, “Runs bAsed Traffic Estimator (RATE): A simple, memory efficient scheme for per-flow rate estimation,” in Proc. IEEE INFOCOM, 2004.  V. Jacobson, K. Nichols, and K. Poduri, “RED in a different light,” Cisco, Tech. Rep., 1999.  J. Padhye, V. Firoiu, D. Towsley, and J. Kurose, “Modeling TCP throughput: a simple model and its empirical validation,” in Proc. ACM SIGCOMM, 1998, pp. 303–314.  V. Firoiu and M. Borden, “A study of active queue management for congestion control,” in Proc. IEEE INFOCOM, vol. 3, Mar. 2000, pp. 1435–1444.