Dell EMC ScaleIO: Networking Best Practices and Design

DELL EMC SCALEIO
Networking Best Practices and Design Considerations
ABSTRACT
This document describes core concepts and best practices for designing,
™
®
troubleshooting, and maintaining a Dell EMC ScaleIO network of any size.
October 2016
1
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the
information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright © 2016 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its
subsidiaries. Other trademarks may be the property of their respective owners. Published in the USA 10/16, White Paper, H14708.3
Dell EMC believes the information in this document is accurate as of its publication date. The information is subject to change without
notice.
TABLE OF CONTENTS
EXECUTIVE SUMMARY ...........................................................................................................6
AUDIENCE AND USAGE ..........................................................................................................6
SCALEIO FUNCTIONAL OVERVIEW.......................................................................................6
SCALEIO SOFTWARE COMPONENTS ...................................................................................7
ScaleIO Data Servers (SDS) ............................................................................................................. 8
ScaleIO Data Clients (SDC) .............................................................................................................. 8
Meta Data Managers (MDM) ............................................................................................................. 9
TRAFFIC TYPES .......................................................................................................................9
ScaleIO Data Client (SDC) to ScaleIO Data Server (SDS) ............................................................... 9
ScaleIO Data Server (SDS) to ScaleIO Data Server (SDS) .............................................................. 9
Meta Data Manager (MDM) to Meta Data Manager (MDM) ............................................................ 10
Meta Data Manager (MDM) to ScaleIO Data Client (SDC) ............................................................. 10
Meta Data Manager (MDM) to ScaleIO Data Server (SDS) ............................................................ 10
Other Traffic .................................................................................................................................... 10
NETWORK INFRASTRUCTURE............................................................................................ 10
Flat Network Topologies .................................................................................................................. 11
IPv4 and IPv6 .................................................................................................................................. 12
NETWORK PERFORMANCE AND SIZING ........................................................................... 12
Network Latency.............................................................................................................................. 12
Network Throughput ........................................................................................................................ 12
Example: An SDS-only node with 10 HDDs ............................................................................................. 13
Example: An SDS-only node with 6 SSDs and 10 HDDs ......................................................................... 13
Write-heavy environments ....................................................................................................................... 13
Hyper-converged environments ............................................................................................................... 14
NETWORK HARDWARE ....................................................................................................... 14
Two NICs vs. Four NICs and Other Configurations ......................................................................... 14
Switch Redundancy......................................................................................................................... 15
Buffer Capacity ................................................................................................................................ 15
IP CONSIDERATIONS ........................................................................................................... 15
IP-level Redundancy ....................................................................................................................... 15
ETHERNET CONSIDERATIONS ........................................................................................... 16
Jumbo Frames ................................................................................................................................ 16
VLAN Tagging ................................................................................................................................. 16
LINK AGGREGATION GROUPS ........................................................................................... 17
LACP ............................................................................................................................................... 17
Load Balancing................................................................................................................................ 17
Multiple Chassis Link Aggregation Groups...................................................................................... 18
THE MDM NETWORK ............................................................................................................ 18
NETWORK SERVICES........................................................................................................... 18
DNS ................................................................................................................................................. 18
DHCP .............................................................................................................................................. 19
DYNAMIC ROUTING CONSIDERATIONS ............................................................................ 19
Bidirectional Forwarding Detection (BFD) ....................................................................................... 19
Physical Link Configuration ............................................................................................................. 20
ECMP .............................................................................................................................................. 20
OSPF .............................................................................................................................................. 20
Link State Advertisements (LSAs) ................................................................................................... 20
Shortest Path First (SPF) Calculations ............................................................................................ 21
BGP................................................................................................................................................. 21
Host to Leaf Connectivity ................................................................................................................ 22
Leaf and Spine Connectivity ............................................................................................................ 23
Leaf to Spine Bandwidth Requirements .......................................................................................... 23
VRRP Engine .................................................................................................................................. 24
AMS Considerations........................................................................................................................ 24
VMWARE CONSIDERATIONS .............................................................................................. 24
IP-level Redundancy ....................................................................................................................... 24
LAG and MLAG ............................................................................................................................... 24
SDC................................................................................................................................................. 25
SDS ................................................................................................................................................. 25
MDM ................................................................................................................................................ 25
VALIDATION METHODS ....................................................................................................... 25
ScaleIO Native Tools....................................................................................................................... 25
SDS Network Test ................................................................................................................................... 25
SDS Network Latency Meter Test ............................................................................................................ 26
Iperf, NetPerf, and Tracepath .......................................................................................................... 27
Network Monitoring ......................................................................................................................... 27
Network Troubleshooting Basics ..................................................................................................... 27
SUMMARY OF RECOMMENDATIONS ................................................................................. 28
Traffic Types.................................................................................................................................... 28
Network Infrastructure ..................................................................................................................... 28
Network Performance and Sizing .................................................................................................... 28
Network Hardware........................................................................................................................... 28
IP Considerations ............................................................................................................................ 28
Ethernet Considerations .................................................................................................................. 29
Link Aggregation Groups ................................................................................................................. 29
The MDM Network .......................................................................................................................... 29
Network Services ............................................................................................................................ 29
Dynamic Routing Considerations .................................................................................................... 29
VMware Considerations .................................................................................................................. 30
Validation Methods .......................................................................................................................... 30
CONCLUSION ........................................................................................................................ 30
REFERENCES ........................................................................................................................ 31
EXECUTIVE SUMMARY
™
®
Organizations use Dell EMC ScaleIO to build software defined storage systems on commodity hardware. A successful ScaleIO
deployment depends on the hardware it operates on, properly tuned operating system platforms, and a properly designed network
topology.
One of the key architectural advantages of ScaleIO is that distributes load evenly across a large number of nodes. This eliminates
concerns associated with bottlenecks at storage protocol endpoints. It also frees administrators from micromanagement by moving the
granularity of operations away from individual components to management of infrastructure. Because networks are the central
component of data center infrastructure, understanding their relationship to ScaleIO is crucial for a successful deployment.
This guide provides details on network topology choices, network performance, hyper-converged considerations, Ethernet
®
considerations, dynamic IP routing considerations, ScaleIO implementations within a VMware environment, validation methods, and
monitoring recommendations.
Audience and Usage
This document is meant to be accessible to readers who are not networking experts. However, an intermediate level understanding of
modern IP networking is assumed.
The “Dynamic Routing Considerations” section of this document contains information and concepts that are likely unfamiliar to storage
and virtualization administrators. However, this white paper was written with storage and virtualization administrators in mind, so those
concepts are described methodically and with care.
Readers familiar with ScaleIO may choose to skip the “ScaleIO Functional Overview” and “ScaleIO Software Components” sections.
Specific recommendations appearing in bold are re-visited in the “Summary of Recommendations” section near the end of this
document.
This guide provides a minimal set of network best practices. It does not cover every networking best practice for ScaleIO. A ScaleIO
technical expert may recommend more comprehensive best practices than covered in this guide. The focus of this document is
Ethernet deployments of ScaleIO. If you are considering an InfiniBand™ deployment, work with a Dell EMC ScaleIO representative.
®
Cisco Nexus switches are used in the examples in this document, but the same principles will apply to any network vendor. For
detailed guidance in the use of Dell network equipment, see the ScaleIO IP Fabric Best Practice document.
ScaleIO Functional Overview
ScaleIO is software that creates a server-based SAN from direct-attached server storage to deliver flexible and scalable performance
and capacity on demand. As an alternative to a traditional SAN infrastructure, ScaleIO combines HDDs, SSDs, and PCIe flash cards to
create a virtual pool of block storage with varying performance tiers. ScaleIO provides enterprise-grade data protection, multi-tenant
capabilities, and add-on enterprise features such as QoS, thin provisioning, and snapshots. ScaleIO is hardware-agnostic, supports
physical and virtual application servers, and has been proven to deliver significant TCO savings vs. traditional SAN.
Traditional storage vs. ScaleIO
Massive Scale - ScaleIO can scale from three to 1024 nodes. The scalability of performance is linear with regard to the growth of the
deployment. As devices or nodes are added, ScaleIO rebalances data evenly, resulting in a balanced and fully utilized pool of
distributed storage.
Extreme Performance - Every device in a ScaleIO storage pool is used to process I/O operations. This massive I/O parallelism
eliminates bottlenecks. Throughput and IOPS scale in direct proportion to the number of storage devices added to the storage pool.
Performance and data protection optimization is automatic. Component loss triggers a rebuild operation to preserve data protection.
Addition of a component triggers a rebalance to increase available performance and capacity. Both operations occur in the background
with no downtime to applications and users.
Compelling Economics - ScaleIO does not require a Fibre Channel fabric or dedicated components like HBAs. There are no forklift
upgrades for outdated hardware. Failed and outdated components are simply removed from the system. ScaleIO can reduce the cost
and complexity of the solution resulting in greater than 60 percent TCO savings vs. traditional SAN.
Unparalleled Flexibility - ScaleIO provides flexible deployment options. In a two-layer deployment, the applications and storage are
installed in separate servers. A two-layer deployment allows compute and storage teams to maintain operational autonomy. In a hyperconverged deployment, the applications and storage are installed on the set of same servers. This provides the lowest footprint and
cost profile. The deployment models can also be mixed to provide independent scaling of compute and storage resources. ScaleIO is
infrastructure agnostic. It can be used with mixed server brands, virtualized and bare metal operating systems, and storage media types
(HDDs, SSDs, and PCIe flash cards).
Supreme Elasticity - Storage and compute resources can be increased or decreased whenever the need arises. The system
automatically rebalances data on the fly. Additions and removals can be done in small or large increments. No capacity planning or
complex reconfiguration is required. Rebuild and rebalance operations happen automatically without operator intervention.
Essential Features for Enterprises and Service Providers - With ScaleIO, you can limit the amount of performance (IOPS or
bandwidth) that selected customers can consume. The limiter allows for resource usage to be imposed and regulated, preventing
application “hogging” scenarios. ScaleIO offers instantaneous, writeable snapshots for data backups and cloning.
DRAM caching enables you to improve read performance by using server RAM. Any group of servers hosting storage that may go
down together (such as SDSs residing on nodes in the same physical enclosure) can be grouped together in a fault set. A fault set can
be defined to ensure data mirroring occurs outside the group, improving business continuity. Volumes can be thin provisioned,
providing on-demand storage as well as faster setup and startup times.
ScaleIO also provides multi-tenant capabilities via protection domains and storage pools. Protection Domains allow you to isolate
specific servers and data sets. Storage Pools can be used for further data segregation, tiering, and performance management. For
example, data that is accessed very frequently can be stored in a flash-only storage pool for the lowest latency, while less frequently
accessed data can be stored in a low-cost, high-capacity pool of spinning disks.
ScaleIO Software Components
ScaleIO fundamentally consists of three types of software components: the ScaleIO Data Server (SDS), the ScaleIO Data Client (SDC),
and the Meta Data manager (MDM).
A logical illustration of a two-layer ScaleIO deployment. Systems running the ScaleIO Data Clients (SDCs) reside on different physical
servers than those running the ScaleIO Data Servers (SDSs). Each volume available to an SDC is distributed across many systems
running the SDS. The Meta Data Managers (MDMs) reside outside the data path, and are only consulted by SDCs when an SDS fails
or when the data layout changes. Hyper-converged deployments, rebuild, and rebalance operations would be represented as a
complete graph, where all nodes are logically connected to all other nodes (not shown).
SCALEIO DATA SERVERS (SDS)
The ScaleIO Data Server (SDS) serves raw local storage in a server as part of a ScaleIO cluster. The SDS is the server-side software
component. A server that takes part in serving data to other nodes has an SDS installed on it. A collection of SDSs form the ScaleIO
persistence layer.
SDSs maintain redundant copies of the user data, protect each other from hardware loss, and reconstruct data protection when
hardware components fail. SDSs may leverage SSDs, PCIe based flash, spinning media, RAID controller write caches, available RAM,
or a combination of the above.
SDSs may run natively on Windows or Linux, or as a virtual appliance on ESX. A ScaleIO cluster may have 1024 nodes, each running
an SDS. Each SDS requires only 500 megabytes of RAM.
SDS components can communicate directly with each other. They are fully meshed and optimized for rebuild, rebalance, and I/O
parallelism. Data layout between SDS components is managed through storage pools, protection domains, and fault sets.
Client volumes used by the SDCs are placed inside a storage pool. Storage pools are used to logically aggregate types of storage
media at drive-level granularity. Storage pools provide varying levels of storage service priced by capacity and performance.
Protection from node, device, and network connectivity failure is managed at node-level granularity through protection domains.
Protection domains are groups of SDSs where replicas are maintained.
Fault sets allow large systems to tolerate multiple simultaneous failures by preventing redundant copies from residing in a single rack
or chassis.
SCALEIO DATA CLIENTS (SDC)
The ScaleIO Data Client (SDC) allows an operating system or hypervisor to access data served by ScaleIO clusters. The SDC is a
®
®
client-side software component that can run natively on Windows , Linux, or ESX . It is analogous to a software initiator, but is
optimized to use networks and endpoints in parallel.
The SDC provides the operating system or hypervisor running it access to logical block devices called “volumes”. A volume is
analogous to LUN in traditional SAN. Each logical block device provides raw storage for a database or a file system.
The SDC knows which ScaleIO Data Server (SDS) endpoint to contact based on block locations in a volume. The SDC consumes
distributed storage resources directly from other systems running ScaleIO. SDCs do not share a single protocol target or network end
point with other SDCs. SDCs distribute load evenly and autonomously.
The SDC is extremely lightweight. SDC to SDS communication is inherently multi-pathed across SDS storage servers, in contrast to
approaches like iSCSI, where multiple clients target a single protocol endpoint.
The SDC allows for shared volume access for uses such as clustering. The SDC does not require an iSCSI initiator, a fibre channel
initiator, or an FCoE initiator. Each SDC requires only 50 megabytes of RAM. The SDC is optimized for simplicity, speed, and
efficiency.
META DATA MANAGERS (MDM)
MDMs control the behavior of the ScaleIO system. They determine and provide the mapping between clients and their data, keep track
of the state of the system, and issue reconstruct directives to SDS components.
MDMs establish the notion of quorum in ScaleIO. They are the only tightly clustered component of ScaleIO. They are authoritative,
redundant, and highly available. They are not consulted during I/O operations or during SDS to SDS operations like rebuild and
rebalance. When a hardware component fails, the MDM cluster will begin an auto-healing operation within seconds.
Traffic Types
ScaleIO performance, scalability, and security can benefit when the network architecture reflects ScaleIO traffic patterns. This is
particularly true in large ScaleIO deployments. The software components that make up ScaleIO (the SDCs, SDSs, and MDMs)
converse with each other in a predictable way. Architects designing a ScaleIO deployment should be aware of these traffic
patterns in order to make informed choices about the network layout.
A simplified illustration of how the ScaleIO software components communicate. A ScaleIO system will have many SDCs, SDSs, and
MDMs. This illustration groups SDCs, SDSs, and MDMs. The arrows from the SDSs and MDMs pointing back to themselves represent
communication to other SDSs and MDMs. The traffic patterns are the same regardless of the physical location of an SDC, SDS, or
MDM.
ScaleIO Data Client (SDC) to ScaleIO Data Server (SDS)
Traffic between the SDCs and the SDSs forms the bulk of front end storage traffic. Front end storage traffic includes all read and write
traffic arriving at or originating from a client. This network has a high throughput requirement.
ScaleIO Data Server (SDS) to ScaleIO Data Server (SDS)
Traffic between SDSs forms the bulk of back end storage traffic. Back end storage traffic includes writes that are mirrored between
SDSs, rebalance traffic, and rebuild traffic. This network has a high throughput requirement.
Although not required, there may be situations where isolating front-end and back-end traffic for the storage network may be ideal. This
is required in two-layer deployments where the storage and server teams act independently.
Meta Data Manager (MDM) to Meta Data Manager (MDM)
MDMs are used to coordinate operations inside the cluster. They issue directives to ScaleIO to rebalance, rebuild, and redirect traffic.
MDMs are redundant, and must communicate with each other to maintain a shared understanding of data layout. MDMs also establish
the notion of quorum in ScaleIO.
MDMs do not carry or directly interfere with I/O traffic. MDMs do not require the same level of throughput required for SDS or SDC
traffic. MDM to MDM traffic requires a stable, reliable, low latency network. MDM to MDM traffic is considered back end storage
traffic.
ScaleIO supports the use of one or more networks dedicated to traffic between MDMs. At least two 10-gigabit links should be used for
each network connection.
Meta Data Manager (MDM) to ScaleIO Data Client (SDC)
The master MDM must communicate with SDCs in the event that data layout changes. This can occur because the SDSs that host
storage for the SDCs are added, removed, placed in maintenance mode, or go offline. Communication between the Master MDM and
the SDCs is asynchronous.
MDM to SDC traffic requires a reliable, low latency network. MDM to SDC traffic is considered front end storage traffic.
Meta Data Manager (MDM) to ScaleIO Data Server (SDS)
The master MDM must communicate with SDSs to issue rebalance and rebuild directives. MDM to SDS traffic requires a reliable, low
latency network. MDM to SDS traffic is considered back end storage traffic.
Other Traffic
Other traffic includes management, installation, and reporting. This includes traffic to the ScaleIO Gateway (REST Gateway, Installation
Manager, and SNMP trap sender), traffic to and from the Light Installation Agent (LIA), and reporting or management traffic to the
MDMs (such as syslog for reporting and LDAP for administrator authentication) or to an AMS (Automated Management Services)
installation. See the ScaleIO User Guide for more information.
SDCs do not communicate with other SDCs. This can be enforced using private VLANs and network firewalls.
Network Infrastructure
Leaf-spine (also called Clos) and flat network topologies are the most common in use today. Flat networks are used in very small
networks. In modern datacenters leaf-spine topologies are preferred over legacy hierarchical topologies. This section compares flat and
leaf-spine topologies as a transport medium for ScaleIO data traffic.
Dell EMC recommends the use of a non-blocking network design. Non-blocking network designs allow the use of all switch ports
concurrently, without blocking some of the network ports to prevent message loops. Therefore, Dell EMC strongly recommends against
the use of Spanning Tree Protocol (STP) on a network hosting ScaleIO.
In order to achieve maximum performance and predictable quality of service, the network should not be over-subscribed.
Leaf-Spine Network Topologies
A two tier leaf-spine topology provides a single switch hop between leaf switches and provides a large amount of bandwidth between
end points. A properly sized leaf-spine topology eliminates oversubscription of uplink ports. Very large datacenters may use a three tier
leaf-spine topology. For simplicity, this paper focuses on two tier leaf-spine deployments.
In a leaf-spine topology, each leaf switch is attached to all spine switches. Leaf switches do not need to be directly connected to other
leaf switches. Spine switches do not need to be directly connected to other spine switches.
In most instances, Dell EMC recommends using a leaf-spine network topology. This is because:
•
ScaleIO can scale out to 1024 nodes.
•
Leaf-spine architectures are future proof. They facilitate scale-out deployments without having to re-architect the network.
•
A leaf-spine topology allows the use of all network links concurrently. Legacy hierarchical topologies must employ technologies like
Spanning Tree Protocol (STP), which blocks some ports to prevent loops.
•
Properly sized leaf-spine topologies provide more predictable latency due to the elimination of uplink oversubscription.
A two tier leaf-spine network topology. Each leaf switch has multiple paths to every other leaf switch. All links are active. This provides
increased throughput between devices on the network. Leaf switches may be connected to each other for use with MLAG (not shown).
Flat Network Topologies
A flat network topology may be easier to implement, and may be the preferred choice if an existing flat network is being extended or if
the network is not expected to scale. In a flat network, all the switches are used to connect hosts. There are no spine switches.
If you expand beyond a small number of switches, the additional cross-link ports required would likely make a flat network topology
cost-prohibitive.
The primary use-cases for a flat network topology are:
•
Small datacenter deployments that will not grow.
•
Remote office or back office.
•
Small business.
A flat network. This network design reduces cost and complexity at the expense of redundancy and scalability. In this visualization,
each switch is a single point of failure. It is possible to build a flat network without a single point of failure using technology such as
MLAG (not shown).
IPv4 and IPv6
ScaleIO 2.0 provides IPv6 support. Earlier versions of ScaleIO support Internet Protocol version 4 (IPv4) addressing only.
Network Performance and Sizing
A properly sized network frees network and storage administrators from concerns over individual ports or links becoming performance
or operational bottlenecks. The management of networks instead of endpoint hot-spots is a key architectural advantage of ScaleIO.
Because ScaleIO distributes I/O across multiple points in a network, network performance must be sized appropriately.
Network Latency
Network latency is important to account for when designing your network. Minimizing the amount of network latency will provide for
improved performance and reliability. For best performance, latency for all SDS and MDM communication should not exceed 1
millisecond network-only round-trip time under normal operating conditions. Since wide-area networks’ (WANs) lowest response
times generally exceed this limit, you should not operate ScaleIO clusters across a WAN.
Latency should be tested in both directions between all components. This can be verified by pinging, and more extensively by the SDS
Network Latency Meter Test. The open source tool iPerf can be used to verify bandwidth. Please note that iPerf is not supported by Dell
EMC. iPerf and other tools used for validating a ScaleIO deployment are covered in detail in the “Validation Methods” section of this
document.
Network Throughput
Network throughput is also a critical component when designing your ScaleIO implementation. Throughput is important to reduce the
amount of time it takes for a failed node to rebuild; reduce the amount of time it takes to redistribute data in the event of uneven data
distribution; optimize the amount of I/O a node is capable of delivering; and meet performance expectations.
While ScaleIO can be deployed on a 1-gigabit network, storage performance will likely be bottlenecked by network capacity. At a
minimum, Dell EMC recommends leveraging 10-gigabit network technology.
Additionally, although the ScaleIO cluster itself may be heterogeneous, the SDS components that make up a protection domain
should reside on hardware with equivalent storage and network performance. This is because the total bandwidth of the
protection domain will be limited by the weakest link during reconstruct and I/O operations.
In addition to throughput considerations, it is recommended that each node have at least two separate network connections for
redundancy, regardless of throughput requirements. This remains important even as network technology improves. For instance,
replacing two 10-gigabit links with a single 25- or 40-gigabit link improves throughput but sacrifices link-level network redundancy.
In most cases, the amount of network throughput to a node should match or exceed the maximum throughput of the storage media
hosted on the node. Stated differently, a node’s network requirements are proportional to the total performance of its underlying storage
media.
When determining the amount of network throughput required, keep in mind that modern media performance is typically measured in
megabytes per second, but modern network links are typically measured in gigabits per second.
To translate megabytes per second to gigabits per second, first multiply megabytes 8 to translate to megabits, and then divide megabits
by 1,000 to find gigabits.
gigabits =
megabytes ∗ 8
1,000
Note that this is not precise, as it does not account for the base-2 definition of “kilo” as 1024, but it is adequate for this purpose.
Example: An SDS-only node with 10 HDDs
Assume that you have a node hosting only an SDS. This is not a hyper-converged environment, so only storage traffic must be taken
into account. The node contains 10 hard disk drives. Each of these drives is individually capable of delivering a raw throughput of 100
megabytes per second under the best conditions (sequential I/O, which ScaleIO is optimized for during reconstruct and rebalance
operations). The total throughout of the underlying storage media is therefore 1000 megabytes per second.
10 ∗ 100 megabytes = 1,000 megabytes
Then convert 1,000 megabytes to gigabits using the equation described earlier, first multiply 1000MB by 8, and then divide by 1,000.
1,000 megabytes ∗ 8
= 8 gigabits
1,000
In this case, if all the drives on the node are serving read operations at the maximum speed possible, the total network throughput
required would be 8 gigabits per second. We are accounting for read operations only, which is typically enough to estimate the network
bandwidth requirement. This can be serviced by a single 10 gigabit link. However, since network redundancy is encouraged, this node
should have at least two 10 gigabit links.
Example: An SDS-only node with 6 SSDs and 10 HDDs
This is another two-layer example, where only storage traffic must be taken into account. In this case, the node hosts 6 SSDs that can
each deliver 450 megabytes per second, and 18 HDDs that can each deliver 100 megabytes per second.
(6 ∗ 450 megabytes) + (18 ∗ 100 megabytes) = 4,500 megabytes
The SDS has 4,500 megabytes of potential raw storage throughput. Convert the result into gigabits.
4,500 megabytes ∗ 8
= 36 gigabits
1000
Four 10 gigabit links can service the node’s potential read throughput. This estimation does not account for writes, but is sufficient for
most cases. Because there will be four links, the loss of a network link will not bring down this SDS, if it is configured properly.
Note that this level of throughput is high for a single node. Verify that the RAID controller on the node can also meet or exceed the
maximum throughput of the underlying storage media. If it cannot, size the network according to the maximum achievable
throughput of the RAID controller.
Write-heavy environments
Read and write operations produce different traffic patterns in a ScaleIO environment. When a host (SDC) makes a single 4k read
request, it must contact a single SDS to retrieve the data. The 4k block is transmitted once, out of a single SDS. If that host makes a
single 4k write request, the 4k block must be transmitted to the primary SDS, then out of the primary SDS, then into the secondary
SDS.
Write operations therefore require three times more bandwidth to SDSs than read operations. However, a write operation involves two
SDSs, rather than the one required for a read operation. The bandwidth requirement ratio of reads to writes is therefore 1:1.5.
Stated differently, per SDS, a write operation requires 1.5 times more network throughput than a read operation when compared to the
throughput of the underlying storage.
Under ordinary circumstances, the storage bandwidth calculations described earlier are sufficient. However, if some of the SDSs in
the environment are expected to host a write-heavy workload, consider adding network capacity.
Hyper-converged environments
When ScaleIO is in a hyper-converged deployment, each physical node is running an SDS, an SDC on the hypervisor, and one or more
VMs. Hyper-converged deployments optimize hardware investments, but they also introduce network sizing requirements.
The storage bandwidth calculations described earlier apply to hyper-converged environments, but front-end bandwidth to any
virtual machines, hypervisor or OS traffic, and traffic from the SDC, must also be taken into account. Though sizing for the
virtual machines is outside the scope of this technical report, it is a priority.
In hyper-converged environments, it is also a priority to logically separate storage traffic from other network traffic.
10Gb Switch
Eth0
10Gb Switch
Eth2
Eth3
Eth1
SDS / SDC
An example of a hyper-converged VMware environment using 4 10 gigabit network connections. ScaleIO traffic on this host uses ports
Eth0 and Eth1. Redundancy is provided with native ScaleIO IP multipathing, rather than MLAG. Ports Eth2 and Eth3 use both MLAG
and VLAN tagging, and provide access network access to the hypervisor and the other guests. Other configurations are possible as
ScaleIO also supports VLAN tagging and link aggregation.
Network Hardware
Two NICs vs. Four NICs and Other Configurations
ScaleIO allows for the scaling of network resources through the addition of additional network interfaces. Although not required, there
may be situations where isolating front-end and back-end traffic for the storage network may be ideal. This is a requirement in
two-layer deployments where the storage and virtualization teams each manage their own networks. Another reason to segment
front-end and back-end network traffic is to guarantee the performance of storage- and application-related network traffic. In all cases
Dell EMC recommends multiple interfaces for redundancy, capacity, and speed.
PCI NIC redundancy is also a consideration. The use of two dual-port PCI NICs on each server is preferable to the use of a single
quad-port PCI NIC, as a two dual-port PCI NICs can be configured to survive the failure of a single NIC.
Switch Redundancy
In most leaf-spine configurations, spine switches and top-of-rack (ToR) leaf switches are redundant. This provides continued access to
components inside the rack in the network in the event a ToR switch fails. In cases where each rack contains a single ToR switch, ToR
switch failure will result in an inability to access the SDS components inside the rack. Therefore, if a single ToR switch is used per
rack, consider defining fault sets at the rack level.
Buffer Capacity
To maximize ScaleIO stability and performance the leaf switches should have a deep buffer size. Deep buffers help protect
against packet loss during periods of congestion and while a network is recovering from link or device failure.
IP Considerations
IP-level Redundancy
MDMs, SDSs, and SDCs can have multiple IP addresses, and can therefore reside in more than one network. This provides options for
load balancing and redundancy.
ScaleIO natively provides redundancy and load balancing across physical network links when an MDM or SDS is configured to send
traffic across multiple links. In this configuration, each physical network port available to the MDM or SDS is assigned its own IP
address, each in a different subnet.
The use of multiple subnets provides redundancy at the network level. The use of multiple subnets also ensures that as traffic is sent
from one component to another, a different entry in the source component’s route table is chosen depending on the destination IP
address. This prevents a single physical network port at the source from being a bottleneck as the source contacts multiple IP
addresses (each corresponding to a physical network port) on a single destination.
Stated differently, a bottleneck at the source port may happen if multiple physical ports on the source and destination are in the same
subnet. For example, if two SDSs share a single subnet, each SDS has two physical ports, and each physical port has its own IP
address in that subnet, the IP stack will cause the source SDS to always choose the same physical source port. Splitting ports across
subnets allows for load balancing, because each port corresponds to a different subnet in the host’s routing table.
A comparison of operating system IP configurations. The improper IP configuration on the left uses the same subnet, 10.10.10.0/24, for
all traffic. When Server A initiates a connection to Server B, the network link providing a route to 10.10.10.0/24 will always be chosen
for the outgoing connection. The second network port on Server A will not be utilized for outgoing connections. The proper IP
configuration on the right uses two subnets, 10.10.10.0/24 and 192.168.1.0/24, allowing both ports on Server C to be utilized for
outgoing connections. Note: the subnets chosen in this example (10.10.10.0/24 and 192.168.1.0) are arbitrary: the mixed use of a class
“A” and a class “C” is meant for visual distinction only.
When each MDM or SDS has multiple IP addresses, ScaleIO will handle load balancing more effectively due to its awareness of the
traffic pattern. This can result in a small performance boost. Additionally, link aggregation maintains its own set of timers for link-level
failover. Native ScaleIO IP-level redundancy can therefore ease troubleshooting when a link goes down.
IP-level redundancy also protects against IP address conflicts. To protect against unwanted IP changes or conflicts, DHCP should not
be deployed on a network where ScaleIO MDMs or SDCs reside.
IP-level redundancy is preferred, but not strongly preferred, over link aggregation for links in use by SDC and SDS
components. IP-level redundancy is strongly preferred over MLAG for links in use for MDM to MDM communication.
10Gb Switch
Eth0
Eth2
Eth3
SDS
10Gb Switch
Eth1
Eth0
Eth2
SDC
Eth3
Eth1
In this two-layer deployment example, the nodes running the SDS and the SDC are using IP-level redundancy for ScaleIO traffic. The
virtual machines running on the SDC node are using physical ports bound in a Multiple Chassis Link Aggregation Group (MLAG) with
VLAN tagging. ScaleIO traffic is divided across multiple front-end and back-end storage networks. ScaleIO is using administratorspecified IP roles to control network traffic. As with the hyperconverged example, other configurations are possible because ScaleIO
supports VLAN tagging and link aggregation.
Ethernet Considerations
Jumbo Frames
While ScaleIO does support jumbo frames, enabling jumbo frames can be challenging depending on your network infrastructure.
Inconsistent implementation of jumbo frames by the various network components can lead to performance problems that are difficult to
troubleshoot. When jumbo frames are in use, they must be enabled on every network component used by ScaleIO infrastructure,
including the hosts and switches.
Enabling jumbo frames allows more data to be passed in a single Ethernet frame. This decreases the total number of Ethernet frames
and the number of interrupts that must be processed by each node. If jumbo frames are enabled on every component in your ScaleIO
infrastructure, there may be a performance benefit of approximately 10%, depending on your workload.
Because of the relatively small performance gains and potential for performance problems, Dell EMC recommends leaving jumbo
frames disabled initially. Enable jumbo frames only after you have a stable working setup and confirmed that your infrastructure
can support their use. Take care to ensure that jumbo frames are configured on all nodes along each path. Utilities like the Linux
tracepath command can be used to discover MTU sizes along a path.
VLAN Tagging
ScaleIO supports native VLANs and VLAN tagging on the connection between the server and the access or leaf switch. When
measured by ScaleIO engineering, both options provided the same level of performance.
Link Aggregation Groups
Link Aggregation Groups (LAGs) and Multi-Chassis Link Aggregation Groups (MLAGs) combine ports between end points. The end
points can be a switch and a host with LAG or two switches and a host with MLAG. Link aggregation terminology and implementation
varies by switch vendor. MLAG functionality on Cisco Nexus switches is called Virtual Port Channels (vPC).
LAGs use the Link Aggregation Control Protocol (LACP) for setup, tear down, and error handling. LACP is a standard, but there are
many proprietary variants.
Regardless of the switch vendor or the operating system hosting ScaleIO, LACP is recommended when link aggregation groups
are used. The use of static link aggregation is not recommended.
Link aggregation can be used as an alternative to IP-level redundancy, where each physical port has its own IP address. Link
aggregation can be simpler to configure, and useful in situations where IP address exhaustion is an issue. Link aggregation must be
configured on both the node running ScaleIO and the network equipment it is attached to.
IP-level redundancy is slightly preferred over link aggregation, but ScaleIO is resilient and high performance regardless of the choice of
IP-level redundancy or link aggregation. Performance of SDSs and SDCs when MLAG is in use is close to the performance of IP-level
redundancy. The choice of MLAG or IP-level redundancy for SDSs and SDCs should therefore be considered an operational
decision.
The exception is MDM to MDM traffic, where IP-level redundancy or LAG is strongly recommended over MLAG.
LACP
LACP sends a message across each physical network link in the aggregated group of network links on a periodic basis. This message
is part of the logic that determines if each physical link is still active. The frequency of these messages can be controlled by the network
administrator using LACP timers.
LACP timers can typically be configured to detect link failures at a fast rate (one message per second) or a normal rate (one message
every 30 seconds). When an LACP timer is configured to operate at a fast rate, corrective action is taken quickly. Additionally, the
relative overhead of sending a message every second is small with modern network technology.
LACP timers should be configured to operate at a fast rate when link aggregation is used between a ScaleIO SDS and a
switch.
To establish an LACP connection, one or both of the LACP peers must be configured to use active mode. It is therefore
recommended that the switch connected to the ScaleIO node be configured to use active mode across the link.
Load Balancing
When multiple network links are active in a link aggregation group, the endpoints must choose how to distribute traffic between the
links. Network administrators control this behavior by configuring a load balancing method on the end points. Load balancing methods
typically choose which network link to use based on some combination of the source or destination IP address, MAC address, or
TCP/UDP port.
This load-balancing method is referred to as a “hash mode”. Hash mode load balancing aims to keep traffic to and from a certain pair of
source and destination addresses or transport ports on the same physical link, provided that link remains active.
The recommended configuration of hash mode load balancing depends on the operating system in use.
If a node running an SDS has aggregated links to the switch and is running Windows, the hash mode should be configured to
use “Transport Ports”. This mechanism uses the source and destination TCP/UDP ports and IP addresses to load balance between
physical network links.
If a node running an SDS has aggregated links to the switch and is running VMware ESX®, the hash mode should be
configured to use “Source and destination IP address” or “Source and destination IP address and TCP/UDP port”.
If a node running an SDS has aggregated links to the switch and is running Linux, the hash mode on Linux should be
configured to use the "xmit_hash_policy=layer2+3" or "xmit_hash_policy=layer3+4" bonding option. The
"xmit_hash_policy=layer2+3" bonding option uses the source and destination MAC and IP addresses for load balancing. The
"xmit_hash_policy=layer3+4" bonding option uses the source and destination IP addresses and TCP/UDP ports for load balancing.
On Linux, the “miimon=100” bonding option should also be used. This option directs Linux to verify the status of each physical link
every 100 milliseconds.
Note that the name of each bonding option may vary depending on the Linux distribution, but the recommendations remain the same.
Multiple Chassis Link Aggregation Groups
Like link aggregation groups (LAGs), MLAGs provide network link redundancy. Unlike LAGs, MLAGs allow a single end point (such as
a node running ScaleIO) to be connected to multiple switches. Switch vendors use different names when referring to MLAG, and MLAG
implementations are typically proprietary.
The use of MLAG is supported by ScaleIO, but is not recommended for MDM to MDM traffic. The options described in the “Load
Balancing” section also apply to the use of MLAG.
The MDM Network
Although MDMs do not reside in the data path between hosts (SDCs) and their distributed storage (SDSs), they are responsible for
maintaining relationships between themselves to keep track of the state of the cluster. MDM to MDM traffic is therefore sensitive to
network events that impact latency, such as the loss of a physical network link in an MLAG.
It is recommended that MDMs use IP-level redundancy on two or more network segments rather than MLAG. The MDMs may
share one or more dedicated MDM cluster networks.
MDMs are redundant. ScaleIO can therefore survive not just an increase in latency, but loss of MDMs. The use of MLAG to a node
hosting an MDM will work. However, if you require the use of MLAG on a network that carries MDM to MDM traffic, please work
with a Dell EMC ScaleIO representative to ensure you have chosen a robust design.
Two physical nodes, running the MDM and the SDS (left and right), connected to a pair of leaf switches (center). The Ethernet links
carrying SDS traffic (Eth3 and Eth4 on the physical nodes) are aggregated together using MLAG. The Ethernet links carrying MDM
traffic (Eth1 and Eth2 on the physical nodes) are not aggregated. The physical ports carrying MDM traffic reside in separate subnets,
and use IP-level redundancy for performance and reliability.
Network Services
DNS
The MDM cluster maintains the database of system components and their IP addresses. In order to eliminate the possibility of a DNS
outage impacting a ScaleIO deployment, the MDM cluster does not track system components by hostname or FQDN. If a hostname or
FQDN is used when registering a system component with the MDM cluster, it is resolved to an IP address and the component is
registered with its IP address.
Therefore, hostname and FQDN changes do not influence inter-component traffic in a ScaleIO deployment.
DHCP
In order to reduce the likelihood of unplanned IP address modifications or IP address conflicts, DHCP should not be used in network
segments that contain MDMs or SDSs. DHCP should not be used to allocate addresses to MDMs or SDSs. Change management
procedures for adding or removing IP addresses are encouraged.
Dynamic Routing Considerations
In large leaf-spine environments consisting of hundreds or thousands of nodes, the network infrastructure may be required to
dynamically route ScaleIO traffic.
A central objective to routing ScaleIO traffic is to reduce the convergence time of the routing protocol. When a component or link fails,
the router or switch must detect the failure; the routing protocol must propagate the changes to the other routers; then each router or
switch must re-calculate the route to each destination node. If the network is configured correctly, this process can happen in less than
300 milliseconds: fast enough to maintain MDM cluster stability.
If the convergence time approaches or exceeds 400 milliseconds, the MDM cluster may fail over to a secondary MDM. The system will
continue to operate and I/O will continue if the MDM fails over, but 300 milliseconds is a good target to maintain maximum system
stability. Timeout values for other system component communication mechanisms are much higher, so the system should be designed
for the most demanding timeout requirements: those of the MDMs.
For the fastest possible convergence time, standard best practices apply. This means conforming to all network vendor best practices
designed to achieve that end, including the absence of underpowered routers (weak links) that prevent rapid convergence.
Convergence time is insufficient in every tested network vendor’s default OSPF or BGP configuration. Every routing protocol
deployment, irrespective of network vendor, must include performance tweaks to minimize convergence time. These tweaks
include the use of Bidirectional Forwarding Detection (BFD) and the adjustment of failure-related timing mechanisms.
OSPF and BGP have both been tested with ScaleIO. ScaleIO is known to function without errors during link and device failures when
routing protocols and networking devices are configured properly. However, OSPF is recommended over BGP. This recommendation
is supported by test results that indicate OSPF converges faster than BGP when both are configured optimally for fast convergence.
Bidirectional Forwarding Detection (BFD)
Regardless of the choice of routing protocol (OSPF or BGP), the use of Bidirectional Forwarding Detection (BDF) is required. BFD
reduces the overhead associated with protocol-native hello timers, allowing link failures to be detected quickly. BFD provides faster
failure detection than native protocol hello timers for a number of reasons including reduction in router CPU and bandwidth utilization.
BFD is therefore strongly recommended over aggressive protocol hello timers.
ScaleIO is stable during network fail-overs when it is deployed with BFD and optimized OSPF and BGP routing. Sub-second failure
detection must be enabled with BFD.
For a network to converge, the event must be detected, propagated to other routers, processed by the routers, and the routing
information base (RIB) or Forwarding Information Base (FIB) must be updated. All these steps must be performed for the routing
protocol to converge, and they should all complete in less than 300 milliseconds.
In tests using Cisco 9000 and 3000 series switches a BFD hold down timer of 150 milliseconds was sufficient. The configuration for
a 150 millisecond hold down timer consisted of 50 millisecond transmission intervals, with a 50 millisecond min_rx and a multiplier of 3.
The ScaleIO recommendation is to use a maximum hold down timer of 150 milliseconds. If your switch vendor supports BFD hold down
timers of less than 150 milliseconds, the shortest achievable hold down timer is preferred. BFD should be enabled in asynchronous
mode when possible.
In environments using Cisco vPC (MLAG), BFD should also be enabled on all routed interfaces and all host-facing interfaces
running Virtual Router Redundancy Protocol (VRRP).
An example of a BFD configuration on a Cisco Nexus switch. BFD is configured with a hold down timer of 150 milliseconds (the interval
is 50 microseconds, the multiplier is 3). OSPF on interface port-channel51 and VRRP on interface Vlan30 are both configured as a
client of BFD.
Physical Link Configuration
Timers involved with link failures are candidates for tuning. Link down and interface down event detection and handling varies by
network vendor and product line. On Cisco Nexus switches, “carrier-delay” timer should be set to 0 milliseconds on each
SVI interface, and “link debounce” timer should both be set to 0 milliseconds on each physical interface.
Carrier delay (carrier-delay) is a timer on the switch. It is applicable to an SVI interface. Carrier delay represents the amount of
time the switch should wait before it notifies the application when a link failure is detected. Carrier delay is used to prevent flapping
event notification in unstable networks. In modern leaf-spine environments, all links should be configured as point-to-point, providing a
stable network. The recommended value for an SVI interface carrying ScaleIO traffic is 0 milliseconds.
Debounce (link debounce) is a timer that delays link-down notification in firmware. It is applicable to a physical interface. Debounce
is similar to carrier delay, but it is applicable to physical interfaces, rather than logical interfaces, and is used for link down notifications
only. Traffic is stopped during the wait period. A nonzero link debounce setting can affect the convergence of routing protocols. The
recommended value for a link debounce timer is 0 milliseconds for a physical interface carrying ScaleIO traffic.
ECMP
The use of Equal-Cost Multi-Path Routing (ECMP) is required. ECMP distributes traffic evenly between leaf and spine switches,
and provides high availability using redundant leaf to spine network links. ECMP is analogous to MLAG, but operates over layer 3 (IP),
rather than over Ethernet.
ECMP is on by default with OSPF on Cisco Nexus switches. It is not on by default with BGP on Cisco Nexus switches, so it must be
enabled manually. The ECMP hash algorithm used should be layer 3 (IP) or layer 3 and layer 4 (IP and TCP/UDP port).
OSPF
OSPF is the preferred routing protocol because when it is configured properly, it converges rapidly. When OSPF is used, the leaf and
spine switches all reside in a single OSPF area. To provide stable, sub-300 millisecond convergence time, it is necessary to tune
Link State Advertisement (LSA) and SPF timers. On all leaf and spine switches, the OSPF interfaces should be configured as pointto-point with the OSPF process configured as a client of BFD. All OSPF interfaces should be configured as point-to-point.
Link State Advertisements (LSAs)
Link State Advertisements (LSAs) are used by link state routing protocols such as OSPF to notify neighboring devices that the network
topology has changed due a device or link failure. LSA timers are configurable to prevent the network from being flooded with LSAs if a
network port is flapping.
The LSA configuration on leaf and spine switches should be tuned with a start interval of 10 milliseconds or less. This means
that if multiple LSAs are sourced from the same device, LSAs will not be sent more often than every 10 milliseconds.
The LSA hold interval on leaf and spine switches should be tuned to 100 milliseconds or less. This means that if a subsequent
LSA needs to be generated within that time period, it will be generated only after the hold interval has elapsed. Once this occurs on a
Cisco Nexus switch, the hold interval will then be doubled until it reaches the max interval. When the max interval is reached, the
topology must remain stable for twice the max interval before the hold interval is reset to the start interval.
ScaleIO testing was performed using a max interval of 5000 milliseconds (5 seconds). The max interval is less important than the start
and hold interval settings, provided it is large enough to prevent excessive LSA traffic.
LSA arrival timers allow the system to drop duplicate copies of the same link state advertisement that arrive within the specified interval.
The LSA arrival timer must be less than the hold interval. The recommended LSA arrival timer setting is 80 milliseconds.
Shortest Path First (SPF) Calculations
To prevent overutilization of router hardware during periods of network instability, shortest path first calculations can be delayed. This
prevents the router from continually recalculating path trees as a result of rapid and continual topology fluctuations.
On Cisco Nexus switches, the algorithm that controls SPF timers is similar to the algorithm that controls LSAs. SPF timers should be
throttled with a start time of 10 milliseconds or less and a hold time of 100 milliseconds or less. As with the LSA max interval, a
max hold time of 5000 milliseconds was used under test, which is a reasonable default, but can be adjusted if needed.
OSPF Leaf Configuration
OSPF Spine Configuration
OSPF configuration examples on a Cisco Nexus leaf switch (left) and spine switch (right). They are in the same OSPF area (100). All
the interfaces running OSPF are configured point-to-point. BFD is configured on the OSPF router. SPF and LSA timers are configured
to minimize convergence time in the event of link or switch failure.
BGP
Though OSPF is preferred because it can converge faster, BGP can also be configured to converge within the required time frame.
BGP is not configured to use ECMP on Cisco Nexus switches by default. It must be configured manually. If BGP is required
EBGP is recommended over IBGP. EBGP supports ECMP without a problem by default, but IBGP requires a BGP route reflector and
the add-path feature to fully support ECMP.
EBGP can be configured in a way where each leaf and spine switch represents a different Autonomous System Number (ASN). In this
configuration, each leaf has to peer with every other spine. Alternatively, EBGP can be configured such that all spine switches share
same ASN and each leaf switch represents a different ASN.
BGP leaf and spine switches should be configured for fast external failover (fast-external-failover on Cisco). This
command setting allows the switch to terminate BGP connections over a dead network link, without waiting for a hold down timer to
expire.
Leaf and spine switches should also enable ECMP by allowing the switch to load share across multiple BGP paths, regardless
of the ASN. On Cisco, this is done using the “best as-path multipath-relax” configuration. EBGP may require additional
configuration parameters to enable ECMP. On Cisco, this includes setting the “maximum-path” parameter to number of available
paths to spine switches.
EBGP with ScaleIO requires that BFD be configured on each leaf and spine neighbor. When using BGP, the SDS and MDM
networks are advertised by the leaf switch.
Leaf Configuration
Spine Configuration
BGP configuration examples on a Cisco Nexus leaf switch (left) and spine switch (right). They reside in different autonomous systems
(65123 and 65122). The “fast-external-failover” and “best as-path multipath-relax” options are enabled on both. The
“maximum-path” parameter is tuned on both to match the number of paths to be used for ECMP (in this example, both are 4, but that
may not always be the case). BFD is enabled for each leaf or spine neighbor. The leaf switch is configured to advertise the ScaleIO
MDM and SDS networks (20.20.20.0/24 and 30.30.30.0/24).
Host to Leaf Connectivity
Leaf switch failure is protected against by using multi-home topology with either a dual subnet configuration, or with MLAG.
Just as in environments where traffic is not dynamically routed, if MLAG is in use for SDS and SDC traffic, a separate IP
network without MLAG is recommended for the MDM cluster. This increases stability by preventing MDM cluster failover when a
link fails.
As with environments that do not employ dynamic routing, the leaf switches should have a deep buffer size in order to protect
against packet loss during periods of congestion, especially during link or switch fail-over.
Leaf and Spine Connectivity
Configurations consisting of multiple uplinks between each leaf and spine switch are supported. In these configurations, each leaf
switch is connected using multiple IP subnets to the same spine switch. Leaf switches can also be connected to a spine switch using
bonded (aggregated) links. Aggregated links between leaf and spine switches use LAG, rather than MLAG. In a properly configured
system, failover times for LAG are sufficient for MDM traffic.
Leaf to Spine Bandwidth Requirements
Assuming storage media is not a performance bottleneck, calculating the amount of bandwidth required between leaf and spine
switches involves determining the amount of bandwidth available from each leaf switch to the attached hosts, discounting the amount if
I/O that is likely to be local to the leaf switch, then dividing the remote bandwidth requirement between each of the spine switches.
Consider a situation with two racks where each rack contains two leaf switches and 20 servers, each server has two 10 gigabit
interfaces, and each of these servers is dual-homed to the two leaf switches in the rack. In this case, the downstream bandwidth from
each of the leaf switches is calculated as:
20  ∗ 10

= 200 

The downstream bandwidth requirement for each leaf switch is 200 gigabits. However, some of the traffic will be local to the pair of leaf
switches, and therefore will not need to traverse the spine switches.
The amount traffic that is local to the leaf switches in the rack is determined by the number of racks in the configuration. If there are two
racks, 50% of the traffic will likely be local. If there are three racks, 33% of the traffic will likely be local. If there are four racks, 25% of
the traffic is likely to be local, and so on. Stated differently, the proportion of I/O that is likely to be remote will be:
_ =
__ − 1
__
In this example, there are two racks, so 50% of the bandwidth is likely to be remote:
_ =
2 _ − 1 
= 50%
2 _
Given that there are two racks in this example, 50% of the bandwidth is likely to be remote. Multiply the amount of traffic expected to be
remote by the downstream bandwidth of each leaf switch to find the total remote bandwidth requirement from each leaf switch:
__ = 200  ∗ 50% _ = 100 
100 gigabits of bandwidth is required between the leaf switches. However, this bandwidth will be distributed between spine switches, so
an additional calculation is required. To find the upstream requirements to each spine switch from each leaf switch, divide the remote
bandwidth requirement by the number of spine switches, since remote load is balanced between the spine switches.
____ =
__
___ℎ
In this example, each leaf switch is expected to demand 100 gigabits of remote bandwidth through the mesh of spine switches. Since
this load will be distributed among the spine switches, the total bandwidth between each leaf and spine is calculated as:
____ =
100 

= 50
2  ℎ
 ℎ
Therefore, for a nonblocking topology, two 40 gigabit connections for a total of 80 gigabits is sufficient bandwidth between each leaf and
spine switch. Alternatively, five 10 gigabit connections from each leaf switch to each spine switch for a total of 50 gigabits is sufficient.
The equation to determine the amount of bandwidth needed from each leaf switch to each spine switch can be summarized as:
_ℎ_ ∗ ((__ − 1) /__)
___ℎ
VRRP Engine
For routed access architectures with Cisco vPC and IP-level redundancy on the nodes, Dell EMC recommends using VRRP for the
node default gateway. This allows the default gateway to fail over to the other leaf switch in the event of leaf switch failure.
BFD should be enabled for each VRRP instance. As with routing protocols, BFD allows VRRP to fail over quickly. It is recommended
that VRRP be configured as primary on the active vPC peer and secondary on backup vPC Peer.
Spine Switch 1
Spine Switch 2
A VRRP configuration example on a pair of Cisco Nexus leaf switches. VRRP is a client of BFD on both switches. The active vPC peer
should act as the VRRP primary while the backup vPC peer should act as the VRRP secondary.
AMS Considerations
When AMS is used to configure new Dell EMC ScaleIO Ready Nodes that reside in different subnets, an AMS discovery server is
required on each subnet to act as a proxy. If no AMS proxy server exists on the subnet, AMS discovery will fail, and you will be
unable to configure the new nodes through AMS.
VMware Considerations
Though network connections are virtualized in ESX, the same principles of physical network layout described in this document apply.
Specifically, this means that MLAG should be avoided on links carrying MDM traffic unless a Dell EMC ScaleIO representative has
been consulted.
It is helpful to think of physical network from the perspective of the network stack on the virtual machine running the MDM or SDS, or
the network stack in use by the SDC in the VMkernel. Considering the needs of the guest or host level network stack, then applying it to
the physical network can inform decisions about the virtual switch layout.
IP-level Redundancy
When network link redundancy is provided using a dual subnet configuration, two separate virtual switches are needed. This
is required because each virtual switch has its own physical uplink port. When ScaleIO is run in hyper-converged mode, this
configuration has 3 interfaces: Vmkernel for the SDC, VM for the SDS, and uplink for physical network access. ScaleIO natively
supports installation in this mode.
LAG and MLAG
The use of the distributed virtual switch is required when LAG or MLAG is used. The standard virtual switch does not support
LACP, and is therefore not recommended. When LAG or MLAG is used, the bonding is done on physical uplink ports.
ScaleIO installation does not natively support LAG or MLAG installation. It must be configured manually or automated using other
means.
If a node running an SDS or SDC has aggregated links to a switch, the hash mode on the physical uplink ports should be configured to
use “Source and destination IP address” or “Source and destination IP address and TCP/UDP port”.
SDC
The SDC is a kernel driver for ESX that implements the ScaleIO client. Since it runs in ESX, it uses one or more VMkernel ports for
communication with the other ScaleIO components. If redundancy is desired, IP-level redundancy, LAG, or MLAG can be used.
SDS
The SDS is deployed as a virtual appliance on ESX. It can use IP-level redundancy, LAG, or MLAG.
MDM
The MDM is deployed as a virtual appliance on ESX. The used of IP-level redundancy is strongly recommended over the use of MLAG.
A single MDM should therefore use two or more separate virtual switches.
The use of two dual-port PCI NICs is preferable to one quad-port NIC because it provides redundancy in the event of PCI failure. In this
VMware example, two network ports: Eth0 and Eth2 reside on different dual-port PCI NICs and are used as physical uplinks for two
separate virtual switches. Eth0 and Eth2 are connected to two separate physical switches for redundancy. Eth3 and Eth1 reside on
different dual-port PCI NICs, are bound together using MLAG, and form the uplink for a single distributed vswitch.
Validation Methods
ScaleIO Native Tools
There are two main built-in tools that monitor network performance:
•
SDS Network Test
•
SDS Network Latency Meter Test
SDS Network Test
The SDS network test, “start_sds_network_test”, is covered in the ScaleIO User Manual. To fetch the results after it is run, use
the “query_sds_network_test_results” command.
It is important to note that the parallel_messages and network_test_size_gb - options should be set so that they are at least
2x larger than the maximum network bandwidth. For Example: 1x10 gigabit NIC = 1250 megabytes * 2 = 2500 megabytes, or 3
gigabits rounded up. In this case you should run the command “--network_test_size_gb 3” This will ensure that you are sending
enough bandwidth out on the network to have a consistent test result. The parallel message size should be equal to the total number of
cores in your system, with a maximum of 16.
Example Output:
scli --start_sds_network_test --sds_ip 10.248.0.23
--network_test_size_gb 8 --parallel_messages 8
Network testing successfully started.
scli --query_sds_network_test_results --sds_ip 10.248.0.23SDS with IP
10.248.0.23 returned information on 7 SDSs
SDS 6bfc235100000000 10.248.0.24 bandwidth 2.4 GB (2474 MB) per-second
SDS 6bfc235200000001 10.248.0.25 bandwidth 3.5 GB (3592 MB) per-second
SDS 6bfc235400000003 10.248.0.26 bandwidth 2.5 GB (2592 MB) per-second
SDS 6bfc235500000004 10.248.0.28 bandwidth 3.0 GB (3045 MB) per-second
SDS 6bfc235600000005 10.248.0.30 bandwidth 3.2 GB (3316 MB) per-second
SDS 6bfc235700000006 10.248.0.27 bandwidth 3.0 GB (3056 MB) per-second
SDS 6bfc235800000007 10.248.0.29 bandwidth 2.6 GB (2617 MB) per-second
In the example above, you can see the network performance from the SDS you are testing to every other SDS in the network. Ensure
that the speed per second is close to the expected performance of your network configuration.
SDS Network Latency Meter Test
The "query_network_latency_meters" command can be used to show the average network latency between SDS components.
Low latency between SDS components is crucial for good write performance. When running this test, look for outliers and latency
higher than a few hundred microseconds when 10 gigabit or better network connectivity is used. Note that this should be run from each
SDS.
Example Output:
scli --query_network_latency_meters --sds_ip 10.248.0.23
SDS with IP 10.248.0.23 returned information on 7 SDSs
SDS 10.248.0.24
Average IO size: 8.0 KB (8192 Bytes)
Average latency (micro seconds): 231
SDS 10.248.0.25
Average IO size: 40.0 KB (40960 Bytes)
Average latency (micro seconds): 368
SDS 10.248.0.26
Average IO size: 38.0 KB (38912 Bytes)
Average latency (micro seconds): 315
SDS 10.248.0.28
Average IO size: 5.0 KB (5120 Bytes)
Average latency (micro seconds): 250
SDS 10.248.0.30
Average IO size: 1.0 KB (1024 Bytes)
Average latency (micro seconds): 211
SDS 10.248.0.27
Average IO size: 9.0 KB (9216 Bytes)
Average latency (micro seconds): 252
SDS 10.248.0.29
Average IO size: 66.0 KB (67584 Bytes)
Average latency (micro seconds): 418
Iperf, NetPerf, and Tracepath
NOTE: Iperf and NetPerf should be used to validate your network before configuring ScaleIO. If you identify issues with Iperf
or NetPerf, there may be network issues that need to be investigated. If you do not see issues with Iperf/NetPerf, use the
ScaleIO internal validation tools for additional and more accurate validation.
Iperf is a traffic generation tool, which can be used to measure the maximum possible bandwidth on IP networks. The Iperf feature set
allows for tuning of various parameters and reports on bandwidth, loss, and other measurements. When Iperf is used, it should be run
with multiple parallel client threads. Eight threads per IP socket is a good choice.
NetPerf is a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both
unidirectional throughput, and end-to-end latency.
The Linux “tracepath” command can be used to discover MTU sizes along a path.
Network Monitoring
It is important to monitor the health of your network to identify any issues that are preventing your network for operating at optimal
capacity, and to safeguard from network performance degradation. There are a number of network monitoring tools available for use
on the market, which offer many different feature sets.
Dell EMC recommends monitoring the following areas:
•
Input and output traffic
•
Errors, discards, and overruns
•
Physical port status
Network Troubleshooting Basics
•
Verify connectivity end-to-end between SDSs and SDCs using ping
•
Test connectivity between components in both directions
•
SDS and MDM communication should not exceed 1 millisecond network-only round-trip time. Verify round-trip latency between
components using ping
•
Check for port errors, discards, and overruns on the switch side
•
Verify ScaleIO nodes are up
•
Verify ScaleIO processes are installed and running on all nodes
•
Check MTU across all switches and servers
•
Prefer 10 gigabit Ethernet over 1 gigabit Ethernet when possible
•
Check for NIC errors, high NIC overrun rates (more than 2%), and dropped packets in the OS event logs
•
Check for IP addresses without a valid NIC association
•
Configure separate subnets for each NIC, to load balance across networks
•
Verify the network ports needed by ScaleIO are not blocked by the network or the node
•
Check for packet loss on the OS running ScaleIO using event logs or OS network commands
•
Verify no other applications running on the node are attempting to use TCP ports required by ScaleIO
•
Set all NICs to full duplex, with auto negotiation on, and the maximum speed supported by your network
•
Check SIO test output
•
Check for RAID controller misconfiguration (this is not network, but it is a common performance problem)
•
If you have a problem, collect the logs as soon as you can before they are over-written
•
Additional troubleshooting, log collection information, and an FAQ is available in the ScaleIO User Guide
Summary of Recommendations
Traffic Types
•
Familiarize yourself with ScaleIO traffic patterns to make informed choices about the network layout.
•
To achieve high performance, low latency is important for SDS to SDS and SDC to SDS traffic.
•
Pay special attention to the networks that provide MDM to MDM connectivity, prioritizing low latency and stability.
Network Infrastructure
•
Use a non-blocking network design.
•
Use a leaf-spine network architecture without uplink oversubscription if you plan to scale beyond a small deployment.
Network Performance and Sizing
•
Ensure SDS and MDM components in the system have 1 millisecond round-trip network response times between each other under
normal operating conditions.
•
Use 10 gigabit network technology at a minimum.
•
Use redundant server network connections for availability.
•
SDS components that make up a protection domain should reside on hardware with equivalent storage and network performance.
•
Size SDS network throughput using media throughput as a guide.
•
Convert megabytes (storage native throughput metric) to gigabits (network native throughput metric) as follows:
gigabits =
megabytes ∗ 8
1,000
•
Size network throughput based on the best achievable performance of the underlying storage media. If the RAID controller will
bottleneck storage throughput, size network performance based on the RAID controller.
•
If the workload is expected to be write-heavy, consider adding network bandwidth.
•
If the deployment will be hyper-converged, front-end bandwidth to any virtual machines, hypervisor or OS traffic, and traffic from
the SDC must also be taken into account.
Network Hardware
•
In two-layer deployments front-end (SDC) and back-end (SDS, MDM) traffic must reside on separate networks.
•
In cases where each rack contains a single ToR switch, consider defining fault sets at the rack level.
•
The use of two dual-port PCI NICs on each server is preferable to the use of a single quad-port PCI NIC, as a two dual-port PCI
NICs can be configured to survive the failure of a single NIC.
•
Use leaf switches with a deep buffer size.
IP Considerations
•
Splitting ports across subnets allows for load balancing, because each port corresponds to a different subnet in the host’s routing
table.
•
IP-level redundancy is preferred, but not strongly preferred, over MLAG for links in use by SDC and SDS components. The choice
of IP-level redundancy or MLAG should be considered an operational decision.
•
IP-level redundancy is strongly preferred over MLAG for links in use for MDM to MDM communication.
Ethernet Considerations
•
If the use of jumbo frames is desired, enable them only after you have a stable working setup and you have confirmed that your
infrastructure can support them.
Link Aggregation Groups
•
The use of LACP is recommended. The use of static link aggregation is not recommended.
•
When link aggregation is used between a ScaleIO SDS and a switch, use LACP fast mode.
•
Configure the switch ports attached to the ScaleIO node to use active mode across the link.
•
If a node running an SDS has aggregated links to the switch and is running Windows, the hash mode should be configured to use
“Transport Ports”.
•
If a node running an SDS has aggregated links to the switch and is running VMware ESX®, the hash mode should be configured to
use “Source and destination IP address” or “Source and destination IP address and TCP/UDP port”.
•
If a node running an SDS has aggregated links to the switch and is running Linux, the hash mode on Linux should be configured to
use the "xmit_hash_policy=layer2+3" or "xmit_hash_policy=layer3+4" bonding option. On Linux, the “miimon=100” bonding option
should also be used.
The MDM Network
•
Prefer IP-level redundancy on two or more network segments rather than MLAG.
•
Work with a Dell EMC ScaleIO representative if you wish to use MLAG on ports delivering MDM traffic.
Network Services
•
Hostname and FQDN changes do not influence inter-component traffic in a ScaleIO deployment, because components are
registered with the system using IP addresses.
•
DHCP should not be used in network segments that contain MDMs or SDSs.
Dynamic Routing Considerations
•
The routing protocol should converge within 300 milliseconds to maintain maximum MDM stability.
•
Every routing protocol deployment must include performance tweaks to minimize convergence time.
•
OSPF is preferred because it can converge faster than BGP. Both can converge within the required time.
•
Use BFD over protocol native hello timers.
•
Use short BFD hold down timers (150 milliseconds or less).
•
BFD should be enabled on all routed interfaces and all host-facing interfaces running Virtual Router Redundancy Protocol (VRRP).
•
Set carrier link and debounce timers to 0 on all routed interfaces.
•
Use ECMP with layer 3 or layer 3 and layer 4 hash algorithms for load balancing.
•
Configure all OSPF interfaces as point-to-point.
•
With OSPF, configure LSA timers appropriately, with a start, hold, and wait time of 10, 100, and 5000 milliseconds, respectively.
•
With OSPF, configure SPF timers appropriately, with a start, hold, and wait time of 10, 100, and 5000 milliseconds, respectively.
•
Verify that ECMP is configured for EBGP.
•
Prefer EBGP over IBGP, as IBGP requires additional configuration for ECMP.
•
With BGP, verify that leaf and spine switches are configured for fast external failover.
•
With BGP, verify that the SDS and MDM networks are advertised by the leaf switch.
•
For a non-blocking topology, size the connections from leaf to spine switches appropriately. The equation to determine the amount
of bandwidth needed from each leaf switch to each spine switch can be summarized as:
downstream_bandwidth_requirement ∗ ((number_of_racks − 1) /number_of_racks)
number_of_spine_switches
•
Use leaf switches with a deep buffer to prevent overruns during microbursts and periods of network convergence.
•
When using AMS, a discovery server is required on each subnet to act as a proxy.
VMware Considerations
•
When network link redundancy is provided using a dual subnet configuration, two separate virtual switches are needed.
•
The use of the distributed virtual switch is required when LAG or MLAG is used. Link aggregation must be enabled on the
distributed switch.
•
When using IP-level redundancy, a single MDM should use two or more separate virtual switches. These switches may be shared
with SDS components on a converged network.
Validation Methods
•
ScaleIO provides the native tools “scli --start_sds_network_test” and “scli --query_network_latency_meters”.
•
The Iperf and NetPerf tools can be used to measure network performance.
•
Follow the recommendations listed in the document for troubleshooting network-related ScaleIO issues.
Conclusion
Dell EMC ScaleIO can scale to 1024 nodes hosting flash, spinning media, or both. It can be deployed in a hyper-converged mode
where compute and storage reside on the same set of nodes, in a two-layer mode where storage and compute resources are separate,
or a combination of both.
To achieve this immense performance, scalability, and flexibility, the network must be designed to account for ScaleIO’s requirements.
Following the principles and recommendations in this guide will result in a resilient, massively scalable, and high performance block
storage infrastructure.
If you are new to ScaleIO, it is available for download free, and can be used for an unlimited time, without capacity restrictions. If you
are familiar with ScaleIO and you are planning a large scale deployment, professional services are available through your Dell EMC
representative.
References
ScaleIO User Guide
ScaleIO IP Fabric Best Practice
ScaleIO Installation Guide
ScaleIO ECN community
VMware vSphere 5.5 Documentation Center
Dell EMC ScaleIO for VMware Environment
ScaleIO Download Page
ScaleIO Design Considerations and Best Practices
World Class, High Performance Cloud Scale Storage Solutions: Arista and EMC ScaleIO