Emulex | InSpeed 350 | Infiniband & High Speed Ethernet for Dummies

InfiniBand and High-Speed Ethernet for
Dummies
A Tutorial at SC ’13
by
Dhabaleswar K. (DK) Panda
Hari Subramoni
The Ohio State University
The Ohio State University
E-mail: panda@cse.ohio-state.edu
E-mail: subramon@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~subramon
Presentation Overview
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
SC'13
2
Current and Next Generation Applications and
Computing Systems
• Growth of High Performance Computing
– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking
• Increase in speed/features + reducing cost
• Clusters: popular choice for HPC
– Scalability, Modularity and Upgradeability
SC'13
3
500
450
400
350
300
250
200
150
100
50
0
100
90
80
70
60
50
40
30
20
10
0
Percentage of Clusters
Number of Clusters
Percentage of Clusters
Number of Clusters
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
Timeline
SC'13
4
Integrated High-End Computing Environments
Storage cluster
Compute cluster
Compute
Node
Compute
Node
Frontend
Compute
Node
LAN
LAN/WAN
L
A
N
Compute
Node
Meta-Data
Manager
Meta
Data
I/O Server
Node
Data
I/O Server
Node
Data
I/O Server
Node
Data
Enterprise Multi-tier Datacenter for Visualization and Mining
Routers/
Servers
Routers/
Servers
Routers/
Servers
.
.
Routers/
Servers
Tier1
SC'13
Switch
Application
Server
Application
Server
Database
Server
Switch
Database
Server
Application
Server
Database
Server
Application
Server
Database
Server
Tier2
Tier3
.
.
Switch
5
Cloud Computing Environments
Virtual
Machine
Virtual
Machine
Physical Machine
Virtual
Machine
Physical Machine
LAN / WAN
Virtual
Machine
Virtual
Machine
Physical Machine
Virtual
Machine
Virtual Network File System
Virtual
Machine
Physical
Meta-Data
Manager
Meta
Data
Physical
I/O Server
Node
Data
Physical
I/O Server
Node
Data
Physical
I/O Server
Node
Data
Physical
I/O Server
Node
Data
Virtual
Machine
Physical Machine
SC'13
6
Big Data Analytics with Hadoop
• Underlying Hadoop Distributed File
System (HDFS)
• Fault-tolerance by replicating data
blocks
• NameNode: stores information on
data blocks
• DataNodes: store blocks and host
Map-reduce computation
• JobTracker: track jobs and detect
failure
• MapReduce (Distributed Computation)
• HBase (Database component)
• Model scales but high amount of
communication during intermediate
phases
SC '13
7
Networking and I/O Requirements
• Good System Area Networks with excellent performance
(low latency, high bandwidth and low CPU utilization) for
inter-processor communication (IPC) and I/O
• Good Storage Area Networks high performance I/O
• Good WAN connectivity in addition to intra-cluster
SAN/LAN connectivity
• Quality of Service (QoS) for interactive applications
• RAS (Reliability, Availability, and Serviceability)
• With low cost
SC'13
8
Major Components in Computing Systems
• Hardware components
P0
Core0
Core1
Core2
Core3
Memory
Core0
Core1
Core2
Core3
– I/O bus or links
Processing
Bottlenecks
P1
– Processing cores and memory
subsystem
– Network adapters/switches
• Software components
I
/
O
I/O Interface
B
Bottlenecks
u
s
Memory
– Communication stack
• Bottlenecks can artificially
limit the network performance
the user perceives
Network Adapter
Network
Bottlenecks
Network
Switch
SC'13
9
Processing Bottlenecks in Traditional Protocols
• Ex: TCP/IP, UDP/IP
• Generic architecture for all networks
• Host processor handles almost all aspects
of communication
– Data buffering (copies on sender and
receiver)
P0
Core0
Core1
Core2
Core3
Processing
Bottlenecks
P1
Core0
Core1
Core2
Core3
– Data integrity (checksum)
I
/
O
Memory
B
u
s
– Routing aspects (IP routing)
Network Adapter
• Signaling between different layers
– Hardware interrupt on packet arrival or
transmission
Memory
Network
Switch
– Software signals between different layers
to handle protocol processing in different
priority levels
SC'13
10
Bottlenecks in Traditional I/O Interfaces and Networks
• Traditionally relied on bus-based
technologies (last mile bottleneck)
P0
Core0
Core1
Core2
Core3
– E.g., PCI, PCI-X
Memory
P1
– One bit per wire
– Performance increase through:
Core0
Core1
Core2
Core3
I
/
O
Memory
I/O Interface
B
Bottlenecks
• Increasing clock speed
u
s
• Increasing bus width
Network Adapter
– Not scalable:
• Cross talk between bits
• Skew between wires
Network
Switch
• Signal integrity makes it difficult to increase bus
width significantly, especially for high clock speeds
SC'13
PCI
1990
33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X
1998 (v1.0)
2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)
266-533MHz/64bit: 17Gbps (shared bidirectional)
11
Bottlenecks on Traditional Networks
P0
• Network speeds saturated at
around 1Gbps
– Features provided were limited
Core0
Core1
Core2
Core3
Memory
P1
Core0
Core1
Core2
Core3
I
/
O
– Commodity networks were not
considered scalable enough for very
large-scale systems
Memory
B
u
s
Network Adapter
Network
Bottlenecks
Network
Switch
SC'13
Ethernet (1979 - )
10 Mbit/sec
Fast Ethernet (1993 -)
100 Mbit/sec
Gigabit Ethernet (1995 -)
1000 Mbit /sec
ATM (1995 -)
155/622/1024 Mbit/sec
Myrinet (1993 -)
1 Gbit/sec
Fibre Channel (1994 -)
1 Gbit/sec
12
Motivation for InfiniBand and High-speed Ethernet
• Industry Networking Standards
• InfiniBand and High-speed Ethernet were introduced into
the market to address these bottlenecks
• InfiniBand aimed at all three bottlenecks (protocol
processing, I/O bus, and network speed)
• Ethernet aimed at directly handling the network speed
bottleneck and relying on complementary technologies to
alleviate the protocol processing and I/O bus bottlenecks
SC'13
13
Presentation Overview
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
SC'13
14
IB Trade Association
• IB Trade Association was formed with seven industry leaders
(Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)
• Goal: To design a scalable and high performance communication
and I/O architecture by taking an integrated view of computing,
networking, and storage technologies
• Many other industry participated in the effort to define the IB
architecture specification
• IB Architecture (Volume 1, Version 1.0) was released to public
on Oct 24, 2000
– Latest version 1.2.1 released January 2008
– Several annexes released after that (RDMA_CM - Sep’06, iSER –
Sep’06, XRC – Mar’09, RoCE – Apr’10)
• http://www.infinibandta.org
SC'13
15
High-speed Ethernet Consortium (10GE/40GE/100GE)
• 10GE Alliance formed by several industry leaders to take
the Ethernet family to the next speed step
• Goal: To achieve a scalable and high performance
communication architecture while maintaining backward
compatibility with Ethernet
• http://www.ethernetalliance.org
• 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones,
Switches, Routers): IEEE 802.3 WG
• Energy-efficient and power-conscious protocols
– On-the-fly link speed reduction for under-utilized links
SC'13
16
Tackling Communication Bottlenecks with IB and HSE
• Network speed bottlenecks
• Protocol processing bottlenecks
• I/O interface bottlenecks
SC'13
17
Network Bottleneck Alleviation: InfiniBand (“Infinite
Bandwidth”) and High-speed Ethernet (10/40/100 GE)
• Bit serial differential signaling
– Independent pairs of wires to transmit independent
data (called a lane)
– Scalable to any number of lanes
– Easy to increase clock speed of lanes (since each lane
consists only of a pair of wires)
• Theoretically, no perceived limit on the
bandwidth
SC'13
18
Network Speed Acceleration with IB and HSE
Ethernet (1979 - )
10 Mbit/sec
Fast Ethernet (1993 -)
100 Mbit/sec
Gigabit Ethernet (1995 -)
1000 Mbit /sec
ATM (1995 -)
155/622/1024 Mbit/sec
Myrinet (1993 -)
1 Gbit/sec
Fibre Channel (1994 -)
1 Gbit/sec
InfiniBand (2001 -)
2 Gbit/sec (1X SDR)
10-Gigabit Ethernet (2001 -)
10 Gbit/sec
InfiniBand (2003 -)
8 Gbit/sec (4X SDR)
InfiniBand (2005 -)
16 Gbit/sec (4X DDR)
24 Gbit/sec (12X SDR)
InfiniBand (2007 -)
32 Gbit/sec (4X QDR)
40-Gigabit Ethernet (2010 -)
40 Gbit/sec
InfiniBand (2011 -)
54.6 Gbit/sec (4X FDR)
InfiniBand (2012 -)
2 x 54.6 Gbit/sec (4X Dual-FDR)
InfiniBand (2014?)
100 Gbit/sec (4X EDR)
50 times in the last 12 years
SC'13
19
InfiniBand Link Speed Standardization Roadmap
NDR = Next Data Rate
HDR = High Data Rate
EDR = Enhanced Data Rate
FDR = Fourteen Data Rate
QDR = Quad Data Rate
DDR = Double Data Rate
SDR = Single Data Rate (not shown)
SC'13
20
Tackling Communication Bottlenecks with IB and HSE
• Network speed bottlenecks
• Protocol processing bottlenecks
• I/O interface bottlenecks
SC'13
21
Capabilities of High-Performance Networks
• Intelligent Network Interface Cards
• Support entire protocol processing completely in hardware
(hardware protocol offload engines)
• Provide a rich communication interface to applications
– User-level communication capability
– Gets rid of intermediate data buffering requirements
• No software signaling between communication layers
– All layers are implemented on a dedicated hardware unit, and not
on a shared host CPU
SC'13
22
Previous High-Performance Network Stacks
• Fast Messages (FM)
– Developed by UIUC
• Myricom GM
– Proprietary protocol stack from Myricom
• These network stacks set the trend for high-performance
communication requirements
– Hardware offloaded protocol stack
– Support for fast and secure user-level access to the protocol stack
• Virtual Interface Architecture (VIA)
– Standardized by Intel, Compaq, Microsoft
– Precursor to IB
SC'13
23
IB Hardware Acceleration
• Some IB models have multiple hardware accelerators
– E.g., Mellanox IB adapters
• Protocol Offload Engines
– Completely implement ISO/OSI layers 2-4 (link layer, network layer
and transport layer) in hardware
• Additional hardware supported features also present
– RDMA, Multicast, QoS, Fault Tolerance, and many more
SC'13
24
Ethernet Hardware Acceleration
• Interrupt Coalescing
– Improves throughput, but degrades latency
• Jumbo Frames
– No latency impact; Incompatible with existing switches
• Hardware Checksum Engines
– Checksum performed in hardware  significantly faster
– Shown to have minimal benefit independently
• Segmentation Offload Engines (a.k.a. Virtual MTU)
– Host processor “thinks” that the adapter supports large Jumbo
frames, but the adapter splits it into regular sized (1500-byte) frames
– Supported by most HSE products because of its backward
compatibility  considered “regular” Ethernet
SC'13
25
TOE and iWARP Accelerators
• TCP Offload Engines (TOE)
– Hardware Acceleration for the entire TCP/IP stack
– Initially patented by Tehuti Networks
– Actually refers to the IC on the network adapter that implements
TCP/IP
– In practice, usually referred to as the entire network adapter
• Internet Wide-Area RDMA Protocol (iWARP)
– Standardized by IETF and the RDMA Consortium
– Support acceleration features (like IB) for Ethernet
• http://www.ietf.org & http://www.rdmaconsortium.org
SC'13
26
Converged (Enhanced) Ethernet (CEE or CE)
• Also known as “Datacenter Ethernet” or “Lossless Ethernet”
– Combines a number of optional Ethernet standards into one umbrella
as mandatory requirements
• Sample enhancements include:
– Priority-based flow-control: Link-level flow control for each Class of
Service (CoS)
– Enhanced Transmission Selection (ETS): Bandwidth assignment to
each CoS
– Datacenter Bridging Exchange Protocols (DBX): Congestion
notification, Priority classes
– End-to-end Congestion notification: Per flow congestion control to
supplement per link flow control
SC'13
27
Tackling Communication Bottlenecks with IB and HSE
• Network speed bottlenecks
• Protocol processing bottlenecks
• I/O interface bottlenecks
SC'13
28
Interplay with I/O Technologies
• InfiniBand initially intended to replace I/O bus
technologies with networking-like technology
– That is, bit serial differential signaling
– With enhancements in I/O technologies that use a similar
architecture (HyperTransport, PCI Express), this has become
mostly irrelevant now
• Both IB and HSE today come as network adapters that plug
into existing I/O technologies
SC'13
29
Trends in I/O Interfaces with Servers
• Recent trends in I/O interfaces show that they are nearly
matching head-to-head with network speeds (though they
still lag a little bit)
PCI
1990
33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X
1998 (v1.0)
2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)
266-533MHz/64bit: 17Gbps (shared bidirectional)
AMD HyperTransport (HT)
2001 (v1.0), 2004 (v2.0)
2006 (v3.0), 2008 (v3.1)
102.4Gbps (v1.0), 179.2Gbps (v2.0)
332.8Gbps (v3.0), 409.6Gbps (v3.1)
(32 lanes)
PCI-Express (PCIe)
by Intel
2003 (Gen1), 2007 (Gen2)
2009 (Gen3 standard)
Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps)
Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps)
Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)
Intel QuickPath
Interconnect (QPI)
2009
153.6-204.8Gbps (20 lanes)
SC'13
30
Presentation Overview
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
SC'13
31
IB, HSE and their Convergence
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
– Novel Features
– Subnet Management and Services
• High-speed Ethernet Family
– Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies
– Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
SC'13
32
Comparing InfiniBand with Traditional Networking Stack
HTTP, FTP, MPI,
File Systems
Application Layer
Application Layer
OpenFabrics Verbs
Sockets Interface
TCP, UDP
Routing
DNS management tools
Flow-control and
Error Detection
Copper, Optical or Wireless
SC'13
MPI, PGAS, File Systems
Transport Layer
Transport Layer
Network Layer
Network Layer
Link Layer
Link Layer
Physical Layer
Physical Layer
Traditional Ethernet
InfiniBand
RC (reliable), UD (unreliable)
Routing
Flow-control, Error Detection
OpenSM (management tool)
Copper or Optical
33
TCP/IP Stack and IPoIB
Application /
Middleware
Application /
Middleware Interface
Sockets
Protocol
Kernel
Space
TCP/IP
Ethernet
Driver
IPoIB
Adapter
Ethernet
Adapter
InfiniBand
Adapter
Switch
Ethernet
Switch
InfiniBand
Switch
1/10/40
GigE
IPoIB
ISC’13
34
TCP/IP, IPoIB and Native IB Verbs
Application /
Middleware
Application /
Middleware Interface
Sockets
Verbs
Protocol
Kernel
Space
RDMA
TCP/IP
IPoIB
User
Space
InfiniBand
Adapter
InfiniBand
Adapter
Ethernet
Switch
InfiniBand
Switch
InfiniBand
Switch
1/10/40
GigE
IPoIB
IB Native
Ethernet
Driver
Adapter
Ethernet
Adapter
Switch
ISC’13
35
IB Overview
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
– Sockets Direct Protocol (SDP) stack
– RSockets Protocol Stack
SC'13
36
Components: Channel Adapters
• Used by processing and I/O units to
• Programmable DMA engines with
QP
…
DMA
Port
Port
SMA
VL
VL
…
VL
VL
VL
VL
VL
VL
…
Channel Adapter
MTP
C
Transport
protection features
• May have multiple ports
QP
QP
QP
• Consume & generate IB packets
Memory
…
…
VL
connect to fabric
Port
– Independent buffering channeled
through Virtual Lanes
• Host Channel Adapters (HCAs)
SC'13
37
Components: Switches and Routers
Switch
…
Port
Port
…
VL
VL
VL
…
VL
VL
VL
…
VL
VL
VL
Packet Relay
Port
Router
Port
Port
…
…
VL
VL
VL
…
VL
VL
VL
…
VL
VL
VL
GRH Packet Relay
Port
• Relay packets from a link to another
• Switches: intra-subnet
• Routers: inter-subnet
• May support multicast
SC'13
38
Components: Links & Repeaters
• Network Links
– Copper, Optical, Printed Circuit wiring on Back Plane
– Not directly addressable
• Traditional adapters built for copper cabling
– Restricted by cable length (signal integrity)
– For example, QDR copper cables are restricted to 7m
• Intel Connects: Optical cables with Copper-to-optical
conversion hubs (acquired by Emcore)
– Up to 100m length
– 550 picoseconds
copper-to-optical conversion latency
• Available from other vendors (Luxtera)
• Repeaters (Vol. 2 of InfiniBand specification)
SC'13
(Courtesy Intel)
39
IB Overview
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
– Sockets Direct Protocol (SDP) stack
– RSockets Protocol Stack
SC'13
40
IB Communication Model
Basic InfiniBand
Communication
Semantics
SC'13
41
Two-sided Communication Model
HCA
P1
HCA
HCA
P2
Recv from P1
Post Recv
Buffer
Recv from P3
Post Recv
Buffer
P3
Poll HCA
Send to P2
No Data
Post Send Buffer
Poll HCA
Recv from P3
Send to P2
HCA Send
Data to P2
Recv Data from P3
Post Send Buffer
Poll HCA
HCA Send
Data to P2
Recv from P1
Recv Data from P1
SC'13
42
One-sided Communication Model
HCA
HCA
P1
HCA
P2
P3
Global Region Creation
(Buffer Info Exchanged)
Buffer at P2
Buffer at P1
Buffer at P3
Write to P3
HCA Write
Post to HCA
Data to P3
Write to P2
Post to HCA
Write data from P2
HCA Write
Data to P2
SC'13
Write Data from P1
43
Queue Pair Model
• Each QP has two queues
– Send Queue (SQ)
– Receive Queue (RQ)
QP
Send
CQ
Recv
WQEs
CQEs
– Work requests are queued to the QP
(WQEs: “Wookies”)
InfiniBand Device
• QP to be linked to a Complete Queue
(CQ)
– Gives notification of operation
completion from QPs
– Completed WQEs are placed in the
CQ with additional information
(CQEs: “Cookies”)
SC'13
44
Memory Registration
Before we do any communication:
All memory used for communication must
be registered
1. Registration Request
•
2. Kernel handles virtual->physical
mapping and pins region into
physical memory
Process
1
•
4
Kernel
2
HCA/RNIC
Process cannot map memory
that it does not own (security !)
3. HCA caches the virtual to physical
mapping and issues a handle
•
3
SC'13
Send virtual address and length
Includes an l_key and r_key
4. Handle is returned to application
45
Memory Protection
For security, keys are required for all
operations that touch buffers
• To send or receive data the l_key
must be provided to the HCA
Process
• HCA verifies access to local
memory
l_key
Kernel
HCA/NIC
• For RDMA, initiator must have the
r_key for the remote virtual address
• Possibly exchanged with a
send/recv
• r_key is not encrypted in IB
r_key is needed for RDMA operations
SC'13
46
Communication in the Channel Semantics
(Send/Receive Model)
Memory
Segment
Processor
Memory
Processor
Memory
Memory
Segment
Memory
Segment
Memory
Segment
CQ
Memory
Segment
QP
Send
Recv
Processor is involved only to:
QP
Send
Recv
CQ
1. Post receive WQE
2. Post send WQE
3. Pull out completed CQEs from the CQ
InfiniBand Device
Send WQE contains information about the
send buffer (multiple non-contiguous
segments)
SC'13
Hardware ACK
InfiniBand Device
Receive WQE contains information on the receive
buffer (multiple non-contiguous segments);
Incoming messages have to be matched to a
receive WQE to know where to place
47
Communication in the Memory Semantics (RDMA Model)
Memory
Segment
Processor
Memory
Processor
Memory
Memory
Segment
Memory
Segment
Memory
Segment
CQ
QP
Initiator processor is involved only to:
Send
Recv
Recv
1. Post send WQE
2. Pull out completed CQE from the send CQ
QP
Send
CQ
No involvement from the target processor
InfiniBand Device
Hardware ACK
InfiniBand Device
Send WQE contains information about the
send buffer (multiple segments) and the
receive buffer (single segment)
SC'13
48
Communication in the Memory Semantics (Atomics)
Processor
Memory
Processor
Memory
Source
Memory
Segment
Destination
Memory
Segment
CQ
Memory
Segment
QP
Initiator processor is involved only to:
Send
Recv
Recv
1. Post send WQE
2. Pull out completed CQE from the send CQ
QP
Send
CQ
No involvement from the target processor
InfiniBand Device
Send WQE contains information about the
send buffer (single 64-bit segment) and the
receive buffer (single 64-bit segment)
SC'13
OP
InfiniBand Device
IB supports compare-and-swap and
fetch-and-add atomic operations
49
IB Overview
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
– Sockets Direct Protocol (SDP) stack
– RSockets Protocol Stack
SC'13
50
Hardware Protocol Offload
Complete
Hardware
Implementations
Exist
SC'13
51
Link/Network Layer Capabilities
• Buffering and Flow Control
• Virtual Lanes, Service Levels and QoS
• Switching and Multicast
• Network Fault Tolerance
• IB WAN Capability
SC'13
52
Buffering and Flow Control
• IB provides three-levels of communication throttling/control
mechanisms
– Link-level flow control (link layer feature)
– Message-level flow control (transport layer feature): discussed later
– Congestion control (part of the link layer features)
• IB provides an absolute credit-based flow-control
– Receiver guarantees that enough space is allotted for N blocks of data
– Occasional update of available credits by the receiver
• Has no relation to the number of messages, but only to the
total amount of data being sent
– One 1MB message is equivalent to 1024 1KB messages (except for
rounding off at message boundaries)
SC'13
53
Virtual Lanes
• Multiple virtual links within
same physical link
– Between 2 and 16
• Separate buffers and flow
control
– Avoids Head-of-Line
Blocking
• VL15: reserved for
management
• Each port supports one or
more data VL
SC'13
54
Service Levels and QoS
• Service Level (SL):
– Packets may operate at one of 16 different SLs
– Meaning not defined by IB
• SL to VL mapping:
– SL determines which VL on the next link is to be used
– Each port (switches, routers, end nodes) has a SL to VL mapping
table configured by the subnet management
• Partitions:
– Fabric administration (through Subnet Manager) may assign
specific SLs to different partitions to isolate traffic flows
SC'13
55
Traffic Segregation Benefits
IPC, Load Balancing, Web Caches, ASP
Servers
Servers
InfiniBand
Network
Servers
Virtual Lanes
InfiniBand
Fabric
Fabric
IP Network
Routers, Switches
VPN’s, DSLAMs
Storage Area Network
RAID, NAS, Backup
(Courtesy: Mellanox Technologies)
SC'13
• InfiniBand Virtual Lanes
allow the multiplexing of
multiple independent logical
traffic flows on the same
physical link
• Providing the benefits of
independent, separate
networks while eliminating
the cost and difficulties
associated with maintaining
two or more networks
56
Switching (Layer-2 Routing) and Multicast
• Each port has one or more associated LIDs (Local Identifiers)
– Switches look up which port to forward a packet to based on its
destination LID (DLID)
– This information is maintained at the switch
• For multicast packets, the switch needs to maintain multiple
output ports to forward the packet to
– Packet is replicated to each appropriate output port
– Ensures at-most once delivery & loop-free forwarding
– There is an interface for a group management protocol
• Create, join/leave, prune, delete group
SC'13
57
Switch Complex
• Basic unit of switching is a crossbar
– Current InfiniBand products use either 24-port (DDR) or 36-port
(QDR and FDR) crossbars
• Switches available in the market are typically collections of
crossbars within a single cabinet
• Do not confuse “non-blocking switches” with “crossbars”
– Crossbars provide all-to-all connectivity to all connected nodes
• For any random node pair selection, all communication is non-blocking
– Non-blocking switches provide a fat-tree of many crossbars
• For any random node pair selection, there exists a switch
configuration such that communication is non-blocking
• If the communication pattern changes, the same switch configuration
might no longer provide fully non-blocking communication
SC'13
58
IB Switching/Routing: An Example
An Example IB Switch Block Diagram (Mellanox 144-Port)
Spine Blocks
1 2 3 4
P2
LID: 2
LID: 4
P1
Leaf Blocks
DLID
Out-Port
Forwarding Table
2
1
4
4
Switching: IB supports
Virtual Cut Through (VCT)
Routing: Unspecified by IB SPEC
Up*/Down*, Shift are popular
routing engines supported by OFED
• Fat-Tree is a popular
topology for IB Cluster
– Different over-subscription
ratio may be used
• Someone has to setup the forwarding tables and • Other topologies
give every port an LID
– 3D Torus (Sandia Red Sky,
– “Subnet Manager” does this work
• Different routing algorithms give different paths
SC'13
SDSC Gordon) and SGI
Altix (Hypercube)
– 10D Hypercube (NASA
Pleiades)
59
More on Multipathing
• Similar to basic switching, except…
– … sender can utilize multiple LIDs associated to the same
destination port
• Packets sent to one DLID take a fixed path
• Different packets can be sent using different DLIDs
• Each DLID can have a different path (switch can be configured
differently for each DLID)
• Can cause out-of-order arrival of packets
– IB uses a simplistic approach:
• If packets in one connection arrive out-of-order, they are dropped
– Easier to use different DLIDs for different connections
• This is what most high-level libraries using IB do!
SC'13
60
IB Multicast Example
SC'13
61
Network Level Fault Tolerance: Automatic Path Migration
• Automatically utilizes multipathing for network faulttolerance (optional feature)
• Idea is that the high-level library (or application) using IB will
have one primary path, and one fall-back path
– Enables migrating connections to a different path
• Connection recovery in the case of failures
• Available for RC, UC, and RD
• Reliability guarantees for service type maintained during
migration
• Issue is that there is only one fall-back path (in hardware). If
there is more than one failure (or a failure that affects both
paths), the application will have to handle this in software
SC'13
62
IB WAN Capability
• Getting increased attention for:
– Remote Storage, Remote Visualization
– Cluster Aggregation (Cluster-of-clusters)
• IB-Optical switches by multiple vendors
– Mellanox Technologies: www.mellanox.com
– Obsidian Research Corporation: www.obsidianresearch.com & Bay
Microsystems: www.baymicrosystems.com
• Layer-1 changes from copper to optical; everything else stays the same
– Low-latency copper-optical-copper conversion
• Large link-level buffers for flow-control
– Data messages do not have to wait for round-trip hops
– Important in the wide-area network
SC'13
63
Hardware Protocol Offload
Complete
Hardware
Implementations
Exist
SC'13
64
IB Transport Services
Service Type
Connection
Oriented
Acknowledged
Transport
Reliable Connection
Yes
Yes
IBA
Unreliable Connection
Yes
No
IBA
Reliable Datagram
No
Yes
IBA
Unreliable Datagram
No
No
IBA
RAW Datagram
No
No
Raw
• Each transport service can have zero or more QPs
associated with it
– E.g., you can have four QPs based on RC and one QP based on UD
SC'13
65
Trade-offs in Different Transport Types
Attribute
Scalability
(M processes, N
nodes)
Reliable
Connection
M2N QPs
per HCA
Reliable
Datagram
M QPs
per HCA
eXtended
Reliable
Connection
Unreliable
Connection
Unreliable
Datagram
MN QPs
per HCA
M2N QPs
per HCA
M QPs
per HCA
Reliability
Corrupt
data
detected
Data delivered exactly once
Data Order
Guarantees
One source to
multiple
destinations
Error
Recovery
SC'13
1 QP
per HCA
Yes
Data
Delivery
Guarantee
Data Loss
Detected
Raw
Datagram
Per connection
Per connection
No guarantees
Unordered,
duplicate data
detected
Yes
Errors (retransmissions, alternate path, etc.)
handled by transport layer. Client only involved in
handling fatal errors (links broken, protection
violation, etc.)
Packets with
errors and
sequence
errors are
reported to
responder
No
No
No
No
None
None
66
Transport Layer Capabilities
• Data Segmentation
• Transaction Ordering
• Message-level Flow Control
• Static Rate Control and Auto-negotiation
SC'13
67
Data Segmentation
• IB transport layer provides a message-level communication
granularity, not byte-level (unlike TCP)
• Application can hand over a large message
– Network adapter segments it to MTU sized packets
– Single notification when the entire message is transmitted or
received (not per packet)
• Reduced host overhead to send/receive messages
– Depends on the number of messages, not the number of bytes
SC'13
68
Transaction Ordering
• IB follows a strong transaction ordering for RC
• Sender network adapter transmits messages in the order
in which WQEs were posted
• Each QP utilizes a single LID
– All WQEs posted on same QP take the same path
– All packets are received by the receiver in the same order
– All receive WQEs are completed in the order in which they were
posted
SC'13
69
Message-level Flow-Control
• Also called as End-to-end Flow-control
– Does not depend on the number of network hops
• Separate from Link-level Flow-Control
– Link-level flow-control only relies on the number of bytes being
transmitted, not the number of messages
– Message-level flow-control only relies on the number of messages
transferred, not the number of bytes
• If 5 receive WQEs are posted, the sender can send 5
messages (can post 5 send WQEs)
– If the sent messages are larger than what the receive buffers are
posted, flow-control cannot handle it
SC'13
70
Static Rate Control and Auto-Negotiation
• IB allows link rates to be statically changed
– On a 4X link, we can set data to be sent at 1X
– For heterogeneous links, rate can be set to the lowest link rate
– Useful for low-priority traffic
• Auto-negotiation also available
– E.g., if you connect a 4X adapter to a 1X switch, data is
automatically sent at 1X rate
• Only fixed settings available
– Cannot set rate requirement to 3.16 Gbps, for example
SC'13
71
IB Overview
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
– Sockets Direct Protocol (SDP) Stack
– RSockets Protocol Stack
SC'13
72
Concepts in IB Management
• Agents
– Processes or hardware units running on each adapter, switch,
router (everything on the network)
– Provide capability to query and set parameters
• Managers
– Make high-level decisions and implement it on the network fabric
using the agents
• Messaging schemes
– Used for interactions between the manager and agents (or
between agents)
• Messages
SC'13
73
Subnet Manager
Inactive
Link
Multicast
Setup
Switch
Compute
Node
Active
Inactive
Links
Multicast Join
Multicast
Setup
Multicast Join
Subnet
Manager
SC'13
74
IB Overview
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
– Sockets Direct Protocol (SDP) Stack
– RSockets Protocol Stack
SC'13
75
IPoIB vs. SDP Architectural Models
Traditional Model
Possible SDP Model
Sockets App
Sockets Application
Sockets API
Sockets API
User
Kernel
SDP
OS Modules
InfiniBand
Hardware
User
TCP/IP Sockets
Provider
TCP/IP Transport
Driver
IPoIB Driver
InfiniBand CA
Kernel
TCP/IP Sockets
Provider
TCP/IP Transport
Driver
IPoIB Driver
Sockets Direct
Protocol
Kernel
Bypass
RDMA
Semantics
InfiniBand CA
(Source: InfiniBand Trade Association)
SC'13
76
RSockets Overview
• Implements various socket
like functions
– Functions take same
parameters as sockets
• Can switch between regular
Sockets and RSockets using
LD_PRELOAD
Applications / Middleware
Sockets
LD_PRELOAD
RSockets Library
RSockets
RDMA_CM
SC'13
Verbs
77
TCP/IP, IPoIB, Native IB Verbs, SDP and RSockets
Application /
Middleware
Application /
Middleware Interface
Sockets
Verbs
Protocol
Kernel
Space
RSockets
SDP
RDMA
IPoIB
User
Space
RDMA
User
Space
InfiniBand
Adapter
InfiniBand
Adapter
InfiniBand
Adapter
InfiniBand
Adapter
Ethernet
Switch
InfiniBand
Switch
InfiniBand
Switch
InfiniBand
Switch
InfiniBand
Switch
1/10/40
GigE
IPoIB
RSockets
SDP
IB Native
TCP/IP
Ethernet
Driver
Adapter
Ethernet
Adapter
Switch
ISC'13
78
IB, HSE and their Convergence
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
– Novel Features
– Subnet Management and Services
• High-speed Ethernet Family
– Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies
– Virtual Protocol Interconnect (VPI)
– RDMA over Converged Enhanced Ethernet (RoCE)
SC'13
79
HSE Overview
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
SC'13
80
IB and 10GE RDMA Models: Commonalities and
Differences
Features
IB
iWARP/HSE
Hardware Acceleration
Supported
Supported
RDMA
Supported
Supported
Atomic Operations
Supported
Not supported
Multicast
Supported
Supported
Congestion Control
Supported
Supported
Data Placement
Ordered
Out-of-order
Data Rate-control
Static and Coarse-grained
Dynamic and Fine-grained
QoS
Prioritization
Prioritization and
Fixed Bandwidth QoS
Multipathing
Using DLIDs
Using VLANs
SC'13
81
iWARP Architecture and Components
iWARP Offload Engines
User
Application or Library
RDMAP
RDDP
SCTP
TCP
IP
Device Driver
Network Adapter
(e.g., 10GigE)
(Courtesy iWARP Specification)
SC'13
– Feature-rich interface
– Security Management
• Remote Direct Data Placement (RDDP)
– Data Placement and Delivery
MPA
Hardware
• RDMA Protocol (RDMAP)
– Multi Stream Semantics
– Connection Management
• Marker PDU Aligned (MPA)
– Middle Box Fragmentation
– Data Integrity (CRC)
82
HSE Overview
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
SC'13
83
Decoupled Data Placement and Data Delivery
• Place data as it arrives, whether in or out-of-order
• If data is out-of-order, place it at the appropriate offset
• Issues from the application’s perspective:
– Second half of the message has been placed does not mean that
the first half of the message has arrived as well
– If one message has been placed, it does not mean that that the
previous messages have been placed
• Issues from protocol stack’s perspective
– The receiver network stack has to understand each frame of data
• If the frame is unchanged during transmission, this is easy!
– The MPA protocol layer adds appropriate information at regular
intervals to allow the receiver to identify fragmented frames
SC'13
84
Dynamic and Fine-grained Rate Control
• Part of the Ethernet standard, not iWARP
– Network vendors use a separate interface to support it
• Dynamic bandwidth allocation to flows based on interval
between two packets in a flow
– E.g., one stall for every packet sent on a 10 Gbps network refers to
a bandwidth allocation of 5 Gbps
– Complicated because of TCP windowing behavior
• Important for high-latency/high-bandwidth networks
– Large windows exposed on the receiver side
– Receiver overflow controlled through rate control
SC'13
85
Prioritization and Fixed Bandwidth QoS
• Can allow for simple prioritization:
– E.g., connection 1 performs better than connection 2
– 8 classes provided (a connection can be in any class)
• Similar to SLs in InfiniBand
– Two priority classes for high-priority traffic
• E.g., management traffic or your favorite application
• Or can allow for specific bandwidth requests:
– E.g., can request for 3.62 Gbps bandwidth
– Packet pacing and stalls used to achieve this
• Query functionality to find out “remaining bandwidth”
SC'13
86
HSE Overview
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
SC'13
87
Current Usage of Ethernet
Regular
Ethernet
TOE
Regular Ethernet
Cluster
iWARP
Wide
Area
Network
System Area Network or
Cluster Environment
iWARP Cluster
Distributed Cluster
Environment
SC'13
88
Different iWARP Implementations
OSU, OSC, IBM
Application
Application
OSU, ANL
Chelsio, NetEffect (Intel)
Application
Application
High Performance Sockets
User-level iWARP
Kernel-level
iWARP
Sockets
Sockets
TCP (Modified with MPA)
TCP
Software
iWARP
TCP
IP
Device Driver
Network Adapter
Device Driver
Network Adapter
Regular Ethernet Adapters
Sockets
TCP
IP
IP
IP
SC'13
Sockets
High Performance Sockets
Device Driver
Device Driver
Offloaded TCP
Offloaded IP
Network Adapter
TCP Offload Engines
Offloaded iWARP
Offloaded TCP
Offloaded IP
Network Adapter
iWARP compliant
Adapters
89
iWARP and TOE
Application /
Middleware
Application /
Middleware Interface
Sockets
Verbs
Protocol
Kernel
Space
TCP/IP
RSockets
SDP
TCP/IP
RDMA
IPoIB
Hardware
Offload
User
Space
RDMA
User
Space
User
Space
InfiniBand
Adapter
Ethernet
Adapter
InfiniBand
Adapter
InfiniBand
Adapter
iWARP
Adapter
InfiniBand
Adapter
Ethernet
Switch
InfiniBand
Switch
Ethernet
Switch
InfiniBand
Switch
InfiniBand
Switch
Ethernet
Switch
InfiniBand
Switch
1/10/40
GigE
IPoIB
10/40 GigETOE
RSockets
SDP
iWARP
IB Native
TCP/IP
Ethernet
Driver
Adapter
Ethernet
Adapter
Switch
ISC '13
90
HSE Overview
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stack
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
• Emulex FastStack DBL (for OneConnect OCe12000-D 10GE adapters)
SC'13
91
Myrinet Express (MX)
• Proprietary communication layer developed by Myricom for
their Myrinet adapters
– Third generation communication layer (after FM and GM)
– Supports Myrinet-2000 and the newer Myri-10G adapters
• Low-level “MPI-like” messaging layer
– Almost one-to-one match with MPI semantics (including connectionless model, implicit memory registration and tag matching)
– Later versions added some more advanced communication methods
such as RDMA to support other programming models such as ARMCI
(low-level runtime for the Global Arrays PGAS library)
• Open-MX
– New open-source implementation of the MX interface for nonMyrinet adapters from INRIA, France
SC'13
92
Datagram Bypass Layer (DBL)
• Another proprietary communication layer developed by
Myricom
– Compatible with regular UDP sockets (embraces and extends)
– Idea is to bypass the kernel stack and give UDP applications direct
access to the network adapter
• High performance and low-jitter
• Primary motivation: Financial market applications (e.g.,
stock market)
– Applications prefer unreliable communication
– Timeliness is more important than reliability
• This stack is covered by NDA; more details can be
requested from Myricom
SC'13
93
Solarflare Communications: OpenOnload Stack
•
•
HPC Networking Stack provides many
performance benefits, but has limitations
for certain types of scenarios, especially
where applications tend to fork(), exec()
and need asynchronous advancement (per
application)
Solarflare approach:
Typical Commodity Networking Stack

Network hardware provides user-safe
interface to route packets directly to apps
based on flow information in headers

Protocol processing can happen in both
kernel and user space

Protocol state shared between app and
kernel using shared memory
Typical HPC Networking Stack
Solarflare approach to networking stack
Courtesy Solarflare communications (www.openonload.org/openonload-google-talk.pdf)
SC'13
94
FastStack DBL
• Proprietary communication layer developed by Emulex
– Compatible with regular UDP and TCP sockets
– Idea is to bypass the kernel stack
• High performance, low-jitter and low latency
– Available In multiple modes
• Transparent Acceleration (TA)
– Accelerate existing sockets applications for UDP/TCP
• DBL API
– UDP-only, socket-like semantics but requires application changes
• Primary motivation: Financial market applications (e.g., stock market)
– Applications prefer unreliable communication
– Timeliness is more important than reliability
• This stack is covered by NDA; more details can be requested from
Emulex
SC'13
95
IB, HSE and their Convergence
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
– Novel Features
– Subnet Management and Services
• High-speed Ethernet Family
– Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies
– Virtual Protocol Interconnect (VPI)
– RDMA over Converged Enhanced Ethernet (RoCE)
SC'13
96
Virtual Protocol Interconnect (VPI)
• Single network firmware to support
both IB and Ethernet
Applications
IB Verbs
Hardware
Sockets
IB Transport
Layer
TCP
IB Network
Layer
IP
TCP/IP
support
IB Link Layer
IB Port
SC'13
Ethernet
Link Layer
Ethernet Port
• Autosensing of layer-2 protocol
– Can be configured to automatically
work with either IB or Ethernet
networks
• Multi-port adapters can use one
port on IB and another on Ethernet
• Multiple use modes:
– Datacenters with IB inside the
cluster and Ethernet outside
– Clusters with IB network and
Ethernet management
97
RDMA over Converged Enhanced Ethernet (RoCE)
• Takes advantage of IB and Ethernet
– Software written with IB-Verbs
Application
IB Verbs
IB Transport
IB Network
CE
Hardware
SC'13
– Link layer is Converged (Enhanced) Ethernet (CE)
• Pros:
– Works natively in Ethernet environments (entire
Ethernet management ecosystem is available)
– Has all the benefits of IB verbs
– CE is very similar to the link layer of native IB, so
there are no missing features
• Cons:
– Network bandwidth might be limited to Ethernet
switches: 10/40GE switches available; 56 Gbps IB is
available
98
All interconnects and protocols including RoCE
Application /
Middleware
Application /
Middleware Interface
Sockets
Verbs
Protocol
Kernel
Space
TCP/IP
RSockets
SDP
TCP/IP
RDMA
RDMA
IPoIB
Hardware
Offload
User
Space
RDMA
User
Space
User
Space
User
Space
InfiniBand
Adapter
Ethernet
Adapter
InfiniBand
Adapter
InfiniBand
Adapter
iWARP
Adapter
RoCE
Adapter
InfiniBand
Adapter
Ethernet
Switch
InfiniBand
Switch
Ethernet
Switch
InfiniBand
Switch
InfiniBand
Switch
Ethernet
Switch
Ethernet
Switch
InfiniBand
Switch
1/10/40
GigE
IPoIB
10/40 GigETOE
RSockets
SDP
iWARP
RoCE
IB Native
TCP/IP
Ethernet
Driver
Adapter
Ethernet
Adapter
Switch
ISC '13
99
IB and HSE: Feature Comparison
Features
IB
iWARP/HSE
RoCE
Hardware Acceleration
Yes
Yes
Yes
RDMA
Yes
Yes
Yes
Congestion Control
Yes
Optional
Yes
Multipathing
Yes
Yes
Yes
Atomic Operations
Yes
No
Yes
Multicast
Optional
No
Optional
Data Placement
Ordered
Out-of-order
Ordered
Prioritization
Optional
Optional
Yes
Fixed BW QoS (ETS)
No
Optional
Yes
Ethernet Compatibility
No
Yes
Yes
TCP/IP Compatibility
Yes
(using IPoIB)
Yes
Yes
(using IPoIB)
SC'13
100
Presentation Overview
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
SC'13
101
IB Hardware Products
• Many IB vendors: Mellanox+Voltaire and Qlogic (acquired by Intel)
– Aligned with many server vendors: Intel, IBM, Oracle, Dell
– And many integrators: Appro, Advanced Clustering, Microway
• Broadly two kinds of adapters
– Offloading (Mellanox) and Onloading (Qlogic)
• Adapters with different interfaces:
– Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0, PCI 3.0 and HT
• MemFree Adapter
– No memory on HCA  Uses System memory (through PCIe)
– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)
• Different speeds
– SDR (8 Gbps), DDR (16 Gbps), QDR (32 Gbps), FDR (56 Gbps), Dual-FDR
(100Gbps)
• ConnectX-2,ConnectX-3 and ConnectIB adapters from Mellanox
supports offload for collectives (Barrier, Broadcast, etc.)
SC'13
102
Tyan Thunder S2935 Board
(Courtesy Tyan)
Similar boards from Supermicro with LOM features are also available
SC'13
103
IB Hardware Products (contd.)
•
Switches:
– 4X SDR and DDR (8-288 ports); 12X SDR (small sizes)
– 3456-port “Magnum” switch from SUN  used at TACC
• 72-port “nano magnum”
– 36-port Mellanox InfiniScale IV QDR switch silicon in 2008
• Up to 648-port QDR switch by Mellanox and SUN
• Some internal ports are 96 Gbps (12X QDR)
– IB switch silicon from Qlogic (Intel)
• Up to 846-port QDR switch by Qlogic
– FDR (54.6 Gbps) switch silicon (Bridge-X) and associated switches (18-648
ports) are available
– Switch-X-2 silicon from Mellanox with VPI and SDN (Software Defined
Networking) support announced in Oct ‘12
•
Switch Routers with Gateways
– IB-to-FC; IB-to-IP
SC'13
104
10G, 40G and 100G Ethernet Products
• 10GE adapters: Intel, Intilop, Myricom, Emulex, Mellanox (ConnectX)
• 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel)
• 40GE adapters: Mellanox ConnectX3-EN 40G, Chelsio (T5 2x40 GigE)
• 10GE switches
– Fulcrum Microsystems (acquired by Intel recently)
• Low latency switch based on 24-port silicon
• FM4000 switch with IP routing, and TCP/UDP support
– Arista, Brocade, Cisco, Extreme, Force10, Fujitsu, Juniper, Gnodal and
Myricom
• 40GE and 100GE switches
– Gnodal, Arista, Brocade and Mellanox 40GE (SX series)
– Broadcom has switch architectures for 10/40/100GE
– Nortel Networks
• 10GE downlinks with 40GE and 100GE uplinks
SC'13
105
Products Providing IB and HSE Convergence
• Mellanox ConnectX Adapter
• Supports IB and HSE convergence
• Ports can be configured to support IB or HSE
• Support for VPI and RoCE
– 8 Gbps (SDR), 16Gbps (DDR), 32Gbps (QDR) and 54.6 Gbps (FDR)
rates available for IB
– 10GE and 40GE rates available for RoCE
SC'13
106
Software Convergence with OpenFabrics
• Open source organization (formerly OpenIB)
– www.openfabrics.org
• Incorporates both IB and iWARP in a unified manner
– Support for Linux and Windows
• Users can download the entire stack and run
– Latest release is OFED 3.5
• New naming convention to get aligned with Linux Kernel
Development
SC'13
107
OpenFabrics Stack with Unified Verbs Interface
Verbs Interface
(libibverbs)
Mellanox
(libmthca)
QLogic
(libipathverbs)
IBM
(libehca)
Chelsio
(libcxgb3)
User Level
Mellanox
(ib_mthca)
QLogic
(ib_ipath)
Chelsio
(ib_cxgb3)
IBM
(ib_ehca)
Kernel Level
Mellanox
Adapters
SC'13
QLogic
Adapters
IBM
Adapters
Chelsio
Adapters
108
OpenFabrics on Convergent IB/HSE
• For IBoE and RoCE, the upper-level
stacks remain completely
unchanged
Verbs Interface
(libibverbs)
ConnectX
(libmlx4)
User Level
ConnectX
(ib_mlx4)
Kernel Level
ConnectX
Adapters
HSE
SC'13
• Within the hardware:
– Transport and network layers remain
completely unchanged
– Both IB and Ethernet (or CEE) link
layers are supported on the network
adapter
• Note: The OpenFabrics stack is not
valid for the Ethernet path in VPI
IB
– That still uses sockets and TCP/IP
109
OpenFabrics Software Stack
Application
Level
Diag Open
Tools
SM
IP Based
App
Access
Sockets
Based
Access
User Level
MAD API
User
APIs
OpenFabrics User Level
User Space
Access to
File
Systems
Verbs / API
iWARP R-NIC
SDP Lib
Kernel Space
Kernel bypass
IPoIB
SDP
SRP
iSER
RDS
NFS-RDMA
RPC
Cluster
File Sys
Connection Manager
Abstraction (CMA)
SA
MAD
Client
InfiniBand
Hardware
Specific Driver
SMA
Connection
Manager
Connection
Manager
OpenFabrics Kernel Level Verbs / API
iWARP R-NIC
Kernel bypass
Upper
Layer
Protocol
Provider
Clustered
DB Access
UDAPL
InfiniBand
Mid-Layer
Block
Storage
Access
Various
MPIs
Hardware Specific
Driver
SA
Subnet Administrator
MAD
Management Datagram
SMA
Subnet Manager Agent
PMA
Performance Manager
Agent
IPoIB
IP over InfiniBand
SDP
Sockets Direct Protocol
SRP
SCSI RDMA Protocol
(Initiator)
iSER
iSCSI RDMA Protocol
(Initiator)
RDS
Reliable Datagram Service
UDAPL
User Direct Access
Programming Lib
HCA
Host Channel Adapter
R-NIC
RDMA NIC
Key
Common
InfiniBand
Hardware
SC'13
InfiniBand HCA
iWARP R-NIC
iWARP
Apps &
Access
Methods
for using
OF Stack
110
Trends of Networking Technologies in TOP500 Systems
Interconnect Family – Systems Share
SC'13
Interconnect Family – Performance Share
111
InfiniBand in the Top500 (June 2013)
Performance
Number of Systems
3%
1%
2%
0%
0%
0%
10%
3%
10%
28%
41%
14%
43%
45%
Infiniband
Gigabit Ethernet
Infiniband
Gigabit Ethernet
Custom Interconnect
Proprietary Network
Custom Interconnect
Proprietary Network
Cray Interconnect
Myrinet
Cray Interconnect
Myrinet
Fat Tree
SC'13
Fat Tree
112
Large-scale InfiniBand Installations
• 205 IB Clusters (41%) in the June 2013 Top500 list
(http://www.top500.org)
• Installations in the Top 40 (18 systems)
462,462 cores (Stampede) at TACC (6th)
138,368 cores (Tera-100) at France/CEA (25th)
147, 456 cores (Super MUC) in Germany (7th)
53,504 cores (PRIMERGY) at Australia/NCI (27th)
110,400 cores (Pangea) at France/Total (11th)
77,520 cores (Conte) at Purdue University (28th)
73,584 (Spirit) at USA/Air Force (14th)
48,896 cores (MareNostrum) at Spain/BSC (29th)
77,184 cores (Curie thin nodes) at France/CEA
(15th)
78,660 cores (Lomonosov) in Russia (31st )
120, 640 cores (Nebulae) at China/NSCS (16th)
137,200 cores (Sunway Blue Light) in China 33rd)
72,288 cores (Yellowstone) at NCAR (17th)
46,208 cores (Zin) at LLNL (34th)
125,980 cores (Pleiades) at NASA/Ames (19th)
38,016 cores at India/IITM (36th)
70,560 cores (Helios) at Japan/IFERC (20th)
More are getting installed !
73,278 cores (Tsubame 2.0) at Japan/GSIC (21st)
SC'13
113
HSE Scientific Computing Installations
•
•
HSE compute systems with ranking in the Jun‘13 Top500 list
–
42,848- core installation in United States (#49) – new
–
32,256-core installation in United States (#59)
–
43,264-core installation in United States (#64) – new
–
25,568-core installation in United States (#73)
–
19.712-core installation in United States (#104) – new
–
25,856-core installation in United States (#112) – new
–
18,440-core installation in United States (#114) – new
–
16,872-core installation in United States (#120) – new
–
17,024-core installation at the Amazon EC2 Cluster (#127)
–
16,064-core installation in the United States (#136) – new
–
15,360-core installation in United States (#143)
–
13,872-core installation in United States (#151) – new
–
14,272-core installation in United States (#154)
–
13,568- core installation in United States (#157) – new
–
13,184-core installation in United States ( #159) – new
–
13,168-core installation in United States (#160) – new
–
15,504-core installation in United States (#165) – new and more …
Integrated Systems
– BG/P uses 10GE for I/O (ranks58, 66, 164, 359 in the Top 500)
SC'13
114
Other HSE Installations
• HSE has most of its popularity in enterprise computing and
other non-scientific markets including Wide-area
networking
• Example Enterprise Computing Domains
– Enterprise Datacenters (HP, Intel)
– Animation firms (e.g., Universal Studios (“The Hulk”), 20th Century
Fox (“Avatar”), and many new movies using 10GE)
– Amazon’s HPC cloud offering uses 10GE internally
– Heavily used in financial markets (users are typically undisclosed)
• Many Network-attached Storage devices come integrated
with 10GE network adapters
• ESnet is installing 100GE infrastructure for US DOE
SC'13
115
Presentation Overview
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
SC'13
116
Case Studies
• Low-level Performance
• Message Passing Interface (MPI)
SC'13
117
Low-level Latency Measurements
Small Messages
2.5
250
200
Latency (us)
Latency (us)
2
1.5
1
0.97
Large Messages
IB-FDR (56Gbps)
RoCE (40Gbps)
150
100
0.87
50
0.5
0
0
Message Size (bytes)
Message Size (bytes)
ConnectX-3 FDR (54 Gbps): 2.6 GHz Octa-core (SandyBridge) Intel with IB (FDR) switches
ConnectX-3 EN (40 GigE): 2.6 GHz Octa-core (SandyBridge) Intel with 40GE switches
SC'13
118
Low-level Uni-directional Bandwidth Measurements
7000.00
Bandwidth (MBytes/sec)
6000.00
Uni-directional Bandwidth
IB-FDR (56Gbps)
RoCE (40Gbps)
6311
5000.00
4536
4000.00
3000.00
2000.00
1000.00
0.00
Message Size (bytes)
ConnectX-3 FDR (54 Gbps): 2.6 GHz Octa-core (SandyBridge) Intel with IB (FDR) switches
ConnectX-3 EN (40 GigE): 2.6 GHz Octa-core (SandyBridge) Intel with 40GE switches
SC'13
119
Low-level Latency Measurements
Small Messages
50
45
700
40
600
Latency (us)
Latency (us)
35
30
25
20
15.50
15
0
500
Sockets
Rsockets
IB-Verbs
400
300
200
10
5
Large Messages
800
100
1.03
0.87
0
Message Size (bytes)
Message Size (bytes)
ConnectX-3 FDR (54 Gbps): 2.6 GHz Octa-core (SandyBridge) Intel with IB (FDR) switches
SC'13
120
Low-level Uni-directional Bandwidth Measurements
Uni-directional Bandwidth
7000.00
Bandwidth (MBytes/sec)
6000.00
5000.00
6029
Sockets
6012
Rsockets
IB-Verbs
4000.00
3000.00
2000.00
1312
1000.00
0.00
Message Size (bytes)
ConnectX-3 FDR (54 Gbps): 2.6 GHz Octa-core (SandyBridge) Intel with IB (FDR) switches
SC'13
121
Case Studies
• Low-level Performance
• Message Passing Interface (MPI)
SC'13
122
MVAPICH2/MVAPICH2-X Software
•
MPI(+X) continues to be the predominant programming model in HPC
•
High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,070 organizations (HPC Centers, Industry and Universities) in 70
countries
– More than 182,000 downloads from OSU site directly
– Empowering many TOP500 clusters
•
6th ranked 462,462-core cluster (Stampede) at TACC
• 19th ranked 125,980-core cluster (Pleiades) at NASA
• 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others
– Available with software stacks of many IB, HSE, and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
•
Partner in the U.S. NSF-TACC Stampede System
SC'13
123
One-way Latency: MPI over IB
Small Message Latency
6.00
5.00
200.00
2.00
1.00
0.00
150.00
Large Message Latency
MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
MVAPICH2-Mellanox-ConnectIB-DualFDR
Latency (us)
Latency (us)
4.00
3.00
250.00
1.82
1.66
1.64
100.00
1.56
50.00
1.09
0.99
0.00
Message Size (bytes)
Message Size (bytes)
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
SC'13
124
Bandwidth: MPI over IB
Unidirectional Bandwidth
Bandwidth (MBytes/sec)
12000
25000
12485
10000
8000
6000
6343
3385
4000
3280
2000
1917
1706
0
Message Size (bytes)
Bandwidth (MBytes/sec)
14000
20000
Bidirectional Bandwidth
MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
21025
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
MVAPICH2-Mellanox-ConnectIB-DualFDR
15000
11643
10000
6521
4407
5000
3704
3341
0
Message Size (bytes)
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
SC'13
125
One-way Latency: MPI over iWARP
One-way Latency
Latency (us)
30
Chelsio T4 (TCP/IP)
25
Chelsio T4 (iWARP)
20
Intel-NetEffect NE20 (iWARP)
15
10
5
Intel-NetEffect NE20 (TCP/IP)
13.44
11.32
5.59
4.64
0
Message Size (bytes)
SC'13
2.6 GHz Dual Eight-core (SandyBridge) Intel
Chelsio T4 cards connected through Fujitsu xg2600 10GigE switch
Intel NetEffect cards connected through Fulcrum 10GigE switch
126
Bandwidth: MPI over iWARP
1400.00
Bandwidth (MBytes/sec)
1200.00
1000.00
Unidirectional Bandwidth
Chelsio T4 (TCP/IP)
1181
1176
Chelsio T4 (iWARP)
Intel-NetEffect NE20 (TCP/IP)
Intel-NetEffect NE20 (iWARP)
1168
1169
800.00
600.00
400.00
200.00
0.00
Message Size (bytes)
SC'13
2.6 GHz Dual Eight-core (SandyBridge) Intel
Chelsio T4 cards connected through Fujitsu xg2600 10GigE switch
Intel NetEffect cards connected through Fulcrum 10GigE switch
127
Convergent Technologies: MPI Latency
One-way Latency
3
IB-FDR (56 Gbps)
Latency (us)
2.5
RoCE (40 Gbps)
2
1.5
1.24
1
1.13
0.5
0
Message Size (bytes)
ConnectX-3 FDR (54 Gbps): 2.6 GHz Octa-core (SandyBridge) Intel with IB (FDR) switches
ConnectX-3 EN (40 GigE): 2.6 GHz Octa-core (SandyBridge) Intel with 40GE switches
SC'13
128
Convergent Technologies:
MPI Uni- and Bi-directional Bandwidth
Uni-directional Bandwidth
6324
Bandwidth (MBytes/sec)
6000
12000
10000
4532
5000
4000
3000
2000
Bi-directional Bandwidth
IB-FDR (56 Gbps)
Bandwidth (MBytes/sec)
7000
RoCE (40 Gbps)
10978
9041
8000
6000
4000
2000
1000
0
0
Message Size (bytes)
Message Size (bytes)
ConnectX-3 FDR (54 Gbps): 2.6 GHz Octa-core (SandyBridge) Intel with IB (FDR) switches
ConnectX-3 EN (40 GigE): 2.6 GHz Octa-core (SandyBridge) Intel with 40GE switches
SC'13
129
Presentation Overview
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
SC'13
130
Concluding Remarks
• Presented network architectures & trends in Clusters
• Presented background and details of IB and HSE
– Highlighted the main features of IB and HSE and their convergence
– Gave an overview of IB and HSE hardware/software products
– Discussed sample performance numbers in designing various highend systems with IB and HSE
• IB and HSE are emerging as new architectures leading to a
new generation of networked computing systems, opening
many research issues needing novel solutions
SC'13
131
Funding Acknowledgments
Funding Support by
Equipment Support by
SC'13
132
Personnel Acknowledgments
Current Post-Docs
Current Students
Current Programmers
–
M. Rahman (Ph.D.)
–
X. Lu
–
M. Arnold
–
R. Shir (Ph.D.)
–
M. Luo
–
D. Bureddy
–
A. Venkatesh (Ph.D.)
–
K. Hamidouche
–
J. Perkins
–
J. Zhang (Ph.D.)
–
N. Islam (Ph.D.)
–
J. Jose (Ph.D.)
–
M. Li (Ph.D.)
–
S. Potluri (Ph.D.)
–
R. Rajachandrasekhar (Ph.D.)
Current Senior Research Associate
–
Past Students
H. Subramoni
–
P. Balaji (Ph.D.)
–
W. Huang (Ph.D.)
–
M. Luo (Ph.D.)
–
G. Santhanaraman (Ph.D.)
–
D. Buntinas (Ph.D.)
–
W. Jiang (M.S.)
–
A. Mamidala (Ph.D.)
–
A. Singh (Ph.D.)
–
S. Bhagvat (M.S.)
–
S. Kini (M.S.)
–
G. Marsh (M.S.)
–
J. Sridhar (M.S.)
–
L. Chai (Ph.D.)
–
M. Koop (Ph.D.)
–
V. Meshram (M.S.)
–
S. Sur (Ph.D.)
–
B. Chandrasekharan (M.S.)
–
R. Kumar (M.S.)
–
S. Naravula (Ph.D.)
–
H. Subramoni (Ph.D.)
–
N. Dandapanthula (M.S.)
–
S. Krishnamoorthy (M.S.)
–
R. Noronha (Ph.D.)
–
K. Vaidyanathan (Ph.D.)
–
V. Dhanraj (M.S.)
–
K. Kandalla (Ph.D.)
–
X. Ouyang (Ph.D.)
–
A. Vishnu (Ph.D.)
–
T. Gangadharappa (M.S.)
–
P. Lai (M.S.)
–
S. Pai (M.S.)
–
J. Wu (Ph.D.)
–
K. Gopalakrishnan (M.S.)
–
J. Liu (Ph.D.)
–
W. Yu (Ph.D.)
Past Post-Docs
–
H. Wang
–
E. Mancini
–
X. Besseron
–
S. Marcarelli
–
H.-W. Jin
–
J. Vienne
SC'13
Past Research Scientist
–
S. Sur
133
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~subramon
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
panda@cse.ohio-state.edu
subramon@cse.ohio-state.edu
SC'13
134
Download PDF