Title of Presentation

Title of Presentation
Nonvolatile Memory (NVM), Four
Trends in the Modern Data Center, and
the Implications for the Design of Next
Generation Distributed Storage
Platforms
David Cohen, System Architect, Intel
Brian Hausauer, Hardware Architect, Intel
Copyright
IBTA © 2015
- All Rights
2015 Storage Developer
Conference.
Copyright
IBTA ©Reserved
2015 - All Rights Reserved
Abstract
There are four trends unfolding simultaneously in the modern Data Center: (i) Increasing
Performance of Network Bandwidth, (ii) Storage Media approaching the performance of
DRAM, (iii) OSVs optimizing the code path of their storage stacks, and (iv) single
processor/core performance remains roughly flat. A direct result of these trends is that
application/workloads and the storage resources they consume are increasingly
distributed and virtualized. This, in turn, is making Onload/Offload and RDMA capabilities
a required feature/function of distributed storage platforms. In this talk we will discuss
these trends and their implications on the design of distributed storage platforms.
Learning Objectives
•
Highlight the four trends unfolding in the data center
•
Elaborate on the implication of these trends on design of modern distributed storage
platforms
•
Provide details on how onload/offload mechanisms and RDMA become
feature/function requirements for these platforms in the near-future
2
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
The Emergence of Cloud Storage
A new storage architecture has emerged in the modern data center that
is predicated on data-center-wide connectivity to support application
mobility.
3
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Architectural Convergence on the Cloud Storage Platform
4
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Application Mobility & Scale Requires
Data Center-Wide Connectivity
1.
Disaggregation : Enables Mobility

2.
Job Scheduling : Increases Efficiency

3.
Disaggregated (i.e., horizontally-scaled) Compute and
Storage has delivered massive increases in capacity
Realizing the benefits of these increases in capacity has
been predicated on increasingly sophisticated scheduling
and job placement
System Balance : Optimization function

The move to scale-out networking has been key to delivering
sufficient end-to-end bandwidth to not stall computations
(i.e., maintain system balance)
Disaggregation is about Application Mobility and Scale
5
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Disaggregation to Enable Application
Deployment Anytime/Anywhere
1.
A modern Data Center is designed so that an application can be
deployed anywhere/anytime. This is achieved by deploying
services:

2.
Job Scheduling : Increases Efficiency

3.
Disaggregated (i.e., horizontally-scaled) Compute and
Storage has delivered massive increases in capacity
Realizing the benefits of these increases in capacity has
been predicated on increasingly sophisticated scheduling
and job placement
System Balance : Optimization function

The move to scale-out networking has been key to delivering
sufficient end-to-end bandwidth to not stall computations
(i.e., maintain system balance)
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
6
Application Deployment Anytime/Anywhere
Requires Scale-Out, Server-based Storage
1.
Does Disaggregation (as defined above) mean that Compute and Storage services
run on different hardware?
No. IaaS, PaaS, and SaaS based applications consume storage services indirectly via IP-based networking
services. This means the application can be running over the same hardware/servers (aka “Hyperconverged”) or
on different servers.
2.
Can Disaggregation be supported by traditional, external storage appliances
(aka “scale-up” storage)?
No, this is frequently referred to as “Converged Infrastructure (as opposed to Hyperconverged).” However, each
appliance is a silo with limited capacity and performance. The scope of the deployment is constrained to some
number of servers interconnected to the storage appliance via a shared network. If the appliance runs out of
capacity (or performance) data (and/or applications) need to be migrated to a new deployment of servers/storageappliance. This significantly hampers the design goal of enabling “an application can be deployed
anywhere/anytime.”
3.
How does Server-base Storage (SBS) address this constraint?
SBS delivers a storage service via the IP-network. While SBS supports a model identical to the storage
appliance, the alternative scale-out model runs the storage service over many physical servers. Capacity and
performance is scale by adding servers over which to run the service. In the case of hyperconverged
infrastructure the application and the storage service operate over the same servers.
Who is driving this innovation?
7
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
“Cloud” based Storage Scales-Out to support Application Mobility
8
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Server-based Storage
a simple taxonomy
Scaling out for Application Mobility shift the burden to
the Network
9
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Four Trends in Modern Data Centers
1.
Operating System Vendors (OSV) optimizing the code path of their
network and storage stacks
2.
Increasing Performance of Network Bandwidth
3.
Storage Media approaching the performance of DRAM
4.
Single processor/core performance not increasing at the same rate
as network and storage
10
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
The Shift from Appliance-based to Cloudbased Storage
The new storage architecture reliance on data-center-wide connectivity
is increasingly focused on latency.
11
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Evolving Storage ARCHITECTURE Landscape
1. Traditional Storage Architecture
2. Cloud Storage Architecture
Over Provisioned
Redundant Networks
Flat Networks
Resiliency in Clos
Fabric
Replication internal to
storage appliance
Rack is Failure
Domain
Data Center is
Appliance
No SPoF
Direct Attached
Storage (DAS)
In Traditional Storage architectures:
In Cloud Storage architectures:
•
•
•
•
•
•
•
•
Analysts predict little to no growth
Availability is built into appliance because network
bandwidth is expensive
Network is redundant & over provisioned
Limited in scale and dependent LAN connectivity
Custom HW & SW
Chassis is the appliance
•
•
•
Analysts predict substantial growth
Use of high BW fabric enables consumption of
large amounts of NVM
Per GB costs are order magnitude less by
leveraging tiering
Storage is an application on commodity HW
Data Center (or ‘zone’) is the appliance
The Traditional and Cloud Storage Architectures Differ on How they Deal with Availability
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
12
Traditional Storage Architecture Dataflow for a
Write Operation
Servers
●●
●
1.
●
Switches
2.
Host
IOCs
Ctrl A
D
D
R
D
3
D
R
SPCH
4
x
4
x
NV
M
D
D
RD
3D
R
PCIe
PCI
e
Intel
Xeon
PCIe NTB
PCI
e SAS
SSD
SAS
SATA
Ctrl B
Host
IOCs
SAS
IOC
SAS
EXP
PCIe NTB
SAS
Intel
Xeon
SAS
IOC
SAS
EXP
3.
SPCH
NV
M
High-performance storage
Enterprise performance storage
Enterprise bulk storage
4
x
4
x
4.
Write Data from Application
arrives; copy (log) placed in
non-volatile memory region
Data is made durable by
replicating in partner nonvolatile memory area (no
SPoF)
Application write is
acknowledged, program
proceeds
Data written at leisure to disk
Enables:
• No SPoF
• Minimal network use
• Minimum response time
• Tiering
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
13
Cloud Storage Architecture Dataflow for a
Write Operation
1.
2.
3.
4.
Write Data from Application arrives; copy (log) placed in non-volatile memory region
Data is made durable by replicating in partner non-volatile memory area (no SPoF)
Application write is acknowledged, program proceeds
●●
Data written at leisure to disk
●●
Enables:
• no SPoF
• greater network use
• Lower cost implementation
• Larger fan out
Servers
w/ storage
Servers
w/o storage
Clos
Network
2
2
2
Storage
Server
D
D
R
D
3
D
R
Storage
Server
Networ
k
D
D
R
D
3
D
R
PC
Ie
Intel
Xeon
4
SAT
A
PC
Ie
Storage
Server
Networ
k
D
D
R
D
3
D
R
PC
Ie
Intel
Xeon
4
SAT
A
PC
Ie
The Cloud Storage Architecture’s Tiering model is also a key differentiator
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Networ
k
PC
Ie
Intel
Xeon
4
SAT
A
PC
Ie
14
Storage Tiering
Cloud Storage uses a tiering model based on Hot, Warm, and Cold
data. A relatively small amount of higher performance storage is used
to service application I/O requests. As data is accessed less frequently
it is moved to less expensive media.
15
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Storage Tiering in the Cloud Storage Architecture
Hot Tier (Performance):
•
•
•
•
5-10% of Capacity
“Active Working Set”
Performance Driven
Local Replication Based
Warm Tier (Capacity):
•
•
•
•
15-30% of Capacity
“Data < 1 year”
Cost/Performance
Driven
Medium Erasure Code
Authoritative
Source of Data
Driven by Application BW
Fast Refill for Hot Tier
RD BW: Hot Tier cache miss rate
WRT BW: Hot Tier change rates
Higher node count based on Erasure code
Increasing
minimal
node count
Cold Tier (Capacity):
•
•
•
•
60-75% of Capacity
Cost Driven
Maximum Erasure
Code
Future Multi-Site
Lowest $/GB
High node count based on Erasure code
Nothing ever gets deleted
Storage Tiering enables the use of higher performance, more expensive storage media
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
16
End-to-End Storage I/O – Positioning the Media
Transition
Data Center Network (Fabric)
Storage
Storage
Server
Network
Warm
Storage
Server
Network
D
PCIe
Server
D
Network
D
D
D
R
D
3
D
R
R
D 3
D
R D
D R
3
D
R
Storage
Storage
Server
Network
Hot
Storage
Server
Network
D
PCIe
Server
Network
D D
PCIe
Intel
Intel
Intel
Xeon
SPCH
PCIe
WarmHot
Interaction
D D
D R
D
R 3
D
3 R
A
P
NVM
NVM
PCIe
PCIe
HotApp
Interaction
PCIe
Intel
Intel
Intel
Xeon
SPCH
D D
Server
D R
D
D
R
3
A
D
R 3
D D
3
D R
R
PCIe
PCIe
PCIe
PCIe
Network
PCIe
PCIe
Intel
Intel
Intel
Xeon
SPCH
P
S
NVM
NVM
PCIe
D
PCIe
R
D
3
D
R
SPCH
SPCH
Storage
Storage
Server
Network
Application
Server
Network
Compute
S
SPCH
NVM
NVM
PCIe
PCIe
PCIe
PCIe
Rotational/SATA  3D-NAND
(SAS)
(NVMe)
Warm Tier
Storage
Hot Tier
Compute Tier
The transition from rotational to solid-state media shifts focus to low latency network I/O
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
17
The Shift in Focus to Latency
Workloads in Cloud deployments are concerned per-operation elapsed
time (aka “latency”) and the “tail” of the distribution as measured across
many of these operations.
18
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Flash Accelerates the Data Center,
Drives Innovation in the Network
• Traditional Network
•
•
•
•
~10ms across data center
Highly Buffered no packet drop
Highly Oversubscribed
1Gbs to Host
• Flat Network
•
•
•
•
1
2
~10uS across data center
No Buffering
Much lower oversubscription
10Gbs to Host
• 25/40/50Gbs to Host
4
• 100Gbs to Host, Low Latency
Messaging, RDMA
• SAS HDD
• ~200 IOPs @ ~5mS
• ~100MB/s streaming
2nd Cloud Wave
3
• NVM Express™ SSD
• ~400,000 IOPs @ ~100uS
• 2GB/s
4th Cloud Wave
5
• 3D-Xpoint
• ~ <???> IOPs @ ~ <???>uS
• <???>GB/s
The next wave of innovations will focus on addressing Latency
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
19
Network Latency requirements



Table shows typical small read latency for several current and next gen NV Memory
technologies
Last column shows a “rule of thumb”, that 20% of additional network latency is acceptable
For many block and file access protocols, the network latency includes separate
command, data transfer, and status messages, requiring up to two network round trip
times (RTTs).
Conclusions


It is difficult to achieve the ‘Next gen NVM Express’ network latency goal without RDMA
It is very difficult to achieve the ‘Persistent Memory DIMM’ network latency goal with as-is
block or file network protocols. A new or enhanced protocol needs:
 Reduced number of network messages per IOP and max single RTT
20
 No per-IOP CPU interaction on the target
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Write I/O Operations, Availability, and Distributed Logs
Recall from Slide 14, a Write I/O Operation must be appended to the Master’s log along with the
logs of at least two peers before an acknowledgement is returned to Application. This is a
synchronous operation.
Data Center Network (Fabric)
Storage
Server
D
D
R
D
3
D
R
Hot
Master ships log
entry to peer1
Network
1
PCIe
Storage
Server
Intel
Master ships log
entry to peer2
3
SPCH
NVM
2
PCIe
Storage
Server
log
D
D
R
3
A
Hot
D
D
R
D
3
D
R
App submits a
write I/O
Hot
Network
PCIe
D
D
Server
D R
Intel
SPCH
S
NVM
Network
PCIe
PCIe
PCIe
Master sends
acknowledgement
4
Intel Xeon
Storage
Storage
Server
Network
Application
Server
Network
Compute
D
D
D
R
3
A
P
PCIe
Network
PCIe
D
PCIe
R 3
Intel
D D
3
Intel
D R
R Intel Xeon
SPCH
S
SPCH
NVM
NVM
PCIe
PCIe
5
P
log
PCIe
Master appends
entry to log
log
The response time of Write I/O Operations is a challenge due to synchronous replication
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
21
Satisfying the Requirements of Cloud
Storage
RDMA and High Performance Storage Networking
22
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Server OSes Already Support the I/O Infrastructure
Needed to address Cloud Scale-Out Storage
Example: Linux OpenFabrics Software Stack for RDMA



Supports a range of block,
network filesystem, and
distributed parallel
filesystem protocols, with
both initiator and target
implementations
Uses RDMA for low latency
and high message rate
Supports cloud scale-out
storage when used with
IP-based RDMA
23
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Protocols are Evolving to Better Address Cloud
Storage Opportunities
Example: NVMe over Fabrics block storage
Evolving from a PCIe Fabric-connected
solution…


Very low latency and high message rate
But with scale-out limitations due to PCIe fabric
24
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Protocols are Evolving to Better Address Cloud
Storage Opportunities
Example: NVMe over Fabrics block storage
Evolving from a PCIe Fabric-connected
solution…


Very low latency and high message rate
But with scale-out limitations due to PCIe fabric
…to include an RDMA Fabric-connected solution


Preserves the latency and message rate
characteristics of the original
Solves the scale-out limitations
25
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
How does NVMe over Fabrics preserve the latency
and message rate characteristics of NVMe?
By direct mapping of the NVMe programming
model to RDMA Verbs


Maintains the NVMe PCIe operational model
and NVMe descriptors
Simple mapping of NVMe IOQ to RDMA QP
NVMe RDMA
RDMA Verbs
Infiniband™ or IP-based RDMA NIC
RDMA Fabrics
Infiniband™ or IP-based RDMA NIC
RDMA Verbs
NVMe RDMA
18
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
26
How does NVMe over Fabrics preserve the latency
and message rate characteristics of NVMe?
By direct mapping of the NVMe programming
model to RDMA Verbs




Maintains the NVMe PCIe operational model
and NVMe descriptors
Simple mapping of NVMe IOQ to RDMA QP
Simple translation of NVMe DMA operations to
RDMA operations
Simple mapping and translations enable low
latency, high message rate, and very simple
conversion from RDMA- to NVMe- semantics in
the Controller
Logical
NVMe
Submission
Queue
NVMe RDMA
RDMA Verbs
RDMA
SQ
Logical
NVMe
Completion
Queue
RDMA
RQ
Infiniband™ or IP-based RDMA NIC
RDMA Fabrics
Infiniband™ or IP-based RDMA NIC
RDMA
RQ
NVMe
Submission
Queue
RDMA Verbs
NVMe RDMA
RDMA
SQ
NVMe
Completion
Queue
18
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
27
Cloud Storage Dataflow - Network Performance
Requirements
How many NVMe drives are required to saturate various network link speeds
using NVMe over Fabrics protocol at the Storage Server Home Node?


Each NVMe drive is capable of 2GB/s sustained writes
Traffic pattern is 100% write, 3x replication, 4KB block i/o
Link Speed # NVMe drives to 4KB block 100% Wr, 3x replication
KIOPs
Pkt Rate (Mp/s)
(Gb/s)
saturate link
25
0.8
346
4.15
50
1.6
691
8.30
100
3.1
1383
16.59
200
6.3
2765
33.19
400
12.5
5531
66.37

●●
Servers
w/o storage
●●
Clos Network
2
2
2
Storage
Server
D
D
R
3D
D
R
Storage
Server
Network
D
D
R
3D
D
R
PCIe
Intel Xeon
4
Storage
Server
Network
D
D
R
3D
D
R
PCIe
Intel Xeon
SATA
PCIe
4
PCIe
SATA
Network
PCIe
Intel Xeon
4
SATA
PCIe
In this Cloud Storage Architecture, a modest number of NVMe devices can


generate massive network bandwidth
drive packet rates high enough to make RDMA/offload solutions very attractive
28
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Call to Action

We highlighted the four trends unfolding in the data center

Increasing performance of Network Bandwidth

Storage Media such as Intel’s 3D-XPoint that is approaching the performance of
DRAM

Single processor/core performance is not increasing at the same rate as network
and storage, placing an emphasis on scaling workloads out over available cores
and exploiting RDMA to offload cycles related to network processing.

In anticipation of these first three trends, OSVs are optimizing the code path of
their storage stacks to take advantage of the increased network and storage
performance

We elaborated on the implication of these trends on design of modern
distributed storage platforms

We provided details on how onload/offload mechanisms and RDMA
become feature/function requirements for these platforms in the near-future
with a focus on NVMe-over-Fabrics.
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
29
InfiniBand Trade Association
Global member organization dedicated to developing,
maintaining and furthering the InfiniBand specification

Architecture definition





RDMA software architecture
InfiniBand, up to 100Gb/s and 300Gb/s per port
RDMA over Converged Ethernet (RoCE)
Compliance and interoperability testing of commercial products
Markets and promotes InfiniBand/RoCE


Online, marketing and public relations engagements
IBTA-sponsored technical events and resources
Steering committee members
30
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
For More Information
www.infinibandta.org
www.roceinitiative.org
© InfiniBand Trade Association
31
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Speaker Bios

David Cohen, System Architect, Intel


Dave is a System Architect and Senior Principal Engineer in
Intel’s Data Center Group where he focuses on the system
implications of the intersection of networking and storage in the
modern data center.
Brian Hausauer, Hardware Architect, Intel

Brian is a Hardware Architect and Principal Engineer in Intel’s
Data Center Group with focus on Ethernet RDMA engine
architecture, and the application of RDMA to emerging storage
use cases.
32
2015 Storage Developer Conference. Copyright IBTA © 2015 - All Rights Reserved
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising