Seagate ExaScale HPC storage Miro Lehocky System Engineer

Seagate ExaScale
HPC storage
Miro Lehocky
System Engineer
Seagate Systems Group, HPC
© 2015 Seagate, Inc. All Rights Reserved.
100+ PB Lustre File System
130+ GB/s Lustre File System
140+ GB/s Lustre
File System
55 PB Lustre File System
1.6 TB/sec Lustre File System
500+ GB/s Lustre File System
1 TB/sec Lustre File System
Market leadership ….
Rank
Name
Computer
Site
Total
Cores
Rmax
(TFLOPS)
Rpeak
(TFLOPS)
Power
(KW)
TH-IVB-FEP Cluster, Xeon E5-2692
12C 2.2GHz, TH Express-2, Intel
Xeon Phi
Cray XK7 , Opteron 6274 16C
2.2GHz, Cray Gemini interconnect,
NVIDIA K20x
National Super
Computer Center in
Guangzhou
3120000
33862700
54902400
17808
Lustre/H2FS
12.4 PB
~750 GB/s
DOE/SC/Oak Ridge
National Laboratory
560640
17590000
27112550
8209
Lustre
10.5 PB
240 GB/s
1572864
17173224
20132659
7890
Lustre
55 PB
850 GB/s
RIKEN AICS
705024
10510000
11280384
12659
Lustre
40 PB
965 GB/s
DOE/SC/Argonne
National Lab.
786432
8586612
10066330
3945
GPFS
28.8 PB
240 GB/s
76 PB
1,600 GB/s
File system
Size
Perf
1
Tianhe-2
2
Titan
3
Sequoia
4
K computer
5
Mira
6
Trinity
Cray XC40, Xeon E5-2698v3 16C
2.3GHz, Aries interconnect
DOE/NNSA/LANL/SNL
301056
7
Piz Daint
Cray XC30, Xeon E5-2670 8C
2.600GHz, Aries interconnect ,
NVIDIA K20x
Swiss National
Supercomputing Centre
(CSCS)
115984
6271000
7788853
2325
Lustre
2.5 PB
138 GB/s
8
Shaheen II
Cray XC40, Xeon E5-2698v3 16C
2.3GHz, Aries interconnect
KAUST,
Saudi Arabia
196,608
5,537
7,235
2,834
Lustre
17 PB
500 GB/s
Cray XC40, Xeon E5-2680v3 12C
2.5GHz, Aries interconnect
HLRS - Stuttgart
185088
5640170
7403520
7 PB
~ 100 GB/s
PowerEdge C8220, Xeon E5-2680
8C 2.7GHz, IB FDR, Intel Xeon Phi
TACC/
Univ. of Texas
462462
5168110
8520112
14 PB
150 GB/s
Hazel Hen
9
10
Stampede
BlueGene/Q, Power BQC 16C 1.60
GHz, Custom Interconnect
Fujitsu, SPARC64 VIIIfx 2.0GHz, ,
Tofu interconnect
BlueGene/Q, Power BQC 16C
1.60GHz, Custom
DOE/NNSA/LLNL
n.b. NCSA Bluewaters
24 PB
8100900
1100 GB/s (Lustre 2.1.3)
11078861
Lustre
Lustre
4510
Lustre
Still The Same Concept:
Fully integrated, fully balanced, no bottlenecks …
ClusterStor Scalable Storage Unit
 Intel Ivy bridge or Haswell CPUs
 F/EDR, 100 GbE & 2x40GbE, all SAS infrastructure
 SBB v3 Form Factor, PCIe Gen-3
 Embedded RAID & Lustre support
ClusterStor Manager
Lustre File System (2.x)
Data Protection Layer
(PD-RAID/Grid-RAID)
Linux OS
Embedded
server modules
Unified System Management
(GEM-USM)
© 2015 Seagate, Inc. All Rights Reserved.
So what’s new ??? Seagate Next Gen Lustre appliance
CS9000
ClusterStor Lustre Software
› Lustre v2.5
› Linux v6.5
CS L300
ClusterStor Lustre Software
› Lustre v2.5 Phase 1, Lustre 2.7 Phase 2
› Linux v6.5 Phase 1, Linux 7.2 Phase 2
Management Switches
› 1Gbe Switches (component communication)
Management Switches
› 1Gbe Switches (component communication)
Top of Rack Switches (ToR)
› Infiniband (IB) Mellanox EDR or 100/40GbE
Top of Rack Switches (ToR)
› Infiniband (IB) Mellanox FDR or 40GbE
(High availability connectivity in the Rack)
ClusterStor System Management Unit (SMU)
› 2U24 w/ Embedded Servers in HA Configuration
› File System Management, Boot, Storage
ClusterStor Meta Data Management Unit (MMU)
› 2U24 w/ Embedded Servers in HA Configuration
› Meta Data (user data location map)
ClusterStor Management Unit Hardware (CMU)
› 4 Servers in a 2U Chassis, FDR / 40GbE
› Sever 1 & 2 File System Management, Boot
› Server 3 & 4 Meta Data (user data location map)
› 2U24 JBOD
(HDD storage for Management and Meta Data)
ClusterStor Scalable Storage (SSU)
› 5U84 Enclosure
› Object Storage Servers (OSS) Network I/O
› Mellanox EDR 100/40GbE
› 10K and 7.2K RPM HDDs
› 2U24 Enclosure
› Seagate Koho Flash SSDs
ClusterStor Scalable Storage (SSU)
› 5U84 Enclosure
› 6Gbit SAS
› Object Storage Servers (OSS) Network I/O
› Mellanox FDR/40GbE
› 7.2K RPM HDDs
© 2015 Seagate, Inc. All Rights Reserved.
ClusterStor GRIDRAID
Feature
Benefit
De-clustered RAID 6: Up to 400% faster to repair
Rebuild of 6TB drive – MD RAID ~ 33.3 hours, GridRAID ~ 9.5 hours
Recover from a disk failure and return to full data protection faster
Repeal Amdahl’s Law: speed of a parallel system is gated by the performance of the
slowest component
Minimizes application impact to widely striped file performance
Minimize file system fragmentation
Improved allocation and layout maximizes sequential data placement
4 to1 Reduction in OSTs
Simplifies scalability challenges
ClusterStor Integrated Management
CLI and GUI configuration, monitoring and management reduces Opex
Traditional RAID
Parity
Rebuild Disk
Pool #1
Parity
Rebuild Disk
Pool #2
Parity
Rebuild Disk
Pool #3
GridRAID
OSS
Server
OSS
Server
Parity
Rebuild
Disk
Pool #1
Parity
Rebuild Disk
Pool #4
© 2015 Seagate, Inc. All Rights Reserved.
ClusterStor L300
HPC Disk Drive
HPC Optimized Performance 3.5 HDD
High level product description
•
•
–
–
–
•
•
–
•
–
•
ClusterStor L300 HPC 4TB SAS HDD
HPC Industry First; Best Mixed Application Workload Value
Performance Leader
World-beating performance over other
3.5in HDDs: Speeding data ingest,
extraction and access
Capacity Strong
4TB of storage for big data applications
600
500
CS HPC
HDD
400
CS HPC
HDD
300
Reliable Workhorse
2M hour MTBF and 750TB/year ratings
for reliability under the toughest
workloads your users throw at it
200
CS HPC
HDD
NL 7.2K
RPM HDD
NL 7.2K
RPM HDD
NL 7.2K
RPM HDD
100
Power Efficient
Seagate’s PowerBalance feature
provides significant power benefits for
minimal performance tradeoffs
0
Random writes
(4K IOPS, WCD)
Random reads
(4KQ16 IOPS)
Sequential data rate
(MB/s)
IBM Spectrum Scale
based solutions
© 2015 Seagate, Inc. All Rights Reserved.
Best in Class Features
Designed for HPC, Big Data and Cloud
Connectivity
› IB – FDR, QDR and 40 GbE
(EDR, 100GbE and Omnipath on roadmap)
› Exportable via CNFS, CIFS, Object storage,
HDFS connectors
› Linux and Windows Clients
Robust Feature Set
› Global Shared Access with Single Namespace
across cluster/file systems
› Building Block approach with Scalable
performance and capacity
› Distributed file system data caching and
coherency management
› RAID 6 (8 +2) with De-clustered RAID
› Snapshot and Rollback
› Integrated Lifecycle Management
› Backup to Tape
› Non Disruptive Scaling, Restriping, Rebalancing
› Synchronously replicated data and metadata
Management and Support
› Clusterstor CLI Based Single Point of
Management
› RAS/Phone home
› SNMP integration with Business Operation
Systems
› Low level Hardware Monitoring & Diagnostics
› Embedded monitoring,
› Proactive alerts
Hardware Platform
› Industry’s Fastest Converged Scale-Out
Storage Platform
› Latest Intel Processors
› Embedded High Availability NSD Servers
Integrated into Data Storage Backplane
› Fastest Available IO – details??
› Extremely Dense Storage Enclosures with 84
drives in 5U
ClusterStor Spectrum Scale
Performance Density Rack Configuration
ETN 1
Key components:
› ClusterStor Manager Node (2U enclosure)
• 2 HA management servers
• 10 drives
ETN 1
ETN 2
ETN 2
High Speed Network 1
High Speed Network 1
High Speed Network 2
High Speed Network 2
CSM
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
Base Rack
Expansion Rack
› 2 Management switches
Performance:
› Up to 56GB /sec per rack
Key components:
› 5U84 Enclosure Configured as NSDs + Disk
• 2 HA Embedded NSD Servers
• 76 to 80 7.2K RPM HDDs
• 4 to 8 SSD
› 42U reinforced Rack
• Custom cable harness
• Up to 7 enclosures in each rack
(base + expansion)
ClusterStor Spectrum Scale
Capacity Optimized Rack Configuration
ETN 1
Key components:
› ClusterStor Manager Node (2U enclosure)
• 2 HA management servers
• 10 drives
› 2 Management switches
Performance:
› Up to 32GB /sec per rack
Key components:
› 5U84 Enclosure Configured as NSDs + Disk
• 2 HA Embedded NSD Servers
• 76 to 80 7.2K RPM HDDs
• 4 to 8 SSD
› 5U84 Enclosure Configured as JBODs
• 84 7.2K RPM HDDs
• SAS connected to NSD servers, 1 to 1 ratio
› 42U reinforced Rack
ETN 1
ETN 2
ETN 2
High Speed Network 1
High Speed Network 1
High Speed Network 2
High Speed Network 2
CSM
NSD
JBOD
JBOD
NSD
NSD
JBOD
JBOD
NSD
NSD
JBOD
JBOD
NSD
NSD
JBOD
Base Rack
Expansion Rack
ClusterStor Spectrum Scale – Standard Configuration
2U24 Management Server
5U84 Disk Enclosure
NSD (MD) Server x 2 servers
> 8GB/sec per 5U84 (Clustered)
~ 20K File Creates per Sec
~ 2 Billion Files
NSD (MD) Server #1
NSD (MD) Server #2
ClusterStor ToR & Mgt Switch,
Rack, Cables, PDU
Factory Integration & Test
Up to (7) 5U84’s in base rack
Metadata SSD Pool
~10K File Creates / sec
~
1Billion files, 800 GB SSD x 2
User Data Pool
~4GB/sec
HDD x qty (40)
Metadata SSD Pool
~10K File Creates / sec
~ 1Billion files, 800 GB SSD x 2
User Data Pool
~4GB/sec
HDD x qty (40)
Object Storage
based
archiving solutions
© 2015 Seagate, Inc. All Rights Reserved.
Clusterstor A200 Object Store Features
Can achieve 4 “nines”
system availability
High density storage up to
3.6PB usable per rack
“Infinite” numbers of
objects (2128)
8+2 network erasure coding
for cost-effective protection
Rapid drive rebuild (<1hr
for 8TB in a large system)
Global object namespace
ClusterStor A200
Object API & portfolio of
network based interfaces
Integrated management &
consensus based HA
Performance scales with
capacity(up to 10GB/s per rack)
ClusterStor A200: Resiliency Built-In
Redundant TOR switches
›
›
›
Combines data and management network
traffic
VLANs used to segregate network traffic
10GbE with 40GbE TOR uplinks
Management Unit
›
›
1x2U24 enclosure
2x Embedded Controllers
Storage Units
42U Rack with wiring loom & power
cables
›
›
›
Dual PDUs
2U spare space reserved for future
configuration options
Blanking plates as required
›
›
›
›
›
Titan v2 5U84 enclosures – 6x is the
minimum config
82 SMR SATA HDD(8TB)
Single Embedded Storage Controller
Dual 10GbE network connections
Resiliency withj 2SSU failures
(12SSU’s miimum)
Economic Benefits of SMR drives
SMR Drives
Shingled Technology increases capacity of platter
by 30-40%
› Write tracks are overlapped by up to 50% of write
width
› Read head is much smaller & can reliably read
narrower tracks
SMR Drives are optimal for object stores as most
data is static/WORM
› Updates require special intelligence and may be
expensive in terms of performance
› “Wide tracks in each band are often reserved for
updates
CS A200 manages SMR Drives directly to optimize
workflow & caching
Read Head
Write Head
Updates destroy
portion of next track
Current Clusterstor Lustre solutions line-up
CS-1500
CS-9000
Base rack
Expansion rack
L300
Base rack
Expansion rack
Current ClusterStor Spectrum Scale/Active Archive line-up
G200
Base rack
Expansion rack
A200
Base rack
Expansion rack
Download PDF