High Performance Data
Transfer
Les CottrellSLAC Chin FangZettar, Andy
HanushevskySLAC, Wilko KreugerSLAC.
Wei YangSLAC
Presented at CHEP16, San Francisco
Agenda
 Data Transfer Requirements Challenges at SLAC
 Solution Proposed to tackle Data Transfer Challenges
 Testing
 Conclusions
2
Requirement
• Today 20 Gbps from SLAC to NERSC
• Until LCLS-II start taking data (2020) data rate 120Hz=>1MHz
• Also increasing as experiments improve use of networking
• 2020
• LCLS-II starts taking data at
increased data rate
• Up to 2024:
• Imaging detectors get faster
3
Requirement
• Today 20 Gbps from SLAC to NERSC
• Until LCLS-II start taking data (2020) data rate 120Hz=>1MHz
• Also increasing as experiments improve use of networking
• 2020
• LCLS-II starts taking data at
increased data rate
• Up to 2024:
• Imaging detectors get faster
3
Requirement
• Today 20 Gbps from SLAC to NERSC
• Until LCLS-II start taking data (2020) data rate 120Hz=>1MHz
• Also increasing as experiments improve use of networking
• 2020
• LCLS-II starts taking data at
increased data rate
• Up to 2024:
• Imaging detectors get faster
• LHC Luminosity increase 10 times in
2020 SLAC ATLAS
35Gbps=>350Gbps
3
What is Zettar
Startup
• Provide HPC data transfer solution (i.e. SW + transfer
system reference design):
• state of the art, efficient, scalable high speed data transfer
• Over carefully selected demonstration hardware
4
Design Goals
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi
NICs, servers )
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi
NICs, servers )
• Scale out (fine granularity of 1U for storage, multi cores, NICs)
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi
NICs, servers )
• Scale out (fine granularity of 1U for storage, multi cores, NICs)
• Highly efficient (low component count, low complexity, managed by
software)
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi
NICs, servers )
• Scale out (fine granularity of 1U for storage, multi cores, NICs)
• Highly efficient (low component count, low complexity, managed by
software)
• Low cost (inexpensive SSDs for read, more expensive for write, careful
balancing of needs)
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi NICs,
servers )
•
Scale out (fine granularity of 1U for storage, multi cores, NICs)
•
Highly efficient (low component count, low complexity, managed by software)
•
Low cost (inexpensive SSDs for read, more expensive for write, careful
balancing of needs)
•
Forward looking (uses 100G IB EDR HCAs & 25GbE NICs, modern design)
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi NICs,
servers )
•
Scale out (fine granularity of 1U for storage, multi cores, NICs)
•
Highly efficient (low component count, low complexity, managed by software)
•
Low cost (inexpensive SSDs for read, more expensive for write, careful
balancing of needs)
•
Forward looking (uses 100G IB EDR HCAs & 25GbE NICs, modern design)
•
Small form factor (6Us)
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi NICs,
servers )
•
Scale out (fine granularity of 1U for storage, multi cores, NICs)
•
Highly efficient (low component count, low complexity, managed by software)
•
Low cost (inexpensive SSDs for read, more expensive for write, careful
balancing of needs)
•
Forward looking (uses 100G IB EDR HCAs & 25GbE NICs, modern design)
•
Small form factor (6Us)
•
Energy efficient
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi NICs,
servers )
•
Scale out (fine granularity of 1U for storage, multi cores, NICs)
•
Highly efficient (low component count, low complexity, managed by software)
•
Low cost (inexpensive SSDs for read, more expensive for write, careful
balancing of needs)
•
Forward looking (uses 100G IB EDR HCAs & 25GbE NICs, modern design)
•
Small form factor (6Us)
•
Energy efficient
•
Storage tiering friendly (supports tiering)
5
Design Goals
• Focus on LCLS-II needs
• SLAC=>NERSC(LBL) over ESnet link
• High availability (peer-to-peer SW, cluster with failover for multi NICs,
servers )
•
Scale out (fine granularity of 1U for storage, multi cores, NICs)
•
Highly efficient (low component count, low complexity, managed by software)
•
Low cost (inexpensive SSDs for read, more expensive for write, careful
balancing of needs)
•
Forward looking (uses 100G IB EDR HCAs & 25GbE NICs, modern design)
•
Small form factor (6Us)
•
Energy efficient
•
Storage tiering friendly (supports tiering)
DTN
Hi-perf
storage
Capacity
storage
HSM
SSDs
5
NG demonstration
2x100G
LAG
IP over InfiniBand
IPoIB)
NG demonstration
2x100G
LAG
Storage servers
IP over InfiniBand
IPoIB)
> 25TBytes in 8 SSDs
NG demonstration
2x100G
LAG
Storage servers
IP over InfiniBand
IPoIB)
> 25TBytes in 8 SSDs
100Gbps
100Gbps
NG demonstration
2x100G
LAG
Storage servers
IP over InfiniBand
IPoIB)
> 25TBytes in 8 SSDs
100Gbps
100Gbps
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
Storage servers
IP over InfiniBand
IPoIB)
> 25TBytes in 8 SSDs
100Gbps
100Gbps
4* 56 Gbps
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
Storage servers
IP over InfiniBand
IPoIB)
> 25TBytes in 8 SSDs
100Gbps
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
IP over InfiniBand
Other cluster
or High speed Storage servers IPoIB)
Internet
> 25TBytes in 8 SSDs
100Gbps
n(2)*100GbE
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
IP over InfiniBand
Other cluster
or High speed Storage servers IPoIB)
Internet
> 25TBytes in 8 SSDs
100Gbps
n(2)*100GbE
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
IP over InfiniBand
Other cluster
or High speed Storage servers IPoIB)
Internet
> 25TBytes in 8 SSDs
100Gbps
n(2)*100GbE
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
IP over InfiniBand
Other cluster
or High speed Storage servers IPoIB)
Internet
> 25TBytes in 8 SSDs
100Gbps
n(2)*100GbE
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
IP over InfiniBand
Other cluster
or High speed Storage servers IPoIB)
Internet
> 25TBytes in 8 SSDs
100Gbps
n(2)*100GbE
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
NG demonstration
2x100G
LAG
IP over InfiniBand
Other cluster
or High speed Storage servers IPoIB)
Internet
> 25TBytes in 8 SSDs
100Gbps
n(2)*100GbE
100Gbps
4* 56 Gbps
4* 2 * 25GbE
Data Transfer Nodes (DTNs)
Memory to Memory between clusters with 2*100Gbps
• No storage involved just DTN to DTN mem-to-mem
• Extended locally to 200Gbps
• Here repeated 3 times
• Note uniformity of 8* 25Gbps interfaces.
7
Memory to Memory between clusters with 2*100Gbps
• No storage involved just DTN to DTN mem-to-mem
• Extended locally to 200Gbps
• Here repeated 3 times
• Note uniformity of 8* 25Gbps interfaces.
• Can simply use TCP, no need for exotic proprietary protocols
7
• Network is not a problem
Storage
8
Storage
• On the other hand, file-to-file transfers are at the mercy of
the back-end storage performance.
8
Storage
• On the other hand, file-to-file transfers are at the mercy of
the back-end storage performance.
• Even with generous compute power and network
bandwidth available, the best designed and implemented
data transfer software cannot create any magic with a
slow storage backend
8
XFS READ performance of 8*SSDs in a file server
measured by Unix fio utility
SSD busy
800%
600%
400%
200%
15:16 15:18
Data size = 5*200GiB files similar to
typical LCLS large file sizes
9
XFS READ performance of 8*SSDs in a file server
measured by Unix fio utility
SSD busy
800%
Queue Size
3500
600%
2500
400%
200%
15:16 15:18
1500
15:16 15:18
Data size = 5*200GiB files similar to
typical LCLS large file sizes
9
XFS READ performance of 8*SSDs in a file server
measured by Unix fio utility
SSD busy
800%
Queue Size
Read Throughput
3500
20GBps
2500
15GBps
1500
10GBps
600%
400%
200%
15:16 15:18
15:16 15:18
Data size = 5*200GiB files similar to
typical LCLS large file sizes
5GBps
0GBps
15:16 15:18
9
XFS READ performance of 8*SSDs in a file server
measured by Unix fio utility
SSD busy
800%
Queue Size
Read Throughput
3500
20GBps
2500
15GBps
1500
10GBps
600%
400%
200%
15:16 15:18
15:16 15:18
Data size = 5*200GiB files similar to
typical LCLS large file sizes
5GBps
0GBps
15:16 15:18
Note reading SSD busy, uniformity, plenty of objects in queue
yields close to raw throughput available
9
XFS + parallel file system WRITE performance for 16
SSDs in 2 file servers
SSD busy
10
XFS + parallel file system WRITE performance for 16
SSDs in 2 file servers
SSD busy
Queue size of
pending writes
50
10
XFS + parallel file system WRITE performance for 16
SSDs in 2 file servers
SSD write throughput
SSD busy
10GBps
Queue size of
pending writes
50
10
XFS + parallel file system WRITE performance for 16
SSDs in 2 file servers
SSD write throughput
SSD busy
10GBps
Queue size of
pending writes
50
Write factor 2 slower than read
File system layers can’t keep queue full (factor
1000 less items queued than for reads)
10
Speed breakdown
SSD
speeds
Read
Write
Write/ % Intel % Intel
Read Read
Write
IOPS IOPS IOPS write /
read write read
Intel
sequential
2.8
GBps
1.9
GBps
68%
100%
100%
1200
XFS
2.4
GBps
1.5
GBps
63%
86%
79%
XFS/Intel
85%
79%
305
25%
11
Speed breakdown
SSD
speeds
Read
Write
Write/ % Intel % Intel
Read Read
Write
IOPS IOPS IOPS write /
read write read
Intel
sequential
2.8
GBps
1.9
GBps
68%
100%
100%
1200
XFS
2.4
GBps
1.5
GBps
63%
86%
79%
XFS/Intel
85%
79%
305
25%
Raw Intel sequential write speed 68% of read speed
11
Speed breakdown
SSD
speeds
Read
Write
Write/ % Intel % Intel
Read Read
Write
IOPS IOPS IOPS write /
read write read
Intel
sequential
2.8
GBps
1.9
GBps
68%
100%
100%
1200
XFS
2.4
GBps
1.5
GBps
63%
86%
79%
XFS/Intel
85%
79%
305
25%
Raw Intel sequential write speed 68% of read speed
Limited by IOPS
11
Speed breakdown
SSD
speeds
Read
Write
Write/ % Intel % Intel
Read Read
Write
IOPS IOPS IOPS write /
read write read
Intel
sequential
2.8
GBps
1.9
GBps
68%
100%
100%
1200
XFS
2.4
GBps
1.5
GBps
63%
86%
79%
XFS/Intel
85%
79%
305
25%
Raw Intel sequential write speed 68% of read speed
Limited by IOPS
XFS 79%(write)-85%(Read) of Raw Intel sequential speed
11
Speed breakdown
SSD
speeds
Read
Write
Write/ % Intel % Intel
Read Read
Write
IOPS IOPS IOPS write /
read write read
Intel
sequential
2.8
GBps
1.9
GBps
68%
100%
100%
1200
XFS
2.4
GBps
1.5
GBps
63%
86%
79%
XFS/Intel
85%
79%
305
25%
Raw Intel sequential write speed 68% of read speed
Limited by IOPS
XFS 79%(write)-85%(Read) of Raw Intel sequential speed
File speeds
Read
Write
Write/Read
XFS
38GBps
24GBps
63%
BeeGFS+XFS
21GBps
12GBps
57%
BeeGFS+XFS/XFS
55%
50%
11
Speed breakdown
SSD
speeds
Read
Write
Write/ % Intel % Intel
Read Read
Write
IOPS IOPS IOPS write /
read write read
Intel
sequential
2.8
GBps
1.9
GBps
68%
100%
100%
1200
XFS
2.4
GBps
1.5
GBps
63%
86%
79%
XFS/Intel
85%
79%
305
25%
Raw Intel sequential write speed 68% of read speed
Limited by IOPS
XFS 79%(write)-85%(Read) of Raw Intel sequential speed
File speeds
Read
Write
Write/Read
XFS
38GBps
24GBps
63%
BeeGFS+XFS
21GBps
12GBps
57%
BeeGFS+XFS/XFS
55%
50%
Parallel file system further reduces speed to
50%(write)-55%(read) of XFS
11
Results from older demonstration:
LOSF
2 mins
80Gbps
40Gbps
60Gbps
40Gbps
20Gbps
Elephant File-to-file transfer with encryption
Copyright © Zettar Inc. 2013 - 2016
Results from older demonstration:
LOSF
2 mins
80Gbps
40Gbps
60Gbps
40Gbps
20Gbps
Elephant File-to-file transfer with encryption
Copyright © Zettar Inc. 2013 - 2016
LOSF File-to-file transfer With encryption
Over 5,000 mile ESnet OSCARs
circuit with TLS encryption
Copyright © Zettar Inc. 2013 - 2014
13
Over 5,000 mile ESnet OSCARs
circuit with TLS encryption
Copyright © Zettar Inc. 2013 - 2014
13
Over 5,000 mile ESnet OSCARs
circuit with TLS encryption
70Gbps
Transmit Speed
50Gbps
30Gbps
10Gbps
22:24
70Gbps
22:25 22:26 22:27
Receive Speed
50Gbps
Copyright © Zettar Inc. 2013 - 2014
13
Over 5,000 mile ESnet OSCARs
circuit with TLS encryption
70Gbps
Transmit Speed
50Gbps
30Gbps
10Gbps
Degredation of 15% for 120ms
RTT loop.
22:24
70Gbps
22:25 22:26 22:27
Receive Speed
50Gbps
Copyright © Zettar Inc. 2013 - 2014
13
Conclusion
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
• Worst case waited 2 months for parts
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
• Worst case waited 2 months for parts
Use fastest SSDs
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
•
Worst case waited 2 months for parts
Use fastest SSDs
•
We used Intel DC P3700 NVMe 1.6TB drives
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
•
Worst case waited 2 months for parts
Use fastest SSDs
•
We used Intel DC P3700 NVMe 1.6TB drives
•
Biggest also fastest but also most expensive
•
1.6TB $1677 vs 2.0TB $2655 for 20% improvement
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
•
Worst case waited 2 months for parts
Use fastest SSDs
•
We used Intel DC P3700 NVMe 1.6TB drives
•
Biggest also fastest but also most expensive
•
1.6TB $1677 vs 2.0TB $2655 for 20% improvement
Need to coordinate with Hierarchical Storage Management (HSM), e.g. Lustre +
Robin Hood
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
•
Worst case waited 2 months for parts
Use fastest SSDs
•
We used Intel DC P3700 NVMe 1.6TB drives
•
Biggest also fastest but also most expensive
•
1.6TB $1677 vs 2.0TB $2655 for 20% improvement
Need to coordinate with Hierarchical Storage Management (HSM), e.g. Lustre +
Robin Hood
We are looking to achieve achieve 80Gbps = 6 PB/wk
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
•
Worst case waited 2 months for parts
Use fastest SSDs
•
We used Intel DC P3700 NVMe 1.6TB drives
•
Biggest also fastest but also most expensive
•
1.6TB $1677 vs 2.0TB $2655 for 20% improvement
Need to coordinate with Hierarchical Storage Management (HSM), e.g. Lustre +
Robin Hood
We are looking to achieve achieve 80Gbps = 6 PB/wk
Parallel file system is bottleneck
14
Conclusion
Network is fine, can drive 200Gbps, no need for proprietary protocols
Insufficient IOPS for write < 50% of raw capability
- Today limited to 80-90Gbps file transfer
Work with local vendors
•
State of art components fail, need fast replacements
•
Worst case waited 2 months for parts
Use fastest SSDs
•
We used Intel DC P3700 NVMe 1.6TB drives
•
Biggest also fastest but also most expensive
•
1.6TB $1677 vs 2.0TB $2655 for 20% improvement
Need to coordinate with Hierarchical Storage Management (HSM), e.g. Lustre +
Robin Hood
We are looking to achieve achieve 80Gbps = 6 PB/wk
Parallel file system is bottleneck
•
Needs enhancing for modern hardware & OS’
14
More Information
• LCLS SLAC->NERSC 2013
•
http://es.net/science-engagement/case-studies/multi-facility-workflow-case-study/
• LCLS Exascale requirements, Jan Thayer and Amedeo Perazzo
•
https://confluence.slac.stanford.edu/download/attachments/178521813/ExascaleReq
uirementsLCLSCaseStudy.docx
Questions
• Also email cottrell@slac.stanford.edu