HPE Reference Architecture for ProLiant DL580 Gen9 and Microsoft

HPE Reference Architecture for
ProLiant DL580 Gen9 and Microsoft
SQL Server 2016 OLTP database
consolidation
Technical white paper
Technical white paper
Contents
Executive summary ................................................................................................................................................................................................................................................................................................................................ 3
Introduction ...................................................................................................................................................................................................................................................................................................................................................3
Solution overview ..................................................................................................................................................................................................................................................................................................................................... 4
HPE ProLiant DL580 Gen9 server ...................................................................................................................................................................................................................................................................................4
HPE NVMe PCIe SSDs ..................................................................................................................................................................................................................................................................................................................5
HPE NVMe PCIe Workload Accelerator........................................................................................................................................................................................................................................................................5
Solution components............................................................................................................................................................................................................................................................................................................................7
Hardware...................................................................................................................................................................................................................................................................................................................................................7
Software .....................................................................................................................................................................................................................................................................................................................................................7
Application software .......................................................................................................................................................................................................................................................................................................................7
Best practices and configuration guidance for the solution ............................................................................................................................................................................................................................. 7
HPE ProLiant DL580 Gen9 .....................................................................................................................................................................................................................................................................................................7
NVMe PCIe Workload Accelerator configuration .................................................................................................................................................................................................................................................8
SQL Server configuration guidance .................................................................................................................................................................................................................................................................................. 8
Capacity and sizing ................................................................................................................................................................................................................................................................................................................................ 8
Workload description and test methodology...........................................................................................................................................................................................................................................................8
NVMe characterization.................................................................................................................................................................................................................................................................................................................9
SQL memory characterization ........................................................................................................................................................................................................................................................................................... 13
HPE ProLiant DL580 Gen9 system and SQL settings characterization results ................................................................................................................................................................. 16
SQL backup tests........................................................................................................................................................................................................................................................................................................................... 18
SQL index rebuild tests ............................................................................................................................................................................................................................................................................................................ 20
Processor comparison (Broadwell versus Haswell) ....................................................................................................................................................................................................................................... 20
Analysis and recommendations ....................................................................................................................................................................................................................................................................................... 21
Summary ...................................................................................................................................................................................................................................................................................................................................................... 22
Appendix A: Bill of materials ...................................................................................................................................................................................................................................................................................................... 22
Appendix B: FIO tool command line configuration for disk I/O benchmarking ........................................................................................................................................................................... 23
Appendix C: Understanding hardware and software NUMA in SQL Server 2016 and how to set affinity on Broadwell .................................................................... 24
Appendix D: Creating a custom soft-NUMA configuration ........................................................................................................................................................................................................................... 25
Resources and additional links ................................................................................................................................................................................................................................................................................................ 27
Technical white paper
Page 3
Executive summary
Demands for database implementations continue to escalate. Faster transaction processing speeds, scalable capacity, and increased flexibility are
required to meet the needs of today’s business. Many businesses are still running older versions of Microsoft® SQL Server on older infrastructure.
As hardware continues to age and older SQL Server versions reach end-of-life (EOL), SQL admins often run into performance and scalability
issues. To solve these issues, customers often refresh their platforms and consolidate databases as a way to improve performance and resource
utilization. This also, if designed properly, usually results in reduced operating costs and accommodates future data growth.
SQL Server database consolidation can be achieved through three ways – virtualization, instance consolidation, and database consolidation. The
HPE ProLiant DL580 Gen9 server is an ideal choice for your demanding and data-intensive resources, addressing key technology trends, such
as in-memory computing for accelerated data analytics, in-server flash storage for accelerating data processing, co-processors and GPUs for
accelerating data processing, high-density memory and I/O scalability for application consolidation. In addition, the DL580 Gen9 includes system
Reliability, Availability, and Serviceability (RAS) features, such as HPE Advanced Error Recovery, HPE Memory Quarantine, and HPE Advanced
Error Containment, which Microsoft SQL Server 2016 supports.
This Reference Architecture will demonstrate a highly optimized configuration for a mid-range multiple database consolidation with Microsoft
SQL Server 2016 and Windows Server® 2016 on an HPE ProLiant DL580 Gen9 server with a hybrid storage solution of write-intensive NVMe
PCIe SSDs and Workload Accelerators to achieve almost 62K batch requests per second with an OLTP workload, gaining 26% in performance
over a misconfigured SQL server.
This RA also addresses customer challenges and how to deal with slow SQL backups and index rebuilds performed during daily maintenance
windows. We will analyze database performances using specific SQL command switches to improve throughput thus shortening the duration of
these important tasks while business databases are either offline or online.
Microsoft SQL Server 2016 realizes additional performance and scalability benefits by upgrading to the latest processors. The testing showed a
20% gain in performance for an OLTP workload on Intel® Xeon® v4 (Broadwell) over v3 (Haswell) processors in an HPE ProLiant DL580 Gen9
server.
Target audience: This Hewlett Packard Enterprise Reference Architecture white paper is designed for IT professionals who use, program,
manage, or administer large databases that require high availability and performance. Specifically, this information is intended for those who
evaluate, recommend, or design new IT high-performance architectures.
This white paper describes testing completed in September 2016.
Document purpose: The purpose of this document is to describe a Reference Architecture, highlighting benefits and key implementation details
to technical audiences.
Introduction
As the demands for higher processing and scale-up capabilities grow, older platforms have reached their limits in scalability and performance.
The purpose of this Reference Architecture is to provide an example platform for customers to use when designing a high-performance SQL
Server database server to support database (DB) consolidation initiatives or new business requests that require high performance, transactional
database support. We will discuss the results and analysis for several key decision points that SQL architects must consider in designing their
SQL Server database environment. These include the following:
• Perform characterization configuration including two HPE NVMe PCIe offerings to determine the best storage layout for the OLTP database
files.
• Determine the least amount of memory required by SQL while maintaining reasonable performance and reducing initial server cost.
• Compare optimized performance results of a multi-database, single instance consolidation scenario over a default installation of SQL Server.
• Characterize workload performance during concurrent backup jobs.
• Characterize workload performance during concurrent online index rebuilds.
• Compare performance of an Intel E7 v4 processor (Broadwell) based Gen9 server against its Intel E7 v3 (Haswell) predecessor.
Technical white paper
Page 4
Solution overview
HPE ProLiant DL580 Gen9 server
The HPE ProLiant DL580 Gen9 server offers a great platform for high performance OLTP database workloads. It is the Hewlett Packard
Enterprise four-socket (4S) enterprise standard x86 server offering commanding performance, rock-solid reliability and availability, and
compelling consolidation and virtualization efficiencies. Key features of the HPE ProLiant DL580 Gen9 server include:
• Commanding performance key features and benefits:
– Processors – achieve performance with up to four Intel Xeon E7-4800/8800 v4/v3 processors with up to 96 cores per server.
– Memory – HPE SmartMemory prevents data loss and downtime with enhanced error handling. Achieve maximum memory configuration
using performance-qualified HPE SmartMemory DIMMs, populating 96 DDR4 DIMM slots, with up to 6 TB maximum memory.
– I/O expansion – adapt and grow to changing business needs with nine PCIe 3.0 slots and a choice of HPE FlexibleLOM or PCIe adapters for
1, 10, or 25 GbE, or InfiniBand adapters.
– HPE Smart Array controllers – faster access to your data with the redesigned HPE Flexible Smart Array and HPE Smart SAS HBA
controllers that allow you the flexibility to choose the optimal 12 Gb/s SAS controller most suited to your environment. Additional support
for HPE NVMe Mixed Use and Write Intensive PCIe Workload Accelerators
– Storage – standard with 5 SFF hot-plug HDD/SSD drive bays. Additional 5 HDD/SDD or NVMe drive bay support requires optional
backplane kit.
• Compelling agility and efficiencies for scale-up environments:
– The HPE ProLiant DL580 Gen9 server supports improved ambient temperature ASHRAE A3 and A4 standards helping to reduce your
cooling expenses.
– High efficiency, redundant HPE Common Slot Power Supplies, up to 4x 1500W, provide up to 94% efficiency (Platinum Plus), infrastructure
power efficiencies with -48VDC input voltages and support for HPE Power Discovery Services.
– Customer-inspired and easily accessible features include: front access processor/memory drawer for ease of serviceability, hot pluggable
fans and drives, optional Systems Insight Display (SID) for health and monitoring of components and Quick Reference code for quick access
to product information.
• Agile infrastructure management for accelerating IT service delivery:
– With HPE ProLiant DL580 Gen9 server, HPE OneView (optional) provides infrastructure management for automation simplicity across
servers, storage and networking.
– Online personalized dashboard for converged infrastructure health monitoring and support management with HPE Insight Online.
– Configure in Unified Extensible Firmware Interface (UEFI) boot mode, provision local and remote with Intelligent Provisioning and Scripting
Toolkits.
– Embedded management to deploy, monitor and support your server remotely, out of band with HPE iLO and optimize firmware and driver
updates and reduce downtime with Smart Update, consisting of SUM (Smart Update Manager) and SPP (Service Pack for ProLiant).
Figure 1. HPE ProLiant DL580 Gen9 server, front view
Technical white paper
Page 5
HPE NVMe PCIe SSDs
The introduction of Non-Volatile Memory Express (NVMe) interface architecture propelled the disk drive technology to the next era of extremely
high performing storage products. With significantly high bandwidth/IOPS and very low latency, HPE NVME PCIe SSD products are designed to
take advantage of its exceptional properties to provide efficient access to your business data. The HPE SSD portfolio offers three broad
categories of SSDs based on target workloads: Read Intensive, Write Intensive, and Mixed Use. The Write Intensive NVMe SSDs selected for this
Reference Architecture are designed to have the highest write performance which is best suited for online transactional processing
environments. The table below shows the relative performance data between drives in the SSD portfolio. See the “Resources and additional links”
section for the full HPE SSD Data Sheet.
Table 1. Write performance comparison example between HPE Write Intensive (WI) 800GB SSDs
Random test
HPE WI NVME SSD
HPE WI 12G SAS SSD
HPE WI 6G SAS SSD
Sequential writes (MB/S)
1,700
580
370
Random writes (IOPS)
99,000
68,000
46,500
HPE NVMe PCIe Workload Accelerator
The Workload Accelerator platform provides consistent microsecond latency access for mixed workloads, multiple gigabytes per second access
and hundreds of thousands of IOPS from a single product. The optimized Workload Accelerator architecture allows for nearly symmetrical read
and write performance with excellent low queue depth performance, making the Workload Accelerator platform ideal across a wide variety of real
world, high-performance enterprise environments.
Figure 2. HPE NVME PCIe Workload Accelerator
Technical white paper
Page 6
The logical diagram below illustrates the major solution components that includes four Intel Xeon v4 processors, 6TB RAM, NVMe Write
Intensive PCIe SSDs and Write Intensive NVMe PCIe Workload Accelerators.
CPUMemoryDrawer
4-socket
NUMA Architecture
18
Core
XeonE7-8890v4
(2.2G
CPH
Uz/24c/60M
Slot 1 B/
165W) 816643-L21
XeonE7-8890v4
(2.2G
CPH
Uz/24c/60M
Slot 4 B/
165W) 816643-B21
CH4
11H
6TB TOTAL
RAM
10M
Ch2
Ch3
9B
8F Mem
7KoryC
1C
2GSlot3L
Optional
artridge
8
8F Mem
7KoryC
1C
2GSlot3L
9B
7
Optional
artridge
4A
4A
Ch1
5E
5E
5E
6J
6J
6J
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
12D
10M
4A
DL580Gen912DDR4DIMMSlotsMemoryCartridge788360-B21
11H
9B
7KoryC
8F Mem
1C
2GSlot3L
6
Optional
artridge
6J
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
12D
10M
DL580Gen912DDR4DIMMSlotsMemoryCartridge788360-B21
11H
5E
6J
6J
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
18
Core
12D
4A
5E
5E
DL580Gen912DDR4DIMMSlotsMemoryCartridge788360-B21
8F Mem
1C
2GSlot3L
9B
7KoryC
Optional
artridge
5
4A
4A
6J
6J
Intel®C602JSeriesChipset
ProLiant DL580 Gen9 CTO 793161B21
2- 1.6TB NVMe PCIe
Accelerators
CPU1Slots
CPU2Slots
7
CPU3Slots
3
2
iLO4
4GBNAND
NAND
iLO42.4
2.04GB
1GB1G
User
Partition
BUser
Partition
Redundant
CPU4SlotsPower
1
Front Video
MatroxG200Video
Serial
Dedicated Mgmt. Port
D
edicatediLO
42.0
ILO 4 2.4
331FLR
534FLR
-SFP+
8xUSB2x2.0
10G
4xGbit
bitFlexFab
LAN
FlexLOMSlot
2xint, 2xfront, 4xrear
Internal MicroSDSlot
-SFP+FIO2p10Gb
FlexLOM 556FLR
adptr
FlexLO
M732456-B
Slot 21
PCI-E3.0x16( x16speed)
Full-Height / Full-Length
4
PCI-E3.0x16( x16speed)
Full-Height / Full-Length
5
PCI-E3.0x16( x16speed)
Full-Height / Full-Length
PCI-E3.0x16( x8speed)
Full-Height / Full-Length
PCI-E3.0x16( x8speed)
Full-Height / Full-Length
C
x16( x16speed)
FB
W
CI-E
NVM4G
eBSFull-H
wP
itch
C3.0
ontroller
eight / Full-Length
2GB24inFIOFBWC
Optional CacheSlot
758836-B21
6
1.6TB NVMePWI
PCIe
speed)
CI-EHH/HL
3.0x16( x8
Full-Height / Full-Length
803197-B21
8
PCIe
1.6TB NVMePWI
speed)
CI-EHH/HL
3.0x16( x8
Full-Height / Full-Length
803197-B21
9
PCI-E3.0x16( x16speed)
Full-Height / Full-Length
SmartArrayP830i
16p12Gb/sSASRAID
DL580Gen9NVMe5SSDExpressBayKit
788359-B21
Optional 2ndSFF
StorageCage
Figure 3. Solution logical diagram
10M
3L
5E
5E
400GB6GSATAMESFFSCEM
SFF
DriveSlot 1
SSD691866-B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
11H
2G
4A
4A
400GB6GSATAMESFFSCEM
SFF
DriveSlot 2
SSD691866-B21
XeonE7-8890v4
(2.2G
CPH
Uz/24c/60M
Slot 3 B/
165W) 816643-B21
12D
1C
3L
SFFDriveSlot 3
DL580Gen912DDR4DIMMSlotsMemoryCartridge788360-B21
7K
2G
SFFDriveSlot 4
9B
8F Mem
7KoryC
1C
2GSlot3L
Optional
artridge
4
8F
1C
SFFDriveSlot 5
10M
9B
7K
9B
8F Mem
7KoryC
1C
2GSlot3L
Optional
artridge
2
8F
Ch1
800GBNVMePCIeWI SFFSC2SSD
SFF736939-B
NVMeS21
SD
800GBNVMePCIeWI SFFSC2SSD
SFF736939-B
NVMeS21
SD
11H
10M
10M
9B
Ch3
800GBNVMePCIeWI SFFSC2SSD
SFF736939-B
NVMeS21
SD
DL580Gen912DDR4DIMMSlotsMemoryCartridge788360-B21
11H
11H
10M
DL580Gen912DDR4DIMMSlotsMemoryCartridge788360-B21
11H
Ch2
SFFNVMeSSD
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
18
Core
12D
12D
12D
CH4
1.5TB RAM per socket
4- 800GB NVMe
PCIe SSDs
800GBNVMePCIeWI SFFSC2SSD
SFF736939-B
NVMeS21
SD
XeonE7-8890v4
(2.2G
CPH
Uz/24c/60M
Slot 2 B/
165W) 816643-B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
12D
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
64GBQRx4DDR4-2400
CAS17LRDIMM805358B21
18
Core
Internal TPMSlot
1200WCSPlatinumPlus
Pow
er656364-B
Supply 21
HotPlug
PS
1200WCSPlatinumPlus
P
ow
er
Supply 21
HotPlugPS656364-B
1200WCSPlatinumPlus
O
ptional
P
ow
er Supply
HotPlugPS656364-B
21 Slot
1200WCSPlatinumPlus
O
ptional
P
ow
er
S
upply
HotPlugPS656364-B21 Slot
Technical white paper
Page 7
Solution components
Hardware
HPE ProLiant DL580 Gen9 server configuration
This HPE ProLiant DL580 Gen9 database server configuration is based on internal, direct-attached storage. We evaluated different storage
options and used a hybrid internal storage configuration to demonstrate the performance characteristics of the different disk options and to
establish an optimum hybrid configuration using different models working together. In addition, the testing evaluated two server processor family
options for the HPE ProLiant Gen9 server. Both the Intel Xeon E7 v3 and v4 processor families were tested. This testing demonstrated the
relative performance between the processor models and overall benefits of deploying newer processor architectures.
The HPE ProLiant DL580 Gen9 Broadwell-based server was configured with the following components:
• Four 24-core Intel Xeon E7-8890 v4 processors at 2.20GHz
• 6TB memory (96 x 64GB HPE DDR4 SmartMemory LRDIMMs)
• 2 x 400GB 6G SATA ME 2.5in SC EM SSD (OS)
• 2 HPE 1.6TB NVMe Write-Intensive PCIe Workload Accelerators in RAID1
• 4 x 800GB NVMe PCIe Write-intensive SFF SC2 SSD in RAID10
The HPE ProLiant DL580 Gen9 Haswell-based server was configured with the following components:
• Four 18-core Intel Xeon E7-8890 v3 processors at 2.50 GHz
• 6TB memory (96 x 64GB HPE DDR4 SmartMemory LRDIMMs)
• 2 x 400GB 6G SATA ME 2.5in SC EM SSD (OS)
• 2 HPE 1.6TB NVMe Write-Intensive PCIe Workload Accelerators in RAID1
• 4 x 800GB NVMe PCIe Write-intensive SFF SC2 SSD in RAID10
Software
• Microsoft Windows Server 2016 RTM
• Microsoft SQL Server 2016 RTM/CU1
Application software
The SQL Server version used for this testing was Microsoft SQL Server 2016 (RTM) – 13.0.1601.5 with Cumulative Update 1 - 13.00.2149.0.
This version was installed on the DL580 Gen9 using Microsoft Windows Server 2016 RTM.
Best practices and configuration guidance for the solution
HPE ProLiant DL580 Gen9
Initial setup for the DL580 Gen9 server consisted of various BIOS and SQL settings. The following BIOS settings were configured as derived from
the HPE best practices on DL580 Gen9 servers white paper.
• Hyper-Threading – Enabled
• Intel Turbo Boost – Enabled
• HPE Power Profile – Maximum Performance
• NUMA Group Size Optimization – Clustered (default)
• QPI Snoop configuration – Cluster-on-Die
Technical white paper
Page 8
NVMe PCIe Workload Accelerator configuration
• Accelerator cards in a RAID set should both be installed in slots that belong to the same socket to benefit from NUMA-based hardware
performance features.
• Figure 4 shows the location of the installed NVMe PCIe Workload Accelerator cards in a RAID1 set, highlighted in red. Because the cards are in
a RAID set, we installed them in adjacent slots with a single socket to localize the mirroring I/O traffic.
Figure 4. HPE ProLiant DL580 Gen9 high-level block diagram. Workload Accelerator cards are installed on Slot 4 and Slot 5
SQL Server configuration guidance
• T834 flag / large-page allocation – Enabled
• Max Degree of Parallelism (MAXDOP) – set to 1
Capacity and sizing
Workload description and test methodology
The OLTP databases are part of a stock trading application emulator, in which clients connect to the databases and perform trade buy, sell and
market orders and reports. The workload used for characterizing the HPE ProLiant DL580 Gen9 server consisted of eight 100GB OLTP
databases. Each database is spread into eight 12.5GB database file.
For each test the workload was run long enough to warm the SQL buffer pool and was deemed warm when the average bytes per read
Windows® Performance Monitor counter dropped from 64K to 8K bytes. Once the buffer pool was warm a measurement period of approximately
15 minutes was used to record steady state performance.
Metrics were collected using the Windows Performance Monitor tool and included counters such as CPU Utilization, Physical Disk counters, and
SQL Batch requests per second.
Technical white paper
Page 9
The figure below illustrates our test layout, showing the DL580 Gen9 server with 800GB total OLTP databases, connecting with the workload
engine VMs on a 10Gb network.
8 x 100GB OLTP DBs
800GB total size
Performance Tuning
Data
Workload engines
10Gb switch
DL580 Gen9
VMs to run workload
engines, simulating
trade buy and sell,
market orders, and
reports
Log
Performance
Monitoring
Market data and
transactions logs are
captured in OLTP
databases
OLTP Test Layout Diagram
Figure 5. OLTP workload test layout
The following sections will describe the test results and analysis for several key decision points in the design of our SQL Server environment.
• NVMe characterization – Characterize SSDs and Workload Accelerator cards to determine the layout of SQL database files.
• SQL memory characterization – Analyze the least amount of memory required by SQL Server while maintaining reasonable performance.
• HPE ProLiant DL580 Gen9 system and SQL settings characterization results – Configure DL580 Gen9 BIOS settings to deliver optimal
performance.
• SQL backup tests – Analyze workload performance during concurrent backup jobs.
• SQL index rebuild tests – Analyze workload performance during concurrent online index rebuilds.
• Processor comparison (Broadwell versus Haswell) – Performance comparison between Broadwell and Haswell processors
NVMe characterization
As part of this Reference Architecture, we evaluated two different internal media options to determine their best use in SQL deployments. In the
environment we have both a RAID10 array with 4-800GB disks, and a RAID1 array with two Workload Accelerators.
The Flexible IO Tester (FIO) tool was used to measure IOPS, throughput, and latency. Tests were run to measure random read, random write,
sequential write, and random read-write measurements, sweeping through different Queue Depths to find the optimal test drive point with
reasonable latencies that would mirror the requirements of a transaction DB server.
Read tests were performed at 8K byte reads, while sequential writes were set at 64K bytes. Duration for each data collection run was set to
5 minutes. A 12.5GB test file similar in size to our actual database data file was used. See Appendix B for the FIO tool configuration.
Technical white paper
Page 10
In Figure 6 below, SSDs have better read performance with 16.3% more IOPS than the Workload Accelerators during random read tests.
Random Read Results
125000
118763
120000
IOPS
115000
RAID10 SSD
Random Reads
outperform PCI
flash cards by
16.3%
110000
105000
102157
100000
95000
90000
Accelerators(RAID 1): Random-rd
SSDs(RAID 10): Random-rd
Figure 6. NVMe characterization – Random Read results at optimal Queue Depth
Figure 7 shows that SSDs had better read-write performance with 21.8% more IOPS than the Workload Accelerators during random read-write
tests.
Random Read-Write Results
100000
87540
90000
80000
71815
70000
IOPS
60000
50000
40000
30000
20000
10000
0
Accelerators(RAID 1): Random-rd/wr
SSDs(RAID 10): Random-rd/wr
Figure 7. NVMe characterization – Random Read-Write results at optimal Queue Depth
RAID10 SSD
Random
Read-Writes
outperform PCI
flash by 21.8%
Technical white paper
Page 11
Figure 8 shows that SSDs had better write performance with 9.7% more IOPS than the Workload Accelerators during random write tests.
Random Write Test Results
94000
92695
92000
RAID10 SSD
Random Writes
outperform PCI
flash by 9.7%
IOPS
90000
88000
86000
84521
84000
82000
80000
Accelerators(RAID 1): Random-wr
SSDs(RAID 10) : Random-wr
Figure 8. NVMe characterization – Random Write results at optimal Queue Depth
The bar graph below shows SSDs offer more than twice the sequential write throughput at 3.5GB/sec than the Workload Accelerators. SSDs were
chosen over the Workload Accelerators for transaction log files due to the higher bandwidth.
Sequential Write Throughput
4000
3484
3500
3000
MB/sec
2500
2000
1634
1500
1000
500
0
Accelerators(RAID 1)
Figure 9. NVMe characterization – Sequential Write results at optimal queue depth of 256
SSDs(RAID 10)
SSDs
provide 2X
Sequential
Write
Throughput
Technical white paper
Page 12
One of the greatest benefits of NVMe PCIe-based SSDs is very low latency. The graph below shows latencies at queue depth of 32.
SSDs (RAID10) & Accelerators (RAID1) Latency Results
RAID 1: Seq-wr
1258
RAID 10: Seq-wr
615
RAID 1: Random-rd/wr
455
RAID 10: Random-rd/wr
RAID10 SSDs
offer 51%
reduction in
latency
321
RAID 1: Random-rd
226
RAID 10: Random-rd
195
RAID 1: Random-wr
225
RAID 10: Random-wr
203
0
200
400
600
800
1000
1200
1400
microseconds
Figure 10. NVMe characterization – Latency results at multiple queue depth test points
Based on our NVMe characterization results, we recommend the NVMe SSDs (RAID10) over the Workload Accelerators (RAID1) to give you the
best transactional throughput for the logs. Comparatively, measured IOPS on both NVMe products are only within a percentage difference of
16%, while throughput is over 100% better on NVMe SSDs (RAID10) with about 3.5GB/sec.
With our analysis, we recommend the following:
• NVMe PCIe Workload Accelerators for data where reasonable read/write (60/40) performance is sustained.
• NVMe PCIe SSDs using four disks for logs where high sequential transactional write access is essential.
Technical white paper
Page 13
SQL memory characterization
By default, SQL Server can dynamically adjust its memory requirements based on available system resources. With 6TB of RAM on the DL580
Gen9 server, our SQL server has ample memory available for large database workloads. The following tests show that below a certain point too
little memory can have a negative impact on performance and memory sizing is key to achieve high performance.
For our workload, we limited the memory that SQL Server can use by setting the maximum server memory. To determine the least amount
appropriate for our workload, we performed a memory ramp-down test that started with 1.6TB of max memory, then decreasing the amount 3060 minutes after the database was warm. Measurements were taken at 1.6TB, 800GB, 600GB, 400GB, 200GB, 100GB, 50GB max mem settings.
Figure 11 shows the actual Windows Performance Monitor graph that shows optimal performance at a 1:1 ratio (800GB) and as we step down
the SQL Server maximum memory, disk reads (blue line) increases, and transactional performance (green line) is slightly reduced. Going further
to 200GB and below workload performance significantly decreased.
Optimal
SQL Max. Mem.
(800GB)
Yields 51K
Batch Requests
Figure 11. SQL Server performance under varied SQL maximum memory setting
Technical white paper
Page 14
Figure 12 below shows the charted results of our memory test. There are three key points to this test:
• Memory sizing for an 800GB database benefits from 400GB or more of RAM. Any smaller memory configurations resulted in drastic
performance reductions. Larger amounts provide for database growth and workload surges.
• Batch requests per second (BRPS) drops by 33% from 47K to 31.5K when SQL Server max memory is reduced to a 1:4 ratio (200GB).
• Disk reads are minimal at 206 reads/sec when SQL Server max memory is set to 800GB, compared to disk reads of 8.3K, 39K, and 65K at max
mem of 600GB, 400GB, and breaking point of 200GB respectively.
We chose 800GB max. memory that will give us over 51K batch requests and roughly 200 read IOPS for the simulated work.
SQL Max Memory Test
80000
60000
51600
Batch Requests/Sec
60000
40000
50000
30000
40000
30000
20000
20000
10000
Disk Reads/Sec
70000
50000
10000
206
0
0
1600
800
600
400
200
100
50
SQL Max. Memory
BRPS (KB)
Disk Reads / sec
Figure 12. SQL max. memory requirement. SQL memory ramp-down test results
SQL OLTP workload test
Once the SQL databases are laid-out optimally on NVMe storage and the SQL max memory is set, our OLTP SQL environment is ready for SQL
OLTP workload performance testing. The purpose of the OLTP test was to show a stable running OLTP workload within the compute
capabilities of the HPE ProLiant DL580 Gen9 server. Several BIOS and SQL Server settings were modified to establish which settings provide
best performance for this particular workload. By doing so, this Reference Architecture serves as an optimization guidance for customers
deploying SQL Server on HPE ProLiant servers.
Microsoft Windows Server 2016 RTM and SQL Server 2016 RTM/CU1 were the foundation for these tests.
Eight 100GB databases were used to simulate a large-scale environment with high CPU and I/O utilization. Each 100GB database was spread
into eight 12.5GB data files, which were located on Drive D. Logs were kept on Drive F.
The buffer pool was warmed up before each measurement was taken. The buffer pool is considered warm when initial warmup read ahead
activity ends. This occurs when the average bytes per read counter drops from 64K to 8K, along with total CPU utilization in a steady, leveled
state. Transactional throughput was measured with SQL Server Performance monitor counter “Batch requests per second”.
Technical white paper
Page 15
A baseline test with a workload drive point of 28 users per database was used to achieve an overall system CPU utilization at about 80%. The
baseline consisted of the following BIOS and SQL Server settings:
• Maximum Performance
• Hyper-Threading disabled
• Power Profile – Maximum Performance
• Power Regulator – High Static Performance mode
• NUMA Group Size Optimization – Clustered
• QPI Snoop configuration – Cluster-on-Die
• Hardware NUMA / CPU affinity testing
• SQL Server – Soft-NUMA disabled
• No Trace Flags
Each setting was tested to evaluate its impact on performance:
• Hyper-Threading enabled vs. Hyper-Threading disabled
• With T834 flag (large-page allocation) vs. without T834 flag
• SQL soft-NUMA enabled vs. soft-NUMA disabled
• NUMA Group Size Optimization and QPI Snoop configuration settings:
– Clustered / Home Snoop
– Flat / Home Snoop
– Clustered / Cluster-on-Die
– Flat / Cluster-on-Die
• CPU-affinity vs. No CPU-affinity testing
Details for the above settings are described in the next section.
Once our baseline test was complete, our first comparison was with T834 flag versus no T834 flag. We kept the setting with the better
performance for the rest of the tests.
In order to minimize reboots and shorten test cycles (taking advantage of a warm buffer pool), we did several tests together. In addition, to
minimize test harness configuration changes, we kept all test configurations in three main groups – 4 NUMA nodes, 8 NUMA nodes, and no Port
to CPU affinity.
Note
This characterization shows the impact and gains measured for the OLTP workload used. Every workload is different and should be evaluated for
optimum settings separately.
Technical white paper
Page 16
HPE ProLiant DL580 Gen9 system and SQL settings characterization results
Hyper-Threading
When enabled, Hyper-Threading allows a physical processor core to run two threads of execution. The Windows operating system will now see
double the logical CPUs per NUMA node, which often results in an increase in performance. However, two logical processors sharing the same
physical resources can increase resource contention and processor overhead so some workloads can experience decreased performance. HPE
recommends testing with Hyper-Threading whenever possible to see if Hyper-Threading is better for your workload.
CPU affinity was used to align different workload databases with specific CPU (or NUMA) nodes. With CPU affinity, each database workload ran
on a specific NUMA node gaining a 15% boost in performance due to local NUMA memory access and we observed an increase from 53.9K
BRPS to 61.8K BRPS when Hyper-Threading was enabled. With no CPU affinity, enabling Hyper-Threading also increased performance by 18%,
from 48.7K to 57.4K BRPS. In addition, having Hyper-Threading enabled allowed us to increase the user workload from 28 to 42 users per
database while keeping CPU resources at the same 80% utilization, an increase of 50% more users.
Table 2. Batch requests on Broadwell-based DL580 Gen9. Key results for Hyper-Threading and CPU affinity options
Key test points with Hyper-Threading and CPU affinity
BRPS
CPU_Total %
Latency / MS (Data)
Latency / MS (Logs)
HT enabled @ 28 users – with affinity
54569
66.43%
0.000
0.000
HT enabled @ 42 users – with affinity
61790
80.16%
0.001
0.000
HT enabled @ 42 users – no affinity
57420
86.36%
0.000
0.000
HT disabled @ 28 users – with affinity
53854
80.70%
0.001
0.000
HT disabled @ 28 users – no affinity
48725
83.59%
0.001
0.000
NUMA Group Size Optimization and QPI Snoop configuration
The NUMA Group Size Optimization option on Gen9 servers is used to configure how the system ROM reports the number of logical processors
in a NUMA node. This option can be set to Flat or Clustered (default). When set to Clustered, the physical socket will serve as the boundary for
the NUMA process group. When set to Flat, Windows will adjust the processor group in an effort to minimize the number of groups and balance
their size. In addition, the above setting works in conjunction with QPI links. The QPI Snoop configuration option determines the Snoop mode
used by the QPI bus. When in Home Snoop mode, the number of hardware NUMA nodes reported in BIOS is the same as the number of NUMA
nodes reported in Windows. However, when Cluster-on-Die is selected, 2 NUMA nodes per socket are reported to the Windows operating
system. See Appendix D for a pictorial representation on how NUMA nodes are mapped in Windows and within SQL Server.
These two BIOS settings can greatly impact performance of SQL workloads. For our workload, we cycled through the NUMA/QPI combinations to
find the optimal setting to yield the best performance.
Table 3 shows performance results for the NUMA/QPI settings. For our workload environment, only a 1-4% difference separate the measured
batch requests when toggling between the four combinations of Flat / Clustered / Home Snoop / Cluster-on-Die, and variations of with/without
CPU affinity and Hyper-Threading enabled/disabled.
With CPU affinity, we experienced only a 1% difference in batch requests per second, with the four possible combinations of NUMA/QPI settings.
Meanwhile, we saw a 4% difference in batch requests with no CPU affinity. In most cases, the highest batch requests were achieved when set to
Flat / Cluster-on-Die for our workload while testing with the variations of with/without CPU affinity and Hyper-Threading enabled/disabled.
While our testing with NUMA / QPI settings in BIOS showed minimal performance differences on our workload as shown in Table 4, we kept the
Flat / Cluster-on-Die setting to get the most batch requests per second performance. As your workload may experience a wider swing in
performance, HPE recommends to test with these BIOS settings to get the best performance possible for your workload.
Table 3. Batch requests on Broadwell-based DL580 Gen9. NUMA Group / QPI Link optimization Hyper-Threading enabled with CPU affinity
BRPS
CPU_Total %
Latency / MS (Data)
Latency / MS (Logs)
Clustered / Cluster-on-Die
61267
76.81%
0.000
0.000
Clustered / Home Snoop
61494
85.15%
0.000
0.000
Flat / Cluster-on-Die
61790
80.16%
0.001
0.000
Flat / Home Snoop
61329
78.48%
0.000
0.000
Technical white paper
Page 17
Trace flag T834 / Large-page allocation
Trace flag T834 can be enabled to have SQL Server use large-page allocations for the memory buffer pool. This flag can improve performance
by increasing the efficiency of the TLB buffer in the CPU. While T834 only applies to 64-bit versions of SQL Server, SQL admins must have Lock
pages in memory user rights to enable trace flag T834.
Our test shows that enabling this flag gave a gain of 1.2% in the batch requests per seconds, from 53.1K to 53.7K. The benefit of large pages in
memory is workload depended and we recommend testing it to establish the value for each particular workload.
Table 4. Batch requests on Broadwell-based DL580 Gen9. T834/large-page allocation vs. no T834/large-page allocation result
Setting
BRPS
CPU_Total %
Latency / MS (Data)
Latency / MS (Logs)
No Trace flag T834 / Large-page allocation
53077
80.90%
0.000
0.000
Trace flag T834 / Large-page allocation
53710
80.25%
0.002
0.000
SQL server – Automatic soft-NUMA
Soft-NUMA allows SQL servers to further group CPUs into nodes on a software level. Enabled by default, soft-NUMA will sub-divide hardware
NUMA nodes into smaller groups.
Enabling automatic soft-NUMA will increase the number of soft-NUMA nodes reported within SQL. When using affinity, make sure to verify the
number of NUMA nodes within the SQL logs to correctly set up your affinitized workload.
Automatic soft-NUMA can be beneficial with servers with or without hardware NUMA. However, this further division of CPUs can cause
contention and may actually degrade performance, based on workload, as the division of processors are spread to multiple sockets. In Table 5,
we see a 2-3% decrease in batch requests when automatic soft-NUMA is enabled. On the other hand, you can manually configure how SQL
Server divides the logical processors into processor groups by entering node configuration in the registry to give a custom CPU mask and
processor group.
Table 5. Batch requests on Broadwell-based DL580 Gen9. Automatic Soft-NUMA enabled vs. disabled result with Hyper-Threading disabled
BRPS
CPU_Total %
Latency / MS (Data)
Latency / MS (Logs)
Soft-NUMA Enabled
52357
79.73%
0.001
0.000
Soft-NUMA Disabled
53854
80.77%
0.001
0.000
We created a custom soft-NUMA affinity with 8 nodes so that each database will connect to its own node. Each node consists of half of the
logical processor from a single socket. Table 6 shows the performance gain with custom soft-NUMA against automatic soft-NUMA; we saw a 2%
increase in batch requests from 51790 to 52830. Setting the CPU Mask and Processor Group is done in the registry. Appendix D shows how to
set up a manual soft-NUMA.
HPE recommends testing with automatic or manual soft-NUMA to optimize your workload.
Note
SQL Server databases can have affinity to either SQL Soft NUMA nodes, or the Underlying hardware NUMA nodes when Soft-NUMA is disabled.
Table 6. Batch requests on Haswell-based DL580 Gen9. Results at key test points with best settings and CPU affinity.
Haswell Server – 30 Users
BRPS
CPU_Total %
Latency / MS (Data)
Latency / MS (Logs)
HT Enabled – Soft-NUMA disabled
51441
81.93%
0.000
0.000
HT Enabled – Automatic soft-NUMA enabled
51790
78.79%
0.000
0.000
HT Enabled – Custom soft-NUMA
52830
77.79%
0.000
0.000
Technical white paper
Page 18
Hardware NUMA / CPU affinity
SQL Server is a NUMA-aware application that requires no special configuration to take advantage of NUMA-based hardware. On NUMA
hardware, processors are grouped together along with their own memory and I/O channels. Each group is called a NUMA node. A NUMA node
can access its local memory, and can also access memory from other NUMA nodes. However, access to local memory is faster than using the
remote memory from other NUMA nodes. Thus, the name Non-Uniform Memory Access architecture.
With an affinity over non-affinity workload, we experienced a gain of 8%-11% in batch requests, depending on the Hyper-Threading setting. With
Hyper-Threading enabled, we measured an 8% improvement; and, with Hyper-Threading disabled, we measured an 11% improvement, based on
the data in Table 2.
Overall performance gain with optimized BIOS and SQL settings
After identifying optimum BIOS and SQL settings and setting proper database affinity to hardware NUMA nodes, we measured a 26% cumulative
improvement in performance due to an 18% gain from enabling Hyper-Threading and a further 8% gain from setting database level affinity. It is
very important to review the system and SQL Server settings during deployment and proof-of-concept testing to take full advantage of the
performance capabilities of the HPE ProLiant DL580 server.
Overall Performance Gain
30%
26% total
gain
25%
20%
With CPU
Affinity, 8%
15%
10%
HyperThreading
Enabled, 18%
5%
0%
Figure 13. Cumulative performance gain with Hyper-Threading enabled and with database to CPU affinity
SQL backup tests
In addition to evaluating our primary database transactional workload and identifying optimum system settings for performance, the following
backup tests are intended to show the system’s ability to handle a secondary surge workload with no or minimal interference to the performance
of the primary workload.
Performing backups using the SQL Server Management Studio (SSMS) GUI is fairly easy and user-friendly; however, GUI-invoked backups do not
reveal certain backup command options that are only available using command line scripting. Running backups in Transact-SQL unlocks backup
switch options that can allow businesses to perform offline backups at faster speeds, or online backups at a slightly-reduced SQL performance
during the regular overnight maintenance periods.
Our backup test plan was based on a 100GB OLTP database; we ran backups in two sets with three different scenarios:
• Set #1 – No Load
– Default backup as found in SSMS, without Compression
– A backup with Compression as found in SSMS
Technical white paper
Page 19
– A compressed backup with switch options:
 blocksize=65536
 maxtransfersize=4194304
 buffercount=300
• Set #2 – With OLTP workload running
– Default backup as found in SSMS, without Compression
– A backup with Compression as found in SSMS
– A compressed backup with switch options:
 blocksize=65536
 maxtransfersize=4194304
 buffercount=300
In Table 7 with no load, our test resulted in two key points:
• The rate of speed of the backup increased by 79% from 2430MB/sec to 4361MB/sec when using custom compression with additional options.
• The time spent on the backup decreased 44% from 46 seconds to just 26 seconds for a physical database size of 112.38GB.
To put this in a real-life scenario, imagine a business with a 1TB SQL database, it would only take 4 minutes and 33 seconds for a full backup
using a tweaked compression command script.
Table 7. No Load – Compression / No compression backup test results
Backup performed with no active OLTP workload
Database size=112.38GB
MB/sec
% Difference
Duration (sec)
% Difference
Backup with default settings
2430MB
--
46
--
Backup with plain compression
362MB
-85%
311
572%
Backup with custom compression
4361MB
79%
26
-44%
In Table 8 with a workload running, our test resulted in three key points:
• The rate of speed of the backup increased by 108% from 746MB/sec to 1549MB/sec when using custom compression with additional options.
• The time spent on the backup decreased 53% from 159 seconds to just 76 seconds for a physical database size of 112.38GB.
• Most important, performance on the running workload only decreased by 13.5% during the backup.
Table 8. With running workload – Compression / No compression backup test results
Backup performed during
active OLTP workload
Database size=112.38GB
BRPS
before backup
BRPS
during backup
% Diff
MB/sec
% Diff
Duration (sec)
% Diff
Backup with default settings
48852
48362
-1.0%
745.88MB
--
159
--
Backup with plain compression
48762
48096
-1.4%
326.23MB
-56%
364
128%
Backup with custom compression
48953
42322
-13.5%
1548.63MB
108%
76
-53%
Technical white paper
Page 20
SQL index rebuild tests
The max degree of parallelism (MAXDOP) option determines the maximum number of processors to use during an index rebuild operation.
Using MAXDOP with the default value of zero (0), the server determines the number of CPUs that are used for the index operation, using only
the actual available number of processors or fewer based on the current system workload. However, you can manually configure the number of
processors used to run index operations by specifying the number of processors. By doing so, performance may be impacted positively or
negatively during online index rebuild. To find the optimal value for our workload, we measured workload performance during concurrent online
index rebuilds with varying MAXDOP values.
Our index rebuild test plan consists of running several degree of parallelism scenarios that will measure the performance of the OLTP workload
and the duration of the indexing during online index rebuilds. We used one single TPCE database, with an un-partitioned index with
207,360,000 rows by default, 3,018,933 data pages, and 23.585GB in size. The first performance test was done with no max degree of
parallelism, with MAXDOP set to 1. The next tests were done with 96, 64, 32, and at the default MAXDOP of 0.
The table below shows that leaving the MAXDOP value at zero (default) will yield the best online performance with only a slight decrease of 23%
(avg) in batch requests per second while the index rebuild only took 52 seconds. Using MAXDOP with the default value of zero (0), the server
determines the number of CPUs that are used for the index operation, using only the actual available number of processors or fewer based on
the current system workload; meaning, over-subscription of CPUs can cause insufficient resources for other applications and database operations
for the duration of the index operation, thus performance will decline.
For our workload, leaving MAXDOP index option at default (zero) is recommended during online index rebuild.
Table 9. Performance results during online index rebuild with varying MAXDOP values
Index rebuild on 20GB index
table on 96-core Broadwell
BRPS
before rebuild
BRPS (avg)
during rebuild
% Diff
BRPS (min)
during rebuild
% Diff
Duration (sec)
% Diff
MAXDOP=1
61849
41688
-33%
23944
-61%
01:19:03
52%
MAXDOP=96
61849
46354
-25%
33004
-47%
00:58:09
12%
MAXDOP=64
61849
35117
-43%
18442
-70%
01:55:57
121%
MAXDOP=32
61849
34351
-44%
16166
-74%
02:04:13
138%
MAXDOP=0
61849
47567
-23%
34407
-44%
00:52:48
Processor comparison (Broadwell versus Haswell)
THE HPE ProLiant DL580 Gen9 supports Intel Xeon 4800/8800 v3/v4 processors. Selecting the right processor for the SQL workload is key to
the design of the SQL server environment. The next step is to compare the newer Broadwell processor with the Haswell processor. With the best
BIOS and SQL settings, with CPU affinity on Broadwell (as shown on Table 4), the Flat/Cluster-on-Die combination yielded 61,790 batch
requests per seconds. We took the same settings and ran the same workload with the same NVMe PCIe storage on a Haswell-based DL580
Gen9 server and compared results. Keep in mind that the Haswell-based server also has four processors with only 18 cores each but running at a
higher frequency.
The table and graph below show the test results that show a 20% increase, when upgrading from Haswell to Broadwell processors.
Table 10. Performance comparison between Broadwell and Haswell processors with optimal BIOS and SQL settings
DL580 server – 30 users
BRPS
CPU_Total %
Latency / MS (Data)
Latency / MS (Logs)
DL580 Gen9 with Haswell processor
51441
81.93%
0.000
0.000
DL580 Gen9 with Broadwell processor
61790
80.16%
0.001
0.000
Technical white paper
Page 21
Batch Requests per Second
Broadwell vs. Haswell Performance Results
70000
60000
61790
51441
20%
Performance
Increase with
Broadwell
CPUs
50000
40000
30000
20000
10000
0
Haswell
Broadwell
Figure 14. Broadwell versus Haswell performance results
Analysis and recommendations
• When using localized storage, using NVMe PCIe storage products in your SQL Server solution will bring benefits unseen with using regular
SSDs and HDDs. Either in regular SFF drive format or in Workload Accelerator cards, these NVMe drives, when configured and used properly
in the SQL environment, can immensely improve disk access for your workload.
• Memory sizing is important. Smaller memory allocations below a 1:2 memory to database ratio can result in performance degradation.
Wherever possible use a 1:1 ratio or better.
• Setting the SQL Max server memory for each SQL instance and enabling locked large-page memory in the buffer pool can improve your
workload performance.
• Our testing was to find the best configuration for our workload that included several important server and SQL settings. After taking all tests
into consideration, we concluded that our series of tests resulted in two best practices: use CPU affinity and Cluster-on-Die over soft-NUMA.
• Using Hyper-Threading can positively impact your workload. Hyper-Threading enables SQL Server to handle more load (an increase from 28
to 42 users in our workload), while improving performance (batch requests) with 18% gain.
• Using CPU affinity can greatly improve performance on your workload. With 11% performance improvement on our workload, CPU affinity
enables each database instance to be assigned to its own NUMA nodes, resulting in an exclusive resource of logical processors, memory, and
I/O per node. To use affinity, make sure to verify the number of NUMA nodes that are reported in the operating system and in SQL to ensure
affinitized workloads are working properly.
• While each workload is different, HPE recommends to test with the NUMA Group Size Optimization and QPI Snoop configuration settings to
get the best performance for your workload.
• For Gen9 servers with Broadwell processors, we recommend using hardware-NUMA with Cluster-on-Die versus using SQL soft-NUMA when
implementing a NUMA affinitized workload.
Technical white paper
Page 22
Summary
The HPE ProLiant DL580 Gen9 server is a powerful and versatile platform for Microsoft SQL Server consolidation deployments.
• The large number of storage options and PCI expansion slots provide the flexibility needed to deploy high performance and scalable SQL
Server environments.
• Intel Broadwell processors improve upon prior generations providing a 20% gain in our OLTP tests when compared to Haswell processors.
• When the DL580 Gen9 and Broadwell processors are configured optimally, we experienced a cumulative gain of 26% when compared to
default settings.
• The backup and index maintenance job testing showed minimal primary workload impact.
• The HPE ProLiant DL580 Gen9 with Broadwell-based Intel CPUs provided almost 62K batch requests per second compared to 52K using
Haswell.
Our testing shows that our Reference Architecture provides exceptional performance, making it ideal for consolidation efforts during hardware
refresh cycles. Highly optimized workloads maintained performance during secondary maintenance windows, giving overall production behavior
confidence as a primary database server.
Because of new hardware, new applications, as well as new policies and practices, consolidation is a constant never-ending project. Our Reference
Architecture provides an example platform that will bring performance and scalability that will make future consolidation even easier and
supports growth for the company.
Appendix A: Bill of materials
Note
Part numbers are at time of testing and subject to change. The bill of materials does not include complete support options or other rack and
power requirements. If you have questions regarding ordering, please consult with your HPE Reseller or HPE Sales Representative for more
details. hpe.com/us/en/services/consulting.html
Table 11. Bill of materials. Broadwell server
Qty
Part number
Description
HPE ProLiant DL580 Gen9 server
1
793161-B21
HPE DL580 Gen9 CTO Server
1
816643-L21
HPE DL580 Gen9 Intel Xeon E7-8890v4 (2.2GHz/24-core/60MB/165W) FIO Kit
3
816643-B21
HPE DL580 Gen9 Intel Xeon E7-8890v4 (2.2GHz/24-core/60MB/165W) Kit
96
805358-B21
HPE 64GB (1x64GB) Quad Rank x4 DDR4-2400 CAS-17-17-17 Load Reduced Memory
8
788360-B21
HPE DL580 Gen9 12 DIMMs Memory Cartridge
1
788359-B21
HPE DL580 Gen9 NVMe 5 SSD Express Bay Kit
2
691866-B21
HPE 400GB 6G SATA ME 2.5in SC EM SSD
2
803197-B21
HPE 1.6TB NVMe WI HH PCIe Accelerator
4
736939-B21
HPE 800GB NVMe PCIe WI SFF SC2 SSD
1
732456-B21
HPE Flex Fbr 10Gb 2P 556FLR-SFP+FIO Adptr
1
758836-B21
HPE 2GB FIO Flash Backed Write Cache
4
656364-B21
HPE 1200W CS Plat PL HtPlg Pwr Supply Kit
1
BD505A
HPE iLO Adv incl 3yr TSU 1-Svr Lic
Technical white paper
Page 23
Appendix B: FIO tool command line configuration for disk I/O benchmarking
Running the FIO tool in a job file minimizes run starts and stops. Multiple tests can be combined to simplify testing. The use of a Global section
sets the defaults for the jobs (or tests) described in the file, which reduces redundant test options that may appear for each test, such as, file
name, test duration, or number of threads. The example below performs four I/O disk tests – Random Read, Random Write, Sequential Write, and
Mixed Random Read-Write with one request outstanding. All three Random I/O tests are performed with 8K reads, while sequential writes are
performed with 64K writes. The global section defines all tests to run with 1 thread, for 5 minutes using the fio_testfile as the testbed, with nonbuffered I/O.
[global]
ioengine=windowsaio
size=12500MB
direct=1
time_based
runtime=300
directory=/fio
filename=fio_testfile
thread=1
new_group
[rand-read-1]
iodepth=1
bs=8k
rw=randread
stonewall
[seq-write-1]
iodepth=1
bs=64k
rw=write
stonewall
[rand-write-1]
iodepth=1
bs=8k
rw=randwrite
stonewall
[rand-40/60-1]
iodepth=1
bs=8k
rw=randrw
rwmixread=40
For more information about using this tool, visit: http://bluestop.org/fio/
Technical white paper
Page 24
Appendix C: Understanding hardware and software NUMA in SQL Server 2016 and how to set
affinity on Broadwell
Figure 15 shows how NUMA nodes are mapped in Windows and within SQL Server and the resulting TCP/IP port to NUMA node bitmask.
NUMA Node Configuration
4 socket 8 core system
Default Configuration:
QPI Snoop: Home Snoop
SQL Server Automatic Soft-NUMA: Enabled
QPI Snoop: Home Snoop
SQL Server Automatic Soft-NUMA: Disabled
- SQL Server ignores hardware NUMA, and overlays its own nodes
- 4 H/W nodes are seen in Resource Monitor
TCP/IP: 4-bit Bitmask 1111
TCP/IP: 8-bit Bitmask 11111111
Socket
NUMA Node
Node
NUMA
- 4 H/W nodes are seen in Resource Monitor, and SQL Server
Socket
Socket
Socket
NUMA Node
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Socket
Socket
Socket
Socket
NUMA Node
NUMA Node
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
QPI Snoop: Cluster on Die
SQL Server Automatic Soft-NUMA: Disabled
QPI Snoop: Cluster on Die
SQL Server Automatic Soft-NUMA: Enabled
- 8 H/W nodes are seen in Resource Monitor, and SQL Server
- 8 H/W nodes are seen in Resource Monitor, and SQL Server splits them into 16
TCP/IP: 8-bit Bitmask 11111111
TCP/IP: 16-bit Bitmask 1111111111111111
Socket
Socket
Socket
Socket
NUMA Node
NUMA Node
NUMA Node
NUMA Node
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
NUMA Node
NUMA Node
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
NUMA Node
NUMA Node
Socket
Socket
NUMA Node
NUMA Node
Socket
Socket
NUMA Node
NUMA Node
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
NUMA Node
NUMA Node
NUMA Node
NUMA Node
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
NUMA Node
NUMA Node
Figure 15. NUMA node configuration on 4-socket 8-core system
Technical white paper
Page 25
Appendix D: Creating a custom soft-NUMA configuration
This example is for a 4-socket server with 18-core processors with Hyper-Threading enabled for an eight database workload. General
instructions can be found at https://msdn.microsoft.com/en-us/library/ms345357.aspx
• First, we will split the 36 logical cores into two 18-core halves.
• With a programmer’s calculator, enter 18 adjacent ones to reflect 18 adjacent logical cores and covert to hex or decimal.
11 1111 1111 1111 1111 = 3FFFF or 262143(dec)
• Then enter 18 ones to reflect the next 18 adjacent logical cores, followed by 18 zeros to represent the first 18 cores that were used
previously.
1111 1111 1111 1111 1100 0000 0000 0000 0000 = FFFFC0000 or 68719214592(dec)
• Table 12 shows the CPUMask and group values for each node.
Table 12. Custom CPUMask and group for a manual soft-NUMA configuration that will position each node within half a socket
CPUMask (Hex)
CPUMASK (dec)
Group
Node 0
0x3FFFF
262143
0
Node 1
0xFFFFC0000
68719214592
0
Node 2
0x3FFFF
262143
1
Node 3
0xFFFFC0000
68719214592
1
Node 4
0x3FFFF
262143
2
Node 5
0xFFFFC0000
68719214592
2
Node 6
0x3FFFF
262143
3
Node 7
0xFFFFC0000
68719214592
3
• Using the registry, we create a NodeConfiguration folder under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\130.
Then we create 8 keys to represent 8 nodes. Table 12 shows the CPUMask and group values for each node.
• Make sure automatic soft-NUMA is disabled, then restart SQL Server.
Technical white paper
• Verify the new soft-NUMA is configured by checking the SQL log:
• Verification can also be done with sys.dm_os_nodes:
• To affinitize each node to a database, we map TCP/IP ports to each node. For more information, visit
https://msdn.microsoft.com/en-us/library/ms345346.aspx
TCP/IP Ports - 1500[1],1501[2],1502[4],1503[8],1504[16],1505[32],1506[64],1507[128]
• Each database will listen to a specific port to exclusively access 18 cores within a physical socket.
Page 26
Technical white paper
Page 27
Resources and additional links
HPE ProLiant DL580 Gen9 server
hpe.com/servers/dl580
HPE Reference Architectures
hpe.com/info/ra
HPE Servers
hpe.com/servers
HPE Storage
hpe.com/storage
HPE Networking
hpe.com/networking
HPE Technology Consulting Services
hpe.com/us/en/services/consulting.html
HPE SSD Data Sheet
http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4AA4-7186ENW
Best practices configuring the HPE ProLiant DL560 and DL580 Gen9 Servers with Windows Server
http://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=4AA5-1110ENW
To help us improve our documents, please provide feedback at hpe.com/contact/feedback.
Sign up for updates
© Copyright 2016 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice.
The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying
such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall
not be liable for technical or editorial errors or omissions contained herein.
Microsoft, Windows Server, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States
and/or other countries. Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries.
4AA6-8301ENW, October 2016
Download PDF
Similar pages