HP 124708-001 1850, ProLiant CL1850 Introduction Manual

The Intel

®

processor roadmap for industrystandard servers

technology brief, 10

th

Edition

Abstract ..............................................................................................................................................2

Introduction .........................................................................................................................................2

Intel processor architecture and microarchitectures ...................................................................................2

NetBurst ® microarchitecture ...................................................................................................................5

Hyper-pipeline and clock frequency ....................................................................................................5

Hyper-Threading Technology..............................................................................................................7

NetBurst microarchitecture on 90nm silicon process technology .............................................................9

Extended hyper-pipeline...............................................................................................................10

SSE3 instructions.........................................................................................................................10

64-bit extensions —Intel 64..........................................................................................................10

Two-core technology .......................................................................................................................11

Intel Core™ microarchitecture ..............................................................................................................12

Processors......................................................................................................................................12

Xeon two-core processors ................................................................................................................12

Xeon four-core processors ................................................................................................................13

Enhanced SpeedStep® Technology...............................................................................................14

Intel Virtualization® Technology....................................................................................................15

Intel ® Microarchitecture Nehalem .........................................................................................................15

Integrated memory controller............................................................................................................15

Intel ® QuickPath Technology ............................................................................................................16

Three-level cache hierarchy..............................................................................................................17

Intel ® Hyper-Threading Technology ...................................................................................................18

Intel ® Turbo Boost Technology ..........................................................................................................18

Dynamic Power Management...........................................................................................................19

Performance comparisons....................................................................................................................20

TPC-C performance .........................................................................................................................20

SPEC performance ..........................................................................................................................20

Conclusion ........................................................................................................................................21

For more information ..........................................................................................................................22

Call to action .....................................................................................................................................22

Abstract

Intel ® continues to introduce processor technologies that boost the performance of x86 processors in multi-threaded environments. This technology brief describes these processors and some of the more important innovations as they affect HP industry-standard enterprise servers.

Introduction

As standards-based computing has pushed into the enterprise server market, the demand for increased performance and greater variety in processor solutions has grown. To meet this demand,

Intel continues to introduce processor innovations and new speeds. This technology brief summarizes the recent history and near-term plans for Intel processors as they relate to the industry-standard enterprise server market.

Intel processor architecture and microarchitectures

The Intel processor architecture refers to its x86 instruction set and registers that are exposed to programmers. The current x86 instruction set includes all instructions that the original 16-bit 8086 processor could execute and the enhanced instructions offered by the successor x86 processors.

Processor manufacturers such as Intel and AMD use a common processor architecture to maintain backward and forward compatibility of the instruction set among generations of their processors. Intel refers to its 32-bit and 64-bit versions of the x86 processor architecture as Intel Architecture (IA)-32 and IA-64. In comparison, the term “microarchitecture” refers to each processor’s physical design that implements the instruction set. Processors with different microarchitectures from Intel and AMD x86 processors for example, can still use a common instruction set.

Figure 1 shows the relationship between the x86 processor architecture and Intel’s evolving microarchitectures, as well as processors based on these microarchitectures.

Figure 1.

Intel processor architecture and microarchitectures for industry-standard enterprise servers

2

Intel processor sequences are intended to help developers select the best processor for a particular platform design. Intel offers three processor number sequences for server applications (see Table 1).

Intel processor series numbers within a sequence (for example, 5100 series) help differentiate processor features such as number of cores, architecture, cache, power dissipation, and embedded

Intel technologies.

Table 1 .

Intel processor sequences

Processor sequence

Two-Core Intel® Xeon™ processor 3000 sequence

Two-Core and Four-Core Intel ®

5000 sequence

Xeon ® processor

Two-Core, Four-Core, and Six-Core Intel processor 7000 sequence

® Xeon ®

Platform

Uni-processor servers

Two-processor high-volume servers and workstations

Enterprise servers with 4 to 32 processors

Intel enhances the microarchitecture of a family of processors over time to improve performance and capability while maintaining compatibility with the processor architecture. One method to enhance the microarchitectures involves changing the silicon process technology. For example, Figure 2 shows that Intel enhanced NetBurst-based processors in 2004 by changing the manufacturing process from

130nm to 90nm silicon process technology.

In the second half of 2006, Intel launched the Core® microarchitecture, which is the basis for the multi-core Xeon 5000 Sequence processors, including the first four-core Xeon processor (Clovertown).

Beginning with the Penryn family of processors, Intel enhanced the performance and energy efficiency of Intel Core microarchitecture-based processors by switching from 65nm to 45nm Hi-k 1 process technology with the hafnium-based high-K + metal gate transistor design. In 2009, Intel produced the first processors based on the “next generation” Nehalem microarchitecture.

Figure 2.

Intel microarchitecture introductions and associated silicon process technologies for industry-standard servers

1 Hi-k, or High-k, stands for high dielectric constant, a measure of how much charge a material can hold. For more information, refer to http://www.intel.com/technology/silicon/highk.htm?iid=tech_arch_45nm+body_hik .

3

Table 2 includes more details about the release dates and features of previously released Intel x86 processors, as well as processors projected to be available through 2009.

Table 2 .

Release dates and features of Intel x86 processors

Code

Name

Market name

Feature size

Description Available/

Smithfield Pentium

D

90nm Two-core processor uni-

Projected

Cache Max. rate * (MT/s)

L2

per core

800

1Q2005 2MB 800 L2 version of

Nocona

Cranford Xeon MP

Prescott 2M Xeon

90nm

90nm

Xeon MP

2MB L2 version of

Prescott

1Q2005 1MB L2

1Q2005 2MB

667

800

Potomac

Paxville

Paxville

Xeon MP

Xeon MP

Xeon MP

Presler Pentium

D

Dempsey Xeon

5000

90nm

90nm

90nm

65nm

Xeon MP

Two-core Xeon

MP

Two-core processor

Two-core uni-

1Q2005

4Q2005

Q12006

1H2006

8MB L3

2x2MB L2

2MB L2 per core

2MB L2 per core

4MB L2 shared

667

800

>800

1066

Woodcrest Xeon

5100

Conroe Core

Duo

Clovertown Xeon

Tigerton Xeon

65nm

65nm

65nm

Two-core

65nm Two-core, uni-processor

Four-core

Four-core

1H2006

Mid-2006

Two-core, uni-processor

3Q2006

Tulsa Xeon MP 65nm Two-core 4Q2006

4Q2006

2H2007

4MB L2 shared

4MB L2 shared

16MB L3

2x4MB L2

8MB L2

1333

1333 MHz

1333 MHz

800 MHz

1333 MHz

1066 MHz

Harpertown Xeon 45nm Four-core 4Q2007 2x6MB 1333/1600

MHz**

Wolfdale Xeon 45nm Two-core 1Q2008 1x6MB MHz*

Dunnington Xeon 45nm Six-core 3Q2008 16MB shared

L3 1066 MHz

1Q2009 4x256KB

8MB L3 shared

6.4 GT/s

QuickPath

Interconnect

* MT/s is an abbreviation for Mega-Transfers per second. A bus operating at 200 MHz and transferring four data packets on each clock (referred to as quad-pumped) would have 800 MT/s.

** Selected chipsets only

4

NetBurst

®

microarchitecture

The NetBurst-based processor for low-cost, single-processor servers is the Pentium® 4 processor. The original 180nm version of the Pentium 4 was known as Willamette, and the subsequent 130nm version was known as Northwood. NetBurst-based processors intended for multi-processor environments are referred to as Intel® Xeon™ (for two-processor systems) and Xeon MP (for systems using more than two processors).

The NetBurst microarchitecture included the following enhancements:

• Higher bandwidth for instruction fetches

• 256-KB Level 2 (L2) cache with 64-byte cache lines

• NetBurst system bus: a 64-bit, 100-MHz bus capable of providing 3.2 GB/s of bandwidth by double pumping the address and quad pumping the data. The 100-MHz quad pumped data bus is also referred to as a 400-MHz data bus. To provide higher levels of performance, Intel added support for a 533-MHz front side bus to the Pentium 4 and Xeon processors and later added support for 800 MHz to the Pentium 4.

• Integer arithmetic logic unit (ALU) running at twice the clock speed (double data rate)

• Modified floating point unit (FPU)

• Streaming SIMD extension 2 (SSE2): New instructions bring the total to 144 SIMD instructions to manage floating point, application, and multimedia performance.

• Advanced dynamic execution

• Deeper instruction window for out-of-order, speculative execution and improved branch prediction over the P6 dynamic execution core

• Execution trace cache (stores pre-decoded micro-operations)

• Enhanced floating point/multimedia engine

• Hyper-threading (HT) in Xeon processors and Pentium 4 processors (described below)

Hyper-pipeline and clock frequency

One performance-enhancing feature of the NetBurst microarchitecture was its hyper-pipeline, a 20stage branch-prediction pipeline. Previous 32-bit processors had a 10-stage pipeline. The hyperpipeline can contain more than 100 instructions at once and can handle up to 48 loads and stores concurrently. The pipeline in a processor is analogous to a factory assembly line where production is split into multiple stages to keep all factory workers busy and to complete multiple stages in parallel.

Likewise, the work to execute program code is split into stages to keep the processor busy and allow it to execute more code during each clock cycle. In this case, the processor must complete the operation for each stage within a single clock cycle. The processor can achieve this by splitting the task into smaller tasks and using more (shorter) stages to execute the instructions (Figure 3). Thus, each stage can be completed faster, allowing the processor to have a higher clock frequency.

However, it is important to understand that splitting each stage into smaller stages to achieve a higher clock frequency does not mean that more work is being done in the pipeline per clock cycle.

5

Figure 3.

By decreasing the amount of work done in each stage, the clock frequency can be increased.

A basic structure for a computer pipeline consists of the following four steps, which are performed repeatedly to execute a program.

1.

2.

3.

4.

Fetch the next instruction from the address stored in the program counter.

Store that instruction in the instruction register, decode it, and increment the address in the program counter.

Execute the instruction currently in the instruction register.

Write the results of that instruction from the execution unit back into the destination register.

Typical processor architectures split the pipeline into segments that perform those basic steps: the

“front end” of the microprocessor; the execution engine; and the retire unit (Figure 4). The front end fetches the instruction and decodes it into smaller instructions (commonly referred to as micro-ops).

These decoded instructions are sent to one of the three types of execution units (integer, load/store, or floating point) to be executed. Finally, the instruction is retired and the result is written back to its destination register.

Figure 4.

Basic 4-stage pipeline schematic

6

Keeping the pipeline busy requires that the processor begin executing a second instruction before the first has traveled completely through the pipeline. However, suppose a program has an instruction that requires summing three numbers:

X = A + B + C

If the processor already has A and B stored in registers but needs to get C from memory, this causes a

“bubble,” or stall, in the pipeline in which the processor cannot execute the instruction until it obtains the value for C from memory. This bubble must move all the way through the pipeline, forcing each stage that contains the bubble to sit idle, wasting execution resources during that clock cycle. Clearly, the longer the pipeline, the more significant this problem becomes.

Processor stalls often occur as a result of one instruction being dependent on another. If the program has a branch, such as an IF–THEN loop, the processor has two options. The processor either waits for the critical instruction to finish (stalling the pipeline) before deciding which program branch to take, or it predicts which branch the program will follow.

If the processor predicts the wrong code branch, it must flush the pipeline and start over again with the IF–THEN statement using the correct branch. The longer the pipeline, the higher the performance cost for branch mispredicts. For example, the longer the pipeline, the more the processor must execute speculative instructions that must be discarded when a mispredict occurs. Specific to the NetBurst design was an improved branch-prediction algorithm aided by a large branch target array that stored branch predictions.

Hyper-Threading Technology

Intel Hyper-Threading (HT) Technology is a design enhancement for server environments. It takes advantage of the fact that, according to Intel estimates, the utilization rate for the execution units in a

NetBurst processor is typically only about 35 percent. To improve the utilization rate, HT Technology adds Multi-Thread-Level Parallelism (MTLP) to the design. In essence, MTLP means that the core receives two instruction streams from the operating system (OS) to take advantage of idle cycles on the execution units of the processor. For one physical processor to appear as two distinct processors to the OS, the design replicates the pieces of the processor with which the OS interacts to create two logical processors in one package. These replicated components include the instruction pointer, the interrupt controller, and other general-purpose registers ― all of which are collectively referred to as the Architectural State, or AS (see Figure 5).

Figure 5. Hyper-Threading Technology

IA-32 Processor with

Hyper-thread Technology

AS1 AS2

Traditional Dual-processor

(D) System

AS AS

Processor

Core

Processor

Core

Processor

Core

Logical processor

Logical processor

System Bus System Bus

7

Since multi-processing operating systems such as Microsoft Windows and Linux are designed to divide their workload into threads that can be independently scheduled, these operating systems can send two distinct threads to work their way through execution in the same device. This provides the opportunity for a higher abstraction level of parallelism at the thread level rather than simply at the instruction level, as in the Pentium 4 design. To illustrate this concept, refer to Table 3: It is obvious that instruction-level parallelism can take advantage of opportunities in the instruction stream to execute independent instructions at the same time. Thread-level parallelism, shown in Table 4, takes this a step further since two independent instruction streams are available for simultaneous execution opportunities.

It should be noted that the performance gain from adding HT Technology does not equal the expected gain from adding a second physical processor or processor core. The overhead to maintain the threads and the requirement to share processor resources limit HT Technology performance.

Nevertheless, HT Technology was a valuable and cost-effective addition to the Pentium 4 design.

Table 3.

Example of instruction-level parallelism

Instruction number

1

Instruction thread

Read register A

Instruction execution

Operations 1, 2, and 3 are independent and can execute simultaneously if resources permit.

2 Write register B

3 Read register C

Add A + B

4 This operation must wait for instructions 1 and 2 to complete, but it can execute in parallel with operation 3.

5 Inc A This operation needs to wait for the completion of instruction 4 before executing.

Table 4.

Example of thread-level parallelism

Instruction number

Instruction thread

Instruction number

1a Read register A

1b

2a Write register B

2b

3a Read register C

3b

4a

Add A + B 4b

5a Inc

5b

Instruction thread

Add D + E

Inc E

Read F

Add E+F

Write E

Instruction execution

None of the instructions in Thread

2 depend on those in Thread 1; therefore, to the extent that execution units are available, any of them can execute in parallel with those in Thread 1.

As an example, instruction 2b must wait for instruction 1b, but does not need to wait for 1a.

Similarly, if two arithmetic units are available, 4a and 4b can execute at the same time.

According to Intel’s simulations, HT Technology achieves its objective of improving the microarchitecture utilization rate significantly. Improved performance is the real goal though, and Intel reports that the performance gain can be as high as 30 percent.

The performance gained by these design changes is limited by the fact that two threads now share and compete for processor resources, such as the execution pipeline and Level 1 (L1) and L2 caches.

There is some risk that data needed by one thread can be replaced in a cache by data that the other is using, resulting in a higher turnover of cache data (referred to as thrashing) and a reduced hit rate.

8

HT Technology also puts a heavier load on the OS to allocate threads and switch contexts on the device. Evaluating the threads for parallelism and context switching are OS tasks and increase the operating overhead.

HT Technology presents little in the way of software licensing issues. Intel asserts that the HT design is still only a single-processor unit, so customers should not have to purchase two software licenses for each processor. This is true for Microsoft SQL Server 2000 and Windows Server 2003, which only require one license for each physical processor, regardless of the number of logical processors it contains. However, Windows 2000 Server does not make this distinction between physical and logical processors and fills the licensing limit based on the number of processors the BIOS discovers at boot.

According to Intel, the system requirements for HT Technology are as follows:

• A processor that supports HT Technology 2

• HT Technology-enabled chipset

• HT Technology-enabled system BIOS

• HT Technology-enabled/optimized operating system

For more information, refer to http://www.intel.com/products/ht/hyperthreading_more.htm

.

NetBurst microarchitecture on 90nm silicon process technology

In 2004, Intel introduced major improvement to the Pentium 4 and Xeon processor lines by changing the manufacturing process from 130nm to 90nm silicon process technology and adding numerous enhancements:

• Larger, more effective caches (1MB or 2-MB L2 Advanced Transfer Cache compared to 512-KB on the 0.13 micron Pentium 4 processor)

• Faster processor bus: a 64-bit, 200-MHz bus capable of providing 6.4 GB/s of bandwidth by double pumping the address and quad pumping the data. The 200-MHz quad-pumped data bus is also referred to as an 800-MT/s data bus.

• Extended hyper-pipeline (31 stages versus 20 stages) to enable high CPU core frequencies

(described below)

• Enhanced execution units including the addition of a dedicated integer multiplier, and support for shift and rotate instruction execution on a fast ALU

• Improved branch prediction to help compensate for longer pipeline

• Streaming SIMD Extensions 3 (SSE3) instructions (described below)

• Larger execution schedulers and execution queues

• Improved hardware memory prefetcher

• Improved Hyper-Threading

• 64-bit extensions (described below)

• Two-core (for Smithfield, Dempsey, and Paxville)

2

For more information, read the white paper “Introducing the 45nm next-generation Intel® Core™ microarchitecture at Intel® 45nm Hi-k silicon technology.

9

Extended hyper-pipeline

In keeping with its history of regularly increasing processor frequencies, Intel extended the hyperpipeline queue from 20 (in the earlier Pentium 4 design) to 31 stages. The biggest drawback to this approach is that, as the pipe gets longer, interruptions (stalls) to the regular flow of instructions in the pipe become progressively more costly in terms of performance. To mitigate such stalls, Intel improved the branch-prediction algorithm sufficiently to prevent this deeper pipeline from causing performance degradation.

SSE3 instructions

The Prescott design added Streaming Single-Instruction-Multiple-Data (SIMD) Extensions 3, or Prescott

New Instructions. As they did in earlier processors, SIMD instructions provide the potential for improved performance because each instruction permits operation on multiple data items at the same time. Prescott processors had newer versions of arithmetic, graphics, and HT synchronization instructions.

The arithmetic group consists of one new instruction for converting x87 data into integer format, and five instructions that simplify the process of performing complex arithmetic. Complex numbers actually consist of two numbers: a real and an imaginary component. The additional instructions facilitate complex operations because they are designed to operate on both parts of these complex pairs of numbers at the same time. Using these instructions also simplifies coding complex arithmetic operations because fewer instructions are needed to accomplish the goal.

The graphics group contains one instruction for video encoding and four that are specific to graphics operations. Finally, two instructions facilitate HT operation, for example, by allowing one operational thread to be moved to a higher priority than another.

64-bit extensions —Intel 64

In response to market demands, Intel added 64-bit extensions to the x86 architecture of the Xeon,

Xeon MP, and Pentium 4 processors. The key advantage of 64-bit processing is that the system can address a much larger flat memory space (up to 16 exabytes). Even though the 32-bit architecture can actually access up to 64 GB of memory, access above the standard 4 GB limit must go through a slow and cumbersome windowing facility such as Physical Address Extension (PAE). Due to the complexities of this process, most 32-bit applications have not made use of the higher address space.

Today, few applications require more than 1 or 2 GB of memory; however, this will eventually change. By adding 64-bit extensions to its x86 processors, Intel has provided users with the same 64bit addressing benefit at a much lower cost than if users were forced to replace both the hardware and software.

Even though the larger memory addressing capability is the primary advantage of 64-bit extensions, it is not the only one. The 64-bit extensions also provide a larger register set with eight additional general purpose registers (GPR) and 64-bit versions of the existing registers. With a total of 16 GPRs,

64-bit extensions provide additional resources that compilers can use to increase performance.

64-bit Extensions

AMD was first to release 64-bit extensions ― called AMD64 ― with its Opteron processor in early 2003. Within a year, Intel responded with its own plans to deliver a similar solution called Extended Memory 64 Technology, or EM64T, which is broadly compatible with AMD64. In late 2006, Intel began using the name Intel 64 for its implementation. Intel 64 and AMD64 use the same register sets and definitions, and the 64-bit instructions are nearly identical. HP expects that any minor differences will be handled by the OS and compiler, so that the average application writer or customer should see no differences. New operating systems are required to make use of 64-bit extensions. Red Hat, SUSE, and

Microsoft provide AMD64 support and Intel 64 support.

10

Two-core technology

Single-core processors that run multi-threaded applications become less cost effective with each increase in frequency. This is because the multiple threads compete for available compute resources, which limits the increase in performance at higher frequencies. Increasing the CPU core frequency not only delivers lower incremental performance gains, but also increases power requirements and heat generation. These factors create significant barriers for single-core architectures to keep pace with the growing needs of data centers.

To address the performance, power, and cooling complexities, Intel announced its first two-core processor architecture in 2005. A two-core processor is a single physical package that contains two, full processor cores per socket. The two cores have their own functional execution units and cache hierarchy (L1 and L2); however, the OS recognizes each execution core as an independent processor.

Figure 6 illustrates the difference between single-core and two-core processors with HT Technology. In the case of the single-core processor, HT Technology allows the OS to schedule two threads on the core by treating it as two separate "logical" processors with a shared 2-MB L2 cache. The two-core processor builds on HT Technology with two execution cores. Each core has its own 2-MB L2 cache and separate interface to an 800-MHz front side bus. The two-core architecture runs two threads on each execution core, allowing the processor to run up to four threads simultaneously. The second core’s additional capacity reduces competition for processor resources and increases processor utilization. Thus, the performance improvement of a two-core processor is in addition to the improvement due to HT Technology.

A two-core processor has better performance-per-watt than a single-core processor running at a higher frequency. This is analogous to the way a wide pipe, by virtue of its volume, can carry more water than a smaller pipe with a higher flow rate. Likewise, the two-core architecture is designed to make processors perform more efficiently at lower frequencies (and lower power consumption levels). The two-core processor allows a better balance between performance and power requirements. It was the first step in multi-core processor technology.

Figure 6 . Implementation of Hyper-Threading Technology on single processor core (left) and two-core processor

(right)

11

Intel Core™ microarchitecture

In 2006, Intel introduced the Core microarchitecture to extend the NetBurst microarchitecture features and to add the energy efficient features of Intel’s mobile microarchitecture. The Core microarchitecture uses less power and produces less heat than previous generation Intel processors. The Core microarchitecture features the following technologies that improve per-watt performance and energy efficiency:

• Intel® Wide Dynamic Execution enables delivery of more instructions per clock cycle to improve execution time and energy efficiency.

• Intel® Intelligent Power Capability reduces power consumption and design requirements.

• Intel® Smart Memory Access improves system performance by optimizing the use of the available data bandwidth from the memory subsystem.

• Intel® Advanced Smart Cache is optimized for multi-core and two-core processors to reduce latency to frequently used data, providing a higher-performance, more efficient cache subsystem.

• Intel® Advanced Digital Media Boost improves performance when executing SSE, SSE2, and SSE3 instructions. This technology accelerates a broad range of encryption, financial, engineering, and scientific applications.

• Streaming SIMD Extensions 4 (SSE4) instructions consist of 54 instructions that enhance video, graphics, and high-performance applications. These instructions are divided into two major categories—Vectorizing Compiler and Media Accelerators, and Efficient Accelerated String and

Text Processing.

Processors

The two-core Intel Xeon 3000 and 5000 Sequence and the 7300 series processors are based on the

Core microarchitecture.

Using Hyper-Threading technology, two-core processors (with the exception of the Xeon 3000 Sequence processors) can simultaneously execute four software threads, thereby increasing processor utilization. To avoid saturation of the Front Side Bus (FSB), the Intel 5000 chipset widens the interface by providing dual independent buses. The Xeon 7300 series processors introduce an independent point-to-point interface between the chipset and each processor that allows full front-side-bus bandwidth.

Xeon two-core processors

The 64-bit Intel Xeon 3000 Sequence processors combine performance and power efficiency to enable smaller, quieter systems. Xeon 3000 Sequence processors run at a maximum frequency of

2.66 gigahertz (GHz), with 4 megabytes (MB) of shared L2 cache (Figure 7 left) and a maximum front-side bus speed of 1066 megahertz. These processors are compatible with IA-32 software and support single-processor operation. Xeon 3000 Sequence processors use the Intel 3000 or 3010 chipsets which support Error Correction Code (ECC) memory for a high level of data integrity, reliability, and system uptime. ECC can detect multiple-bit memory errors and locate and correct single-bit errors to keep business applications running smoothly.

The 64-bit Intel Xeon 5000 Sequence processors have two complete processor cores, including caches, buses, and execution states. The Xeon 5000 Sequence processors run at a maximum frequency of 3.73 GHz, with 2 MB of L2 cache per core. The processor supports maximum front-side bus speeds of 1066 megahertz (Figure 7 center).

12

The 64-bit Xeon 5100 series two-core processor runs at a maximum frequency of 3.0 GHz with 4 MB of shared L2 cache and a maximum front-side bus speed of 1333 megahertz (Figure 7 right).

The Xeon 5000 Sequence as well as the 5100 and 5200 series processors use the Intel 5000 series chipsets. These chipsets contain two main components: the Memory Controller Hub (MCH) and the

I/O controller hub. The new Northbridge MCH supports DDR2 Fully-Buffered DIMMs (dual in-line memory modules).

Figure 7.

Diagram representing the major components of two-core Intel Xeon 3000, 5000, 5100, and 5200

Sequence processors

Xeon four-core processors

The four-core Intel Xeon 5300 series processor (Clovertown) was the first four-core processor for dualsocket platforms (Figure 8). The Xeon 5300 series processor has two dual-cores. Each pair of cores shares a L2 cache; up to 4 MB of L2 cache can be allocated to one core. The processor runs at a maximum frequency of 3.0 GHz, with 2 MB of L2 cache per core. This configuration delivers a significant increase in processing capacity utilizing the Intel 5000 series chipsets. ProLiant 300 series servers use the Intel 5000P and 5000Z chipsets. These chipsets support 1066-MHz and 1333-MHz

Two Independent Buses, DDR2 FB-DIMMs, and PCI Express I/O slots.

The four-core Xeon 5400 series processor (Harpertown) has two dual-cores, and each pair shares a

6-MB L2 cache. The Xeon 5400 series processor runs at a maximum frequency of 3.0 MHz with a

1333 MZ or 1600 MHz FSB.

Figure 8.

Intel Xeon 5300 and 5400 series processors contain two dual-core chips.

13

The four-core Intel Xeon 7300 series processor (Tigerton) consists of two dual-core silicon chips on a single ceramic module, similar to the Xeon 5300 series processors. Each pair of cores shares a L2 cache; up to 4 MB of L2 cache can be allocated to one core. Intel states the Xeon 7300 series processors offer more than twice the performance and more than three times the performance-per-watt of the previous generation 7100 series, which is based on the NetBurst microarchitecture. Xeon 7300 series processors are empowered by the Intel® 7300 Chipset, which features Dedicated High-Speed

Interconnects (DHSI). DHSI is an independent point-to-point interface between the chipset and each processor that provides full front side bus bandwidth to each processor (Figure 9). The point-to-point interface significantly reduces data traffic and provides lower latencies and greater available bandwidth. The chipset also features a 64-MB snoop filter that manages data coherency across processors, eliminating unnecessary snoops and boosting available bandwidth.

Figure 9. Intel Xeon 7300 series processors with the Intel 7300 Chipset and Dedicated High-Speed Interconnects

Based on the 45nm Penryn core, the 6-core Dunnington processor is the successor to Tigerton and has a 16MB L3 “last level cache.”

Enhanced SpeedStep® Technology

Four-core Intel Xeon 5300 and 7300 series processors support Enhanced Intel SpeedStep

Technology. These processors have power state hardware registers that are available (exposed) to allow IT organizations to control the processor’s performance and power consumption. These capabilities are implemented through Intel’s Enhanced Intel SpeedStep Technology and demandbased switching. With the appropriate ROM firmware or operating system interface, programmers can use the exposed hardware registers to switch a processor between different performance states,

3 , at different power consumption levels. For example, HP developed a power management or P-states feature called HP Power Regulator that uses P-state registers to control processor power use and performance. These capabilities have become increasingly important for power and cooling management in high-density data centers. With the combination of this technology and data-center management tools such as Insight Power Manager, IT organizations have more control over the power consumption of all the servers in the data center.

3 The ACPI body defines P-states as processor performance states. For Intel and AMD processors, a P-state is defined by a fixed operating frequency and voltage.

14

Intel Virtualization® Technology

Virtualization techniques that are completely enabled in software perform many complex translations between the guest operating systems and the hardware. With software virtualization, the processor overhead increases (performance decreases) as each guest OS and application vies for the host machine’s physical resources such as memory space and I/O devices. Also, memory latency increases as the virtual machine monitor, or hypervisor, dynamically translates the memory addresses sent to and received from the memory controller. The hypervisor does this so that each guest operating system does not realize that it is being virtualized.

Four-core Intel Xeon 5300 and 7300 series processors support Intel Virtualization Technology (VT-x), which is a hardware enhancement designed to reduce this software overhead. Intel VT-x is a group of extensions to the x86 instruction set that affect the processor, memory, and local I/O address translations. The new instructions enable guest operating systems to run in the standard Ring-0 architectural layer 4 .

The Xeon 7300 series processors also include APIC Task Programmable Register, a new Intel® VT extension that improves interrupt handling to further optimize virtualization software efficiency.

Intel

®

Microarchitecture Nehalem

The Intel ® Xeon ® processor 5500 series, introduced in 2008, is based on the Intel ® Microarchitecture, codenamed Nehalem (Figure 10). The Intel Microarchitecture Nehalem is built on hafnium-based

45 nanometer Hi-k metal gate silicon technology, a new material combination that reduces electrical leakage and enables smaller, more energy-efficient, higher performance processors. The Intel

Microarchitecture Nehalem includes several performance and power management innovations:

• Dynamically managed cores, threads, cache, interfaces, and power

• Extensions to the Intel ® Streaming SIMD Extensions 4 (SSE4) for faster computation/manipulation of media (graphics, video encoding and processing, 3-D imaging, and gaming)

• Seven Application Targeted Accelerators that improve the performance of specific applications

• Extended Page Table that improves performance of software in virtualized environments

• An integrated memory controller

• Intel ® QuickPath ® Technology

• a three-level cache hierarchy

• Intel ® Hyper-Threading Technology

• Intel ® Turbo Boost technology

• Dynamic Power Management

Integrated memory controller

One of the most notable improvements in Intel Xeon 5500 series processor is the integrated memory controller (Figure 11). The memory controller uses three channels to access dedicated DDR-3 memory sockets. This delivers significant performance improvement over previous architectures that provide a single pool of system memory (Blackford has 4 FBD channels). The memory channels can operate at up to 1333 MT/s, but the actual speed depends on the number and type of DIMMs—registered or unbuffered—that populate the slots. For example, in a fully-populated system using registered DDR3-

1333 DIMMs, the memory bus speed drops to 800 MT/s to maintain signal integrity. The three memory channels have a maximum total bandwidth of 32 GB/s. If a processor needs to access the memory of another processor, it can do so through the QuickPath Interconnect (QPI).

4 For more information, refer to the technology brief “Server virtualization technologies for x86-based HP

BladeSystem and HP ProLiant servers” at http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01067846/c01067846.pdf

15

Figure 10 . Intel Microarchitecture Nehalem

Intel

®

QuickPath Technology

Intel QuickPath Technology maximizes data transfer between the processors and other system components. It replaces the multi-drop front-side bus and memory controller hub found in previous generation architectures with high-speed, point-to-point interconnects that directly link the processors and I/O chipset (Figure 11). Each QPI consists of two unidirectional links that operate simultaneously in each direction using differential signaling. Unlike a typical serial bus, the QPI transmits data packets in parallel across multiple lanes, and the packets are broken into multiple parallel transfers.

Each link is comprised of twenty 1-bit lanes. A maximum of 16 bits (2 bytes) are used to transfer data and the remaining 4 bits are used for the protocol and error correction. Initially, the QPI performs a maximum of 6.4 gigatransfers per second (GT/s) with 2 bytes per transfer, or 12.8-GB/s in each direction for a total theoretical bandwidth of 25.6 GB/s.

Reliability, Availability, and Serviceability (RAS features) of the QPI include self-healing links and clock fail-over. Each link has twenty 1-bit lanes that are grouped into quadrants with 5 lanes each. If a persistent (hard) error occurs in one quadrant, the link automatically reduces its width to half (2 quadrants) or quarter-width (1 quadrant) using only the good lanes. This self-healing capability allows the interconnect to recover from multiple hard errors without data loss. Unrecoverable soft errors initiate a dynamic link width reduction cycle. If the clock fails, the link is reduced to half- or quarterwidth and the clock is mapped to a pre-determined data lane. The bandwidth of the link in RAS mode is reduced; however, the link in the other direction can still operate normally.

16

Three-level cache hierarchy

Each Intel Xeon 5500 series processor has a three-level cache hierarchy (Figure 11):

• An on-die, 64-kilobyte, L1 cache that is split into two 32-kilobyte caches storing data and instructions

• An individual, 256-kilobyte, L2 cache for each core for lower latency

• A new inclusive, fully shared, Level 3 (L3) cache that can be up to 8 megabytes

The L3 cache is shared and inclusive, which means that it duplicates the data stored in the L1 and L2 caches of each core. This guarantees that data is stored outside the cores, thus minimizing latency by eliminating unnecessary core snoops to the L1 and L2 caches. The L3 cache has additional flags that track which core’s cache supplied the original data. Therefore, if one core modifies the data of another core in L3 cache, the L1 and L2 caches of the core that originated the data are updated. This eliminates excessive inter-core traffic and ensures multi-level cache coherence.

Figure 11 . Three-level cache hierarchy in the Intel Xeon 5500 series processors

Table 5.

60-Watt, 80-Watt, and 95-Watt Intel Xeon 5500 series processor specifications

L5530 (2.40

GHz)

L5520 (2.26

GHz)

60W 80W 95W

8MB L3,

5.86 GT/s QPI,

800/1066

MHz DDR3,

Hyper-

Threading

E5540 (2.53

GHz)

E5530 (2.40

GHz)

E5520 (2.26

GHz)

8MB L3

5.86 GT/s

QPI

1066 MHz

DDR3

Hyper-

Threading

E5570

(2.93 GHz)

E5560

(2.80 GHz)

E5550

(2.66 GHz)

8MB L3

6.40 GT/s QPI

800/1066/1333

MHz DDR3

Hyper-Threading

L5506 (2.13

GHz)

4MB L3,

4.8 GT/s QPI,

800 MHz

DDR3,

No Hyper-

Threading

E5506 (2.13

GHz)

E5504 (2.00

GHz)

E5502 (1.86

GHz)

4MB L3

4.8 GT/s

QPI

800 MHz

DDR3

No Hyper-

Threading

17

Intel

®

Hyper-Threading Technology

Intel Nehalem-based processors re-introduce the support for HT Technology (simultaneous multithreading). HT Technology lets each core execute two computational threads at the same time, which allows each four-core processor to simultaneously execute up to eight threads. In addition, the highbandwidth memory subsystem supplies data faster to the two computational processes, and the lowlatency cache hierarchy allows simultaneous processing of more instructions. HT Technology improves performance-per-watt over previous generation Intel processor-based servers.

HT Technology achieves performance gains by reducing latency. The two threads are not executed in parallel; rather, they share the resources of a single execution core. If one thread needs to use an execution unit being used by the other thread, it must wait. As a result, processor throughput may only increase by up to 30-percent and may vary based on the application and hardware platform. In applications where software programmers minimize or effectively eliminate memory latencies through cache optimizations, HT Technology may not yield measureable performance gains.

In comparison, a two-core processor running a single thread on each core provides true parallel execution, delivering close to 100% performance improvement. In addition, a well-designed two-core processor is more energy efficient than a single core processor running multiple threads. For these reasons, processors based on the Intel Core microarchitecture did not support HT Technology.

Intel

®

Turbo Boost Technology

Turbo Boost Technology complements HT Technology by increasing the performance of both multithreaded and single-threaded workloads. The processor increases the clock frequency of all active cores when it is operating below power and thermal design points set by the user. These design points include the number of active cores desired, the estimated current consumption, the estimated power consumption, or the processor temperature.

The three Turbo Boost control states are Off, Automatic, and Manual. When Turbo Boost is turned

Off, the processor operates only at the rated frequency. When Turbo Boost is set to Automatic, the OS requests a higher performance state and the processor determines the optimum frequency. When

Turbo Boost is set to Manual, the user can manually disable cores using the BIOS (reboot required) and increase the likelihood that Turbo Boost will be initiated (Figure 12).

Figure 12 . Turbo Boost Technology: some cores turned off; remaining cores running at a higher frequency

There are at least three situations in which disabling processor cores can prove beneficial.

• Reducing power use. Disabling processor cores reduces processor power use. If a server is being used in an application environment that does not depend heavily on multi-threading, disabling cores can lower power consumption without materially affecting performance.

• Increasing overall performance. Some applications benefit from higher core frequency rather than from additional cores. When Turbo Mode is enabled for Intel Nehalem processors, the power and

18

heat savings realized by disabling processor cores allows the remaining cores to run at a higher frequency than their rated speed. In specific application environments, this may actually increase overall system performance.

• Addressing licensing issues. Some software is licensed on a per-core basis. Disabling cores allows an administrator to match the number of active cores on a server with licensing requirements.

However, some software that is licensed on a per-core basis may not recognize the disabling of cores unless the core is disabled through the BIOS during POST.

Dynamic Power Management

Dynamic Power Management works hand-in-hand with Turbo Boost to automatically optimize the performance and power use of the processor, chipset, and memory based on business requirements.

Dynamic Power Management provides the following key improvements:

• The ability to manage power for the processor, chipset, and memory

• More operating power states and lower idle processor power states

• Reduced overhead when transitioning states

These Dynamic Power Management advances allow a processor based on the Intel Microarchitecture

Nehalem to provide greater performance while using the same amount of power as a processor based on the previous generation Intel Core microarchitecture (Figure 13). Conversely, a Nehalem processor can achieve performance equivalent to a previous generation processor and use less power achieving it.

Figure 13 . Dynamic Power Management

Lower power consumption with the same performance Higher performance using the same power

19

Performance comparisons

TPC-C performance

The Transaction Processing Performance Council benchmark TPC-C results for Woodcrest,

Clovertown, Tulsa, Nehalem, and Dunnington processors are compared in Figure 14. TPC-C is measured in transactions per minute (tpmC).

Figure 14. TPC-C performance for Intel processors showing percentage improvements compared to Woodcrest

SPEC performance

The Standard Performance Evaluation Corporation (SPEC) CPU2006 benchmark provides performance measurements that can be used to compare compute-intensive workloads on different computer systems. SPEC results for Woodcrest, Clovertown, Tulsa, Tigerton, and Nehalem processors are compared in Figure 15. SPEC CPU2006 contains two benchmark suites: CINT2006 for measuring and comparing compute-intensive integer performance; CFP2006 for measuring and comparing compute-intensive floating point performance. The performance results show that the fourcore processors—Clovertown, Tigerton, and Nehalem—performed better in the SPEC tests.

20

Figure 15. SPEC CPU2006 performance for Intel processors showing percentage improvements compared to

Woodcrest

Conclusion

Intel processors continue to provide dramatic increases in the processing capability of HP industrystandard servers. In addition to improved system performance, multi-core Intel processors offer greater energy efficiency to help HP customers manage power costs.

21

For more information

For additional information, refer to the resources listed below.

Resource description Web address

ProLiant servers home page www.hp.com/servers/proliant

Power Regulator for ProLiant

Servers

ISS Technology Papers http://h20000.www2.hp.com/bc/docs/support/Su pportManual/c00300430/c00300430.pdf

www.hp.com/servers/technology

Call to action

Send comments about this paper to [email protected]

.

© 2002, 2005, 2006, 2007, 2009 Hewlett-Packard Development Company, L.P.

The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

Intel, Intel Xeon, Pentium and Itanium are trademarks or registered trademarks of

Intel Corporation or its subsidiaries in the United States and other countries

AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc.

Linux is a U.S. registered trademark of Linus Torvalds.

Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.

TC091006TB, October 2009

HP 124708-001 1850, ProLiant CL1850 Introduction Manual

The Intel

processor roadmap for industrystandard servers

technology brief, 10

Edition

Abstract

Introduction

Intel processor architecture and microarchitectures

NetBurst

microarchitecture

Hyper-pipeline and clock frequency

Hyper-Threading Technology

NetBurst microarchitecture on 90nm silicon process technology

Two-core technology

Intel Core™ microarchitecture

Processors

Xeon two-core processors

Xeon four-core processors

Intel

Microarchitecture Nehalem

Integrated memory controller

Intel

QuickPath Technology

Three-level cache hierarchy

Intel

Hyper-Threading Technology

Intel

Turbo Boost Technology

Dynamic Power Management

Performance comparisons

TPC-C performance

SPEC performance

Conclusion

For more information

Call to action

Related manuals

Table of contents

HP 124708-001 1850, ProLiant CL1850 Introduction Manual

The Intel

processor roadmap for industrystandard servers

technology brief, 10

Edition

Abstract

Introduction

Intel processor architecture and microarchitectures

NetBurst

microarchitecture

Hyper-pipeline and clock frequency

Hyper-Threading Technology

NetBurst microarchitecture on 90nm silicon process technology

Two-core technology

Intel Core™ microarchitecture

Processors

Xeon two-core processors

Xeon four-core processors

Intel

Microarchitecture Nehalem

Integrated memory controller

Intel

QuickPath Technology

Three-level cache hierarchy

Intel

Hyper-Threading Technology

Intel

Turbo Boost Technology

Dynamic Power Management

Performance comparisons

TPC-C performance

SPEC performance

Conclusion

For more information

Call to action

Related manuals

Dynatron

H53G

Intel

LF80537GE0251MN

Microsoft

Security Camera 2008

Intel

CORE 2 DUO PROCESSOR E8000 -

Intel

BX80532PH3460FS

Intel

CM8064601483722

Cooler Master

S1N-PGFCS-05-GP

Inter-Tech

88885090

Intel

Pentium D 915 2.80GHz

Table of contents