sg247039

sg247039

6

Chapter 6.

The POWER Hypervisor

The technology behind the shared processor on

Sserver

p5 systems is provided by a piece of firmware known as the POWER Hypervisor (referred to in this document simply as “hypervisor”). The enhanced layered code structure of the hypervisor resides in flash memory on the Service Processor. This firmware performs the initialization and configuration of the POWER5 processor, as well as the virtualization support required to run up to 140 partitions concurrently on the IBM

^

p5 servers.

© Copyright IBM Corp. 2003, 2004, 2005. All rights reserved.

217

6.1 Introduction

The hypervisor supports many advanced functions when compared to the previous version of the hypervisor, including sharing of processors, virtual I/O, high-speed communications between partitions using Virtual LAN, concurrent maintenance and allows for multiple operating systems to run on the single system. AIX 5L, Linux, and i5/OS are supported.

With support for dynamic resource movement across multiple environments, customers can move processors, memory and I/O between partitions on the system as they move workloads between the three environments.

The hypervisor is the underlying control mechanism that resides below the operating systems. The hypervisor owns all system resources and creates partitions by allocating these resources and sharing them.

The layers above the hypervisor are different for each supported operating system. The

Sserver

p5 systems use the new hypervisor for support of the

Micro-Partitioning technology model. The POWER4 Hypervisor processor-based systems worked on a demand basis, as the result of machine interrupts and callbacks to the operating system. The new hypervisor operates continuously in the background.

For the AIX 5L and Linux operating systems, the layer above the hypervisor are similar but the contents are characterized by each operating system. The layers of code supporting AIX 5L and Linux consist of System Firmware and Run-Time

Abstraction Services (RTAS).

System firmware is composed of low level firmware that is code that performs server unique I/O configurations and the Open Firmware which contains the boot time drivers, boot manager, and the device drivers required to initialize the PCI adapters and attached devices. RTAS consist of code that supplies platform dependent accesses and can be called from the operating system. These calls are passed to the hypervisor that handles all I/O interrupts.

The role of RTAS versus Open Firmware is very important to understand. Open

Firmware and RTAS are both platform-specific firmware and both are tailored by the platform developer to manipulate the specific platform hardware. However,

RTAS is intended to present to access platform hardware features on behalf of the operating system, while Open Firmware need not be present when the operating system, is running. This frees Open Firmware’s memory to be used by applications. RTAS is small enough to painlessly coexist with the operating system and applications.

218

Partitioning Implementations for IBM

E server

p5 Servers

Figure 6-1 POWER Hypervisor on AIX 5L and Linux

For i5/OS, Technology Independent Machine Interface and the layers above the hypervisor are still in place. System Licensed Internal Code, however, is changed and enabled for interfacing with the hypervisor. The hypervisor code is based on the iSeries Partition Licensed Internal Code code that is enhanced for use with the IBM

Eserver

® i5 hardware and that is now part of the hypervisor.

Attention: All

Sserver

p5 based servers require the use of the hypervisor. A system administrator can configure the system as a single, partition that includes all the resources on the system, but cannot run in SMP mode without the hypervisor as they could with POWER4 systems. A

Sserver

p5 based server is always partition capable.

6.2 Hypervisor support

The POWER5 processor supports a special set of instructions which are exclusively used by the hypervisor. If an operating system instance in a partition requires access to hardware, it first invokes the hypervisor by using hypervisor calls. The hypervisor allows privileged access to the operating system for dedicated hardware facilities and includes protection for those facilities in the processor and memory locations.

Chapter 6. The POWER Hypervisor

219

The introduction of shared processors has not fundamentally changed with the introduction of Virtualization and Micro-Partitioning technology. New virtual processor objects and hypervisor calls have been added to support shared processor partitions. Actually, the existing physical processor objects have just been refined, so as not to include physical characteristics of the processor, since there is not fixed relationship between a virtual processor and the physical processor that actualizes it. These new hypervisor calls are intended to support the scheduling heuristic of minimizing idle time.

The hypervisor is entered by the way of three interrupts:

򐂰

System Reset Interrupt

򐂰

The hypervisor code saves all processor state by saving the contents of the processor’s registers (multiplexing the use of this resource with the operating system). The processor’s stack and data are found by processing the

Processor Identification Register (PIR). The PIR is a read only register.

During power-on reset, PIR is set to a unique value for each processor in a multi-processor system.

Machine Check Interrupt

򐂰

The hypervisor code saves all processor state by saving the contents of the processor’s registers (multiplexing the use of this resource with the operating system). The processor’s stack and data are found by processing the PIR.

The hypervisor investigates the cause of the machine check. The cause can either be a recoverable event on the current processor or one of the other processors in the logical partition. Also the hypervisor must determine if the machine check has corrupted its own internal state by looking at the footprints, if any, that were left in the PIR processor data area of the errant processor.

System (hypervisor) Call Interrupt

The hypervisor call (hcall) interrupt is a special variety of the

sc

(system call) instruction. The parameters to the hcall() are passed in registers using the

POWERPC Application Binary Interface (ABI) definitions. This ABI specifies an interface for compiled application programs to system software. In contrast to the PowerPC ABI, pass by reference parameters are avoided to or from hcall(). This minimizes the address translation problem parameters passed by reference would cause because address translation is disabled automatically when interrupts are invoked. Input parameters can be indexes. Output parameters can be passed in the registers and require special in-line assembler code on the part of the caller. The first parameter in the hypervisor call function table to hcall() is the function token. The assignment of function token is designed such that a single mask operation can be used to validate the value to be within the range of a reasonable size branch table. Entries within the branch table can handle unimplemented code points. And some of

220

Partitioning Implementations for IBM

E server

p5 Servers

the hcall() functions indicate if the system is partitioned, and which ones are available. The Open Firmware property is provided in the /rtas node of the partition’s device tree. The property is present if the system is partitioned while its value specifies which function sets are implemented by a given implementation. If the system implements any hcall() of a function set it implements the entire function set. Additionally, certain values of the Open

Firmware property indicate that the system supports a given architecture extension to a standard hcall().

The hypervisor routines are optimized for execution speed. In some rare cases, locks will have to be taken, and short wait loops will be required due to specific hardware designs. However, if a needed resource is truly busy, or processing is required by an agent, the hypervisor returns to the caller, either to have the function retried or continued at a later time. The performance class establishes specific performance against specific hcall() function.

6.3 Hypervisor call functions

The hypervisor provides the following functions:

򐂰

Page Frame Table

Page Frame Table (PFT) access is called using 64-bit linkage conventions.

The hypervisor PFT access functions carefully update a Page Table Entry

(PTE) with at least 64-bit store operations since an invalid update sequence could result in machine check. The hypervisor protects check-stop conditions by allocating certain PTE bits for PTE locks and reserve for operating system assumes that the PTE is in use.

For logical addressing, an additional level of virtual addresses translation is managed by the hypervisor. The operating system is not allowed to use the physical address for its memory this includes main storage, memory-mapped

I/O (MMIO) space, and NVRAM. The operating system sees main storage as regions of contiguous logical memory. Each logical region is mapped by the hypervisor into a corresponding block of contiguous physical memory on a specific node. All regions on a specific system are the same size though different systems with different amount of memory can have different region sizes since they are the quantum of memory allocation to partitions. That is, partitions are granted memory in region size chunks and if a partition’s operating system gives up memory, it is in units of a full region.

Chapter 6. The POWER Hypervisor

221

򐂰

򐂰

򐂰

򐂰

򐂰

򐂰

Translation Control Entry

Translation Control Entry (TCE) access hcall() and take as a parameters in the Logical I/O Bus Number which is the logical bus number value derived from the property that are associated with the particular I/O adapter. TCE is responsible for the I/O address to memory address translation in order to perform direct memory access (DMA) transfers between memory and PCI adapters. The TCE tables are allocated in the physical memory.

Processor Register Hypervisor Resource Access

Processor Register Hypervisor Resource Access provides controlled in the write access services.

Debugger Support

Debugger support provides the capability for the real mode debugger to get to its asynchronous port and beyond the real mode limit register without turning on virtual address translation.

Virtual Terminal Support

The hypervisor provides console access to every logical partition without a physical device assigned. The console emulates a vt320 terminal that can be used to access partition system using the Hardware Management Console

(HMC). Some functions are limited, and the performance cannot be guaranteed because of the limited bandwidth of the connection between the

HMC and the managed system. A partition’s device tree that contains one or more nodes notifying that this has been assigned to one or more virtual terminal client adapters. The unit address of the node is used by the partition to map the virtual device(s) to the operating system’s corresponding logical representations and notify the partition that the virtual adapter is a Vterm client adapter. The node’s interrupts property specifies the interrupt source number that has been assigned to the client Vterm I/O adapter for receive data.

Dump Support

Dump support allows the operating system to dump hypervisor data areas in support of field problem diagnostics. The hcall-dump function set contains the

H_HYPERVISOR_DATA hcall(). This hcall() is enabled or disabled (default disabled) with the HMC.

Memory Migration Support

The Memory Migration Support hcall() was provided to assist the operating system in the memory migration process. It is the responsibility of the operating system not to change the DMA mappings referenced by the translation buffer. Failure of the operating system to serialize relative to the logical bus numbers might result DMA data corruption within the caller’s partition.

222

Partitioning Implementations for IBM

E server

p5 Servers

򐂰 Performance Monitor Support

The performance registers are saved when a virtual processor yields or is preempted. They are restored when the state of the virtual processor is restored on the hardware. A bit in one of the performance monitor registers enables the partition to specify whether the performance monitor registers count when a hypervisor call (except yield) is made (MSR[HV]=1). When a virtual processor yields or is preempted, the performance monitor registers do count. This allows a partition to query the hypervisor to appropriate information regarding hypervisor code and data addresses.

Table 6-1 provides a list of hypervisor calls.

Note: This table is not intended to be a programming reference. Therefore,

these calls can change in future levels of firmware. However, the definitions can provide a better understanding of the mechanics within the hypervisor.

Table 6-1 Hypervisor calls

Hypervisor call

H_REGISTER_VPA

H_CEDE

H_CONFER

H_PROD

H_ENTER

H_PUT_TCE

Definition

This hcall() provides a data area registered with the Hypervisor by the operating system for each virtual processor. The Virtual Processor Area

(VPA) is the control area which contains information used by Hypervisor and the operating system in cooperation with each other.

This hcall() is to have the virtual processor, which has no useful work to do, enter a wait state ceding its processor capacity to other virtual processor until some useful work appears, signaled either through an interrupt or a H_PROD hcall().

This hcall() allows a virtual processor to give its cycles to one or all other virtual processors in its partition.

This hcall() makes the specific virtual processor runnable.

This hcall() adds an entry into the page frame table. PTE high and low order bytes of the page table contains the new entry.

This hcall() provides mapping of a single 4096 byte page into the specified TCE.

Chapter 6. The POWER Hypervisor

223

Hypervisor call

H_READ

H_REMOVE

H_BULK_REMOVE

H_GET_PPP

H_SET_PPP

H_CLEAR_MODE

H_CLEAR_REF

H_PROTECT

H_EOI

H_IPI

H_CPPR

H_MIGRATE_DMA

H_PUT_RTCE

H_PAGE_INIT

H_GET_TCE

Definition

This hcall() returns the contents of a specific PTE in GPR4 and GPR5.

This hcall() is for invalidating an entry in the page table.

This hcall() is for invalidating up to four entries in the page table.

This hcall() returns the partition’s performance parameters.

This hcall() allows the partition to modify its entitled processor capacity percentage and variable processor capacity weight within limits.

This hcall() clears the modified bit in the specific

PTE. The second double word of the old PTE is returned in GPR4.

This hcall() clears the reference bit in the specific

PTE from the partition’s node PFT.

This hcall() sets the page protects bits in the specific PTE.

This hcall() incorporates the interrupt reset function when specifying an interrupt source number associated with an interpartition logical

I/O adapter.

This hcall() generates an interprocessor interrupt.

This hcall() sets the processor’s current interrupt priority.

This hcall() is extended to serialize the sending of a logical LAN message to allow for migration of

TCE mapped DMA pages.

This hcall() maps the number of contiguous TCEs in an RTCE to the same number of contiguous I/O adapter TCEs.

This hcall() initializes pages in real mode either to zero or to the copied contents of another page.

This standard hcall() s used to manage the interpartition logical LAN adapters’s I/O translations.

224

Partitioning Implementations for IBM

E server

p5 Servers

Hypervisor call

H_COPY_RDMA

H_SEND_CRQ

H_SEND_LOGICAL_LAN

H_ADD_LOGICAL_LAN_BUF

H_PIC

H_XIRR

H_POLL_PENDING

H_PURR

Definition

This hcall() copies data from an RTCE table mapped buffer in one partition to an RTCE table mapped buffer in another partition, with the length of the transfer being specified by the transfer length parameter in the hcall().

This hcall() sends one 16 byte message to the partner partition’s registered Command /

Response Queue (CRQ). The CRQ facility provides ordered delivery of messages between authorized partitions.

This hcall() sends a logical LAN message.

This hcall() adds receive buffers to the logical LAN receive buffer pool.

This hcall() returns the summation of the physical processor pool’s idle cycles.

This hcall() is extended to report the virtual interrupt source number associated with virtual interrupts associated with an interpartition logical

LAN I/O adapter.

This hcall() provides the operating system with the ability to perform background administrative functions and the implementation with indication of pending work so that it can more intelligently manage the use of hardware resources.

This hcall() is a new resource provided for

Micro-Partitioning and SMT. It provides an actual count of ticks that the shared resource has used on a per virtual processor or per SMT thread basis. In the case of Micro-Partitioning, the virtual processor’s Processor Utilization Resource

Register (PURR) begins incrementing when the virtual processor is dispatched onto a physical processor. Therefore, comparisons of elapsed

PURR with elapsed Timebase provides an indication of how much of the physical processor a virtual processor is getting. The PURR will also count Hypervisor calls made by the partition, with the exception of H_CEDE and H_CONFER. For improved accuracy, the existing hcall() time stamping should be converted to use PURR instead of timebase.

Chapter 6. The POWER Hypervisor

225

The

lparstat

command in AIX 5L Version 5.3 with

-H

flag displays the partition data with detailed breakdown of hypervisor time by call type, as shown in

Figure 6-2.

Figure 6-2 lparstat -H command output

226

Partitioning Implementations for IBM

E server

p5 Servers

6.4 Micro-Partitioning technology extensions

A new virtual processor is dispatched on a physical processor when one of the following conditions happens:

򐂰

The physical processor is idle and a virtual processor was made ready to run

(interrupt or process).

򐂰

򐂰

The old virtual processor exhausted its time slice (HDERC interrupt).

The old virtual processor ceded or conferred its cycles.

When one of the above conditions occurs, the hypervisor, by default, records all the virtual processor architected state including the Time Base and Decrementer values and sets the hypervisor timer services to wake the virtual processor per the setting of the decrementer. The virtual processor’s Processor Utilization

Resource Register (PURR) value for this dispatch is computed. The Virtual

Processor Area (VPA) dispatch count is incremented (such that the result is odd).

Then the hypervisor selects a new virtual processor to dispatch on the physical processor using an implemented dependent algorithm having the following characteristics listed in priority order:

1. The virtual processor is ready to run (has not ceded or conferred its cycles or exhausted its time slice).

2. Ready-to-run virtual processors are dispatched prior to waiting in excess of their maximum specified latency.

3. Of the non-latency critical virtual processors ready to run, select the virtual processor that is most likely to have its working set in the physical processor’s cache or for other reasons will run most efficiently on the physical processor.

If no virtual processor is ready to run at this time, start accumulating the Pool Idle

Count of the total number of idle processor cycles in the physical processor pool.

6.5 Memory considerations

POWER5 processors use memory to temporarily hold information. Memory requirements for partitions depend on partition configuration, I/O resources assigned, and applications used. Memory can be assigned in increments of

16 MB.

Depending on the overall memory in your system and the maximum memory values you choose for each partition, the server firmware must have enough memory to perform logical partition tasks. each partition has a Hardware Page

Table (HPT). The size of the HPT is based on an HPT ratio and determined by

Chapter 6. The POWER Hypervisor

227

the maximum memory values you establish for each partition. The HPT ratio is

1/64.

򐂰

򐂰

When selecting the maximum memory values for each partition, consider the following:

Maximum values affect the HPT size for each partition.

The logical memory map size of each partition.

When you create a logical partition on your managed system, the managed system reserves an amount of memory to manage the logical partition. Some of this physical partition is used for hypervisor page table translation support. The current memory available for partition usage in the HMC is the amount of memory that is currently available to the logical partitions on the managed

system, see Figure 6-3. This is the amount of active memory on your managed

system minus the estimated memory needed by the managed system to manage the logical partitions currently defined on your system. Therefore, the amount in this field decreases for each additional logical partition you create.

When you are assessing changing performance conditions across system reboots, it is important to know that memory allocations might change based on the availability of the underlying resources. Memory is allocated by the system across the system. Applications in partitions cannot determine where memory has been physically allocated.

Figure 6-3 Logical Partition Profile Properties - current memory settings

228

Partitioning Implementations for IBM

E server

p5 Servers

6.6 Performance considerations

The hypervisor does use a small percentage of the system processor and memory resources. This is associated with virtual memory management and is used for the hypervisor dispatcher, virtual processor data structures (including save areas for virtual processor) and for queuing up of interrupts. This is dependant on most workloads, and page-mapping activity. Partitioning can actually help performance in some cases for applications that do not scale well on large SMP systems by enforcing strong separation between workloads running in the separate partitions.

The output of

lparstat

with

-h

flag displays the percentage spent in hypervisor

(%hypv) and the number of hcalls. Notice that in the example output shown in

Figure 6-4, the %hypv in relation to entitlement capacity is only around 1% of the

system resources. This percentage shows that the hypervisor consumes a small amount of the processor during this sample.

Figure 6-4 lparstat -h 1 16 command output

To provide input to the capacity planning and quality of service tools, the hypervisor reports to an operating system certain statistics, these include the number of virtual processor that are online, minimum processor capacity that the operating system can expect (the operating system can cede any unused capacity back to the system), the maximum processor capacity that the partition will grant to the operating system, the portion of spare capacity (up to the maximum) that the operating system will be granted, variable capacity weight, and the latency to a dispatch by an hcall(). The output of the

lparstat

command

Chapter 6. The POWER Hypervisor

229

with the

-i

flag, shown in Figure 6-5, reports the logical partition related

information.

Figure 6-5 lparstat -i command output

230

Partitioning Implementations for IBM

E server

p5 Servers

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement

Table of contents