IBM eServer p5 510 Technical Overview and Introduction

3

Chapter 3.

Reliability, availability, and serviceability

This chapter provides detailed information about IBM

Sserver

p5 510 reliability, availability and serviceability (RAS) features. It describes several features that are available when using

AIX 5L. Support of these features using Linux can vary.

© Copyright IBM Corp. 2005. All rights reserved.

53

3.1 Reliability, fault tolerance, and data integrity

The reliability of the p5-510 server starts with components, devices, and subsystems that are designed to be fault-tolerant. During the design and development process, subsystems go through rigorous verification and integration testing processes. During system manufacturing, systems go through a thorough testing process designed to help ensure the highest level of product quality.

Among the features that provide fault tolerance and ensure data integrity are the following:

򐂰

The p5-510 server L3 cache and system memory offers ECC (error checking and correcting) fault-tolerant features. ECC is designed to correct environmentally induced, single-bit, intermittent memory failures and single-bit hard failures. With ECC, the likelihood of memory failures will be substantially reduced.

򐂰 ECC also provides double-bit memory error detection that helps protect data integrity in the event of a double-bit memory failure.

򐂰

System memory also provides 4-bit packet error detection that helps to protect data integrity in the event of a DRAM chip failure.

򐂰 The system bus, I/O bus, and PCI buses are designed with parity error detection.

򐂰

Disk mirroring and disk controller duplexing are also provided by the AIX 5L operating system. Linux supports disk mirroring (RAID 1). This is supported in software using the md driver. Some of the hardware RAID adapters supported under Linux also support mirroring.

򐂰 The Journaled File System maintains file system consistency and reduces the likelihood of data loss when the system is abnormally halted due to a power failure.

3.1.1 PCI extended error handling

In the past, PCI bus parity errors caused a global machine check interrupt, which eventually required a system reboot to continue. In the POWER5 systems, new I/O drawer hardware, system firmware, and AIX 5L interaction have been designed to allow transparent recovery of intermittent PCI bus parity errors and graceful transition to the I/O device available state in the case of a permanent parity error in the PCI bus. This mechanism is called PCI extended error handling (EEH).

EEH-enabled adapters respond to a special data packet generated from the affected PCI slot hardware by calling system firmware, which will examine the affected bus, allow the device driver to reset it, and continue without a system reboot

Note: This RAS function is not supported under Linux.

3.1.2 Memory error correction extensions

The p5-510 server uses Error Checking and Correcting (ECC) circuitry for memory reliability, fault tolerance, and integrity.

򐂰 Memory has single-error-correct and double-error-detect ECC circuitry designed to correct single-bit memory failures. The

double-bit

detection is designed to help maintain data integrity by detecting and reporting multiple errors beyond what the ECC circuitry can correct.

򐂰 The memory chips are organized such that the failure of any specific memory module only affects a single-bit within an ECC word (

bit-scattering)

, thus allowing for error correction and continued operation in the presence of a complete chip failure (Chipkill™ recovery).

54

IBM eServer p5 510 Technical Overview and Introduction

򐂰 The memory also utilizes memory scrubbing and thresholding to determine when spare memory modules, within each bank of memory, if available, should be used to replace ones that have exceeded their threshold value (

dynamic bit-steering

). Memory scrubbing is the process of reading the contents of the memory during idle time and checking and correcting any single-bit errors that have accumulated by passing the data through the

ECC logic. This function is a hardware function on the memory controller chip and does not influence normal system memory performance.

3.1.3 Redundancy for array self-healing

Although the most likely failure event in a processor is a soft single-bit error in one of its caches, there are other events that can occur, and they need to be distinguished from one another.

򐂰 For the L1, L2, and L3 caches and their directories, hardware and firmware keep track of whether permanent errors are being corrected beyond a threshold. If this threshold is exceeded, a deferred repair error log is created. Additional run-time availability actions, such as CPU vary off

1

or L3 cache line delete, are also initiated.

򐂰 L1 and L2 caches and L2 and L3 directories on the POWER5 chip are manufactured with spare bits in their arrays that can be accessed via programmable steering logic to replace faulty bits in the respective arrays. This is analogous to the redundant bit-steering employed in main storage as a mechanism that is designed to help avoid physical repair, and is also implemented in POWER5 systems. The steering logic is activated during processor initialization and is initiated by the built-in self-test (BIST) at power-on time.

򐂰 L3 cache redundancy is implemented at the cache line level. Exceeding correctable error thresholds while running causes a dynamic L3 cache line delete function to be invoked.

3.1.4 Service processor

The service processor included in the p5-510 server is designed to provide an immediate means to diagnose, check status, and sense operational conditions of a remote system, even when the main processor is inoperable.

򐂰 The service processor enables firmware and operating system surveillance, several remote power controls, environmental monitoring (only critical errors are supported under

Linux), reset, boot features, remote maintenance, and diagnostic activities, including console mirroring.

򐂰 The service processor can place calls to report surveillance failures, critical environmental faults, and critical processing faults.

For more detailed information on the service processor refer to Chapter 2.10.5, “Service processor” on page 49.

3.1.5 Fault monitoring functions

Among the fault monitoring systems included with a p5-510 server are the following:

򐂰 Built-in self-test (BIST) and power-on self-test (POST) check the processor, L3 cache, memory, and associated hardware required for proper booting of the operating system every time the system is powered on. If a noncritical error is detected, or if the errors occur in the resources that can be removed from the system configuration, the booting process is designed to proceed to completion. The errors are logged in the system nonvolatile

RAM (NVRAM).

1

This RAS function is only available for a Linux operating system running the 2.6 kernel.

Chapter 3. Reliability, availability, and serviceability

55

򐂰 Disk drive fault tracking can alert the system administrator of an impending disk failure before it impacts client operation.

򐂰 The AIX 5L or Linux log (where hardware and software failures are recorded and analyzed by the Error Log Analysis (ELA) routine) warns the system administrator about the causes of system problems. This also enables IBM service representatives to bring along probable replacement hardware components when a service call is placed, thus minimizing system repair time.

3.1.6 Mutual surveillance

The service processor monitors the operation of the POWER Hypervisor firmware during the boot process and watches for loss of control during system operation. It also allows the

POWER Hypervisor to monitor service processor activity.

The service processor can take appropriate action, including calling for service, when it detects the POWER Hypervisor firmware has lost control. Likewise, the POWER Hypervisor can request a service processor repair action if necessary.

3.1.7 First Failure Data Capture

Diagnosing problems in a computer is a critical requirement for autonomic computing. The first step to producing a computer that truly has the ability to self-heal is to create a highly accurate way to identify and isolate hardware errors. IBM has implemented a server design that builds in hardware error-check stations that capture and help to identify error conditions within the server. Each of these checkers is viewed as a diagnostic probe into the server, and, when coupled with extensive diagnostic firmware routines, allows quick and accurate assessment of hardware error conditions at run-time.

򐂰 First Failure Data Capture (FFDC) check stations are carefully positioned within the server logic and data paths to help ensure that potential errors can be quickly identified and accurately tracked to an individual field-replaceable unit (FRU).

򐂰 These checkers are collected in a series of Fault Isolation Registers, where they can easily be accessed by the service processor.

򐂰 All communication between the SP and the FIR is accomplished

out of band

. That is, operation of the error-detection mechanism is transparent to an operating system. This entire structure is

below the architecture

and is not seen, nor accessed, by system-level activities.

3.1.8 Environmental monitoring functions

Among the environmental monitoring functions available for the p5-510 server are the following:

򐂰

Temperature monitoring increases the fan speed rotation when ambient temperature is above the normal operating range.

򐂰 Temperature monitoring warns the system administrator of potential environmentally related problems (for example, air conditioning and air circulation around the system) so that appropriate corrective actions can be taken before a critical failure threshold is reached. It also performs an orderly system shutdown when the operating temperature exceeds the critical level.

򐂰 Fan speed monitoring provides a warning and an orderly system shutdown when the speed is out of the operational specification.

56


򐂰 Voltage monitoring provides a warning and an orderly system shutdown when the voltages are out of the operational specification.

3.1.9 Error handling and reporting

In the unlikely event of system hardware or environmentally induced failure, the system run-time error capture capability systematically analyzes the hardware error signature to determine the cause of failure.

򐂰 The analysis will be stored in the system NVRAM. When the system can be successfully rebooted either manually or automatically, the error will be reported to the AIX 5L or Linux operating system.

򐂰 Error Log Analysis (ELA) can be used to display the failure cause and the physical location of failing hardware.

򐂰 With the integrated service processor, the system has the ability to automatically send out an alert via phone line to a pager or call for service in the event of critical system failure. A hardware fault will also turn on the two Attention Indicators (one located on the front of the system unit and the other on the rear of the system) to alert the user of an internal hardware problem. The indicator may also be turned on by the operator as a tool to allow system identification. For identification, the indicators will flash, whereas the indicator will be on solid when an error condition occurs.

3.1.10 Availability enhancement functions

The auto-restart (reboot) option, when enabled, can reboot the system automatically following an unrecoverable software error, software hang, hardware failure, or environmentally induced

(ac power) failure.

3.2 Serviceability

The p5-510 server is designed for client setup of the machine and for subsequent addition of most features (adapters/devices). For a fee, IBM Service can perform the installation.

򐂰 The p5-510 server allows clients to replace service parts (Customer Replaceable Unit) if they want to. The p5-510 server has incorporated LEDs that will indicate the parts needing to be replaced.

򐂰 The p5-510 server allows support personnel to remotely log into a system to review error logs and perform remote maintenance. The p5-510 server service processor enables the analysis of a system that will not boot.

򐂰

The diagnostics consist of Stand-alone Diagnostics, which are loaded from the DVD-ROM drive, and Online Diagnostics.

򐂰 Online Diagnostics, when installed, are resident with AIX 5L on the disk or system. They can be booted in single-user mode (service mode), run in maintenance mode, or run concurrently (concurrent mode) with other applications. They have access to the AIX 5L

Error Log and the AIX 5L Configuration Data.

– Service mode allows checking of system devices and features.

– Concurrent mode allows the normal system functions to continue while selected resources are being checked.

– Maintenance mode allows checking of most system resources.


57

򐂰 The System Management Services (SMS) error log is accessible from the SMS menu for tests performed through SMS programs. For results of service processor tests, access the error log from the service processor menu.

3.2.1 Service Agent

Service Agent is available at no additional charge. When installed on an IBM

Sserver system, the Service Agent can enhance IBM's ability to provide the system with maintenance service.

The Service Agent:

򐂰 Monitors and analyzes system errors, and if needed, can automatically place a service call to IBM without client intervention

򐂰 Can help reduce the effect of business disruptions due to unplanned system outages and failures

򐂰

Performs problem analysis on a subset of hardware-related problems and, with client authorization, can report the results to IBM Service automatically

Note: Because the 9110-510 system has an optional DVD-ROM (FC 2640) and DVD-RAM

(FC 5751), alternate methods for maintaining and servicing the system need to be available if the DVD-ROM or DVD-RAM is not ordered; an external Internet connection must be available to maintain or update system microcode to the latest required level.

3.2.2 Online customer support

Online customer support (OCS) for hardware problem reporting may be performed via remote login by IBM

Sserver

specialists. The Electronic Service Agent™ software can also be used for this capability.

AIX 5L support offerings will be under AIXSERV and Electronic Service Agent.

Note: This RAS function is not supported under Linux.

3.3 IBM

Sserver

Cluster 1600

Today's IT infrastructure requires that systems meet increasing demands, while offering the flexibility and manageability to rapidly develop and deploy new services. IBM clustering hardware and software provide the building blocks, with availability, scalability, security, and single-point-of-management control, to satisfy these needs. The advantages of clusters are:

򐂰 Large-capacity data and transaction volumes, including support of mixed workloads

򐂰 Scale-up (add processors) or scale-out (add servers) without downtime

򐂰 Single point-of-control for distributed and clustered server management

򐂰

Simplified use of IT resources

򐂰 Designed for 24x7 access to data applications

򐂰

Business continuity in the event of disaster

IBM

Sserver

Cluster 1600 is a POWER processor-based AIX 5L and Linux cluster targeting scientific and technical computing, large-scale databases, and workload consolidation. IBM

Cluster Systems Management (CSM) is designed to provide a robust, powerful, and

58


centralized way to manage a large number of POWER5 processor-based systems all from one single point-of-control. CSM can help lower the overall cost of IT ownership by helping to simplify the tasks of installing, operating, and maintaining clusters of servers. CSM can provide one consistent interface for managing both AIX 5L and Linux nodes (physical systems or logical partitions), with capabilities for remote parallel network install, remote hardware control, and distributed command execution.

Cluster Systems Management V1.4 for AIX 5L and Linux on POWER is supported on the p5-510 server. For hardware control, a Hardware Management Console (HMC) is required.

Additionally, the p5-510 server is added to the hardware models supported with the pSeries cluster 1600 running CSM.

Information regarding the IBM

Sserver

Cluster 1600, HMC control, cluster building block servers, and cluster software available can be found at: http://www-1.ibm.com/servers/eserver/clusters/hardware/1600.ht


59

60


IBM eServer p5 510 Technical Overview and Introduction

Reliability, availability, and serviceability

53

3.1 Reliability, fault tolerance, and data integrity

Note: This RAS function is not supported under Linux.

double-bit

bit-scattering)

54

dynamic bit-steering

55

out of band

below the architecture

56

3.2 Serviceability

57

Note: Because the 9110-510 system has an optional DVD-ROM (FC 2640) and DVD-RAM

Note: This RAS function is not supported under Linux.

3.3 IBM

Sserver

Cluster 1600

58

59

60

Related manuals

Table of contents

IBM eServer p5 510 Technical Overview and Introduction

Reliability, availability, and serviceability

53

3.1 Reliability, fault tolerance, and data integrity

Note: This RAS function is not supported under Linux.

double-bit

bit-scattering)

54

dynamic bit-steering

55

out of band

below the architecture

56

3.2 Serviceability

57

Note: Because the 9110-510 system has an optional DVD-ROM (FC 2640) and DVD-RAM

Note: This RAS function is not supported under Linux.

3.3 IBM

Sserver

Cluster 1600

58

59

60

Related manuals

IBM

9123710 - eServer OpenPower 710

IBM

System p5 520Q

IBM

P5 570

IBM

eServer OpenPower 720

Bull

Power 5

Table of contents