System Event Log Troubleshooting Guide for Intel® S5500/S3420


Add to my manuals
110 Pages

advertisement

System Event Log Troubleshooting Guide for Intel® S5500/S3420 | Manualzz

System Event Log Troubleshooting

Guide for Intel

®

Server Boards

Intel order number G74211-002

Revision 1.1

December 2013

Platform Collaboration and Systems Division – Marketing

Revision History

Revision History

Date

August 2012

December 2013

Revision

Number

1.0

1.1

Modifications

Initial draft.

 Corrected IPMI Watchdog and PEF Sensors Typical Characteristics tables.

 Clarified Channel designators for DIMM memory errors.

 Added ME sensor 17h. ii Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Disclaimers

Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS

GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR

SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR

IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR

WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR

INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION

CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,

SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH,

HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS'

FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL

INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR

NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF

THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature .

Revision 1.1 Intel order number G74211-002 iii

Table of Contents

Table of Contents

1.

Introduction ........................................................................................................................ 1

1.1

Purpose .................................................................................................................. 1

1.2

Industry Standard ................................................................................................... 1

1.2.1

Intelligent Platform Management Interface (IPMI) ................................................... 1

1.2.2

Baseboard Management Controller (BMC) ............................................................. 2

1.2.3

Intel

®

Intelligent Power Node Manager Version 1.5 ................................................ 3

2.

Basic Decoding of a SEL Record ...................................................................................... 4

2.1

Default Values in the SEL Records ........................................................................ 4

3.

Sensor Cross Reference List ............................................................................................. 8

3.1

BMC owned Sensors (GID = 0020h) ...................................................................... 8

3.2

3.3

3.4

3.5

BIOS POST owned Sensors (GID = 0001h) ......................................................... 12

BIOS SMI owned Sensors (GID = 0033h) ............................................................ 12

Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h) ................. 14

Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch) ............. 15

3.6

3.7

Microsoft* OS owned Events (GID = 0041) .......................................................... 16

Linux* Kernel Panic Events (GID = 0021) ............................................................. 16

4.

Power Subsystems ........................................................................................................... 17

4.1

4.2

Voltage Sensors ................................................................................................... 17

Power Unit ........................................................................................................... 21

4.2.1

Power Unit Status Sensor .................................................................................... 21

4.2.2

Power Unit Redundancy Sensor........................................................................... 22

4.3

Power Supply ....................................................................................................... 24

4.3.1

Power Supply Status Sensors .............................................................................. 24

4.3.2

Power Supply AC Power Input Sensors ............................................................... 25

4.3.3

Power Supply Current Output % Sensors ............................................................. 26

4.3.4

Power Supply Temperature Sensors .................................................................... 27

5.

Cooling Subsystem .......................................................................................................... 29

5.1

Fan Sensors ......................................................................................................... 29

5.1.1

Fan Speed Sensors.............................................................................................. 29

5.1.2

Fan Presence and Redundancy Sensors ............................................................. 30

5.2

Temperature Sensors ........................................................................................... 33

5.2.1

Regular Temperature Sensors ............................................................................. 33

5.2.2

Thermal Margin Sensors ...................................................................................... 35

5.2.3

Processor Thermal Control % Sensors................................................................. 36

5.2.4

Discrete Thermal Sensors .................................................................................... 37

6.

Processor Subsystem ...................................................................................................... 39

6.1

6.2

Processor Status Sensor ...................................................................................... 39

Catastrophic Error Sensor .................................................................................... 40

6.2.1

Catastrophic Error Sensor – Next Steps ............................................................... 41

iv Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Table of Contents

6.3

CPU Missing Sensor ............................................................................................ 41

6.3.1

CPU Missing Sensor – Next Steps ....................................................................... 42

6.4

QuickPath Interconnect Error Sensors ................................................................. 42

6.4.1

QPI Correctable Error Sensor .............................................................................. 42

6.4.2

QPI Non-Fatal Error Sensor ................................................................................. 43

6.4.3

QPI Fatal and Fatal #2 ......................................................................................... 44

7.

Memory Subsystem .......................................................................................................... 46

7.1

Memory RAS Mirroring and Sparing ..................................................................... 46

7.1.1

Mirroring Configuration Status .............................................................................. 46

7.1.2

Mirrored Redundancy State Sensor ..................................................................... 47

7.1.3

Sparing Configuration Status ................................................................................ 49

7.1.4

Sparing Redundancy State Sensor ...................................................................... 50

7.2

ECC and Address Parity ...................................................................................... 53

7.2.1

Memory Correctable and Uncorrectable ECC Error .............................................. 53

7.2.2

Memory Address Parity Error ............................................................................... 55

8.

PCI Express* and Legacy PCI Subsystem ...................................................................... 58

8.1

PCI Express* Errors ............................................................................................. 58

8.1.1

PCI Express* Correctable Errors .......................................................................... 58

8.1.2

PCI Express* Fatal Errors .................................................................................... 59

8.1.3

Legacy PCI Errors ................................................................................................ 61

9.

System BIOS Events ........................................................................................................ 63

9.1

System Events ..................................................................................................... 63

9.1.1

System Boot ......................................................................................................... 63

9.1.2

Timestamp Clock Synchronization ....................................................................... 63

9.2

System Firmware Progress (Formerly Post Error) ................................................ 64

9.2.1

System Firmware Progress (Formerly Post Error) – Next Steps ........................... 65

10.

Chassis Subsystem .......................................................................................................... 71

10.1

Physical Security .................................................................................................. 71

10.1.1

Chassis Intrusion .................................................................................................. 71

10.1.2

LAN Leash Lost .................................................................................................... 71

10.2

FP (NMI) Interrupt ................................................................................................ 73

10.2.1

FP (NMI) Interrupt – Next Steps ........................................................................... 73

10.3

Button Press Events ............................................................................................. 74

11.

Miscellaneous Events ...................................................................................................... 75

11.1

IPMI Watchdog ..................................................................................................... 75

11.2

SMI Timeout ......................................................................................................... 77

11.2.1

SMI Timeout – Next Steps.................................................................................... 77

11.3

11.4

System Event Log Cleared ................................................................................... 78

System Event – PEF Action ................................................................................. 78

11.4.1

System Event – PEF Action – Next Steps ............................................................ 79

12.

Hot Swap Controller Events ............................................................................................. 80

Revision 1.1 Intel order number G74211-002 v

Table of Contents

12.1

12.2

HSC Backplane Temperature Sensor .................................................................. 80

HSC Drive Slot Status Sensor .............................................................................. 81

12.2.1

HSC Drive Slot Status Sensor – Next Steps ......................................................... 82

12.3

HSC Drive Presence Sensor ................................................................................ 82

12.3.1

HSC Drive Presence Sensor – Next Steps ........................................................... 83

13.

Manageability Engine (ME) Events .................................................................................. 85

13.1

Node Manager Exception Event ........................................................................... 85

13.1.1

Node Manager Exception Event – Next Steps ...................................................... 86

13.2

Node Manager Health Event ................................................................................ 86

13.2.1

Node Manager Health Event – Next Steps ........................................................... 87

13.3

Node Manager Operational Capabilities Change .................................................. 88

13.3.1

Node Manager Operational Capabilities Change – Next Steps ............................ 89

13.4

Node Manager Alert Threshold Exceeded ............................................................ 90

13.4.1

Node Manager Alert Threshold Exceeded – Next Steps ....................................... 91

13.5

ME Firmware Health Event ................................................................................... 91

13.5.1

ME Firmware Health Event – Next Steps ............................................................. 92

14.

Microsoft Windows* Records .......................................................................................... 93

14.1

14.2

Boot-up Event Records ........................................................................................ 93

Shutdown Event Records ..................................................................................... 94

14.3

Bug Check / Blue Screen Event Records ............................................................. 97

15.

Linux* Kernel Panic Records ........................................................................................... 99

vi Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

List of Tables

List of Tables

Table 1: SEL Record Format ....................................................................................................... 4

Table 2: Event Request Message Event Data Field Contents ..................................................... 6

Table 3: OEM SEL Record (Type C0h-DFh) ............................................................................... 7

Table 4: OEM SEL Record (Type E0h-FFh) ................................................................................ 7

Table 5: BMC owned Sensors ..................................................................................................... 8

Table 6: BIOS POST owned Sensors ....................................................................................... 12

Table 7: BIOS SMI owned Sensors ........................................................................................... 13

Table 8: Hot Swap Controller Firmware owned Sensors ........................................................... 14

Table 9: Management Engine Firmware owned Sensors .......................................................... 15

Table 10: Microsoft* OS owned Events ..................................................................................... 16

Table 11: Linux* Kernel Panic Events ....................................................................................... 16

Table 12: Voltage Sensors Typical Characteristics ................................................................... 17

Table 13: Voltage Sensors Event Triggers – Description .......................................................... 18

Table 14: Voltage Sensors – Next Steps ................................................................................... 18

Table 15: Power Unit Status Sensors Typical Characteristics ................................................... 21

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps ............................ 22

Table 17: Power Unit Redundancy Sensors Typical Characteristics ......................................... 23

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps ....................... 23

Table 19: Power Supply Status Sensors Typical Characteristics ............................................... 24

Table 20: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps ....................... 24

Table 21: Power Supply AC Power Input Sensors Typical Characteristics ................................ 25

Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next Steps .............. 26

Table 23: Power Supply Current Output % Sensors Typical Characteristics ............................. 26

Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next Steps ........... 27

Table 25: Power Supply Temperature Sensors Typical Characteristics .................................... 28

Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps .................. 28

Table 27: Fan Speed Sensors Typical Characteristics .............................................................. 29

Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps ............................................ 30

Table 29: Fan Presence Sensors Typical Characteristics ......................................................... 30

Table 30: Fan Presence Sensors – Event Trigger Offset – Next Steps ..................................... 31

Table 31: Fan Redundancy Sensors Typical Characteristics ..................................................... 31

Table 32: Fan Redundancy Sensor – Event Trigger Offset – Next Steps .................................. 32

Table 33: Temperature Sensors Typical Characteristics ........................................................... 33

Table 34: Temperature Sensors Event Triggers – Description .................................................. 34

Table 35: Temperature Sensors – Next Steps........................................................................... 34

Table 36: Thermal Margin Sensors Typical Characteristics ....................................................... 35

Table 37: Thermal Margin Sensors Event Triggers – Description .............................................. 35

Table 38: Thermal Margin Sensors – Next Steps ...................................................................... 36

Table 39: Processor Thermal Control % Sensors Typical Characteristics ................................. 36

Revision 1.1 Intel order number G74211-002 vii

List of Tables

Table 40: Processor Thermal Control % Sensors Event Triggers – Description ........................ 37

Table 41: Processor Thermal Control % Sensors – Next Steps ................................................ 37

Table 42: Discrete Thermal Sensors Typical Characteristics ..................................................... 38

Table 43: Discrete Thermal Sensors – Next Steps .................................................................... 38

Table 44: Process Status Sensors Typical Characteristics ........................................................ 39

Table 45: Processor Status Sensors – Next Steps .................................................................... 40

Table 46: Catastrophic Error Sensor Typical Characteristics..................................................... 40

Table 47: CPU Missing Sensor Typical Characteristics ............................................................. 41

Table 48: QPI Correctable Error Sensor Typical Characteristics ............................................... 42

Table 49: QPI Non-Fatal Error Sensor Typical Characteristics .................................................. 43

Table 50: QPI Fatal Error Sensor Typical Characteristics ......................................................... 44

Table 51: QPI Fatal #2 Error Sensor Typical Characteristics..................................................... 45

Table 52: Mirroring Configuration Status Sensor Typical Characteristics .................................. 46

Table 53: Mirroring Configuration Status Sensor Event Trigger Offset – Next Steps ................. 47

Table 54: Mirrored Redundancy State Sensor Typical Characteristics ...................................... 48

Table 55: Mirrored Redundancy State Sensor Event Trigger Offset – Next Steps ..................... 49

Table 56: Sparing Configuration Status Sensor Typical Characteristics .................................... 49

Table 57: Sparing Configuration Status Sensor Event Trigger Offset – Next Steps ................... 50

Table 58: Sparing Redundancy State Sensor Typical Characteristics ....................................... 51

Table 59: Sparing Redundancy State Sensor Event Trigger Offset – Next Steps ...................... 52

Table 60: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics ................ 53

Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps54

Table 62: Address Parity Error Sensor Typical Characteristics ................................................. 55

Table 63: PCI Express* Correctable Error Sensor Typical Characteristics ................................ 58

Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps ............... 59

Table 65: PCI Express* Fatal Error Sensor Typical Characteristics ........................................... 60

Table 66: PCI Express* Fatal Error Sensor Event Trigger Offset – Next Steps ......................... 60

Table 67: Legacy PCI Error Sensor Typical Characteristics ...................................................... 62

Table 68: Legacy PCI Error Sensor Event Trigger Offset – Next Steps ..................................... 62

Table 69: System Event Sensor Typical Characteristics ........................................................... 64

Table 70: POST Error Sensor Typical Characteristics ............................................................... 65

Table 71: POST Error Codes .................................................................................................... 65

Table 72: Physical Security Sensor Typical Characteristics ...................................................... 71

Table 73: Physical Security Sensor Event Trigger Offset – Next Steps ..................................... 72

Table 74: FP (NMI) Interrupt Sensor Typical Characteristics ..................................................... 73

Table 75: Button Press Events Sensor Typical Characteristics ................................................. 74

Table 76: IPMI Watchdog Sensor Typical Characteristics ......................................................... 75

Table 77: IPMI Watchdog Sensor Event Trigger Offset – Next Steps ........................................ 76

Table 78: SMI Timeout Sensor Typical Characteristics ............................................................. 77

Table 79: System Event Log Cleared Sensor Typical Characteristics ....................................... 78

Table 80: System Event – PEF Action Sensor Typical Characteristics ...................................... 79

viii Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

List of Tables

Table 81: HSC Backplane Temperature Sensor Typical Characteristics ................................... 80

Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps ............... 81

Table 83: HSC Drive Slot Status Sensor Typical Characteristics .............................................. 81

Table 84: HSC Drive Presence Sensor Typical Characteristics ................................................. 83

Table 85: Node Manager Exception Sensor Typical Characteristics ......................................... 85

Table 86: Node Manager Health Event Sensor Typical Characteristics ..................................... 86

Table 87: Node Manager Operational Capabilities Change Sensor Typical Characteristics ...... 88

Table 88: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics ................ 90

Table 89: ME Firmware Health Event Sensor Typical Characteristics ....................................... 91

Table 90: ME Firmware Health Event Sensor – Next Steps ...................................................... 92

Table 91: Boot-up Event Record Typical Characteristics ........................................................... 93

Table 92: Boot-up OEM Event Record Typical Characteristics .................................................. 94

Table 93: Shutdown Reason Code Event Record Typical Characteristics ................................. 95

Table 94: Shutdown Reason OEM Event Record Typical Characteristics ................................. 95

Table 95: Shutdown Comment OEM Event Record Typical Characteristics .............................. 96

Table 96: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics .............. 97

Table 97: Bug Check / Blue Screen Code OEM Event Record Typical Characteristics ............. 97

Table 98: Linux* Kernel Panic Event Record Characteristics .................................................... 99

Table 99: Linux* Kernel Panic String Extended Record Characteristics .................................. 100

Revision 1.1 Intel order number G74211-002 ix

System Event Log Troubleshooting Guide for Intel

®

Introduction

1. Introduction

The server management hardware that is part of Intel

®

Server Boards and Intel

®

Server

Platforms serves as a vital part of the overall server management strategy. The server management hardware provides essential information to the system administrator and provides the administrator the ability to remotely control the server, even when the operating system is not running.

The Intel

®

Server Boards and Intel

®

Server Platforms offer comprehensive hardware and software based solutions. The server management features make the servers simple to manage and provide alerting on system events. From entry to enterprise systems, good overall server management is essential to reduce overall total cost of ownership.

This Troubleshooting Guide is intended to help the users better understand the events that are logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these

Intel

®

Server Boards.

There is a separate User’s Guide that covers the general server management and the server management software offered on Intel

®

Server Boards and Intel

®

Server Platforms.

Server boards currently supported by this document:

 Intel

 Intel

®

®

S3200/X38ML Server Boards

S5500/S3420 Series Server Boards

1.1 Purpose

The purpose of this document is to list all possible events generated by the Intel

®

platform. It may be possible that other sources (not under our control) also generate events, which will not be described in this document.

1.2 Industry Standard

1.2.1 Intelligent Platform Management Interface (IPMI)

The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the inventory, monitoring, logging, and recovery control functions are available independently of the main processors, BIOS, and operating system. Platform management functions can also be made available when the system is in a power-down state.

IPMI works by interfacing with the BMC, which extends management capabilities in the server system and operates independently of the main processor by monitoring the on-board instrumentation. Through the BMC, IPMI also allows administrators to control power to the server, and remotely access BIOS configuration and operating system console information.

IPMI defines a common platform instrumentation interface to enable interoperability between:

Revision 1.1 Intel order number G74211-002 1

Introduction

 The baseboard management controller and chassis

 The baseboard management controller and systems management software

 Between servers

IPMI enables the following:

 Common access to platform management information, consisting of:

- Local access from systems management software

- Remote access from LAN

- Inter-chassis access from Intelligent Chassis Management Bus

- Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the processor is down

 IPMI interface isolates systems management software from hardware.

Hardware advancements can be made without impacting the systems management software.

IPMI facilitates cross-platform management software.

You can find more information on IPMI at the following URL: http://www.intel.com/design/servers/ipmi

1.2.2 Baseboard Management Controller (BMC)

A baseboard management controller (BMC) is a specialized microcontroller embedded on most

Intel

®

Server Boards. The BMC is the heart of the IPMI architecture and provides the intelligence behind intelligent platform management, that is, the autonomous monitoring and recovery features implemented directly in platform management hardware and firmware.

Different types of sensors built into the computer system report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC monitors the system for critical events by communicating with various sensors on the system board; it sends alerts and logs events when certain parameters exceed their preset thresholds, indicating a potential failure of the system. The administrator can also remotely communicate with the BMC to take some corrective actions such as resetting or power cycling the system to get a hung OS running again. These abilities save on the total cost of ownership of a system.

For Intel

®

Server Boards and Intel

®

Server Platforms, the BMC supports the industry-standard

IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.

1.2.2.1 System Event Log (SEL)

The BMC provides a centralized, non-volatile repository for critical, warning, and informational system events called the System Event Log or SEL. By having the BMC manage the SEL and logging functions, it helps to ensure that “post-mortem” logging information is available if a failure occurs that disables the systems processor(s).

2 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Introduction

The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various tools and utilities that can be used to access the SEL. There is the Intel

®

SELViewer and multiple open sourced IPMI tools.

1.2.3 Intel

®

Intel

®

Intelligent Power Node Manager version 1.5 (NM) is a platform-resident technology that enforces power and thermal policies for the platform. These policies are applied by exploiting subsystem knobs (such as processor P and T states) that can be used to control power consumption. Intel

®

Intelligent Power Node Manager enables data center power and thermal management by exposing an external interface to management software through which platform policies can be specified. It also enables specific data center power management usage models such as power limiting.

The configuration and control commands are used by the external management software or

BMC to configure and control the Intel

®

Intelligent Power Node Manager feature. Because

Platform Services firmware does not have any external interface, external commands are first received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB channel. The BMC acts as a relay and the transport conversion device for these commands. For simplicity, the commands from the management console might be encapsulated in a generic

CONFIG packet format (config data length, config data blob) to the BMC so that the BMC doesn’t even have to parse the actual configuration data.

The BMC provides the access point for remote commands from external management SW and generates alerts to them. Intel

®

Intelligent Power Node Manager on Intel

®

Manageability Engine

(Intel

®

Intel

®

ME) is an IPMI satellite controller. A mechanism needs to exist to forward commands to

ME and send response back to originator. Similarly events from Intel

®

ME have to be sent as alerts outside of the BMC. It is the responsibility of BMC to implement these mechanisms for communication with Intel

®

Intelligent Power Node Manager.

The full specification can be downloaded from the following link: http://www.intel.com/content/dam/doc/technical-specification/intelligent-power-node-manager-1-

5-specification.pdf

Revision 1.1 Intel order number G74211-002 3

Basic Decoding of a SEL Record

2. Basic Decoding of a SEL Record

The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for each of the fields in a SEL. For more details see the IPMI

Specification.

The definitions for the standard SEL can be found in Table 1.

The definitions for the OEM defined event logs can be found in Table 3 and Table 4.

2.1 Default Values in the SEL Records

Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.

 Byte [3] = Record Type (RT) = 02h = System event record

 Byte [9:8] = Generator ID = 0020h = BMC Firmware

 Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0

Table 1: SEL Record Format

Byte

1

2

Record ID

(RID)

Field

3 Record Type

(RT)

ID used for SEL Record access.

Description

4

5

6

7

Timestamp

(TS)

[7:0] – Record Type

02h = System event record

C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)

E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)

Time when event was logged. LS byte first.

Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010

23:20:09 UTC

Note: There are various websites that will convert the raw number to a date/time.

4 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Basic Decoding of a SEL Record

Byte

8

9

Field

Generator ID

(GID)

10 EvM Rev

(ER)

11 Sensor Type

(ST)

12 Sensor #

(SN)

13 Event Dir |

Event Type

(EDIR)

Description

RqSA and LUN if event was generated from IPMB.

Software ID if event was generated from system software.

Byte 1

[7:1] – 7-bit I

2

C Slave Address, or 7-bit system software ID

[0] 0b = ID is IPMB Slave Address

1b = System software ID

Software ID values:

0001h – BIOS POST for POST errors, RAS Configuration/State,

Timestamp Synch, OS Boot events

 0033h – BIOS SMI Handler

 0020h – BMC Firmware

 002Ch – ME Firmware

 0041h – Server Management Software

 00C0h – HSC Firmware – HSBP A

 00C2h – HSC Firmware – HSBP B

Byte 2

[7:4] – Channel number. Channel that event message was received over. 0h if the event message was received from the system interface, primary IPMB, or internally generated by the BMC.

[3:2] – Reserved. Write as 00b.

[1:0] – IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.

Event Message format version. 04h = IPMI v2.0; 03h = IPMI v1.0

Sensor Type Code for sensor that generated the event

Number of sensor that generated the event (From SDR)

Event Dir

[7] – 0b = Assertion event.

1b = Deassertion event.

Event Type

Type of trigger for the event, for example, critical threshold going high, state asserted, and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.

The Event Type field is encoded using the Event/Reading Type Code.

[6:0] – Event Type Codes

01h = Threshold (States = 0x00-0x0b)

02h-0ch = Discrete

6Fh = Sensor-Specific

70-7Fh = OEM

Per Table 2: Event Request Message Event Data Field Contents

14 Event Data 1

(ED1)

15 Event Data 2

(ED2)

16 Event Data 3

(ED3)

Revision 1.1 Intel order number G74211-002 5

Basic Decoding of a SEL Record

Table 2: Event Request Message Event Data Field Contents

Sensor

Class

Event Data

Threshold Event Data 1

[7:6] – 00b = Unspecified Event Data 2

01b = Trigger reading in Event Data 2

10b = OEM code in Event Data 2

11b = Sensor-specific event extension code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

01b = Trigger threshold value in Event Data 3

10b = OEM code in Event Data 3

11b = Sensor-specific event extension code in Event Data 3

[3:0] – Offset from Event/Reading Code for threshold event.

Event Data 2 – Reading that triggered event, FFh or not present if unspecified.

Event Data 3 – Threshold value that triggered event, FFh or not present if unspecified. If present, Event

Data 2 must be present. discrete Event Data 1

[7:6] – 00b = Unspecified Event Data 2

01b = Previous state and/or severity in Event Data 2

10b = OEM code in Event Data 2

11b = Sensor-specific event extension code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

01b = Reserved

10b = OEM code in Event Data 3

11b = Sensor-specific event extension code in Event Data 3

[3:0] – Offset from Event/Reading Code for discrete event state

Event Data 2

[7:4] – Optional offset from “Severity” Event/Reading Code (0Fh if unspecified).

[3:0] – Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).

Event Data 3 – Optional OEM code. FFh or not present if unspecified.

OEM Event Data 1

[7:6] – 00b = Unspecified in Event Data 2

01b = Previous state and/or severity in Event Data 2

10b = OEM code in Event Data 2

11b = Reserved

[5:4] – 00b = Unspecified Event Data 3

01b = Reserved

10b = OEM code in Event Data 3

11b = Reserved

[3:0] – Offset from Event/Reading Type Code

Event Data 2

[7:4] – Optional OEM code bits or offset from “Severity” Event/Reading Type Code (0Fh if unspecified).

[3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified).

Event Data 3 – Optional OEM code. FFh or not present or unspecified.

6 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Basic Decoding of a SEL Record

Table 3: OEM SEL Record (Type C0h-DFh)

11

12

13

14

15

16

8

9

10

4

5

6

7

Byte

1

2

Record ID

(RID)

Field

3 Record Type

(RT)

ID used for SEL Record access.

Description

Timestamp

(TS)

[7:0] – Record Type

C0h-DFh = OEM timestamped, bytes 8-16 OEM defined

Time when event was logged. LS byte first.

Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010

23:20:09 UTC

Note: There are various websites that will convert the raw number to a date/time.

Manufacturer ID LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA

“Private Enterprise” ID.

Most significant four bits = Reserved (0000b).

000000h = Unspecified. 0FFFFFh = Reserved.

This value is binary encoded.

For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.

OEM Defined OEM Defined. This is defined according to the manufacturer identified by the

Manufacturer ID field.

Table 4: OEM SEL Record (Type E0h-FFh)

8

9

10

11

4

5

6

7

12

13

14

15

16

Byte

1

2

Record ID

(RID)

Field

3 Record Type

(RT)

OEM

ID used for SEL Record access.

Description

[7:0] – Record Type

E0h-FFh = OEM system event record

OEM Defined. This is defined by the system integrator.

Revision 1.1 Intel order number G74211-002 7

Sensor Cross Reference List

3. Sensor Cross Reference List

This section contains a cross reference to help find details on any specific SEL entry.

3.1 BMC owned Sensors (GID = 0020h)

The following table can be used to find the details of sensors owned by the BMC.

Table 5: BMC owned Sensors

Sensor

Number

01h

02h

03h

04h

05h

06h

07h

08h

09h

Sensor Name

Power Unit Status

(Pwr Unit Status)

Power Unit Redundancy

(Pwr Unit Redund)

IPMI Watchdog

(IPMI Watchdog)

Physical Security

(Physical Scrty)

FP Interrupt

(FP NMI Diag Int)

SMI Timeout

(SMI Timeout)

System Event Log

(System Event Log)

System Event

(System Event)

Button Press Event

(Button Press)

Details Section

Power Unit Status Sensor

Power Unit Redundancy Sensor

IPMI Watchdog

Physical Security

FP (NMI) Interrupt

SMI Timeout

System Event Log Cleared

System Event – PEF Action

Button Press Events

Next Steps

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps

Table 77: IPMI Watchdog Sensor Event Trigger Offset – Next Steps

Table 73: Physical Security Sensor Event Trigger Offset – Next Steps

FP (NMI) Interrupt – Next Steps

SMI Timeout – Next Steps

Not applicable

System Event – PEF Action – Next Steps

Not applicable

8 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Sensor

Number

10h

11h

12h

13h

14h

15h

16h

17h

18h

19h

1Ah

1Bh

1Ch

1Dh

Sensor Name

BB +1.1V IOH

(BB +1.1V IOH)

BB +1.1V P1 Vccp

(BB +1.1V P1 Vccp)

BB +1.1V P2 Vccp

(BB +1.1V P2 Vccp)

BB +1.5V P1 DDR3

(BB +1.5V P1 DDR3)

BB +1.5V P2 DDR3

(BB +1.5V P2 DDR3)

BB +1.8V AUX

(BB +1.8V AUX)

BB +3.3V

(BB +3.3V)

BB +3.3V STBY

(BB +3.3V STBY)

BB +3.3V Vbat

(BB +3.3V Vbat)

BB +5.0V

(BB +5.0V)

BB +5.0V STBY

(BB +5.0V STBY)

BB +12.0V

(BB +12.0V)

BB -12.0V

(BB -12.0V)

BB +1.35V P1 LV DDR3

(BB +1.35v P1 MEM)

Details Section

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Voltage Sensors

Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Table 14: Voltage Sensors – Next Steps

Revision 1.1 Intel order number G74211-002

Sensor Cross Reference List

9

Sensor Cross Reference List

Sensor

Number

1Eh

20h

21h

22h

23h

24h

30h-39h

40h-45h

46h

50h

51h

52h

53h

Sensor Name

BB +1.35V P2 LV DDR3

(BB +1.35v P2 MEM)

Baseboard Temperature

(Baseboard Temp)

Front Panel Temperature

(Front Panel Temp)

IOH Thermal Margin

(IOH Therm Margin)

Processor 1 Memory Thermal

Margin

(Mem P1 Thrm Mrgn)

Processor 2 Memory Thermal

Margin

(Mem P2 Thrm Mrgn)

Fan Tachometer Sensors

(Chassis specific sensor names)

Fan Present Sensors

(Fan x Present)

Fan Redundancy

(Fan Redundancy)

Power Supply 1 Status

(PS1 Status)

Power Supply 2 Status

(PS2 Status)

Power Supply 1

AC Power Input

(PS1 Power In)

Power Supply 2

AC Power Input

(PS2 Power In)

Details Section

Voltage Sensors

Regular Temperature Sensors

Regular Temperature Sensors

Thermal Margin Sensors

Thermal Margin Sensors

Thermal Margin Sensors

Fan Speed Sensors

Fan Presence and Redundancy

Sensors

Fan Presence and Redundancy

Sensors

Power Supply Status Sensors

Power Supply Status Sensors

Power Supply AC Power Input

Sensors

Power Supply AC Power Input

Sensors

Next Steps

Table 14: Voltage Sensors – Next Steps

Table 35: Temperature Sensors – Next Steps

Table 35: Temperature Sensors – Next Steps

Table 38: Thermal Margin Sensors – Next Steps

Table 38: Thermal Margin Sensors – Next Steps

Table 38: Thermal Margin Sensors – Next Steps

Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps

Table 30: Fan Presence Sensors – Event Trigger Offset – Next Steps

Table 32: Fan Redundancy Sensor – Event Trigger Offset – Next Steps

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next

Steps

Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next

Steps

10 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Sensor Cross Reference List

Sensor

Number

54h

55h

56h

57h

60h

61h

62h

63h

64h

65h

66h

67h

68h

69h

Sensor Name

Power Supply 1 +12V % of

Maximum Current Output

(PS1 Curr Out %)

Power Supply 2 +12V % of

Maximum Current Output

(PS2 Curr Out %)

Power Supply 1 Temperature

(PS1 Temperature)

Power Supply 2 Temperature

(PS2 Temperature)

Processor 1 Status

(P1 Status)

Processor 2 Status

(P2 Status)

Processor 1 Thermal Margin

(P1 Therm Margin)

Processor 2 Thermal Margin

(P2 Therm Margin)

Processor 1 Thermal Control %

(P1 Therm Ctrl %)

Processor 2 Thermal Control %

(P2 Therm Ctrl %)

Processor 1 VRD Temp

(P1 VRD Hot)

Processor 2 VRD Temp

(P2 VRD Hot)

Catastrophic Error

(CATERR)

CPU Missing

(CPU Missing)

Power Supply Current Output %

Sensors

Details Section

Power Supply Current Output %

Sensors

Power Supply Temperature Sensors

Power Supply Temperature Sensors

Processor Status Sensor

Processor Status Sensor

Thermal Margin Sensors

Thermal Margin Sensors

Processor Thermal Control %

Sensors

Processor Thermal Control %

Sensors

Discrete Thermal Sensors

Discrete Thermal Sensors

Catastrophic Error Sensor

CPU Missing Sensor

Next Steps

Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next

Steps

Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next

Steps

Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

Table 45: Processor Status Sensors – Next Steps

Table 45: Processor Status Sensors – Next Steps

Table 38: Thermal Margin Sensors – Next Steps

Table 38: Thermal Margin Sensors – Next Steps

Table 41: Processor Thermal Control % Sensors – Next Steps

Table 41: Processor Thermal Control % Sensors – Next Steps

Table 43: Discrete Thermal Sensors

Table 43: Discrete Thermal Sensors

Catastrophic Error Sensor– Next Steps

CPU Missing Sensor – Next Steps

Revision 1.1 Intel order number G74211-002 11

Sensor Cross Reference List

Sensor

Number

6Ah

Sensor Name

IOH Thermal Trip

(IOH Thermal Trip)

Details Section

Discrete Thermal Sensors

3.2 BIOS POST owned Sensors (GID = 0001h)

Table 43: Discrete Thermal Sensors

Next Steps

The following table can be used to find the details of sensors owned by BIOS POST.

Table 6: BIOS POST owned Sensors

Sensor

Number

01h

06h

11h

12h

13h

83h

Sensor Name

Mirroring Redundancy State

POST Error

Sparing Redundancy State

Mirroring Configuration Status

Sparing Configuration Status

System Event

Details Section Next Steps

Mirrored Redundancy State Sensor

System Firmware Progress (Formerly

Post Error)

Sparing Redundancy State Sensor

Mirroring Configuration Status

Sparing Configuration Status

System Events

Table 55: Mirrored Redundancy State Sensor Event Trigger Offset – Next Steps

System Firmware Progress (Formerly Post Error) – Next Steps

Table 59: Sparing Redundancy State Sensor Event Trigger Offset – Next Steps

Table 53: Mirroring Configuration Status Sensor Event Trigger Offset – Next Steps

Table 57: Sparing Configuration Status Sensor Event Trigger Offset – Next Steps

Not applicable

3.3 BIOS SMI owned Sensors (GID = 0033h)

The following table can be used to find the details of sensors owned by BIOS SMI.

12 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Sensor Cross Reference List

Table 7: BIOS SMI owned Sensors

03h

04h

05h

06h

Sensor

Number

02h

07h

14h

17h

18h

Sensor Name Details Section

Memory ECC Error

Legacy PCI Error

PCI Express Fatal Error

Memory Correctable and

Uncorrectable ECC Error

Legacy PCI Errors

PCI Express Fatal Errors

PCI Express Correctable Error

PCI Express Correctable errors

Intel

®

QuickPath Interface

Correctable Error

QPI Correctable Error Sensor

Intel

®

QuickPath Interface Nonfatal Error

QPI Non-Fatal Error Sensor

Memory Address Parity Error

Memory Address Parity Error

Intel

®

Error

QuickPath Interface Fatal

Intel

®

QuickPath Interface

Fatal2 Error

QPI Fatal and Fatal #2

QPI Fatal and Fatal #2

System Event

System Events

83h

Next Steps

Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset –

Next Steps

Table 68: Legacy PCI Error Sensor Event Trigger Offset – Next Steps

Table 66: PCI Express* Fatal Error Sensor Event Trigger Offset – Next Steps

Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps

QPI Correctable Error Sensor – Next Steps

QPI Non-Fatal Error Sensor – Next Steps

Memory Address Parity Error Sensor Next Steps

QPI Fatal and Fatal #2 – Next Steps

QPI Fatal and Fatal #2 – Next Steps

Not applicable

Revision 1.1 Intel order number G74211-002 13

Sensor Cross Reference List

3.4 Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h)

The following table can be used to find the details of sensors owned by the Hot Swap Controller (HSC) firmware. The HSC firmware resides on a Hot Swap Back Plane (HSBP). There can be up to two HSBPs in a system. Each HSBP will have its own GID.

 00C0h = HSC Firmware – HSBP A

 00C2h = HSC Firmware – HSBP B

Table 8: Hot Swap Controller Firmware owned Sensors

Sensor

Number

01h

02h

03h

04h

Sensor Name

Backplane Temperature

Drive Slot 0 Status

Drive Slot 1 Status

Drive Slot 2 Status

05h

06h

Drive Slot 3 Status

Drive Slot 4 Status

07h

6 Slot HSBP

Drive Slot 5 Status

08h

09h

Drive Slot 0 Presence

Drive Slot 1 Presence

0Ah

0Bh

0Ch

Drive Slot 2 Presence

Drive Slot 3 Presence

Drive Slot 4 Presence

0Dh

8 Slot HSBP

Drive Slot 5 Presence

08h Drive Slot 6 Status

Details Section Next Steps

HSC Backplane Temperature Sensor Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps

HSC Drive Slot Status Sensor

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Slot Status Sensor

HSC Drive Slot Status Sensor

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Slot Status Sensor

HSC Drive Slot Status Sensor

HSC Drive Slot Status Sensor

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Slot Status Sensor

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Slot Status Sensor – Next Steps

14 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Sensor Cross Reference List

Sensor

Number

09h

0Ah

0Bh

0Ch

0Dh

0Eh

0Fh

10h

11h

Sensor Name

Drive Slot 7 Status

Drive Slot 0 Presence

Drive Slot 1 Presence

Drive Slot 2 Presence

Drive Slot 3 Presence

Drive Slot 4 Presence

Drive Slot 5 Presence

Drive Slot 6 Presence

Drive Slot 7 Presence

Details Section

HSC Drive Slot Status Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

HSC Drive Presence Sensor

Next Steps

HSC Drive Slot Status Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

HSC Drive Presence Sensor – Next Steps

3.5 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)

The following table can be used to find the details of sensors owned by the Node Manager / Management Engine (ME) firmware.

Table 9: Management Engine Firmware owned Sensors

Sensor

Number

17h

18h

19h

1Ah

1Bh

Sensor Name

ME Firmware Health Events

Node Manager Exception Events

Node Manager Health Events

Node Manager Operational Capabilities

Change Events

Node Manager Alert Threshold Exceeded

Events

Details Section Next Steps

ME Firmware Health Event

Node Manager Exception Event

Node Manager Health Event

ME Firmware Health Event – Next Steps

Node Manager Exception Event – Next Steps

Node Manager Health Event – Next Steps

Node Manager Operational Capabilities Change

Node Manager Operational Capabilities Change – Next Steps

Node Manager Alert Threshold Exceeded

Node Manager Alert Threshold Exceeded – Next Steps

Revision 1.1 Intel order number G74211-002 15

Sensor Cross Reference List

3.6 Microsoft* OS owned Events (GID = 0041)

The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS).

Table 10: Microsoft* OS owned Events

Sensor Name

Boot Event

Shutdown Event

Record

Type

02h

DCh

02h

DDh

Bug Check / Blue Screen 02h

DEh

Sensor Type Details Section Next Steps

1Fh = OS Boot

Not applicable

Table 91: Boot-up Event Record Typical Characteristics

Table 92: Boot-up OEM Event Record Typical Characteristics

20h = OS Stop/Shutdown

Table 93: Shutdown Reason Code Event Record Typical Characteristics

Not applicable

Table 94: Shutdown Reason OEM Event Record Typical Characteristics

Table 95: Shutdown Comment OEM Event Record Typical Characteristics

Not applicable

Not applicable

Not applicable

20h = OS Stop/Shutdown

Table 96: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics Not applicable

Not applicable

Table 97: Bug Check / Blue Screen Code OEM Event Record Typical

Characteristics

3.7 Linux* Kernel Panic Events (GID = 0021)

The following table can be used to find the details of records that can be generated when there is a Linux* Kernel panic.

Table 11: Linux* Kernel Panic Events

Sensor Name

Linux* Kernel Panic

Record

Type

02h

F0h

Sensor Type Details Section

20h = OS Stop/Shutdown

Table 98: Linux* Kernel Panic Event Record Characteristics

Not applicable

Table 99: Linux* Kernel Panic String Extended Record Characteristics

Next Steps

Not applicable

16 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Power Subsystems

4. Power Subsystems

The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.

4.1 Voltage Sensors

The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant analog/threshold sensors.

Note: A voltage error could be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be noted who is supplying the voltage and who is using it.

Table 12: Voltage Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

02h = Voltage

See Table 14

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Triggers as described in Table 13

Reading that triggered event

Threshold value that triggered event

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Revision 1.1 Intel order number G74211-002 17

Power Subsystems

Table 13: Voltage Sensors Event Triggers – Description

Hex

Event Trigger

Description

00h Lower non-critical going low

02h Lower critical going low

07h Upper non-critical going high

09h Upper critical going high

Assertion

Severity

Degraded non-fatal

Degraded non-fatal

Deassert

Severity

OK

Description

The voltage has dropped below its lower non-critical threshold.

Degraded The voltage has dropped below its lower critical threshold.

OK The voltage has gone over its upper non-critical threshold.

Degraded The voltage has gone over its upper critical threshold.

Table 14: Voltage Sensors – Next Steps

Sensor

Number

10h

Sensor Name

BB +1.1V IOH

11h

12h

BB +1.1V P1 Vccp

BB +1.1V P2 Vccp

Next Steps

This 1.1V line is supplied by the main board.

This 1.1V line is used by the I/O hub (IOH)

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the motherboard.

This 1.1V line is supplied by the main board.

This 1.1V line is used by processor 1.

1. Ensure all cables are connected correctly.

2. Cross test the processor if possible. If the issue remains with the socket, replace the main board, otherwise the processor.

This 1.1V line is supplied by the main board.

This 1.1V line is used by processor 2.

1. Ensure all cables are connected correctly.

2. Cross test the processor if possible. If the issue remains with the socket, replace the main board, otherwise the processor.

18 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Power Subsystems

Sensor

Number

13h

14h

15h

16h

17h

Sensor Name Next Steps

BB +1.5V P1 DDR3 This 1.5V line is supplied by the main board.

This 1.5V line is used by the memory on processor 1.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise replace the

DIMM.

BB +1.5V P2 DDR3 This 1.5V line is supplied by the main board.

This 1.5V line is used by the memory on processor 2.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise replace the

DIMM.

BB +1.8V AUX +1.8V is supplied by the main board.

+1.8V is used by the onboard NIC and I/O hub.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the main board.

BB +3.3V

BB +3.3V STBY

+3.3V is supplied by the power supplies.

+3.3V is used by the PCIe and PCI-X slots.

1. Ensure all cables are connected correctly.

2. Reseat any PCI cards, and try them in other slots.

3. If the issue follows the card, swap it, otherwise, replace the main board.

4. If the issue remains, replace the power supplies.

+3.3V Stby is supplied by the main board.

+3.3V Stby is used by the BMC, Onboard NIC, IOH, and ICH.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

Revision 1.1 Intel order number G74211-002 19

Power Subsystems

Sensor

Number

18h

Sensor Name

BB +3.3V Vbat

19h

1Ah

1Bh

1Ch

BB +5.0V

BB +5.0V STBY

BB +12.0V

BB -12.0V

Next Steps

+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on.

+3.3V Vbat is used by the CMOS and related circuits.

1. Replace the CMOS battery. Any battery of type CR2032 can be used.

2. If error remains (unlikely), replace the board.

+5.0V is supplied by the power supplies.

+5.0V is used by the PCI slots.

1. Ensure all cables are connected correctly.

2. Reseat any PCI cards, and try them in other slots.

3. If the issue follows the card, swap it, otherwise, replace the main board.

4. If the issue remains, replace the power supplies.

+5.0V STBY is supplied by the power supplies.

+5.0V STBY is used to generate other standby voltages.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

+12V is supplied by the power supplies.

+12V is used by SATA drives, Fans, and PCI cards. In addition it is used to generate various processor voltages.

1. Ensure all cables are connected correctly.

2. Check connections on fans and HDDs.

3. If the issue follows the component, swap it, otherwise, replace the board.

4. If the issue remains, replace the power supplies.

-12V is supplied by the power supplies.

-12V is used by the serial port and by PCI cards. In addition it is used to generate various processor voltages.

1. Ensure all cables are connected correctly.

2. Reseat any PCI cards, and try them in other slots.

3. If the issue follows the card, swap it, otherwise, replace the main board.

4. If the issue remains, replace the power supplies.

20 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Power Subsystems

Sensor

Number

1Dh

Sensor Name

BB +1.35 P1 Mem

1Eh BB +1.35 P2 Mem

Next Steps

This 1.35V line is supplied by the main board.

This 1.35V line is used by low voltage memory on processor 1.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs.

4. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

This 1.35V line is supplied by the main board.

This 1.35V line is used by low voltage memory on processor 2.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs.

4. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

4.2 Power Unit

The power unit monitors the power state of the system and logs the state changes in the SEL.

4.2.1 Power Unit Status Sensor

The power unit status sensor monitors the power state of the system and logs state changes. Expected power-on events such as DC ON/OFF are logged and unexpected events are also logged, such as AC loss and power good loss.

Table 15: Power Unit Status Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

09h = Power Unit

01h

Description

Revision 1.1 Intel order number G74211-002 21

Power Subsystems

Byte Field

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] = Sensor Specific offset as described in Table 9

Not used

Not used

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

Hex

Sensor Specific Offset

Description

00h

04h

05h

06h

Power down

AC Lost

Soft Power Control

Failure

Power Unit Failure failure.

Description

System is powered down.

AC removed.

Generally means power good was lost in the system, causing a shutdown.

Power subsystem experienced a

Next Steps

Informational Event

Informational Event

This could be caused by the power supply subsystem or system components.

1. Verify all power cables and adapters are connected properly (AC cables as well as the cables between the PSU and system components).

2. Cross test the PSU if possible.

3. Replace the power subsystem.

Indicates a power supply failed.

1. Remove and reapply AC power.

2. If the power supply still fails, replace it.

4.2.2 Power Unit Redundancy Sensor

This sensor is enabled on systems that support redundant power supplies. When a system has AC applied or if it loses redundancy of the power supplies a message will get logged into the SEL.

22 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Power Subsystems

Event Trigger Offset

Hex

00h Fully redundant

Description

01h Redundancy lost

02h Redundancy degraded

03h Non-redundant, sufficient from redundant

04h Non-redundant, sufficient from insufficient

05h Non-redundant, insufficient

06h Non-redundant, degraded from fully redundant

07h Redundant, degraded from non-redundant

Table 17: Power Unit Redundancy Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

09h = Power Unit

02h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 18

Not used

Not used

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps

Description

System is fully operational.

System is not running in redundant power supply mode.

Next Steps

Informational Event

This event should be accompanied by specific power supply errors (AC lost, PSU failure, and so on). Troubleshoot these events accordingly.

Revision 1.1 Intel order number G74211-002 23

Power Subsystems

4.3 Power Supply

The BMC monitors the power supply subsystem.

4.3.1 Power Supply Status Sensors

These sensors report the status of the power supplies in the system. When a system first AC applied or removed it can log an event. Also if there is a failure, predictive failure, or a configuration error it can log an event.

Table 19: Power Supply Status Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

08h = Power Supply

50h = Power Supply 1 Status

51h = Power Supply 2 Status

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] = Sensor Specific offset as described in Table 20

Not used

Not used

Table 20: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps

Hex

Sensor Specific Offset

Description

00h Presence Power supply detected.

Description

Informational Event.

Next Steps

24 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Power Subsystems

Hex

Sensor Specific Offset

Description

01h Failure

Description Next Steps

Power supply failed. Indicates a power supply failed.

1) Remove and reapply AC.

2) If the power supply still fails, replace it.

Replace the power supply. 02h Predictive Failure

03h AC lost

Typically means a fan inside the power supply is not cooling the power supply. It may indicate the fan is failing.

AC removed.

06h Configuration error Power supply configuration is not supported.

Informational Event.

Indicates that at least one of the supplies is not correct for your system configuration.

1) Remove the power supply and verify compatibility.

2) If the power supply is compatible it may be faulty. Replace it.

4.3.2 Power Supply AC Power Input Sensors

These sensors will log an event when a power supply in the system is exceeding its AC power in threshold.

Table 21: Power Supply AC Power Input Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

Description

0Bh = Other Units

52h = Power Supply 1 AC Power Input

53h = Power Supply 2 AC Power Input

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Trigger Offset as described in Table 22

Revision 1.1 Intel order number G74211-002 25

Power Subsystems

Byte Field

15 Event Data 2

16 Event Data 3

Description

Reading that triggered event

Threshold value that triggered event

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next Steps

Event Trigger Offset

Hex Description

07h Upper non-critical going high

09h Upper critical going high

Assertion

Severity

Deassert

Severity

Degraded OK non-fatal Degraded

Description

PMBus* feature to monitor power supply power consumption.

Next Steps

If you see this event, the system is pulling too much power on the input for the

PSU rating.

1. Verify the power budget is within the specified range.

2. Check http://www.intel.com/p/en_US/support/ for the power budget tool for your system.

4.3.3 Power Supply Current Output % Sensors

PMBus*-compliant power supplies may monitor the current output of the main 12v voltage rail and report the current usage as a percentage of the maximum power output for that rail.

Table 23: Power Supply Current Output % Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

Description

03h = Current

54h = Power Supply 1 Current Output %

55h = Power Supply 2 Current Output %

26 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Power Subsystems

Byte Field

13 Event Direction and

Event Type

14

15

16

Event Data 1

Event Data 2

Event Data 3

Description

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Trigger Offset as described in Table 24

Reading that triggered event

Threshold value that triggered event

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

07h Upper non-critical going high

09h Upper critical going high

Assertion

Severity non-fatal

Deassert

Severity

Degraded OK

Degraded

Description

PMBus* feature to monitor power supply power consumption.

Next Steps

If you see this event, the system is using too much power on the output for the PSU rating.

1. Verify the power budget is within the specified range.

2. Check http://www.intel.com/p/en_US/support/ for the power budget tool for your system.

4.3.4 Power Supply Temperature Sensors

The BMC monitors one power supply temperature sensor for each installed PMBus*-compliant power supply.

Revision 1.1 Intel order number G74211-002 27

Power Subsystems

Table 25: Power Supply Temperature Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

01h = Temperature

56h = Power Supply 1 Temperature

57h = Power Supply 2 Temperature

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Trigger Offset as described in Table 26

Reading that triggered event

Threshold value that triggered event

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

07h Upper non-critical going high

09h Upper critical going high

Assertion

Severity non-fatal

Deassert

Severity

Degraded OK

Degraded

Description Next Steps

An upper non-critical or critical temperature threshold has been crossed.

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system

(typically below 35°C).

28 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Cooling Subsystem

5. Cooling Subsystem

5.1 Fan Sensors

There are three types of fan sensors that can be present on Intel

®

Server Systems: speed, presence, and redundancy. The last two are only present in systems with hot-swap redundant fans.

5.1.1 Fan Speed Sensors

Fan speed sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based sensors.

Usually they only have lower (critical) thresholds set, so that a SEL entry is only generated if the fan spins too slowly.

Table 27: Fan Speed Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

04h = Fan

30h-39h (Chassis specific)

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Trigger Offset as described in Table 28

Reading that triggered event

Threshold value that triggered event

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Revision 1.1 Intel order number G74211-002 29

Cooling Subsystem

Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps

Event Trigger Offset

Hex Description

00h Lower non-critical going low

02h Lower critical going low

Assertion

Severity

Deassert

Severity

Description

Degraded non-fatal

OK The fan speed has dropped below its lower non-critical threshold.

Degraded The fan speed has dropped below its lower critical threshold.

Next Steps

A fan speed error on a new system build is typically not caused by the fan spinning too slowly, instead it is caused by the fan being connected to the wrong header (the BMC expects them on certain headers for each chassis and will log this event if there is no fan on that header).

1. Refer to the Quick Start Guide or the Service Guide to identify the correct fan headers to use.

2. Ensure the latest FRUSDR update has been run and the correct chassis was detected or selected.

3. If you are sure this was done, the event may be a sign of impending fan failure (although this will only normally apply if the system has been in use for a while). Replace the fan.

5.1.2 Fan Presence and Redundancy Sensors

Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel

®

servers is an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other modes are also possible.

Table 29: Fan Presence Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

Description

04h = Fan

40h-45h (Chassis specific)

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 08h (Generic “digital” Discrete)

30 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Cooling Subsystem

Byte Field

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 30

Not used

Not used

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 30: Fan Presence Sensors – Event Trigger Offset – Next Steps

Event Trigger Offset

Hex Description

01h Device

Present

Assertion

Severity

OK

Deassert

Severity

Description

Degraded Assertion – A fan was inserted. This event may also get logged when the

BMC initializes when AC is applied.

Deassert – A fan was removed, or was not present at the expected location when the BMC initialized.

Informational only

Next Steps

These events only get generated in systems with hot-swappable fans, and normally only when a fan is physically inserted or removed. If fans were not physically removed:

1. Use the Quick Start Guide to check whether the right fan headers were used.

2. Swap the fans round to see whether the problem stays with the location, or follows the fan.

3. Replace the fan or fan wiring/housing depending on the outcome of step 2.

4. Ensure the latest FRUSDR update has been run and the correct chassis was detected or selected.

Table 31: Fan Redundancy Sensors Typical Characteristics

Description Byte

11 Sensor Type

Field

12 Sensor Number

04h = Fan

46h

Revision 1.1 Intel order number G74211-002 31

Cooling Subsystem

Byte Field

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 32

Not used

Not used

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 32: Fan Redundancy Sensor – Event Trigger Offset – Next Steps

Event Trigger Offset

Hex

00h Fully redundant

Description

01h Redundancy lost

02h Redundancy degraded

03h Non-redundant, sufficient from redundant

System has lost one or more fans and is running in non-redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.

Description

04h Non-redundant, sufficient from insufficient

05h Non-redundant, insufficient

06h Non-redundant, degraded from fully redundant

System has lost fans and may no longer be able to cool itself adequately. Overheating may occur if this situation remains for a longer period of time.

System has lost one or more fans and is running in non-redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.

07h Redundant, degraded from non-redundant System has lost one or more fans and is running in a degraded mode, but still is redundant. There are enough fans to keep the system properly cooled.

Next Steps

Fan redundancy loss indicates failure of one or more fans.

Look for lower (non) critical fan errors, or fan removal errors in the SEL, to indicate which fan is causing the problem, and follow the troubleshooting steps for these event types.

32 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Cooling Subsystem

5.2 Temperature Sensors

There are a variety of temperature sensors that can be implemented on Intel

®

Server Systems. They are split into three types: regular temperature sensors, thermal margin sensors, and discrete temperature sensors. Each of them has its own types of events that can be logged.

5.2.1 Regular Temperature Sensors

Regular temperature sensors are sensors that report an actual temperature. These are linear, threshold-based sensors. In most Intel

®

Server

Systems, there are at least two sensors defined: front panel temperature and baseboard temperature. Both these sensors typically have upper and lower thresholds set – upper to warn in case of an over-temperature situation, lower to warn against sensor failure (temperature sensors typically read out 0 if they stop working).

Table 33: Temperature Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

01h = Temperature

See Table 35

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Trigger Offset as described in Table 34

Reading that triggered event

Threshold value that triggered event

Revision 1.1 Intel order number G74211-002 33

Cooling Subsystem

Table 34: Temperature Sensors Event Triggers – Description

Hex

Event Trigger

Description

00h Lower non-critical going low

02h Lower critical going low

07h Upper non-critical going high

09h Upper critical going high

Assertion

Severity

Degraded non-fatal

Degraded non-fatal

Deassert

Severity

OK

Description

The temperature has dropped below its lower non-critical threshold.

Degraded The temperature has dropped below its lower critical threshold.

OK The temperature has gone over its upper non-critical threshold.

Degraded The temperature has gone over its upper critical threshold.

Table 35: Temperature Sensors – Next Steps

Sensor Name

Sensor

Number

Baseboard Temp 20h

Front Panel Temp 21h

Next Steps

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

If the front panel temperature reads zero, check:

1. It is connected properly.

2. The FRUSDR has been programmed correctly for your chassis.

If the front panel temperature is too high:

 Check the cooling of your server room.

34 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Cooling Subsystem

5.2.2 Thermal Margin Sensors

Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to a critical temperature. Example sensors are Processor Thermal Margin, Memory Thermal Margin, and IOH Thermal margin. Values reported should be seen as number of degrees below a critical temperature for the particular component.

Table 36: Thermal Margin Sensors Typical Characteristics

Hex

Event Trigger

Description

07h Upper non-critical going high

09h Upper critical going high

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

01h = Temperature

See Table 38

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Triggers as described in Table 37

Reading that triggered event

Threshold value that triggered event

Table 37: Thermal Margin Sensors Event Triggers – Description

Assertion

Severity

Degraded

Deassert

Severity

OK

Description

The thermal margin has gone over its upper non-critical threshold. non-fatal Degraded The thermal margin has gone over its upper critical threshold.

Revision 1.1 Intel order number G74211-002 35

Cooling Subsystem

Table 38: Thermal Margin Sensors – Next Steps

Sensor

Number

22h

23h

24h

Sensor Name

IOH Therm Margin

Mem P1 Therm Margin

Mem P2 Therm Margin

Next Steps

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

Not a logged SEL event. Sensor is used for thermal management of the processor.

5.2.3

62h

63h

P1 Therm Margin

P2 Therm Margin

Processor Thermal Control % Sensors

Processor Thermal Control % sensors report the percentage of the time that the processor is throttling its performance due to thermal issues. If this is not addressed the processor could overheat and shut down the system to protect itself from damage.

Table 39: Processor Thermal Control % Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

Description

01h = Temperature

See Table 41

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Triggers as described in Table 40

Reading that triggered event.

36 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Cooling Subsystem

Hex

Event Trigger

Description

07h Upper non-critical going high

09h Upper critical going high

Byte Field

16 Event Data 3

Description

Threshold value that triggered event.

Table 40: Processor Thermal Control % Sensors Event Triggers – Description

Assertion

Severity

Degraded

Deassert

Severity

OK

Description

The thermal margin has gone over its upper non-critical threshold. non-fatal Degraded The thermal margin has gone over its upper critical threshold.

Table 41: Processor Thermal Control % Sensors – Next Steps

Sensor

Number

64h

65h

Sensor Name Next Steps

P1 Therm Ctl % These events normally only happen due to failures of the thermal solution:

P2 Therm Ctl %

1. Verify the heatsink is properly attached and has thermal grease.

2. If the system has a heatsink fan, ensure the fan is spinning.

3. Check all system fans are operating properly.

4. Check that the air used to cool the system is within limits (typically 35°C).

5.2.4 Discrete Thermal Sensors

Discrete thermal sensors do not report a temperature at all – instead they report an overheating event of some kind. Examples as VRD Hot

(voltage regulator is overheating) or processor Thermal Trip (the processor got so hot that its over-temperature protection was triggered and the system was shut down to prevent damage).

Revision 1.1 Intel order number G74211-002 37

Cooling Subsystem

Table 42: Discrete Thermal Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

01h = Temperature

See Table 43

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = See Table 43

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 43

Not used

Not used

Table 43: Discrete Thermal Sensors – Next Steps

Sensor

Number

66h

67h

6ah

Sensor Name

P1 VRD Hot

P2 VRD Hot

IOH Thermal Trip

Event

Type

05h

03h

Event Trigger Offset

Hex Description

Description

01h Limit Exceeded Processor1 voltage regulator overheated

Processor2 voltage regulator overheated

01h State Asserted I/O Hub (IOH) overheated

Next Steps

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

38 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

6. Processor Subsystem

Intel

®

servers report several processor-centric sensors in the SEL.

6.1 Processor Status Sensor

The status sensor reports processor presence or a thermal trip condition. Each processor has a status sensor.

Table 44: Process Status Sensors Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

07h = Processor

See Table 45

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 45

Not used.

Not used.

Processor Subsystem

Revision 1.1 Intel order number G74211-002 39

Processor Subsystem

Table 45: Processor Status Sensors – Next Steps

Sensor

Number

60h

Sensor Name

P1 Status

Event Trigger Offset

Hex Description

01h Thermal trip

Description Next Steps

61h P2 Status

07h State Asserted

01h Thermal trip

The processor exceeded the maximum temperature.

Indicates the processor is present.

The processor exceeded the maximum temperature.

Indicates the processor is present.

This event normally only happens due to failures of the thermal solution:

1. Verify the heatsink is properly attached and has thermal grease.

2. If the system has a heatsink fan, ensure the fan is spinning.

3. Check all system fans are operating properly.

4. Check that the air used to cool the system is within limits (typically 35°C).

07h State Asserted

6.2 Catastrophic Error Sensor

When the Catastrophic Error signal (CATERR#) stays asserted, it is a sign that something serious has gone wrong in the hardware. The BMC monitors this signal and reports when it stays asserted.

Table 46: Catastrophic Error Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

Description

07h = Processor

68h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 01h (State Asserted)

Not used.

40 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Processor Subsystem

Byte Field

16 Event Data 3

6.2.1 Catastrophic Error Sensor – Next Steps

This error is typically caused by other platform components.

Not used.

Description

1. Check for other errors near the time of the CATERR event.

2. Verify all peripherals are plugged in and operating correctly, particularly Hard Drives, Optical Drives, and I/O.

3. Update system firmware and drivers.

6.3 CPU Missing Sensor

The CPU Missing sensor is a discrete sensor reporting the processor is not installed. The most common instance of this event is due to a processor populated in the incorrect socket.

Table 47: CPU Missing Sensor Typical Characteristics

Byte

11 Sensor Type

Field

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

07h = Processor

69h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 01h (State Asserted)

Not used.

Not used.

Revision 1.1 Intel order number G74211-002 41

Processor Subsystem

6.3.1 CPU Missing Sensor – Next Steps

Verify the processor is installed in the correct slot.

6.4 QuickPath Interconnect Error Sensors

The Intel

®

QuickPath Interconnect (QPI) bus on Intel

®

S5500/S3420 series server boards is the interconnection between processors and to the chipset. The QPI Error sensors are all reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h.

6.4.1 QPI Correctable Error Sensor

The system detected an error and corrected it. This is an informational event.

Table 48: QPI Correctable Error Sensor Typical Characteristics

Byte Field

8

9

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

06h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 72h (OEM Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = Reserved

0-3 = CPU1-4

Not used

42 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Processor Subsystem

6.4.1.1 QPI Correctable Error Sensor – Next Steps

This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Check the processor is installed correctly.

2. Inspect the socket for bent pins.

3. Cross test the processor if possible.

6.4.2 QPI Non-Fatal Error Sensor

The system detected a QPI non-fatal error that is recoverable. This is an informational event.

Table 49: QPI Non-Fatal Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

07h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 73h (OEM Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = Reserved

0-3 = CPU1-4

Not used

Revision 1.1 Intel order number G74211-002 43

Processor Subsystem

6.4.2.1 QPI Non-Fatal Error Sensor – Next Steps

This is an Informational event only. Non-Fatal errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Check the processor is installed correctly.

2. Inspect the socket for bent pins.

3. Cross test the processor if possible.

6.4.3 QPI Fatal and Fatal #2

The system detected a QPI fatal or non-recoverable error. This is a fatal error.

Table 50: QPI Fatal Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

The QPI Fatal #2 Error is a continuation of QPI Fatal Error.

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

17h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 74h (OEM Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = Reserved

0-3 = CPU1-4

Not used

44 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Processor Subsystem

Table 51: QPI Fatal #2 Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

18h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 74h (OEM Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = Reserved

0-3 = CPU1-4

Not used

6.4.3.1 QPI Fatal and Fatal #2 – Next Steps

This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Check the processor is installed correctly.

2. Inspect the socket for bent pins.

3. Cross test the processor if possible.

Revision 1.1 Intel order number G74211-002 45

Memory Subsystem

7. Memory Subsystem

Intel

®

servers report memory errors, status, and configuration in the SEL.

7.1 Memory RAS Mirroring and Sparing

“Memory RAS Configuration Status” refers to the BIOS sending the current RAS mode and RAS operational state to the BMC to log into the

SEL as a SEL record. This allows a remote software/application to query and retrieve the system memory state.

The memory configuration state sensors are “virtual” sensors. In other words, these sensors are owned and controlled completely by the BIOS, independently of the BMC.

The RAS configuration and state definitions are aligned with the definitions within the Intelligent Platform Management Interface Specification,

Version 2.0. Accordingly, these sensors are read as “Status” and “Redundancy” sensors (Event/Reading Type 0x09 and 0x0B respectively).

 Sensor Number 12h (Event Type 0x09) – Mirroring Configuration Status

 Sensor Number 01h (Event Type 0x0B) – Mirroring Redundancy State

 Sensor Number 13h (Event Type 0x09) – Sparing Configuration Status

 Sensor Number 11h (Event Type 0x0B) – Sparing Redundancy State

7.1.1 Mirroring Configuration Status

This sensor provides the Mirroring mode RAS configuration status.

Table 52: Mirroring Configuration Status Sensor Typical Characteristics

Byte Field

8

9

Generator ID

11 Sensor Type

12 Sensor Number

0001h = BIOS POST

Description

0ch = Memory

12h

46 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Memory Subsystem

Byte Field

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 09h (digital Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 53

Not used

Not used

Table 53: Mirroring Configuration Status Sensor Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

01h The system has been configured into

Mirrored Channel RAS Mode.

00h The system has been configured out of Mirrored Channel RAS Mode.

Description

User enabled mirrored channel mode in setup.

Mirrored channel mode is disabled (either in setup or due to unavailability of memory at post, in which case post error

8500 is also logged).

7.1.2 Mirrored Redundancy State Sensor

Informational event only.

Next Steps

1. If this event is accompanied by a post error 8500, there was a problem applying the mirroring configuration to the memory. Check for other errors related to the memory and troubleshoot accordingly.

2. If there is no post error then mirror mode was simply disabled in BIOS setup and this should be considered informational only.

This sensor provides the RAS Redundancy state for the Memory Mirrored Channel Mode.

Revision 1.1 Intel order number G74211-002 47

Memory Subsystem

48

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

Table 54: Mirrored Redundancy State Sensor Typical Characteristics

Description

0001h = BIOS POST

0ch = Memory

01h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset as described in Table 55

[7:4] – If Domain Instance Type (ED3) is set to Local, this field specifies the mirroring domain local sub-instances – which channels are included in this sub-instance:

0000b – Reserved

0001b – {Ch A, Ch B}

0010b – {Ch A, Ch C}

0011b – {Ch B, Ch C}

0100b-1110b – Reserved

If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket ID of the first participant processor in this mirroring domain global instance.

A value of 1111b indicates that this field is unused and does not contain valid data.

[3:0] – If Domain Instance Type (ED3) is set to Local, this field specifies the sparing domain local sub-instances – which channels are included in this sub-instance:

0000b – Reserved

0001b – {Ch A, Ch B, Ch C} (only configuration possible on Intel

®

Server Boards)

S5500/S5520

0010b-1110b – Reserved

If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket ID of the first participant processor in this sparing domain global instance.

A value of 1111b indicates that this field is unused and does not contain valid data.

Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Memory Subsystem

Byte Field

16 Event Data 3

Description

[7] – Domain Instance Type

0b: Local memory sparing domain instance. This SEL pertains to a local memory mirroring domain that is restricted to memory mirroring pairs within a processor socket only.

1b: Global memory sparing domain instance. This SEL pertains to a global memory mirroring domain that pertains to memory mirroring between processor sockets.

[6:4] – Reserved

[3:0] – 0-based Instance ID of this sparing domain

Table 55: Mirrored Redundancy State Sensor Event Trigger Offset – Next Steps

7.1.3

Hex

Event Trigger Offset

Description

01h Memory is configured in Mirrored

Channel Mode, and the memory is operating in the fully redundant state.

00h Memory is configured in Mirrored

Channel Mode, and the memory has lost redundancy and is operating in the degraded state.

Sparing Configuration Status

Description

System boots with mirrored channel mode active, one entry per processor.

One of the channels in the mirror pair is taken offline – loss of mirror – one entry only for affected processor.

Informational event.

Next Steps

This event should be accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).

This sensor provides the Spare Channel mode RAS Configuration status.

Table 56: Sparing Configuration Status Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

0001h = BIOS POST

Description

0ch = Memory

Revision 1.1 Intel order number G74211-002 49

Memory Subsystem

Byte Field

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

13h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 09h (digital Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 57

Not used

Not used

Table 57: Sparing Configuration Status Sensor Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

01h The system has configured into

Spare Channel RAS mode.

00h The system has configured out of

Spare Channel RAS mode

Description

Sparing mode is enabled in setup.

Sparing mode is disabled, either from setup or due to error in which case post error

8500 also occurs.

Informational event only.

Next Steps

1. If this event is accompanied by a post error 8500, there was a problem applying the sparing configuration to the memory. Check for other errors related to the memory and troubleshoot accordingly.

2. If there is no post error then sparing mode was simply disabled in BIOS setup and this should be considered informational only.

7.1.4 Sparing Redundancy State Sensor

This sensor provides the RAS Redundancy state for the Spare Channel Mode.

50 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

Table 58: Sparing Redundancy State Sensor Typical Characteristics

Description

0001h = BIOS POST

0ch = Memory

11h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset as described in Table 59

[7:4] – If Domain Instance Type (ED3) is set to Local, this field specifies the 0-based Socket ID of the processor that contains the sparing domain local sub-instances.

A value of 1110b indicates that the sparing configuration specified in Bits [3:0] applies globally to all sockets in the system.

If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket

ID of the second participant processor in this sparing domain global instance.

A value of 1111b indicates that this field is unused and does not contain valid data.

[3:0] – If Domain Instance Type (ED3) is set to Local, this field specifies the sparing domain local sub-instances – which channels are included in this sub-instance:

0000b – Reserved

0001b – {Ch A, Ch B, Ch C} (only configuration possible on Intel

®

S5500/S5520

Server Boards)

0010b-1110b – Reserved

If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket

ID of the first participant processor in this sparing domain global instance.

A value of 1111b indicates that this field is unused and does not contain valid data.

Memory Subsystem

Revision 1.1 Intel order number G74211-002 51

Memory Subsystem

Byte Field

16 Event Data 3

Description

[7] – Domain Instance Type

0b: Local memory sparing domain instance. This SEL pertains to a local memory sparing domain that is restricted to memory sparing pairs within a processor socket only.

1b: Global memory sparing domain instance. This SEL pertains to a global memory sparing domain that pertains to memory sparing between processor sockets.

[6:4] – Reserved

[3:0] – 0-based Instance ID of this sparing domain

Table 59: Sparing Redundancy State Sensor Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

01h Memory is configured in Spare

Channel Mode, and the memory is operating in the fully redundant state, with the spare channel inactive and available.

00h Memory is configured in Spare

Channel Mode, and the memory has lost redundancy and is operating in the degraded state, with the spare channel active and used to replace a failed channel.

Description

System boots with spare channel mode active, one entry per processor.

Spare channel replaces failing channel, one SEL entry for processor with failing memory to signify loss of redundancy.

Informational event.

Next Steps

This event should be accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).

52 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Memory Subsystem

7.2 ECC and Address Parity

1. Memory data errors are logged as correctable or uncorrectable.

2. Uncorrectable errors are fatal.

3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error.

7.2.1 Memory Correctable and Uncorrectable ECC Error

ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors. A “Correctable ECC Error” actually represents a threshold overflow. More Correctable Errors are detected at the memory controller level for a given DIMM within a given timeframe. In both cases, the error can be narrowed down to particular DIMM(s). The BIOS SMI error handler uses this information to log the data to the BMC SEL and identify the failing DIMM module.

Table 60: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

0033h = BIOS SMI Handler

Description

0ch = Memory

02h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset as described in Table 61

[7:2] – Reserved. Set to 0.

[1:0] – The logical rank associated with the failed DDR3 DIMM

Revision 1.1 Intel order number G74211-002 53

Memory Subsystem

Byte Field

16 Event Data 3

Description

[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:

000b = Processor Socket 1

001b = Processor Socket 2

All other values are reserved.

[4:3] – Indicates the processor Memory Channel to which the failing DDR3 DIMM is attached:

00b = Channel A or D (For Processor Socket 1, Processor Socket 2)

01b = Channel B or E

10b = Channel C or F

11b is reserved.

[2:0] – Indicates the DIMM Socket on the channel to which the failing DDR3 DIMM is attached:

000b = DIMM Socket 1

001b = DIMM Socket 2

All other values are reserved.

Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

01h Uncorrectable ECC

Error

Description Next Steps

An uncorrectable (multi-bit) ECC error has occurred. This is a fatal issue that will typically lead to an OS crash (unless memory has been configured in a RAS mode). The system will generate a

CATERR# (catastrophic error) and an MCE (Machine Check

Exception Error).

While the error may be due to a failing DRAM chip on the DIMM, it could also be caused by incorrect seating or improper contact between the socket and DIMM, or by bent pins in the processor socket.

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify contacts are clean.

4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.

5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.

54 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Memory Subsystem

Hex

Event Trigger Offset

Description

00h Correctable ECC

Error threshold reached

Description

There have been too many (10 or more) correctable ECC errors for this particular DIMM since last boot. This event in itself does not pose any direct problems as the ECC errors are still being corrected. Depending on the RAS configuration of the memory, the IMC may take the affected DIMM offline.

Next Steps

Even though this event doesn't immediately lead to problems, it can indicate one of the DIMM modules is slowly failing. If this error occurs more than once:

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify contacts are clean.

4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.

5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.

7.2.2 Memory Address Parity Error

Address Parity errors are errors detected in the memory addressing hardware. Because these affect the addressing of memory contents, they can potentially lead to the same sort of failures as ECC errors. They are logged as a distinct type of error because they affect memory addressing rather than memory contents, but otherwise they are treated exactly the same as Uncorrectable ECC Errors. Address Parity errors are logged to the BMC SEL, with Event Data to identify the failing address by channel and DIMM to the extent that it is possible to do so.

Table 62: Address Parity Error Sensor Typical Characteristics

Byte Field

8

9

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

0033h = BIOS SMI Handler

0ch = Memory

14h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Description

Revision 1.1 Intel order number G74211-002 55

56

Memory Subsystem

Byte Field

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset = 02h

[7:5] – Reserved. Set to 0.

[4] – Channel Information Validity Check:

0b = Channel Number in Event Data 3 Bits[4:3] is not valid

1b = Channel Number in Event Data 3 Bits[4:3] is valid

[3] – DIMM Information Validity Check:

0b = DIMM Slot ID in Event Data 3 Bits[2:0] is not valid

1b = DIMM Slot ID in Event Data 3 Bits[2:0] is valid

[2:0] – Error Type:

000b = Parity Error Type not known

001b = Data Parity Error (not used)

010b = Address Parity Error

All other values are reserved.

[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:

000b = Processor Socket 1

001b = Processor Socket 2

All other values are reserved.

[4:3] – Channel Number (if valid) on which the Parity Error occurred. This value will be indeterminate and should be ignored if ED2 Bit [4] is 0b.

00b = Channel A or D (For Processor Socket 1, Processor Socket 2)

01b = Channel B or E

10b = Channel C or F

11b = Reserved

[2:0] – DIMM Slot ID (if valid) of the specific DIMM that was involved in the transaction that led to the parity error. This value will be indeterminate and should be ignored if ED2 Bit [3] is 0b.

000b = DIMM Socket 1

001b = DIMM Socket 2

All other values are reserved.

Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Memory Subsystem

7.2.2.1 Memory Address Parity Error Sensor Next Steps

These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn. An Address Parity Error is logged as such in SEL but in all other ways is treated the same as an Uncorrectable ECC Error.

While the error may be due to a failing DRAM chip on the DIMM, it could also be caused by incorrect seating or improper contact between the socket and DIMM, or by bent pins in the processor socket.

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify contacts are clean.

4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.

5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.

Revision 1.1 Intel order number G74211-002 57

PCI Express* and Legacy PCI Subsystem

8. PCI Express* and Legacy PCI Subsystem

The PCI Express* (PCIe) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The BIOS logs

AER events into the SEL.

The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL.

8.1 PCI Express* Errors

PCIe error events are either correctable (informational event) or fatal. In both cases information is logged to help identify the source of the PCIe error and the bus, device, and function is included in the extended data fields. The PCIe devices are mapped in the operating system by bus, device, and function. Each device is uniquely identified by the bus, device, and function. PCIe device information can be found in the operating system.

8.1.1 PCI Express* Correctable Errors

When a PCI Express* correctable error is reported to the BIOS SMI handler, it will record the error using the following format.

Table 63: PCI Express* Correctable Error Sensor Typical Characteristics

Byte Field

8

9

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

05h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 71h (OEM Specific)

58 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

PCI Express* and Legacy PCI Subsystem

Byte Field

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset as described in Table 64

PCI Bus number

[7:3] – PCI Device number

[2:0] – PCI Function number

Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

00h Receiver error

01h Bad DLLP error

02h Bad TLLP error

03h REPLAY_NUM Rollover

Error

04h REPLAY Timer Timeout

Error

05h Advisory non-fatal Error

(received ERR_COR message)

06h Link bandwidth changed

8.1.2

Description

Correctable error occurred

Correctable bad DLLP occurred

Correctable bad TLP occurred

Correctable Replay event occurred

Correctable Replay timeout event occurred

Correctable advisory event occurred, typically provided as notice to software driver

Link bandwidth changed

PCI Express* Fatal Errors

Next Steps

Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Decode bus, device, and function to identify the card.

2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.

3. If this is an onboard device: a. Update all BIOS, firmware, and drivers. b. Replace the board.

When a PCI Express* fatal error is reported to the BIOS SMI handler, it will record the error using the following format.

Revision 1.1 Intel order number G74211-002 59

PCI Express* and Legacy PCI Subsystem

Table 65: PCI Express* Fatal Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

04h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 70h (OEM Specific)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset as described in Table 66

PCI Bus number

[7:3] – PCI Device number

[2:0] – PCI Function number

Table 66: PCI Express* Fatal Error Sensor Event Trigger Offset – Next Steps

Hex

Event Trigger Offset

Description

00h Data Link Layer Protocol Error

Description

01h Surprise Link Down

02h Unexpected Completion

03h Received Unsupported request condition on inbound address decode with the exception of SAD

Indicates a CRC error detected during a DLLP transaction. This means the transaction was corrupted.

The link was lost and is no longer functional. Requires a reboot to bring the link back.

Indicates the device received a completion notification for a transaction it does not recognize. This is a fatal error.

Typically indicates a failure due to an incorrect address sent to the target. This unknown address is a fatal error.

Next Steps

1. Decode bus, device, and function to identify the card.

2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.

60 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

PCI Express* and Legacy PCI Subsystem

Hex

Event Trigger Offset

Description

04h Poisoned TLP Error

05h Flow Control Protocol Error

06h Completion Timeout Error

07h Completer Abort Error

08h Receiver Buffer Overflow Error

09h ACS Violation Error

0Ah Malformed TLP Error

Description

Typically indicates a parity error in a TLP transaction. This means the data received is not correct.

Indicates an error during initialization with the device not providing enough flow control credits. This means the bus configuration is incorrect and it cannot continue.

Indicates a transaction did not complete in the specified amount of time.

Indicates a transaction had unexpected content or format.

Indicates a synchronization problem between PCI Express* devices.

Extremely rare.

Access Control Services, a transaction routing feature, failed.

Indicates a transaction was sent with data exceeding the maximum allowed number of bytes. This is not allowed and is a fatal error, usually a firmware or driver problem.

Indicates a fatal error occurred and is being reported. 0Bh Received ERR_FATAL message from downstream Error

0Ch Unexpected Completion Error Indicates the device received a completion notification for a transaction it does not recognize.

0Dh Received ERR_NONFATAL Message Error Indicates a non-fatal error is redefined as fatal, and is being reported.

Next Steps

3. If this is an onboard device: a. Update all BIOS, firmware, and drivers. b. Replace the board.

8.1.3 Legacy PCI Errors

Legacy PCI errors include PERR and SERR; both are fatal errors.

Revision 1.1 Intel order number G74211-002 61

PCI Express* and Legacy PCI Subsystem

Table 67: Legacy PCI Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0033h = BIOS SMI Handler

13h = Critical Interrupt

03h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset as described in Table 68

PCI Bus number

[7:3] – PCI Device number

[2:0] – PCI Function number

Event Trigger Offset

Hex Description

04h PERR#

Description

Parity Error, PERR, asserted. This is a fatal error.

05h SERR#

Table 68: Legacy PCI Error Sensor Event Trigger Offset – Next Steps

System Error, SERR, asserted. This is a fatal error.

Next Steps

1. Decode bus, device, and function to identify the card.

2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.

3. If this is an onboard device: a. Update all BIOS, firmware, and drivers. b. Replace the board.

62 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

System BIOS Events

9. System BIOS Events

There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this document (for example, memory events).

9.1 System Events

These events can occur during POST or when coming out of a sleep state. These are informational events only.

1. When logging events during POST BIOS uses generator ID 0001h.

2. When coming out of a sleep state BIOS uses generator ID 0033h.

9.1.1 System Boot

The BIOS logs a system boot event every time the system boots. The event gets logged early during POST when BIOS-BMC communication is first established. This event is not an error.

9.1.2 Timestamp Clock Synchronization

These events are used when the time between the BIOS and the BMC is synchronized. Two events are logged. The BIOS does the first one to send the time synch message to the BMC for synchronization, and the timestamp that the message gets is unknown, that is, the timestamp in the log could be anything because it gets the "before" timestamp.

So the BIOS sends a second time synch message to get a "baseline" correct timestamp in the log. That is the "starting time".

For example, say that the time the BMC has is March 1, 2011 21:00. The BIOS time synch updates that to the same date, 21:20 (the BMC was running behind). Without that second time synch message, you don't know that the log time jumped ahead, and when you get the next log message it looks like there was a 20-min delay during the boot for some unknown reasons.

Without that second time synch message, the time span to the next logged message is indeterminate. With the second time synch as a baseline, the following log timestamps are always determinate.

Revision 1.1 Intel order number G74211-002 63

System BIOS Events

Table 69: System Event Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

 0001h = BIOS POST

 0033h = BIOS SMI Handler

12h = System Event

83h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset

01h = System Boot

05h = Timestamp Clock Synchronization

For Event Trigger Offset 05h only (Timestamp Clock

Synchronization)

00h = 1st in pair

80h = 2nd in pair

Not used

9.2 System Firmware Progress (Formerly Post Error)

The BIOS logs any POST errors to the SEL. The 2-byte POST code gets logged in the ED2 and ED3 bytes in the SEL entry. This event will be logged every time a POST error is displayed. Even though this event indicates an error, it may not be a fatal error. If this is a serious error, there will typically also be a corresponding SEL entry logged for whatever was the cause of the error – this event may contain more information about what happened than the POST error event.

64 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Table 70: POST Error Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

0001h = BIOS POST

Description

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

0Fh = System Firmware Progress (formerly POST

Error)

06h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset = 0

Low Byte of POST Error Code

High Byte of POST Error Code

9.2.1 System Firmware Progress (Formerly Post Error) – Next Steps

See the following table for POST error Codes.

Table 71: POST Error Codes

Error Code

0012

0048

0108

0109

Error Message

CMOS date/time not set

Password check failed

Keyboard component encountered a locked error.

Keyboard component encountered a stuck key error.

Revision 1.1 Intel order number G74211-002

Response

Major

Major

Minor

Minor

System BIOS Events

65

System BIOS Events

8161

8180

8190

8198

8300

84F2

0198

019F

5220

5221

5224

8160

84F3

84F4

84FF

0192

0193

0194

0195

0196

0197

Error Code

0113

0140

0141

0146

Error Message Response

Fixed Media The SAS RAID firmware cannot run properly. The user should attempt to reflash the firmware. Major

PCI component encountered a PERR error.

PCI resource conflict

PCI out of resources error

Major

Major

Major

Processor 0x cache size mismatch detected.

Processor 0x stepping mismatch.

Processor 0x family mismatch detected.

Processor 0x Intel

®

QPI speed mismatch.

Processor 0x model mismatch.

Fatal

Minor

Fatal

Fatal

Fatal

Fatal Processor 0x speeds mismatched.

Processor 0x family is not supported.

Processor and chipset stepping configuration is unsupported.

CMOS/NVRAM Configuration Cleared

Passwords cleared by jumper

Password clear Jumper is Set.

Processor 01 unable to apply microcode update

Fatal

Major

Major

Major

Major

Major

Processor 02 unable to apply microcode update

Processor 0x microcode update not found.

Watchdog timer failed on last boot

OS boot watchdog timer failure.

Baseboard management controller failed self-test

Baseboard management controller failed to respond

Baseboard management controller in update mode

Sensor data record empty

System event log full

Major

Minor

Major

Major

Major

Major

Major

Major

Minor

66 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

8541

8542

8543

8544

8545

8546

8527

8528

8529

852A

852B

8540

8547

8548

8549

8521

8522

8523

8524

8525

8526

Error Code

8500

8501

8502

8520

Error Message

Memory component could not be configured in the selected RAS mode.

DIMM Population Error.

CLTT Configuration Failure Error.

DIMM_A1 failed Self-Test (BIST).

DIMM_A2 failed Self-Test (BIST).

DIMM_B1 failed Self-Test (BIST).

DIMM_B2 failed Self-Test (BIST).

DIMM_C1 failed Self-Test (BIST).

DIMM_C2 failed Self-Test (BIST).

DIMM_D1 failed Self-Test (BIST).

DIMM_D2 failed Self-Test (BIST).

DIMM_E1 failed Self-Test (BIST).

DIMM_E2 failed Self-Test (BIST).

DIMM_F1 failed Self-Test (BIST).

DIMM_F2 failed Self-Test (BIST).

DIMM_A1 Disabled.

DIMM_A2 Disabled.

DIMM_B1 Disabled.

DIMM_B2 Disabled.

DIMM_C1 Disabled.

DIMM_C2 Disabled.

DIMM_D1 Disabled.

DIMM_D2 Disabled.

DIMM_E1 Disabled.

DIMM_E2 Disabled.

Revision 1.1 Intel order number G74211-002

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Response

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

System BIOS Events

67

System BIOS Events

85A2

85A3

85A4

85A5

85A6

85A7

8568

8569

856A

856B

85A0

85A1

85A8

85A9

85AA

8562

8563

8564

8565

8566

8567

Error Code

854A

854B

8560

8561

Error Message

DIMM_F1 Disabled.

DIMM_F2 Disabled.

DIMM_A1 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_A2 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_B1 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_B2 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_C1 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_C2 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_D1 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_D2 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_E1 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_E2 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_F1 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_F2 Component encountered a Serial Presence Detection (SPD) fail error.

DIMM_A1 Uncorrectable ECC error encountered.

DIMM_A2 Uncorrectable ECC error encountered.

DIMM_B1 Uncorrectable ECC error encountered.

DIMM_B2 Uncorrectable ECC error encountered.

DIMM_C1 Uncorrectable ECC error encountered.

DIMM_C2 Uncorrectable ECC error encountered.

DIMM_D1 Uncorrectable ECC error encountered.

DIMM_D2 Uncorrectable ECC error encountered.

DIMM_E1 Uncorrectable ECC error encountered.

DIMM_E2 Uncorrectable ECC error encountered.

DIMM_F1 Uncorrectable ECC error encountered.

68 Intel order number G74211-002

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Response

Major

Major

Major

Major

Major

Major

Major

Major

Major

Major

Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

92C7

92C8

94C6

94C9

9506

95A6

9286

9287

9288

92A3

92A9

92C6

95A7

95A8

9609

9226

9243

9246

9266

9268

9269

Error Code

85AB

8604

9000

9223

Error Message

DIMM_F2 Uncorrectable ECC error encountered.

Chipset Reclaim of non-critical variables complete.

Unspecified processor component has encountered a non-specific error.

Keyboard component was not detected.

Keyboard component encountered a controller error.

Mouse component was not detected.

Mouse component encountered a controller error.

Local Console component encountered a controller error.

Local Console component encountered an output error.

Local Console component encountered a resource conflict error.

Remote Console component encountered a controller error.

Remote Console component encountered an input error.

Remote Console component encountered an output error.

Serial port component was not detected

Serial port component encountered a resource conflict error

Serial Port controller error

Serial Port component encountered an input error.

Serial Port component encountered an output error.

LPC component encountered a controller error.

LPC component encountered a resource conflict error.

ATA/ATPI component encountered a controller error.

PCI component encountered a controller error.

PCI component encountered a read error.

PCI component encountered a write error.

Unspecified software component encountered a start error.

Revision 1.1 Intel order number G74211-002

Minor

Minor

Minor

Major

Minor

Minor

Minor

Minor

Minor

Major

Major

Minor

Minor

Minor

Minor

Response

Major

Minor

Major

Minor

Minor

Minor

Minor

Minor

Minor

Minor

System BIOS Events

69

System BIOS Events

A501

A5A0

A5A1

A5A4

A6A0

B6A3

A022

A027

A028

A100

A421

A500

Error Code

9641

9667

9687

96A7

96AB

96E7

A000

A001

A002

A003

PEI Core component encountered a load error.

Error Message

PEI module component encountered an illegal software state error.

DXE core component encountered an illegal software state error.

DXE boot services driver component encountered an illegal software state error.

DXE boot services driver component encountered invalid configuration.

SMM driver component encountered an illegal software state error.

TPM device not detected.

TPM device missing or not responding.

TPM device failure.

TPM device failed self-test.

Processor component encountered a mismatch error.

Processor component encountered a low voltage error.

Processor component encountered a high voltage error.

BIOS ACM Error

PCI component encountered a SERR error.

ATA/ATPI ATA bus SMART not supported.

ATA/ATPI ATA SMART is disabled.

PCI Express component encountered a PERR error.

PCI Express component encountered a SERR error.

PCI Express IBIST error.

DXE boot services driver Not enough memory available to shadow a legacy option ROM.

DXE boot services driver Unrecognized.

70 Intel order number G74211-002

Minor

Minor

Fatal

Major

Minor

Major

Major

Minor

Minor

Major

Fatal

Minor

Response

Minor

Fatal

Fatal

Fatal

Minor

Fatal

Minor

Minor

Minor

Minor

Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Chassis Subsystem

10. Chassis Subsystem

The BMC monitors several aspects of the chassis. Next to logging when the power and reset buttons get pressed, the BMC also monitors chassis intrusion if a chassis intrusion switch is included in the chassis; as well as looking at the network connections, and logging an event whenever the physical network link is lost.

10.1 Physical Security

Two sensors are included in the physical security subsystem: chassis intrusion and LAN leash lost.

10.1.1 Chassis Intrusion

Chassis Intrusion is monitored on supported chassis, and the BMC logs corresponding events when the chassis lid is opened and closed.

10.1.2 LAN Leash Lost

The LAN Leash lost sensor monitors the physical connection on the onboard network ports. If a LAN Leash lost event is logged, this means the network port lost its physical connection.

Table 72: Physical Security Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

Description

05h = Physical Security

04h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as described in Table 73

Revision 1.1 Intel order number G74211-002 71

Chassis Subsystem

Byte Field

15 Event Data 2

16 Event Data 3

Not used

Not used

Description

Table 73: Physical Security Sensor Event Trigger Offset – Next Steps

Event Trigger Offset

Hex Description

00h Chassis intrusion

Description

Somebody has opened the chassis (or the chassis intrusion sensor is not connected).

04h LAN leash lost

Someone has unplugged a LAN cable that was present when the BMC initialized. This event gets logged when the electrical connection on the NIC connector gets lost.

Next Steps

1. Use the Quick Start Guide and the Service Guide to determine whether the chassis intrusion switch is connected properly.

2. If this is the case, make sure it makes proper contact when the chassis is closed.

3. If this is also the case, someone has opened the chassis. Ensure nobody has access to the system that shouldn't.

This is most likely due to unplugging the cable but could also happen if there is an issue with the cable or switch.

1. Check the LAN cable and connector for issues.

2. Investigate switch logs where possible.

3. Ensure nobody has access to the server that shouldn't.

72 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Chassis Subsystem

10.2 FP (NMI) Interrupt

The front panel interrupt button (also referred to as NMI button) is a recessed button on the front panel that allows the user to force a critical interrupt which causes a crash error or kernel panic.

Table 74: FP (NMI) Interrupt Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

Description

13h = Critical Interrupt

05h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 0

Not used

Not used

15 Event Data 2

16 Event Data 3

10.2.1 FP (NMI) Interrupt – Next Steps

The purpose of this button is for diagnosing software issues – when a critical interrupt is generated the OS typically saves a memory dump.

This allows for exact analysis of what is going on in system memory, which can be useful for software developers, or for troubleshooting OS, software, and driver issues.

If this button was not actually pressed, you should ensure there is no physical fault with the front panel.

This event only gets logged if a user pressed the NMI button, and although it causes the OS to crash, is not an error.

Revision 1.1 Intel order number G74211-002 73

Chassis Subsystem

10.3 Button Press Events

The BMC logs when the front panel power and reset buttons get pressed. This is purely for informational purposes and these events do not indicate errors.

Table 75: Button Press Events Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

14h = Button/Switch

09h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset

0h = Power Button

2h = Reset Button

Not used

Not used

74 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Miscellaneous Events

11. Miscellaneous Events

The miscellaneous events section addresses sensors not easily grouped with other sensor types.

11.1 IPMI Watchdog

PCSD server systems support an IPMI watchdog timer, which can check to see whether the OS is still responsive. The timer is disabled by default, and has to be enabled manually. It then requires an IPMI-aware utility in the operating system that will reset the timer before it expires.

If the timer does expire, the BMC can take action if it is configured to do so (reset, power down, power cycle, or generate a critical interrupt).

Table 76: IPMI Watchdog Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

Description

23h = Watchdog 2

03h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 11B = Sensor-specific event extension code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset as describe in Table 77

Revision 1.1 Intel order number G74211-002 75

Miscellaneous Events

Byte Field

15 Event Data 2

Description

[7:4] – Interrupt type

0h = None

1h = SMI

2h = NMI

3h = Messaging Interrupt

Fh = Unspecified

All other = Reserved

[3:0] – Timer use at expiration

0h = Reserved

1h = BIOS FRB2

2h = BIOS/POST

3h = OS Load

4h = SMS/OS

5h = OEM

Fh = Unspecified

All other = Reserved

Not used 16 Event Data 3

Table 77: IPMI Watchdog Sensor Event Trigger Offset – Next Steps

Event Trigger Offset

Hex Description

00h Timer expired, status only

01h Hard reset

02h Power down

03h Power cycle

08h Timer interrupt

Description

Our server systems support a BMC watchdog timer, which can check to see whether the OS is still responsive. The timer is disabled by default, and has to be enabled manually. It then requires an IPMI-aware utility in the operating system that will reset the timer before it expires. If the timer does expire, the

BMC can take action if it is configured to do so (reset, power down, power cycle, or generate a critical interrupt).

Next Steps

If this event is being logged, it is because the BMC has been configured to check the watchdog timer.

1. Make sure you have support for this in your OS (typically using a third-party

IPMI-aware utility like ipmitool or ipmiutil along with the openipmi driver).

2. If this is the case, then it is likely your OS has hung, and you should investigate

OS event logs to determine what may have caused this.

76 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Miscellaneous Events

11.2 SMI Timeout

SMI stands for system management interrupt and is an interrupt that gets generated so the processor can service server management events

(typically memory or PCI errors, or other forms of critical interrupts), in order to log them to the SEL. If this interrupt times out, the system is frozen.

Table 78: SMI Timeout Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

11.2.1 SMI Timeout – Next Steps

This event normally only occurs after another more critical event.

Description

F3h = SMI Timeout

06h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 03h (“digital” Discrete)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 1 = State Asserted

Not used

Not used

1. Check the SEL for any critical interrupts, memory errors, bus errors, PCI errors, or any other serious errors.

2. If these are not present, the system locked up before it was able to log the original issue. In this case, low level debug is normally required.

Revision 1.1 Intel order number G74211-002 77

Miscellaneous Events

11.3 System Event Log Cleared

The BMC logs a SEL clear event. This is only ever the first event in the SEL. Cause of this event is either a manual SEL clear using Intel

®

SEL

Viewer or some other IPMI-aware utility, or is done in the factory as one of the last steps in the manufacturing process.

This is an informational event only.

Table 79: System Event Log Cleared Sensor Typical Characteristics

Byte

11

12

13

Field

Sensor Type

Sensor Number

Event Direction and

Event Type

14 Event Data 1

Description

10h = Event Logging Disabled

07h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 2 = Log area reset/cleared

Not used

Not used

15

16

11.4 System Event – PEF Action

Event Data 2

Event Data 3

The BMC is configurable to send alerts for events logged into the SEL. These alerts are called Platform Event Filters (PEF) and are disabled by default. The user must configure and enable this feature. PEF events are logged if the BMC takes action due to a PEF configuration. The BMC event triggering the PEF action will also be in the SEL.

78 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Miscellaneous Events

This functionality is built into the BMC to allow it to send alerts (SNMP or other) for any event that gets logged to the SEL. PEF filters are turned off by default and have to be enabled manually using Intel

®

deployment assistant, Intel

®

syscfg utility, or an IPMI-aware utility.

Table 80: System Event – PEF Action Sensor Typical Characteristics

Byte Field

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

11.4.1 System Event – PEF Action – Next Steps

Description

12h = System Event

08h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 11B = Sensor-specific event extension code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 4 = PEF Action

[7:6] – Reserved

[5] – 1b = Diagnostic Interrupt (NMI)

[4] – 1b = OEM action

[3] – 1b = Power cycle

[2] – 1b = Reset

[1] – 1b = Power off

[0] – 1b = Alert

Not used

This event gets logged if the BMC takes an action due to PEF configuration. Actions can be sending an alert, or resetting, power cycling, or powering down the system. There will be another event that has led to the action so you should investigate the SEL and PEF settings to identify this event, and troubleshoot accordingly.

Revision 1.1 Intel order number G74211-002 79

Hot Swap Controller Events

12. Hot Swap Controller Events

The Hot Swap Controller (HSC) implements the same basic sensor model that is utilized by the other management controllers in the system.

Sensor model information is contained in the document Intelligent Platform Management Interface Specification. A common set of IPMI commands is used for configuring the sensors and returning threshold status.

12.1 HSC Backplane Temperature Sensor

There is a thermal sensor on the Hot Swap Backplane to measure the ambient temperature.

Table 81: HSC Backplane Temperature Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

00C0h = HSC Firmware – HSBP A

00C2h = HSC Firmware – HSBP B

01h = Temperature

01h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

[7:6] – 01b = Trigger reading in Event Data 2

[5:4] – 01b = Trigger threshold in Event Data 3

[3:0] – Event Trigger Offset as described in Table 82

Reading that triggered event

Threshold value that triggered event

80 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Hot Swap Controller Events

Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps

Hex

Event Trigger

Description

00h Lower non-critical going low

02h Lower critical going low

07h Upper non-critical going high

09h Upper critical going high

Assertion

Severity

Deassert

Severity

Description

Degraded non-fatal

OK The temperature has dropped below its lower non-critical threshold.

Degraded The temperature has dropped below its lower critical threshold.

Degraded non-fatal

OK The temperature has gone over its upper non-critical threshold.

Degraded The temperature has gone over its upper critical threshold.

12.2 HSC Drive Slot Status Sensor

Next Steps

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below

35°C).

The HSC Drive Slot Status sensor provides the current status for drives in each of the slots.

Table 83: HSC Drive Slot Status Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

Description

00C0h = HSC Firmware – HSBP A

00C2h = HSC Firmware – HSBP B

0Dh = Drive Slot (Bay)

6 Slot HSBP 8 Slot HSBP

Revision 1.1 Intel order number G74211-002 81

Hot Swap Controller Events

Byte Field

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

12.2.1 HSC Drive Slot Status Sensor – Next Steps

Description

02h = Drive Slot 0 Status

03h = Drive Slot 1 Status

04h = Drive Slot 2 Status

05h = Drive Slot 3 Status

06h = Drive Slot 4 Status

07h = Drive Slot 5 Status

02h = Drive Slot 0 Status

03h = Drive Slot 1 Status

04h = Drive Slot 2 Status

05h = Drive Slot 3 Status

06h = Drive Slot 4 Status

07h = Drive Slot 5 Status

08h = Drive Slot 6 Status

09h = Drive Slot 7 Status

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

40h = Failed Drive

Not used

Not used

If during normal operation a drive gets reported as failed, ensure that the drive was seated properly and the drive carrier was properly latched.

If that does not work, replace the drive.

12.3 HSC Drive Presence Sensor

The HSC Drive Slot Presence sensor provides the current presence state for the drive in each of the slots. After an AC power cycle there will be a SEL entry to report the presence of the drive in a slot and there will be another entry for any changes in the presence of drives after that.

82 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Table 84: HSC Drive Presence Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

Description

00C0h = HSC Firmware – HSBP A

00C2h = HSC Firmware – HSBP B

0Dh = Drive Slot (Bay)

6 Slot HSBP

08h = Drive Slot 0 Presence

09h = Drive Slot 1 Presence

0Ah = Drive Slot 2 Presence

0Bh = Drive Slot 3 Presence

0Ch = Drive Slot 4 Presence

0Dh = Drive Slot 5 Presence

8 Slot HSBP

0Ah = Drive Slot 0 Presence

0Bh = Drive Slot 1 Presence

0Ch = Drive Slot 2 Presence

0Dh = Drive Slot 3 Presence

0Eh = Drive Slot 4 Presence

0Fh = Drive Slot 5 Presence

10h = Drive Slot 6 Presence

11h = Drive Slot 7 Presence

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 08h (“digital” Discrete)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset

0h = Device Removed / Device Absent.

1h = Device Inserted / Device Present

Not used

Not used

15 Event Data 2

16 Event Data 3

12.3.1 HSC Drive Presence Sensor – Next Steps

On AC power-on the drive presence will be logged as an informational event.

Revision 1.1 Intel order number G74211-002

Hot Swap Controller Events

83

Hot Swap Controller Events

If during normal operation a drive is removed or installed, it will also log an event.

If you get a drive removed or installed without operator intervention, ensure that the drive was seated properly and the drive carrier was properly latched.

84 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Manageability Engine (ME) Events

13. Manageability Engine (ME) Events

The Manageability Engine controls the PECI interface and also contains the Node Manager functionality.

13.1 Node Manager Exception Event

A Node Manager Exception Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.

Table 85: Node Manager Exception Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

002Ch – ME Firmware

Description

14 Event Data 1

15 Event Data 2

16 Event Data 3

DCh = OEM

18h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 72h (OEM)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3] – Node Manager Policy event

0 – Reserved

1 – Policy Correction Time Exceeded – Policy did not meet the contract for the defined policy. The policy will continue to limit the power or shut down the platform based on the defined policy action.

[2] – Reserved

[1:0] – 00b

[4:7] – Reserved

[0:3] – Domain Id (Currently, supports only one domain, Domain 0)

Policy Id

Revision 1.1 Intel order number G74211-002 85

Manageability Engine (ME) Events

13.1.1 Node Manager Exception Event – Next Steps

This is an informational event. Next steps depend on the policy that was set. See the Node Manager Specification for more details.

13.2 Node Manager Health Event

A Node Manager Health Event message provides a runtime error indication about Intel

®

Intelligent Power Node Manager’s health. Types of service that can send an error are defined as follows:

 Misconfigured policy Error reading power data

 Error reading inlet temperature

Table 86: Node Manager Health Event Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

002Ch – ME Firmware

Description

DCh = OEM

19h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 73h (OEM)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Health Event Type = 02h (Sensor Node Manager)

86 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Manageability Engine (ME) Events

Byte Field

15 Event Data 2

16 Event Data 3

13.2.1 Node Manager Health Event – Next Steps

Description

[7:4] – Error type

0-9 – Reserved

10 – Policy Misconfiguration

11 – Power Sensor Reading Failure

12 – Inlet Temperature Reading Failure

13 – Host Communication error

14 – Real-time clock synchronization failure

15 – Platform shutdown initiated by NM policy due to execution of action defined by Policy Exception Action

[3:0] – Domain Id (Currently, supports only one domain, Domain 0)

If Error type = 10 or 15 <Policy Id>

If Error type = 11 <Power Sensor Address>

If Error type = 12 <Inlet Sensor Address>

Otherwise set to 0.

Misconfigured policy can happen if the max/min power consumption of the platform exceeds the values in policy due to hardware reconfiguration.

First occurrence of an unacknowledged event will be retransmitted no faster than every 300 milliseconds.

Real-time clock synchronization failure alert is sent when NM is enabled and capable of limiting power, but within 10 minutes the firmware cannot obtain valid calendar time from the host side, so NM cannot handle suspend periods.

Next steps depend on the policy that was set. See the Node Manager Specification for more details.

Revision 1.1 Intel order number G74211-002 87

Manageability Engine (ME) Events

13.3 Node Manager Operational Capabilities Change

This message provides a runtime error indication about Intel domains.

®

Intelligent Power Node Manager’s operational capabilities. This applies to all

Assertion and deassertion of these events are supported.

Table 87: Node Manager Operational Capabilities Change Sensor Typical Characteristics

Byte Field

8

9

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

002Ch – ME Firmware

Description

14 Event Data 1

15 Event Data 2

DCh = OEM

1Ah

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 74h (OEM)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Current state of Operational Capabilities. Bit pattern:

0 – Policy interface capability

0 – Not Available

1 – Available

1 – Monitoring capability

0 – Not Available

1 – Available

2 – Power limiting capability

0 – Not Available

1 – Available

Not used

88 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Manageability Engine (ME) Events

Byte Field

16 Event Data 3 Not used

13.3.1 Node Manager Operational Capabilities Change – Next Steps

Description

Policy Interface available indicates that Intel

®

Intelligent Power Node Manager is able to respond to the external interface about querying and setting Intel

®

Intelligent Power Node Manager policies. This is generally available as soon as the microcontroller is initialized.

Monitoring Interface available indicates that Intel

®

Intelligent Power Node Manager has the capability to monitor power and temperature. This is generally available when firmware is operational.

Power limiting interface available indicates that Intel

®

Intelligent Power Node Manager can do power limiting and is indicative of an ACPIcompliant OS loaded (unless the OEM has indicated support for non-ACPI compliant OS).

Current value of not acknowledged capability sensor will be retransmitted no faster than every 300 milliseconds.

Next steps depend on the policy that was set. See the Node Manager Specification for more details.

Revision 1.1 Intel order number G74211-002 89

Manageability Engine (ME) Events

13.4 Node Manager Alert Threshold Exceeded

Policy Correction Time Exceeded Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.

Table 88: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

002Ch – ME Firmware

Description

14 Event Data 1

15 Event Data 2

16 Event Data 3

DCh = OEM

1Bh

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 72h (OEM)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3] = Node Manager Policy event

0 –Threshold exceeded

1 – Policy Correction Time Exceeded – Policy did not meet the contract for the defined policy. The policy will continue to limit the power or shut down the platform based on the defined policy action.

[2] – Reserved

[1:0] – Threshold Number. Valid only if Byte 5 bit [3] is set to 0.

0 to 2 – Threshold index

[7:4] – Reserved

[3:0] – Domain Id (Currently, supports only one domain, Domain 0)

Policy ID

90 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Manageability Engine (ME) Events

13.4.1 Node Manager Alert Threshold Exceeded – Next Steps

First occurrence of an unacknowledged event will be retransmitted no faster than every 300 milliseconds.

First occurrence of Threshold exceeded event assertion/deassertion will be retransmitted no faster than every 300 milliseconds.

Next steps depend on the policy that was set. See the Node Manager Specification for more details.

13.5 ME Firmware Health Event

This sensor is used in Platform Event messages to the BMC containing health information including but not limited to firmware upgrade and application errors.

Table 89: ME Firmware Health Event Sensor Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

002Ch or 602Ch – ME Firmware

DCh = OEM

17h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 75h (OEM)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Health event type – 0h (Firmware Status)

See Table 90

See Table 90

Revision 1.1 Intel order number G74211-002 91

Manageability Engine (ME) Events System Event Log Troubleshooting Guide for Intel® S5500/S3420 Series Server Boards

13.5.1 ME Firmware Health Event – Next Steps

In the following table Event Data 3 is only noted for specific errors.

If the issue continues to be persistent, provide the content of Event Data 3 to Intel support team for interpretation. Event Data 3 codes are in general not documented, because their meaning only provides some clues, varies, and usually needs to be individually interpreted.

Table 90: ME Firmware Health Event Sensor – Next Steps

ED2 ED3

00h

01h

02h

03h

04h

05h

06h-

FFh

Description

Recovery GPIO forced. Recovery Image loaded due to recovery MGPIO pin asserted. Pin number is configurable in factory presets. Default recovery pin is MGPIO1.

Image execution failed. Recovery Image or backup operational image loaded because operational image is corrupted. This may be either caused by flash device corruption or failed upgrade procedure.

Flash erase error. Error during flash erasure procedure.

Next Steps

Deassert MGPIO1 and reset the Intel

®

ME.

Either the flash device must be replaced (if error is persistent) or the upgrade procedure must be started again.

The flash device must be replaced.

Flash corrupted. Error while checking Flash consistency. The Flash device must be replaced (if error is persistent).

Internal error. Error during firmware execution – FW Watchdog Timeout. Operational image needs to be updated to other version or hardware board repair is needed (if error is persistent).

BMC did not respond to cold reset request and Intel platform.

®

ME rebooted the Verify the Intel

®

Node Manager configuration.

Reserved.

92 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Microsoft Windows* Records

14. Microsoft Windows* Records

With Microsoft Windows Server 2003* R2 and later versions, an Intelligent Platform Management Interface (IPMI) driver was added. This added the capability of logging some OS events to the SEL. The driver can write multiple records to the SEL for the following events:

 Boot-up

 Shutdown

 Bug Check / Blue Screen

14.1 Boot-up Event Records

When the system boots into the Microsoft Windows* OS, there can be two events logged. The first is a boot-up record and the second is an

OEM event. These are informational only records.

Table 91: Boot-up Event Record Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0041h – System Software with an ID = 20h

1Fh = OS Boot

00h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 1h = C: boot completed

Not used

Not used

Revision 1.1 Intel order number G74211-002 93

Microsoft Windows* Records

Table 92: Boot-up OEM Event Record Typical Characteristics

Byte

1

2

3

4

5

6

7

8

9

10

Record ID

Record Type

Timestamp

Field

IPMI Manufacturer ID

11 Record ID

ID used for SEL Record access.

[7:0] – DCh = OEM timestamped, bytes 8-16 OEM defined

Time when event was logged. LS byte first.

0137h (311d) = IANA enterprise number for Microsoft

Description

Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.

Timestamp of when system booted into the OS 12

13

14

15

Boot Time

16 Reserved 00h

14.2 Shutdown Event Records

When the system shuts down from the Microsoft Windows* OS, there can be multiple events logged. The first is an OS Stop/Shutdown Event

Record; this can be followed by a shutdown reason code OEM record, and then zero or more shutdown comment OEM records. These are all informational only records.

94 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

4

5

6

7

Byte

1

2

3

8

9

10

Record ID

Record Type

Timestamp

Field

IPMI Manufacturer ID

Table 93: Shutdown Reason Code Event Record Typical Characteristics

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0041h – System Software with an ID = 20h

20h = OS Stop/Shutdown

00h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 3h = OS Graceful Shutdown

Not used

Not used

Table 94: Shutdown Reason OEM Event Record Typical Characteristics

Description

ID used for SEL Record access.

[7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined

Time when event was logged. LS byte first.

0137h (311d) = IANA enterprise number for Microsoft

Revision 1.1 Intel order number G74211-002

Microsoft Windows* Records

95

Microsoft Windows* Records

Byte

11 Record ID

Field

12

13

14

15

Shutdown Reason

16 Reserved

Description

Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.

Shutdown Reason code from the registry (LSB first.):

HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/ReasonCode

00h

Table 95: Shutdown Comment OEM Event Record Typical Characteristics

Description

ID used for SEL Record access.

[7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined

Time when event was logged. LS byte first.

Byte

1

2

3

Record ID

Field

Record Type

4

5

6

7

8

9

10

Timestamp

IPMI Manufacturer ID

11 Record ID

12

13

14

15

Shutdown Comment

16 Reserved

96

0137h (311d) = IANA enterprise number for Microsoft

0157h (343) = IANA enterprise number for Intel

The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.

Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.

Shutdown Comment from the registry (LSB first.):

HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/Comment

00h

Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Microsoft Windows* Records

14.3 Bug Check / Blue Screen Event Records

When the system experiences a bug check (blue screen), there will be multiple records written to the event log. The first is a Bug Check / Blue

Screen OS Stop/Shutdown Event Record; this can be followed by multiple Bug Check / Blue Screen code OEM records that will contain the

Bug Check / Blue Screen codes. This information can be used to determine what caused the failure.

Table 96: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics

Byte

1

2

3

Record ID

Record Type

Field

Byte

8

9

Field

Generator ID

11 Sensor Type

12 Sensor Number

13 Event Direction and

Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

Description

0041h – System Software with an ID = 20h

20h = OS Stop/Shutdown

00h

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 00b = Unspecified Event Data 2

[5:4] – 00b = Unspecified Event Data 3

[3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (that is, “core dump”, “blue screen”)

Not used

Not used

Table 97: Bug Check / Blue Screen Code OEM Event Record Typical Characteristics

Description

ID used for SEL Record access.

[7:0] – DEh = OEM timestamped, bytes 8-16 OEM defined

Revision 1.1 Intel order number G74211-002 97

Microsoft Windows* Records

Byte

4

5

6

7

8

9

10

Timestamp

Field

IPMI Manufacturer ID

11 Sequence Number

Time when event was logged. LS byte first.

Description

0137h (311) = IANA enterprise number for Microsoft

0157h (343) = IANA enterprise number for Intel

The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.

Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.

12

13

14

15

Bug Check/Blue Screen Data The first record of this type will contain the Bug Check / Blue Screen Stop code and will be followed by the four Bug Check / Blue

Screen parameters. LSB first.

Note that each of the Bug Check / Blue Screen parameters requires two records each.

Both of the two records for each parameter will have the same Record ID.

There will be a total of 9 records.

16 Operating system type 00 = 32 bit OS

01 = 64 bit OS

98 Intel order number G74211-002 Revision 1.1

System Event Log Troubleshooting Guide for Intel

®

Linux* Kernel Panic Records

15. Linux* Kernel Panic Records

The OpenIPMI driver supports the ability to put semi-custom and custom events in the system event log if a panic occurs. If you enable the

“Generate a panic event to all BMCs on a panic” option, you will get one event on a panic in a standard IPMI event format. If you enable the

“Generate OEM events containing the panic string” option, you will also get a set of OEM events holding the panic string.

Table 98: Linux* Kernel Panic Event Record Characteristics

Byte

8

9

Generator ID

10 EvM Rev

11 Sensor Type

Field

12 Sensor Number

13 Event Direction and Event Type

14 Event Data 1

15 Event Data 2

16 Event Data 3

0021h – Kernel

Description

03h = IPMI 1.0 format

20h = OS Stop/Shutdown

The first byte of the panic string (0 if no panic string)

[7] Event direction

0b = Assertion Event

1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

[7:6] – 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (that is, “core dump”, “blue screen”)

The second byte of panic string

The third byte of panic string

Revision 1.1 Intel order number G74211-002 99

Linux* Kernel Panic Records

Table 99: Linux* Kernel Panic String Extended Record Characteristics

Byte

1

2

3

4

5

6

16

Field

Record ID ID used for SEL Record access.

Description

Record Type

Slave Address

[7:0] – F0h = OEM non-timestamped, bytes 4-16 OEM defined

The slave address of the card saving the panic.

Sequence Number A sequence number (starting at zero).

Kernel Panic Data These hold the panic sting. If the panic string is longer than 11 bytes, multiple messages will be sent with increasing sequence numbers.

100 Intel order number G74211-002 Revision 1.1

advertisement

Was this manual useful for you? Yes No
Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement

Table of contents