advertisement
System Event Log Troubleshooting
Guide for Intel
®
Server Boards
Intel order number G74211-002
Revision 1.1
December 2013
Platform Collaboration and Systems Division – Marketing
Revision History
Revision History
Date
August 2012
December 2013
Revision
Number
1.0
1.1
Modifications
Initial draft.
Corrected IPMI Watchdog and PEF Sensors Typical Characteristics tables.
Clarified Channel designators for DIMM memory errors.
Added ME sensor 17h. ii Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Disclaimers
Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR
SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,
SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH,
HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS'
FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL
INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR
NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF
THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature .
Revision 1.1 Intel order number G74211-002 iii
Table of Contents
Table of Contents
Intelligent Power Node Manager Version 1.5 ................................................ 3
BIOS POST owned Sensors (GID = 0001h) ......................................................... 12
BIOS SMI owned Sensors (GID = 0033h) ............................................................ 12
Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h) ................. 14
Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch) ............. 15
Fan Presence and Redundancy Sensors ............................................................. 30
iv Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Table of Contents
Memory Correctable and Uncorrectable ECC Error .............................................. 53
System Firmware Progress (Formerly Post Error) ................................................ 64
System Firmware Progress (Formerly Post Error) – Next Steps ........................... 65
Revision 1.1 Intel order number G74211-002 v
Table of Contents
Node Manager Exception Event – Next Steps ...................................................... 86
Node Manager Operational Capabilities Change .................................................. 88
Node Manager Operational Capabilities Change – Next Steps ............................ 89
Node Manager Alert Threshold Exceeded – Next Steps ....................................... 91
vi Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
List of Tables
List of Tables
Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next Steps .............. 26
Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next Steps ........... 27
Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps .................. 28
Revision 1.1 Intel order number G74211-002 vii
List of Tables
Table 60: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics ................ 53
Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps54
Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps ............... 59
viii Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
List of Tables
Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps ............... 81
Table 87: Node Manager Operational Capabilities Change Sensor Typical Characteristics ...... 88
Table 88: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics ................ 90
Table 96: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics .............. 97
Table 97: Bug Check / Blue Screen Code OEM Event Record Typical Characteristics ............. 97
Revision 1.1 Intel order number G74211-002 ix
System Event Log Troubleshooting Guide for Intel
®
Introduction
1. Introduction
The server management hardware that is part of Intel
®
Server Boards and Intel
®
Server
Platforms serves as a vital part of the overall server management strategy. The server management hardware provides essential information to the system administrator and provides the administrator the ability to remotely control the server, even when the operating system is not running.
The Intel
®
Server Boards and Intel
®
Server Platforms offer comprehensive hardware and software based solutions. The server management features make the servers simple to manage and provide alerting on system events. From entry to enterprise systems, good overall server management is essential to reduce overall total cost of ownership.
This Troubleshooting Guide is intended to help the users better understand the events that are logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these
Intel
®
Server Boards.
There is a separate User’s Guide that covers the general server management and the server management software offered on Intel
®
Server Boards and Intel
®
Server Platforms.
Server boards currently supported by this document:
Intel
Intel
®
®
S3200/X38ML Server Boards
S5500/S3420 Series Server Boards
1.1 Purpose
The purpose of this document is to list all possible events generated by the Intel
®
platform. It may be possible that other sources (not under our control) also generate events, which will not be described in this document.
1.2 Industry Standard
1.2.1 Intelligent Platform Management Interface (IPMI)
The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the inventory, monitoring, logging, and recovery control functions are available independently of the main processors, BIOS, and operating system. Platform management functions can also be made available when the system is in a power-down state.
IPMI works by interfacing with the BMC, which extends management capabilities in the server system and operates independently of the main processor by monitoring the on-board instrumentation. Through the BMC, IPMI also allows administrators to control power to the server, and remotely access BIOS configuration and operating system console information.
IPMI defines a common platform instrumentation interface to enable interoperability between:
Revision 1.1 Intel order number G74211-002 1
Introduction
The baseboard management controller and chassis
The baseboard management controller and systems management software
Between servers
IPMI enables the following:
Common access to platform management information, consisting of:
- Local access from systems management software
- Remote access from LAN
- Inter-chassis access from Intelligent Chassis Management Bus
- Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the processor is down
IPMI interface isolates systems management software from hardware.
Hardware advancements can be made without impacting the systems management software.
IPMI facilitates cross-platform management software.
You can find more information on IPMI at the following URL: http://www.intel.com/design/servers/ipmi
1.2.2 Baseboard Management Controller (BMC)
A baseboard management controller (BMC) is a specialized microcontroller embedded on most
Intel
®
Server Boards. The BMC is the heart of the IPMI architecture and provides the intelligence behind intelligent platform management, that is, the autonomous monitoring and recovery features implemented directly in platform management hardware and firmware.
Different types of sensors built into the computer system report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC monitors the system for critical events by communicating with various sensors on the system board; it sends alerts and logs events when certain parameters exceed their preset thresholds, indicating a potential failure of the system. The administrator can also remotely communicate with the BMC to take some corrective actions such as resetting or power cycling the system to get a hung OS running again. These abilities save on the total cost of ownership of a system.
For Intel
®
Server Boards and Intel
®
Server Platforms, the BMC supports the industry-standard
IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.
1.2.2.1 System Event Log (SEL)
The BMC provides a centralized, non-volatile repository for critical, warning, and informational system events called the System Event Log or SEL. By having the BMC manage the SEL and logging functions, it helps to ensure that “post-mortem” logging information is available if a failure occurs that disables the systems processor(s).
2 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Introduction
The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various tools and utilities that can be used to access the SEL. There is the Intel
®
SELViewer and multiple open sourced IPMI tools.
1.2.3 Intel
®
Intel
®
Intelligent Power Node Manager version 1.5 (NM) is a platform-resident technology that enforces power and thermal policies for the platform. These policies are applied by exploiting subsystem knobs (such as processor P and T states) that can be used to control power consumption. Intel
®
Intelligent Power Node Manager enables data center power and thermal management by exposing an external interface to management software through which platform policies can be specified. It also enables specific data center power management usage models such as power limiting.
The configuration and control commands are used by the external management software or
BMC to configure and control the Intel
®
Intelligent Power Node Manager feature. Because
Platform Services firmware does not have any external interface, external commands are first received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB channel. The BMC acts as a relay and the transport conversion device for these commands. For simplicity, the commands from the management console might be encapsulated in a generic
CONFIG packet format (config data length, config data blob) to the BMC so that the BMC doesn’t even have to parse the actual configuration data.
The BMC provides the access point for remote commands from external management SW and generates alerts to them. Intel
®
Intelligent Power Node Manager on Intel
®
Manageability Engine
(Intel
®
Intel
®
ME) is an IPMI satellite controller. A mechanism needs to exist to forward commands to
ME and send response back to originator. Similarly events from Intel
®
ME have to be sent as alerts outside of the BMC. It is the responsibility of BMC to implement these mechanisms for communication with Intel
®
Intelligent Power Node Manager.
The full specification can be downloaded from the following link: http://www.intel.com/content/dam/doc/technical-specification/intelligent-power-node-manager-1-
5-specification.pdf
Revision 1.1 Intel order number G74211-002 3
Basic Decoding of a SEL Record
2. Basic Decoding of a SEL Record
The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for each of the fields in a SEL. For more details see the IPMI
Specification.
The definitions for the standard SEL can be found in Table 1.
The definitions for the OEM defined event logs can be found in Table 3 and Table 4.
2.1 Default Values in the SEL Records
Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.
Byte [3] = Record Type (RT) = 02h = System event record
Byte [9:8] = Generator ID = 0020h = BMC Firmware
Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0
Table 1: SEL Record Format
Byte
1
2
Record ID
(RID)
Field
3 Record Type
(RT)
ID used for SEL Record access.
Description
4
5
6
7
Timestamp
(TS)
[7:0] – Record Type
02h = System event record
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)
E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
4 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Basic Decoding of a SEL Record
Byte
8
9
Field
Generator ID
(GID)
10 EvM Rev
(ER)
11 Sensor Type
(ST)
12 Sensor #
(SN)
13 Event Dir |
Event Type
(EDIR)
Description
RqSA and LUN if event was generated from IPMB.
Software ID if event was generated from system software.
Byte 1
[7:1] – 7-bit I
2
C Slave Address, or 7-bit system software ID
[0] 0b = ID is IPMB Slave Address
1b = System software ID
Software ID values:
0001h – BIOS POST for POST errors, RAS Configuration/State,
Timestamp Synch, OS Boot events
0033h – BIOS SMI Handler
0020h – BMC Firmware
002Ch – ME Firmware
0041h – Server Management Software
00C0h – HSC Firmware – HSBP A
00C2h – HSC Firmware – HSBP B
Byte 2
[7:4] – Channel number. Channel that event message was received over. 0h if the event message was received from the system interface, primary IPMB, or internally generated by the BMC.
[3:2] – Reserved. Write as 00b.
[1:0] – IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.
Event Message format version. 04h = IPMI v2.0; 03h = IPMI v1.0
Sensor Type Code for sensor that generated the event
Number of sensor that generated the event (From SDR)
Event Dir
[7] – 0b = Assertion event.
1b = Deassertion event.
Event Type
Type of trigger for the event, for example, critical threshold going high, state asserted, and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.
The Event Type field is encoded using the Event/Reading Type Code.
[6:0] – Event Type Codes
01h = Threshold (States = 0x00-0x0b)
02h-0ch = Discrete
6Fh = Sensor-Specific
70-7Fh = OEM
Per Table 2: Event Request Message Event Data Field Contents
14 Event Data 1
(ED1)
15 Event Data 2
(ED2)
16 Event Data 3
(ED3)
Revision 1.1 Intel order number G74211-002 5
Basic Decoding of a SEL Record
Table 2: Event Request Message Event Data Field Contents
Sensor
Class
Event Data
Threshold Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Trigger reading in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Trigger threshold value in Event Data 3
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for threshold event.
Event Data 2 – Reading that triggered event, FFh or not present if unspecified.
Event Data 3 – Threshold value that triggered event, FFh or not present if unspecified. If present, Event
Data 2 must be present. discrete Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for discrete event state
Event Data 2
[7:4] – Optional offset from “Severity” Event/Reading Code (0Fh if unspecified).
[3:0] – Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
OEM Event Data 1
[7:6] – 00b = Unspecified in Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
11b = Reserved
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Reserved
[3:0] – Offset from Event/Reading Type Code
Event Data 2
[7:4] – Optional OEM code bits or offset from “Severity” Event/Reading Type Code (0Fh if unspecified).
[3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present or unspecified.
6 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Basic Decoding of a SEL Record
Table 3: OEM SEL Record (Type C0h-DFh)
11
12
13
14
15
16
8
9
10
4
5
6
7
Byte
1
2
Record ID
(RID)
Field
3 Record Type
(RT)
ID used for SEL Record access.
Description
Timestamp
(TS)
[7:0] – Record Type
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
Manufacturer ID LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA
“Private Enterprise” ID.
Most significant four bits = Reserved (0000b).
000000h = Unspecified. 0FFFFFh = Reserved.
This value is binary encoded.
For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.
OEM Defined OEM Defined. This is defined according to the manufacturer identified by the
Manufacturer ID field.
Table 4: OEM SEL Record (Type E0h-FFh)
8
9
10
11
4
5
6
7
12
13
14
15
16
Byte
1
2
Record ID
(RID)
Field
3 Record Type
(RT)
OEM
ID used for SEL Record access.
Description
[7:0] – Record Type
E0h-FFh = OEM system event record
OEM Defined. This is defined by the system integrator.
Revision 1.1 Intel order number G74211-002 7
Sensor Cross Reference List
3. Sensor Cross Reference List
This section contains a cross reference to help find details on any specific SEL entry.
3.1 BMC owned Sensors (GID = 0020h)
The following table can be used to find the details of sensors owned by the BMC.
Table 5: BMC owned Sensors
Sensor
Number
01h
02h
03h
04h
05h
06h
07h
08h
09h
Sensor Name
Power Unit Status
(Pwr Unit Status)
Power Unit Redundancy
(Pwr Unit Redund)
IPMI Watchdog
(IPMI Watchdog)
Physical Security
(Physical Scrty)
FP Interrupt
(FP NMI Diag Int)
SMI Timeout
(SMI Timeout)
System Event Log
(System Event Log)
System Event
(System Event)
Button Press Event
(Button Press)
Details Section
Next Steps
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps
Table 77: IPMI Watchdog Sensor Event Trigger Offset – Next Steps
Table 73: Physical Security Sensor Event Trigger Offset – Next Steps
FP (NMI) Interrupt – Next Steps
Not applicable
System Event – PEF Action – Next Steps
Not applicable
8 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Sensor
Number
10h
11h
12h
13h
14h
15h
16h
17h
18h
19h
1Ah
1Bh
1Ch
1Dh
Sensor Name
BB +1.1V IOH
(BB +1.1V IOH)
BB +1.1V P1 Vccp
(BB +1.1V P1 Vccp)
BB +1.1V P2 Vccp
(BB +1.1V P2 Vccp)
BB +1.5V P1 DDR3
(BB +1.5V P1 DDR3)
BB +1.5V P2 DDR3
(BB +1.5V P2 DDR3)
BB +1.8V AUX
(BB +1.8V AUX)
BB +3.3V
(BB +3.3V)
BB +3.3V STBY
(BB +3.3V STBY)
BB +3.3V Vbat
(BB +3.3V Vbat)
BB +5.0V
(BB +5.0V)
BB +5.0V STBY
(BB +5.0V STBY)
BB +12.0V
(BB +12.0V)
BB -12.0V
(BB -12.0V)
BB +1.35V P1 LV DDR3
(BB +1.35v P1 MEM)
Details Section
Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Table 14: Voltage Sensors – Next Steps
Revision 1.1 Intel order number G74211-002
Sensor Cross Reference List
9
Sensor Cross Reference List
Sensor
Number
1Eh
20h
21h
22h
23h
24h
30h-39h
40h-45h
46h
50h
51h
52h
53h
Sensor Name
BB +1.35V P2 LV DDR3
(BB +1.35v P2 MEM)
Baseboard Temperature
(Baseboard Temp)
Front Panel Temperature
(Front Panel Temp)
IOH Thermal Margin
(IOH Therm Margin)
Processor 1 Memory Thermal
Margin
(Mem P1 Thrm Mrgn)
Processor 2 Memory Thermal
Margin
(Mem P2 Thrm Mrgn)
Fan Tachometer Sensors
(Chassis specific sensor names)
Fan Present Sensors
(Fan x Present)
Fan Redundancy
(Fan Redundancy)
Power Supply 1 Status
(PS1 Status)
Power Supply 2 Status
(PS2 Status)
Power Supply 1
AC Power Input
(PS1 Power In)
Power Supply 2
AC Power Input
(PS2 Power In)
Details Section
Next Steps
Table 14: Voltage Sensors – Next Steps
Table 35: Temperature Sensors – Next Steps
Table 35: Temperature Sensors – Next Steps
Table 38: Thermal Margin Sensors – Next Steps
Table 38: Thermal Margin Sensors – Next Steps
Table 38: Thermal Margin Sensors – Next Steps
Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps
Table 30: Fan Presence Sensors – Event Trigger Offset – Next Steps
Table 32: Fan Redundancy Sensor – Event Trigger Offset – Next Steps
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next
Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next
10 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Sensor Cross Reference List
Sensor
Number
54h
55h
56h
57h
60h
61h
62h
63h
64h
65h
66h
67h
68h
69h
Sensor Name
Power Supply 1 +12V % of
Maximum Current Output
(PS1 Curr Out %)
Power Supply 2 +12V % of
Maximum Current Output
(PS2 Curr Out %)
Power Supply 1 Temperature
(PS1 Temperature)
Power Supply 2 Temperature
(PS2 Temperature)
Processor 1 Status
(P1 Status)
Processor 2 Status
(P2 Status)
Processor 1 Thermal Margin
(P1 Therm Margin)
Processor 2 Thermal Margin
(P2 Therm Margin)
Processor 1 Thermal Control %
(P1 Therm Ctrl %)
Processor 2 Thermal Control %
(P2 Therm Ctrl %)
Processor 1 VRD Temp
(P1 VRD Hot)
Processor 2 VRD Temp
(P2 VRD Hot)
Catastrophic Error
(CATERR)
CPU Missing
(CPU Missing)
Details Section
Power Supply Temperature Sensors
Power Supply Temperature Sensors
Next Steps
Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next
Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next
Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps
Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps
Table 45: Processor Status Sensors – Next Steps
Table 45: Processor Status Sensors – Next Steps
Table 38: Thermal Margin Sensors – Next Steps
Table 38: Thermal Margin Sensors – Next Steps
Table 41: Processor Thermal Control % Sensors – Next Steps
Table 41: Processor Thermal Control % Sensors – Next Steps
Table 43: Discrete Thermal Sensors
Table 43: Discrete Thermal Sensors
Catastrophic Error Sensor– Next Steps
CPU Missing Sensor – Next Steps
Revision 1.1 Intel order number G74211-002 11
Sensor Cross Reference List
Sensor
Number
6Ah
Sensor Name
IOH Thermal Trip
(IOH Thermal Trip)
Details Section
3.2 BIOS POST owned Sensors (GID = 0001h)
Table 43: Discrete Thermal Sensors
Next Steps
The following table can be used to find the details of sensors owned by BIOS POST.
Table 6: BIOS POST owned Sensors
Sensor
Number
01h
06h
11h
12h
13h
83h
Sensor Name
Mirroring Redundancy State
POST Error
Sparing Redundancy State
Mirroring Configuration Status
Sparing Configuration Status
System Event
Details Section Next Steps
Mirrored Redundancy State Sensor
System Firmware Progress (Formerly
Sparing Redundancy State Sensor
Mirroring Configuration Status
Table 55: Mirrored Redundancy State Sensor Event Trigger Offset – Next Steps
System Firmware Progress (Formerly Post Error) – Next Steps
Table 59: Sparing Redundancy State Sensor Event Trigger Offset – Next Steps
Table 53: Mirroring Configuration Status Sensor Event Trigger Offset – Next Steps
Table 57: Sparing Configuration Status Sensor Event Trigger Offset – Next Steps
Not applicable
3.3 BIOS SMI owned Sensors (GID = 0033h)
The following table can be used to find the details of sensors owned by BIOS SMI.
12 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Sensor Cross Reference List
Table 7: BIOS SMI owned Sensors
03h
04h
05h
06h
Sensor
Number
02h
07h
14h
17h
18h
Sensor Name Details Section
Memory ECC Error
Legacy PCI Error
PCI Express Fatal Error
PCI Express Correctable Error
PCI Express Correctable errors
Intel
®
QuickPath Interface
Correctable Error
Intel
®
QuickPath Interface Nonfatal Error
Memory Address Parity Error
Intel
®
Error
QuickPath Interface Fatal
Intel
®
QuickPath Interface
Fatal2 Error
System Event
83h
Next Steps
Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset –
Table 68: Legacy PCI Error Sensor Event Trigger Offset – Next Steps
Table 66: PCI Express* Fatal Error Sensor Event Trigger Offset – Next Steps
Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps
QPI Correctable Error Sensor – Next Steps
QPI Non-Fatal Error Sensor – Next Steps
Memory Address Parity Error Sensor Next Steps
QPI Fatal and Fatal #2 – Next Steps
QPI Fatal and Fatal #2 – Next Steps
Not applicable
Revision 1.1 Intel order number G74211-002 13
Sensor Cross Reference List
3.4 Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h)
The following table can be used to find the details of sensors owned by the Hot Swap Controller (HSC) firmware. The HSC firmware resides on a Hot Swap Back Plane (HSBP). There can be up to two HSBPs in a system. Each HSBP will have its own GID.
00C0h = HSC Firmware – HSBP A
00C2h = HSC Firmware – HSBP B
Table 8: Hot Swap Controller Firmware owned Sensors
Sensor
Number
01h
02h
03h
04h
Sensor Name
Backplane Temperature
Drive Slot 0 Status
Drive Slot 1 Status
Drive Slot 2 Status
05h
06h
Drive Slot 3 Status
Drive Slot 4 Status
07h
6 Slot HSBP
Drive Slot 5 Status
08h
09h
Drive Slot 0 Presence
Drive Slot 1 Presence
0Ah
0Bh
0Ch
Drive Slot 2 Presence
Drive Slot 3 Presence
Drive Slot 4 Presence
0Dh
8 Slot HSBP
Drive Slot 5 Presence
08h Drive Slot 6 Status
Details Section Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Slot Status Sensor – Next Steps
14 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Sensor Cross Reference List
Sensor
Number
09h
0Ah
0Bh
0Ch
0Dh
0Eh
0Fh
10h
11h
Sensor Name
Drive Slot 7 Status
Drive Slot 0 Presence
Drive Slot 1 Presence
Drive Slot 2 Presence
Drive Slot 3 Presence
Drive Slot 4 Presence
Drive Slot 5 Presence
Drive Slot 6 Presence
Drive Slot 7 Presence
Details Section
Next Steps
HSC Drive Slot Status Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
HSC Drive Presence Sensor – Next Steps
3.5 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)
The following table can be used to find the details of sensors owned by the Node Manager / Management Engine (ME) firmware.
Table 9: Management Engine Firmware owned Sensors
Sensor
Number
17h
18h
19h
1Ah
1Bh
Sensor Name
ME Firmware Health Events
Node Manager Exception Events
Node Manager Health Events
Node Manager Operational Capabilities
Change Events
Node Manager Alert Threshold Exceeded
Events
Details Section Next Steps
ME Firmware Health Event – Next Steps
Node Manager Exception Event – Next Steps
Node Manager Health Event – Next Steps
Node Manager Operational Capabilities Change
Node Manager Operational Capabilities Change – Next Steps
Node Manager Alert Threshold Exceeded
Node Manager Alert Threshold Exceeded – Next Steps
Revision 1.1 Intel order number G74211-002 15
Sensor Cross Reference List
3.6 Microsoft* OS owned Events (GID = 0041)
The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS).
Table 10: Microsoft* OS owned Events
Sensor Name
Boot Event
Shutdown Event
Record
Type
02h
DCh
02h
DDh
Bug Check / Blue Screen 02h
DEh
Sensor Type Details Section Next Steps
1Fh = OS Boot
Not applicable
Table 91: Boot-up Event Record Typical Characteristics
Table 92: Boot-up OEM Event Record Typical Characteristics
20h = OS Stop/Shutdown
Table 93: Shutdown Reason Code Event Record Typical Characteristics
Not applicable
Table 94: Shutdown Reason OEM Event Record Typical Characteristics
Table 95: Shutdown Comment OEM Event Record Typical Characteristics
Not applicable
Not applicable
Not applicable
20h = OS Stop/Shutdown
Table 96: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics Not applicable
Not applicable
Table 97: Bug Check / Blue Screen Code OEM Event Record Typical
3.7 Linux* Kernel Panic Events (GID = 0021)
The following table can be used to find the details of records that can be generated when there is a Linux* Kernel panic.
Table 11: Linux* Kernel Panic Events
Sensor Name
Linux* Kernel Panic
Record
Type
02h
F0h
Sensor Type Details Section
20h = OS Stop/Shutdown
Table 98: Linux* Kernel Panic Event Record Characteristics
Not applicable
Table 99: Linux* Kernel Panic String Extended Record Characteristics
Next Steps
Not applicable
16 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Power Subsystems
4. Power Subsystems
The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.
4.1 Voltage Sensors
The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant analog/threshold sensors.
Note: A voltage error could be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be noted who is supplying the voltage and who is using it.
Table 12: Voltage Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
02h = Voltage
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 13
Reading that triggered event
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Revision 1.1 Intel order number G74211-002 17
Power Subsystems
Table 13: Voltage Sensors Event Triggers – Description
Hex
Event Trigger
Description
00h Lower non-critical going low
02h Lower critical going low
07h Upper non-critical going high
09h Upper critical going high
Assertion
Severity
Degraded non-fatal
Degraded non-fatal
Deassert
Severity
OK
Description
The voltage has dropped below its lower non-critical threshold.
Degraded The voltage has dropped below its lower critical threshold.
OK The voltage has gone over its upper non-critical threshold.
Degraded The voltage has gone over its upper critical threshold.
Table 14: Voltage Sensors – Next Steps
Sensor
Number
10h
Sensor Name
BB +1.1V IOH
11h
12h
BB +1.1V P1 Vccp
BB +1.1V P2 Vccp
Next Steps
This 1.1V line is supplied by the main board.
This 1.1V line is used by the I/O hub (IOH)
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the motherboard.
This 1.1V line is supplied by the main board.
This 1.1V line is used by processor 1.
1. Ensure all cables are connected correctly.
2. Cross test the processor if possible. If the issue remains with the socket, replace the main board, otherwise the processor.
This 1.1V line is supplied by the main board.
This 1.1V line is used by processor 2.
1. Ensure all cables are connected correctly.
2. Cross test the processor if possible. If the issue remains with the socket, replace the main board, otherwise the processor.
18 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Power Subsystems
Sensor
Number
13h
14h
15h
16h
17h
Sensor Name Next Steps
BB +1.5V P1 DDR3 This 1.5V line is supplied by the main board.
This 1.5V line is used by the memory on processor 1.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise replace the
DIMM.
BB +1.5V P2 DDR3 This 1.5V line is supplied by the main board.
This 1.5V line is used by the memory on processor 2.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise replace the
DIMM.
BB +1.8V AUX +1.8V is supplied by the main board.
+1.8V is used by the onboard NIC and I/O hub.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the main board.
BB +3.3V
BB +3.3V STBY
+3.3V is supplied by the power supplies.
+3.3V is used by the PCIe and PCI-X slots.
1. Ensure all cables are connected correctly.
2. Reseat any PCI cards, and try them in other slots.
3. If the issue follows the card, swap it, otherwise, replace the main board.
4. If the issue remains, replace the power supplies.
+3.3V Stby is supplied by the main board.
+3.3V Stby is used by the BMC, Onboard NIC, IOH, and ICH.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Revision 1.1 Intel order number G74211-002 19
Power Subsystems
Sensor
Number
18h
Sensor Name
BB +3.3V Vbat
19h
1Ah
1Bh
1Ch
BB +5.0V
BB +5.0V STBY
BB +12.0V
BB -12.0V
Next Steps
+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on.
+3.3V Vbat is used by the CMOS and related circuits.
1. Replace the CMOS battery. Any battery of type CR2032 can be used.
2. If error remains (unlikely), replace the board.
+5.0V is supplied by the power supplies.
+5.0V is used by the PCI slots.
1. Ensure all cables are connected correctly.
2. Reseat any PCI cards, and try them in other slots.
3. If the issue follows the card, swap it, otherwise, replace the main board.
4. If the issue remains, replace the power supplies.
+5.0V STBY is supplied by the power supplies.
+5.0V STBY is used to generate other standby voltages.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
+12V is supplied by the power supplies.
+12V is used by SATA drives, Fans, and PCI cards. In addition it is used to generate various processor voltages.
1. Ensure all cables are connected correctly.
2. Check connections on fans and HDDs.
3. If the issue follows the component, swap it, otherwise, replace the board.
4. If the issue remains, replace the power supplies.
-12V is supplied by the power supplies.
-12V is used by the serial port and by PCI cards. In addition it is used to generate various processor voltages.
1. Ensure all cables are connected correctly.
2. Reseat any PCI cards, and try them in other slots.
3. If the issue follows the card, swap it, otherwise, replace the main board.
4. If the issue remains, replace the power supplies.
20 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Power Subsystems
Sensor
Number
1Dh
Sensor Name
BB +1.35 P1 Mem
1Eh BB +1.35 P2 Mem
Next Steps
This 1.35V line is supplied by the main board.
This 1.35V line is used by low voltage memory on processor 1.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs.
4. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.
This 1.35V line is supplied by the main board.
This 1.35V line is used by low voltage memory on processor 2.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs.
4. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.
4.2 Power Unit
The power unit monitors the power state of the system and logs the state changes in the SEL.
4.2.1 Power Unit Status Sensor
The power unit status sensor monitors the power state of the system and logs state changes. Expected power-on events such as DC ON/OFF are logged and unexpected events are also logged, such as AC loss and power good loss.
Table 15: Power Unit Status Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
09h = Power Unit
01h
Description
Revision 1.1 Intel order number G74211-002 21
Power Subsystems
Byte Field
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] = Sensor Specific offset as described in Table 9
Not used
Not used
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
Hex
Sensor Specific Offset
Description
00h
04h
05h
06h
Power down
AC Lost
Soft Power Control
Failure
Power Unit Failure failure.
Description
System is powered down.
AC removed.
Generally means power good was lost in the system, causing a shutdown.
Power subsystem experienced a
Next Steps
Informational Event
Informational Event
This could be caused by the power supply subsystem or system components.
1. Verify all power cables and adapters are connected properly (AC cables as well as the cables between the PSU and system components).
2. Cross test the PSU if possible.
3. Replace the power subsystem.
Indicates a power supply failed.
1. Remove and reapply AC power.
2. If the power supply still fails, replace it.
4.2.2 Power Unit Redundancy Sensor
This sensor is enabled on systems that support redundant power supplies. When a system has AC applied or if it loses redundancy of the power supplies a message will get logged into the SEL.
22 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Power Subsystems
Event Trigger Offset
Hex
00h Fully redundant
Description
01h Redundancy lost
02h Redundancy degraded
03h Non-redundant, sufficient from redundant
04h Non-redundant, sufficient from insufficient
05h Non-redundant, insufficient
06h Non-redundant, degraded from fully redundant
07h Redundant, degraded from non-redundant
Table 17: Power Unit Redundancy Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
09h = Power Unit
02h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 18
Not used
Not used
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps
Description
System is fully operational.
System is not running in redundant power supply mode.
Next Steps
Informational Event
This event should be accompanied by specific power supply errors (AC lost, PSU failure, and so on). Troubleshoot these events accordingly.
Revision 1.1 Intel order number G74211-002 23
Power Subsystems
4.3 Power Supply
The BMC monitors the power supply subsystem.
4.3.1 Power Supply Status Sensors
These sensors report the status of the power supplies in the system. When a system first AC applied or removed it can log an event. Also if there is a failure, predictive failure, or a configuration error it can log an event.
Table 19: Power Supply Status Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
08h = Power Supply
50h = Power Supply 1 Status
51h = Power Supply 2 Status
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] = Sensor Specific offset as described in Table 20
Not used
Not used
Table 20: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps
Hex
Sensor Specific Offset
Description
00h Presence Power supply detected.
Description
Informational Event.
Next Steps
24 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Power Subsystems
Hex
Sensor Specific Offset
Description
01h Failure
Description Next Steps
Power supply failed. Indicates a power supply failed.
1) Remove and reapply AC.
2) If the power supply still fails, replace it.
Replace the power supply. 02h Predictive Failure
03h AC lost
Typically means a fan inside the power supply is not cooling the power supply. It may indicate the fan is failing.
AC removed.
06h Configuration error Power supply configuration is not supported.
Informational Event.
Indicates that at least one of the supplies is not correct for your system configuration.
1) Remove the power supply and verify compatibility.
2) If the power supply is compatible it may be faulty. Replace it.
4.3.2 Power Supply AC Power Input Sensors
These sensors will log an event when a power supply in the system is exceeding its AC power in threshold.
Table 21: Power Supply AC Power Input Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
Description
0Bh = Other Units
52h = Power Supply 1 AC Power Input
53h = Power Supply 2 AC Power Input
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 22
Revision 1.1 Intel order number G74211-002 25
Power Subsystems
Byte Field
15 Event Data 2
16 Event Data 3
Description
Reading that triggered event
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 22: Power Supply AC Power Input Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
07h Upper non-critical going high
09h Upper critical going high
Assertion
Severity
Deassert
Severity
Degraded OK non-fatal Degraded
Description
PMBus* feature to monitor power supply power consumption.
Next Steps
If you see this event, the system is pulling too much power on the input for the
PSU rating.
1. Verify the power budget is within the specified range.
2. Check http://www.intel.com/p/en_US/support/ for the power budget tool for your system.
4.3.3 Power Supply Current Output % Sensors
PMBus*-compliant power supplies may monitor the current output of the main 12v voltage rail and report the current usage as a percentage of the maximum power output for that rail.
Table 23: Power Supply Current Output % Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
Description
03h = Current
54h = Power Supply 1 Current Output %
55h = Power Supply 2 Current Output %
26 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Power Subsystems
Byte Field
13 Event Direction and
Event Type
14
15
16
Event Data 1
Event Data 2
Event Data 3
Description
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 24
Reading that triggered event
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
07h Upper non-critical going high
09h Upper critical going high
Assertion
Severity non-fatal
Deassert
Severity
Degraded OK
Degraded
Description
PMBus* feature to monitor power supply power consumption.
Next Steps
If you see this event, the system is using too much power on the output for the PSU rating.
1. Verify the power budget is within the specified range.
2. Check http://www.intel.com/p/en_US/support/ for the power budget tool for your system.
4.3.4 Power Supply Temperature Sensors
The BMC monitors one power supply temperature sensor for each installed PMBus*-compliant power supply.
Revision 1.1 Intel order number G74211-002 27
Power Subsystems
Table 25: Power Supply Temperature Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
01h = Temperature
56h = Power Supply 1 Temperature
57h = Power Supply 2 Temperature
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 26
Reading that triggered event
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 26: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
07h Upper non-critical going high
09h Upper critical going high
Assertion
Severity non-fatal
Deassert
Severity
Degraded OK
Degraded
Description Next Steps
An upper non-critical or critical temperature threshold has been crossed.
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system
(typically below 35°C).
28 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Cooling Subsystem
5. Cooling Subsystem
5.1 Fan Sensors
There are three types of fan sensors that can be present on Intel
®
Server Systems: speed, presence, and redundancy. The last two are only present in systems with hot-swap redundant fans.
5.1.1 Fan Speed Sensors
Fan speed sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based sensors.
Usually they only have lower (critical) thresholds set, so that a SEL entry is only generated if the fan spins too slowly.
Table 27: Fan Speed Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
04h = Fan
30h-39h (Chassis specific)
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 28
Reading that triggered event
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Revision 1.1 Intel order number G74211-002 29
Cooling Subsystem
Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
00h Lower non-critical going low
02h Lower critical going low
Assertion
Severity
Deassert
Severity
Description
Degraded non-fatal
OK The fan speed has dropped below its lower non-critical threshold.
Degraded The fan speed has dropped below its lower critical threshold.
Next Steps
A fan speed error on a new system build is typically not caused by the fan spinning too slowly, instead it is caused by the fan being connected to the wrong header (the BMC expects them on certain headers for each chassis and will log this event if there is no fan on that header).
1. Refer to the Quick Start Guide or the Service Guide to identify the correct fan headers to use.
2. Ensure the latest FRUSDR update has been run and the correct chassis was detected or selected.
3. If you are sure this was done, the event may be a sign of impending fan failure (although this will only normally apply if the system has been in use for a while). Replace the fan.
5.1.2 Fan Presence and Redundancy Sensors
Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel
®
servers is an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other modes are also possible.
Table 29: Fan Presence Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
Description
04h = Fan
40h-45h (Chassis specific)
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 08h (Generic “digital” Discrete)
30 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Cooling Subsystem
Byte Field
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 30
Not used
Not used
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 30: Fan Presence Sensors – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
01h Device
Present
Assertion
Severity
OK
Deassert
Severity
Description
Degraded Assertion – A fan was inserted. This event may also get logged when the
BMC initializes when AC is applied.
Deassert – A fan was removed, or was not present at the expected location when the BMC initialized.
Informational only
Next Steps
These events only get generated in systems with hot-swappable fans, and normally only when a fan is physically inserted or removed. If fans were not physically removed:
1. Use the Quick Start Guide to check whether the right fan headers were used.
2. Swap the fans round to see whether the problem stays with the location, or follows the fan.
3. Replace the fan or fan wiring/housing depending on the outcome of step 2.
4. Ensure the latest FRUSDR update has been run and the correct chassis was detected or selected.
Table 31: Fan Redundancy Sensors Typical Characteristics
Description Byte
11 Sensor Type
Field
12 Sensor Number
04h = Fan
46h
Revision 1.1 Intel order number G74211-002 31
Cooling Subsystem
Byte Field
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 32
Not used
Not used
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 32: Fan Redundancy Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
00h Fully redundant
Description
01h Redundancy lost
02h Redundancy degraded
03h Non-redundant, sufficient from redundant
System has lost one or more fans and is running in non-redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.
Description
04h Non-redundant, sufficient from insufficient
05h Non-redundant, insufficient
06h Non-redundant, degraded from fully redundant
System has lost fans and may no longer be able to cool itself adequately. Overheating may occur if this situation remains for a longer period of time.
System has lost one or more fans and is running in non-redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.
07h Redundant, degraded from non-redundant System has lost one or more fans and is running in a degraded mode, but still is redundant. There are enough fans to keep the system properly cooled.
Next Steps
Fan redundancy loss indicates failure of one or more fans.
Look for lower (non) critical fan errors, or fan removal errors in the SEL, to indicate which fan is causing the problem, and follow the troubleshooting steps for these event types.
32 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Cooling Subsystem
5.2 Temperature Sensors
There are a variety of temperature sensors that can be implemented on Intel
®
Server Systems. They are split into three types: regular temperature sensors, thermal margin sensors, and discrete temperature sensors. Each of them has its own types of events that can be logged.
5.2.1 Regular Temperature Sensors
Regular temperature sensors are sensors that report an actual temperature. These are linear, threshold-based sensors. In most Intel
®
Server
Systems, there are at least two sensors defined: front panel temperature and baseboard temperature. Both these sensors typically have upper and lower thresholds set – upper to warn in case of an over-temperature situation, lower to warn against sensor failure (temperature sensors typically read out 0 if they stop working).
Table 33: Temperature Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
01h = Temperature
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 34
Reading that triggered event
Threshold value that triggered event
Revision 1.1 Intel order number G74211-002 33
Cooling Subsystem
Table 34: Temperature Sensors Event Triggers – Description
Hex
Event Trigger
Description
00h Lower non-critical going low
02h Lower critical going low
07h Upper non-critical going high
09h Upper critical going high
Assertion
Severity
Degraded non-fatal
Degraded non-fatal
Deassert
Severity
OK
Description
The temperature has dropped below its lower non-critical threshold.
Degraded The temperature has dropped below its lower critical threshold.
OK The temperature has gone over its upper non-critical threshold.
Degraded The temperature has gone over its upper critical threshold.
Table 35: Temperature Sensors – Next Steps
Sensor Name
Sensor
Number
Baseboard Temp 20h
Front Panel Temp 21h
Next Steps
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
If the front panel temperature reads zero, check:
1. It is connected properly.
2. The FRUSDR has been programmed correctly for your chassis.
If the front panel temperature is too high:
Check the cooling of your server room.
34 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Cooling Subsystem
5.2.2 Thermal Margin Sensors
Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to a critical temperature. Example sensors are Processor Thermal Margin, Memory Thermal Margin, and IOH Thermal margin. Values reported should be seen as number of degrees below a critical temperature for the particular component.
Table 36: Thermal Margin Sensors Typical Characteristics
Hex
Event Trigger
Description
07h Upper non-critical going high
09h Upper critical going high
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
01h = Temperature
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 37
Reading that triggered event
Threshold value that triggered event
Table 37: Thermal Margin Sensors Event Triggers – Description
Assertion
Severity
Degraded
Deassert
Severity
OK
Description
The thermal margin has gone over its upper non-critical threshold. non-fatal Degraded The thermal margin has gone over its upper critical threshold.
Revision 1.1 Intel order number G74211-002 35
Cooling Subsystem
Table 38: Thermal Margin Sensors – Next Steps
Sensor
Number
22h
23h
24h
Sensor Name
IOH Therm Margin
Mem P1 Therm Margin
Mem P2 Therm Margin
Next Steps
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
Not a logged SEL event. Sensor is used for thermal management of the processor.
5.2.3
62h
63h
P1 Therm Margin
P2 Therm Margin
Processor Thermal Control % Sensors
Processor Thermal Control % sensors report the percentage of the time that the processor is throttling its performance due to thermal issues. If this is not addressed the processor could overheat and shut down the system to protect itself from damage.
Table 39: Processor Thermal Control % Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
Description
01h = Temperature
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 40
Reading that triggered event.
36 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Cooling Subsystem
Hex
Event Trigger
Description
07h Upper non-critical going high
09h Upper critical going high
Byte Field
16 Event Data 3
Description
Threshold value that triggered event.
Table 40: Processor Thermal Control % Sensors Event Triggers – Description
Assertion
Severity
Degraded
Deassert
Severity
OK
Description
The thermal margin has gone over its upper non-critical threshold. non-fatal Degraded The thermal margin has gone over its upper critical threshold.
Table 41: Processor Thermal Control % Sensors – Next Steps
Sensor
Number
64h
65h
Sensor Name Next Steps
P1 Therm Ctl % These events normally only happen due to failures of the thermal solution:
P2 Therm Ctl %
1. Verify the heatsink is properly attached and has thermal grease.
2. If the system has a heatsink fan, ensure the fan is spinning.
3. Check all system fans are operating properly.
4. Check that the air used to cool the system is within limits (typically 35°C).
5.2.4 Discrete Thermal Sensors
Discrete thermal sensors do not report a temperature at all – instead they report an overheating event of some kind. Examples as VRD Hot
(voltage regulator is overheating) or processor Thermal Trip (the processor got so hot that its over-temperature protection was triggered and the system was shut down to prevent damage).
Revision 1.1 Intel order number G74211-002 37
Cooling Subsystem
Table 42: Discrete Thermal Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
01h = Temperature
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = See Table 43
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 43
Not used
Not used
Table 43: Discrete Thermal Sensors – Next Steps
Sensor
Number
66h
67h
6ah
Sensor Name
P1 VRD Hot
P2 VRD Hot
IOH Thermal Trip
Event
Type
05h
03h
Event Trigger Offset
Hex Description
Description
01h Limit Exceeded Processor1 voltage regulator overheated
Processor2 voltage regulator overheated
01h State Asserted I/O Hub (IOH) overheated
Next Steps
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
38 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
6. Processor Subsystem
Intel
®
servers report several processor-centric sensors in the SEL.
6.1 Processor Status Sensor
The status sensor reports processor presence or a thermal trip condition. Each processor has a status sensor.
Table 44: Process Status Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
07h = Processor
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 45
Not used.
Not used.
Processor Subsystem
Revision 1.1 Intel order number G74211-002 39
Processor Subsystem
Table 45: Processor Status Sensors – Next Steps
Sensor
Number
60h
Sensor Name
P1 Status
Event Trigger Offset
Hex Description
01h Thermal trip
Description Next Steps
61h P2 Status
07h State Asserted
01h Thermal trip
The processor exceeded the maximum temperature.
Indicates the processor is present.
The processor exceeded the maximum temperature.
Indicates the processor is present.
This event normally only happens due to failures of the thermal solution:
1. Verify the heatsink is properly attached and has thermal grease.
2. If the system has a heatsink fan, ensure the fan is spinning.
3. Check all system fans are operating properly.
4. Check that the air used to cool the system is within limits (typically 35°C).
07h State Asserted
6.2 Catastrophic Error Sensor
When the Catastrophic Error signal (CATERR#) stays asserted, it is a sign that something serious has gone wrong in the hardware. The BMC monitors this signal and reports when it stays asserted.
Table 46: Catastrophic Error Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
Description
07h = Processor
68h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 01h (State Asserted)
Not used.
40 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Processor Subsystem
Byte Field
16 Event Data 3
6.2.1 Catastrophic Error Sensor – Next Steps
This error is typically caused by other platform components.
Not used.
Description
1. Check for other errors near the time of the CATERR event.
2. Verify all peripherals are plugged in and operating correctly, particularly Hard Drives, Optical Drives, and I/O.
3. Update system firmware and drivers.
6.3 CPU Missing Sensor
The CPU Missing sensor is a discrete sensor reporting the processor is not installed. The most common instance of this event is due to a processor populated in the incorrect socket.
Table 47: CPU Missing Sensor Typical Characteristics
Byte
11 Sensor Type
Field
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
07h = Processor
69h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 01h (State Asserted)
Not used.
Not used.
Revision 1.1 Intel order number G74211-002 41
Processor Subsystem
6.3.1 CPU Missing Sensor – Next Steps
Verify the processor is installed in the correct slot.
6.4 QuickPath Interconnect Error Sensors
The Intel
®
QuickPath Interconnect (QPI) bus on Intel
®
S5500/S3420 series server boards is the interconnection between processors and to the chipset. The QPI Error sensors are all reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h.
6.4.1 QPI Correctable Error Sensor
The system detected an error and corrected it. This is an informational event.
Table 48: QPI Correctable Error Sensor Typical Characteristics
Byte Field
8
9
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
06h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = Reserved
0-3 = CPU1-4
Not used
42 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Processor Subsystem
6.4.1.1 QPI Correctable Error Sensor – Next Steps
This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor if possible.
6.4.2 QPI Non-Fatal Error Sensor
The system detected a QPI non-fatal error that is recoverable. This is an informational event.
Table 49: QPI Non-Fatal Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
07h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 73h (OEM Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = Reserved
0-3 = CPU1-4
Not used
Revision 1.1 Intel order number G74211-002 43
Processor Subsystem
6.4.2.1 QPI Non-Fatal Error Sensor – Next Steps
This is an Informational event only. Non-Fatal errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor if possible.
6.4.3 QPI Fatal and Fatal #2
The system detected a QPI fatal or non-recoverable error. This is a fatal error.
Table 50: QPI Fatal Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
The QPI Fatal #2 Error is a continuation of QPI Fatal Error.
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
17h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 74h (OEM Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = Reserved
0-3 = CPU1-4
Not used
44 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Processor Subsystem
Table 51: QPI Fatal #2 Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
18h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 74h (OEM Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = Reserved
0-3 = CPU1-4
Not used
6.4.3.1 QPI Fatal and Fatal #2 – Next Steps
This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor if possible.
Revision 1.1 Intel order number G74211-002 45
Memory Subsystem
7. Memory Subsystem
Intel
®
servers report memory errors, status, and configuration in the SEL.
7.1 Memory RAS Mirroring and Sparing
“Memory RAS Configuration Status” refers to the BIOS sending the current RAS mode and RAS operational state to the BMC to log into the
SEL as a SEL record. This allows a remote software/application to query and retrieve the system memory state.
The memory configuration state sensors are “virtual” sensors. In other words, these sensors are owned and controlled completely by the BIOS, independently of the BMC.
The RAS configuration and state definitions are aligned with the definitions within the Intelligent Platform Management Interface Specification,
Version 2.0. Accordingly, these sensors are read as “Status” and “Redundancy” sensors (Event/Reading Type 0x09 and 0x0B respectively).
Sensor Number 12h (Event Type 0x09) – Mirroring Configuration Status
Sensor Number 01h (Event Type 0x0B) – Mirroring Redundancy State
Sensor Number 13h (Event Type 0x09) – Sparing Configuration Status
Sensor Number 11h (Event Type 0x0B) – Sparing Redundancy State
7.1.1 Mirroring Configuration Status
This sensor provides the Mirroring mode RAS configuration status.
Table 52: Mirroring Configuration Status Sensor Typical Characteristics
Byte Field
8
9
Generator ID
11 Sensor Type
12 Sensor Number
0001h = BIOS POST
Description
0ch = Memory
12h
46 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Memory Subsystem
Byte Field
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 09h (digital Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 53
Not used
Not used
Table 53: Mirroring Configuration Status Sensor Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
01h The system has been configured into
Mirrored Channel RAS Mode.
00h The system has been configured out of Mirrored Channel RAS Mode.
Description
User enabled mirrored channel mode in setup.
Mirrored channel mode is disabled (either in setup or due to unavailability of memory at post, in which case post error
8500 is also logged).
7.1.2 Mirrored Redundancy State Sensor
Informational event only.
Next Steps
1. If this event is accompanied by a post error 8500, there was a problem applying the mirroring configuration to the memory. Check for other errors related to the memory and troubleshoot accordingly.
2. If there is no post error then mirror mode was simply disabled in BIOS setup and this should be considered informational only.
This sensor provides the RAS Redundancy state for the Memory Mirrored Channel Mode.
Revision 1.1 Intel order number G74211-002 47
Memory Subsystem
48
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
Table 54: Mirrored Redundancy State Sensor Typical Characteristics
Description
0001h = BIOS POST
0ch = Memory
01h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 55
[7:4] – If Domain Instance Type (ED3) is set to Local, this field specifies the mirroring domain local sub-instances – which channels are included in this sub-instance:
0000b – Reserved
0001b – {Ch A, Ch B}
0010b – {Ch A, Ch C}
0011b – {Ch B, Ch C}
0100b-1110b – Reserved
If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket ID of the first participant processor in this mirroring domain global instance.
A value of 1111b indicates that this field is unused and does not contain valid data.
[3:0] – If Domain Instance Type (ED3) is set to Local, this field specifies the sparing domain local sub-instances – which channels are included in this sub-instance:
0000b – Reserved
0001b – {Ch A, Ch B, Ch C} (only configuration possible on Intel
®
Server Boards)
S5500/S5520
0010b-1110b – Reserved
If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket ID of the first participant processor in this sparing domain global instance.
A value of 1111b indicates that this field is unused and does not contain valid data.
Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Memory Subsystem
Byte Field
16 Event Data 3
Description
[7] – Domain Instance Type
0b: Local memory sparing domain instance. This SEL pertains to a local memory mirroring domain that is restricted to memory mirroring pairs within a processor socket only.
1b: Global memory sparing domain instance. This SEL pertains to a global memory mirroring domain that pertains to memory mirroring between processor sockets.
[6:4] – Reserved
[3:0] – 0-based Instance ID of this sparing domain
Table 55: Mirrored Redundancy State Sensor Event Trigger Offset – Next Steps
7.1.3
Hex
Event Trigger Offset
Description
01h Memory is configured in Mirrored
Channel Mode, and the memory is operating in the fully redundant state.
00h Memory is configured in Mirrored
Channel Mode, and the memory has lost redundancy and is operating in the degraded state.
Sparing Configuration Status
Description
System boots with mirrored channel mode active, one entry per processor.
One of the channels in the mirror pair is taken offline – loss of mirror – one entry only for affected processor.
Informational event.
Next Steps
This event should be accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).
This sensor provides the Spare Channel mode RAS Configuration status.
Table 56: Sparing Configuration Status Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
0001h = BIOS POST
Description
0ch = Memory
Revision 1.1 Intel order number G74211-002 49
Memory Subsystem
Byte Field
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
13h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 09h (digital Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 57
Not used
Not used
Table 57: Sparing Configuration Status Sensor Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
01h The system has configured into
Spare Channel RAS mode.
00h The system has configured out of
Spare Channel RAS mode
Description
Sparing mode is enabled in setup.
Sparing mode is disabled, either from setup or due to error in which case post error
8500 also occurs.
Informational event only.
Next Steps
1. If this event is accompanied by a post error 8500, there was a problem applying the sparing configuration to the memory. Check for other errors related to the memory and troubleshoot accordingly.
2. If there is no post error then sparing mode was simply disabled in BIOS setup and this should be considered informational only.
7.1.4 Sparing Redundancy State Sensor
This sensor provides the RAS Redundancy state for the Spare Channel Mode.
50 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
Table 58: Sparing Redundancy State Sensor Typical Characteristics
Description
0001h = BIOS POST
0ch = Memory
11h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 59
[7:4] – If Domain Instance Type (ED3) is set to Local, this field specifies the 0-based Socket ID of the processor that contains the sparing domain local sub-instances.
A value of 1110b indicates that the sparing configuration specified in Bits [3:0] applies globally to all sockets in the system.
If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket
ID of the second participant processor in this sparing domain global instance.
A value of 1111b indicates that this field is unused and does not contain valid data.
[3:0] – If Domain Instance Type (ED3) is set to Local, this field specifies the sparing domain local sub-instances – which channels are included in this sub-instance:
0000b – Reserved
0001b – {Ch A, Ch B, Ch C} (only configuration possible on Intel
®
S5500/S5520
Server Boards)
0010b-1110b – Reserved
If Domain Instance Type (ED3) is set to Global, this field specifies the 0-based Socket
ID of the first participant processor in this sparing domain global instance.
A value of 1111b indicates that this field is unused and does not contain valid data.
Memory Subsystem
Revision 1.1 Intel order number G74211-002 51
Memory Subsystem
Byte Field
16 Event Data 3
Description
[7] – Domain Instance Type
0b: Local memory sparing domain instance. This SEL pertains to a local memory sparing domain that is restricted to memory sparing pairs within a processor socket only.
1b: Global memory sparing domain instance. This SEL pertains to a global memory sparing domain that pertains to memory sparing between processor sockets.
[6:4] – Reserved
[3:0] – 0-based Instance ID of this sparing domain
Table 59: Sparing Redundancy State Sensor Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
01h Memory is configured in Spare
Channel Mode, and the memory is operating in the fully redundant state, with the spare channel inactive and available.
00h Memory is configured in Spare
Channel Mode, and the memory has lost redundancy and is operating in the degraded state, with the spare channel active and used to replace a failed channel.
Description
System boots with spare channel mode active, one entry per processor.
Spare channel replaces failing channel, one SEL entry for processor with failing memory to signify loss of redundancy.
Informational event.
Next Steps
This event should be accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).
52 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Memory Subsystem
7.2 ECC and Address Parity
1. Memory data errors are logged as correctable or uncorrectable.
2. Uncorrectable errors are fatal.
3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error.
7.2.1 Memory Correctable and Uncorrectable ECC Error
ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors. A “Correctable ECC Error” actually represents a threshold overflow. More Correctable Errors are detected at the memory controller level for a given DIMM within a given timeframe. In both cases, the error can be narrowed down to particular DIMM(s). The BIOS SMI error handler uses this information to log the data to the BMC SEL and identify the failing DIMM module.
Table 60: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
0033h = BIOS SMI Handler
Description
0ch = Memory
02h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 61
[7:2] – Reserved. Set to 0.
[1:0] – The logical rank associated with the failed DDR3 DIMM
Revision 1.1 Intel order number G74211-002 53
Memory Subsystem
Byte Field
16 Event Data 3
Description
[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:
000b = Processor Socket 1
001b = Processor Socket 2
All other values are reserved.
[4:3] – Indicates the processor Memory Channel to which the failing DDR3 DIMM is attached:
00b = Channel A or D (For Processor Socket 1, Processor Socket 2)
01b = Channel B or E
10b = Channel C or F
11b is reserved.
[2:0] – Indicates the DIMM Socket on the channel to which the failing DDR3 DIMM is attached:
000b = DIMM Socket 1
001b = DIMM Socket 2
All other values are reserved.
Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
01h Uncorrectable ECC
Error
Description Next Steps
An uncorrectable (multi-bit) ECC error has occurred. This is a fatal issue that will typically lead to an OS crash (unless memory has been configured in a RAS mode). The system will generate a
CATERR# (catastrophic error) and an MCE (Machine Check
Exception Error).
While the error may be due to a failing DRAM chip on the DIMM, it could also be caused by incorrect seating or improper contact between the socket and DIMM, or by bent pins in the processor socket.
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify contacts are clean.
4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.
54 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Memory Subsystem
Hex
Event Trigger Offset
Description
00h Correctable ECC
Error threshold reached
Description
There have been too many (10 or more) correctable ECC errors for this particular DIMM since last boot. This event in itself does not pose any direct problems as the ECC errors are still being corrected. Depending on the RAS configuration of the memory, the IMC may take the affected DIMM offline.
Next Steps
Even though this event doesn't immediately lead to problems, it can indicate one of the DIMM modules is slowly failing. If this error occurs more than once:
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify contacts are clean.
4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.
7.2.2 Memory Address Parity Error
Address Parity errors are errors detected in the memory addressing hardware. Because these affect the addressing of memory contents, they can potentially lead to the same sort of failures as ECC errors. They are logged as a distinct type of error because they affect memory addressing rather than memory contents, but otherwise they are treated exactly the same as Uncorrectable ECC Errors. Address Parity errors are logged to the BMC SEL, with Event Data to identify the failing address by channel and DIMM to the extent that it is possible to do so.
Table 62: Address Parity Error Sensor Typical Characteristics
Byte Field
8
9
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
0033h = BIOS SMI Handler
0ch = Memory
14h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
Description
Revision 1.1 Intel order number G74211-002 55
56
Memory Subsystem
Byte Field
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 02h
[7:5] – Reserved. Set to 0.
[4] – Channel Information Validity Check:
0b = Channel Number in Event Data 3 Bits[4:3] is not valid
1b = Channel Number in Event Data 3 Bits[4:3] is valid
[3] – DIMM Information Validity Check:
0b = DIMM Slot ID in Event Data 3 Bits[2:0] is not valid
1b = DIMM Slot ID in Event Data 3 Bits[2:0] is valid
[2:0] – Error Type:
000b = Parity Error Type not known
001b = Data Parity Error (not used)
010b = Address Parity Error
All other values are reserved.
[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:
000b = Processor Socket 1
001b = Processor Socket 2
All other values are reserved.
[4:3] – Channel Number (if valid) on which the Parity Error occurred. This value will be indeterminate and should be ignored if ED2 Bit [4] is 0b.
00b = Channel A or D (For Processor Socket 1, Processor Socket 2)
01b = Channel B or E
10b = Channel C or F
11b = Reserved
[2:0] – DIMM Slot ID (if valid) of the specific DIMM that was involved in the transaction that led to the parity error. This value will be indeterminate and should be ignored if ED2 Bit [3] is 0b.
000b = DIMM Socket 1
001b = DIMM Socket 2
All other values are reserved.
Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Memory Subsystem
7.2.2.1 Memory Address Parity Error Sensor Next Steps
These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn. An Address Parity Error is logged as such in SEL but in all other ways is treated the same as an Uncorrectable ECC Error.
While the error may be due to a failing DRAM chip on the DIMM, it could also be caused by incorrect seating or improper contact between the socket and DIMM, or by bent pins in the processor socket.
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify contacts are clean.
4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.
Revision 1.1 Intel order number G74211-002 57
PCI Express* and Legacy PCI Subsystem
8. PCI Express* and Legacy PCI Subsystem
The PCI Express* (PCIe) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The BIOS logs
AER events into the SEL.
The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL.
8.1 PCI Express* Errors
PCIe error events are either correctable (informational event) or fatal. In both cases information is logged to help identify the source of the PCIe error and the bus, device, and function is included in the extended data fields. The PCIe devices are mapped in the operating system by bus, device, and function. Each device is uniquely identified by the bus, device, and function. PCIe device information can be found in the operating system.
8.1.1 PCI Express* Correctable Errors
When a PCI Express* correctable error is reported to the BIOS SMI handler, it will record the error using the following format.
Table 63: PCI Express* Correctable Error Sensor Typical Characteristics
Byte Field
8
9
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
05h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 71h (OEM Specific)
58 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
PCI Express* and Legacy PCI Subsystem
Byte Field
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 64
PCI Bus number
[7:3] – PCI Device number
[2:0] – PCI Function number
Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
00h Receiver error
01h Bad DLLP error
02h Bad TLLP error
03h REPLAY_NUM Rollover
Error
04h REPLAY Timer Timeout
Error
05h Advisory non-fatal Error
(received ERR_COR message)
06h Link bandwidth changed
8.1.2
Description
Correctable error occurred
Correctable bad DLLP occurred
Correctable bad TLP occurred
Correctable Replay event occurred
Correctable Replay timeout event occurred
Correctable advisory event occurred, typically provided as notice to software driver
Link bandwidth changed
PCI Express* Fatal Errors
Next Steps
Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Decode bus, device, and function to identify the card.
2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.
3. If this is an onboard device: a. Update all BIOS, firmware, and drivers. b. Replace the board.
When a PCI Express* fatal error is reported to the BIOS SMI handler, it will record the error using the following format.
Revision 1.1 Intel order number G74211-002 59
PCI Express* and Legacy PCI Subsystem
Table 65: PCI Express* Fatal Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
04h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 70h (OEM Specific)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 66
PCI Bus number
[7:3] – PCI Device number
[2:0] – PCI Function number
Table 66: PCI Express* Fatal Error Sensor Event Trigger Offset – Next Steps
Hex
Event Trigger Offset
Description
00h Data Link Layer Protocol Error
Description
01h Surprise Link Down
02h Unexpected Completion
03h Received Unsupported request condition on inbound address decode with the exception of SAD
Indicates a CRC error detected during a DLLP transaction. This means the transaction was corrupted.
The link was lost and is no longer functional. Requires a reboot to bring the link back.
Indicates the device received a completion notification for a transaction it does not recognize. This is a fatal error.
Typically indicates a failure due to an incorrect address sent to the target. This unknown address is a fatal error.
Next Steps
1. Decode bus, device, and function to identify the card.
2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.
60 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
PCI Express* and Legacy PCI Subsystem
Hex
Event Trigger Offset
Description
04h Poisoned TLP Error
05h Flow Control Protocol Error
06h Completion Timeout Error
07h Completer Abort Error
08h Receiver Buffer Overflow Error
09h ACS Violation Error
0Ah Malformed TLP Error
Description
Typically indicates a parity error in a TLP transaction. This means the data received is not correct.
Indicates an error during initialization with the device not providing enough flow control credits. This means the bus configuration is incorrect and it cannot continue.
Indicates a transaction did not complete in the specified amount of time.
Indicates a transaction had unexpected content or format.
Indicates a synchronization problem between PCI Express* devices.
Extremely rare.
Access Control Services, a transaction routing feature, failed.
Indicates a transaction was sent with data exceeding the maximum allowed number of bytes. This is not allowed and is a fatal error, usually a firmware or driver problem.
Indicates a fatal error occurred and is being reported. 0Bh Received ERR_FATAL message from downstream Error
0Ch Unexpected Completion Error Indicates the device received a completion notification for a transaction it does not recognize.
0Dh Received ERR_NONFATAL Message Error Indicates a non-fatal error is redefined as fatal, and is being reported.
Next Steps
3. If this is an onboard device: a. Update all BIOS, firmware, and drivers. b. Replace the board.
8.1.3 Legacy PCI Errors
Legacy PCI errors include PERR and SERR; both are fatal errors.
Revision 1.1 Intel order number G74211-002 61
PCI Express* and Legacy PCI Subsystem
Table 67: Legacy PCI Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0033h = BIOS SMI Handler
13h = Critical Interrupt
03h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 68
PCI Bus number
[7:3] – PCI Device number
[2:0] – PCI Function number
Event Trigger Offset
Hex Description
04h PERR#
Description
Parity Error, PERR, asserted. This is a fatal error.
05h SERR#
Table 68: Legacy PCI Error Sensor Event Trigger Offset – Next Steps
System Error, SERR, asserted. This is a fatal error.
Next Steps
1. Decode bus, device, and function to identify the card.
2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.
3. If this is an onboard device: a. Update all BIOS, firmware, and drivers. b. Replace the board.
62 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
System BIOS Events
9. System BIOS Events
There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this document (for example, memory events).
9.1 System Events
These events can occur during POST or when coming out of a sleep state. These are informational events only.
1. When logging events during POST BIOS uses generator ID 0001h.
2. When coming out of a sleep state BIOS uses generator ID 0033h.
9.1.1 System Boot
The BIOS logs a system boot event every time the system boots. The event gets logged early during POST when BIOS-BMC communication is first established. This event is not an error.
9.1.2 Timestamp Clock Synchronization
These events are used when the time between the BIOS and the BMC is synchronized. Two events are logged. The BIOS does the first one to send the time synch message to the BMC for synchronization, and the timestamp that the message gets is unknown, that is, the timestamp in the log could be anything because it gets the "before" timestamp.
So the BIOS sends a second time synch message to get a "baseline" correct timestamp in the log. That is the "starting time".
For example, say that the time the BMC has is March 1, 2011 21:00. The BIOS time synch updates that to the same date, 21:20 (the BMC was running behind). Without that second time synch message, you don't know that the log time jumped ahead, and when you get the next log message it looks like there was a 20-min delay during the boot for some unknown reasons.
Without that second time synch message, the time span to the next logged message is indeterminate. With the second time synch as a baseline, the following log timestamps are always determinate.
Revision 1.1 Intel order number G74211-002 63
System BIOS Events
Table 69: System Event Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0001h = BIOS POST
0033h = BIOS SMI Handler
12h = System Event
83h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
01h = System Boot
05h = Timestamp Clock Synchronization
For Event Trigger Offset 05h only (Timestamp Clock
Synchronization)
00h = 1st in pair
80h = 2nd in pair
Not used
9.2 System Firmware Progress (Formerly Post Error)
The BIOS logs any POST errors to the SEL. The 2-byte POST code gets logged in the ED2 and ED3 bytes in the SEL entry. This event will be logged every time a POST error is displayed. Even though this event indicates an error, it may not be a fatal error. If this is a serious error, there will typically also be a corresponding SEL entry logged for whatever was the cause of the error – this event may contain more information about what happened than the POST error event.
64 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Table 70: POST Error Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
0001h = BIOS POST
Description
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
0Fh = System Firmware Progress (formerly POST
Error)
06h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0
Low Byte of POST Error Code
High Byte of POST Error Code
9.2.1 System Firmware Progress (Formerly Post Error) – Next Steps
See the following table for POST error Codes.
Table 71: POST Error Codes
Error Code
0012
0048
0108
0109
Error Message
CMOS date/time not set
Password check failed
Keyboard component encountered a locked error.
Keyboard component encountered a stuck key error.
Revision 1.1 Intel order number G74211-002
Response
Major
Major
Minor
Minor
System BIOS Events
65
System BIOS Events
8161
8180
8190
8198
8300
84F2
0198
019F
5220
5221
5224
8160
84F3
84F4
84FF
0192
0193
0194
0195
0196
0197
Error Code
0113
0140
0141
0146
Error Message Response
Fixed Media The SAS RAID firmware cannot run properly. The user should attempt to reflash the firmware. Major
PCI component encountered a PERR error.
PCI resource conflict
PCI out of resources error
Major
Major
Major
Processor 0x cache size mismatch detected.
Processor 0x stepping mismatch.
Processor 0x family mismatch detected.
Processor 0x Intel
®
QPI speed mismatch.
Processor 0x model mismatch.
Fatal
Minor
Fatal
Fatal
Fatal
Fatal Processor 0x speeds mismatched.
Processor 0x family is not supported.
Processor and chipset stepping configuration is unsupported.
CMOS/NVRAM Configuration Cleared
Passwords cleared by jumper
Password clear Jumper is Set.
Processor 01 unable to apply microcode update
Fatal
Major
Major
Major
Major
Major
Processor 02 unable to apply microcode update
Processor 0x microcode update not found.
Watchdog timer failed on last boot
OS boot watchdog timer failure.
Baseboard management controller failed self-test
Baseboard management controller failed to respond
Baseboard management controller in update mode
Sensor data record empty
System event log full
Major
Minor
Major
Major
Major
Major
Major
Major
Minor
66 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
8541
8542
8543
8544
8545
8546
8527
8528
8529
852A
852B
8540
8547
8548
8549
8521
8522
8523
8524
8525
8526
Error Code
8500
8501
8502
8520
Error Message
Memory component could not be configured in the selected RAS mode.
DIMM Population Error.
CLTT Configuration Failure Error.
DIMM_A1 failed Self-Test (BIST).
DIMM_A2 failed Self-Test (BIST).
DIMM_B1 failed Self-Test (BIST).
DIMM_B2 failed Self-Test (BIST).
DIMM_C1 failed Self-Test (BIST).
DIMM_C2 failed Self-Test (BIST).
DIMM_D1 failed Self-Test (BIST).
DIMM_D2 failed Self-Test (BIST).
DIMM_E1 failed Self-Test (BIST).
DIMM_E2 failed Self-Test (BIST).
DIMM_F1 failed Self-Test (BIST).
DIMM_F2 failed Self-Test (BIST).
DIMM_A1 Disabled.
DIMM_A2 Disabled.
DIMM_B1 Disabled.
DIMM_B2 Disabled.
DIMM_C1 Disabled.
DIMM_C2 Disabled.
DIMM_D1 Disabled.
DIMM_D2 Disabled.
DIMM_E1 Disabled.
DIMM_E2 Disabled.
Revision 1.1 Intel order number G74211-002
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Response
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
System BIOS Events
67
System BIOS Events
85A2
85A3
85A4
85A5
85A6
85A7
8568
8569
856A
856B
85A0
85A1
85A8
85A9
85AA
8562
8563
8564
8565
8566
8567
Error Code
854A
854B
8560
8561
Error Message
DIMM_F1 Disabled.
DIMM_F2 Disabled.
DIMM_A1 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_A2 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_B1 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_B2 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_C1 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_C2 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_D1 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_D2 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_E1 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_E2 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_F1 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_F2 Component encountered a Serial Presence Detection (SPD) fail error.
DIMM_A1 Uncorrectable ECC error encountered.
DIMM_A2 Uncorrectable ECC error encountered.
DIMM_B1 Uncorrectable ECC error encountered.
DIMM_B2 Uncorrectable ECC error encountered.
DIMM_C1 Uncorrectable ECC error encountered.
DIMM_C2 Uncorrectable ECC error encountered.
DIMM_D1 Uncorrectable ECC error encountered.
DIMM_D2 Uncorrectable ECC error encountered.
DIMM_E1 Uncorrectable ECC error encountered.
DIMM_E2 Uncorrectable ECC error encountered.
DIMM_F1 Uncorrectable ECC error encountered.
68 Intel order number G74211-002
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Response
Major
Major
Major
Major
Major
Major
Major
Major
Major
Major
Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
92C7
92C8
94C6
94C9
9506
95A6
9286
9287
9288
92A3
92A9
92C6
95A7
95A8
9609
9226
9243
9246
9266
9268
9269
Error Code
85AB
8604
9000
9223
Error Message
DIMM_F2 Uncorrectable ECC error encountered.
Chipset Reclaim of non-critical variables complete.
Unspecified processor component has encountered a non-specific error.
Keyboard component was not detected.
Keyboard component encountered a controller error.
Mouse component was not detected.
Mouse component encountered a controller error.
Local Console component encountered a controller error.
Local Console component encountered an output error.
Local Console component encountered a resource conflict error.
Remote Console component encountered a controller error.
Remote Console component encountered an input error.
Remote Console component encountered an output error.
Serial port component was not detected
Serial port component encountered a resource conflict error
Serial Port controller error
Serial Port component encountered an input error.
Serial Port component encountered an output error.
LPC component encountered a controller error.
LPC component encountered a resource conflict error.
ATA/ATPI component encountered a controller error.
PCI component encountered a controller error.
PCI component encountered a read error.
PCI component encountered a write error.
Unspecified software component encountered a start error.
Revision 1.1 Intel order number G74211-002
Minor
Minor
Minor
Major
Minor
Minor
Minor
Minor
Minor
Major
Major
Minor
Minor
Minor
Minor
Response
Major
Minor
Major
Minor
Minor
Minor
Minor
Minor
Minor
Minor
System BIOS Events
69
System BIOS Events
A501
A5A0
A5A1
A5A4
A6A0
B6A3
A022
A027
A028
A100
A421
A500
Error Code
9641
9667
9687
96A7
96AB
96E7
A000
A001
A002
A003
PEI Core component encountered a load error.
Error Message
PEI module component encountered an illegal software state error.
DXE core component encountered an illegal software state error.
DXE boot services driver component encountered an illegal software state error.
DXE boot services driver component encountered invalid configuration.
SMM driver component encountered an illegal software state error.
TPM device not detected.
TPM device missing or not responding.
TPM device failure.
TPM device failed self-test.
Processor component encountered a mismatch error.
Processor component encountered a low voltage error.
Processor component encountered a high voltage error.
BIOS ACM Error
PCI component encountered a SERR error.
ATA/ATPI ATA bus SMART not supported.
ATA/ATPI ATA SMART is disabled.
PCI Express component encountered a PERR error.
PCI Express component encountered a SERR error.
PCI Express IBIST error.
DXE boot services driver Not enough memory available to shadow a legacy option ROM.
DXE boot services driver Unrecognized.
70 Intel order number G74211-002
Minor
Minor
Fatal
Major
Minor
Major
Major
Minor
Minor
Major
Fatal
Minor
Response
Minor
Fatal
Fatal
Fatal
Minor
Fatal
Minor
Minor
Minor
Minor
Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Chassis Subsystem
10. Chassis Subsystem
The BMC monitors several aspects of the chassis. Next to logging when the power and reset buttons get pressed, the BMC also monitors chassis intrusion if a chassis intrusion switch is included in the chassis; as well as looking at the network connections, and logging an event whenever the physical network link is lost.
10.1 Physical Security
Two sensors are included in the physical security subsystem: chassis intrusion and LAN leash lost.
10.1.1 Chassis Intrusion
Chassis Intrusion is monitored on supported chassis, and the BMC logs corresponding events when the chassis lid is opened and closed.
10.1.2 LAN Leash Lost
The LAN Leash lost sensor monitors the physical connection on the onboard network ports. If a LAN Leash lost event is logged, this means the network port lost its physical connection.
Table 72: Physical Security Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
Description
05h = Physical Security
04h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 73
Revision 1.1 Intel order number G74211-002 71
Chassis Subsystem
Byte Field
15 Event Data 2
16 Event Data 3
Not used
Not used
Description
Table 73: Physical Security Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
00h Chassis intrusion
Description
Somebody has opened the chassis (or the chassis intrusion sensor is not connected).
04h LAN leash lost
Someone has unplugged a LAN cable that was present when the BMC initialized. This event gets logged when the electrical connection on the NIC connector gets lost.
Next Steps
1. Use the Quick Start Guide and the Service Guide to determine whether the chassis intrusion switch is connected properly.
2. If this is the case, make sure it makes proper contact when the chassis is closed.
3. If this is also the case, someone has opened the chassis. Ensure nobody has access to the system that shouldn't.
This is most likely due to unplugging the cable but could also happen if there is an issue with the cable or switch.
1. Check the LAN cable and connector for issues.
2. Investigate switch logs where possible.
3. Ensure nobody has access to the server that shouldn't.
72 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Chassis Subsystem
10.2 FP (NMI) Interrupt
The front panel interrupt button (also referred to as NMI button) is a recessed button on the front panel that allows the user to force a critical interrupt which causes a crash error or kernel panic.
Table 74: FP (NMI) Interrupt Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
Description
13h = Critical Interrupt
05h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 0
Not used
Not used
15 Event Data 2
16 Event Data 3
10.2.1 FP (NMI) Interrupt – Next Steps
The purpose of this button is for diagnosing software issues – when a critical interrupt is generated the OS typically saves a memory dump.
This allows for exact analysis of what is going on in system memory, which can be useful for software developers, or for troubleshooting OS, software, and driver issues.
If this button was not actually pressed, you should ensure there is no physical fault with the front panel.
This event only gets logged if a user pressed the NMI button, and although it causes the OS to crash, is not an error.
Revision 1.1 Intel order number G74211-002 73
Chassis Subsystem
10.3 Button Press Events
The BMC logs when the front panel power and reset buttons get pressed. This is purely for informational purposes and these events do not indicate errors.
Table 75: Button Press Events Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
14h = Button/Switch
09h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
0h = Power Button
2h = Reset Button
Not used
Not used
74 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Miscellaneous Events
11. Miscellaneous Events
The miscellaneous events section addresses sensors not easily grouped with other sensor types.
11.1 IPMI Watchdog
PCSD server systems support an IPMI watchdog timer, which can check to see whether the OS is still responsive. The timer is disabled by default, and has to be enabled manually. It then requires an IPMI-aware utility in the operating system that will reset the timer before it expires.
If the timer does expire, the BMC can take action if it is configured to do so (reset, power down, power cycle, or generate a critical interrupt).
Table 76: IPMI Watchdog Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
Description
23h = Watchdog 2
03h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 11B = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as describe in Table 77
Revision 1.1 Intel order number G74211-002 75
Miscellaneous Events
Byte Field
15 Event Data 2
Description
[7:4] – Interrupt type
0h = None
1h = SMI
2h = NMI
3h = Messaging Interrupt
Fh = Unspecified
All other = Reserved
[3:0] – Timer use at expiration
0h = Reserved
1h = BIOS FRB2
2h = BIOS/POST
3h = OS Load
4h = SMS/OS
5h = OEM
Fh = Unspecified
All other = Reserved
Not used 16 Event Data 3
Table 77: IPMI Watchdog Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
00h Timer expired, status only
01h Hard reset
02h Power down
03h Power cycle
08h Timer interrupt
Description
Our server systems support a BMC watchdog timer, which can check to see whether the OS is still responsive. The timer is disabled by default, and has to be enabled manually. It then requires an IPMI-aware utility in the operating system that will reset the timer before it expires. If the timer does expire, the
BMC can take action if it is configured to do so (reset, power down, power cycle, or generate a critical interrupt).
Next Steps
If this event is being logged, it is because the BMC has been configured to check the watchdog timer.
1. Make sure you have support for this in your OS (typically using a third-party
IPMI-aware utility like ipmitool or ipmiutil along with the openipmi driver).
2. If this is the case, then it is likely your OS has hung, and you should investigate
OS event logs to determine what may have caused this.
76 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Miscellaneous Events
11.2 SMI Timeout
SMI stands for system management interrupt and is an interrupt that gets generated so the processor can service server management events
(typically memory or PCI errors, or other forms of critical interrupts), in order to log them to the SEL. If this interrupt times out, the system is frozen.
Table 78: SMI Timeout Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
11.2.1 SMI Timeout – Next Steps
This event normally only occurs after another more critical event.
Description
F3h = SMI Timeout
06h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1 = State Asserted
Not used
Not used
1. Check the SEL for any critical interrupts, memory errors, bus errors, PCI errors, or any other serious errors.
2. If these are not present, the system locked up before it was able to log the original issue. In this case, low level debug is normally required.
Revision 1.1 Intel order number G74211-002 77
Miscellaneous Events
11.3 System Event Log Cleared
The BMC logs a SEL clear event. This is only ever the first event in the SEL. Cause of this event is either a manual SEL clear using Intel
®
SEL
Viewer or some other IPMI-aware utility, or is done in the factory as one of the last steps in the manufacturing process.
This is an informational event only.
Table 79: System Event Log Cleared Sensor Typical Characteristics
Byte
11
12
13
Field
Sensor Type
Sensor Number
Event Direction and
Event Type
14 Event Data 1
Description
10h = Event Logging Disabled
07h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 2 = Log area reset/cleared
Not used
Not used
15
16
11.4 System Event – PEF Action
Event Data 2
Event Data 3
The BMC is configurable to send alerts for events logged into the SEL. These alerts are called Platform Event Filters (PEF) and are disabled by default. The user must configure and enable this feature. PEF events are logged if the BMC takes action due to a PEF configuration. The BMC event triggering the PEF action will also be in the SEL.
78 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Miscellaneous Events
This functionality is built into the BMC to allow it to send alerts (SNMP or other) for any event that gets logged to the SEL. PEF filters are turned off by default and have to be enabled manually using Intel
®
deployment assistant, Intel
®
syscfg utility, or an IPMI-aware utility.
Table 80: System Event – PEF Action Sensor Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
11.4.1 System Event – PEF Action – Next Steps
Description
12h = System Event
08h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 11B = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 4 = PEF Action
[7:6] – Reserved
[5] – 1b = Diagnostic Interrupt (NMI)
[4] – 1b = OEM action
[3] – 1b = Power cycle
[2] – 1b = Reset
[1] – 1b = Power off
[0] – 1b = Alert
Not used
This event gets logged if the BMC takes an action due to PEF configuration. Actions can be sending an alert, or resetting, power cycling, or powering down the system. There will be another event that has led to the action so you should investigate the SEL and PEF settings to identify this event, and troubleshoot accordingly.
Revision 1.1 Intel order number G74211-002 79
Hot Swap Controller Events
12. Hot Swap Controller Events
The Hot Swap Controller (HSC) implements the same basic sensor model that is utilized by the other management controllers in the system.
Sensor model information is contained in the document Intelligent Platform Management Interface Specification. A common set of IPMI commands is used for configuring the sensors and returning threshold status.
12.1 HSC Backplane Temperature Sensor
There is a thermal sensor on the Hot Swap Backplane to measure the ambient temperature.
Table 81: HSC Backplane Temperature Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
00C0h = HSC Firmware – HSBP A
00C2h = HSC Firmware – HSBP B
01h = Temperature
01h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 82
Reading that triggered event
Threshold value that triggered event
80 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Hot Swap Controller Events
Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps
Hex
Event Trigger
Description
00h Lower non-critical going low
02h Lower critical going low
07h Upper non-critical going high
09h Upper critical going high
Assertion
Severity
Deassert
Severity
Description
Degraded non-fatal
OK The temperature has dropped below its lower non-critical threshold.
Degraded The temperature has dropped below its lower critical threshold.
Degraded non-fatal
OK The temperature has gone over its upper non-critical threshold.
Degraded The temperature has gone over its upper critical threshold.
12.2 HSC Drive Slot Status Sensor
Next Steps
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below
35°C).
The HSC Drive Slot Status sensor provides the current status for drives in each of the slots.
Table 83: HSC Drive Slot Status Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
Description
00C0h = HSC Firmware – HSBP A
00C2h = HSC Firmware – HSBP B
0Dh = Drive Slot (Bay)
6 Slot HSBP 8 Slot HSBP
Revision 1.1 Intel order number G74211-002 81
Hot Swap Controller Events
Byte Field
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
12.2.1 HSC Drive Slot Status Sensor – Next Steps
Description
02h = Drive Slot 0 Status
03h = Drive Slot 1 Status
04h = Drive Slot 2 Status
05h = Drive Slot 3 Status
06h = Drive Slot 4 Status
07h = Drive Slot 5 Status
02h = Drive Slot 0 Status
03h = Drive Slot 1 Status
04h = Drive Slot 2 Status
05h = Drive Slot 3 Status
06h = Drive Slot 4 Status
07h = Drive Slot 5 Status
08h = Drive Slot 6 Status
09h = Drive Slot 7 Status
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
40h = Failed Drive
Not used
Not used
If during normal operation a drive gets reported as failed, ensure that the drive was seated properly and the drive carrier was properly latched.
If that does not work, replace the drive.
12.3 HSC Drive Presence Sensor
The HSC Drive Slot Presence sensor provides the current presence state for the drive in each of the slots. After an AC power cycle there will be a SEL entry to report the presence of the drive in a slot and there will be another entry for any changes in the presence of drives after that.
82 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Table 84: HSC Drive Presence Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
Description
00C0h = HSC Firmware – HSBP A
00C2h = HSC Firmware – HSBP B
0Dh = Drive Slot (Bay)
6 Slot HSBP
08h = Drive Slot 0 Presence
09h = Drive Slot 1 Presence
0Ah = Drive Slot 2 Presence
0Bh = Drive Slot 3 Presence
0Ch = Drive Slot 4 Presence
0Dh = Drive Slot 5 Presence
8 Slot HSBP
0Ah = Drive Slot 0 Presence
0Bh = Drive Slot 1 Presence
0Ch = Drive Slot 2 Presence
0Dh = Drive Slot 3 Presence
0Eh = Drive Slot 4 Presence
0Fh = Drive Slot 5 Presence
10h = Drive Slot 6 Presence
11h = Drive Slot 7 Presence
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 08h (“digital” Discrete)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
0h = Device Removed / Device Absent.
1h = Device Inserted / Device Present
Not used
Not used
15 Event Data 2
16 Event Data 3
12.3.1 HSC Drive Presence Sensor – Next Steps
On AC power-on the drive presence will be logged as an informational event.
Revision 1.1 Intel order number G74211-002
Hot Swap Controller Events
83
Hot Swap Controller Events
If during normal operation a drive is removed or installed, it will also log an event.
If you get a drive removed or installed without operator intervention, ensure that the drive was seated properly and the drive carrier was properly latched.
84 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Manageability Engine (ME) Events
13. Manageability Engine (ME) Events
The Manageability Engine controls the PECI interface and also contains the Node Manager functionality.
13.1 Node Manager Exception Event
A Node Manager Exception Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.
Table 85: Node Manager Exception Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
002Ch – ME Firmware
Description
14 Event Data 1
15 Event Data 2
16 Event Data 3
DCh = OEM
18h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3] – Node Manager Policy event
0 – Reserved
1 – Policy Correction Time Exceeded – Policy did not meet the contract for the defined policy. The policy will continue to limit the power or shut down the platform based on the defined policy action.
[2] – Reserved
[1:0] – 00b
[4:7] – Reserved
[0:3] – Domain Id (Currently, supports only one domain, Domain 0)
Policy Id
Revision 1.1 Intel order number G74211-002 85
Manageability Engine (ME) Events
13.1.1 Node Manager Exception Event – Next Steps
This is an informational event. Next steps depend on the policy that was set. See the Node Manager Specification for more details.
13.2 Node Manager Health Event
A Node Manager Health Event message provides a runtime error indication about Intel
®
Intelligent Power Node Manager’s health. Types of service that can send an error are defined as follows:
Misconfigured policy Error reading power data
Error reading inlet temperature
Table 86: Node Manager Health Event Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
002Ch – ME Firmware
Description
DCh = OEM
19h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 73h (OEM)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Health Event Type = 02h (Sensor Node Manager)
86 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Manageability Engine (ME) Events
Byte Field
15 Event Data 2
16 Event Data 3
13.2.1 Node Manager Health Event – Next Steps
Description
[7:4] – Error type
0-9 – Reserved
10 – Policy Misconfiguration
11 – Power Sensor Reading Failure
12 – Inlet Temperature Reading Failure
13 – Host Communication error
14 – Real-time clock synchronization failure
15 – Platform shutdown initiated by NM policy due to execution of action defined by Policy Exception Action
[3:0] – Domain Id (Currently, supports only one domain, Domain 0)
If Error type = 10 or 15 <Policy Id>
If Error type = 11 <Power Sensor Address>
If Error type = 12 <Inlet Sensor Address>
Otherwise set to 0.
Misconfigured policy can happen if the max/min power consumption of the platform exceeds the values in policy due to hardware reconfiguration.
First occurrence of an unacknowledged event will be retransmitted no faster than every 300 milliseconds.
Real-time clock synchronization failure alert is sent when NM is enabled and capable of limiting power, but within 10 minutes the firmware cannot obtain valid calendar time from the host side, so NM cannot handle suspend periods.
Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Revision 1.1 Intel order number G74211-002 87
Manageability Engine (ME) Events
13.3 Node Manager Operational Capabilities Change
This message provides a runtime error indication about Intel domains.
®
Intelligent Power Node Manager’s operational capabilities. This applies to all
Assertion and deassertion of these events are supported.
Table 87: Node Manager Operational Capabilities Change Sensor Typical Characteristics
Byte Field
8
9
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
002Ch – ME Firmware
Description
14 Event Data 1
15 Event Data 2
DCh = OEM
1Ah
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 74h (OEM)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Current state of Operational Capabilities. Bit pattern:
0 – Policy interface capability
0 – Not Available
1 – Available
1 – Monitoring capability
0 – Not Available
1 – Available
2 – Power limiting capability
0 – Not Available
1 – Available
Not used
88 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Manageability Engine (ME) Events
Byte Field
16 Event Data 3 Not used
13.3.1 Node Manager Operational Capabilities Change – Next Steps
Description
Policy Interface available indicates that Intel
®
Intelligent Power Node Manager is able to respond to the external interface about querying and setting Intel
®
Intelligent Power Node Manager policies. This is generally available as soon as the microcontroller is initialized.
Monitoring Interface available indicates that Intel
®
Intelligent Power Node Manager has the capability to monitor power and temperature. This is generally available when firmware is operational.
Power limiting interface available indicates that Intel
®
Intelligent Power Node Manager can do power limiting and is indicative of an ACPIcompliant OS loaded (unless the OEM has indicated support for non-ACPI compliant OS).
Current value of not acknowledged capability sensor will be retransmitted no faster than every 300 milliseconds.
Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Revision 1.1 Intel order number G74211-002 89
Manageability Engine (ME) Events
13.4 Node Manager Alert Threshold Exceeded
Policy Correction Time Exceeded Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.
Table 88: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
002Ch – ME Firmware
Description
14 Event Data 1
15 Event Data 2
16 Event Data 3
DCh = OEM
1Bh
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3] = Node Manager Policy event
0 –Threshold exceeded
1 – Policy Correction Time Exceeded – Policy did not meet the contract for the defined policy. The policy will continue to limit the power or shut down the platform based on the defined policy action.
[2] – Reserved
[1:0] – Threshold Number. Valid only if Byte 5 bit [3] is set to 0.
0 to 2 – Threshold index
[7:4] – Reserved
[3:0] – Domain Id (Currently, supports only one domain, Domain 0)
Policy ID
90 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Manageability Engine (ME) Events
13.4.1 Node Manager Alert Threshold Exceeded – Next Steps
First occurrence of an unacknowledged event will be retransmitted no faster than every 300 milliseconds.
First occurrence of Threshold exceeded event assertion/deassertion will be retransmitted no faster than every 300 milliseconds.
Next steps depend on the policy that was set. See the Node Manager Specification for more details.
13.5 ME Firmware Health Event
This sensor is used in Platform Event messages to the BMC containing health information including but not limited to firmware upgrade and application errors.
Table 89: ME Firmware Health Event Sensor Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
002Ch or 602Ch – ME Firmware
DCh = OEM
17h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 75h (OEM)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Health event type – 0h (Firmware Status)
Revision 1.1 Intel order number G74211-002 91
Manageability Engine (ME) Events System Event Log Troubleshooting Guide for Intel® S5500/S3420 Series Server Boards
13.5.1 ME Firmware Health Event – Next Steps
In the following table Event Data 3 is only noted for specific errors.
If the issue continues to be persistent, provide the content of Event Data 3 to Intel support team for interpretation. Event Data 3 codes are in general not documented, because their meaning only provides some clues, varies, and usually needs to be individually interpreted.
Table 90: ME Firmware Health Event Sensor – Next Steps
ED2 ED3
00h
01h
02h
03h
04h
05h
06h-
FFh
Description
Recovery GPIO forced. Recovery Image loaded due to recovery MGPIO pin asserted. Pin number is configurable in factory presets. Default recovery pin is MGPIO1.
Image execution failed. Recovery Image or backup operational image loaded because operational image is corrupted. This may be either caused by flash device corruption or failed upgrade procedure.
Flash erase error. Error during flash erasure procedure.
Next Steps
Deassert MGPIO1 and reset the Intel
®
ME.
Either the flash device must be replaced (if error is persistent) or the upgrade procedure must be started again.
The flash device must be replaced.
Flash corrupted. Error while checking Flash consistency. The Flash device must be replaced (if error is persistent).
Internal error. Error during firmware execution – FW Watchdog Timeout. Operational image needs to be updated to other version or hardware board repair is needed (if error is persistent).
BMC did not respond to cold reset request and Intel platform.
®
ME rebooted the Verify the Intel
®
Node Manager configuration.
Reserved.
92 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Microsoft Windows* Records
14. Microsoft Windows* Records
With Microsoft Windows Server 2003* R2 and later versions, an Intelligent Platform Management Interface (IPMI) driver was added. This added the capability of logging some OS events to the SEL. The driver can write multiple records to the SEL for the following events:
Boot-up
Shutdown
Bug Check / Blue Screen
14.1 Boot-up Event Records
When the system boots into the Microsoft Windows* OS, there can be two events logged. The first is a boot-up record and the second is an
OEM event. These are informational only records.
Table 91: Boot-up Event Record Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0041h – System Software with an ID = 20h
1Fh = OS Boot
00h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = C: boot completed
Not used
Not used
Revision 1.1 Intel order number G74211-002 93
Microsoft Windows* Records
Table 92: Boot-up OEM Event Record Typical Characteristics
Byte
1
2
3
4
5
6
7
8
9
10
Record ID
Record Type
Timestamp
Field
IPMI Manufacturer ID
11 Record ID
ID used for SEL Record access.
[7:0] – DCh = OEM timestamped, bytes 8-16 OEM defined
Time when event was logged. LS byte first.
0137h (311d) = IANA enterprise number for Microsoft
Description
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.
Timestamp of when system booted into the OS 12
13
14
15
Boot Time
16 Reserved 00h
14.2 Shutdown Event Records
When the system shuts down from the Microsoft Windows* OS, there can be multiple events logged. The first is an OS Stop/Shutdown Event
Record; this can be followed by a shutdown reason code OEM record, and then zero or more shutdown comment OEM records. These are all informational only records.
94 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
4
5
6
7
Byte
1
2
3
8
9
10
Record ID
Record Type
Timestamp
Field
IPMI Manufacturer ID
Table 93: Shutdown Reason Code Event Record Typical Characteristics
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0041h – System Software with an ID = 20h
20h = OS Stop/Shutdown
00h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 3h = OS Graceful Shutdown
Not used
Not used
Table 94: Shutdown Reason OEM Event Record Typical Characteristics
Description
ID used for SEL Record access.
[7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined
Time when event was logged. LS byte first.
0137h (311d) = IANA enterprise number for Microsoft
Revision 1.1 Intel order number G74211-002
Microsoft Windows* Records
95
Microsoft Windows* Records
Byte
11 Record ID
Field
12
13
14
15
Shutdown Reason
16 Reserved
Description
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.
Shutdown Reason code from the registry (LSB first.):
HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/ReasonCode
00h
Table 95: Shutdown Comment OEM Event Record Typical Characteristics
Description
ID used for SEL Record access.
[7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined
Time when event was logged. LS byte first.
Byte
1
2
3
Record ID
Field
Record Type
4
5
6
7
8
9
10
Timestamp
IPMI Manufacturer ID
11 Record ID
12
13
14
15
Shutdown Comment
16 Reserved
96
0137h (311d) = IANA enterprise number for Microsoft
0157h (343) = IANA enterprise number for Intel
The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.
Shutdown Comment from the registry (LSB first.):
HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/Comment
00h
Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Microsoft Windows* Records
14.3 Bug Check / Blue Screen Event Records
When the system experiences a bug check (blue screen), there will be multiple records written to the event log. The first is a Bug Check / Blue
Screen OS Stop/Shutdown Event Record; this can be followed by multiple Bug Check / Blue Screen code OEM records that will contain the
Bug Check / Blue Screen codes. This information can be used to determine what caused the failure.
Table 96: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics
Byte
1
2
3
Record ID
Record Type
Field
Byte
8
9
Field
Generator ID
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
0041h – System Software with an ID = 20h
20h = OS Stop/Shutdown
00h
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (that is, “core dump”, “blue screen”)
Not used
Not used
Table 97: Bug Check / Blue Screen Code OEM Event Record Typical Characteristics
Description
ID used for SEL Record access.
[7:0] – DEh = OEM timestamped, bytes 8-16 OEM defined
Revision 1.1 Intel order number G74211-002 97
Microsoft Windows* Records
Byte
4
5
6
7
8
9
10
Timestamp
Field
IPMI Manufacturer ID
11 Sequence Number
Time when event was logged. LS byte first.
Description
0137h (311) = IANA enterprise number for Microsoft
0157h (343) = IANA enterprise number for Intel
The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL.
12
13
14
15
Bug Check/Blue Screen Data The first record of this type will contain the Bug Check / Blue Screen Stop code and will be followed by the four Bug Check / Blue
Screen parameters. LSB first.
Note that each of the Bug Check / Blue Screen parameters requires two records each.
Both of the two records for each parameter will have the same Record ID.
There will be a total of 9 records.
16 Operating system type 00 = 32 bit OS
01 = 64 bit OS
98 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Linux* Kernel Panic Records
15. Linux* Kernel Panic Records
The OpenIPMI driver supports the ability to put semi-custom and custom events in the system event log if a panic occurs. If you enable the
“Generate a panic event to all BMCs on a panic” option, you will get one event on a panic in a standard IPMI event format. If you enable the
“Generate OEM events containing the panic string” option, you will also get a set of OEM events holding the panic string.
Table 98: Linux* Kernel Panic Event Record Characteristics
Byte
8
9
Generator ID
10 EvM Rev
11 Sensor Type
Field
12 Sensor Number
13 Event Direction and Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
0021h – Kernel
Description
03h = IPMI 1.0 format
20h = OS Stop/Shutdown
The first byte of the panic string (0 if no panic string)
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (that is, “core dump”, “blue screen”)
The second byte of panic string
The third byte of panic string
Revision 1.1 Intel order number G74211-002 99
Linux* Kernel Panic Records
Table 99: Linux* Kernel Panic String Extended Record Characteristics
Byte
1
2
3
4
5
6
…
16
Field
Record ID ID used for SEL Record access.
Description
Record Type
Slave Address
[7:0] – F0h = OEM non-timestamped, bytes 4-16 OEM defined
The slave address of the card saving the panic.
Sequence Number A sequence number (starting at zero).
Kernel Panic Data These hold the panic sting. If the panic string is longer than 11 bytes, multiple messages will be sent with increasing sequence numbers.
100 Intel order number G74211-002 Revision 1.1
advertisement
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Related manuals
advertisement
Table of contents
- 11 Introduction
- 11 Purpose
- 11 Industry Standard
- 11 Intelligent Platform Management Interface (IPMI)
- 12 Baseboard Management Controller (BMC)
- 13 Intelligent Power Node Manager Version
- 14 Basic Decoding of a SEL Record
- 14 Default Values in the SEL Records
- 18 Sensor Cross Reference List
- 18 BMC owned Sensors (GID = 0020h)
- 22 BIOS POST owned Sensors (GID = 0001h)
- 22 BIOS SMI owned Sensors (GID = 0033h)
- 24 Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h)
- 25 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)
- 26 Microsoft* OS owned Events (GID = 0041)
- 26 Linux* Kernel Panic Events (GID = 0021)
- 27 Power Subsystems
- 27 Voltage Sensors
- 31 Power Unit
- 31 Power Unit Status Sensor
- 32 Power Unit Redundancy Sensor
- 34 Power Supply
- 34 Power Supply Status Sensors
- 35 Power Supply AC Power Input Sensors
- 36 Power Supply Current Output % Sensors
- 37 Power Supply Temperature Sensors
- 39 Cooling Subsystem
- 39 Fan Sensors
- 39 Fan Speed Sensors
- 40 Fan Presence and Redundancy Sensors
- 43 Temperature Sensors
- 43 Regular Temperature Sensors
- 45 Thermal Margin Sensors
- 46 Processor Thermal Control % Sensors
- 47 Discrete Thermal Sensors
- 49 Processor Subsystem
- 49 Processor Status Sensor
- 50 Catastrophic Error Sensor
- 51 Catastrophic Error Sensor – Next Steps
- 51 CPU Missing Sensor
- 52 CPU Missing Sensor – Next Steps
- 52 QuickPath Interconnect Error Sensors
- 52 QPI Correctable Error Sensor
- 53 QPI Non-Fatal Error Sensor
- 54 QPI Fatal and Fatal
- 56 Memory Subsystem
- 56 Memory RAS Mirroring and Sparing
- 56 Mirroring Configuration Status
- 57 Mirrored Redundancy State Sensor
- 59 Sparing Configuration Status
- 60 Sparing Redundancy State Sensor
- 63 ECC and Address Parity
- 63 Memory Correctable and Uncorrectable ECC Error
- 65 Memory Address Parity Error
- 68 PCI Express* and Legacy PCI Subsystem
- 68 PCI Express* Errors
- 68 PCI Express* Correctable Errors
- 69 PCI Express* Fatal Errors
- 71 Legacy PCI Errors
- 73 System BIOS Events
- 73 System Events
- 73 System Boot
- 73 Timestamp Clock Synchronization
- 74 System Firmware Progress (Formerly Post Error)
- 75 System Firmware Progress (Formerly Post Error) – Next Steps
- 81 Chassis Subsystem
- 81 Physical Security
- 81 Chassis Intrusion
- 81 LAN Leash Lost
- 83 FP (NMI) Interrupt
- 83 FP (NMI) Interrupt – Next Steps
- 84 Button Press Events
- 85 Miscellaneous Events
- 85 IPMI Watchdog
- 87 SMI Timeout
- 87 SMI Timeout – Next Steps
- 88 System Event Log Cleared
- 88 System Event – PEF Action
- 89 System Event – PEF Action – Next Steps
- 90 Hot Swap Controller Events
- 90 HSC Backplane Temperature Sensor
- 91 HSC Drive Slot Status Sensor
- 92 HSC Drive Slot Status Sensor – Next Steps
- 92 HSC Drive Presence Sensor
- 93 HSC Drive Presence Sensor – Next Steps
- 95 Manageability Engine (ME) Events
- 95 Node Manager Exception Event
- 96 Node Manager Exception Event – Next Steps
- 96 Node Manager Health Event
- 97 Node Manager Health Event – Next Steps
- 98 Node Manager Operational Capabilities Change
- 99 Node Manager Operational Capabilities Change – Next Steps
- 100 Node Manager Alert Threshold Exceeded
- 101 Node Manager Alert Threshold Exceeded – Next Steps
- 101 ME Firmware Health Event
- 102 ME Firmware Health Event – Next Steps
- 103 Microsoft Windows* Records
- 103 Boot-up Event Records
- 104 Shutdown Event Records
- 107 Bug Check / Blue Screen Event Records
- 109 Linux* Kernel Panic Records