advertisement
Cooling Subsystem
Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
00h Lower non-critical going low
02h Lower critical going low
Assertion
Severity
Deassert
Severity
Description
Degraded non-fatal
OK The fan speed has dropped below its lower non-critical threshold.
Degraded The fan speed has dropped below its lower critical threshold.
Next Steps
A fan speed error on a new system build is typically not caused by the fan spinning too slowly, instead it is caused by the fan being connected to the wrong header (the BMC expects them on certain headers for each chassis and will log this event if there is no fan on that header).
1. Refer to the Quick Start Guide or the Service Guide to identify the correct fan headers to use.
2. Ensure the latest FRUSDR update has been run and the correct chassis was detected or selected.
3. If you are sure this was done, the event may be a sign of impending fan failure (although this will only normally apply if the system has been in use for a while). Replace the fan.
5.1.2 Fan Presence and Redundancy Sensors
Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel
®
servers is an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other modes are also possible.
Table 29: Fan Presence Sensors Typical Characteristics
Byte Field
11 Sensor Type
12 Sensor Number
13 Event Direction and
Event Type
Description
04h = Fan
40h-45h (Chassis specific)
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 08h (Generic “digital” Discrete)
30 Intel order number G74211-002 Revision 1.1
System Event Log Troubleshooting Guide for Intel
®
Cooling Subsystem
Byte Field
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 30
Not used
Not used
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 30: Fan Presence Sensors – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex Description
01h Device
Present
Assertion
Severity
OK
Deassert
Severity
Description
Degraded Assertion – A fan was inserted. This event may also get logged when the
BMC initializes when AC is applied.
Deassert – A fan was removed, or was not present at the expected location when the BMC initialized.
Informational only
Next Steps
These events only get generated in systems with hot-swappable fans, and normally only when a fan is physically inserted or removed. If fans were not physically removed:
1. Use the Quick Start Guide to check whether the right fan headers were used.
2. Swap the fans round to see whether the problem stays with the location, or follows the fan.
3. Replace the fan or fan wiring/housing depending on the outcome of step 2.
4. Ensure the latest FRUSDR update has been run and the correct chassis was detected or selected.
Table 31: Fan Redundancy Sensors Typical Characteristics
Description Byte
11 Sensor Type
Field
12 Sensor Number
04h = Fan
46h
Revision 1.1 Intel order number G74211-002 31
Cooling Subsystem
Byte Field
13 Event Direction and
Event Type
14 Event Data 1
15 Event Data 2
16 Event Data 3
Description
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 32
Not used
Not used
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 32: Fan Redundancy Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
00h Fully redundant
Description
01h Redundancy lost
02h Redundancy degraded
03h Non-redundant, sufficient from redundant
System has lost one or more fans and is running in non-redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.
Description
04h Non-redundant, sufficient from insufficient
05h Non-redundant, insufficient
06h Non-redundant, degraded from fully redundant
System has lost fans and may no longer be able to cool itself adequately. Overheating may occur if this situation remains for a longer period of time.
System has lost one or more fans and is running in non-redundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.
07h Redundant, degraded from non-redundant System has lost one or more fans and is running in a degraded mode, but still is redundant. There are enough fans to keep the system properly cooled.
Next Steps
Fan redundancy loss indicates failure of one or more fans.
Look for lower (non) critical fan errors, or fan removal errors in the SEL, to indicate which fan is causing the problem, and follow the troubleshooting steps for these event types.
32 Intel order number G74211-002 Revision 1.1
advertisement
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Related manuals
advertisement
Table of contents
- 11 Introduction
- 11 Purpose
- 11 Industry Standard
- 11 Intelligent Platform Management Interface (IPMI)
- 12 Baseboard Management Controller (BMC)
- 13 Intelligent Power Node Manager Version
- 14 Basic Decoding of a SEL Record
- 14 Default Values in the SEL Records
- 18 Sensor Cross Reference List
- 18 BMC owned Sensors (GID = 0020h)
- 22 BIOS POST owned Sensors (GID = 0001h)
- 22 BIOS SMI owned Sensors (GID = 0033h)
- 24 Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h)
- 25 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)
- 26 Microsoft* OS owned Events (GID = 0041)
- 26 Linux* Kernel Panic Events (GID = 0021)
- 27 Power Subsystems
- 27 Voltage Sensors
- 31 Power Unit
- 31 Power Unit Status Sensor
- 32 Power Unit Redundancy Sensor
- 34 Power Supply
- 34 Power Supply Status Sensors
- 35 Power Supply AC Power Input Sensors
- 36 Power Supply Current Output % Sensors
- 37 Power Supply Temperature Sensors
- 39 Cooling Subsystem
- 39 Fan Sensors
- 39 Fan Speed Sensors
- 40 Fan Presence and Redundancy Sensors
- 43 Temperature Sensors
- 43 Regular Temperature Sensors
- 45 Thermal Margin Sensors
- 46 Processor Thermal Control % Sensors
- 47 Discrete Thermal Sensors
- 49 Processor Subsystem
- 49 Processor Status Sensor
- 50 Catastrophic Error Sensor
- 51 Catastrophic Error Sensor – Next Steps
- 51 CPU Missing Sensor
- 52 CPU Missing Sensor – Next Steps
- 52 QuickPath Interconnect Error Sensors
- 52 QPI Correctable Error Sensor
- 53 QPI Non-Fatal Error Sensor
- 54 QPI Fatal and Fatal
- 56 Memory Subsystem
- 56 Memory RAS Mirroring and Sparing
- 56 Mirroring Configuration Status
- 57 Mirrored Redundancy State Sensor
- 59 Sparing Configuration Status
- 60 Sparing Redundancy State Sensor
- 63 ECC and Address Parity
- 63 Memory Correctable and Uncorrectable ECC Error
- 65 Memory Address Parity Error
- 68 PCI Express* and Legacy PCI Subsystem
- 68 PCI Express* Errors
- 68 PCI Express* Correctable Errors
- 69 PCI Express* Fatal Errors
- 71 Legacy PCI Errors
- 73 System BIOS Events
- 73 System Events
- 73 System Boot
- 73 Timestamp Clock Synchronization
- 74 System Firmware Progress (Formerly Post Error)
- 75 System Firmware Progress (Formerly Post Error) – Next Steps
- 81 Chassis Subsystem
- 81 Physical Security
- 81 Chassis Intrusion
- 81 LAN Leash Lost
- 83 FP (NMI) Interrupt
- 83 FP (NMI) Interrupt – Next Steps
- 84 Button Press Events
- 85 Miscellaneous Events
- 85 IPMI Watchdog
- 87 SMI Timeout
- 87 SMI Timeout – Next Steps
- 88 System Event Log Cleared
- 88 System Event – PEF Action
- 89 System Event – PEF Action – Next Steps
- 90 Hot Swap Controller Events
- 90 HSC Backplane Temperature Sensor
- 91 HSC Drive Slot Status Sensor
- 92 HSC Drive Slot Status Sensor – Next Steps
- 92 HSC Drive Presence Sensor
- 93 HSC Drive Presence Sensor – Next Steps
- 95 Manageability Engine (ME) Events
- 95 Node Manager Exception Event
- 96 Node Manager Exception Event – Next Steps
- 96 Node Manager Health Event
- 97 Node Manager Health Event – Next Steps
- 98 Node Manager Operational Capabilities Change
- 99 Node Manager Operational Capabilities Change – Next Steps
- 100 Node Manager Alert Threshold Exceeded
- 101 Node Manager Alert Threshold Exceeded – Next Steps
- 101 ME Firmware Health Event
- 102 ME Firmware Health Event – Next Steps
- 103 Microsoft Windows* Records
- 103 Boot-up Event Records
- 104 Shutdown Event Records
- 107 Bug Check / Blue Screen Event Records
- 109 Linux* Kernel Panic Records