IBM System Storage SAN Volume Controller
Troubleshooting Guide
IBM
GC27-2284-10
Note
Before using this information and the product it supports, read the information in “Notices” on page 359.
This edition applies to IBM SAN Volume Controller, Version 7.6, and to all subsequent releases and modifications
until otherwise indicated in new editions.
© Copyright IBM Corporation 2003, 2015.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Figures . . . . . . . . . . . . . . vii
Tables . . . . . . . . . . . . . . . ix
About this guide . . . . . . . . . . . xi
Who should use this guide . . . . . . . . . xi
Emphasis . . . . . . . . . . . . . . . xi
SAN Volume Controller library and related
publications . . . . . . . . . . . . . . xi
How to order IBM publications . . . . . . . xiv
Related websites . . . . . . . . . . . . xiv
Sending your comments . . . . . . . . . . xv
How to get information, help, and technical
assistance . . . . . . . . . . . . . . . xv
Summary of changes for GC27-2284-07 SAN
Volume Controller Troubleshooting Guide. . . . xvii
Summary of changes for GC27-2284-06 SAN
Volume Controller Troubleshooting Guide. . . . xvii
Chapter 1. SAN Volume Controller
overview . . . . . . . . . . . . . . 1
Systems. . . . . . . . . .
Configuration node . . . . .
Configuration node addressing .
Management IP failover . . .
SAN fabric overview . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
14
Chapter 2. Introducing the SAN Volume
Controller hardware components . . . 15
SAN Volume Controller nodes . . . . . . . .
SAN Volume Controller controls and indicators
SAN Volume Controller operator-information
panel . . . . . . . . . . . . . . .
SAN Volume Controller rear-panel indicators and
connectors . . . . . . . . . . . . . .
Fibre Channel port numbers and worldwide port
names . . . . . . . . . . . . . . .
Requirements for the SAN Volume Controller
environment . . . . . . . . . . . . .
Redundant AC-power switch . . . . . . . .
Redundant AC-power environment requirements
Cabling of redundant AC-power switch
(example) . . . . . . . . . . . . . .
Uninterruptible power supply . . . . . . . .
2145 UPS-1U . . . . . . . . . . . . .
Uninterruptible power-supply environment
requirements . . . . . . . . . . . . .
Defining the SAN Volume Controller FRUs . . . .
SAN Volume Controller FRUs . . . . . . .
Redundant AC-power switch FRUs . . . . .
15
15
20
24
35
35
42
43
44
47
48
52
53
53
58
Chapter 3. SAN Volume Controller user
interfaces for servicing your system . . 59
Management GUI interface .
© Copyright IBM Corp. 2003, 2015
.
.
.
.
.
.
.
. 59
When to use the management GUI . . . .
Accessing the management GUI . . . . .
Deleting a node from a clustered system using
the management GUI . . . . . . . . .
Adding a node to a system . . . . . . .
Service assistant interface . . . . . . . . .
When to use the service assistant . . . . .
Accessing the service assistant . . . . . .
Command-line interface . . . . . . . . .
When to use the CLI . . . . . . . . .
Accessing the system CLI. . . . . . . .
Service command-line interface . . . . . . .
When to use the service CLI . . . . . . .
Accessing the service CLI. . . . . . . .
USB flash drive interface . . . . . . . .
Technician port for node access . . . . . . .
Front panel interface . . . . . . . . . .
. 60
. 60
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
63
66
67
67
68
68
68
68
68
69
69
75
76
Chapter 4. Performing recovery actions
using the SAN Volume Controller CLI . 77
Validating and repairing mirrored volume copies
using the CLI. . . . . . . . . . . . . . 77
Repairing a thin-provisioned volume using the CLI 78
Recovering offline volumes using the CLI . . . . 79
Chapter 5. Viewing the vital product
data . . . . . . . . . . . . . . . . 81
Downloading the vital product data using the
management GUI . . . . . . . . . .
Displaying the vital product data using the CLI
Displaying node properties by using the CLI
Displaying clustered system properties using
CLI . . . . . . . . . . . . . .
Fields for the node VPD . . . . . . . .
Fields for the system VPD . . . . . . .
. .
. .
. .
the
. .
. .
. .
81
81
81
82
83
88
Chapter 6. Using the front panel of the
SAN Volume Controller. . . . . . . . 91
Boot progress indicator . . . . .
Boot failed. . . . . . . . . .
Charging . . . . . . . . . .
Error codes . . . . . . . . .
Hardware boot . . . . . . . .
Node rescue request . . . . . .
Power failure . . . . . . . . .
Powering off . . . . . . . . .
Recovering . . . . . . . . .
Restarting . . . . . . . . . .
Shutting down . . . . . . . .
Validate WWNN? option . . . . .
SAN Volume Controller menu options
Cluster (system) options . . . .
Node options . . . . . . .
Version options . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 91
. 91
. 92
. 92
. 92
. 93
. 93
. 93
. 94
. 94
. 94
. 95
. 96
. 98
. 100
. 100
iii
Ethernet options . . . .
Fibre Channel port options .
Actions options. . . . .
Language? option . . . .
Using the power control for the
Controller node. . . . . .
. .
. .
. .
. .
SAN
. .
. . .
. . .
. . .
. . .
Volume
. . .
.
.
.
.
.
.
.
.
101
101
102
116
.
. 117
Backing up and restoring the system configuration
Backing up the system configuration using the
CLI. . . . . . . . . . . . . . . .
Restoring the system configuration . . . . .
Deleting backup configuration files using the
CLI. . . . . . . . . . . . . . . .
Completing the node rescue when the node boots
263
264
266
271
271
Chapter 7. Diagnosing problems . . . 119
Starting statistics collection . . . . . . . .
Event reporting. . . . . . . . . . . .
Power-on self-test . . . . . . . . . .
Understanding events . . . . . . . . .
Managing the event log . . . . . . . .
Viewing the event log . . . . . . . .
Describing the fields in the event log . . .
Event notifications. . . . . . . . . . .
Inventory information email . . . . . . .
Understanding the error codes . . . . . .
Using the error code tables . . . . . . .
Event IDs . . . . . . . . . . . .
SCSI event reporting . . . . . . . . .
Object types . . . . . . . . . . . .
Error event IDs and error codes . . . . .
Resolving a problem with the SAN Volume
Controller 2145-DH8 boot drives . . . . .
Determining a hardware boot failure . . .
Boot code reference . . . . . . . . .
Node error code overview . . . . . . .
Clustered-system code overview . . . . .
Error code range . . . . . . . . . .
Procedure: SAN problem determination . . .
Resolving a problem with SSL/TLS clients . .
Procedure: Making drives support protection
information . . . . . . . . . . . . .
Resolving a problem with new expansion
enclosures . . . . . . . . . . . . .
Fibre Channel and 10G Ethernet link failures . .
Ethernet iSCSI host-link problems . . . . .
Fibre Channel over Ethernet host-link problems
Servicing storage systems . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
130
130
131
131
131
132
133
136
137
137
138
142
145
146
Chapter 10. Understanding the
medium errors and bad blocks . . . . 273
.
.
.
.
.
.
.
.
157
159
160
160
161
162
245
245
303
. 246
. 247
. 247
. 248
249
. 250
Chapter 8. Disaster recovery . . . . . 251
Chapter 9. Recovery procedures . . . 253
Recover system procedure . . . . . . . .
When to run the recover system procedure .
Fix hardware errors . . . . . . . . .
Removing clustered system information for
nodes with error code 550 or error code 578
using the front panel . . . . . . . . .
Removing system information for nodes with
error code 550 or error code 578 using the
service assistant . . . . . . . . . .
Completing recovery procedure for clustered
systems using the front panel . . . . . .
Running system recovery using the service
assistant . . . . . . . . . . . . .
Recovering from offline volumes using the CLI
What to check after running the system
recovery . . . . . . . . . . . . .
iv
SAN Volume Controller: Troubleshooting Guide
. 253
. 254
. 254
. 256
. 256
. 257
. 259
260
. 261
Chapter 11. Using the maintenance
analysis procedures . . . . . . . . 275
MAP 5000: Start . . . . . . . . . . . .
MAP 5040: Power SAN Volume Controller
2145-DH8 . . . . . . . . . . . . . .
MAP 5050: Power 2145-CG8 and 2145-CF8 . . .
MAP 5150: 2145 UPS-1U. . . . . . . . . .
MAP 5250: 2145 UPS-1U repair verification . . .
MAP 5320: Redundant AC power . . . . . .
MAP 5340: Redundant ac power verification . . .
MAP 5350: Powering off a node . . . . . . .
Using the management GUI to power off a
system . . . . . . . . . . . . . .
Using the SAN Volume Controller CLI to power
off a node . . . . . . . . . . . . .
Using the SAN Volume Controller power
control button . . . . . . . . . . . .
MAP 5400: Front panel . . . . . . . . . .
MAP 5500: Ethernet . . . . . . . . . . .
Defining an alternate configuration node . . .
MAP 5550: 10G Ethernet and Fibre Channel over
Ethernet personality enabled adapter port . . . .
MAP 5600: Fibre Channel . . . . . . . . .
MAP 5700: Repair verification . . . . . . . .
MAP 5800: Light path . . . . . . . . . .
Light path for SAN Volume Controller
2145-DH8 . . . . . . . . . . . . .
Light path for SAN Volume Controller
2145-CG8 . . . . . . . . . . . . . .
Light path for SAN Volume Controller 2145-CF8
MAP 5900: Hardware boot . . . . . . . . .
MAP 6000: Replace offline SSD . . . . . . .
MAP 6001: Replace offline SSD in a RAID 0
array . . . . . . . . . . . . . . .
MAP 6002: Replace offline SSD in RAID 1 array
or RAID 10 array . . . . . . . . . . .
275
283
288
292
298
299
300
302
305
306
307
309
313
313
316
321
322
323
329
335
341
346
347
349
Chapter 12. iSCSI performance
analysis and tuning. . . . . . . . . 351
Appendix A. Accessibility features for
IBM SAN Volume Controller . . . . . 355
Appendix B. Where to find the
Statement of Limited Warranty . . . . 357
Notices . . . . . . . . . . . . . . 359
Trademarks . . . . . . . . . . . . .
Homologation statement . . . . . . . .
Electronic emission notices . . . . . . . .
Federal Communications Commission (FCC)
statement. . . . . . . . . . . . .
Industry Canada compliance statement . . .
. 361
. 361
. 361
. 361
. 362
Australia and New Zealand Class A Statement
European Union Electromagnetic Compatibility
Directive . . . . . . . . . . . . .
Germany Electromagnetic Compatibility
Directive . . . . . . . . . . . . .
People's Republic of China Class A Statement
Taiwan Class A compliance statement . . .
Taiwan Contact Information . . . . . .
Japan VCCI Council Class A statement . . .
Japan Electronics and Information Technology
Industries Association Statement . . . . .
Korean Communications Commission Class A
Statement . . . . . . . . . . . .
Russia Electromagnetic Interference Class A
Statement . . . . . . . . . . . .
362
. 362
. 362
363
. 364
. 364
. 364
. 364
. 365
. 365
Index . . . . . . . . . . . . . . . 367
Contents
v
vi
SAN Volume Controller: Troubleshooting Guide
Figures
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
SAN Volume Controller system in a fabric
2
Data flow in a SAN Volume Controller system 3
Example of a basic volume . . . . . . . . 4
Example of mirrored volumes . . . . . . . 4
Example of stretched volumes. . . . . . . 5
Example of HyperSwap volumes . . . . . . 6
Example of a standard system topology . . . 7
Example of a stretched system topology . . . 7
Example of a HyperSwap system topology
8
SAN Volume Controller nodes with internal
Flash drives . . . . . . . . . . . . 10
SAN Volume Controller configuration node
12
SAN Volume Controller 2145-DH8 front panel 16
SAN Volume Controller 2145-CG8 front panel 18
SAN Volume Controller 2145-CF8 front panel 18
SAN Volume Controller 2145-DH8 operator
information panel . . . . . . . . . . 21
SAN Volume Controller 2145-CG8 or 2145-CF8
operator-information panel . . . . . . . 21
SAN Volume Controller 2145-CG8 or 2145-CF8
operator-information panel . . . . . . . 22
SAN Volume Controller 2145-DH8 rear-panel
indicators . . . . . . . . . . . . . 24
Connectors on the rear of the SAN Volume
Controller 2145-DH8 . . . . . . . . . 25
Power connector . . . . . . . . . . . 25
SAN Volume Controller 2145-DH8 service
ports . . . . . . . . . . . . . . . 26
SAN Volume Controller 2145-DH8 unused
Ethernet port . . . . . . . . . . . . 26
SAN Volume Controller 2145-CG8 rear-panel
indicators . . . . . . . . . . . . . 27
SAN Volume Controller 2145-CG8 rear-panel
indicators for the 10 Gbps Ethernet feature . . 27
Connectors on the rear of the SAN Volume
Controller 2145-CG8 . . . . . . . . . 27
10 Gbps Ethernet ports on the rear of the SAN
Volume Controller 2145-CG8 . . . . . . . 28
Power connector . . . . . . . . . . . 28
Service ports of the SAN Volume Controller
2145-CG8 . . . . . . . . . . . . . 29
SAN Volume Controller 2145-CG8 port not
used . . . . . . . . . . . . . . . 29
SAN Volume Controller 2145-CF8 rear-panel
indicators . . . . . . . . . . . . . 30
Connectors on the rear of the SAN Volume
Controller 2145-CG8 or 2145-CF8 . . . . . 30
Power connector . . . . . . . . . . . 31
Service ports of the SAN Volume Controller
2145-CF8 . . . . . . . . . . . . . 31
SAN Volume Controller 2145-CF8 port not
used . . . . . . . . . . . . . . . 31
SAN Volume Controller 2145-DH8 AC, DC,
and power-error LEDs . . . . . . . . . 34
SAN Volume Controller 2145-CG8 or 2145-CF8
AC, DC, and power-error LEDs . . . . . . 34
© Copyright IBM Corp. 2003, 2015
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
Photo of the redundant AC-power switch
43
A four-node SAN Volume Controller system
with the redundant AC-power switch feature . 45
Rack cabling example. . . . . . . . . . 47
2145 UPS-1U front-panel assembly . . . . . 49
2145 UPS-1U connectors and switches. . . . 51
2145 UPS-1U dip switches . . . . . . . 52
Ports not used by the 2145 UPS-1U . . . . 52
Power connector . . . . . . . . . . . 52
SAN Volume Controller front-panel assembly 91
Example of a boot progress display . . . . 91
Example of an error code for a clustered
system . . . . . . . . . . . . . . 92
Example of a node error code . . . . . . 92
Node rescue display . . . . . . . . . 93
Validate WWNN? navigation. . . . . . . 95
SAN Volume Controller options on the
front-panel display . . . . . . . . . . 97
Viewing the IPv6 address on the front-panel
display. . . . . . . . . . . . . . 100
Upper options of the actions menu on the
front panel . . . . . . . . . . . . 104
Middle options of the actions menu on the
front panel . . . . . . . . . . . . 105
Lower options of the actions menu on the
front panel . . . . . . . . . . . . 106
Language? navigation . . . . . . . . . 116
Example of inventory information email
137
Example of a boot error code . . . . . . 159
Example of a boot progress display . . . . 160
Example of a displayed node error code
160
Example of a node-rescue error code
161
Example of a create error code for a clustered
system . . . . . . . . . . . . . . 162
Example of a recovery error code . . . . . 162
Example of an error code for a clustered
system . . . . . . . . . . . . . . 162
Node rescue display . . . . . . . . . 272
Error LED on the SAN Volume Controller
models. . . . . . . . . . . . . . 277
SAN Volume Controller 2145-DH8
operator-information panel . . . . . . . 278
Hardware boot display . . . . . . . . 278
SAN Volume Controller 2145-DH8 front panel 279
Power LED on the SAN Volume Controller
2145-DH8 . . . . . . . . . . . . . 284
Power LED indicator on the rear panel of the
SAN Volume Controller 2145-DH8 . . . . 285
AC, dc, and power-supply error LED
indicators on the rear panel of the SAN
Volume Controller 2145-DH8 . . . . . . 286
Power LED on the SAN Volume Controller
models 2145-CG8 or2145-CF8
operator-information panel . . . . . . . 289
vii
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
viii
Power LED indicator on the rear panel of the
SAN Volume Controller 2145-CG8 or
2145-CF8 . . . . . . . . . . . . .
Power LED indicator and ac and dc
indicators on the rear panel of the SAN
Volume Controller 2145-CG8 or 2145-CF8 . .
2145 UPS-1U front-panel assembly . . . .
Power control button on the SAN Volume
Controller models . . . . . . . . . .
SAN Volume Controller service controller
error light. . . . . . . . . . . . .
Front-panel display when push buttons are
pressed . . . . . . . . . . . . .
Port 2 Ethernet link LED on the SAN Volume
Controller rear panel . . . . . . . . .
Ethernet ports on the rear of the SAN Volume
Controller 2145-DH8 . . . . . . . . .
SAN Volume Controller 2145-DH8
operator-information panel . . . . . . .
Press the release latch . . . . . . . . .
SAN Volume Controller 2145-DH8 light path
diagnostics panel . . . . . . . . . .
SAN Volume Controller 2145-DH8 system
board LEDs. . . . . . . . . . . . .
SAN Volume Controller: Troubleshooting Guide
86.
290
87.
88.
291
293
89.
306
90.
308
91.
309
311
92.
93.
94.
311
95.
323
324
96.
97.
324
325
SAN Volume Controller 2145-CG8 or
2145-CF8 operator-information panel. .
SAN Volume Controller 2145-CG8 or
2145-CF8 light path diagnostics panel .
SAN Volume Controller 2145-CG8 system
board LEDs diagnostics panel . . . .
SAN Volume Controller 2145-CG8 or
2145-CF8 operator-information panel. .
SAN Volume Controller 2145-CG8 or
2145-CF8 light path diagnostics panel .
SAN Volume Controller 2145-CF8 system
board LEDs diagnostics panel . . . .
Hardware boot display . . . . . .
Node rescue display . . . . . . .
Keyboard and monitor ports on the SAN
Volume Controller 2145-CF8 . . . .
Keyboard and monitor ports on the SAN
Volume Controller 2145-CG8 . . . .
Keyboard and monitor ports on the SAN
Volume Controller 2145-DH8, front . .
Keyboard and monitor ports on the SAN
Volume Controller 2145-DH8, rear . .
.
. 330
.
. 330
.
. 332
.
. 336
.
. 336
.
.
.
. 337
. 342
. 342
.
. 343
.
. 343
.
. 344
.
. 344
Tables
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
IBM websites for help, services, and
information . . . . . . . . . . . . xii
SAN Volume Controller library . . . . . . xii
Other IBM publications . . . . . . . . xiv
IBM documentation and related websites
xiv
IBM websites for help, services, and
information . . . . . . . . . . . . xv
System topology and volume summary . . . 8
SAN Volume Controller communications types 9
Link state and activity for the bottom Fibre
Channel LED . . . . . . . . . . . . 32
Link speed for the top Fibre Channel LED
32
Actual link speeds . . . . . . . . . . 32
Input-voltage requirements . . . . . . . 35
Power consumption . . . . . . . . . . 35
Physical specifications . . . . . . . . . 36
Dimensions and weight . . . . . . . . 36
Additional space requirements . . . . . . 36
Maximum heat output of each 2145-DH8 node 37
Input-voltage requirements . . . . . . . 37
Maximum power consumption . . . . . . 37
Environment requirements without redundant
AC power . . . . . . . . . . . . . 38
Environment requirements with redundant AC
power . . . . . . . . . . . . . . 38
Dimensions and weight . . . . . . . . 39
Additional space requirements . . . . . . 39
Maximum heat output of each SAN Volume
Controller 2145-CG8 node. . . . . . . . 39
Maximum heat output of each 2145 UPS-1U
39
Input-voltage requirements . . . . . . . 40
Power requirements for each node . . . . . 40
Environment requirements without redundant
AC power . . . . . . . . . . . . . 41
Environment requirements with redundant AC
power . . . . . . . . . . . . . . 41
Dimensions and weight . . . . . . . . 41
Additional space requirements . . . . . . 42
Heat output of each SAN Volume Controller
2145-CF8 node . . . . . . . . . . . 42
Rack space required for redundant AC-power
switch . . . . . . . . . . . . . . 43
Rack space required for redundant AC-power
switch side mounting plates . . . . . . . 44
Rack space required for the 2145 UPS-1U
53
Heat output of the 2145 UPS-1U . . . . . 53
FRUs for SAN Volume Controller 2145-24F
expansion enclosure. . . . . . . . . . 54
FRU part for the SAN Volume Controller
2145-24F SAS drive units . . . . . . . . 54
SAN Volume Controller 2145-CG8 FRU
descriptions . . . . . . . . . . . . 54
SAN Volume Controller 2145-CF8 FRU
descriptions . . . . . . . . . . . . 56
Ethernet feature FRU descriptions . . . . . 57
Flash drive feature FRU descriptions . . . . 58
© Copyright IBM Corp. 2003, 2015
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
2145 UPS-1U FRU descriptions . . . . . . 58
Fields for the system board . . . . . . . 84
Fields for the batteries . . . . . . . . . 84
Fields for the processors . . . . . . . . 85
Fields for the fans . . . . . . . . . . 85
Fields that are repeated for each installed
memory module . . . . . . . . . . . 85
Fields that are repeated for each adapter that
is installed . . . . . . . . . . . . . 85
Fields that are repeated for each SCSI, IDE,
SATA, and SAS device that is installed . . . 86
Fields that are specific to the node software
86
Fields that are provided for the front panel
assembly . . . . . . . . . . . . . 86
Fields that are provided for the Ethernet port 86
Fields that are provided for the power supplies
in the node . . . . . . . . . . . . 87
Fields that are provided for the uninterruptible
power supply assembly that is powering the
node . . . . . . . . . . . . . . . 87
Fields that are provided for the SAS host bus
adapter (HBA) . . . . . . . . . . . 87
Fields that are provided for the SAS flash
drive . . . . . . . . . . . . . . . 88
Fields that are provided for the small form
factor pluggable (SFP) transceiver . . . . . 88
Fields that are provided for the system
properties . . . . . . . . . . . . . 89
When options are available . . . . . . . 102
Statistics collection for individual nodes
120
Statistic collection for volumes for individual
nodes . . . . . . . . . . . . . . 121
Statistic collection for volumes that are used
in Metro Mirror and Global Mirror
relationships for individual nodes. . . . . 121
Statistic collection for node ports . . . . . 122
Statistic collection for nodes. . . . . . . 123
Cache statistics collection for volumes and
volume copies . . . . . . . . . . . 124
Statistic collection for volume cache per
individual nodes . . . . . . . . . . 127
XML statistics for an IP Partnership port
128
Description of data fields for the event log
132
Notification levels . . . . . . . . . . 133
SAN Volume Controller notification types and
corresponding syslog level codes . . . . . 135
SAN Volume Controller values of
user-defined message origin identifiers and
syslog facility codes . . . . . . . . . 135
Informational events . . . . . . . . . 138
SCSI status . . . . . . . . . . . . 143
SCSI sense keys, codes, and qualifiers
143
Reason codes . . . . . . . . . . . 144
Object types . . . . . . . . . . . . 145
Error event IDs and error codes . . . . . 146
Message classification number range
162
ix
79.
80.
81.
82.
83.
x
Files created by the backup process .
Bad block errors . . . . . . .
2145 UPS-1U error indicators . . .
Fibre Channel assemblies . . . .
SAN Volume Controller Fibre Channel
adapter connection hardware . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 320
SAN Volume Controller: Troubleshooting Guide
265
273
293
318
84.
85.
86.
Diagnostics panel LED. . . . . . . .
Diagnostics panel LED prescribed actions
Diagnostics panel LED prescribed actions
. 326
333
339
About this guide
This guide describes how to troubleshoot the IBM® SAN Volume Controller.
The chapters that follow introduce you to the SAN Volume Controller, expansion
enclosure, the redundant AC-power switch, and the uninterruptible power supply.
They describe how you can configure and check the status of one SAN Volume
Controller node or a clustered system of nodes through the front panel, with the
service assistant GUI, or with the management GUI.
The vital product data (VPD) chapter provides information about the VPD that
uniquely defines each hardware and microcode element that is in the SAN Volume
Controller. You can also learn how to diagnose problems using the SAN Volume
Controller.
The maintenance analysis procedures (MAPs) can help you analyze failures that
occur in a SAN Volume Controller. With the MAPs, you can isolate the
field-replaceable units (FRUs) of the SAN Volume Controller that fail. Begin all
problem determination and repair procedures from “MAP 5000: Start” on page 275.
Who should use this guide
This guide is intended for system administrators or systems services
representatives who use and diagnose problems with the SAN Volume Controller,
the redundant AC-power switch, and the uninterruptible power supply.
Emphasis
Different typefaces are used in this guide to show emphasis.
The following typefaces are used to show emphasis:
Boldface
Text in boldface represents menu items.
Bold monospace
Text in bold monospace represents command
names.
Italics
Text in italics is used to emphasize a word.
In command syntax, it is used for variables
for which you supply actual values, such as
a default directory or the name of a system.
Monospace
Text in monospace identifies the data or
commands that you type, samples of
command output, examples of program code
or messages from the system, or names of
command flags, parameters, arguments, and
name-value pairs.
SAN Volume Controller library and related publications
Product manuals, other publications, and websites contain information that relates
to SAN Volume Controller.
© Copyright IBM Corp. 2003, 2015
xi
IBM Knowledge Center for SAN Volume Controller
The information collection in the IBM Knowledge Center contains all of the
information that is required to install, configure, and manage the system. The
information collection in the IBM Knowledge Center is updated between product
releases to provide the most current documentation. The information collection is
available at the following website:
publib.boulder.ibm.com/infocenter/svc/ic/index.jsp
SAN Volume Controller library
Unless otherwise noted, the publications in the library are available in Adobe
portable document format (PDF) from a website.
www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
Click Search for publications to find the online publications you are interested in,
and then view or download the publication by clicking the appropriate item.
Table 1 lists websites where you can find help, services, and more information.
Table 1. IBM websites for help, services, and information
Website
Address
Directory of worldwide contacts
http://www.ibm.com/planetwide
Support for SAN Volume Controller (2145)
www.ibm.com/storage/support/
2145
Support for IBM System Storage® and IBM TotalStorage products
www.ibm.com/storage/support/
Each of the PDF publications in the Table 2 library is also available in the IBM
Knowledge Center by clicking the number in the “Order number” column:
Table 2. SAN Volume Controller library
xii
Title
Description
IBM SAN Volume Controller
Model 2145-DH8 Hardware
Installation Guide
The guide provides the
instructions that the IBM
service representative uses to
install the hardware for SAN
Volume Controller model
2145-DH8.
IBM System Storage SAN
Volume Controller Model
2145-CG8 Hardware
Installation Guide
The guide provides the
instructions that the IBM
service representative uses to
install the hardware for SAN
Volume Controller model
2145-CG8.
IBM System Storage SAN
Volume Controller Hardware
Maintenance Guide
The guide provides the
instructions that the IBM
service representative uses to
service the SAN Volume
Controller hardware,
including the removal and
replacement of parts.
SAN Volume Controller: Troubleshooting Guide
Order number
Table 2. SAN Volume Controller library (continued)
Title
Description
IBM SAN Volume Controller
Troubleshooting Guide
The guide describes the
features of each SAN Volume
Controller model, explains
how to use the front panel or
service assistant GUI, and
provides maintenance
analysis procedures to help
you diagnose and solve
problems with the SAN
Volume Controller.
IBM SAN Volume Controller
Safety Notices
The guide contains translated
caution and danger
statements. Each caution and
danger statement in the SAN
Volume Controller
documentation has a
number. Use the number to
locate the corresponding
statement in your language
in the IBM SAN Volume
Controller Safety Notices
document.
IBM System Storage SAN
Volume Controller Read First
Flyer
The document introduces the
major components of the
SAN Volume Controller
system and describes how to
install the hardware and
software.
IBM System Storage SAN
Volume Controller and IBM
Storwize V7000 Command-Line
Interface User's Guide
The guide describes the
commands that you can use
from the SAN Volume
Controller command-line
interface (CLI).
IBM Statement of Limited
Warranty (2145 and 2076)
This multilingual document
provides information about
the IBM warranty for
machine types 2145 and
2076.
IBM License Agreement for
Machine Code
This multilingual guide
contains the License
Agreement for Machine
Code for the SAN Volume
Controller product.
Order number
Other IBM publications
Table 3 on page xiv lists an IBM publication that contains information that is
related to SAN Volume Controller.
About this guide
xiii
Table 3. Other IBM publications
Title
Description
Order number
IBM System Storage Multipath
Subsystem Device Driver User's Guide
The guide describes the IBM System
Storage Multipath Subsystem Device
Driver for IBM System Storage products
and how to use it with the SAN Volume
Controller.
GC52-1309
IBM documentation and related websites
Table 4 lists websites that provide publications and other information about the
SAN Volume Controller or related products or technologies. The IBM Redbooks®
publications provide positioning and value guidance, installation and
implementation experiences, solution scenarios, and step-by-step procedures for
various products.
Table 4. IBM documentation and related websites
Website
Address
IBM Publications Center
www.ibm.com/e-business/linkweb/publications/
servlet/pbi.wss
IBM Redbooks publications
www.redbooks.ibm.com/
Related accessibility information
To view a PDF file, you need Adobe Reader, which can be downloaded from the
Adobe website:
www.adobe.com/support/downloads/main.html
How to order IBM publications
The IBM Publications Center is a worldwide central repository for IBM product
publications and marketing material.
The IBM Publications Center offers customized search functions to help you find
the publications that you need. Some publications are available for you to view or
download at no charge. You can also order publications. The publications center
displays prices in your local currency. You can access the IBM Publications Center
through the following website:
www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
Related websites
The following websites provide information about SAN Volume Controller or
related products or technologies:
xiv
Type of information
Website
SAN Volume Controller support
www.ibm.com/storage/support/2145
Technical support for IBM storage
products
www.ibm.com/storage/support/
IBM Electronic Support registration
www-01.ibm.com/support/electronicsupport/
SAN Volume Controller: Troubleshooting Guide
Sending your comments
Your feedback is important in helping to provide the most accurate and highest
quality information.
To submit any comments about this book or any other SAN Volume Controller
documentation:
v Send your comments by email to starpubs@us.ibm.com. Include the following
information for this publication or use suitable replacements for the publication
title and form number for the publication on which you are commenting:
– Publication title: IBM SAN Volume Controller Troubleshooting Guide
– Publication form number: GC27-2284-07
– Page, table, or illustration numbers that you are commenting on
– A detailed description of any information that should be changed
How to get information, help, and technical assistance
If you need help, service, technical assistance, or just want more information about
IBM products, you will find a wide variety of sources available from IBM to assist
you.
Information
IBM maintains pages on the web where you can get information about IBM
products and fee services, product implementation and usage assistance, break and
fix service support, and the latest technical information. For more information,
refer to Table 5.
Table 5. IBM websites for help, services, and information
Website
Address
Directory of worldwide contacts
http://www.ibm.com/planetwide
Support for SAN Volume Controller www.ibm.com/storage/support/2145
(2145)
Support for IBM System Storage
and IBM TotalStorage products
www.ibm.com/storage/support/
Note: Available services, telephone numbers, and web links are subject to change
without notice.
Help and service
Before calling for support, be sure to have your IBM Customer Number available.
If you are in the US or Canada, you can call 1 (800) IBM SERV for help and
service. From other parts of the world, see http://www.ibm.com/planetwide for
the number that you can call.
When calling from the US or Canada, choose the storage option. The agent decides
where to route your call, to either storage software or storage hardware, depending
on the nature of your problem.
If you call from somewhere other than the US or Canada, you must choose the
software or hardware option when calling for assistance. Choose the software
option if you are uncertain if the problem involves the SAN Volume Controller
About this guide
xv
software or hardware. Choose the hardware option only if you are certain the
problem solely involves the SAN Volume Controller hardware. When calling IBM
for service regarding the product, follow these guidelines for the software and
hardware options:
Software option
Identify the SAN Volume Controller product as your product and supply
your customer number as proof of purchase. The customer number is a
7-digit number (0000000 to 9999999) assigned by IBM when the product is
purchased. Your customer number should be located on the customer
information worksheet or on the invoice from your storage purchase. If
asked for an operating system, use Storage.
Hardware option
Provide the serial number and appropriate 4-digit machine type. For SAN
Volume Controller, the machine type is 2145.
In the US and Canada, hardware service and support can be extended to 24x7 on
the same day. The base warranty is 9x5 on the next business day.
Getting help online
You can find information about products, solutions, partners, and support on the
IBM website.
To find up-to-date information about products, services, and partners, visit the IBM
website at www.ibm.com/storage/support/2145.
Before you call
Make sure that you have taken steps to try to solve the problem yourself before
you call.
Some suggestions for resolving the problem before calling IBM Support include:
v Check all cables to make sure that they are connected.
v Check all power switches to make sure that the system and optional devices are
turned on.
v Use the troubleshooting information in your system documentation. The
troubleshooting section of the information center contains procedures to help
you diagnose problems.
v Go to the IBM Support website at www.ibm.com/storage/support/2145 to check
for technical information, hints, tips, and new device drivers or to submit a
request for information.
Using the documentation
Information about your IBM storage system is available in the documentation that
comes with the product.
That documentation includes printed documents, online documents, readme files,
and help files in addition to the information center. See the troubleshooting
information for diagnostic instructions. The troubleshooting procedure might
require you to download updated device drivers or software. IBM maintains pages
on the web where you can get the latest technical information and download
device drivers and updates. To access these pages, go to www.ibm.com/storage/
xvi
SAN Volume Controller: Troubleshooting Guide
support/2145 and follow the instructions. Also, some documents are available
through the IBM Publications Center.
Sign up for the Support Line Offering
If you have questions about how to use and configure the machine, sign up for the
IBM Support Line offering to get a professional answer.
The maintenance supplied with the system provides support when there is a
problem with a hardware component or a fault in the system machine code. At
times, you might need expert advice about using a function provided by the
system or about how to configure the system. Purchasing the IBM Support Line
offering gives you access to this professional advice while deploying your system,
and in the future.
Contact your local IBM sales representative or your support group for availability
and purchase information.
Summary of changes for GC27-2284-07 SAN Volume Controller
Troubleshooting Guide
The summary of changes provides a list of new and updated information since the
last version of the guide.
New information
The following information has been added to this guide since the previous edition,
GC27-2284-06.
v “USB flash drive interface” on page 69
v “Resolving a problem with SSL/TLS clients” on page 245
v “Procedure: Making drives support protection information” on page 246
Updated information
This version includes updates to:
v “Technician port for node access” on page 75
v “Error event IDs and error codes” on page 146
Summary of changes for GC27-2284-06 SAN Volume Controller
Troubleshooting Guide
The summary of changes provides a list of new and updated information since the
last version of the guide.
New information
The following information has been added to this guide since the previous edition,
GC27-2284-05..
v “SAN Volume Controller 2145-DH8 front panel controls and indicators” on page
15
v “SAN Volume Controller 2145-DH8 operator information panel” on page 20
v “SAN Volume Controller 2145-DH8 environment requirements” on page 35
v “MAP 5040: Power SAN Volume Controller 2145-DH8” on page 283
About this guide
xvii
xviii
SAN Volume Controller: Troubleshooting Guide
Chapter 1. SAN Volume Controller overview
The SAN Volume Controller combines software and hardware into a
comprehensive, modular appliance that uses symmetric virtualization.
Symmetric virtualization is achieved by creating a pool of managed disks (MDisks)
from the attached storage systems. Those storage systems are then mapped to a set
of volumes for use by attached host systems. System administrators can view and
access a common pool of storage on the storage area network (SAN). This
functionality helps administrators to use storage resources more efficiently and
provides a common base for advanced functions.
A SAN is a high-speed Fibre Channel network that connects host systems and
storage devices. In a SAN, a host system can be connected to a storage device
across the network. The connections are made through units such as routers and
switches. The area of the network that contains these units is known as the fabric of
the network.
IBM Real-time Compression™ software
IBM SAN Volume Controller is built with IBM Spectrum Virtualize™ software,
which is part of the IBM Spectrum Storage™ family.
The software provides these functions for the host systems that attach to SAN
Volume Controller.
v Creates a single pool of storage
v Provides logical unit virtualization
v Manages logical volumes
v Mirrors logical volumes
The system also provides these functions.
v Large scalable cache
v Copy Services
– IBM FlashCopy® (point-in-time copy) function, including thin-provisioned
FlashCopy to make multiple targets affordable
– IBM HyperSwap® (active-active copy) function
– Metro Mirror (synchronous copy)
– Global Mirror (asynchronous copy)
– Data migration
v Space management
– IBM Easy Tier® function to migrate the most frequently used data to
higher-performance storage
– Metering of service quality when combined with IBM Spectrum Control Base
Edition. For information, refer to the IBM Spectrum Control Base Edition
documentation.
– Thin-provisioned logical volumes
– Compressed volumes to consolidate storage
© Copyright IBM Corp. 2003, 2015
1
Figure 1 shows hosts, SAN Volume Controller nodes, and RAID storage systems
connected to a SAN fabric. The redundant SAN fabric comprises a fault-tolerant
arrangement of two or more counterpart SANs that provide alternative paths for
each SAN-attached device.
Host
Host
Host
Host
Host zone
Node
Redundant
SAN fabric
Node
Node
RAID
storage system
Storage system zone
svc00600
RAID
storage system
Figure 1. SAN Volume Controller system in a fabric
Volumes
A system of SAN Volume Controller nodes presents volumes to the hosts. Most of
the advanced functions that SAN Volume Controller provides are defined on
volumes. These volumes are created from managed disks (MDisks) that are
presented by the RAID storage systems. The volumes can also be created by arrays
that are provided by flash drives in an expansion enclosure. All data transfer
occurs through the SAN Volume Controller node, which is described as symmetric
virtualization.
Figure 2 on page 3 shows the data flow across the fabric.
2
SAN Volume Controller: Troubleshooting Guide
Host
Host
Host
Host
Hosts send I/O
to volumes.
Node
Redundant
SAN fabric
Node
I/O is sent to
managed disks.
RAID
storage system
svc00601
RAID
storage system
Data transfer
Figure 2. Data flow in a SAN Volume Controller system
The nodes in a system are arranged into pairs that are known as I/O groups. A
single pair is responsible for serving I/O on a volume. Because a volume is served
by two nodes, no loss of availability occurs if one node fails or is taken offline.
Volumes types
You can create the following types of volumes on the system.
v Basic volumes, where a single copy of the volume is cached in one I/O group.
Basic volumes can be established in any system topology; however, Figure 3 on
page 4 shows a standard system topology.
Chapter 1. SAN Volume Controller overview
3
Standard System
I/O Group 1
Basic Volume
svc00909
Volume
Copy
Figure 3. Example of a basic volume
v Mirrored volumes, where copies of the volume can either be in the same storage
pool or in different storage pools. The volume is cached in a single I/O group.
Typically, mirrored volumes are established in a standard system topology.
Standard System
I/O Group 1
Mirrored Volume
Volume
Volume
Copy
Copy
svc00908
Volume
Volume
Copy
Copy
Figure 4. Example of mirrored volumes
v Stretched volumes, where copies of a single volume are in different storage pools
at different sites. The volume is cached in one I/O group. Stretched volumes are
only available in stretched topology systems.
4
SAN Volume Controller: Troubleshooting Guide
Stretched System
I/O Group 1
Volume
Copy
Volume
Copy
Site 1
Site 2
svc00907
Stretched Volume
Figure 5. Example of stretched volumes
v HyperSwap volumes, where copies of a single volume are in different storage
pools that are on different sites. The volume is cached in two I/O groups that
are on different sites. These volumes can be created only when the system
topology is HyperSwap.
Chapter 1. SAN Volume Controller overview
5
HyperSwap System
I/O Group 2
I/O Group 1
HyperSwap Volume
Volume
Copy
Volume
Copy
Change Volume
Change Volume
Volume
Copy
Volume
Copy
Site 1
Site 2
svc00906
Active-active
relationship
Figure 6. Example of HyperSwap volumes
System topology
The topology property of a SAN Volume Controller system can be set to one of the
following states.
Note: You cannot mix I/O groups of different topologies in the same system.
v Standard topology, where all nodes in the system are at the same site.
6
SAN Volume Controller: Troubleshooting Guide
I/O Group 1
Node 2
svs00919
Node 1
Site 1
Figure 7. Example of a standard system topology
v Stretched topology, where each node of an I/O group is at a different site. When
one site is not available, access to a volume can continue but with reduced
performance.
Node 1
Node 2
Site 1
Site 2
svs00920
I/O Group 1
Figure 8. Example of a stretched system topology
v HyperSwap topology, where the system is consists of at least two I/O groups.
Each I/O group is at a different site. Both nodes of an I/O group are at the
same site. A volume can be active on two I/O groups so that it can immediately
be accessed by the other site when a site is not available.
Chapter 1. SAN Volume Controller overview
7
I/O Group 2
Node 1
Node 3
Node 2
Node 4
Site 1
Site 2
svs00921
I/O Group 1
Figure 9. Example of a HyperSwap system topology
Summary of system topology and volumes
Table 6 summarizes the types of volumes that can be associated with each system
topology.
Table 6. System topology and volume summary
Topology
Volume Type
Basic
Mirrored
Standard
X
X
Stretched
X
HyperSwap
X
Stretched
HyperSwap
Custom
X
X
X
X
X
System management
The SAN Volume Controller nodes in a system operate as a single system and
present a single point of control for system management and service. System
management and error reporting are provided through an Ethernet interface to one
of the nodes in the system, which is called the configuration node. The configuration
node runs a web server and provides a command-line interface (CLI). Any node in
the system can be the configuration node. If the current configuration node fails, a
new configuration node is selected from the remaining nodes. Each node also
provides a command-line interface and web interface for initiating hardware
service actions.
8
SAN Volume Controller: Troubleshooting Guide
Fabric types
I/O operations between hosts and SAN Volume Controller nodes and between
SAN Volume Controller nodes and arrays use the SCSI standard. The SAN Volume
Controller nodes communicate with each other through private SCSI commands.
Fibre Channel over Ethernet connectivity is supported on SAN Volume Controller
models 2145-DH8 and 2145-CG8.
Table 7 shows the fabric type that can be used for communicating between hosts,
nodes, and RAID storage systems. These fabric types can be used at the same time.
Table 7. SAN Volume Controller communications types
Communications
type
1
Host to SAN Volume
SAN Volume
Controller nodes
Controller to storage
system
SAN Volume
Controller nodes to
SAN Volume
Controller nodes
Fibre Channel SAN
Yes
Yes
Yes
iSCSI (1 Gbps
Ethernet or 10 Gbps
Ethernet)
Yes
Yes
No
Fibre Channel Over
Ethernet SAN (10
Gbps Ethernet)
Yes
Yes
Yes
Flash drives
Some SAN Volume Controller nodes contain flash drives or are attached to
expansion enclosures that contain flash drives. These flash drives can be used to
create RAID-managed disks (MDisks) that in turn can be used to create volumes.
In SAN Volume Controller 2145-DH8 nodes, the flash drives are in an expansion
enclosure that is connected to both sides of an I/O group.
Flash drives provide host servers with a pool of high-performance storage for
critical applications. Figure 10 on page 10 shows this configuration. MDisks on
flash drives can also be placed in a storage pool with MDisks from regular RAID
storage systems. IBM Easy Tier performs automatic data placement within that
storage pool by moving high-activity data onto better-performing storage.
Chapter 1. SAN Volume Controller overview
9
Hosts send I/O
to volumes, which
are mapped to internal
solid-state drives.
Host
Host
Host
Redundant
SAN fabric
svc00602
Node
with SSDs
Host
Figure 10. SAN Volume Controller nodes with internal Flash drives
SAN Volume Controller hardware
Each SAN Volume Controller node is an individual server in a SAN Volume
Controller clustered system on which the SAN Volume Controller software runs.
The nodes are always installed in pairs; a minimum of one pair and a maximum of
four pairs of nodes constitute a system. Each pair of nodes is known as an I/O
group.
I/O groups take the storage that is presented to the SAN by the storage systems as
MDisks and translates the storage into logical disks (volumes) that are used by
applications on the hosts. A node is in only one I/O group and provides access to
the volumes in that I/O group.
SAN Volume Controller 2145-DH8 node features
The SAN Volume Controller 2145-DH8 node has the following features:
v A 19-inch rack-mounted enclosure
v At least one Fibre Channel adapter or one 10 Gbps Ethernet adapter
v Optional second, third, and fourth Fibre Channel adapters
v 32 GB memory per processor
v One or two, eight-core processors
v Dual redundant power supplies
v Dual redundant batteries for better reliability, availability, and serviceability than
for a SAN Volume Controller 2145-CG8 with an uninterruptible power supply
v Up to two SAN Volume Controller 2145-24F expansion enclosures to house up to
24 flash drives each
v SAN Volume Controller 2145-12F expansion enclosures to house up to 12 LFF
HDD or flash drives
|
|
v iSCSI host attachment (1 Gbps Ethernet and optional 10 Gbps Ethernet)
v Supports optional IBM Real-time Compression
v A dedicated technician port for local access to the initialization tool or the
service assistant interface.
10
SAN Volume Controller: Troubleshooting Guide
SAN Volume Controller 2145-CG8 node features
The SAN Volume Controller 2145-CG8 node has the following features:
v A 19-inch rack-mounted enclosure
v One 4-port 8 Gbps Fibre Channel adapter
v One optional 2-port 10 Gbps Fibre Channel over Ethernet converged network
adapter
v Optional second 4-port 8 Gbps Fibre Channel adapter
v 24 GB memory
v Fibre Channel over Ethernet host attachment (need to add only one)
v One quad-core processor
v Dual, redundant power supplies
v Supports up to four optional flash drives
v iSCSI host attachment (1 Gbps Ethernet and optional 10 Gbps Ethernet)
v Supports optional IBM Real-time Compression
Note: The optional flash drives and optional 10 Gbps Ethernet cannot be in the
same 2145-CG8 node.
Systems
A system is a collection of SAN Volume Controller nodes.
A system can consist of between two to eight SAN Volume Controller nodes.
All configuration settings are replicated across all nodes in the system.
Management IP addresses are assigned to the system. Each interface accesses the
system remotely through the Ethernet system-management addresses, also known
as the primary, and secondary system IP addresses.
Configuration node
A configuration node is a single node that manages configuration activity of the
system.
If the configuration node fails, the system chooses a new configuration node. This
action is called configuration node failover. The new configuration node takes over
the management IP addresses. Thus, you can access the system through the same
IP addresses although the original configuration node has failed. During the
failover, there is a short period when you cannot use the command-line tools or
management GUI.
Figure 11 on page 12 shows an example of a clustered system that contains four
nodes. Node 1 is the configuration node. User requests (▌1▐) are handled by node
1.
Chapter 1. SAN Volume Controller overview
11
2
Node 1
1
Node 2
Node 3
Node 4
Configuration
Node
IP Interface
Figure 11. SAN Volume Controller configuration node
Configuration node addressing
At any given time, only one node within a SAN Volume Controller clustered
system is assigned an IP addresses.
An IP address for the clustered system must be assigned to Ethernet port 1. An IP
address can also be assigned to Ethernet port 2. These are the only ports that can
be assigned management IP addresses.
This node then acts as the focal point for all configuration and other requests that
are made from the management GUI application or the CLI. This node is known as
the configuration node.
If the configuration node is stopped or fails, the remaining nodes in the system
determine which node will take on the role of configuration node. The new
configuration node binds the management IP addresses to its Ethernet ports. It
broadcasts this new mapping so that connections to the system configuration
interface can be resumed.
The new configuration node broadcasts the new IP address mapping using the
Address Resolution Protocol (ARP). You must configure some switches to forward
the ARP packet on to other devices on the subnetwork. Ensure that all Ethernet
devices are configured to pass on unsolicited ARP packets. Otherwise, if the ARP
packet is not forwarded, a device loses its connection to the SAN Volume
Controller system.
If a device loses its connection to the SAN Volume Controller system, it can
regenerate the address quickly if the device is on the same subnetwork as the
system. However, if the device is not on the same subnetwork, it might take hours
for the address resolution cache of the gateway to refresh. In this case, you can
restore the connection by establishing a command line connection to the system
from a terminal that is on the same subnetwork, and then by starting a secure copy
to the device that has lost its connection.
Management IP failover
If the configuration node fails, the IP addresses for the clustered system are
transferred to a new node. The system services are used to manage the transfer of
the management IP addresses from the failed configuration node to the new
configuration node.
The following changes are performed by the system service:
12
SAN Volume Controller: Troubleshooting Guide
1
1
v If software on the failed configuration node is still operational, the software
shuts down the management IP interfaces. If the software cannot shut down the
management IP interfaces, the hardware service forces the node to shut down.
v When the management IP interfaces shut down, all remaining nodes choose a
new node to host the configuration interfaces.
v The new configuration initializes the configuration daemons, including SSHD
and HTTPD, and then binds the management IP interfaces to its Ethernet ports.
v The router is configured as the default gateway for the new configuration.
v The routing tables are established on the new configuration for the management
IP addresses. The new configuration sends five unsolicited address resolution
protocol (ARP) packets for each IP address to the local subnet broadcast address.
The ARP packets contain the management IP and the Media Access Control
(MAC) address for the new configuration node. All systems that receive ARP
packets are forced to update their ARP tables. After the ARP tables are updated,
these systems can connect to the new configuration node.
1
1
Note: Some Ethernet devices might not forward ARP packets. If the ARP
packets are not forwarded, connectivity to the new configuration node cannot be
established automatically. To avoid this problem, configure all Ethernet devices
to pass unsolicited ARP packets. You can restore lost connectivity by logging in
to the system and starting a secure copy to the affected system. Starting a secure
copy forces an update to the ARP cache for all systems connected to the same
switch as the affected system.
Ethernet link failures
If the Ethernet link to the system fails because of an event that is unrelated to SAN
Volume Controller, the system does not attempt to fail over the configuration node
to restore management IP access. For example, the Ethernet link can fail if a cable
is disconnected or an Ethernet router fails. To protect against this type of failure,
the system provides the option for two Ethernet ports that each have a
management IP address. If you cannot connect through one IP address, attempt to
access the system through the alternative IP address.
Note: IP addresses that are used by hosts to access the system over an Ethernet
connection are different from management IP addresses.
Routing considerations for event notification and Network Time
Protocol
SAN Volume Controller supports the following protocols that make outbound
connections from the system:
v Email
v Simple Network Mail Protocol (SNMP)
v Syslog
v Network Time Protocol (NTP)
These protocols operate only on a port configured with a management IP address.
When making outbound connections, the system uses the following routing
decisions:
v If the destination IP address is in the same subnet as one of the management IP
addresses, the system sends the packet immediately.
Chapter 1. SAN Volume Controller overview
13
v If the destination IP address is not in the same subnet as either of the
management IP addresses, the system sends the packet to the default gateway
for Ethernet port 1.
v If the destination IP address is not in the same subnet as either of the
management IP addresses and Ethernet port 1 is not connected to the Ethernet
network, the system sends the packet to the default gateway for Ethernet port 2.
When you configure any of these protocols for event notifications, use these
routing decisions to ensure that error notification works correctly if the network
fails.
SAN fabric overview
The SAN fabric is an area of the network that contains routers and switches. A SAN
is configured into a number of zones. A device that uses the SAN can
communicate only with devices that are included in the same zones that it is in. A
system requires several distinct types of zones: a system zone, host zones, and disk
zones. The intersystem zone is optional.
In the host zone, the host systems can identify and address the nodes. You can
have more than one host zone and more than one disk zone. Unless you are using
a dual-core fabric design, the system zone contains all ports from all nodes in the
system. Create one zone for each host Fibre Channel port. In a disk zone, the
nodes identify the storage systems. Generally, create one zone for each external
storage system. If you are using the Metro Mirror and Global Mirror feature, create
a zone with at least one port from each node in each system; up to four systems
are supported.
Note: Some operating systems cannot tolerate other operating systems in the same
host zone, although you might have more than one host type in the SAN fabric.
For example, you can have a SAN that contains one host that runs on an IBM AIX®
operating system and another host that runs on a Microsoft Windows operating
system.
All communication between SAN Volume Controller nodes is performed through
the SAN. All SAN Volume Controller configuration and service commands are sent
to the system through an Ethernet network.
14
SAN Volume Controller: Troubleshooting Guide
Chapter 2. Introducing the SAN Volume Controller hardware
components
A SAN Volume Controller system consists of SAN Volume Controller nodes and
related hardware components, such as uninterruptible power supply units and the
optional redundant AC-power switches. Note that nodes and uninterruptible
power supply units are installed in pairs.
SAN Volume Controller nodes
SAN Volume Controller supports several different node types.
The following nodes are supported:
v The SAN Volume Controller 2145-DH8 node is available for purchase, with the
following features:
– At least one Fibre Channel adapter or one 10 Gbps Ethernet adapter
– Optional second and third Fibre Channel adapters
– Up to two SAN Volume Controller 2145-24F expansion enclosures to house
optional flash drives
– iSCSI host attachment (1 Gbps Ethernet and optional 10 Gbps Ethernet)
v The SAN Volume Controller 2145-CG8 node is available for purchase, with the
following features:
– A high-speed SAS adapter with up to four flash drives
– A two-port 10 Gbps Ethernet adapter
– A second four-port Fibre Channel adapter
v The following nodes are no longer available for purchase but remain supported:
– SAN Volume Controller 2145-CF8
A label on the front of the node indicates the SAN Volume Controller node type,
hardware revision (if appropriate), and serial number.
SAN Volume Controller controls and indicators
The controls and indicators are used for power and navigation and to indicate
information such as system activity, service and configuration options, service
controller failures, and node identification.
SAN Volume Controller 2145-DH8 front panel controls and
indicators
The controls and indicators on the front panel are used for power and to indicate
information such as system activity, node failures, and node identification.
Figure 12 on page 16 shows the controls and indicators on the front panel of the
SAN Volume Controller 2145-DH8.
© Copyright IBM Corp. 2003, 2015
15
1
2
3
2
3
4
5
aaaa
aa
aa
aaaa
aa
aa
aa
aa
aa
aa
aa
a
aa
aaaa
aa
aa
aaaa
aaaa
aaaa
aaaa
aaaa
a aa
aaaa
aa
aa
aaaa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
a
aaaa
aa
aa
a aa
aa
aa
aa
aa
aa
aa
aaaa
aaaa
aa
aa
aaaa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
a
aaaa
aa
aa
a aa
aa
aa
aa
aa
aa
aa
aaaa
aaaa
aaaaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
a a
aaaaaa
aaaa
aaaa
a a
6
7
1
8
2
5
4
+ -
1 2
6
12
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaa aaaaaa aaaaaa
a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaa
a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
11
3 4
SAN Volume Controller
6
10
7
8
- +
9
svc00800
1
aaaa
aa
aa
aaaa
aa
aa
aa
aa
aa
aa
aa
a
aa
aaaa
aa
aa
aaaa
aaaa
aaaa
aaaa
aaaa
a aa
+ -
Figure 12. SAN Volume Controller 2145-DH8 front panel
▌1▐ Hard disk drive activity LED
▌2▐ Hard disk drive status LED
▌3▐ USB port
▌4▐ Video connector
▌5▐ Operator-information panel
▌6▐ Rack release latch
▌7▐ Node status LED
▌8▐ Node fault LED
▌9▐ Battery status LED
▌10▐ Battery fault LED
▌11▐ Batteries
▌12▐ Hard disk drives (boot drives)
Node status LED
The node status LED provides the following system activity indicators:
Off
The node is not operating as a member of a system.
On
The node is operating as a member of a system.
Slow blinking
The node is in candidate or service state.
Fast blinking
The node is dumping cache and state data to the local disk in anticipation
of a system reboot from a pending power-off action or other controlled
restart sequence.
Node fault LED
A node fault is indicated by the amber node-fault LED.
16
Off
The node does not have any errors that will prevent it from doing I/O or
the system software is not running on the node.
On
The node has a fatal node error and is not part of the system.
SAN Volume Controller: Troubleshooting Guide
Battery status LED
The green battery status LED indicates one of the following battery conditions.
Off
The system software is not running on the node or the state of the system
cannot be saved if power to the node is lost.
Fast blinking
Battery charge level is too low for the state of the system to be saved if
power to the node is lost. Batteries are charging.
Slow blinking
Battery charge level is sufficient for the state of the system to be saved
once if power to the node is lost.
On
Battery charge level is sufficient for the state of the system to be saved
twice if power to the node is lost.
Battery fault LED
The amber battery fault LED indicates one of the following battery conditions.
Off
The system software is not running on the node or this battery does not
have a fault.
Blinking
This battery is being identified.
On
This battery has a fault. It cannot be used to save the system state if power
to the node is lost.
Hard disk drive activity LED
The green drive activity LED indicates one of the following conditions.
Off
The drive is not ready for use.
Flashing
The drive is in use.
On
The drive is ready for use, but is not in use.
Hard disk drive status LED
The amber drive status LED indicates one of the following conditions.
Off
The drive is in a good state or has no power.
Blinking
The drive is being identified.
On
The drive has failed.
SAN Volume Controller 2145-CG8 controls and indicators
The controls and indicators are used for power and navigation and to indicate
information such as system activity, service and configuration options, service
controller failures, and node identification.
Figure 13 on page 18 shows the controls and indicators on the front panel of the
SAN Volume Controller 2145-CG8.
Chapter 2. Introducing the SAN Volume Controller hardware components
17
1
3
2
6
4
1
2
3
4
svc00717
5
Figure 13. SAN Volume Controller 2145-CG8 front panel
▌1▐ Node-status LED
▌2▐ Front-panel display
▌3▐ Navigation buttons
▌4▐ Operator-information panel
▌5▐ Select button
▌6▐ Error LED
SAN Volume Controller 2145-CF8 controls and indicators
The controls and indicators are used for power and navigation and to indicate
information such as system activity, service and configuration options, service
controller failures, and node identification.
Figure 14 shows the controls and indicators on the front panel of the SAN Volume
Controller 2145-CF8.
2
6
3
4
5
1
2
3
4
Figure 14. SAN Volume Controller 2145-CF8 front panel
▌1▐ Node-status LED
▌2▐ Front-panel display
▌3▐ Navigation buttons
▌4▐ Operator-information panel
▌5▐ Select button
▌6▐ Error LED
Node status LED
System activity is indicated through the green node-status LED.
The node status LED provides the following system activity indicators:
Off
18
The node is not operating as a member of a system.
SAN Volume Controller: Troubleshooting Guide
svc00541c
1
On
The node is operating as a member of a system.
Slow blinking
The node is in candidate or service state.
Fast blinking
The node is dumping cache and state data to the local disk in anticipation
of a system reboot from a pending power-off action or other controlled
restart sequence.
Front-panel display
The front-panel display shows service, configuration, and navigation information.
You can select the language that is displayed on the front panel. The display can
show both alphanumeric information and graphical information (progress bars).
The front-panel display shows configuration and service information about the
node and the system, including the following items:
v Boot progress indicator
v Boot failed
v Charging
v Hardware boot
v Node rescue request
v Power failure
v Powering off
v Recovering
v Restarting
v Shutting down
v Error codes
v Validate WWNN?
Navigation buttons
You can use the navigation buttons to move through menus.
There are four navigational buttons that you can use to move throughout a menu:
up, down, right, and left.
Each button corresponds to the direction that you can move in a menu. For
example, to move right in a menu, press the navigation button that is located on
the right side. If you want to move down in a menu, press the navigation button
that is located on the bottom.
Note: The select button is used in tandem with the navigation buttons.
Product serial number
The node contains a SAN Volume Controller product serial number that is written
to the system board hardware. The product serial number is also printed on the
serial number label which is located on the front panel.
This number is used for warranty and service entitlement checking and is included
in the data sent with error reports. It is essential that this number is not changed
during the life of the product. If the system board is replaced, you must follow the
system board replacement instructions carefully and rewrite the serial number on
the system board.
Chapter 2. Introducing the SAN Volume Controller hardware components
19
Select button
Use the select button to select an item from a menu.
The select button and navigation buttons help you to navigate and select menu
and boot options, and start a service panel test. The select button is located on the
front panel of the SAN Volume Controller, near the navigation buttons.
Node identification label
The node identification label on the front panel displays a six-digit node
identification number. Sometimes this number is called the panel name or front
panel ID.
The node identification label is the six-digit number that is input to the addnode
command. It is readable by system software and is used by configuration and
service software as a node identifier. The node identification number can also be
displayed on the front-panel display when node is selected from the menu.
If the service controller assembly front panel is replaced, the configuration and
service software displays the number that is printed on the front of the
replacement panel. Future error reports contain the new number. No system
reconfiguration is necessary when the front panel is replaced.
Error LED
Critical faults on the service controller are indicated through the amber error LED.
The error LED has the following two states:
OFF
The service controller is functioning correctly.
ON
A critical service-controller failure was detected and you must replace the
service controller.
The error LED can light temporarily when the node is powered on. If the
error LED is on, but the front panel display is completely blank, wait five
minutes to allow the LED time to turn off before performing any service
action.
SAN Volume Controller operator-information panel
The operator-information panel is located on the front panel of the SAN Volume
Controller .
SAN Volume Controller 2145-DH8 operator information panel
The operator-information panel indicates information such as system board errors,
Ethernet activity, and power status.
Figure 15 on page 21 shows the operator-information panel for the SAN Volume
Controller 2145-DH8.
20
SAN Volume Controller: Troubleshooting Guide
4
3
2
5
ifs00064
1
7
6
Figure 15. SAN Volume Controller 2145-DH8 operator information panel
▌1▐ Power-control button and power-on LED (green)
▌2▐ Ethernet icon
▌3▐ System-locator button and LED (blue)
▌4▐ Release latch for the light path diagnostics panel
▌5▐ Ethernet activity LEDs
▌6▐ Check log LED
▌7▐ System-error LED (yellow)
Note: If the node has more than four Ethernet ports, activity on ports five and
above is not reflected on the operator-information panel Ethernet activity LEDs.
SAN Volume Controller 2145-CG8 operator-information panel
The operator-information panel contains buttons and indicators such as the
power-control button, and LEDs that indicate information such as system-board
errors, hard-drive activity, and power status.
Figure 16 shows the operator-information panel for the SAN Volume Controller
2145-CG8.
1
8
3
4
5
2
7
6
svc00722
2
1
Figure 16. SAN Volume Controller 2145-CG8 or 2145-CF8 operator-information panel
▌1▐ Power-button cover
▌2▐ Ethernet 1 activity LED. The operator-information panel LEDs refer to the
Ethernet ports that are mounted on the system board.
▌3▐ Ethernet 2 activity LED. The operator-information panel LEDs refer to the
Ethernet ports that are mounted on the system board.
▌4▐ System-information LED
▌5▐ System-error LED
▌6▐ Release latch
Chapter 2. Introducing the SAN Volume Controller hardware components
21
▌7▐ Locator button and LED
▌8▐ Power button and LED
Note: If you install the 10 Gbps Ethernet feature, the port activity is not reflected
on the activity LEDs.
SAN Volume Controller 2145-CF8 operator-information panel
The operator-information panel contains buttons and indicators such as the
power-control button, and LEDs that indicate information such as system-board
errors, hard-drive activity, and power status.
Figure 17 shows the operator-information panel for the SAN Volume Controller
2145-CF8.
3
2
10
2
1
4
3
9
8
4
5
svc_bb1gs008
1
7
6
Figure 17. SAN Volume Controller 2145-CG8 or 2145-CF8 operator-information panel
▌1▐ Power-button cover
▌2▐ Ethernet 2 activity LED
▌3▐ Ethernet 1 activity LED
▌4▐ System-information LED
▌5▐ System-error LED
▌6▐ Release latch
▌7▐ Locator button and LED
▌8▐ Not used
▌9▐ Not used
▌10▐ Power button and LED
System-error LED
When it is lit, the system-error LED indicates that a system-board error has
occurred.
This amber LED lights up if the hardware detects an unrecoverable error that
requires a new field-replaceable unit (FRU). To help you isolate the faulty FRU, see
MAP 5800: Light path to help you isolate the faulty FRU.
A system-error LED is also at the rear of theseSAN Volume Controller models:
v 2145-CG8
v 2145-CF8
Disk drive activity LED
When it is lit, the green disk drive activity LED indicates that the disk drive is in
use.
22
SAN Volume Controller: Troubleshooting Guide
Reset button
If a reset button is available on your SAN Volume Controller node, do not use it.
Attention: If you use the reset button, the node restarts immediately without the
SAN Volume Controller control data being written to disk. Service actions are then
required to make the node operational again.
Power button
The power button turns main power on or off for the SAN Volume Controller.
To turn on the power, press and release the power button. You must have a
pointed device, such as a pen, to press the button.
To turn off the power, press and release the power button. For more information
about how to turn off the SAN Volume Controller node, see MAP 5350: Powering
off a SAN Volume Controller node.
Attention: When the node is operational and you press and immediately release
the power button, the SAN Volume Controller writes its control data to its internal
disk and then turns off. This can take up to five minutes. If you press the power
button but do not release it, the node turns off immediately without the SAN
Volume Controller control data being written to disk. Service actions are then
required to make the SAN Volume Controller operational again. Therefore, during
a power-off operation, do not press and hold the power button for more than two
seconds.
Power LED
The green power LED indicates the power status of the system.
The power LED has the following properties:
Off
One or more of the following are true:
v No power is present at the power supply input.
v The power supply has failed.
v The LED has failed.
On
The SAN Volume Controller node is turned on.
Flashing
The SAN Volume Controller node is turned off, but is still connected to a
power source.
Note: A power LED is also at the rear of these SAN Volume Controller nodes:
v 2145-CG8
v 2145-CF8
System-information LED
When the system-information LED is lit, a noncritical event has occurred.
Check the light path diagnostics panel and the event log. Light path diagnostics
are described in more detail in the light path maintenance analysis procedure
(MAP).
Locator LED
The SAN Volume Controller does not use the locator LED.
Chapter 2. Introducing the SAN Volume Controller hardware components
23
Ethernet-activity LED
An Ethernet-activity LED beside each Ethernet port indicates that theSAN Volume
Controller node is communicating on the Ethernet network that is connected to the
Ethernet port.
The operator-information panel LEDs refer to the Ethernet ports that are mounted
on the system board. If you install the 10 Gbps Ethernet card on a SAN Volume
Controller 2145-CG8, the port activity is not reflected on the activity LEDs.
SAN Volume Controller rear-panel indicators and connectors
The rear-panel indicators for the SAN Volume Controller are located on the
back-panel assembly. The external connectors are located on the SAN Volume
Controller node and the power supply assembly.
SAN Volume Controller 2145-DH8 rear-panel indicators
The rear-panel indicators consist of LEDs that indicate the status of the Fibre
Channel ports, Ethernet connection and activity, power, electrical current, and
system-board errors.
1
4
2
5
3
6
1
2
3
4
svc00862
Figure 18 shows the rear-panel indicators on the SAN Volume Controller 2145-DH8
back-panel assembly.
Figure 18. SAN Volume Controller 2145-DH8 rear-panel indicators
▌1▐ Ethernet-link LED
▌2▐ Ethernet-activity LED
▌3▐ Power, location, and system-error LEDs
▌4▐ AC, DC, and power-supply error LEDs
SAN Volume Controller 2145-DH8 connectors
The SAN Volume Controller 2145-DH8 includes multiple external connectors for
data, video, and power.
Figure 19 on page 25 shows the external connectors on the SAN Volume Controller
2145-DH8 back panel assembly.
24
SAN Volume Controller: Troubleshooting Guide
1
1
4
2
5
3
6
2
3
4
13
11
12
10
9
8
7
6
svc00859
5
Figure 19. Connectors on the rear of the SAN Volume Controller 2145-DH8
▌1▐ 1 Gbps Ethernet port 1
▌2▐ 1 Gbps Ethernet port 2
▌3▐ 1 Gbps Ethernet port 3
▌4▐ Technician port (Ethernet)
▌5▐ Power supply 2
▌6▐ Power supply 1
▌7▐ USB 6
▌8▐ USB 5
▌9▐ USB 4
▌10▐ USB 3
▌11▐ Serial
▌12▐ Video
▌13▐ Unused Ethernet port
Figure 20 shows the type of connector that is located on each power-supply
assembly.
1
3
svc00838
2
Figure 20. Power connector
▌1▐ Neutral
▌2▐ Ground
▌3▐ Live
Note: Optional host interface adapters provide additional connectors for 10Gbps
Ethernet, Fibre Channel, or SAS.
SAN Volume Controller 2145-DH8 ports used during service procedures:
The SAN Volume Controller 2145-DH8 contains a number of ports that are only
used during service procedures.
Chapter 2. Introducing the SAN Volume Controller hardware components
25
1
4
2
5
3
6
2
1
3
4
5
svc00866
Figure 21 shows ports that are used only during service procedures.
Figure 21. SAN Volume Controller 2145-DH8 service ports
▌1▐ Technician port (Ethernet)
▌2▐ USB 3
▌3▐ USB 4
▌4▐ USB 5
▌5▐ USB 6
During normal operation, none of these ports are used. Connect a device to any of
these ports only when you are directed to do so by a service procedure or by an
IBM service representative.
SAN Volume Controller 2145-DH8 unused ports:
The SAN Volume Controller 2145-DH8 includes one port that is not used.
1
4
2
5
3
6
1
Figure 22. SAN Volume Controller 2145-DH8 unused Ethernet port
▌1▐ Unused Ethernet port
SAN Volume Controller 2145-CG8 rear-panel indicators
The rear-panel indicators consist of LEDs that indicate the status of the Fibre
Channel ports, Ethernet connection and activity, power, electrical current, and
system-board errors.
Figure 23 on page 27 shows the rear-panel indicators on the SAN Volume
Controller 2145-CG8 back-panel assembly.
26
SAN Volume Controller: Troubleshooting Guide
svc00867
Figure 22 shows the one port that is not used during service procedures or normal
operation. This port is disabled in software to make the port inactive.
1
4
3
svc00720
2
5
Figure 23. SAN Volume Controller 2145-CG8 rear-panel indicators
▌1▐ Fibre Channel LEDs
▌2▐ Ethernet-link LEDs
▌3▐ Ethernet-activity LEDs
▌4▐ Ac, dc, and power-supply error LEDs
▌5▐ Power, location, and system-error LEDs
Figure 24 shows the rear-panel indicators on the SAN Volume Controller 2145-CG8
back-panel assembly that has the 10 Gbps Ethernet feature.
svc00729
1
2
Figure 24. SAN Volume Controller 2145-CG8 rear-panel indicators for the 10 Gbps Ethernet
feature
▌1▐ 10 Gbps Ethernet-link LEDs. The amber link LED is on when this port is
connected to a 10 Gbps Ethernet switch and the link is online.
▌2▐ 10 Gbps Ethernet-activity LEDs. The green activity LED is on while data is
being sent over the link.
SAN Volume Controller 2145-CG8 connectors
External connectors that the SAN Volume Controller 2145-CG8 uses include four
Fibre Channel ports, a serial port, two Ethernet ports, and two power connectors.
The 2145-CG8 also has external connectors for the 10 Gbps Ethernet feature.
These figures show the external connectors on the SAN Volume Controller
2145-CG8 back panel assembly.
2
3
5
4
9
8
6
7
svc00732
1
Figure 25. Connectors on the rear of the SAN Volume Controller 2145-CG8
Chapter 2. Introducing the SAN Volume Controller hardware components
27
▌1▐ Fibre Channel port 1
▌2▐ Fibre Channel port 2
▌3▐ Fibre Channel port 3
▌4▐ Fibre Channel port 4
▌5▐ Power cord connector for power supply 1
▌6▐ Power cord connector for power supply 2
▌7▐ Serial connection for UPS communication cable
▌8▐ Ethernet port 2
▌9▐ Ethernet port 1
2
svc00731
1
Figure 26. 10 Gbps Ethernet ports on the rear of the SAN Volume Controller 2145-CG8
▌1▐ 10 Gbps Ethernet port 3
▌2▐ 10 Gbps Ethernet port 4
Fibre Channel port 5 (not shown)
Fibre Channel port 6 (not shown)
Fibre Channel port 7 (not shown)
Fibre Channel port 8 (not shown)
Figure 27 shows the type of connector that is located on each power-supply
assembly. Use these connectors to connect the SAN Volume Controller 2145-CG8 to
the two power cables from the uninterruptible power supply.
Neutral
Ground
Live
Figure 27. Power connector
SAN Volume Controller 2145-CG8 ports used during service procedures:
The SAN Volume Controller 2145-CG8 contains a number of ports that are only
used during service procedures.
Figure 28 on page 29 shows ports that are used only during service procedures.
28
SAN Volume Controller: Troubleshooting Guide
1
3
2
2
svc00724
3
Figure 28. Service ports of the SAN Volume Controller 2145-CG8
▌1▐ System management port
▌2▐ Two monitor ports, one on the front and one on the rear
▌3▐ Four USB ports, two on the front and two on the rear
During normal operation, none of these ports are used. Connect a device to any of
these ports only when you are directed to do so by a service procedure or by an
IBM service representative.
SAN Volume Controller 2145-CG8 unused ports:
The SAN Volume Controller 2145-CG8 can contain one port that is not used.
Figure 29 shows the one port that is not used during service procedures or normal
use.
svc00730
1
Figure 29. SAN Volume Controller 2145-CG8 port not used
▌1▐ Serial-attached SCSI (SAS) port
When present, this port is disabled in software to make the port inactive.
The SAS port is present when the optional high-speed SAS adapter is installed
with one or more flash drives.
SAN Volume Controller 2145-CF8 rear-panel indicators
The rear-panel indicators consist of LEDs that indicate the status of the Fibre
Channel ports, Ethernet connection and activity, power, electrical current, and
system-board errors.
Figure 30 on page 30 shows the rear-panel indicators on the SAN Volume
Controller 2145-CF8 back-panel assembly.
Chapter 2. Introducing the SAN Volume Controller hardware components
29
2
svc_00219b_cf8
1
5 4 5 4
3
Figure 30. SAN Volume Controller 2145-CF8 rear-panel indicators
▌1▐ Fibre Channel LEDs
▌2▐ AC, DC, and power-supply error LEDs
▌3▐ Power, location, and system-error LEDs
▌4▐ Ethernet-link LEDs
▌5▐ Ethernet-activity LEDs
SAN Volume Controller 2145-CF8 connectors
External connectors that the SAN Volume Controller 2145-CF8 uses include four
Fibre Channel ports, a serial port, two Ethernet ports, and two power connectors.
Figure 31 shows the external connectors on the SAN Volume Controller 2145-CF8
back panel assembly.
2
3
4
5
6
svc_00219_cf8
1
9
8
7
Figure 31. Connectors on the rear of the SAN Volume Controller 2145-CG8 or 2145-CF8
▌1▐ Fibre Channel port 1
▌2▐ Fibre Channel port 2
▌3▐ Fibre Channel port 3
▌4▐ Fibre Channel port 4
▌5▐ Power cord connector for power supply 1
▌6▐ Power cord connector for power supply 2
▌7▐ Serial connection for UPS communication cable
▌8▐ Ethernet port 2
▌9▐ Ethernet port 1
Figure 32 on page 31 shows the type of connector that is located on each
power-supply assembly. Use these connectors to connect the SAN Volume
Controller 2145-CF8 to the two power cables from the uninterruptible power
supply.
30
SAN Volume Controller: Troubleshooting Guide
Neutral
Ground
Live
Figure 32. Power connector
SAN Volume Controller 2145-CF8 ports used during service procedures:
The SAN Volume Controller 2145-CF8 contains a number of ports that are only
used during service procedures.
1
svc00227cf8
Figure 33 shows ports that are used only during service procedures.
3
2
2
3
Figure 33. Service ports of the SAN Volume Controller 2145-CF8
▌1▐ System management port
▌2▐ Two monitor ports, one on the front and one on the rear
▌3▐ Four USB ports, two on the front and two on the rear
During normal operation, none of these ports are used. Connect a device to any of
these ports only when you are directed to do so by a service procedure or by an
IBM service representative.
SAN Volume Controller 2145-CF8 unused ports:
The SAN Volume Controller 2145-CF8 can contain one port that is not used.
Figure 34 shows the one port that is not used during service procedures or normal
use.
svc00227cf8b
1
Figure 34. SAN Volume Controller 2145-CF8 port not used
Chapter 2. Introducing the SAN Volume Controller hardware components
31
▌1▐ Serial-attached SCSI (SAS) port
When present, this port is disabled in software to make the port inactive.
The SAS port is present when the optional high-speed SAS adapter is installed
with one or more flash drives.
Fibre Channel LEDs
The Fibre Channel LEDs indicate the status of the Fibre Channel ports.
Two LEDs are used to indicate the state and speed of the operation of each Fibre
Channel port. The bottom LED indicates the link state and activity.
Table 8. Link state and activity for the bottom Fibre Channel LED
LED state
Link state and activity indicated
Off
Link inactive
On
Link active, no I/O
Flashing
Link active, I/O active
Each Fibre Channel port can operate at one of three speeds. The top LED indicates
the relative link speed. The link speed is defined only if the link state is active.
Table 9. Link speed for the top Fibre Channel LED
LED state
Link speed indicated
Off
SLOW
On
FAST
Blinking
MEDIUM
Table 10 shows the actual link speeds for the SAN Volume Controller 2145-CF8 and
for the SAN Volume Controller 2145-CG8.
Table 10. Actual link speeds
Link speed
Actual link speeds
Slow
2 Gbps
Fast
8 Gbps
Medium
4 Gbps
Ethernet activity LED
The Ethernet activity LED indicates that the node is communicating with the
Ethernet network that is connected to the Ethernet port.
There is a set of LEDs for each Ethernet connector. The top LED is the Ethernet
link LED. When it is lit, it indicates that there is an active connection on the
Ethernet port. The bottom LED is the Ethernet activity LED. When it flashes, it
indicates that data is being transmitted or received between the server and a
network device.
Ethernet link LED
The Ethernet link LED indicates that there is an active connection on the Ethernet
port.
32
SAN Volume Controller: Troubleshooting Guide
There is a set of LEDs for each Ethernet connector. The top LED is the Ethernet
link LED. When it is lit, it indicates that there is an active connection on the
Ethernet port. The bottom LED is the Ethernet activity LED. When it flashes, it
indicates that data is being transmitted or received between the server and a
network device.
Power, location, and system-error LEDs
The power, location, and system-error LEDs are housed on the rear of the SAN
Volume Controller. These three LEDs are duplicates of the same LEDs that are
shown on the front of the node.
The following terms describe the power, location, and system-error LEDs:
Power LED
This is the top of the three LEDs and indicates the following states:
Off
One or more of the following are true:
v No power is present at the power supply input
v The power supply has failed
v The LED has failed
On
The SAN Volume Controller is powered on.
Blinking
The SAN Volume Controller is turned off but is still connected to a
power source.
Location LED
This is the middle of the three LEDs and is not used by the SAN Volume
Controller.
System-error LED
This is the bottom of the three LEDs that indicates that a system board
error has occurred. The light path diagnostics provide more information.
AC and DC LEDs
The AC and DC LEDs indicate whether the node is receiving electrical current.
AC LED
The upper LED indicates that AC current is present on the node.
DC LED
The lower LED indicates that DC current is present on the node.
AC, DC, and power-supply error LEDs:
The AC, DC, and power-supply error LEDs indicate whether the node is receiving
electrical current.
Figure 35 on page 34 shows the location of the SAN Volume Controller 2145-DH8
AC, DC, and power-supply error LEDs.
Chapter 2. Introducing the SAN Volume Controller hardware components
33
1
4
2
5
3
6
1
2
svc00864
3
Figure 35. SAN Volume Controller 2145-DH8 AC, DC, and power-error LEDs
Each of the two power supplies has its own set of LEDs.
▌1▐
Indicates that AC current is present on the node.
▌2▐
Indicates that DC current is present on the node.
▌3▐
Indicates a problem with the power supply.
AC, DC, and power-supply error LEDs on the SAN Volume Controller 2145-CF8
and SAN Volume Controller 2145-CG8:
The AC, DC, and power-supply error LEDs indicate whether the node is receiving
electrical current.
Figure 36 shows the location of the AC, DC, and power-supply error LEDs.
svc00542
1
2
3
Figure 36. SAN Volume Controller 2145-CG8 or 2145-CF8 AC, DC, and power-error LEDs
Each of the two power supplies has its own set of LEDs.
AC LED
The upper LED (▌1▐) on the left side of the power supply, indicates that
AC current is present on the node.
34
SAN Volume Controller: Troubleshooting Guide
DC LED
The middle LED (▌2▐) to the left side of the power supply, indicates that
DC current is present on the node.
Power-supply error LED
The lower LED (▌3▐) to the left side of the power supply, indicates a
problem with the power supply.
Fibre Channel port numbers and worldwide port names
Fibre Channel (FC) ports are identified by their physical port number and by a
worldwide port name (WWPN).
The physical port numbers identify Fibre Channel adapters and cable connections
when you run service tasks. World wide port names (WWPNs), which uniquely
identify the devices on the SAN, are used for tasks such as Fibre Channel switch
configuration. The WWPNs are derived from the worldwide node name (WWNN)
of the node in which the ports are installed.
Requirements for the SAN Volume Controller environment
Certain specifications for the physical site of the SAN Volume Controller must be
met before the IBM representative can set up your SAN Volume Controller
environment.
SAN Volume Controller 2145-DH8 environment requirements
Before the SAN Volume Controller 2145-DH8 is installed, the physical environment
must meet certain requirements. This includes verifying that adequate space is
available and that requirements for power and environmental conditions are met.
Input-voltage requirements
Ensure that your environment meets the voltage requirements that are shown in
Table 11.
Table 11. Input-voltage requirements
Voltage
Frequency
100-127 / 200-240Vac
50 Hz or 60 Hz
Maximum power requirements for each node
Ensure that your environment meets the power requirements as shown in Table 12.
The maximum power that is required depends on the node type and the optional
features that are installed.
Table 12. Power consumption
Components
Power requirements
SAN Volume Controller 2145-DH8
200 W typical, 750 W maximum (200 - 240V
ac, 50/60 Hz)
Note: You cannot mix ac and dc power sources; the power sources must match.
Chapter 2. Introducing the SAN Volume Controller hardware components
35
Environment requirements without redundant AC power
Ensure that your environment falls within the following ranges if you are not
using redundant AC power.
If you are not using redundant ac power, ensure that your environment falls
within the ranges that are shown in Table 13.
Table 13. Physical specifications
Relative
humidity
Maximum dew
point
Environment
Temperature
Altitude
Operating in
lower altitudes
5°C to 40°C
(41°F to 104°F)
0 to 950 m
(0 ft to 3,117 ft)
Operating in
higher altitudes
5°C to 28°C
(41°F to 82°F)
951 m to 3,050 m 8% to 85%
(3,118 ft to
10,000 ft)
24°C (75°F)
Turned off (with
standby power)
5°C to 45°C
(41°F to 113°F)
0 m to 3,050 m
8% to 85%
(0 ft to 10,000 ft)
27°C (80.6°F)
Storing
1°C to 60°C
(33.8°F to
140.0°F)
0 m to 3,050 m
5% to 80%
(0 ft to 10,000 ft)
29°C (84.2°F)
Shipping
-40°C to 60°C
0 m to 10,700 m 5% to 100%
(-40°F to 140.0°F) (0 ft to 34,991 ft)
29°C (84.2°F)
Note: Decrease the maximum system temperature by 1°C for every 175 m increase
in altitude.
Preparing your environment
The following tables list the physical characteristics of the 2145-DH8 node.
Dimensions and weight
Use the parameters that are shown in Table 14 to ensure that space is available in a
rack capable of supporting the node.
Table 14. Dimensions and weight
Height
Width
Depth
Maximum weight
86 mm (3.4 in.)
445 mm (17.5 in)
746 mm (29.4 in)
25 kg (55 lb) to 30 kg
(65 lb) depending on
configuration
Additional space requirements
Ensure that space is available in the rack for the additional space requirements
around the node, as shown in Table 15.
Table 15. Additional space requirements
36
Location
Additional space
requirements
Reason
Left side and right side
Minimum: 50 mm (2 in.)
Cooling air flow
Back
Minimum: 100 mm (4 in.)
Cable exit
SAN Volume Controller: Troubleshooting Guide
Maximum heat output of each 2145-DH8 node
The node dissipates the maximum heat output that is given in Table 16.
Table 16. Maximum heat output of each 2145-DH8 node
Model
Heat output per node
2145-DH8
v Minimum configuration: 419.68 Btu per
hour (AC 123 watts)
v Maximum configuration: 3480.24 Btu per
hour (AC 1020 watts)
SAN Volume Controller 2145-CG8 environment requirements
Before the SAN Volume Controller 2145-CG8 is installed, the physical environment
must meet certain requirements. This includes verifying that adequate space is
available and that requirements for power and environmental conditions are met.
Input-voltage requirements
Ensure that your environment meets the voltage requirements that are shown in
Table 17.
Table 17. Input-voltage requirements
Voltage
Frequency
200 V - 240 V single phase ac
50 Hz or 60 Hz
Attention:
v If the uninterruptible power supply is cascaded from another uninterruptible
power supply, the source uninterruptible power supply must have at least three
times the capacity per phase and the total harmonic distortion must be less than
5%.
v The uninterruptible power supply also must have input voltage capture that has
a slew rate of no more than 3 Hz per second.
Maximum power requirements for each node
Ensure that your environment meets the power requirements as shown in Table 18.
The maximum power that is required depends on the node type and the optional
features that are installed.
Table 18. Maximum power consumption
Components
Power requirements
SAN Volume Controller 2145-CG8 and 2145
UPS-1U
200 W
For each redundant AC-power switch, add 20 W to the power requirements.
For the high-speed SAS adapter with from one to four solid-state drives, add 50 W
to the power requirements.
Chapter 2. Introducing the SAN Volume Controller hardware components
37
Circuit breaker requirements
The 2145 UPS-1U has an integrated circuit breaker and does not require additional
protection.
Environment requirements without redundant AC power
If you are not using redundant ac power, ensure that your environment falls
within the ranges that are shown in Table 19.
Table 19. Environment requirements without redundant AC power
Maximum wet
bulb
temperature
Environment
Temperature
Altitude
Relative
humidity
Operating in
lower altitudes
10°C - 35°C
(50°F - 95°F)
0 m - 914 m
(0 ft - 3000 ft)
8% - 80%
noncondensing
23°C (73°F)
Operating in
higher altitudes
10°C - 32°C
(50°F - 90°F)
914 m - 2133 m 8% - 80%
(3000 ft - 7000 ft) noncondensing
23°C (73°F)
Turned off
10°C - 43°C
(50°F - 109°F)
0 m - 2133 m
(0 ft - 7000 ft)
8% - 80%
noncondensing
27°C (81°F)
Storing
1°C - 60°C
(34°F - 140°F)
0 m - 2133 m
(0 ft - 7000 ft)
5% - 80%
noncondensing
29°C (84°F)
Shipping
-20°C - 60°C
(-4°F - 140°F)
0 m - 10668 m
(0 ft - 34991 ft)
5% - 100%
condensing, but
no precipitation
29°C (84°F)
Environment requirements with redundant AC power
If you are using redundant ac power, ensure that your environment falls within the
ranges that are shown in Table 20.
Table 20. Environment requirements with redundant AC power
Maximum wet
bulb
temperature
Environment
Temperature
Altitude
Relative
humidity
Operating in
lower altitudes
15°C - 32°C
(59°F - 90°F)
0 m - 914 m
(0 ft - 3000 ft)
20% - 80%
noncondensing
23°C (73°F)
Operating in
higher altitudes
15°C - 32°C
(59°F - 90°F)
914 m - 2133 m 20% - 80%
(3000 ft - 7000 ft) noncondensing
23°C (73°F)
Turned off
10°C - 43°C
(50°F - 109°F)
0 m - 2133 m
(0 ft - 7000 ft)
20% - 80%
noncondensing
27°C (81°F)
Storing
1°C - 60°C
(34°F - 140°F)
0 m - 2133 m
(0 ft - 7000 ft)
5% - 80%
noncondensing
29°C (84°F)
Shipping
-20°C - 60°C
(-4°F - 140°F)
0 m - 10668 m
(0 ft - 34991 ft)
5% - 100%
condensing, but
no precipitation
29°C (84°F)
Preparing your environment
The following tables list the physical characteristics of the SAN Volume Controller
2145-CG8 node.
38
SAN Volume Controller: Troubleshooting Guide
Dimensions and weight
Use the parameters that are shown in Table 21 to ensure that space is available in a
rack capable of supporting the node.
Table 21. Dimensions and weight
Height
Width
Depth
Maximum weight
4.3 cm
(1.7 in.)
44 cm
(17.3 in.)
73.7 cm
(29 in.)
15 kg
(33 lb)
Additional space requirements
Ensure that space is available in the rack for the additional space requirements
around the node, as shown in Table 22.
Table 22. Additional space requirements
Location
Additional space
requirements
Reason
Left side and right side
Minimum: 50 mm (2 in.)
Cooling air flow
Back
Minimum: 100 mm (4 in.)
Cable exit
Maximum heat output of each SAN Volume Controller 2145-CG8 node
The node dissipates the maximum heat output that is given in Table 23.
Table 23. Maximum heat output of each SAN Volume Controller 2145-CG8 node
Model
Heat output per node
SAN Volume Controller 2145-CG8
160 W (546 Btu per hour)
SAN Volume Controller 2145-CG8 plus flash 210 W (717 Btu per hour)
drive
Maximum heat output of each 2145 UPS-1U
The 2145 UPS-1U dissipates the maximum heat output that is given in Table 24.
Table 24. Maximum heat output of each 2145 UPS-1U
Model
Heat output per node
Maximum heat output of 2145 UPS-1U
during normal operation
10 W (34 Btu per hour)
Maximum heat output of 2145 UPS-1U
during battery operation
100 W (341 Btu per hour)
SAN Volume Controller 2145-CF8 environment requirements
Before you install a SAN Volume Controller 2145-CF8 node, your physical
environment must meet certain requirements. This includes verifying that adequate
space is available and that requirements for power and environmental conditions
are met.
Chapter 2. Introducing the SAN Volume Controller hardware components
39
Input-voltage requirements
Ensure that your environment meets the voltage requirements that are shown in
Table 25.
Table 25. Input-voltage requirements
Voltage
Frequency
200 - 240 V single phase ac
50 or 60 Hz
Attention:
v If the uninterruptible power supply is cascaded from another uninterruptible
power supply, the source uninterruptible power supply must have at least three
times the capacity per phase and the total harmonic distortion must be less than
5%.
v The uninterruptible power supply also must have input voltage capture that has
a slew rate of no more than 3 Hz per second.
Power requirements for each node
Ensure that your environment meets the following power requirements.
Ensure that your environment meets the power requirements as shown in Table 26.
Table 26. Power requirements for each node
Components
Power requirements
SAN Volume Controller 2145-CF8 node and
2145 UPS-1U power supply
200 W
Notes:
v SAN Volume Controller 2145-CF8 nodes cannot connect to all revisions of the
2145 UPS-1U power supply unit. The SAN Volume Controller 2145-CF8 nodes
require the 2145 UPS-1U power supply unit part number 31P1318. This unit has
two power outlets that are accessible. Earlier revisions of the 2145 UPS-1U
power supply unit have only one power outlet that is accessible and are not
suitable.
v For each redundant AC-power switch, add 20 W to the power requirements.
v For each high-speed SAS adapter with one to four flash drives, add 50 W to the
power requirements.
Circuit breaker requirements
The 2145 UPS-1U has an integrated circuit breaker and does not require additional
protection.
Environment requirements without redundant AC power
If you are not using redundant ac power, ensure that your environment falls
within the ranges that are shown in Table 27 on page 41.
40
SAN Volume Controller: Troubleshooting Guide
Table 27. Environment requirements without redundant AC power
Maximum wet
bulb
temperature
Environment
Temperature
Altitude
Relative
humidity
Operating in
lower altitudes
10°C to 35°C
(50°F to 95°F)
0 - 914 m
(0 - 2998 ft)
8% to 80%
noncondensing
23°C (73°F)
Operating in
higher altitudes
10°C to 32°C
(50°F to 90°F)
914 - 2133 m
(2998 - 6988 ft)
8% to 80%
noncondensing
23°C (73°F)
Turned off
10°C to 43°C
(50°F to 110°F)
0 - 2133 m
(0 - 6988 ft)
8% to 80%
noncondensing
27°C (81°F)
Storing
1°C to 60°C
(34°F to 140°F)
0 - 2133 m
(0 - 6988 ft)
5% to 80%
noncondensing
29°C (84°F)
Shipping
-20°C to 60°C
(-4°F to 140°F)
0 - 10668 m
(0 - 34991 ft)
5% to 100%
condensing, but
no precipitation
29°C (84°F)
Environment requirements with redundant AC power
If you are using redundant ac power, ensure that your environment falls within the
ranges that are shown in Table 28.
Table 28. Environment requirements with redundant AC power
Maximum wet
bulb
temperature
Environment
Temperature
Altitude
Relative
humidity
Operating in
lower altitudes
15°C to 32°C
(59°F to 90°F)
0 - 914 m
(0 - 2998 ft)
20% to 80%
noncondensing
23°C (73°F)
Operating in
higher altitudes
15°C to 32°C
(59°F to 90°F)
914 - 2133 m
(2998 - 6988 ft)
20% to 80%
noncondensing
23°C (73°F)
Turned off
10°C to 43°C
(50°F to 110°F)
0 - 2133 m
(0 - 6988 ft)
20% to 80%
noncondensing
27°C (81°F)
Storing
1°C to 60°C
(34°F to 140°F)
0 - 2133 m
(0 - 6988 ft)
5% to 80%
noncondensing
29°C (84°F)
Shipping
-20°C to 60°C
(-4°F to 140°F)
0 - 10668 m
(0 - 34991 ft)
5% to 100%
condensing, but
no precipitation
29°C (84°F)
Preparing your environment
The following tables list the physical characteristics of the SAN Volume Controller
2145-CF8 node.
Dimensions and weight
Use the parameters that are shown in Table 29 to ensure that space is available in a
rack capable of supporting the node.
Table 29. Dimensions and weight
Height
Width
Depth
Maximum weight
43 mm
(1.69 in.)
440 mm
(17.32 in.)
686 mm
(27 in.)
12.7 kg
(28 lb)
Chapter 2. Introducing the SAN Volume Controller hardware components
41
Additional space requirements
Ensure that space is available in the rack for the additional space requirements
around the node, as shown in Table 30.
Table 30. Additional space requirements
Location
Additional space
requirements
Reason
Left and right sides
50 mm (2 in.)
Cooling air flow
Back
Minimum:
100 mm (4 in.)
Cable exit
Heat output of each SAN Volume Controller 2145-CF8 node
The node dissipates the maximum heat output that is given in Table 31.
Table 31. Heat output of each SAN Volume Controller 2145-CF8 node
Model
Heat output per node
SAN Volume Controller 2145-CF8
160 W (546 Btu per hour)
SAN Volume Controller 2145-CF8 and up to
four optional flash drives
210 W (717 Btu per hour)
Maximum heat output of 2145 UPS-1U
during typical operation
10 W (34 Btu per hour)
Maximum heat output of 2145 UPS-1U
during battery operation
100 W (341 Btu per hour)
Redundant AC-power switch
The redundant AC-power switch is an optional feature that makes the SAN
Volume Controller nodes resilient to the failure of a single power circuit. The
redundant AC-power switch is not a replacement for an uninterruptible power
supply. You must still use a uninterruptible power supply for each node.
Restriction: The Redundant AC-power switch does not apply to SAN Volume
Controller 2145-DH8.
You must connect the redundant AC-power switch to two independent power
circuits. One power circuit connects to the main power input port and the other
power circuit connects to the backup power-input port. If the main power to the
SAN Volume Controller node fails for any reason, the redundant AC-power switch
automatically uses the backup power source. When power is restored, the
redundant AC-power switch automatically changes back to using the main power
source.
Place the redundant AC-power switch in the same rack as the SAN Volume
Controller node. The redundant AC-power switch logically sits between the rack
power distribution unit and the 2145 UPS-1U.
You can use a single redundant AC-power switch to power one or two SAN
Volume Controller nodes. If you use the redundant AC-power switch to power two
nodes, the nodes must be in different I/O groups. If the redundant AC-power
switch fails or requires maintenance, both nodes turn off. Because the nodes are in
two different I/O groups, the hosts do not lose access to the back-end disk data.
42
SAN Volume Controller: Troubleshooting Guide
For maximum resilience to failure, use one redundant AC-power switch to power
each SAN Volume Controller node.
svc00297
Figure 37 shows a redundant AC-power switch.
Figure 37. Photo of the redundant AC-power switch
You must properly cable the redundant AC-power switch units in your
environment. See “Cabling of redundant AC-power switch (example)” on page 44
for cabling information.
Redundant AC-power environment requirements
Ensure that your physical site meets the installation requirements for the
redundant AC-power switch.
The redundant AC-power switch requires two independent power sources that are
provided through two rack-mounted power distribution units (PDUs). The PDUs
must have IEC320-C13 outlets.
The redundant AC-power switch comes with two IEC 320-C19 to C14 power cables
to connect to rack PDUs. There are no country-specific cables for the redundant
AC-power switch.
The power cable between the redundant AC-power switch and the 2145 UPS-1U is
rated at 10 A.
Redundant AC-power switch specifications
The following tables list the physical characteristics of the redundant AC-power
switch.
Dimensions and weight
Ensure that space is available in a rack that is capable of supporting the redundant
AC-power switch.
Table 32. Rack space required for redundant AC-power switch
Height
Width
Depth
Maximum weight
43 mm (1.69 in.)
192 mm (7.56 in.)
240 mm
2.6 kg (5.72 lb)
Additional space requirements
Chapter 2. Introducing the SAN Volume Controller hardware components
43
Ensure that space is also available in the rack for the side mounting plates on
either side of the redundant AC-power switch.
Table 33. Rack space required for redundant AC-power switch side mounting plates
Location
Width
Reason
Left side
124 mm (4.89 in.)
Side mounting plate
Right side
124 mm (4.89 in.)
Side mounting plate
Heat output (maximum)
The maximum heat output that is dissipated inside the redundant AC-power
switch is approximately 20 watts (70 Btu per hour).
Cabling of redundant AC-power switch (example)
You must properly cable the redundant AC-power switch units in your
environment.
Figure 38 on page 45 shows an example of the main wiring connections for a SAN
Volume Controller clustered system with the redundant AC-power switch feature.
This example is designed to clearly show the cable connections; the components
are not positioned as they would be in a rack. Figure 39 on page 47 shows a
typical rack installation. The four-node clustered system consists of two I/O
groups:
v I/O group 0 contains nodes A and B
v I/O group 1 contains nodes C and D
44
SAN Volume Controller: Troubleshooting Guide
2
3
1
4
5
7
8
6
9
10
11
12
13
svc00358_cf8
14
Figure 38. A four-node SAN Volume Controller system with the redundant AC-power switch
feature
▌1▐ I/O group 0
▌2▐ SAN Volume Controller node A
▌3▐ 2145 UPS-1U A
▌4▐ SAN Volume Controller node B
▌5▐ 2145 UPS-1U B
▌6▐ I/O group 1
▌7▐ SAN Volume Controller node C
▌8▐ 2145 UPS-1U C
▌9▐ SAN Volume Controller node D
▌10▐ 2145 UPS-1U D
▌11▐ Redundant AC-power switch 1
▌12▐ Redundant AC-power switch 2
▌13▐ Site PDU X (C13 outlets)
▌14▐ Site PDU Y (C13 outlets)
Chapter 2. Introducing the SAN Volume Controller hardware components
45
The site PDUs X and Y (▌13▐ and ▌14▐) are powered from two independent power
sources.
In this example, only two redundant AC-power switch units are used, and each
power switch powers one node in each I/O group. However, for maximum
redundancy, use one redundant AC-power switch to power each node in the
system.
Some SAN Volume Controller node types have two power supply units. Both
power supplies must be connected to the same 2145 UPS-1U, as shown by node A
and node B. The SAN Volume Controller 2145-CG8 is an example of a node that
has two power supplies.
Figure 39 on page 47 shows an 8 node cluster, with one redundant ac-power switch
per node installed in a rack using best location practices, the cables between the
components are shown.
46
SAN Volume Controller: Troubleshooting Guide
8U Reserved for data wiring patch panel
or !ller panels
SVC #8 IOGroup 3 Node B
AC
AC
DC
DC
AC
AC
DC
DC
1U Filler Panel
SVC #7 IOGroup 3 Node A
1U Filler Panel
SVC #6 IOGroup 2 Node B
AC
AC
DC
DC
AC
AC
1U Filler Panel
SVC #5 IOGroup 2 Node A
DC
DC
1U Filler Panel
SVC #4 IOGroup 1 Node B
AC
AC
DC
DC
AC
AC
DC
DC
1U Filler Panel
SVC #3 IOGroup 1 Node A
1U Filler Panel
SVC #2 IOGroup 0 Node B
AC
AC
DC
DC
AC
AC
DC
DC
1U Filler Panel
SVC #1 IOGroup 0 Node A
1U Filler Panel
1U !ller panel or optional 1U monitor
1U !ller panel or optional SSPC server
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
CONTROLLERS TO THESE OUTLETS.
ON
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
1U Filler Panel
12A Max.
BRANCH B
ON
20
20A
Circuit Breaker
BRANCH B
20A
Circuit Breaker
12A Max.
ON
20
OFF
OFF
12A Max.
BRANCH A
ON
20
20A
Circuit Breaker
BRANCH A
20A
Circuit Breaker
12A Max.
ON
20
OFF
OFF
1U Filler Panel
Left Side
ePDU
Backup
Input
Right Side
ePDU
svc00765
Main
Input
Figure 39. Rack cabling example.
Uninterruptible power supply
The uninterruptible power supply protects a SAN Volume Controller node against
blackouts, brownouts, and power surges. The uninterruptible power supply
contains a power sensor to monitor the supply and a battery to provide power
until an orderly shutdown of the system can be initiated.
SAN Volume Controller models use the 2145 UPS-1U.
Chapter 2. Introducing the SAN Volume Controller hardware components
47
2145 UPS-1U
A 2145 UPS-1U is used exclusively to maintain data that is held in the SAN
Volume Controller dynamic random access memory (DRAM) in the event of an
unexpected loss of external power. This use differs from the traditional
uninterruptible power supply that enables continued operation of the device that it
supplies when power is lost.
With a 2145 UPS-1U, data is saved to the internal disk of the SAN Volume
Controller node. The uninterruptible power supply units are required to power the
SAN Volume Controller nodes even when the input power source is considered
uninterruptible.
Note: The uninterruptible power supply maintains continuous SAN Volume
Controller-specific communications with its attached SAN Volume Controller
nodes. A SAN Volume Controller node cannot operate without the uninterruptible
power supply. The uninterruptible power supply must be used in accordance with
documented guidelines and procedures and must not power any equipment other
than a SAN Volume Controller node.
2145 UPS-1U operation
Each SAN Volume Controller node monitors the operational state of the
uninterruptible power supply to which it is attached.
If the 2145 UPS-1U reports a loss of input power, the SAN Volume Controller node
stops all I/O operations and dumps the contents of its dynamic random access
memory (DRAM) to the internal disk drive. When input power to the 2145 UPS-1U
is restored, the SAN Volume Controller node restarts and restores the original
contents of the DRAM from the data saved on the disk drive.
A SAN Volume Controller node is not fully operational until the 2145 UPS-1U
battery state indicates that it has sufficient charge to power the SAN Volume
Controller node long enough to save all of its memory to the disk drive. In the
event of a power loss, the 2145 UPS-1U has sufficient capacity for the SAN Volume
Controller to save all its memory to disk at least twice. For a fully charged 2145
UPS-1U, even after battery charge has been used to power the SAN Volume
Controller node while it saves dynamic random access memory (DRAM) data,
sufficient battery charge remains so that the SAN Volume Controller node can
become fully operational as soon as input power is restored.
Important: Do not shut down a 2145 UPS-1U without first shutting down the SAN
Volume Controller node that it supports. Data integrity can be compromised by
pushing the 2145 UPS-1U on/off button when the node is still operating. However,
in the case of an emergency, you can manually shut down the 2145 UPS-1U by
pushing the 2145 UPS-1U on/off button when the node is still operating. Service
actions must then be performed before the node can resume normal operations. If
multiple uninterruptible power supply units are shut down before the nodes they
support, data can be corrupted.
Connecting the 2145 UPS-1U to the SAN Volume Controller
To provide redundancy and concurrent maintenance, you must install the SAN
Volume Controller nodes in pairs.
For connection to the 2145 UPS-1U, each SAN Volume Controller of a pair must be
connected to only one 2145 UPS-1U.
48
SAN Volume Controller: Troubleshooting Guide
Note: A clustered system can contain no more than eight SAN Volume Controller
nodes. The 2145 UPS-1U must be attached to a source that is both single phase and
200-240 V. The 2145 UPS-1U has an integrated circuit breaker and does not need
external protection.
SAN Volume Controller provides a cable bundle for connecting the uninterruptible
power supply to a node. This cable is used to connect both power supplies of a
node to the same uninterruptible power supply.
Dual-power cable plus serial cable:
v SAN Volume Controller 2145-CF8
v SAN Volume Controller 2145-CG8
The SAN Volume Controller software determines whether the input voltage to the
uninterruptible power supply is within range and sets an appropriate voltage
alarm range on the uninterruptible power supply. The software continues to
recheck the input voltage every few minutes. If it changes substantially but
remains within the permitted range, the alarm limits are readjusted.
Note: The 2145 UPS-1U is equipped with a cable retention bracket that keeps the
power cable from disengaging from the rear panel. See the related documentation
for more information.
2145 UPS-1U controls and indicators
All controls and indicators for the 2145 UPS-1U are located on the front-panel
assembly.
7
LOAD 2 LOAD 1
+ -
1
2
3
4
5
6
1yyzvm
8
Figure 40. 2145 UPS-1U front-panel assembly
▌1▐ Load segment 2 indicator
▌2▐ Load segment 1 indicator
▌3▐ Alarm or service indicator
▌4▐ On-battery indicator
▌5▐ Overload indicator
▌6▐ Power-on indicator
▌7▐ On/off button
▌8▐ Test and alarm reset button
Load segment 2 indicator:
The load segment 2 indicator on the 2145 UPS-1U is lit (green) when power is
available to load segment 2.
Chapter 2. Introducing the SAN Volume Controller hardware components
49
When the load segment 2 indicator is green, the 2145 UPS-1U is running normally
and power is available to this segment.
Load segment 1 indicator:
The load segment 1 indicator on the 2145 UPS-1U is not currently used by the
SAN Volume Controller.
Note: When the 2145 UPS-1U is configured by the SAN Volume Controller, this
load segment is disabled. During normal operation, the load segment 1 indicator is
off. A “Do not use” label covers the receptacles.
Alarm indicator:
If the alarm on the 2145 UPS-1U is flashing red, maintenance is required.
If the alarm is on, go to the 2145 UPS-1U MAP to resolve the problem.
On-battery indicator:
The amber on-battery indicator is on when the 2145 UPS-1U is powered by the
battery. This indicates that the main power source has failed.
If the on-battery indicator is on, go to the 2145 UPS-1U MAP to resolve the
problem.
Overload indicator:
The overload indicator lights up when the capacity of the 2145 UPS-1U is
exceeded.
If the overload indicator is on, go to MAP 5250: 2145 UPS-1U repair verification to
resolve the problem.
Power-on indicator:
The power-on indicator is displayed when the 2145 UPS-1U is functioning.
When the power-on indicator is a steady green, the 2145 UPS-1U is active.
On or off button:
The on or off button turns the power on or off for the 2145 UPS-1U.
Turning on the 2145 UPS-1U
After you connect the 2145 UPS-1U to the outlet, it remains in standby mode until
you turn it on. Press and hold the on or off button until the power-on indicator is
illuminated (approximately five seconds). On some versions of the 2145 UPS-1U,
you might need a pointed device, such as a screwdriver, to press the on or off
button. A self-test is initiated that takes approximately 10 seconds, during which
time the indicators are turned on and off several times. The 2145 UPS-1U then
enters normal mode.
50
SAN Volume Controller: Troubleshooting Guide
Turning off the 2145 UPS-1U
Press and hold the on or off button until the power-on light is extinguished
(approximately five seconds). On some versions of the 2145 UPS-1U, you might
need a pointed device, such as a screwdriver, to press the on or off button. This
places the 2145 UPS-1U in standby mode. You must then unplug the 2145 UPS-1U
to turn off the unit.
Attention: Do not turn off the uninterruptible power supply before you shut
down the SAN Volume Controller node that it is connected to. Always follow the
instructions that are provided in MAP 5350 to perform an orderly shutdown of a
SAN Volume Controller node.
Test and alarm reset button:
Use the test and alarm reset button to start the self-test.
To start the self-test, press and hold the test and alarm reset button for three
seconds. This button also resets the alarm.
2145 UPS-1U connectors and switches
The 2145 UPS-1U has external connectors and dip switches.
Locations for the 2145 UPS-1U connectors and switches
1
2
3
4
5
svc00308
Figure 41 shows the location of the connectors and switches on the 2145 UPS-1U.
Figure 41. 2145 UPS-1U connectors and switches
▌1▐ Main power connector
▌2▐ Communication port
▌3▐ Dip switches
▌4▐ Load segment 1 receptacles
▌5▐ Load segment 2 receptacles
2145 UPS-1U dip switches
Figure 42 on page 52 shows the dip switches, which can be used to configure the
input and output voltage ranges. Because this function is performed by the SAN
Volume Controller software, both switches must be left in the OFF position.
Chapter 2. Introducing the SAN Volume Controller hardware components
51
1
2
OFF
svc00147
ON
Figure 42. 2145 UPS-1U dip switches
2145 UPS-1U ports not used
The 2145 UPS-1U is equipped with ports that are not used by the SAN Volume
Controller and have not been tested. Use of these ports, in conjunction with the
SAN Volume Controller or any other application that might be used with the SAN
Volume Controller, is not supported. Figure 43 shows the 2145 UPS-1U ports that
are not used.
Figure 43. Ports not used by the 2145 UPS-1U
▌1▐ USB interface port
▌2▐ Network ports
▌3▐ Load segment receptacles
2145 UPS-1U power connector
Figure 44 shows the power connector for the 2145 UPS-1U.
Neutral
Ground
Live
Figure 44. Power connector
Uninterruptible power-supply environment requirements
An uninterruptible power-supply environment requires that certain specifications
for the physical site of the SAN Volume Controller must be met.
52
SAN Volume Controller: Troubleshooting Guide
2145 UPS-1U environment
All SAN Volume Controller models, except for the 2145-DH8, are supported by the
2145 UPS-1U. The 2145-DH8 has two batteries that prevent it from losing the
system state if power is lost.
2145 UPS-1U specifications
The following tables describe the physical characteristics of the 2145 UPS-1U.
2145 UPS-1U dimensions and weight
Ensure that space is available in a rack that is capable of supporting the 2145
UPS-1U.
Table 34. Rack space required for the 2145 UPS-1U
Height
Width
Depth
Maximum weight
44 mm
(1.73 in.)
439 mm
(17.3 in.)
579 mm
(22.8 in.)
16 kg
(35.3 lb)
Note: The 2145 UPS-1U package, which includes support rails, weighs 18.8 kg (41.4 lb).
Heat output
The 2145 UPS-1U unit produces the following approximate heat output.
Table 35. Heat output of the 2145 UPS-1U
Model
2145 UPS-1U
Heat output during normal
operation
10 W (34 Btu per hour)
Heat output during battery
operation
150 W (512 Btu per hour)
Defining the SAN Volume Controller FRUs
The SAN Volume Controller node, redundant AC-power switch, and
uninterruptible power supply each consist of one or more field-replaceable units
(FRUs).
SAN Volume Controller FRUs
The SAN Volume Controller nodes each consist of several field-replaceable units
(FRUs), such as the Fibre Channel adapter, service controller, disk drive,
microprocessor, memory module, CMOS battery, power supply assembly, fan
assembly, and the operator-information panel.
SAN Volume Controller 2145-DH8 FRUs
Refer to the SAN Volume Controller 2145-DH8 parts topic for the list of SAN
Volume Controller 2145-DH8 FRUs.
SAN Volume Controller 2145-24F FRUs
The following tables provide information the SAN Volume Controller 2145-24F
expansion enclosure parts and SAS drives.
Table 36 on page 54 lists the FRUs for the SAN Volume Controller 2145-24F
expansion enclosure.
Chapter 2. Introducing the SAN Volume Controller hardware components
53
Table 36. FRUs for SAN Volume Controller 2145-24F expansion enclosure
Description
FRU part number
Expansion enclosure midplane assembly
64P8445
Expansion canister
64P8448
Expansion enclosure power supply unit
(PSU), AC
98Y2218
Small-form factor SSD, 200 GB
31P1818
Small-form factor SSD, 400 GB
31P1819
Small-form factor SSD, 800 GB
31P1820
Small-form factor drive, blank expansion
45W8680
Rack rail kit, expansion enclosure
64P8449
Left enclosure bezel, expansion enclosure
00Y2450
Right enclosure bezel, SFF expansion
enclosure
00Y2512
Table 37 lists the FRUs for the SAN Volume Controller 2145-24F SAS drive units.
Table 37. FRU part for the SAN Volume Controller 2145-24F SAS drive units
Description
FRU part number
Mini SAS HD to mini SAS HD, 1.5 m, 12
Gb.
00AR311
Mini SAS HD to mini SAS HD, 3.0 m, 12
Gb.
00AR317
SAN Volume Controller 2145-CG8 FRUs
Table 38 provides a brief description of each SAN Volume Controller 2145-CG8
FRU.
Table 38. SAN Volume Controller 2145-CG8 FRU descriptions
54
FRU
Description
System board
The system board for the SAN Volume
Controller 2145-CG8 node.
Short-wave small form-factor pluggable
(SFP) transceiver
A compact optical transceiver that provides
the optical interface to a Fibre Channel
cable. The transceiver is capable of
autonegotiating 2, 4, or 8 gigabits-per-second
(Gbps) short-wave optical connection on the
four-port Fibre Channel adapter.
Note: It is possible that small form-factor
pluggable (SFP) transceivers, other than
those transceiver shipped with the product,
are in use on the Fibre Channel host bus
adapter. It is a customer responsibility to
obtain replacement parts for such SFP
transceivers. The FRU part number is shown
as "Non standard - supplied by customer" in
the vital product data.
SAN Volume Controller: Troubleshooting Guide
Table 38. SAN Volume Controller 2145-CG8 FRU descriptions (continued)
FRU
Description
Long-wave small form-factor pluggable
(SFP) transceiver
A compact optical transceiver that provides
the optical interface to a Fibre Channel
cable. The transceiver is capable of
autonegotiating 2, 4, or 8 Gbps short-wave
optical connection on the 4-port Fibre
Channel adapter.
Note: It is possible that SFP transceivers
other than those shipped with the product
are in use on the Fibre Channel host bus
adapter. It is a customer responsibility to
obtain replacement parts for such SFP
transceivers. The FRU part number is shown
as "Non standard - supplied by customer" in
the vital product data.
Four-port Fibre Channel host bus adapter
(HBA)
The SAN Volume Controller 2145-CG8 is
connected to the Fibre Channel fabric
through the Fibre Channel HBA, which is in
PCI slot 1. The adapter assembly includes
the Fibre Channel PCI Express adapter, four
short-wave SFP transceivers, the riser card,
and bracket.
Service controller
The unit that provides the service functions
and the front panel display and buttons.
Service controller cable
The USB cable that is used to connect the
service controller to the system board.
Disk drive
The serial-attached SCSI (SAS) 2.5" disk
drive.
Disk signal cable
A 200 mm SAS disk-signal cable.
Disk power cable
The power cable for the 2.5" SAS system
disk.
Disk controller
A SAS controller card for the SAS 2.5" disk
drive.
USB riser card for the disk controller
The riser card that connects the disk
controller to the system board and provides
the USB port to which the service controller
cable connects.
Disk backplane
The hot-swap SAS 2.5" disk drive backplane.
Memory module
An 8 GB DDR3-1333 2RX4 LP RDIMM
memory module.
Microprocessor
The microprocessor on the system board: a
2.53 GHz quad-core microprocessor.
Power supply unit
An assembly that provides dc power to the
node.
CMOS battery
A 3.0 V battery on the system board that
maintains power to back up the system
BIOS settings.
Operator-information panel
The information panel that includes the
power-control button and LEDs that indicate
system-board errors, hard disk drive activity,
and power status.
Chapter 2. Introducing the SAN Volume Controller hardware components
55
Table 38. SAN Volume Controller 2145-CG8 FRU descriptions (continued)
FRU
Description
Operator-information panel cable
A cable that connects the
operator-information panel to the system
board.
Fan assembly
A fan assembly that is used in all the fan
positions.
Power cable assembly
The cable assembly that connects the SAN
Volume Controller and the 2145 UPS-1U.
The assembly consists of two power cables
and a serial cable that is bundled together.
Blank drive bay filler assembly
A blank drive bay filler assembly.
Alcohol wipe
A cleaning wipe.
Thermal grease
Grease that is used to provide a thermal seal
between a processor and a heat sink.
SAN Volume Controller 2145-CF8 FRUs
Table 39 provides a brief description of each SAN Volume Controller 2145-CF8
FRU.
Table 39. SAN Volume Controller 2145-CF8 FRU descriptions
56
FRU
Description
System board
The system board for the SAN Volume
Controller 2145-CF8 node.
Fibre Channel small form-factor pluggable
(SFP) transceiver
A compact optical transceiver that provides
the optical interface to a Fibre Channel
cable. The transceiver is capable of
autonegotiating 2, 4, or 8 Gbps short-wave
optical connection on the 4-port Fibre
Channel adapter.
Note: It is possible that SFPs other than
those shipped with the product are in use
on the Fibre Channel host bus adapter. It is
a customer responsibility to obtain
replacement parts for such SFP transceivers.
The FRU part number is shown as "Non
standard - supplied by customer" in the vital
product data.
Four-port Fibre Channel host bus adapter
(HBA)
The SAN Volume Controller 2145-CF8 is
connected to the Fibre Channel fabric
through the Fibre Channel HBA, which is in
PCI slot 1. The adapter assembly includes
the Fibre Channel PCI Express adapter, four
short-wave SFP transceivers, the riser card,
and bracket.
Service controller
The unit that provides the service functions
and the front panel display and buttons.
Service controller cable
The USB cable that is used to connect the
service controller to the system board.
Disk drive
The serial-attached SCSI (SAS) 2.5" disk
drive.
Disk signal cable
A 200 mm SAS disk-signal cable.
SAN Volume Controller: Troubleshooting Guide
Table 39. SAN Volume Controller 2145-CF8 FRU descriptions (continued)
FRU
Description
Disk power cable
A SAS disk-power cable.
Disk controller
A SAS controller card for the SAS 2.5" disk
drive.
Disk controller / USB riser card
The riser card that connects the disk
controller to the system board and provides
the USB port to which the service controller
cable connects.
Disk backplane
The hot-swap SAS 2.5" disk drive backplane.
Memory module
A 4 GB DDR3-1333 2RX4 LP RDIMM
memory module
Microprocessor
The microprocessor on the system board.
2.40 GHz quad-core microprocessor.
Power supply unit
An assembly that provides dc power to the
SAN Volume Controller 2145-CF8 node.
CMOS battery
A 3.0 V battery on the system board that
maintains power to back up the system
BIOS settings.
Operator-information panel
The information panel that includes the
power-control button and LEDs that indicate
system-board errors, hard disk drive activity,
and power status.
Operator-information panel cable
A cable that connects the
operator-information panel to the system
board.
Fan assembly
A fan assembly that is used in all the fan
positions.
Power cable assembly
The cable assembly that connects the SAN
Volume Controller and the 2145 UPS-1U.
The assembly consists of two power cables
and a serial cable that is bundled together.
Alcohol wipe
A cleaning wipe.
Thermal grease
Grease that is used to provide a thermal seal
between a processor and a heat sink.
Ethernet feature FRUs
Table 40 provides a brief description of each Ethernet feature FRU.
Table 40. Ethernet feature FRU descriptions
FRU
Description
10 Gbps Ethernet adapter
A 10 Gbps Ethernet adapter.
10 Gbps Ethernet fiber SFP
A 10 Gbps Ethernet fiber SFP.
Chapter 2. Introducing the SAN Volume Controller hardware components
57
Flash drive feature FRUs
Table 41 provides a brief description of each flash drive feature FRU.
Table 41. Flash drive feature FRU descriptions
FRU
Description
High-speed SAS adapter
An assembly that includes a high-speed SAS
adapter that provides connectivity up to
four flash drives. The assembly also contains
riser card, blanking plate, and screws.
High-speed SAS cable
The cable that is used to connect the
high-speed SAS adapter to the disk
backplate.
Flash drive
A flash drive, in carrier assembly; different
drive sizes are available.
2145 UPS-1U FRUs
Table 42 provides a brief description of each 2145 UPS-1U FRU.
Table 42. 2145 UPS-1U FRU descriptions
FRU
Description
2145 UPS-1U assembly
An uninterruptible power-supply assembly
for use with the SAN Volume Controller.
Battery pack assembly
The battery that provides backup power to
the SAN Volume Controller if a power
failure occurs.
Power cable, PDU to 2145 UPS-1U
Input power cable for connecting the 2145
UPS-1U to a rack power distribution unit.
Power cable, mains to UPS-1 (US)
Input power cable for connecting the 2145
UPS-1U to mains power (United States
only).
Redundant AC-power switch FRUs
The redundant AC-power switch consists of a single field replaceable unit (FRU).
58
FRU
Description
Redundant AC-power switch
assembly
The redundant AC-power switch and its input power
cables.
SAN Volume Controller: Troubleshooting Guide
Chapter 3. SAN Volume Controller user interfaces for
servicing your system
The SAN Volume Controller provides a number of user interfaces to troubleshoot,
recover, or maintain your system. The interfaces provide various sets of facilities to
help resolve situations that you might encounter.
v Use the management GUI to monitor and maintain the configuration of storage
that is associated with your clustered systems.
v Complete service procedures from the service assistant.
v Use the command-line interface (CLI) to manage your system. The front panel
on the node provides an alternative service interface.
Note: The front panel display is replaced by a technician port on some models.
Management GUI interface
The management GUI is a browser-based GUI for configuring and managing all
aspects of your system. It provides extensive facilities to help troubleshoot and
correct problems.
About this task
You use the management GUI to manage and service your system. The Monitoring
> Events panel provides access to problems that must be fixed and maintenance
procedures that step you through the process of correcting the problem.
The information on the Events panel can be filtered three ways:
Recommended action (default)
Shows only the alerts that require attention and have an associated fix
procedure. Alerts are listed in priority order and should be fixed
sequentially by using the available fix procedures. For each problem that is
selected, you can:
v Run a fix procedure.
v View the properties.
Unfixed messages and alerts
Displays only the alerts and messages that are not fixed. For each entry
that is selected, you can:
v Run a fix procedure on any alert with an error code.
v Mark an event as fixed.
v Filter the entries to show them by specific minutes, hours, or dates.
v Reset the date filter.
v View the properties.
Show all
Displays all event types whether they are fixed or unfixed. For each entry
that is selected, you can:
v Run a fix procedure on any alert with an error code.
v Mark an event as fixed.
v Filter the entries to show them by specific minutes, hours, or dates.
© Copyright IBM Corp. 2003, 2015
59
v Reset the date filter.
v View the properties.
Some events require a certain number of occurrences in 25 hours before they are
displayed as unfixed. If they do not reach this threshold in 25 hours, they are
flagged as expired. Monitoring events are below the coalesce threshold and are
usually transient.
You can also sort events by time or error code. When you sort by error code, the
most serious events, those with the lowest numbers, are displayed first. You can
select any event that is listed and select Actions > Properties to view details about
the event.
v Recommended Actions. For each problem that is selected, you can:
– Run a fix procedure.
– View the properties.
v Event log. For each entry that is selected, you can:
– Run a fix procedure.
– Mark an event as fixed.
– Filter the entries to show them by specific minutes, hours, or dates.
– Reset the date filter.
– View the properties.
When to use the management GUI
The management GUI is the primary tool that is used to service your system.
Regularly monitor the status of the system using the management GUI. If you
suspect a problem, use the management GUI first to diagnose and resolve the
problem.
Use the views that are available in the management GUI to verify the status of the
system, the hardware devices, the physical storage, and the available volumes. The
Monitoring > Events panel provides access to all problems that exist on the
system. Use the Recommended Actions filter to display the most important events
that need to be resolved.
If there is a service error code for the alert, you can run a fix procedure that assists
you in resolving the problem. These fix procedures analyze the system and provide
more information about the problem. They suggest actions to take and step you
through the actions that automatically manage the system where necessary. Finally,
they check that the problem is resolved.
If there is an error that is reported, always use the fix procedures within the
management GUI to resolve the problem. Always use the fix procedures for both
system configuration problems and hardware failures. The fix procedures analyze
the system to ensure that the required changes do not cause volumes to be
inaccessible to the hosts. The fix procedures automatically perform configuration
changes that are required to return the system to its optimum state.
Accessing the management GUI
To view events, you must access the management GUI.
60
SAN Volume Controller: Troubleshooting Guide
About this task
You must use a supported web browser. For a list of supported browsers, refer to
the “Web browser requirements to access the management GUI” topic.
You can use the management GUI to manage your system as soon as you have
created a clustered system.
Procedure
1. Start a supported web browser and point the browser to the management IP
address of your system.
The management IP address is set when the clustered system is created. Up to
four addresses can be configured for your use. There are two addresses for
IPv4 access and two addresses for IPv6 access. When the connection is
successful, you will see a login panel.
2. Log on by using your user name and password.
3. When you have logged on, select Monitoring > Events.
4. Ensure that the events log is filtered using Recommended actions.
5. Select the recommended action and run the fix procedure.
6. Continue to work through the alerts in the order suggested, if possible.
Results
After all the alerts are fixed, check the status of your system to ensure that it is
operating as intended.
Deleting a node from a clustered system using the
management GUI
Remove a node from a system if the node has failed and is being replaced with a
new node or if the repair that has been performed has caused that node to be
unrecognizable by the system.
Before you begin
The cache on the selected node is flushed before the node is taken offline. In some
circumstances, such as when the system is already degraded (for example, when
both nodes in the I/O group are online and the volumes within the I/O group are
degraded), the system ensures that data loss does not occur as a result of deleting
the only node with the cache data. If a failure occurs on the other node in the I/O
group, the cache is flushed before the node is removed to prevent data loss.
Before deleting a node from the system, record the node serial number, worldwide
node name (WWNN), all worldwide port names (WWPNs), and the I/O group
that the node is currently part of. If the node is re-added to the system at a later
time, recording this node information can avoid data corruption.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
61
Attention:
v If you are removing a single node and the remaining node in the I/O group is
online, the data on the remaining node goes into write-through mode. This data
can be exposed to a single point of failure if the remaining node fails.
v If the volumes are already degraded before you remove a node, redundancy to
the volumes is degraded. Removing a node might result in a loss of access to
data and data loss.
v Removing the last node in the system destroys the system. Before you remove
the last node in the system, ensure that you want to destroy the system.
v When you remove a node, you remove all redundancy from the I/O group. As a
result, new or existing failures can cause I/O errors on the hosts. The following
failures can occur:
– Host configuration errors
– Zoning errors
– Multipathing-software configuration errors
v If you are deleting the last node in an I/O group and there are volumes that are
assigned to the I/O group, you cannot remove the node from the system if the
node is online. You must back up or migrate all data that you want to save
before you remove the node. If the node is offline, you can remove the node.
v When you remove the configuration node, the configuration function moves to a
different node within the system. This process can take a short time, typically
less than a minute. The management GUI reattaches to the new configuration
node transparently.
v If you turn the power on to the node that has been removed and it is still
connected to the same fabric or zone, it attempts to rejoin the system. The
system tells the node to remove itself from the system and the node becomes a
candidate for addition to this system or another system.
v If you are adding this node into the system, ensure that you add it to the same
I/O group that it was previously a member of. Failure to do so can result in
data corruption.
This task assumes that you have already accessed the management GUI.
About this task
Complete the following steps to remove a node from a system:
Procedure
1. Select Monitoring > System.
2. Right-click the node that you want to remove and select Remove.
If the node that you want to remove is shown as Offline, then the node is not
participating in the system.
If the node that you want to remove is shown as Online, deleting the node can
result in the dependent volumes to also go offline. Verify whether the node has
any dependent volumes.
3. To check for dependent volumes before attempting to remove the node,
right-click the node and select Show Dependent Volumes.
If any volumes are listed, determine why and if access to the volumes is
required while the node is removed from the system. If the volumes are
assigned from storage pools that contain flash drives that are located in the
node, check why the volume mirror, if it is configured, is not synchronized.
There can also be dependent volumes because the partner node in the I/O
62
SAN Volume Controller: Troubleshooting Guide
group is offline. Fabric issues can also prevent the volume from communicating
with the storage systems. Resolve these problems before continuing with the
node removal.
4. Click Remove.
5. Click Yes to remove the node. Before a node is removed, the system checks to
determine if there are any volumes that depend on that node. If the node that
you selected contains volumes within the following situations, the volumes go
offline and become unavailable if the node is removed:
v The node contains flash drives and also contains the only synchronized copy
of a mirrored volume
v The other node in the I/O group is offline
If you select a node to remove that has these dependencies, another panel
displays confirming the removal.
Adding a node to a system
You can add a node to the system using the CLI or management GUI. A node can
be added to the system if the node previously failed and is being replaced with a
new node or if a repair action has caused the node to be unrecognizable by the
system. When adding nodes, ensure that they are added in pairs to create a full
I/O group. Adding a node to the system increases the capacity of the entire
system.
You can use either the management GUI or the command-line interface to add a
node to the system. Some models might require using the front panel to verify the
new node has been added correctly.
Before you add a node to a system, you must make sure that the switch zoning is
configured such that the node being added is in the same zone as all other nodes
in the system. If you are replacing a node and the switch is zoned by worldwide
port name (WWPN) rather than by switch port, make sure that the switch is
configured such that the node being added is in the same VSAN or zone.
Considerations when adding a node to a system
If you are adding a node that has been used previously, either within a different
I/O group within this system or within a different system, take into account that if
you add a node without changing its worldwide node name (WWNN), hosts
might detect the node and use it as if it were in its old location. This action might
cause the hosts to access the wrong volumes.
v You must ensure that the model type of the new node is supported by the
software level that is currently installed on the system. If the model type is not
supported by the software level, update the system to a software level that
supports the model type of the new node.
v Each node in an I/O group must be connected to a different uninterruptible
power supply.
v If you are re-adding a node back to the same I/O group after a service action
required the node to be deleted from the system and the physical node has not
changed, no special procedures are required and the node can be added back to
the system.
v
If you are replacing a node in a system either because of a node failure or an
update, you must change the WWNN of the new node to match that of the
original node before you connect the node to the Fibre Channel network and
add the node to the system.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
63
v If you are adding a node to the SAN again, ensure that you are adding the node
to the same I/O group from which it was removed. Failure to do this action can
result in data corruption. You must use the information that was recorded when
the node was originally added to the system. If you do not have access to this
information, contact the support center to add the node back into the system
without corrupting the data.
v For each external storage system, the LUNs that are presented to the ports on
the new node must be the same as the LUNs that are presented to the nodes
that currently exist in the system. You must ensure that the LUNs are the same
before you add the new node to the system.
v If you are creating an I/O group in the system and are adding a new node,
there are no special procedures because this node was never added to a system
and the WWNN for the node did not exist.
v If you are creating an I/O group in the system and are adding a new node that
has been added to a system before, the host system might still be configured to
the node WWPNs and the node might still be zoned in the fabric. Because you
cannot change the WWNN for the node, you must ensure that other components
in your fabric are configured correctly. Verify that any host that was previously
configured to use the node has been correctly updated.
v If the node that you are adding was previously replaced, either for a node repair
or update, you might have used the WWNN of that node for the replacement
node. Ensure that the WWNN of this node was updated so that you do not have
two nodes with the same WWNN attached to your fabric. Also ensure that the
WWNN of the node that you are adding is not 00000. If it is 00000, contact your
support representative.
v If the new node supports encryption, you must ensure that the following
requirements are true before the node can be added:
– The node is licensed to use encryption. When the node is added in the
management GUI, you are asked to activate the license for each node that is
detected that supports encryption. An authorization code is sent with license
documentation and must be used to activate the license. Retain all license
documentation for your records.
– The node is running a software level that supports encryption.
v If you are adding the new node to a system with either a HyperSwap or
stretched system topology, you must assign the node to a specific site.
Considerations when using multipathing device drivers
v Applications on the host systems direct I/O operations to file systems or logical
volumes that are mapped by the operating system to virtual paths (vpaths),
which are pseudo disk objects that are supported by the multipathing device
drivers. Multipathing device drivers maintain an association between a vpath
and a volume. This association uses an identifier (UID) which is unique to the
volume and is never reused. The UID allows multipathing device drivers to
directly associate vpaths with volumes.
v Multipathing device drivers operate within a protocol stack that contains disk
and Fibre Channel device drivers that are used to communicate with the system
using the SCSI protocol over Fibre Channel as defined by the ANSI FCS
standard. The addressing scheme that is provided by these SCSI and Fibre
Channel device drivers uses a combination of a SCSI logical unit number (LUN)
and the worldwide node name (WWNN) for the Fibre Channel node and ports.
v If an error occurs, the error recovery procedures (ERPs) operate at various tiers
in the protocol stack. Some of these ERPs cause I/O to be redriven using the
same WWNN and LUN numbers that were previously used.
64
SAN Volume Controller: Troubleshooting Guide
v Multipathing device drivers do not check the association of the volume with the
vpath on every I/O operation that it performs.
After the new node is zoned and cabled correctly to the existing system, you can
use either the addnode command or the Add Node wizard in the management
GUI. To access the Add Node wizard, select Monitoring > System. On the image,
click the new node to launch the wizard. Complete the wizard and verify the new
node. If the new node is not displayed in the image, it indicates a potential cabling
issue. Check the installation information to ensure that your node was cabled
correctly.
To add a SAN Volume Controller 2145-DH8 node to a system using the
command-line interface, complete these steps:
1. Enter this command to verify that the node is detected on the fabric:
svcinfo lsnodecandidate
This example shows the output for this command:
# svcinfo lsnodecandidate
id
panel_name UPS_serial_number UPS_unique_id
hardware serial_number product_mtm machine_signature
500507680C007B00 KD0N8AM
500507680C007B00 DH8
KD0N8AM
2145-DH8
0123-4567-89AB-CDEF
The id parameter displays the WWNN for the node. If the node is not detected,
verify cabling to the node.
2. Enter this command to determine the I/O group where the node should be
added:
lsiogrp
3. Record the name or ID of the first I/O group that has a node count of zero (0).
You will need the name or ID for the next step. Note: You only need to do this
step for the first node that is added. The second node of the pair uses the same
I/O group number.
4. Enter this command to add the node to the system:
addnode -wwnodename WWNN -iogrp iogrp_name -name new_name_arg -site site_name
Where WWNN is the WWNN of the node, iogrp_name is the name of the I/O
group that you want to add the node to and new_name_arg is the name that you
want to assign to the node. If you do not specify a new node name, a default
name is assigned; however, it is recommended that you specify a meaningful
name. The site_name specifies the name of the site location of the new node.
This parameter is only required if the topology is a HyperSwap or stretched
system.
Note: Adding the node might take a considerable amount of time.
5. Record this information for future reference:
v Serial number.
v Worldwide node name.
v All of the worldwide port names.
v The name or ID of the I/O group
To add either a SAN Volume Controller 2145-CG8 or SAN Volume Controller
2145-CF8 node to a system using the command-line interface, complete these steps:
1. Use the front panel of the node, record the WWNN. The front panel only
shows the last 5 digits of the WWNN.
2. Enter this command to verify that the node is detected on the fabric:
Chapter 3. SAN Volume Controller user interfaces for servicing your system
65
svcinfo lsnodecandidate
This example shows the output for this command:
# svcinfo lsnodecandidate
id
panel_name UPS_serial_number UPS_unique_id
hardware serial_number product_mtm machine_signature
500507680100E85F 168167
UPS_Fake_SN
100000000000E85F CG8
78G0123
2145-DH8
0123-4567-89AB-CDEFsvcinfo lsnodecandidate
The id parameter displays the WWNN for the node. Ensure that the last 5
digits that are displayed match the WWNN on the front panel. If the node is
not detected, verify cabling to the node.
3. Enter this command to determine the I/O group where the node should be
added:
lsiogrp
4. Record the name or ID of the first I/O group that has a node count of zero (0).
You will need the ID for the next step. Note: You only need to do this step for
the first node that is added. The second node of the pair uses the same I/O
group number.
5. Enter this command to add the node to the system:
addnode -wwnodename WWNN -iogrp iogrp_name -name newnodename -site newsitename
Where WWNN is the WWNN of the node, iogrp_name is the name or ID of the
I/O group that you want to add the node to and newnodename is the name that
you want to assign to the node. If you do not specify a new node name, a
default name is assigned; however, it is recommended that you specify a
meaningful name. The newsitename specifies the name of the site location of the
new node. This parameter is only required if the topology is a HyperSwap or
stretched system.
Note: Adding the node might take a considerable amount of time.
6. Record this information for future reference:
v Serial number.
v Worldwide node name.
v All of the worldwide port names.
v The name or ID of the I/O group
If a node shows node error 578 or node error 690, the node is in service state.
Complete the following steps from the front panel to exit service state:
1. Press and release the up or down button until the Actions? option displays.
2. Press the Select button.
3. Press and release the Up or Down button until the Exit Service? option
displays.
4. Press the Select button.
5. Press and release the Left or Right button until the Confirm Exit? option
displays.
6. Press the Select button.
Service assistant interface
The service assistant interface is a browser-based GUI that is used to service your
nodes.
66
SAN Volume Controller: Troubleshooting Guide
When to use the service assistant
The primary use of the service assistant is when a node is in service state. The
node cannot be active as part of a system while it is in service state.
Attention: Complete service actions on nodes only when directed to do so by the
fix procedures. If used inappropriately, the service actions that are available
through the service assistant can cause loss of access to data or even data loss.
The node might be in a service state because it has a hardware issue, has corrupted
data, or has lost its configuration data.
Use the service assistant in the following situations:
v When you cannot access the system from the management GUI and you cannot
access the SAN Volume Controller to run the recommended actions
v When the recommended action directs you to use the service assistant.
The management GUI operates only when there is an online clustered system. Use
the service assistant if you are unable to create a clustered system.
The service assistant provides detailed status and error summaries, and the ability
to modify the World Wide Node Name (WWNN) for each node.
You can also complete the following service-related actions:
v Collect logs to create and download a package of files to send to support
personnel.
v Remove the data for the system from a node.
v Recover a system if it fails.
v Install a software package from the support site or rescue the software from
another node.
v Update software on nodes manually versus completing a standard update
procedure.
v Change the service IP address that is assigned to Ethernet port 1 for the current
node.
v Install a temporary SSH key if a key is not installed and CLI access is required.
v Restart the services used by the system.
Accessing the service assistant
The service assistant is a web application that helps troubleshoot and resolve
problems on a node. The service assistant can be accessed through a service IP
address. On SAN Volume Controller 2145-DH8, you can connect to the service
assistant by using the technician port.
About this task
You must use a supported web browser. For a list of supported browsers, refer to
the topic Web browser requirements to access the management GUI.
Procedure
To start the application, complete the following steps.
1. Start a supported web browser and point your web browser to
serviceaddress/service for the node that you want to work on.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
67
2. Log on to the service assistant using the superuser password.
If you do not know the current superuser password, try to find out. If you
cannot find out what the password is, reset the password.
Results
Complete the service assistant actions on the correct node.
Command-line interface
Use the command-line interface (CLI) to manage a system with task commands
and information commands.
For a full description of the commands and how to start an SSH command-line
session, see the “Command-line interface” section of the SAN Volume Controller
Information Center.
When to use the CLI
The system command-line interface is intended for use by advanced users who are
confident at using a CLI.
Nearly all of the flexibility that is offered by the CLI is available through the
management GUI. However, the CLI does not provide the fix procedures that are
available in the management GUI. Therefore, use the fix procedures in the
management GUI to resolve the problems. Use the CLI when you require a
configuration setting that is unavailable in the management GUI.
You might also find it useful to create command scripts that use CLI commands to
monitor certain conditions or to automate configuration changes that you make
regularly.
Accessing the system CLI
Follow the steps that are described in the Command-line interface section of the
SAN Volume Controller Knowledge Center to initialize and use a CLI session.
Service command-line interface
Use the service command-line interface (CLI) to manage a node using the task
commands and information commands.
Note: The service command line interface can also be accessed by using the
technician port.
For a full description of the commands and how to start an SSH command line
session, see the “Command line interface” topic in the “Reference” section of this
product information.
When to use the service CLI
The service CLI is intended for use by advanced users who are confident at using
a command-line interface.
To access a node directly, it is normally easier to use the service assistant with its
graphical interface and extensive help facilities.
68
SAN Volume Controller: Troubleshooting Guide
Accessing the service CLI
To initialize and use a CLI session, review the information in the topics referenced
in the Related information section below.
USB flash drive interface
Use a USB flash drive to help service a SAN Volume Controller node.
When a USB flash drive is inserted into one of the USB ports on a SAN Volume
Controller node, the software searches for a control file on the USB flash drive and
runs the command that is specified in the file. When the command completes, the
command results and node status information are written to the USB flash drive.
When to use the USB flash drive
The USB flash drive can be used for service functions.
Using the USB flash drive is required in the following situations:
v When you cannot connect to a node canister in a control enclosure using the
service assistant and you want to see the status of the node.
v When you do not know, or cannot use, the service IP address for the node
canister in the control enclosure and must set the address.
v When you have forgotten the superuser password and must reset the password.
Using a USB flash drive
Use any USB flash drive that is formatted with a FAT32 file system on its first
partition.
About this task
When a USB flash drive is plugged into a node canister, the node canister code
searches for a text file named satask.txt in the root directory. If the code finds the
file, it attempts to run a command that is specified in the file. When the command
completes, a file called satask_result.html is written to the root directory of the
USB flash drive. If this file does not exist, it is created. If it exists, the data is
inserted at the start of the file. The file contains the details and results of the
command that was run and the status and the configuration information from the
node canister. The status and configuration information matches the detail that is
shown on the service assistant home page panels.
The fault light-emitting diode (LED) on the node canister flashes when the USB
service action is being performed. When the fault LED stops flashing, it is safe to
remove the USB flash drive.
Results
The USB flash drive can then be plugged into a workstation and the
satask_result.html file viewed in a web browser.
To protect from accidentally running the same command again, the satask.txt file
is deleted after it has been read.
If no satask.txt file is found on the USB flash drive, the result file is still created,
if necessary, and the status and configuration data is written to it.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
69
satask.txt commands
If you are creating the satask.txt command file by using a text editor, the file
must contain a single command on a single line in the file.
The commands that you use are the same as the service CLI commands except
where noted. Not all service CLI commands can be run from the USB flash drive.
The satask.txt commands always run on the node that the USB flash drive is
plugged into.
Reset service IP address and superuser password command:
Use this command to obtain service assistant access to a node canister even if the
current state of the node canister is unknown. The physical access to the node
canister is required and is used to authenticate the action.
Syntax
►►
satask
chserviceip
-serviceip
ipv4
►◄
-gw
►►
satask
chserviceip
ipv4
-mask
ipv4
-resetpassword
-serviceip_6 ipv6
►
-gw_6 ipv6
-prefix_6
int
►
►◄
-resetpassword
►►
satask
chserviceip
-default
►◄
-resetpassword
Parameters
-serviceip ipv4
The IPv4 address for the service assistant.
-gw ipv4
The IPv4 gateway for the service assistant.
-mask ipv4
The IPv4 subnet for the service assistant.
-serviceip_6 ipv6
The IPv6 address for the service assistant.
-gw_6 ipv6
The IPv6 gateway for the service assistant.
-prefix_6 int
The IPv6 prefix for the service assistant.
-resetpassword
Sets the service assistant password to the default value.
Description
This command resets the service assistant IP address to the default value. If the
command is run on the default value is 192.168.70.121 subnet mask: 255.255.255.0.
If the command is run on the default value is 192.168.70.122 subnet mask:
70
SAN Volume Controller: Troubleshooting Guide
255.255.255.0. If the node canister is active in a system, the superuser password for
the system is reset; otherwise, the superuser password is reset on the node canister.
If the node canister becomes active in a system, the superuser password is reset to
that of the system. You can configure the system to disable resetting the superuser
password. If you disable that function, this action fails.
This action calls the satask chserviceip command and the satask resetpassword
command.
Reset service assistant password command:
Use this command when you are unable to logon to the system because you have
forgotten the superuser password, and you wish to reset it.
Syntax
►►
satask
resetpassword
►◄
Parameters
None.
Description
This command resets the service assistant password to the default value passw0rd.
If the node canister is active in a system, the superuser password for the system is
reset; otherwise, the superuser password is reset on the node canister.
If the node canister becomes active in a system, the superuser password is reset to
that of the system. You can configure the system to disable resetting the superuser
password. If you disable that function, this action fails.
This command calls the satask resetpassword command.
Snap command:
Use the snap command to collect diagnostic information from the node canister
and to write the output to a USB flash drive.
Syntax
►►
satask
snap
►◄
-dump
-noimm
panel_name
Parameters
-dump
(Optional) Indicates the most recent dump file in the output.
-noimm
(Optional) Indicates the /dumps/imm.ffdc file should not be included in the
output.
panel_name
(Optional) Indicates the node on which to execute the snap command.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
71
Description
This command moves a snap file to a USB flash drive.
This command calls the satask snap command.
If collected, the IMM FFDC file is present in the snap archive in
/dumps/imm.ffdc.<node.dumpname>.<date>.<time>.tgz. The system waits for up to
five minutes for the IMM to generate its FFDC. The status of the IMM FFDC is
located in the snap archive in /dumps/imm.ffdc.log. These two files are not left on
the node.
An invocation example
satask snap -dump 111584
The resulting output:
No feedback
Install software command:
Use this command to install a specific update package on the node canister.
Syntax
►►
satask
installsoftware
-file
filename
►◄
-ignore
-pacedccu
Parameters
-file filename
(Required) The filename designates the name of the update package .
-ignore | -pacedccu
(Optional) Overrides prerequisite checking and forces installation of the update
package.
Description
This command copies the file from the USB flash drive to the update directory on
the node canister and then installs the update package.
This command calls the satask installsoftware command.
Create system command:
Use this command to create a storage system.
Syntax
►►
satask
mkcluster
-clusterip
ipv4
►◄
-gw
►►
satask
mkcluster
-clusterip_6
ipv4
SAN Volume Controller: Troubleshooting Guide
ipv4
-name
cluster_name
ipv6
►◄
-gw_6
72
-mask
ipv6
-prefix_6
int
-name
cluster_name
Parameters
-clusterip ipv4
(Optional) The IPv4 address for Ethernet port 1 on the system.
-gw ipv4
(Optional) The IPv4 gateway for Ethernet port 1 on the system.
-mask ipv4
(Optional) The IPv4 subnet for Ethernet port 1 on the system.
-clusterip_6 ipv6
(Optional) The IPv6 address for Ethernet port 1 on the system.
-gw_6 ipv6
(Optional) The IPv6 gateway for Ethernet port 1 on the system.
-prefix_6 int
(Optional) The IPv6 prefix for Ethernet port 1 on the system.
-name cluster_name
(Optional) The name of the new system.
Description
This command creates a storage system.
This command calls the satask mkcluster command.
Change system IP address:
Use this command to change the system IP address of the storage system.
It is best to use the initialization tool to create this command in satask.txt together
with the associated clitask.txt file that changes the file modules management IP
addresses.
Syntax
►►
satask
setsystemip
-systemip
ipv4
-gw
ipv4
-mask
ipv4
-consoleip
ipv4
►◄
Parameters
-systemip
The IPv4 address for Ethernet port 1 on the system.
-gw
The IPv4 gateway for Ethernet port 1 on the system.
-mask
The IPv4 subnet for Ethernet port 1 on the system.
-consolip
The management IPv4 address of SAN Volume Controller system.
Description
This command is only supported in the satask.txt file on a USB flash drive.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
73
It calls the svctask chsystemip command if the USB flash drive is inserted in the
configuration node canister, Otherwise it will blink the amber identify LED of the
node canister that is the configuration node.
If the amber identify LED for a different node canister starts to blink then move
the USB flash drive over to that node canister because it is the configuration node.
When the amber LED turns off you can move the USB flash drive to one of the file
modules so that it will use the clitask.txt file to change the file module
management IP addresses.
Leave the USB flash drive in the file module for at least two minutes before you
remove it. Use a workstation to check the clitask_results.txt and satask.txt results
files on the USB flash drive.
If the IP address change was successful then you must run the startmgtsrv -r
command to restart the management service so that it will not continue to ssh
commands to the old system IP address of the volume storage system.
For example, on a Linux workstation with network access to the new management
IP address:
satask setsystemip -systemip 123.123.123.20 -gw 123.123.123.1 -mask 255.255.255.0
-consoleip 123.123.123.10
You can now access the management GUI, which you can use to change any other
IP address that needs to be changed.
Here is an example of what could be in the clitask.txt file:
chnwmgt --serviceip1 123.123.123.11 --serviceip2 123.123.123.12
--mgtip 123.123.123.10 --gateway 123.123.123.1 --netmask 255.255.255.0 --force
chstoragesystem --ip1 123.123.123.20
Here is an example of what could be in the satak.txt file:
satask setsystemip -systemip 123.123.123.20 -gw 123.123.123.1 -mask 255.255.255.0
-consoleip 123.123.123.10
Query status command:
Use this command to determine the current service state of the node canister.
Syntax
►►
sainfo
getstatus
Parameters
None.
Description
This command writes the output from each node canister to the USB flash drive.
This command calls the sainfo lsservicenodes command, the sainfo
lsservicestatus command, and the sainfo lsservicerecommendation command.
74
SAN Volume Controller: Troubleshooting Guide
►◄
Technician port for node access
A technician port is available on the rear of the SAN Volume Controller 2145-DH8
to use a direct connection for installing and servicing support.
See “SAN Volume Controller 2145-DH8 ports used during service procedures” on
page 25.
The technician port replaces the front panel display and navigation buttons on
previous models of SAN Volume Controller nodes. The technician port provides
direct access to the service assistant GUI and command-line interface (CLI).
The technician port can be used by directly connecting a computer that has web
browsing software and is configured for Dynamic Host Configuration Protocol
(DHCP) through a standard 1 Gbps Ethernet cable.
If the node has candidate status when you open the web browser, the initialization
tool is displayed. Otherwise, the service assistant interface is displayed.
Important: Do not use the initialization tool on a node if any other node in the
system is already active. For example, a node status LED is solid on any node of
the system.
To access the service assistant through the technician port when the node status is
candidate, change the web page address from:
https:\\service\service\
to the following address:
https:\\service\service\node\home.action
Alternatively, reload the web page after you change the node status to service. For
example, by using the following Service CLI command:
satask startservice
See the following information for how to access the CLI through the technician
port. For other ways to access the CLI, see Chapter 3, “SAN Volume Controller
user interfaces for servicing your system,” on page 59.
To use the technician port, you must be next to the node. The technician port does
not work if it is connected to an Ethernet switch.
If you have Secure Shell (SSH) software on the computer that is directly connected
to the technician port, then you can use it to access the CLI as superuser at
192.168.0.1. The default superuser password is passw0rd.
Note: When your personal computer is configured with DHCP, the technician port
uses DHCP to reconfigure network services on your personal computer. Software
on your personal computer that was using these services might experience
network problems while it is connected to the technician port. For example,
selecting a link in a web page that was loaded before you connect to the technician
port might result in an error message.
Chapter 3. SAN Volume Controller user interfaces for servicing your system
75
Front panel interface
The front panel on each node has a small display, and five control buttons. This
panel provides access to system and node status information, and a means to run
certain system configuration and recovery actions. For a detailed description of
using the front panel, see Chapter 6, “Using the front panel of the SAN Volume
Controller,” on page 91.
When to use the front panel
Note: The SAN Volume Controller 2145-DH8 does not have a front panel display,
navigation, and select buttons. On this system, use the technician port to access the
service assistant interface.
Use the front panel when you are physically next to the system and are unable to
access one of the system interfaces.
76
SAN Volume Controller: Troubleshooting Guide
Chapter 4. Performing recovery actions using the SAN
Volume Controller CLI
The SAN Volume Controller command-line interface (CLI) is a collection of
commands that you can use to manage SAN Volume Controller clusters. See the
Command-line interface documentation for the specific details about the
commands provided here.
Validating and repairing mirrored volume copies using the CLI
You can use the repairvdiskcopy command from the command-line interface (CLI)
to validate and repair mirrored volume copies.
Attention: Run the repairvdiskcopy command only if all volume copies are
synchronized.
When you issue the repairvdiskcopy command, you must use only one of the
-validate, -medium, or -resync parameters. You must also specify the name or ID
of the volume to be validated and repaired as the last entry on the command line.
After you issue the command, no output is displayed.
-validate
Use this parameter if you only want to verify that the mirrored volume copies
are identical. If any difference is found, the command stops and logs an error
that includes the logical block address (LBA) and the length of the first
difference. You can use this parameter, starting at a different LBA each time to
count the number of differences on a volume.
-medium
Use this parameter to convert sectors on all volume copies that contain
different contents into virtual medium errors. Upon completion, the command
logs an event, which indicates the number of differences that were found, the
number that were converted into medium errors, and the number that were
not converted. Use this option if you are unsure what the correct data is, and
you do not want an incorrect version of the data to be used.
-resync
Use this parameter to overwrite contents from the specified primary volume
copy to the other volume copy. The command corrects any differing sectors by
copying the sectors from the primary copy to the copies being compared. Upon
completion, the command process logs an event, which indicates the number
of differences that were corrected. Use this action if you are sure that either the
primary volume copy data is correct or that your host applications can handle
incorrect data.
-startlba lba
Optionally, use this parameter to specify the starting Logical Block Address
(LBA) from which to start the validation and repair. If you previously used the
validate parameter, an error was logged with the LBA where the first
difference, if any, was found. Reissue repairvdiskcopy with that LBA to avoid
reprocessing the initial sectors that compared identically. Continue to reissue
repairvdiskcopy using this parameter to list all the differences.
© Copyright IBM Corp. 2003, 2015
77
Issue the following command to validate and, if necessary, automatically repair
mirrored copies of the specified volume:
repairvdiskcopy -resync -startlba 20 vdisk8
Notes:
1. Only one repairvdiskcopy command can run on a volume at a time.
2. Once you start the repairvdiskcopy command, you cannot use the command to
stop processing.
3. The primary copy of a mirrored volume cannot be changed while the
repairvdiskcopy -resync command is running.
4. If there is only one mirrored copy, the command returns immediately with an
error.
5. If a copy being compared goes offline, the command is halted with an error.
The command is not automatically resumed when the copy is brought back
online.
6. In the case where one copy is readable but the other copy has a medium error,
the command process automatically attempts to fix the medium error by
writing the read data from the other copy.
7. If no differing sectors are found during repairvdiskcopy processing, an
informational error is logged at the end of the process.
Checking the progress of validation and repair of volume copies
using the CLI
Use the lsrepairvdiskcopyprogress command to display the progress of mirrored
volume validation and repairs. You can specify a volume copy using the -copy id
parameter. To display the volume that have two or more copies with an active
task, specify the command with no parameters; it is not possible to have only one
volume copy with an active task.
To check the progress of validation and repair of mirrored volumes, issue the
following command:
lsrepairvdiskcopyprogress –delim :
The following example shows how the command output is displayed:
vdisk_id:vdisk_name:copy id:task:progress:estimated_completion_time
0:vdisk0:0:medium:50:070301120000
0:vdisk0:1:medium:50:070301120000
Repairing a thin-provisioned volume using the CLI
You can use the repairsevdiskcopy command from the command-line interface to
repair the metadata on a thin-provisioned volume.
The repairsevdiskcopy command automatically detects and repairs corrupted
metadata. The command holds the volume offline during the repair, but does not
prevent the disk from being moved between I/O groups.
If a repair operation completes successfully and the volume was previously offline
because of corrupted metadata, the command brings the volume back online. The
only limit on the number of concurrent repair operations is the number of volume
copies in the configuration.
78
SAN Volume Controller: Troubleshooting Guide
When you issue the repairsevdiskcopy command, you must specify the name or
ID of the volume to be repaired as the last entry on the command line. Once
started, a repair operation cannot be paused or cancelled; the repair can only be
terminated by deleting the copy.
Attention: Use this command only to repair a thin-provisioned volume that has
reported corrupt metadata.
Issue the following command to repair the metadata on a thin-provisioned volume:
repairsevdiskcopy vdisk8
After you issue the command, no output is displayed.
Notes:
1. Because the volume is offline to the host, any I/O that is submitted to the
volume while it is being repaired fails.
2. When the repair operation completes successfully, the corrupted metadata error
is marked as fixed.
3. If the repair operation fails, the volume is held offline and an error is logged.
Checking the progress of the repair of a thin-provisioned volume
using the CLI
Issue the lsrepairsevdiskcopyprogress command to list the repair progress for
thin-provisioned volume copies of the specified volume. If you do not specify a
volume, the command lists the repair progress for all thin-provisioned copies in
the system.
Note: Only run this command after you run the repairsevdiskcopy command,
which you must only run as required by the fix procedures recommended by your
support team.
Recovering offline volumes using the CLI
If a node or an I/O group fails, you can use the command-line interface (CLI) to
recover offline volumes.
About this task
If you have lost both nodes in an I/O group and have, therefore, lost access to all
the volumes that are associated with the I/O group, you must complete one of the
following procedures to regain access to your volumes. Depending on the failure
type, you might have lost data that was cached for these volumes and the volumes
are now offline.
Data loss scenario 1
One node in an I/O group has failed and failover has started on the second node.
During the failover process, the second node in the I/O group fails before the data
in the write cache is flushed to the backend. The first node is successfully repaired
but its hardened data is not the most recent version that is committed to the data
store; therefore, it cannot be used. The second node is repaired or replaced and has
lost its hardened data, therefore, the node has no way of recognizing that it is part
of the system.
Chapter 4. Performing recovery actions using the SAN Volume Controller CLI
79
Complete the following steps to recover an offline volume when one node has
down-level hardened data and the other node has lost hardened data:
Procedure
1. Recover the node and add it back into the system.
2. Delete all IBM FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the offline volumes.
3. Run the recovervdisk, recovervdiskbyiogrp or recovervdiskbysystem
command.
4. Re-create all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the volumes.
Example
Data loss scenario 2
Both nodes in the I/O group have failed and have been repaired. The nodes have
lost their hardened data, therefore, the nodes have no way of recognizing that they
are part of the system.
Complete the following steps to recover an offline volume when both nodes have
lost their hardened data and cannot be recognized by the system:
1. Delete all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the offline volumes.
2. Run the recovervdisk, recovervdiskbyiogrp or recovervdiskbysystem
command.
3. Create all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the volumes.
80
SAN Volume Controller: Troubleshooting Guide
Chapter 5. Viewing the vital product data
Vital product data (VPD) is information that uniquely records each element in the
SAN Volume Controller. The data is updated automatically by the system when
the configuration is changed.
The VPD lists the following types of information:
v System-related values such as the software version, space in storage pools, and
space allocated to volumes.
v Node-related values that include the specific hardware that is installed in each
node. Examples include the FRU part number for the system board and the level
of BIOS firmware that is installed. The node VPD is held by the system which
makes it possible to get most of the VPD for the nodes that are powered off.
Using different sets of commands, you can view the system VPD and the node
VPD. You can also view the VPD through the management GUI.
Downloading the vital product data using the management GUI
You can download the vital product data for a node from the management GUI.
Procedure
1. In the management GUI, select Monitoring > System.
2. From the dynamic graphic of the system, select the node and click the icon to
the right of the Actions menu to download VPD information.
Displaying the vital product data using the CLI
You can use the command-line interface (CLI) to display the SAN Volume
Controller system or node vital product data (VPD).
Issue the following CLI commands to display the VPD:
sainfo lsservicestatus
lsnodehw
lsnodevpd nodename
lssystem system_name
lssystemip
lsdrive
Displaying node properties by using the CLI
You can use the command-line interface (CLI) to display node properties.
About this task
To display the node properties:
Procedure
1. Use the lsnode CLI command to display a concise list of nodes in the clustered
system.
Issue this CLI command to list the system nodes:
lsnode -delim :
© Copyright IBM Corp. 2003, 2015
81
2. Issue the lsnode CLI command and specify the node ID or name of the node
that you want to receive detailed output.
The following example is a CLI command that you can use to list detailed
output for a node in the system:
lsnode -delim : group1node1
Where group1node1 is the name of the node for which you want to view
detailed output.
Displaying clustered system properties using the CLI
You can use the command-line interface (CLI) to display the properties for a
clustered system (system).
About this task
These actions help you display your system property information.
Procedure
Issue the lssystem command to display the properties for a system.
The following is an example of the command you can issue:
lssystem -delim : build1
where build1 is the name of the system.
82
SAN Volume Controller: Troubleshooting Guide
Results
id:000002007A00A0FE
name:build1
location:local
partnership:
bandwidth:
total_mdisk_capacity:90.7GB
space_in_mdisk_grps:90.7GB
space_allocated_to_vdisks:14.99GB
total_free_space:75.7GB
statistics_status:on
statistics_frequency:15
required_memory:0
cluster_locale:en_US
time_zone:522 UTC
code_level:6.1.0.0 (build 47.3.1009031000)
FC_port_speed:2Gb
console_IP:9.71.46.186:443
id_alias:000002007A00A0FE
gm_link_tolerance:300
gm_inter_cluster_delay_simulation:0
gm_intra_cluster_delay_simulation:0
email_reply:
email_contact:
email_contact_primary:
email_contact_alternate:
email_contact_location:
email_state:stopped
inventory_mail_interval:0
total_vdiskcopy_capacity:15.71GB
total_used_capacity:13.78GB
total_overallocation:17
total_vdisk_capacity:11.72GB
cluster_ntp_IP_address:
cluster_isns_IP_address:
iscsi_auth_method:none
iscsi_chap_secret:
auth_service_configured:no
auth_service_enabled:no
auth_service_url:
auth_service_user_name:
auth_service_pwd_set:no
auth_service_cert_set:no
relationship_bandwidth_limit:25
gm_max_host_delay:5
tier:generic_ssd
tier_capacity:0.00MB
tier_free_capacity:0.00MB
tier:generic_hdd
tier_capacity:90.67GB
tier_free_capacity:75.34GB
email_contact2:
email_contact2_primary:
email_contact2_alternate:
total_allocated_extent_capacity:16.12GB
Fields for the node VPD
The node vital product data (VPD) provides information for items such as the
system board, batteries, processor, fans, memory module, adapter, devices,
software, front panel assembly, the uninterruptible power supply, SAS flash drive
and SAS host bus adapter (HBA).
Table 43 on page 84 shows the fields that you see for the system board.
Chapter 5. Viewing the vital product data
83
Table 43. Fields for the system board
Item
Field name
System board
Part number
System serial number
Number of processors
Number of memory slots
Number of fans
Number of Fibre Channel adapters
Number of SCSI, IDE, SATA, or SAS devices
Note: The service controller is a device.
Number of compression accelerator adapters
Number of power supplies
Number of high-speed SAS adapters
BIOS manufacturer
BIOS version
BIOS release date
System manufacturer
System product
Planar manufacturer
Power supply part number
CMOS battery part number
Power cable assembly part number
Service processor firmware
SAS controller part number
Table 44 shows the fields that you see for the batteries.
Table 44. Fields for the batteries
Item
Field name
Batteries
Battery_FRU_part
Battery_part_identity
Battery_fault_led
Battery_charging_status
Battery_cycle_count
Battery_power_on_hours
Battery_last_recondition
Battery_midplane_FRU_part
Battery_midplane_part_identity
Battery_midplane_FW_version
Battery_power_cable_FRU_part
Battery_power_sense_cable_FRU_part
Battery_comms_cable_FRU_part
Battery_EPOW_cable_FRU_part
84
SAN Volume Controller: Troubleshooting Guide
Table 45 shows the fields you see for each processor that is installed.
Table 45. Fields for the processors
Item
Field name
Processor
Part number
Processor location
Manufacturer
Version
Speed
Status
Processor serial number
Table 46 shows the fields that you see for each fan that is installed.
Table 46. Fields for the fans
Item
Field name
Fan
Part number
Location
Table 47 shows the fields that are repeated for each installed memory module.
Table 47. Fields that are repeated for each installed memory module
Item
Field name
Memory module
Part number
Device location
Bank location
Size (MB)
Manufacturer (if available)
Serial number (if available)
Table 48 shows the fields that are repeated for each installed adapter.
Table 48. Fields that are repeated for each adapter that is installed
Item
Field name
Adapter
Adapter type
Part number
Port numbers
Location
Device serial number
Manufacturer
Device
Adapter revision
Chip revision
Chapter 5. Viewing the vital product data
85
Table 49 shows the fields that are repeated for each device that is installed.
Table 49. Fields that are repeated for each SCSI, IDE, SATA, and SAS device that is
installed
Item
Field name
Device
Part number
Bus
Device
Model
Revision
Serial number
Approximate capacity
Hardware revision
Manufacturer
Table 50 shows the fields that are specific to the node software.
Table 50. Fields that are specific to the node software
Item
Field name
Software
Code level
Node name
Worldwide node name
ID
Unique string that is used in dump file
names for this node
Table 51 shows the fields that are provided for the front panel assembly.
Table 51. Fields that are provided for the front panel assembly
Item
Field name
Front panel
Part number
Front panel ID
Front panel locale
Table 52 shows the fields that are provided for the Ethernet port.
Table 52. Fields that are provided for the Ethernet port
Item
Field name
Ethernet port
Port number
Ethernet port status
MAC address
Supported speeds
Table 53 on page 87 shows the fields that are provided for the power supplies in
the node.
86
SAN Volume Controller: Troubleshooting Guide
Table 53. Fields that are provided for the power supplies in the node
Item
Field name
Power supplies
Part number
Location
Table 54 shows the fields that are provided for the uninterruptible power supply
assembly that is powering the node.
Table 54. Fields that are provided for the uninterruptible power supply assembly that is
powering the node
Item
Field name
Uninterruptible power supply
Electronics assembly part number
Battery part number
Frame assembly part number
Input power cable part number
UPS serial number
UPS type
UPS internal part number
UPS unique ID
UPS main firmware
UPS communications firmware
Table 55 shows the fields that are provided for the SAS host bus adapter (HBA).
Table 55. Fields that are provided for the SAS host bus adapter (HBA)
Item
Field name
SAS HBA
Part number
Port numbers
Device serial number
Manufacturer
Device
Adapter revision
Chip revision
Table 56 on page 88 shows the fields that are provided for the SAS flash drive.
Chapter 5. Viewing the vital product data
87
Table 56. Fields that are provided for the SAS flash drive
Item
Field name
SAS SSD
Part number
Manufacturer
Device serial number
Model
Type
UID
Firmware
Slot
FPGA firmware
Speed
Capacity
Expansion tray
Connection type
Table 57 shows the fields that are provided for the small form factor pluggable
(SFP) transceiver.
Table 57. Fields that are provided for the small form factor pluggable (SFP) transceiver
Item
Field name
Small form factor pluggable (SFP)
transceiver
Part number
Manufacturer
Device
Serial number
Supported speeds
Connector type
Transmitter type
Wavelength
Maximum distance by cable type
Hardware revision
Port number
Worldwide port name
Fields for the system VPD
The system vital product data (VPD) provides various information about the
system, including its ID, name, location, IP address, email contact, code level, and
total free space.
Table 58 on page 89 shows the fields that are provided for the system properties as
shown by the management GUI.
88
SAN Volume Controller: Troubleshooting Guide
Table 58. Fields that are provided for the system properties
Item
Field name
General
ID
Note: This is the unique identifier for the system.
Name
Location
Time Zone
Required Memory
Licensed Code Version
Channel Port Speed
IP Addresses1
Ethernet Port 1 (attributes for both IPv4 and IPv6)
v IP Address
v Service IP Address
v Subnet Mask
v Prefix
v Default Gateway
Ethernet Port 2 (attributes for both IPv4 and IPv6)
v IP Address
v Service IP Address
v Subnet Mask
v Prefix
v Default Gateway
Remote Authentication
Remote Authentication
Web Address
User Name
Password
SSL Certificate
Space
Total MDisk Capacity
Space in Storage Pools
Space Allocated to Volumes
Total Free Space
Total Used Capacity
Total Allocation
Total Volume Copy Capacity
Total Volume Capacity
Statistics
Statistics Status
Statistics Frequency
Metro and Global Mirror
Link Tolerance
Intersystem Delay Simulation
Intrasystem Delay Simulation
Partnership
Bandwidth
Chapter 5. Viewing the vital product data
89
Table 58. Fields that are provided for the system properties (continued)
Item
Field name
Email
SMTP Email Server
Email Server Port
Reply Email Address
Contact Person Name
Primary Contact Phone Number
Alternate Contact Phone Number
Physical Location of the System Reporting Error
Email Status
Inventory Email Interval
iSCSI
iSNS Server Address
Supported Authentication Methods
CHAP Secret
1
90
You can also use the lssystemip CLI command to view this data.
SAN Volume Controller: Troubleshooting Guide
Chapter 6. Using the front panel of the SAN Volume Controller
The front panel of the SAN Volume Controller has a display, various LEDs,
navigation buttons, and a select button that are used when you service your SAN
Volume Controller node.
Figure 45 shows where the front-panel display ▌1▐ is on the SAN Volume
Controller node.
1
Restarting
svc00552
Restarting
Figure 45. SAN Volume Controller front-panel assembly
The SAN Volume Controller 2145-DH8 does not have a front panel display, but
does include node status, node fault, and battery status LEDs. Instead, a technician
port is available on the rear of the 2145-DH8 node to use a direction connection for
installing and servicing support. The technician port replaces the front panel
display, and provides direct access to the service assistant GUI and command-line
interface (CLI).
Boot progress indicator
Boot progress is displayed on the front panel of the SAN Volume Controller.
The Boot progress display on the front panel shows that the node is starting.
Booting
130
Figure 46. Example of a boot progress display
During the boot operation, boot progress codes are displayed and the progress bar
moves to the right while the boot operation proceeds.
Boot failed
If the boot operation fails, boot code 120 is displayed.
© Copyright IBM Corp. 2003, 2015
91
Failed
120
See the "Error code reference" topic where you can find a description of the failure
and the appropriate steps that you must complete to correct the failure.
Charging
The front panel indicates that the uninterruptible power supply battery is charging.
svc00304
Charging
A node will not start and join a system if there is insufficient power in the
uninterruptible power supply battery to manage with a power failure. Charging is
displayed until it is safe to start the node. This might take up to two hours.
Error codes
Error codes are displayed on the front panel display.
svc00433
Figure 47 and Figure 48 show how error codes are displayed on the front panel.
svc00434
Figure 47. Example of an error code for a clustered system
Figure 48. Example of a node error code
For descriptions of the error codes that are displayed on the front panel display,
see the various error code topics for a full description of the failure and the actions
that you must perform to correct the failure.
Hardware boot
The hardware boot display shows system data when power is first applied to the
node as the node searches for a disk drive to boot.
92
SAN Volume Controller: Troubleshooting Guide
If this display remains active for longer than 3 minutes, there might be a problem.
The cause might be a hardware failure or the software on the hard disk drive
might be missing or damaged.
Node rescue request
If software is lost, you can use the node rescue process to copy all software from
another node.
The node-rescue-request display, which is shown in Figure 49, indicates that a
request has been made to replace the software on this node. The SAN Volume
Controller software is preinstalled on all SAN Volume Controller nodes. This
software includes the operating system, the application software, and the SAN
Volume Controller publications. It is normally not necessary to replace the software
on a node, but if the software is lost for some reason (for example, the hard disk
drive in the node fails), it is possible to copy all the software from another node
that is connected to the same Fibre Channel fabric. This process is known as node
rescue.
Figure 49. Node rescue display
Power failure
The SAN Volume Controller node uses battery power from the uninterruptible
power supply to shut itself down.
The Power failure display on the laptop application shows that the SAN Volume
Controller is running on battery power because main power has been lost. All I/O
operations have stopped. The node is saving system metadata and node cache data
to the internal disk drive. When the progress bar reaches zero, the node powers
off.
Note: When input power is restored to the uninterruptible power supply, the SAN
Volume Controller turns on.
Powering off
The progress bar on the display shows the progress of the power-off operation.
Powering Off is displayed after the power button has been pressed and while the
node is powering off. Powering off might take several minutes.
The progress bar moves to the left when the power is removed.
Chapter 6. Using the front panel of the SAN Volume Controller
93
Recovering
Recovering
svc00305
The front panel indicates that the uninterruptible power supply battery is not fully
charged.
When a node is active in a system but the uninterruptible power supply battery is
not fully charged, Recovering is displayed. If the power fails while this message is
displayed, the node does not restart until the uninterruptible power supply has
charged to a level where it can sustain a second power failure.
Restarting
The front panel indicates when the software on a node is restarting.
Restarting
The software is restarting for one of the following reasons:
v An internal error was detected.
v The power button was pressed again while the node was powering off.
If you press the power button while powering off, the panel display changes to
indicate that the button press was detected; however, the power off continues until
the node finishes saving its data. After the data is saved, the node powers off and
then automatically restarts. The progress bar moves to the right while the node is
restarting.
Shutting down
The front-panel indicator tracks shutdown operations.
The Shutting Down display is shown when you issue a shutdown command to a
SAN Volume Controller clustered system or a SAN Volume Controller node. The
progress bar continues to move to the left until the node turns off.
When the shutdown operation is complete, the node turns off. When you power
off a node that is connected to a 2145 UPS-1U, only the node shuts down; the 2145
UPS-1U does not shut down.
Shutting Down
94
SAN Volume Controller: Troubleshooting Guide
Validate WWNN? option
The front panel prompts you to validate the WWNN when the worldwide node
name (WWNN) that is stored in the service controller (the panel WWNN) does not
match the WWNN that is backed up on the SAN Volume Controller disk (the disk
WWNN).
Typically, this panel is displayed when the service controller has been replaced.
The SAN Volume Controller uses the WWNN that is stored on the service
controller. Usually, when the service controller is replaced, you modify the WWNN
that is stored on it to match the WWNN on the service controller that it replaced.
By doing this, the node maintains its WWNN address, and you do not need to
modify the SAN zoning or host configurations. The WWNN that is stored on disk
is the same that was stored on the old service controller.
After it is in this mode, the front panel display will not revert to its normal
displays, such as node or cluster (system) options or operational status, until the
WWNN is validated. Navigate the Validate WWNN option (shown in Figure 50) to
choose which WWNN that you want to use.
Validate WWNN?
Select
Disk WWNN:
Panel WWNN:
Use Disk WWNN?
Use Panel WWNN?
Node WWNN:
svc00409
Select
Figure 50. Validate WWNN? navigation
Choose which stored WWNN you want this node to use:
1. From the Validate WWNN? panel, press and release the select button. The Disk
WWNN: panel is displayed and shows the last five digits of the WWNN that is
stored on the disk.
2. To view the WWNN that is stored on the service controller, press and release
the right button. The Panel WWNN: panel is displayed and shows the last five
numbers of the WWNN that is stored on the service controller.
3. Determine which WWNN that you want to use.
a. To use the WWNN that is stored on the disk:
1) From the Disk WWNN: panel, press and release the down button. The
Use Disk WWNN? panel is displayed.
2) Press and release the select button.
b. To use the WWNN that is stored on the service controller:
1) From the Panel WWNN: panel, press and release the down button. The
Use Panel WWNN? panel is displayed.
2) Press and release the select button.
The node is now using the selected WWNN. The Node WWNN: panel is displayed
and shows the last five numbers of the WWNN that you selected.
Chapter 6. Using the front panel of the SAN Volume Controller
95
If neither WWNN that is stored on the service controller panel nor disk is suitable,
you must wait until the node restarts before you can change it. After the node
restarts, select Change WWNN to change the WWNN to the value that you want.
SAN Volume Controller menu options
During normal operations, menu options are available on the front panel display of
the SAN Volume Controller node.
Menu options enable you to review the operational status of the clustered system,
node, and external interfaces. They also provide access to the tools and operations
that you use to service the node.
Figure 51 on page 97 shows the sequence of the menu options. Only one option at
a time is displayed on the front panel display. For some options, additional data is
displayed on line 2. The first option that is displayed is the Cluster: option.
96
SAN Volume Controller: Troubleshooting Guide
Main Options
Secondary Options
R/L
Cluster:
FC Port-1
L/R
Port-1
Address:
L/R
Status:
Port-2
Address:
L/R
s
s
s Select shows IPv4 and IPv6 addresses if available
S
S
R/L
U
/
D
IPv4
Address-2:
IPv4
Subnet-2:
L/R
IPv4
Gateway-2:
L/R
IPv6
Address-2:
L/R
IPv6
Prefix-2:
L/R
L/R
IPv6
Gateway-2:
R/L
IPv4
Address-1:
L/R
IPv4
Subnet-1:
L/R
IPv4
Gateway-1:
L/R
IPv6
Address-1:
L/R
IPv6
Prefix-1:
L/R
IPv6
Gateway-1:
R/L
L/R
Node
Node
WWNN:
L/R
Status:
Service
Address
L/R
s
S
R/L
U
/
D
IPv4
Address
L/R
Version:
IPv4
Subnet
L/R
L/R
IPv6
Address
L/R
L/R
IPv6
Prefix
L/R
IPv6
Gateway
Cluster
Build:
L/R
Build:
IPv4
Gateway
R/L
U
/
D
R/L
Ethernet
L/R
Ethernet
Port-1:
Speed-1:
L/R
L/R
MAC
Address-1:
L/R
Ethernet
Port-2:
L/R
Speed-2:
L/R
MAC
Address-2:
Speed-3:
L/R
MAC
Address-3:
Speed-4:
L/R
MAC
Address-4:
L/R
U
/
D
Ethernet
Port-3:
L/R
L/R
Ethernet
Port-4:
L/R
R/L
FC Port-1
Status
L/R
L/R
FC Port-3
Status
L/R
FC Port-2
Status
L/R
FC Port-2
Speed
L/R
FC Port-4
Status
R/L
U
/
D
Actions
FC Port-1
Speed
FC Port-3
Speed
L/R
FC Port-4
Speed
x
Language?
L/R
English?
L
L/R
Japanese?
L
x Select takes you to the Actions menu
L Select activates language
svc00560
U
/
D
Figure 51. SAN Volume Controller options on the front-panel display
Use the left and right buttons to navigate through the secondary fields that are
associated with some of the main fields.
Note: Messages might not display fully on the screen. You might see a right angle
bracket (>) on the right side of the display screen. If you see a right angle bracket,
press the right button to scroll through the display. When there is no more text to
display, you can move to the next item in the menu by pressing the right button.
Chapter 6. Using the front panel of the SAN Volume Controller
97
Similarly, you might see a left angle bracket (<) on the left side of the display
screen. If you see a left angle bracket, press the left button to scroll through the
display. When there is no more text to display, you can move to the previous item
in the menu by pressing the left button.
The following main options are available:
v Cluster
v Node
v Version
v Ethernet
v FC Port 1 Status
v Actions
v Language
Cluster (system) options
The main cluster (system) option from the menu can display the cluster name or
the field can be blank.
The main cluster (system) option displays the system name that the user has
assigned. If a clustered system is in the process of being created on the node, and
no system name has been assigned, a temporary name that is based on the IP
address of the system is displayed. If this node is not assigned to a system, the
field is blank.
Status option
Status is indicated on the front panel.
This field is blank if the node is not a member of a clustered system. If this node is
a member of a clustered system, the field indicates the operational status of the
system, as follows:
Active
Indicates that this node is an active member of the system.
Inactive
Indicates that the node is a member of a system, but is not now operational. It
is not operational because the other nodes that are in the system cannot be
accessed or because this node was excluded from the system.
Degraded
Indicates that the system is operational, but one or more of the member nodes
are missing or have failed.
IPv4 Address option
A clustered system must have either an IPv4 address or an IPv6 address, or both,
assigned to Ethernet port 1. You can also assign an IPv4 address or an IPv6
address, or both, to Ethernet port 2. You can use any of the addresses to access the
system from the command-line tools or the management GUI.
These fields contain the IPv4 addresses of the system. If this node is not a member
of a system or if the IPv4 address has not been assigned, these fields are blank.
98
SAN Volume Controller: Troubleshooting Guide
IPv4 Subnet options:
The IPv4 subnet mask addresses are set when the IPv4 addresses are assigned to
the system.
The IPv4 subnet options display the subnet mask addresses when the system has
IPv4 addresses. If the node is not a member of a system or if the IPv4 addresses
have not been assigned, this field is blank.
IPv4 Gateway options:
The IPv4 gateway addresses are set when the system is created.
The IPv4 gateway options display the gateway addresses for the system. If the
node is not a member of a system, or if the IPv4 addresses have not been assigned,
this field is blank.
IPv6 Address options
A clustered system must have either an IPv4 address or an IPv6 address, or both,
assigned to Ethernet port 1. You can also assign an IPv4 address or an IPv6
address, or both, to Ethernet port 2. You can use any of the addresses to access the
system from the command-line tools or the management GUI.
These fields contain the IPv6 addresses of the system. If the node is not a member
of a system, or if the IPv6 address has not been assigned, these fields are blank.
IPv6 Prefix option:
The IPv6 prefix is set when a system is created.
The IPv6 prefix option displays the network prefix of the system and the service
IPv6 addresses. The prefix has a value of 0 - 127. If the node is not a member of a
system, or if the IPv6 addresses have not been assigned, a blank line displays.
IPv6 Gateway option:
The IPv6 gateway addresses are set when the system is created.
This option displays the IPv6 gateway addresses for the system. If the node is not
a member of a system, or if the IPv6 addresses have not been assigned, a blank
line displays.
Displaying an IPv6 address
After you have set the IPv6 address, you can display the IPv6 addresses and the
IPv6 gateway addresses.
The IPv6 addresses and the IPv6 gateway addresses consist of eight (4-digit)
hexadecimal values that are shown across four panels, as shown in Figure 52 on
page 100. Each panel displays two 4-digit values that are separated by a colon, the
address field position (such as 2/4) within the total address, and scroll indicators.
Move between the address panels by using the left button or right button.
Chapter 6. Using the front panel of the SAN Volume Controller
99
svc00417
Figure 52. Viewing the IPv6 address on the front-panel display
Node options
The main node option displays the identification number or the name of the node
if the user has assigned a name.
Status option
The node status is indicated on the front panel. The status can be one of the
following states:
Active The node is operational, assigned to a system, and ready to complete I/O
operations.
Service
There is an error that is preventing the node from operating as part of a
system. It is safe to shut down the node in this state.
Candidate
The node is not assigned to a system and is not in service. It is safe to shut
down the node in this state.
Starting
The node is part of a system and is attempting to join the system. It cannot
complete I/O operations.
Node WWNN option
The Node WWNN (worldwide node name) option displays the last five
hexadecimal digits of the WWNN that is being used by the node. Only the last five
digits of a WWNN vary on a node. The first 11 digits are always 50050768010.
Service Address option
Pressing select on the Service Address panel displays the IP address that is
configured for access to the service assistant and the service CLI.
Version options
The version option displays the version of the SAN Volume Controller software
that is active on the node. The version consists of four fields that are separated by
full stops. The fields are the version, release, modification, and fix level; for
example, 6.1.0.0.
Build option
The Build: panel displays the level of the SAN Volume Controller software that is
currently active on this node.
Cluster Build option
The Cluster Build: panel displays the level of the software that is currently active
on the system that this node is operating in.
100
SAN Volume Controller: Troubleshooting Guide
Ethernet options
The Ethernet options display the operational state of the Ethernet ports, the speed
and duplex information, and their media access control (MAC) addresses.
The Ethernet panel shows one of the following states:
Config - Yes
This node is the configuration node.
Config - No
This node is not the configuration node.
No Cluster
This node is not a member of a system.
Press the right button to view the details of the individual Ethernet ports.
Ethernet Port options
The Ethernet port options Port-1 through Port-4 display the state of the links and
indicates whether or not there is an active link with an Ethernet network.
Link Online
An Ethernet cable is attached to this port.
Link Offline
No Ethernet cable is attached to this port or the link has failed.
Speed options
The speed options Speed-1 through Speed-4 display the speed and duplex
information for the Ethernet port. The speed information can be one of the
following values:
10
The speed is 10 Mbps.
100
The speed is 100 Mbps.
1
The speed is 1Gbps.
10
The speed is 10 Gbps.
The duplex information can be one of the following values:
Full
Data can be sent and received at the same time.
Half
Data can be sent and received in one direction at a time.
MAC Address options
The MAC address options MAC Address-1 through MAC Address-4 display the
media access control (MAC) address of the Ethernet port.
Fibre Channel port options
The Fibre Channel port options display the operational status of the Fibre Channel
ports.
Active The port is operational and can access the Fibre Channel fabric.
Chapter 6. Using the front panel of the SAN Volume Controller
101
Inactive
The port is operational but cannot access the Fibre Channel fabric. One of
the following conditions caused this result:
v The Fibre Channel cable has failed.
v The Fibre Channel cable is not installed.
v The device that is at the other end of the cable has failed.
Failed The port is not operational because of a hardware failure.
Not installed
This port is not installed.
Actions options
During normal operations, action menu options are available on the front panel
display of the node. Only use the front panel actions when directed to do so by a
service procedure. Inappropriate use can lead to loss of access to data or loss of
data.
Figure 53 on page 104, Figure 54 on page 105, and Figure 55 on page 106 show the
sequence of the actions options. In the figures, bold lines indicate that the select
button was pressed. The lighter lines indicate the navigational path (up or down
and left or right). The circled X indicates that if the select button is pressed, an
action occurs using the data entered.
Only one action menu option at a time is displayed on the front-panel display.
Note: Options only display in the menu if they are valid for the current state of
the node. See Table 59 for a list of when the options are valid.
The following options are available from the Actions menu:
Table 59. When options are available
102
When option is available for
the current state of the node
Front panel option
Option name
Cluster IPv4
Create a clustered system
with an IPv4 management
address
Candidate state
Cluster IPv6
Create a clustered system
with an IPv6 management
address
Candidate state
Service IPv4
Set the IPv4 service address
of the node
All states
Service IPv6
Set the IPv6 service address
of the node
All states
Service DHCPv4
Set a DHCP IPv4 service
address
All states
Service DHCPv6
Set a DHCP IPv6 service
address
All states
Change WWNN
Change the WWNN of the
node
Candidate or service state
Enter Service
Enter service state
Whenever error 690 is not
showing.
Exit Service
Leave service state if possible Whenever error 690 is
showing.
SAN Volume Controller: Troubleshooting Guide
Table 59. When options are available (continued)
When option is available for
the current state of the node
Front panel option
Option name
Recover Cluster
Recover system configuration Candidate or service state
Remove Cluster
Remove system state
Whenever the node has a
clustered system state.
Paced Upgrade
Perform user-paced CCU
Node in service without
clustered system state
Set FC Speed
Set Fibre Channel speed
Reset Password
Reset password
Not active or if the
resetpassword command is
enabled
Rescue Node
Rescue node software
All states
Chapter 6. Using the front panel of the SAN Volume Controller
103
Cluster
IPv4?
IPv4
Address:
IPv4
Subnet:
IPv4
Gateway:
Cluster
IPv6?
IPv6
Address:
IPv6
Prefix:
IPv6
Gateway:
Service
IPv4?
IPv4
Address:
IPv4
Subnet:
IPv4
Gateway:
Service
IPv6?
IPv6
Address:
IPv6
Prefix:
IPv6
Gateway:
Service
DHCPv4?
Confirm
DHCPv4?
Confirm
Create?
x
Confirm
Create?
Cancel?
Cancel?
x
Confirm
Address?
x
Confirm
Address?
Cancel?
Cancel?
x
Cancel?
x
Confirm
DHCPv6?
x
Change
WWNN?
Edit
WWNN?
Cancel?
Confirm
WWNN?
x
Cancel?
svc00657
Service
DHCPv6?
Figure 53. Upper options of the actions menu on the front panel
104
SAN Volume Controller: Troubleshooting Guide
Enter
Service?
Exit
Service?
Recover
Cluster?
Remove
Cluster?
x
Confirm
Exit?
x
Confirm
Recover?
Cancel?
Cancel?
Cancel?
x
Confirm
Remove?
x
Confirm
Upgrade?
Cancel?
Cancel?
x
svc00658
Paced
Upgrade?
Confirm
Enter?
Figure 54. Middle options of the actions menu on the front panel
Chapter 6. Using the front panel of the SAN Volume Controller
105
Set FC
Speed?
Reset
Password?
Rescue
Node?
Edit
Speed?
Confirm
Reset?
Confirm
Speed?
x
Cancel?
Cancel?
x
Confirm
Rescue?
x
Cancel?
svc00659
Exit Actions?
Figure 55. Lower options of the actions menu on the front panel
To perform an action, navigate to the Actions option and press the select button.
The action is initiated. Available parameters for the action are displayed. Use the
left or right buttons to move between the parameters. The current setting is
displayed on the second display line.
To set or change a parameter value, press the select button when the parameter is
displayed. The value changes to edit mode. Use the left or right buttons to move
between subfields, and use the up or down buttons to change the value of a
subfield. When the value is correct, press select to leave edit mode.
Each action also has a Confirm? and a Cancel? panel. Pressing select on the
Confirm? panel initiates the action using the current parameter value setting.
Pressing select on the Cancel? panel returns to the Action option panel without
changing the node.
Note: Messages might not display fully on the screen. You might see a right angle
bracket (>) on the right side of the display screen. If you see a right angle bracket,
press the right button to scroll through the display. When there is no more text to
display, you can move to the next item in the menu by pressing the right button.
Similarly, you might see a left angle bracket (<) on the left side of the display
screen. If you see a left angle bracket, press the left button to scroll through the
display. When there is no more text to display, you can move to the previous item
in the menu by pressing the left button.
106
SAN Volume Controller: Troubleshooting Guide
Cluster IPv4 or Cluster IPv6 options
You can create a clustered system from the Cluster IPv4 or Cluster IPv6 action
options.
The Cluster IPv4 or Cluster IPv6 option allows you to create a clustered system.
From the front panel, when you create a clustered system, you can set either the
IPv4 or the IPv6 address for Ethernet port 1. If required, you can add more
management IP addresses by using the management GUI or the CLI.
Press the up and down buttons to navigate through the parameters that are
associated with the Cluster option. When you have navigated to the desired
parameter, press the select button.
The parameters that are available include:
v IPv4 Address
v IPv4 Subnet
v IPv4 Gateway
v IPv4 Confirm Create?
v IPv6 Address
v IPv6 Subnet
v IPv6 Gateway
v IPv6 Confirm Create?
If you are creating the clustered system with an IPv4 address, complete the
following steps:
1. Press and release the up or down button until Actions? is displayed. Press and
release the select button.
2. Press and release the up or down button until Cluster IPv4? is displayed.
Press and release the select button.
3. Edit the IPv4 address, the IPv4 subnet, and the IPv4 gateway.
4. Press and release the left or right button until IPv4 Confirm Create? is
displayed.
5. Press and release the select button to confirm.
If you are creating the clustered system with an IPv6 address, complete the
following steps:
1. Press and release the up or down button until Actions? is displayed. Press and
release the select button.
2. Press and release the left or right button until Cluster Ipv6? is displayed. Press
and release the select button.
3. Edit the IPv6 address, the IPv6 prefix, and the IPv6 gateway.
4. Press and release the left or right button until IPv6 Confirm Create? is
displayed.
5. Press and release the select button to confirm.
IPv4 Address option
Using the IPv4 address, you can set the IP address for Ethernet port 1 of the
clustered system that you are going to create. The system can have either an IPv4
or an IPv6 address, or both at the same time. You can set either the IPv4 or IPv6
Chapter 6. Using the front panel of the SAN Volume Controller
107
management address for Ethernet port 1 from the front panel when you are
creating the system. If required, you can add more management IP addresses from
the CLI.
Attention: When you set the IPv4 address, ensure that you type the correct
address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Perform the following steps to set the IPv4 address:
1. Navigate to the IPv4 Address panel.
2. Press the select button. The first IP address number is highlighted.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value. If you want to quickly
increase the highlighted value, hold the up button. If you want to quickly
decrease the highlighted value, hold the down button.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. Press the right button or left button to move to the number field that you want
to set.
5. Repeat steps 3 and 4 for each number field that you want to set.
6. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
Press the right button to display the next secondary option or press the left button
to display the previous options.
IPv4 Subnet option
Using this option, you can set the IPv4 subnet mask for Ethernet port 1.
Attention: When you set the IPv4 subnet mask address, ensure that you type the
correct address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Perform the following steps to set the subnet mask:
1. Navigate to the IPv4 Subnet panel.
2. Press the select button. The first subnet mask number is highlighted.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value. If you want to quickly
increase the highlighted value, hold the up button. If you want to quickly
decrease the highlighted value, hold the down button.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
108
SAN Volume Controller: Troubleshooting Guide
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. Press the right button or left button to move to the number field that you want
to set.
5. Repeat steps 3 and 4 for each number field that you want to set.
6. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
IPv4 Gateway option
Using this option, you can set the IPv4 gateway address for Ethernet port 1.
Attention: When you set the IPv4 gateway address, ensure that you type the
correct address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Perform the following steps to set the IPv4 gateway address:
1. Navigate to the IPv4 Gateway panel.
2. Press the select button. The first gateway address number field is highlighted.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value. If you want to quickly
increase the highlighted value, hold the up button. If you want to quickly
decrease the highlighted value, hold the down button.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. Press the right button or left button to move to the number field that you want
to set.
5. Repeat steps 3 and 4 for each number field that you want to set.
6. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
IPv4 Confirm Create? option
Using this option, you can start an operation to create a system with an IPv4
address.
1. Press and release the left or right button until IPv4 Confirm Create? is
displayed.
2. Press the select button to start the operation.
If the create operation is successful, Password is displayed on line 1. The
password that you can use to access the system is displayed on line 2. Be sure
to immediately record the password; it is required on the first attempt to
manage the system from the management GUI.
Chapter 6. Using the front panel of the SAN Volume Controller
109
Attention: The password displays for only 60 seconds, or until a front panel
button is pressed. The system is created only after the password display is
cleared.
If the create operation fails, Create Failed: is displayed on line 1 of the
front-panel display screen. Line 2 displays one of two possible error codes that
you can use to isolate the cause of the failure.
IPv6 Address option
Using this option, you can set the IPv6 address for Ethernet port 1 of the system
that you are going to create. The system can have either an IPv4 or an IPv6
address, or both at the same time. You can set either the IPv4 or IPv6 management
address for Ethernet port 1 from the front panel when you are creating the system.
If required, you can add more management IP addresses from the CLI.
Attention: When you set the IPv6 address, ensure that you type the correct
address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
To set the IPv6 address:
1. From the Create Cluster? option, press the select button, and then press the
down button. The IPv6 Address option is displayed.
2. Press the select button again. The first IPv6 address number is highlighted. .
3. Move between the address panels by using the left button or right button. The
IPv6 addresses and the IPv6 gateway addresses consist of eight (4-digit)
hexadecimal values that are shown across four panels
4. You can change each number in the address independently. Press the up button
if you want to increase the value that is highlighted; press the down button if
you want to decrease that value.
5. Press the right button or left button to move to the number field that you want
to set.
6. Repeat steps 3 and 4 for each number field that you want to set.
7. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
IPv6 Prefix option
Using this option, you can set the IPv6 prefix for Ethernet port 1.
Attention: When you set the IPv6 prefix, ensure that you type the correct
network prefix.Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Perform the following steps to set the IPv6 prefix:
Note: This option is restricted to a value 0 - 127.
1. Press and release the left or right button until IPv6 Prefix is displayed.
2. Press the select button. The first prefix number field is highlighted.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value. If you want to quickly
increase the highlighted value, hold the up button. If you want to quickly
decrease the highlighted value, hold the down button.
110
SAN Volume Controller: Troubleshooting Guide
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
IPv6 Gateway option
Using this option, you can set the IPv6 gateway for Ethernet port 1.
Attention: When you set the IPv6 gateway address, ensure that you type the
correct address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Perform the following steps to set the IPv6 gateway address:
1. Press and release the left or right button until IPv6 Gateway is displayed.
2. Press the select button. The first gateway address number is highlighted. The
IPv6 addresses and the IPv6 gateway addresses consist of eight (4-digit)
hexadecimal values that are shown across four panels.
3. You can change each number in the address independently. Press the up button
if you want to increase the value that is highlighted; press the down button if
you want to decrease that value.
4. Press the right button or left button to move to the number field that you want
to set.
5. Repeat steps 3 and 4 for each number field that you want to set.
6. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
IPv6 Confirm Create? option
Using this option, you can start an operation to create a system with an IPv6
address.
1. Press and release the left or right button until IPv6 Confirm Create? is
displayed.
2. Press the select button to start the operation.
If the create operation is successful, Password is displayed on line 1. The
password that you can use to access the system is displayed on line 2. Be sure
to immediately record the password; it is required on the first attempt to
manage the system from the management GUI.
Attention: The password displays for only 60 seconds, or until a front panel
button is pressed. The system is created only after the password display is
cleared.
If the create operation fails, Create Failed: is displayed on line 1 of the
front-panel display screen. Line 2 displays one of two possible error codes that
you can use to isolate the cause of the failure.
Chapter 6. Using the front panel of the SAN Volume Controller
111
Service IPv4 or Service IPv6 options
You can use the front panel to change a service IPv4 address or a service IPv6
address.
IPv4 Address option
The IPv4 Address panels show one of the following items for the selected Ethernet
port:
v The active service address if the system has an IPv4 address. This address can be
either a configured or fixed address, or it can be an address obtained through
DHCP.
v DHCP Failed if the IPv4 service address is configured for DHCP but the node
was unable to obtain an IP address.
v DHCP Configuring if the IPv4 service address is configured for DHCP while the
node attempts to obtain an IP address. This address changes to the IPv4 address
automatically if a DHCP address is allocated and activated.
v A blank line if the system does not have an IPv4 address.
If the service IPv4 address was not set correctly or a DHCP address was not
allocated, you have the option of correcting the IPv4 address from this panel. The
service IP address must be in the same subnet as the management IP address.
To set a fixed service IPv4 address from the IPv4 Address: panel, perform the
following steps:
1. Press and release the select button to put the panel in edit mode.
2. Press the right button or left button to move to the number field that you want
to set.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value. If you want to quickly
increase the highlighted value, hold the up button. If you want to quickly
decrease the highlighted value, hold the down button.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. When all the fields are set as required, press and release the select button to
activate the new IPv4 address.
The IPv4 Address: panel is displayed. The new service IPv4 address is not
displayed until it has become active. If the new address has not been displayed
after 2 minutes, check that the selected address is valid on the subnetwork and
that the Ethernet switch is working correctly.
IPv6 Address option
The IPv6 Address panels show one of the following conditions for the selected
Ethernet port:
112
SAN Volume Controller: Troubleshooting Guide
v The active service address if the system has an IPv6 address. This address can be
either a configured or fixed address, or it can be an address obtained through
DHCP.
v DHCP Failed if the IPv6 service address is configured for DHCP but the node
was unable to obtain an IP address.
v DHCP Configuring if the IPv6 service address is configured for DHCP while the
node attempts to obtain an IP address. This changes to the IPv6 address
automatically if a DHCP address is allocated and activated.
v A blank line if the system does not have an IPv6 address.
If the service IPv6 address was not set correctly or a DHCP address was not
allocated, you have the option of correcting the IPv6 address from this panel. The
service IP address must be in the same subnet as the management IP address.
To set a fixed service IPv6 address from the IPv6 Address: panel, perform the
following steps:
1. Press and release the select button to put the panel in edit mode. When the
panel is in edit mode, the full address is still shown across four panels as eight
(four-digit) hexadecimal values. You edit each digit of the hexadecimal values
independently. The current digit is highlighted.
2. Press the right button or left button to move to the number field that you want
to set.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value.
4. When all the fields are set as required, press and release the select button to
activate the new IPv6 address.
The IPv6 Address: panel is displayed. The new service IPv6 address is not
displayed until it has become active. If the new address has not been displayed
after 2 minutes, check that the selected address is valid on the subnetwork and
that the Ethernet switch is working correctly.
Service DHCPv4 or DHCPv6 options
The active service address for a system can be either a configured or fixed address,
or it can be an address obtained through DHCP.
If a service IP address does not exist, you must assign a service IP address or use
DHCP with this action.
To set the service IPv4 address to use DHCP, perform the following steps:
1. Press and release the up or down button until Service DHCPv4? is displayed.
2. Press and release the down button. Confirm DHCPv4? is displayed.
3. Press and release the select button to activate DHCP, or you can press and
release the up button to keep the existing address.
4.
If you activate DHCP, DHCP Configuring is displayed while the node attempts
to obtain a DHCP address. It changes automatically to show the allocated
address if a DHCP address is allocated and activated, or it changes to DHCP
Failed if a DHCP address is not allocated.
To set the service IPv6 address to use DHCP, perform the following steps:
1. Press and release the up or down button until Service DHCPv6? is displayed.
2. Press and release the down button. Confirm DHCPv6? is displayed.
Chapter 6. Using the front panel of the SAN Volume Controller
113
3. Press and release the select button to activate DHCP, or you can press and
release the up button to keep the existing address.
4. If you activate DHCP, DHCP Configuring is displayed while the node attempts
to obtain a DHCP address. It changes automatically to show the allocated
address if a DHCP address is allocated and activated, or it changes to DHCP
Failed if a DHCP address is not allocated.
Note: If an IPv6 router is present on the local network, SAN Volume Controller
does not differentiate between an autoconfigured address and a DHCP address.
Therefore, SAN Volume Controller uses the first address that is detected.
Change WWNN? option
The Change WWNN? option displays the last five hexadecimal digits of the
WWNN that is being used by the node. Only the last five digits of a WWNN vary
on a node. The first 11 digits are always 50050768010.
To edit the WWNN, complete the following steps:
Important: Only change the WWNN when you are instructed to do so by a service
procedure. Nodes must always have a unique WWNN. If you change the WWNN,
you might have to reconfigure hosts and the SAN zoning.
1. Press and release the up or down button until Actions is displayed.
2. Press and release the select button.
3. Press and release the up or down button until Change WWNN? is displayed on
line 1. Line 2 of the display shows the last five numbers of the WWNN that is
currently set. The first number is highlighted.
4. Edit the highlighted number to match the number that is required. Use the up
and down buttons to increase or decrease the numbers. The numbers wrap F to
0 or 0 to F. Use the left and right buttons to move between the numbers.
5. When the highlighted value matches the required number, press and release the
select button to activate the change. The Node WWNN: panel displays and the
second line shows the last five characters of the changed WWNN.
Enter Service? option
You can enter service state from the Enter Service? option. Service state can be
used to remove a node from a candidate list or to prevent it from being readded to
a clustered system.
If the node is active, entering service state can cause disruption to hosts if other
faults exist in the system.While in service state, the node cannot join or run as part
of a clustered system.
To exit service state, ensure that all errors are resolved. You can exit service state
by using the Exit Service? option or by restarting the node.
Exit Service? option
You can exit service state from the Exit Service? option. This action releases the
node from the service state.
If there are no noncritical errors, the node enters candidate state. If possible, the
node then becomes active in a clustered system.
To exit service state, ensure that all errors are resolved. You can exit service state
by using this option or by restarting the node.
114
SAN Volume Controller: Troubleshooting Guide
Recover Cluster? option
You can recover an entire clustered system if the data has been lost from all nodes
by using the Recover Cluster? option.
Perform service actions on nodes only when directed by the service procedures. If
used inappropriately, service actions can cause loss of access to data or data loss.
For information about the recover system procedure, see “Recover system
procedure” on page 253.
Remove Cluster? option
The Remove Cluster? option deletes the system state data from the node.
Use this option as the final step in decommissioning a system after the other nodes
have been removed from the system using the command-line interface (CLI) or the
management GUI.
Attention: Use the front panel to remove state data from a single node system. To
remove a node from a multi-node system, always use the CLI or the remove node
options from the management GUI.
To delete the state data from the node using the Remove Cluster? panel:
1. Press and hold the up button.
2. Press and release the select button.
3. Release the up button.
After the option is run, the node shows Cluster: with no system name. If this
option is run on a node that is still a member of a system, the system shows error
1195, Node missing, and the node is displayed in the list of nodes in the system.
Remove the node by using the management GUI or CLI.
Paced Upgrade? option
Use this option to control the time when individual nodes are upgraded within a
system update.
Note: This action can be used only when the following conditions exist for the
node:
v The node is in service state.
v The node has no errors.
v The node has been removed from the clustered system.
For additional information, see the “Upgrading the software manually” topic in the
information center.
Set FC Speed? option
You can change the speed of the Fibre Channel ports on a SAN Volume Controller
by using the Set FC Speed? option
Reset Password? option
The Reset Password? option is useful if the system superuser password has been
lost or forgotten.
Use the Reset password? option if the user has lost the system superuser password
or if the user is unable to access the system. If it is permitted by the user's
password security policy, use this selection to reset the system superuser password.
Chapter 6. Using the front panel of the SAN Volume Controller
115
If your password security policy permits password recovery, and if the node is
currently a member of a clustered system, the system superuser password is reset
and a new password is displayed for 60 seconds. If your password security policy
does not permit password recovery or the node is not a member of a system,
completing these steps has no effect.
If the node is in active state when the password is reset, the reset applies to all
nodes in the system. If the node is in candidate or service state when the password
is reset, the reset applies only to the single node.
Rescue Node? option
You can start the automatic software recovery for this node by using the Rescue
Node? option.
Note: Another way to rescue a node is to force a node rescue when the node
boots. It is the preferred method. Forcing a node rescue when a node boots works
by booting the operating system from the service controller and running a program
that copies all the SAN Volume Controller software from any other node that can
be found on the Fibre Channel fabric. See “Completing the node rescue when the
node boots” on page 271.
Exit Actions? option
Return to the main menu by selecting the Exit Actions? option.
Language? option
You can change the language that displays on the front panel.
Before you begin
The Language? option allows you to change the language that is displayed on the
menu. Figure 56 shows the Language? option sequence.
Language?
English?
Japanese?
svc00410
Select
Figure 56. Language? navigation
The following languages are available:
v English
v Japanese
About this task
To select the language that you want to be used on the front panel, perform the
following steps:
Procedure
1. Press and release the up or down button until Language? is displayed.
2. Press and release the select button.
116
SAN Volume Controller: Troubleshooting Guide
3. Use the left and right buttons to move to the language that you want. The
translated language names are displayed in their own character set. If you do
not understand the language that is displayed, wait for at least 60 seconds for
the menu to reset to the default option.
4. Press and release the select button to select the language that is displayed.
Results
If the selected language uses the Latin alphabet, the front panel display shows two
lines. The panel text is displayed on the first line and additional data is displayed
on the second line.
If the selected language does not use the Latin alphabet, the display shows only
one line at a time to clearly display the character font. For those languages, you
can switch between the panel text and the additional data by pressing and
releasing the select button.
Additional data is unavailable when the front panel displays a menu option, which
ends with a question mark (?). In this case, press and release the select button to
choose the menu option.
Note: You cannot select another language when the node is displaying a boot
error.
Using the power control for the SAN Volume Controller node
Some SAN Volume Controller nodes are powered by an uninterruptible power
supply that is in the same rack as the nodes. Other nodes have internal batteries
instead, such as the SAN Volume Controller 2145-DH8.
The power state of the SAN Volume Controller is displayed by a power indicator
on the front panel. If the uninterruptible power supply battery is not sufficiently
charged to enable the SAN Volume Controller to become fully operational, its
charge state is displayed on the front panel display of the node.
The power to a SAN Volume Controller is controlled through the management
GUI, or by the power button on the front panel of the node. .
Note: Never turn off the node by removing the power cable. You might lose data.
For more information about how to power off the node, see “MAP 5350: Powering
off a node” on page 302
If the SAN Volume Controller software is running and you request it to power off
from the management GUI, CLI, or power button, the node starts its power off
processing. During this time, the node indicates the progress of the power-off
operation on the front panel display. After the power-off processing is complete,
the front panel becomes blank and the front panel power light flashes. It is safe for
you to remove the power cable from the rear of the node. If the power button on
the front panel is pressed during power-off processing, the front panel display
changes to indicate that the node is being restarted, but the power-off process
completes before the restart.
If the SAN Volume Controller software is not running when the front panel power
button is pressed, the node immediately powers off.
Chapter 6. Using the front panel of the SAN Volume Controller
117
Note: The 2145 UPS-1U does not power off when the node is shut down from the
power button.
If you turn off a node using the power button or by a command, the node is put
into a power-off state. The SAN Volume Controller remains in this state until the
power cable is connected to the rear of the node and the power button is pressed.
During the startup sequence, the SAN Volume Controller tries to detect the status
of the uninterruptible power supply through the uninterruptible power supply
signal cable. If an uninterruptible power supply is not detected, the node pauses
and an error is shown on the front panel display. If the uninterruptible power
supply is detected, the software monitors the operational state of the
uninterruptible power supply. If no uninterruptible power supply errors are
reported and the uninterruptible power supply battery is sufficiently charged, the
SAN Volume Controller becomes operational. If the uninterruptible power supply
battery is not sufficiently charged, the charge state is indicated by a progress bar
on the front panel display. When an uninterruptible power supply is first turned
on, it might take up to 2 hours before the battery is sufficiently charged for the
SAN Volume Controller node to become operational.
If input power to the uninterruptible power supply is lost, the node immediately
stops all I/O operations and saves the contents of its dynamic random access
memory (DRAM) to the internal disk drive. While data is being saved to the disk
drive, a Power Failure message is shown on the front panel and is accompanied
by a descending progress bar that indicates the quantity of data that remains to be
saved. After all the data is saved, the node is turned off and the power light on the
front panel turns off.
Note: The node is now in standby state. If the input power to the uninterruptible
power supply unit is restored, the node restarts. If the uninterruptible power
supply battery was fully discharged, Charging is displayed and the boot process
waits for the battery to charge. When the battery is sufficiently charged, Booting is
displayed, the node is tested, and the software is loaded. When the boot process is
complete, Recovering is displayed while the uninterruptible power supply finalizes
its charge. While Recovering is displayed, the system can function normally.
However, when the power is restored after a second power failure, there is a delay
(with Charging displayed) before the node can complete its boot process.
118
SAN Volume Controller: Troubleshooting Guide
Chapter 7. Diagnosing problems
You can diagnose problems with the control and indicators, the command-line
interface (CLI), the management GUI, or the Service Assistant GUI. The diagnostic
LEDs on the SAN Volume Controller nodes and uninterruptible power supply
units also help you diagnose hardware problems.
Event logs
By understanding the event log, you can do the following tasks:
v Manage the event log
v View the event log
v Describe the fields in the event log
Error codes
The following topics provide information to help you understand and process the
error codes:
v Event reporting
v Understanding the events
v Understanding the error codes
v Determining a hardware boot failure
If the node is showing a boot message, failure message, or node error message,
and you determined that the problem was caused by a software or firmware
failure, you can restart the node to see whether that might resolve the problem.
Perform the following steps to properly shut down and restart the node:
1. Follow the instructions in “MAP 5350: Powering off a node” on page 302.
2. Restart only one node at a time.
3. Do not shut down the second node in an I/O group for at least 30 minutes
after you shut down and restart the first node.
Starting statistics collection
You can start the collection of cluster statistics from the Starting the Collection of
Statistics panel in the management GUI.
Introduction
1
For each collection interval, the management GUI creates four statistics files: one
for managed disks (MDisks), named Nm_stat; one for volumes and volume copies,
named Nv_stat; one for nodes, named Nn_stat; and one for SAS drives, named
Nd_stat. The files are written to the /dumps/iostats directory on the node. To
retrieve the statistics files from the non-configuration nodes onto the configuration
node, svctask cpdumps command must be used.
A maximum of 16 files of each type can be created for the node. When the 17th file
is created, the oldest file for the node is overwritten.
© Copyright IBM Corp. 2003, 2015
119
Fields
The following fields are available for user definition:
Interval
Specify the interval in minutes between the collection of statistics. You can
specify 1 - 60 minutes in increments of 1 minute.
Tables
The following tables describe the information that is reported for individual nodes
and volumes.
Table 60 describes the statistics collection for MDisks, for individual nodes.
Table 60. Statistics collection for individual nodes
Statistic
name
Description
id
Indicates the name of the MDisk for which the statistics apply.
idx
Indicates the identifier of the MDisk for which the statistics apply.
rb
Indicates the cumulative number of blocks of data that is read (since the
node has been running).
re
Indicates the cumulative read external response time in milliseconds for each
MDisk. The cumulative response time for disk reads is calculated by starting
a timer when a SCSI read command is issued and stopped when the
command completes successfully. The elapsed time is added to the
cumulative counter.
ro
Indicates the cumulative number of MDisk read operations that are processed
(since the node is running).
rq
Indicates the cumulative read queued response time in milliseconds for each
MDisk. This response is measured from above the queue of commands to be
sent to an MDisk because the queue depth is already full. This calculation
includes the elapsed time that is taken for read commands to complete from
the time they join the queue.
wb
Indicates the cumulative number of blocks of data written (since the node is
running).
we
Indicates the cumulative write external response time in milliseconds for each
MDisk. The cumulative response time for disk writes is calculated by starting
a timer when a SCSI write command is issued and stopped when the
command completes successfully. The elapsed time is added to the
cumulative counter.
wo
Indicates the cumulative number of MDisk write operations processed (since
the node is running).
wq
Indicates the cumulative write queued response time in milliseconds for each
MDisk. This time is measured from above the queue of commands to be sent
to an MDisk because the queue depth is already full. This calculation
includes the elapsed time taken for write commands to complete from the
time they join the queue.
Table 61 on page 121 describes the VDisk (volume) information that is reported for
individual nodes.
120
SAN Volume Controller: Troubleshooting Guide
Note: MDisk statistics files for nodes are written to the /dumps/iostats directory
on the individual node.
Table 61. Statistic collection for volumes for individual nodes
Statistic
name
Description
id
Indicates the volume name for which the statistics apply.
idx
Indicates the volume for which the statistics apply.
rb
Indicates the cumulative number of blocks of data read (since the node is
running).
rl
Indicates the cumulative read response time in milliseconds for each volume.
The cumulative response time for volume reads is calculated by starting a
timer when a SCSI read command is received and stopped when the
command completes successfully. The elapsed time is added to the
cumulative counter.
rlw
Indicates the worst read response time in microseconds for each volume since
the last time statistics were collected. This value is reset to zero after each
statistics collection sample.
ro
Indicates the cumulative number of volume read operations processed (since
the node has been running).
wb
Indicates the cumulative number of blocks of data written (since the node is
running).
wl
Indicates the cumulative write response time in milliseconds for each
volume. The cumulative response time for volume writes is calculated by
starting a timer when a SCSI write command is received and stopped when
the command completes successfully. The elapsed time is added to the
cumulative counter.
wlw
Indicates the worst write response time in microseconds for each volume
since the last time statistics were collected. This value is reset to zero after
each statistics collection sample.
wo
Indicates the cumulative number of volume write operations processed (since
the node is running).
wou
Indicates the cumulative number of volume write operations that are not
aligned on a 4K boundary.
xl
Indicates the cumulative read and write data transfer response time in
milliseconds for each volume since the last time the node was reset. When
this statistic is viewed for multiple volumes and with other statistics, it can
indicate if the latency is caused by the host, fabric, or the SAN Volume
Controller.
Table 62 describes the VDisk information related to Metro Mirror or Global Mirror
relationships that is reported for individual nodes.
Table 62. Statistic collection for volumes that are used in Metro Mirror and Global Mirror
relationships for individual nodes
Statistic
name
Description
gwl
Indicates cumulative secondary write latency in milliseconds. This statistic
accumulates the cumulative secondary write latency for each volume. You
can calculate the amount of time to recovery from a failure based on this
statistic and the gws statistics.
Chapter 7. Diagnosing problems
121
Table 62. Statistic collection for volumes that are used in Metro Mirror and Global Mirror
relationships for individual nodes (continued)
gwo
Indicates the total number of overlapping volume writes. An overlapping
write is when the logical block address (LBA) range of write request collides
with another outstanding request to the same LBA range and the write
request is still outstanding to the secondary site.
gwot
Indicates the total number of fixed or unfixed overlapping writes. When all
nodes in all clusters are running SAN Volume Controller version 4.3.1, this
records the total number of write I/O requests received by the Global Mirror
feature on the primary that overlapped. When any nodes in either cluster are
running SAN Volume Controller versions earlier than 4.3.1, this value does
not increment.
gws
Indicates the total number of write requests issued to the secondary site.
Table 63 describes the port information that is reported for individual nodes
Table 63. Statistic collection for node ports
122
Statistic
name
Description
bbcz
Indicates the total time in microseconds for which the port had data to send
but was prevented from doing so by a lack of buffer credit from the switch.
cbr
Indicates the bytes received from controllers.
cbt
Indicates the bytes transmitted to disk controllers.
cer
Indicates the commands received from disk controllers.
cet
Indicates the commands initiated to disk controllers.
hbr
Indicates the bytes received from hosts.
hbt
Indicates the bytes transmitted to hosts.
her
Indicates the commands received from hosts.
het
Indicates the commands initiated to hosts.
icrc
Indicates the number of CRC that are not valid.
id
Indicates the port identifier for the node.
itw
Indicates the number of transmission word counts that are not valid.
lf
Indicates a link failure count.
lnbr
Indicates the bytes received to other nodes in the same cluster.
lnbt
Indicates the bytes transmitted to other nodes in the same cluster.
lner
Indicates the commands received from other nodes in the same cluster.
lnet
Indicates the commands initiated to other nodes in the same cluster.
lsi
Indicates the lost-of-signal count.
lsy
Indicates the loss-of-synchronization count.
pspe
Indicates the primitive sequence-protocol error count.
rmbr
Indicates the bytes received to other nodes in the other clusters.
rmbt
Indicates the bytes transmitted to other nodes in the other clusters.
rmer
Indicates the commands received from other nodes in the other clusters.
rmet
Indicates the commands initiated to other nodes in the other clusters.
wwpn
Indicates the worldwide port name for the node.
SAN Volume Controller: Troubleshooting Guide
Table 64 describes the node information that is reported for each nodes.
Table 64. Statistic collection for nodes
Statistic
name
Description
cluster_id
Indicates the name of the cluster.
cluster
Indicates the name of the cluster.
cpu
busy - Indicates the total CPU average core busy milliseconds since the node
was reset. This statistic reports the amount of the time the processor has
spent polling while waiting for work versus actually doing work. This
statistic accumulates from zero.
comp - Indicates the total CPU average core busy milliseconds for
compression process cores since the node was reset.
system - Indicates the total CPU average core busy milliseconds since the
node was reset. This statistic reports the amount of the time the processor
has spent polling while waiting for work versus actually doing work. This
statistic accumulates from zero. This is the same information as the
information provided with the cpu busy statistic and will eventually replace
the cpu busy statistic.
cpu_core
id - Indicates the CPU core id.
comp - Indicates the per-core CPU average core busy milliseconds for
compression process cores since node was reset.
system - Indicates the per-core CPU average core busy milliseconds for
system process cores since node was reset.
id
Indicates the name of the node.
node_id
Indicates the unique identifier for the node.
rb
Indicates the number of bytes received.
re
Indicates the accumulated receive latency, excluding inbound queue time.
This statistic is the latency that is experienced by the node communication
layer from the time that an I/O is queued to cache until the time that the
cache gives completion for it.
ro
Indicates the number of messages or bulk data received.
rq
Indicates the accumulated receive latency, including inbound queue time.
This statistic is the latency from the time that a command arrives at the node
communication layer to the time that the cache completes the command.
wb
Indicates the bytes sent.
we
Indicates the accumulated send latency, excluding outbound queue time. This
statistic is the time from when the node communication layer issues a
message out onto the Fibre Channel until the node communication layer
receives notification that the message arrived.
wo
Indicates the number of messages or bulk data sent.
wq
Indicates the accumulated send latency, including outbound queue time. This
statistic includes the entire time that data is sent. This time includes the time
from when the node communication layer receives a message and waits for
resources, the time to send the message to the remote node, and the time
taken for the remote node to respond.
Table 65 on page 124 describes the statistics collection for volumes.
Chapter 7. Diagnosing problems
123
Table 65. Cache statistics collection for volumes and volume copies
Statistic
Acronym
Statistics for Statistics for
Statistics for Statistics for volume
volume
volume
volume
cache
copy cache
cache
copy cache
partition
partition
read ios
ri
Yes
Yes
ios,
cumulative
write ios
wi
Yes
Yes
ios,
cumulative
read misses
r
Yes
Yes
sectors,
cumulative
read hits
rh
Yes
Yes
sectors,
cumulative
flush_through
writes
ft
Yes
Yes
sectors,
cumulative
fast_write
writes
fw
Yes
Yes
sectors,
cumulative
write_through
writes
wt
Yes
Yes
sectors,
cumulative
write hits
wh
Yes
Yes
sectors,
cumulative
prefetches
p
Yes
sectors,
cumulative
prefetch hits
(prefetch data
that is read)
ph
Yes
sectors,
cumulative
Yes
pages,
cumulative
prefetch misses pm
(prefetch pages
that are
discarded
without any
sectors read)
Statistics for
the Node
Cache
Overall
statistics for Units and
Cache
mdisks
state
modified data
m
Yes
Yes
sectors,
snapshot,
noncumulative
read and write
cache data
v
Yes
Yes
sectors
snapshot,
noncumulative
destages
d
Yes
Yes
sectors,
cumulative
fullness
Average
fav
Yes
Yes
%,
noncumulative
fullness Max
fmx
Yes
Yes
%,
noncumulative
fullness Min
fmn
Yes
Yes
%,
noncumulative
Destage Target
Average
dtav
Yes
Destage Target
Max
dtmx
Yes
124
SAN Volume Controller: Troubleshooting Guide
Yes
IOs capped
9999,
noncumulative
IOs,
noncumulative
Table 65. Cache statistics collection for volumes and volume copies (continued)
Statistics for Statistics for
Statistics for Statistics for volume
volume
volume
volume
cache
copy cache
cache
copy cache
partition
partition
Statistics for
the Node
Cache
Overall
statistics for Units and
Cache
mdisks
state
Statistic
Acronym
Destage Target
Min
dtmn
Yes
Destage In
Flight
Average
dfav
Yes
Destage In
Flight Max
dfmx
Yes
IOs,
noncumulative
Destage In
Flight Min
dfmn
Yes
IOs,
noncumulative
destage latency dav
average
Yes
Yes
IOs,
noncumulative
Yes
Yes
IOs capped
9999,
noncumulative
Yes
Yes
Yes
µs capped
9999999,
noncumulative
destage latency dmx
max
Yes
Yes
Yes
µs capped
9999999,
noncumulative
destage latency dmn
min
Yes
Yes
Yes
µs capped
9999999,
noncumulative
Yes
Yes
Yes
ios,
noncumulative
Yes
µs capped
9999999,
noncumulative
destage count
dcn
Yes
Yes
stage latency
average
sav
Yes
Yes
stage latency
max
smx
Yes
µs capped
9999999,
noncumulative
stage latency
min
smn
Yes
µs capped
9999999,
noncumulative
stage count
scn
Yes
Yes
ios,
noncumulative
prestage
latency
average
pav
Yes
Yes
µs capped
9999999,
noncumulative
prestage
latency max
pmx
Yes
µs capped
9999999,
noncumulative
prestage
latency min
pmn
Yes
µs capped
9999999,
noncumulative
Yes
Chapter 7. Diagnosing problems
125
Table 65. Cache statistics collection for volumes and volume copies (continued)
Statistic
Acronym
prestage
count
pcn
Write Cache
Fullness
Average
Statistics for Statistics for
Statistics for Statistics for volume
volume
volume
volume
cache
copy cache
cache
copy cache
partition
partition
Yes
ios,
noncumulative
wfav
Yes
%,
noncumulative
Write Cache
Fullness Max
wfmx
Yes
%,
noncumulative
Write Cache
Fullness Min
wfmn
Yes
%,
noncumulative
Read Cache
Fullness
Average
rfav
Yes
%,
noncumulative
Read Cache
Fullness Max
rfmx
Yes
%,
noncumulative
Read Cache
Fullness Min
rfmn
Yes
%,
noncumulative
Pinned
Percent
pp
Yes
Yes
Yes
% of total
cache
snapshot,
noncumulative
data transfer
latency
average
tav
Yes
Yes
µs capped
9999999,
noncumulative
Track Lock
Latency
(Exclusive)
Average
teav
Yes
Yes
µs capped
9999999,
noncumulative
Track Lock
Latency
(Shared)
Average
tsav
Yes
Yes
µs capped
9999999,
noncumulative
Cache I/O
Control Block
Queue Time
hpt
Yes
Average µs,
noncumulative
Cache Track
Control Block
Queue Time
ppt
Yes
Average µs,
noncumulative
Owner Remote opt
Credit Queue
Time
Yes
Average µs,
noncumulative
Non-Owner
Remote Credit
Queue Time
npt
Yes
Average µs,
noncumulative
Admin Remote apt
Credit Queue
Time
Yes
Average µs,
noncumulative
126
Yes
Statistics for
the Node
Cache
Overall
statistics for Units and
Cache
mdisks
state
SAN Volume Controller: Troubleshooting Guide
Yes
Yes
Table 65. Cache statistics collection for volumes and volume copies (continued)
Statistics for Statistics for
Statistics for Statistics for volume
volume
volume
volume
cache
copy cache
cache
copy cache
partition
partition
Statistics for
the Node
Cache
Overall
statistics for Units and
Cache
mdisks
state
Statistic
Acronym
Cdcb Queue
Time
cpt
Yes
Average µs,
noncumulative
Buffer Queue
Time
bpt
Yes
Average µs,
noncumulative
Hardening
Rights Queue
Time
hrpt
Yes
Average µs,
noncumulative
Note: Any statistic with a name av, mx, mn, and cn is not cumulative. These
statistics reset every statistics interval. For example, if the statistic does not have a
name with name av, mx, mn, and cn, and it is an Ios or count, it will be a field
containing a total number.
v The term pages means in units of 4096 bytes per page.
v The term sectors means in units of 512 bytes per sector.
v The term µs means microseconds.
v Non-cumulative means totals since the previous statistics collection interval.
v Snapshot means the value at the end of the statistics interval (rather than an
average across the interval or a peak within the interval).
Table 66 describes the statistic collection for volume cache per individual nodes.
Table 66. Statistic collection for volume cache per individual nodes. This table describes the
volume cache information that is reported for individual nodes.
Statistic
name
Description
cm
Indicates the number of sectors of modified or dirty data that are held in the
cache.
ctd
Indicates the total number of cache destages that were initiated writes,
submitted to other components as a result of a volume cache flush or destage
operation.
ctds
Indicates the total number of sectors that are written for cache-initiated track
writes.
ctp
Indicates the number of track stages that are initiated by the cache that are
prestage reads.
ctps
Indicates the total number of staged sectors that are initiated by the cache.
ctrh
Indicates the number of total track read-cache hits on prestage or
non-prestage data. For example, a single read that spans two tracks where
only one of the tracks obtained a total cache hit, is counted as one track
read-cache hit.
ctrhp
Indicates the number of track reads received from other components, treated
as cache hits on any prestaged data. For example, if a single read spans two
tracks where only one of the tracks obtained a total cache hit on prestaged
data, it is counted as one track read for the prestaged data. A cache hit that
obtains a partial hit on prestage and non-prestage data still contributes to this
value.
Chapter 7. Diagnosing problems
127
Table 66. Statistic collection for volume cache per individual nodes (continued). This table
describes the volume cache information that is reported for individual nodes.
ctrhps
Indicates the total number of sectors that are read for reads received from
other components that obtained cache hits on any prestaged data.
ctrhs
Indicates the total number of sectors that are read for reads received from
other components that obtained total cache hits on prestage or non-prestage
data.
ctr
Indicates the total number of track reads received. For example, if a single
read spans two tracks, it is counted as two total track reads.
ctrs
Indicates the total number of sectors that are read for reads received.
ctwft
Indicates the number of track writes received from other components and
processed in flush through write mode.
ctwfts
Indicates the total number of sectors that are written for writes that are
received from other components and processed in flush through write mode.
ctwfw
Indicates the number of track writes received from other components and
processed in fast-write mode.
ctwfwsh
Indicates the track writes in fast-write mode that were written in
write-through mode because of the lack of memory.
ctwfwshs
Indicates the track writes in fast-write mode that were written in write
through due to the lack of memory.
ctwfws
Indicates the total number of sectors that are written for writes that are
received from other components and processed in fast-write mode.
ctwh
Indicates the number of track writes received from other components where
every sector in the track obtained a write hit on already dirty data in the
cache. For a write to count as a total cache hit, the entire track write data
must already be marked in the write cache as dirty.
ctwhs
Indicates the total number of sectors that are received from other components
where every sector in the track obtained a write hit on already dirty data in
the cache.
ctw
Indicates the total number of track writes received. For example, if a single
write spans two tracks, it is counted as two total track writes.
ctws
Indicates the total number of sectors that are written for writes that are
received from components.
ctwwt
Indicates the number of track writes received from other components and
processed in write through write mode.
ctwwts
Indicates the total number of sectors that are written for writes that are
received from other components and processed in write through write mode.
cv
Indicates the number of sectors of read and write cache data that is held in
the cache.
Table 67 describes the XML statistics specific to an IP Partnership port.
Table 67. XML statistics for an IP Partnership port
Statistic
name
128
Description
ipbz
Indicates the average size (in bytes) of data that is being submitted to the IP
partnership driver since the last statistics collection period.
ipre
Indicates the bytes retransmitted to other nodes in other clusters by the IP
partnership driver.
SAN Volume Controller: Troubleshooting Guide
Table 67. XML statistics for an IP Partnership port (continued)
Statistic
name
Description
iprt
Indicates the average round-trip time in microseconds for the IP partnership
link since the last statistics collection period.
iprx
Indicates the bytes received from other nodes in other clusters by the IP
partnership driver.
ipsz
Indicates the average size (in bytes) of data that is being transmitted by the IP
partnership driver since the last statistics collection period.
iptx
Indicates the bytes transmitted to other nodes in other clusters by the IP
partnership driver.
Actions
The following actions are available to the user:
OK
Click this button to change statistic collection.
Cancel
Click this button to exit the panel without changing statistic collection.
XML formatting information
The XML is more complicated now, as seen in this raw XML from the volume
(Nv_statistics) statistics. Notice how the names are similar but because they are in
a different section of the XML, they refer to a different part of the VDisk.
<vdsk idx="0"
ctrs="213694394" ctps="0" ctrhs="2416029" ctrhps="0"
ctds="152474234" ctwfts="9635" ctwwts="0" ctwfws="152468611"
ctwhs="9117" ctws="152478246" ctr="1628296" ctw="3241448"
ctp="0" ctrh="123056" ctrhp="0" ctd="1172772"
ctwft="200" ctwwt="0" ctwfw="3241248" ctwfwsh="0"
ctwfwshs="0" ctwh="538" cm="13768758912876544" cv="13874234719731712"
gwot="0" gwo="0" gws="0" gwl="0"
id="Master_iogrp0_1"
ro="0" wo="0" rb="0" wb="0"
rl="0" wl="0" rlw="0" wlw="0" xl="0">
Vdisk/Volume statistics
<ca r="0" rh="0" d="0" ft="0"
wt="0" fw="0" wh="0" ri="0"
wi="0" dav="0" dcn="0" pav="0" pcn="0" teav="0"
pp="0"/>
tsav="0"
tav="0"
<cpy idx="0">
volume copy statistics
<ca r="0" p="0" rh="0" ph="0"
d="0" ft="0" wt="0" fw="0"
wh="0" pm="0" ri="0" wi="0"
dav="0" dcn="0" sav="0" scn="0"
pav="0" pcn="0" teav="0" tsav="0"
tav="0" pp="0"/>
</cpy>
<vdsk>
The <cpy idx="0"> means its in the volume copy section of the VDisk, whereas the
statistics shown under Vdisk/Volume statistics are outside of the cpy idx section
and therefore refer to a VDisk/volume.
Chapter 7. Diagnosing problems
129
Similarly for the volume cache statistics for node and partitions:
<uca><ca dav="18726" dcn="1502531" dmx="749846" dmn="89"
sav="20868" scn="2833391" smx="980941" smn="3"
pav="0" pcn="0" pmx="0" pmn="0"
wfav="0" wfmx="2" wfmn="0"
rfav="0" rfmx="1" rfmn="0"
pp="0"
hpt="0" ppt="0" opt="0" npt="0"
apt="0" cpt="0" bpt="0" hrpt="0"
/><partition id="0"><ca dav="18726" dcn="1502531" dmx="749846" dmn="89"
fav="0" fmx="2" fmn="0"
dfav="0" dfmx="0" dfmn="0"
dtav="0" dtmx="0" dtmn="0"
pp="0"/></partition>
This output describes the volume cache node statistics where <partition id="0">
the statistics are described for partition 0.
Replacing <uca> with <lca> means that the statistics are for volume copy cache
partition 0.
Event reporting
Events that are detected are saved in an event log. As soon as an entry is made in
this event log, the condition is analyzed. If any service activity is required, a
notification is sent, if you have set up notifications.
Event reporting process
The following methods are used to notify you and the IBM Support Center of a
new event:
v The most serious system error code is displayed on the front panel of each node
in the system.
v If you enabled Simple Network Management Protocol (SNMP), an SNMP trap is
sent to an SNMP manager that is configured by the customer.
The SNMP manager might be IBM Systems Director, if it is installed, or another
SNMP manager.
v If enabled, log messages can be forwarded on an IP network by using the syslog
protocol.
v If enabled, event notifications can be forwarded by email by using Simple Mail
Transfer Protocol (SMTP).
v Call Home can be enabled so that critical faults generate a problem management
record (PMR) that is then sent directly to the appropriate IBM Support Center by
using email.
Power-on self-test
When you turn on the SAN Volume Controller, the system board performs
self-tests. During the initial tests, the hardware boot symbol is displayed.
All models perform a series of tests to check the operation of components and
some of the options that have been installed when the units are first turned on.
This series of tests is called the power-on self-test (POST).
Note: Some SAN Volume Controller nodes do not have a front panel display. The
node status LED is off until booting finishes and the SAN Volume Controller
software is loaded.
130
SAN Volume Controller: Troubleshooting Guide
If a critical failure is detected during the POST, the software is not loaded and the
system error LED on the operator information panel is illuminated. If this failure
occurs, use “MAP 5000: Start” on page 275 to help isolate the cause of the failure.
When the software is loaded, additional testing takes place, which ensures that all
of the required hardware and software components are installed and functioning
correctly. During the additional testing, the word Booting is displayed on the front
panel along with a boot progress code and a progress bar. If a test failure occurs,
the word Failed is displayed on the front panel.
The service controller performs internal checks and is vital to the operation of the
SAN Volume Controller. If the error (check) LED is illuminated on the service
controller front panel, the front-panel display might not be functioning correctly
and you can ignore any message displayed.
The uninterruptible power supply also performs internal tests. If the
uninterruptible power supply reports the failure condition, the SAN Volume
Controller displays critical failure information about the front-panel display or
sends noncritical failure information to the event log. If the SAN Volume
Controller cannot communicate with the uninterruptible power supply, it displays
a boot failure error message on the front-panel display. Further problem
determination information might also be displayed on the front panel of the
uninterruptible power supply.
Understanding events
When a significant change in status is detected, an event is logged in the event log.
Error data
Events are classified as either alerts or messages:
v An alert is logged when the event requires some action. Some alerts have an
associated error code that defines the service action that is required. The service
actions are automated through the fix procedures. If the alert does not have an
error code, the alert represents an unexpected change in state. This situation
must be investigated to see if it is expected or represents a failure. Investigate an
alert and resolve it as soon as it is reported.
v A message is logged when a change that is expected is reported, for instance, an
IBM FlashCopy operation completes.
Managing the event log
The event log has a limited size. After it is full, newer entries replace entries that
are no longer required.
To avoid having a repeated event that fills the event log, some records in the event
log refer to multiple occurrences of the same event. When event log entries are
coalesced in this way, the time stamp of the first occurrence and the last occurrence
of the problem is saved in the log entry. A count of the number of times that the
error condition has occurred is also saved in the log entry. Other data refers to the
last occurrence of the event.
Viewing the event log
You can view the event log by using the management GUI or the command-line
interface (CLI).
Chapter 7. Diagnosing problems
131
About this task
You can view the event log by using the Monitoring > Events options in the
management GUI. The event log contains many entries. You can, however, select
only the type of information that you need.
You can also view the event log by using the command-line interface (lseventlog).
See the “Command-line interface” topic for the command details.
Describing the fields in the event log
The event log includes fields with information that you can use to diagnose
problems.
Table 68 describes some of the fields that are available to assist you in diagnosing
problems.
Table 68. Description of data fields for the event log
Data field
Description
Event ID
This number precisely identifies why the event was logged.
Description
A short description of the event.
Status
Indicates whether the event requires some attention.
Alert: if a red icon with a cross is shown, follow the fix procedure or
service action to resolve the event and turn the status green.
Monitoring: the event is not yet of concern.
Expired: the event no longer represents a concern.
Message: provide useful information about system activity.
Error code
Indicates that the event represents an error in the system that can be
fixed by following the fix procedure or service action that is identified
by the error code. Not all events have an error code. Different events
have the same error code if the same service action is required for
each.
Sequence number
Identifies the event within the system.
Event count
The number of events that are coalesced into this event log record.
Object type
The object type to which the event relates.
Object ID
Uniquely identifies the object within the system to which the event
relates.
Object name
The name of the object in the system to which the event relates.
Copy ID
If the object is a volume and the event refers to a specific copy of the
volume, this field is the number of the copy to which the event relates.
Reporting node ID Typically identifies the node responsible for the object to which the
event relates. For events that relate to nodes, it identifies the node that
logged the event, which can be different from the node that is
identified by the object ID.
Reporting node
name
132
Typically identifies the node that contains the object to which the
event relates. For events that relate to nodes, it identifies the node that
logged the event, which can be different from the node that is
identified by the object name.
SAN Volume Controller: Troubleshooting Guide
Table 68. Description of data fields for the event log (continued)
Data field
Description
Fixed
Where an alert is shown for an error or warning condition, it indicates
that the user marked the event as fixed, completed the fix procedure,
or that the condition was resolved automatically. For a message event,
this field can be used to acknowledge the message.
First time stamp
The time when this error event was reported. If events of a similar
type are being coalesced together, so that one event log record
represents more than one event, this field is the time the first error
event was logged.
Last time stamp
The time when the last instance of this error event was recorded into
this event log record.
Root sequence
number
If set, it is the sequence number of an event that represents an error
that probably caused this event to be reported. Resolve the root event
first.
Sense data
Additional data that gives the details of the condition that caused the
event to be logged.
Event notifications
The system can use Simple Network Management Protocol (SNMP) traps, syslog
messages, and Call Home emails to notify you and the support center when
significant events are detected. Any combination of these notification methods can
be used simultaneously. Notifications are normally sent immediately after an event
is raised. However, there are some events that might occur because of active
service actions. If a recommended service action is active, these events are notified
only if they are still unfixed when the service action completes.
Each event that the system detects is assigned a notification type of Error, Warning,
or Information. When you configure notifications, you specify where the
notifications should be sent and which notification types are sent to that recipient.
Table 69 describes the levels of event notifications.
Table 69. Notification levels
Notification level
Description
Error
Error notification is sent to indicate a problem that must be
corrected as soon as possible.
This notification indicates a serious problem with the system. For
example, the event that is being reported could indicate a loss of
redundancy in the system, and it is possible that another failure
could result in loss of access to data. The most typical reason that
this type of notification is sent is because of a hardware failure, but
some configuration errors or fabric errors also are included in this
notification level. Error notifications can be configured to be sent as
a call home message to your support center.
Chapter 7. Diagnosing problems
133
Table 69. Notification levels (continued)
Notification level
Description
Warning
A warning notification is sent to indicate a problem or unexpected
condition with the system. Always immediately investigate this
type of notification to determine the effect that it might have on
your operation, and make any necessary corrections.
A warning notification does not require any replacement parts and
therefore should not require involvement from your support center.
The allocation of notification type Warning does not imply that the
event is less serious than one that has notification level Error.
Information
An informational notification is sent to indicate that an expected
event has occurred: for example, a FlashCopy operation has
completed. No remedial action is required when these notifications
are sent.
Events with notification type Error or Warning are shown as alerts in the event log.
Events with notification type Information are shown as messages.
SNMP traps
Simple Network Management Protocol (SNMP) is a standard protocol for
managing networks and exchanging messages. The system can send SNMP
messages that notify personnel about an event. You can use an SNMP manager to
view the SNMP messages that the system sends. You can use the management GUI
or the command-line interface to configure and modify your SNMP settings. You
can specify up to a maximum of six SNMP servers.
You can use the Management Information Base (MIB) file for SNMP to configure a
network management program to receive SNMP messages that are sent by the
system. This file can be used with SNMP messages from all versions of the
software. More information about the MIB file for SNMP is available at this
website:
www.ibm.com/storage/support/2145
Search for , then search for MIB. Go to the downloads results to find Management
Information Base (MIB) file for SNMP. Click this link to find download options.
Syslog messages
The syslog protocol is a standard protocol for forwarding log messages from a
sender to a receiver on an IP network. The IP network can be either IPv4 or IPv6.
The system can send syslog messages that notify personnel about an event. The
system can transmit syslog messages in either expanded or concise format. You can
use a syslog manager to view the syslog messages that the system sends. The
system uses the User Datagram Protocol (UDP) to transmit the syslog message.
You can specify up to a maximum of six syslog servers.You can use the
management GUI or the command-line interface to configure and modify your
syslog settings.
Table 70 on page 135 shows how SAN Volume Controller notification codes map to
syslog security-level codes.
134
SAN Volume Controller: Troubleshooting Guide
Table 70. SAN Volume Controller notification types and corresponding syslog level codes
SAN Volume Controller
notification type
Syslog level code
Description
ERROR
LOG_ALERT
Fault that might require
hardware replacement that
needs immediate attention.
WARNING
LOG_ERROR
Fault that needs immediate
attention. Hardware
replacement is not expected.
INFORMATIONAL
LOG_INFO
Information message used,
for example, when a
configuration change takes
place or an operation
completes.
TEST
LOG_DEBUG
Test message
Table 71 shows how SAN Volume Controller values of user-defined message origin
identifiers map to syslog facility codes.
Table 71. SAN Volume Controller values of user-defined message origin identifiers and
syslog facility codes
SAN Volume
Controller value
Syslog value
Syslog facility code
Message format
0
16
LOG_LOCAL0
Full
1
17
LOG_LOCAL1
Full
2
18
LOG_LOCAL2
Full
3
19
LOG_LOCAL3
Full
4
20
LOG_LOCAL4
Concise
5
21
LOG_LOCAL5
Concise
6
22
LOG_LOCAL6
Concise
7
23
LOG_LOCAL7
Concise
Call Home email
The Call Home feature transmits operational and event-related data to you and
service personnel through a Simple Mail Transfer Protocol (SMTP) server
connection in the form of an event notification email. When configured, this
function alerts service personnel about hardware failures and potentially serious
configuration or environmental issues.
To send email, you must configure at least one SMTP server. You can specify as
many as five additional SMTP servers for backup purposes. The SMTP server must
accept the relaying of email from the management IP address. You can then use the
management GUI or the command-line interface to configure the email settings,
including contact information and email recipients. Set the reply address to a valid
email address. Send a test email to check that all connections and infrastructure are
set up correctly. You can disable the Call Home function at any time using the
management GUI or the command-line interface.
Chapter 7. Diagnosing problems
135
Data that is sent with notifications
Notifications can be sent using email, SNMP, or syslog. The data sent for each type
of notification is the same. It includes:
v
v
v
v
Record type
Machine type
Machine serial number
Error ID
v Error code
v Software version
v FRU part number
v Cluster (system) name
v Node ID
v Error sequence number
v Time stamp
v Object type
v Object ID
v Problem data
Emails contain the following additional information that allow the Support Center
to contact you:
v Contact names for first and second contacts
v Contact phone numbers for first and second contacts
v Alternate contact numbers for first and second contacts
v Offshift phone number
v Contact email address
v Machine location
To send data and notifications to service personnel, use one of the following email
addresses:
v For systems that are located in North America, Latin America, South America or
the Caribbean Islands, use callhome1@de.ibm.com
v For systems that are located anywhere else in the world, use
callhome0@de.ibm.com
Inventory information email
An inventory information email summarizes the hardware components and
configuration of a system. Service personnel can use this information to contact
you when relevant software updates are available or when an issue that can affect
your configuration is discovered. It is a good practice to enable inventory
reporting.
Because inventory information is sent using the Call Home email function, you
must meet the Call Home function requirements and enable the Call Home email
function before you can attempt to send inventory information email. You can
adjust the contact information, adjust the frequency of inventory email, or
manually send an inventory email using the management GUI or the
command-line interface.
136
SAN Volume Controller: Troubleshooting Guide
The inventory that is sent to IBM includes the following information about the
clustered system on which the Call Home function is enabled. Sensitive
information such as IP addresses is not included.
v Licensing information
v Details about the following objects and functions:
Drives
External storage systems
Hosts
MDisks
Volumes
Array types and RAID levels
Easy Tier
FlashCopy
Metro Mirror and Global Mirror
HyperSwap
Example email
For detailed information about what is included in the Call Home inventory
information, configure the system to send an inventory email to yourself. Figure 57
shows an example of an email.
# Timestamp = Thu May 21 12:01:06 2015
# Timezone = +0100, BST
# Organization = IBM UK
# Machine Address = Maybrook Hall
# Machine City = Manchester
# Machine State = XX
# Machine Zip = M3 2EG
# Machine Country = GB
# Contact Name = Mike Programmer
# Alternate Contact Name = N/A
.
.(800 lines of more information)
.
Figure 57. Example of inventory information email
Understanding the error codes
Error codes are generated by the event-log analysis and system configuration code.
Error codes help you to identify the cause of a problem, a failing component, and
the service actions that might be needed to solve the problem.
Note: If more than one error occurs during an operation, the highest priority error
code displays on the front panel. The lower the number for the error code, the
higher the priority. For example, error code 1020 has a higher priority than error
code 1370.
Using the error code tables
The error code tables list the various error codes and describe the actions that you
can take.
Chapter 7. Diagnosing problems
137
About this task
Complete the following steps to use the error code tables:
Procedure
1. Locate the error code in one of the tables. If you cannot find a particular code
in any table, call IBM Support Center for assistance.
2. Read about the action you must complete to correct the problem. Do not
exchange field replaceable units (FRUs) unless you are instructed to do so.
3. Normally, exchange only one FRU at a time, starting from the top of the FRU
list for that error code.
Event IDs
The SAN Volume Controller software generates events, such as informational
events and error events. An event ID or number is associated with the event and
indicates the reason for the event.
Informational events provide information about the status of an operation.
Informational events are recorded in the event log, and, depending on the
configuration, informational event notifications can be sent through email, SNMP,
or syslog.
Error events are generated when a service action is required. An error event maps
to an alert with an associated error code. Depending on the configuration, error
event notifications can be sent through email, SNMP, or syslog.
Informational events
The informational events provide information about the status of an operation.
Informational events are recorded in the event log and, based on notification type,
can be notified through email, SNMP, or syslog.
Informational events can be either notification type I (information) or notification
type W (warning). An informational event report of type (W) might require user
attention. Table 72 provides a list of informational events, the notification type, and
the reason for the event.
Table 72. Informational events
138
Event ID
Notification
type
Description
070570
I
Battery protection unavailable.
070571
I
Battery protection temporarily unavailable; one
battery is expected to be available soon.
070572
I
Battery protection temporarily unavailable; both
batteries are expected to be available soon.
070785
I
Battery capacity is reduced because of cell imbalance.
980221
I
The error log is cleared.
980230
I
The SSH key was discarded for the service login user.
980231
I
User name has changed.
980301
I
Degraded or offline managed disk is now online.
980310
I
A degraded or offline storage pool is now online.
SAN Volume Controller: Troubleshooting Guide
Table 72. Informational events (continued)
Event ID
Notification
type
Description
980320
I
Offline volume is now online.
980321
W
Volume is offline because of degraded or offline
storage pool.
980330
I
All nodes can see the port.
980349
I
A node has been successfully added to the cluster
(system).
980350
I
The node is now a functional member of the cluster
(system).
980351
I
A noncritical hardware error occurred.
980352
I
Attempt to automatically recover offline node
starting.
980370
I
Both nodes in the I/O group are available.
980371
I
One node in the I/O group is unavailable.
980372
W
Both nodes in the I/O group are unavailable.
980392
I
Cluster (system) recovery completed.
980435
W
Failed to obtain directory listing from remote node.
980440
W
Failed to transfer file from remote node.
980445
I
The migration is complete.
980446
I
The secure delete is complete.
980501
W
The virtualization amount is close to the limit that is
licensed.
980502
W
The FlashCopy feature is close to the limit that is
licensed.
980503
W
The Metro Mirror or Global Mirror feature is close to
the limit that is licensed.
981002
I
Fibre Channel discovery occurred; configuration
changes are pending.
981003
I
Fibre Channel discovery occurred; configuration
changes are complete.
981004
I
Fibre Channel discovery occurred; no configuration
changes were detected.
981007
W
The managed disk is not on the preferred path.
981009
W
The initialization for the managed disk failed.
981014
W
The LUN discovery has failed. The cluster (system)
has a connection to a device through this node but
this node cannot discover the unmanaged or
managed disk that is associated with this LUN.
981015
W
The LUN capacity equals or exceeds the maximum.
Only part of the disk can be accessed.
981020
W
The managed disk error count warning threshold has
been met.
981022
I
Managed disk offline imminent, offline prevention
started
981025
I
Drive firmware download completed successfully
Chapter 7. Diagnosing problems
139
Table 72. Informational events (continued)
140
Event ID
Notification
type
Description
981026
I
Drive FPGA download completed successfully
981027
I
Drive firmware download started
981028
I
Drive FPGA download started
981029
I
Drive firmware download cancelled by user
981101
I
SAS discovery occurred; no configuration changes
were detected.
981102
I
SAS discovery occurred; configuration changes are
pending.
981103
I
SAS discovery occurred; configuration changes are
complete.
981104
W
The LUN capacity equals or exceeds the maximum
capacity. Only the first 1 PB of disk will be accessed.
981105
I
The drive format has started.
981106
I
The drive recovery was started.
982003
W
Insufficient virtual extents.
982004
W
The migration suspended because of insufficient
virtual extents or too many media errors on the
source managed disk.
982007
W
Migration has stopped.
982009
I
Migration is complete.
982010
W
Copied disk I/O medium error.
983001
I
The FlashCopy operation is prepared.
983002
I
The FlashCopy operation is complete.
983003
W
The FlashCopy operation has stopped.
984001
W
First customer data being pinned in a volume
working set.
984002
I
All customer data in a volume working set is now
unpinned.
984003
W
The volume working set cache mode is in the process
of changing to synchronous destage because the
volume working set has too much pinned data.
984004
I
Volume working set cache mode updated to allow
asynchronous destage because enough customer data
has been unpinned for the volume working set.
984506
I
The debug from an IERR was extracted to disk.
984507
I
An attempt was made to power on the slots.
984508
I
All the expanders on the strand were reset.
984509
I
The component firmware update paused to allow the
battery charging to finish.
984511
I
The update for the component firmware paused
because the system was put into maintenance mode.
984512
I
A component firmware update is needed but is
prevented from running.
SAN Volume Controller: Troubleshooting Guide
Table 72. Informational events (continued)
Event ID
Notification
type
985001
I
The Metro Mirror or Global Mirror background copy
is complete.
985002
I
The Metro Mirror or Global Mirror is ready to restart.
985003
W
Unable to find path to disk in the remote cluster
(system) within the timeout period.
986001
W
The thin-provisioned volume copy data in a node is
pinned.
986002
I
All thin-provisioned volume copy data in a node is
unpinned.
986010
I
The thin-provisioned volume copy import has failed
and the new volume is offline; either update the SAN
Volume Controller software to the required version or
delete the volume.
986011
I
The thin-provisioned volume copy import is
successful.
986020
W
A thin-provisioned volume copy space warning has
occurred.
986030
I
A thin-provisioned volume copy repair has started.
986031
I
A thin-provisioned volume copy repair is successful.
986032
I
A thin-provisioned volume copy validation is started.
986033
I
A thin-provisioned volume copy validation is
successful.
986034
I
The import of the compressed-virtual volume copy
was successful.
986035
W
A compressed-virtual volume copy space warning
has occurred.
986036
I
A compressed-virtual volume copy repair has started.
986037
I
A compressed-virtual volume copy repair is
successful.
986038
I
A compressed-virtual volume copy has too many bad
blocks.
986201
I
A medium error has been repaired for the mirrored
copy.
986203
W
A mirror copy repair, using the validate option
cannot complete.
986204
I
A mirror disk repair is complete and no differences
are found.
986205
I
A mirror disk repair is complete and the differences
are resolved.
986206
W
A mirror disk repair is complete and the differences
are marked as medium errors.
986207
I
The mirror disk repair has been started.
986208
W
A mirror copy repair, using the set medium error
option, cannot complete.
986209
W
A mirror copy repair, using the resync option, cannot
complete.
Description
Chapter 7. Diagnosing problems
141
Table 72. Informational events (continued)
Event ID
Notification
type
Description
987102
W
Node coldstarted.
987103
W
A node power-off has been requested from the power
switch.
987104
I
Additional Fibre Channel ports were connected.
987301
W
The connection to a configured remote cluster
(system) has been lost.
987400
W
The node unexpectedly lost power but has now been
restored to the cluster (system).
988022
I
The rebuild for an array MDisk was started.
Performance may be affected, wait for rebuild to
complete.
988023
I
The rebuild for an array MDisk has finished.
988028
I
Array validation started.
988029
I
Array validation complete.
988100
W
An overnight maintenance procedure has failed to
complete. Resolve any hardware and configuration
problems that you are experiencing on the cluster
(system). If the problem persists, contact your IBM
service representative for assistance.
988300
W
An array MDisk is offline because it has too many
missing members.
988304
I
A RAID array has started exchanging an array
member.
988305
I
A RAID array has completed exchanging an array
member.
988306
I
A RAID array needs resynchronization.
988307
I
A failed drive has been re-seated or replaced. The
system has automatically configured the device.
989001
W
A storage pool space warning has occurred.
SCSI event reporting
Nodes can notify their hosts of events for SCSI commands that are issued.
SCSI status
Some events are part of the SCSI architecture and are handled by the host
application or device drivers without reporting an event. Some events, such as
read and write I/O events and events that are associated with the loss of nodes or
loss of access to backend devices, cause application I/O to fail. To help
troubleshoot these events, SCSI commands are returned with the Check Condition
status and a 32-bit event identifier is included with the sense information. The
identifier relates to a specific event in the event log.
If the host application or device driver captures and stores this information, you
can relate the application failure to the event log.
142
SAN Volume Controller: Troubleshooting Guide
Table 73 describes the SCSI status and codes that are returned by the nodes.
Table 73. SCSI status
Status
Code
Description
Good
00h
The command was successful.
Check condition
02h
The command failed and sense data is available.
Condition met
04h
N/A
Busy
08h
An Auto-Contingent Allegiance condition exists
and the command specified NACA=0.
Intermediate
10h
N/A
Intermediate - condition
met
14h
N/A
Reservation conflict
18h
Returned as specified in SPC2 and SAM-2 where
a reserve or persistent reserve condition exists.
Task set full
28h
The initiator has at least one task queued for that
LUN on this port.
ACA active
30h
This code is reported as specified in SAM-2.
Task aborted
40h
This code is returned if TAS is set in the control
mode page 0Ch. The node has a default setting of
TAS=0, which cannot be changed; therefore, the
node does not report this status.
SCSI Sense
Nodes notify the hosts of events on SCSI commands. Table 74 defines the SCSI
sense keys, codes and qualifiers that are returned by the nodes.
Table 74. SCSI sense keys, codes, and qualifiers
Key
Code
Qualifier
Definition
Description
2h
04h
01h
Not Ready. The logical
unit is in the process of
becoming ready.
The node lost sight of the system
and cannot perform I/O
operations. The additional sense
does not have additional
information.
2h
04h
0Ch
Not Ready. The target port The following conditions are
is in the state of
possible:
unavailable.
v The node lost sight of the
system and cannot perform
I/O operations. The additional
sense does not have additional
information.
v The node is in contact with
the system but cannot perform
I/O operations to the
specified logical unit because
of either a loss of connectivity
to the backend controller or
some algorithmic problem.
This sense is returned for
offline volumes.
Chapter 7. Diagnosing problems
143
Table 74. SCSI sense keys, codes, and qualifiers (continued)
Key
Code
Qualifier
Definition
Description
3h
00h
00h
Medium event
This is only returned for read or
write I/Os. The I/O suffered an
event at a specific LBA within its
scope. The location of the event
is reported within the sense
data. The additional sense also
includes a reason code that
relates the event to the
corresponding event log entry.
For example, a RAID controller
event or a migrated medium
event.
4h
08h
00h
Hardware event. A
command to logical unit
communication failure has
occurred.
The I/O suffered an event that is
associated with an I/O event
that is returned by a RAID
controller. The additional sense
includes a reason code that
points to the sense data that is
returned by the controller. This
is only returned for I/O type
commands. This event is also
returned from FlashCopy target
volumes in the prepared and
preparing state.
5h
25h
00h
Illegal request. The logical
unit is not supported.
The logical unit does not exist or
is not mapped to the sender of
the command.
Reason codes
The reason code appears in bytes 20-23 of the sense data. The reason code provides
the node with a specific log entry. The field is a 32-bit unsigned number that is
presented with the most significant byte first. Table 75 lists the reason codes and
their definitions.
If the reason code is not listed in Table 75, the code refers to a specific event in the
event log that corresponds to the sequence number of the relevant event log entry.
Table 75. Reason codes
144
Reason code
(decimal)
Description
40
The resource is part of a stopped FlashCopy mapping.
50
The resource is part of a Metro Mirror or Global Mirror relationship
and the secondary LUN in the offline.
51
The resource is part of a Metro Mirror or Global Mirror and the
secondary LUN is read only.
60
The node is offline.
71
The resource is not bound to any domain.
72
The resource is bound to a domain that has been recreated.
SAN Volume Controller: Troubleshooting Guide
Table 75. Reason codes (continued)
Reason code
(decimal)
Description
73
Running on a node that has been contracted out for some reason
that is not attributable to any path going offline.
80
Wait for the repair to complete, or delete the volume.
81
Wait for the validation to complete, or delete the volume.
82
An offline thin-provisioned volume has caused data to be pinned in
the directory cache. Adequate performance cannot be achieved for
other thin-provisioned volumes, so they have been taken offline.
85
The volume has been taken offline because checkpointing to the
quorum disk failed.
86
The repairvdiskcopy -medium command has created a virtual
medium error where the copies differed.
93
An offline RAID-5 or RAID-6 array has caused in-flight-write data
to be pinned. Good performance cannot be achieved for other
arrays and so they have been taken offline.
94
An array MDisk that is part of the volume has been taken offline
because checkpointing to the quorum disk failed.
95
This reason code is used in MDisk bad block dump files to indicate
the data loss was caused by having to resync parity with rebuilding
strips or some other RAID algorithm reason due to multiple
failures.
96
A RAID-6 array MDisk that is part of the volume has been taken
offline because an internal metadata table is full.
Object types
You can use the object code to determine the type of the object the event is logged
against.
Table 76 lists the object codes and corresponding object types.
Table 76. Object types
Object code
Object type
1
mdisk
2
mdiskgrp
3
volume
4
node
5
host
7
iogroup
8
fcgrp
9
rcgrp
10
fcmap
11
rcmap
12
wwpn
13
cluster (system)
16
device
Chapter 7. Diagnosing problems
145
Table 76. Object types (continued)
Object code
Object type
17
SCSI lun
18
quorum
34
Fibre Channel adapter
38
volume copy
39
Syslog server
40
SNMP server
41
Email server
42
User group
44
Cluster (management) IP
46
SAS adapter
Error event IDs and error codes
Error codes describe a service procedure that must be followed. Each event ID that
requires service has an associated error code.
Error codes can be either notification type E (error) or notification type W
(warning). Table 77 lists the event IDs and corresponding error codes, the
notification type, and the condition of the event.
Table 77. Error event IDs and error codes
Event
ID
Notification
type
009020
E
A system recovery has run. All configuration commands
are blocked.
1001
009040
E
The error event log is full.
1002
009052
W
The following causes are possible:
1196
Condition
Error
code
v The node is missing.
v The node is no longer a functional member of the
system.
146
009053
E
A node has been missing for 30 minutes.
1195
009054
E
Node has been shut down.
1707
009100
W
The software install process has failed.
2010
009101
W
Software install package cannot be delivered to all nodes. 2010
009150
W
Unable to connect to the SMTP (email) server.
2600
009151
W
Unable to send mail through the SMTP (email) server.
2601
009170
W
Remote Copy feature capacity is not set.
3030
009171
W
The FlashCopy feature capacity is not set.
3031
009172
W
The Virtualization feature has exceeded the amount that
is licensed.
3032
009173
W
The FlashCopy feature has exceeded the amount that is
licensed.
3032
009174
W
Remote Copy feature license limit exceeded.
3032
009175
W
Thin-provisioned volume usage not licensed.
3033
SAN Volume Controller: Troubleshooting Guide
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
Error
code
009176
W
The value set for the virtualization feature capacity is not 3029
valid.
009177
E
A physical disk FlashCopy feature license is required.
3035
009178
E
A physical disk Metro Mirror and Global Mirror feature
license is required.
3036
009179
E
A virtualization feature license is required.
3025
009180
E
Automatic recovery of offline node failed.
1194
009181
W
Unable to send email to any of the configured email
servers.
3081
009182
W
The external virtualization feature license limit was
exceeded.
3032
009183
W
Unable to connect to LDAP server.
2251
009184
W
The LDAP configuration is not valid.
2250
009185
E
The limit for the compression feature license was
exceeded.
3032
009186
E
The limit for the compression feature license was
exceeded.
3032
009187
E
Unable to connect to LDAP server that has been
automatically configured.
2256
009188
E
Invalid LDAP configuration for automatically configured 2255
server.
009189
W
A licensable feature's trial-timer has reached 0. The
feature has now been deactivated.
3082
009190
W
A trial of a licensable feature will expire in 5 days.
3083
009191
W
A trial of a licensable feature will expire in 10 days.
3084
009192
W
A trial of a licensable feature will expire in 15 days.
3085
009193
W
A trial of a licensable feature will expire in 45 days.
3086
009194
W
Easy Tier feature license limit exceeded.
3032
009195
W
FlashCopy feature license limit exceeded.
3032
009196
W
External virtualization feature license limit exceeded.
3032
009197
W
Remote copy feature license limit exceeded.
3032
009202
W
System SSL certificate will expire within the next 30
days.
3130
009203
W
System SSL certificate has expired.
2258
010002
E
The node ran out of base event sources. As a result, the
node has stopped and exited the system.
2030
010003
W
The number of device logins has reduced.
1630
010006
E
Access beyond end of disk, or Managed Disk missing.
2030
010008
E
The block size is invalid, the capacity or LUN identity
has changed during the managed disk initialization.
1660
010010
E
The managed disk is excluded because of excessive
errors.
1310
Condition
Chapter 7. Diagnosing problems
147
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
010011
E
The remote port is excluded for a managed disk and
node.
1220
010012
E
The local port is excluded.
1210
010013
E
The login is excluded.
1230
010017
E
A timeout has occurred as a result of excessive
processing time.
1340
010018
E
An error recovery procedure has occurred.
1370
010019
E
A managed disk is reporting excessive errors.
1310
010020
E
The managed disk error count threshold has exceeded.
1310
010021
W
There are too many devices presented to the system.
1200
010022
W
There are too many managed disks presented to the
system.
1200
010023
W
There are too many LUNs presented to a node.
1200
010024
W
There are too many drives presented to a system.
1200
010025
W
A disk I/O medium error has occurred.
1320
010026
W
A suitable MDisk or drive for use as a quorum disk was
not found.
1330
010027
W
The quorum disk is not available.
1335
010028
W
A controller configuration is not supported.
1625
010029
E
A login transport fault has occurred.
1360
010030
E
A managed disk error recovery procedure (ERP) has
occurred. The node or controller reported the following:
1370
Condition
Error
code
v Sense
v Key
v Code
v Qualifier
148
010031
E
One or more MDisks on a controller are degraded.
1623
010032
W
The controller configuration limits failover.
1625
010033
E
The controller configuration uses the RDAC mode; this is 1624
not supported.
010034
W
Persistent unsupported controller configuration.
1695
010040
E
The controller system device is only connected to the
node through a single initiator port.
1627
010041
E
The controller system device is only connected to the
node through a single target port.
1627
010042
E
The controller system device is only connected to the
nodes through a single target port.
1627
010043
E
The controller system device is only connected to the
nodes through half of the expected target ports.
1627
010044
E
The controller system device has disconnected all target
ports to the nodes.
1627
010055
W
An unrecognized SAS device.
1665
010056
E
SAS error counts exceeded the warning thresholds.
1216
SAN Volume Controller: Troubleshooting Guide
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
Condition
Error
code
010057
E
SAS errors exceeded critical thresholds.
1216
010066
W
Controller indicates that it does not support descriptor
sense for LUNs that are greater than 2 TBs.
1625
010067
W
Too many enclosures were presented to a system.
1200
010070
W
Too many controller target ports were presented to the
system.
1200
010071
W
Too many target ports were presented to the system from 1200
a single controller.
010098
W
There are too many drives presented to a system.
1200
010100
W
Incorrect connection detected to a port.
1669
010101
E
Too many long IOs to drive.
1215
010102
E
A drive is reported as continuously slow with
contributory factors.
1215
010103
E
Too many long IOs to drive (Mercury drives).
1215
010104
E
A drive is reported as continuously slow with
contributory factors (Mercury drives).
1215
010106
E
Drive reporting too many t10dif errors.
1680
010110
W
Drive firmware download canceled because of system
changes.
3090
010111
W
Drive firmware download canceled because of a drive
download problem.
3090
010118
W
Too many drives attached to the system.
1179
010119
W
Drive data integrity error.
1322
010120
W
A member drive has been forced to turn off protection
information support.
2035
010121
E
Drive exchange required.
1693
010123
W
Performance of external MDisk has changed.
2115
010124
W
iSCSI session excluded.
1230
020001
E
There are too many medium errors on the MDisk.
1610
020002
E
A storage pool is offline.
1620
020003
W
There are insufficient virtual extents.
2030
020007
E
MDisk not found in easy_tier_perf.xml file.
2110
020008
E
Storage optimization services disabled.
3023
029001
E
The MDisk has bad blocks.
1840
029002
W
The system failed to create a bad block because MDisk
already has the maximum number of allowed bad
blocks.
1226
029003
W
The system failed to create a bad block because the
system already has the maximum number of allowed
bad blocks.
1225
030000
W
FlashCopy prepare failed due to cache flush failure.
1900
030010
W
FlashCopy has been stopped due to the error indicated
in the data.
1910
Chapter 7. Diagnosing problems
149
Table 77. Error event IDs and error codes (continued)
150
Event
ID
Notification
type
Condition
Error
code
030020
W
Unrecovered FlashCopy mappings.
1895
045103
E
An attempt to automatically configure a reseated or
replaced drive has failed.
1686
050001
W
The Metro Mirror or Global Mirror relationship cannot
be recovered.
1700
050002
W
A Metro Mirror or Global Mirror relationship or
consistency group exists within a system, but its
partnership has been deleted.
3080
050010
W
A Global Mirror relationship has stopped because of a
persistent I/O error.
1920
050011
W
A remote copy has stopped because of a persistent I/O
error.
1915
050020
W
Remote Copy repationship or consistency groups lost
synchronization.
1720
050030
W
There are too many system partnerships. The number of
partnerships has been reduced.
1710
050031
W
There are too many system partnerships. The system has 1710
been excluded.
050040
W
Background copy process for the remote copy was
blocked.
1960
050050
E
The Global Mirror secondary volume is offline. The
relationship has pinned hardened write data for this
volume.
1925
050060
E
The Global Mirror secondary volume is offline due to
1730
missing I/O group partner node. The relationship has
pinned hardened write data for this volume but the node
containing the required data is currently offline.
050070
E
Global Mirror performance is likely to be impacted. A
large amount of pinned data for the offline volumes has
reduced the resource available to the global mirror
secondary disks.
1925
050080
W
HyperSwap volume has lost synchronization between
sites.
1940
050081
W
HyperSwap consistency group has lost synchronization
between sites.
1940
060001
W
A thin-provisioned volume copy is offline because of
insufficient space.
1865
060002
E
A thin-provisioned volume copy is offline because of
corrupt metadata.
1862
060003
E
A thin-provisioned volume copy is offline because of a
failed repair.
1860
060004
W
A compressed volume copy is offline because of
insufficient space.
1865
060005
E
A compressed volume copy is offline because of corrupt
metadata.
1862
060006
E
A compressed volume copy is offline because of a failed
repair.
1860
SAN Volume Controller: Troubleshooting Guide
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
Condition
Error
code
060007
E
A compressed volume copy has bad blocks.
1850
062001
W
System is unable to mirror medium error.
1950
062002
E
Mirrored volume is offline because it cannot synchronize 1870
data.
062003
W
Repair of a mirrored volume stopped because of
difference.
1600
070000
E
Unrecognized node error.
1083
070510
E
Detected memory size does not match the expected
memory size.
1022
070511
E
DIMMs are incorrectly installed.
1009
070517
E
The WWNN that is stored on the service controller and
the WWNN that is stored on the drive do not match.
1192
070521
E
Unable to detect any Fibre Channel adapter.
1016
070522
E
The system board processor has failed.
1020
070523
E
The internal disk file system of the node is damaged.
1187
070524
E
Unable to update BIOS settings.
1027
070525
E
Unable to update the service processor firmware for the
system board.
1020
070528
E
The ambient temperature is too high while the system is
starting.
1182
070536
E
A system board device breached critical temperature
threshold.
1084
070538
E
A PCI Riser breached critical temperature threshold.
1085
070542
E
A processor has failed.
1024
070543
E
No usable persistent data could be found on the boot
drives.
1035
070544
E
The boot drives do not belong in this node.
1035
070545
E
Boot drive and system board mismatch.
1035
070550
E
Cannot form system due to lack of resources.
1192
070556
E
Duplicate WWNN detected on the SAN.
1192
070558
E
A node is unable to communicate with other nodes.
1192
070560
E
Battery cabling fault.
1108
070561
E
Battery backplane or cabling fault.
1109
070562
E
The node hardware does not meet minimum
requirements.
1183
070564
E
Too many software failures.
1188
070565
E
The internal drive of the node is failing.
1030
070569
E
CPU temperature breached critical threshold.
1093
070574
E
The node software is damaged.
1187
070576
E
The system data cannot be read.
1030
070578
E
The system data was not saved when power was lost.
1194
Chapter 7. Diagnosing problems
151
Table 77. Error event IDs and error codes (continued)
152
Event
ID
Notification
type
070579
E
Battery subsystem has insufficient charge to save system
data.
1107
070580
E
Unable to read the service controller ID.
1044
070581
E
2145 UPS-1U has a serial link error.
1181
070582
E
2145 UPS-1U has a battery error.
1181
070583
E
2145 UPS-1U has an electronics error.
1171
070584
E
2145 UPS-1U is overloaded.
1166
070585
E
2145 UPS-1U has failed.
1171
070586
E
Power supply to 2145 UPS-1U does not meet
requirements.
1141
070587
E
Incorrect type of uninterruptible power supply detected.
1152
070588
E
2145 UPS-1U is not cabled correctly.
1151
070589
E
The ambient temperature limit for the 2145 UPS-1U was
exceeded.
1136
070590
E
Repeated node restarts because of 2145 UPS-1U errors.
1186
070670
W
Insufficient uninterruptible power supply charge to allow 1193
node to start.
070690
W
Node held in service state.
1189
070710
E
High-speed SAS adapter is missing. This error applies to
only the SAN Volume Controller 2145-CG8 model.
1120
070720
E
Ethernet adapter is missing. This error applies to only
the SAN Volume Controller 2145-CG8 model.
1072
070736
E
A system board device breached warning temperature
threshold.
1084
070737
E
A power supply breached temperature threshold.
1212
070738
E
A PCI Riser breached warning temperature threshold.
1085
070743
E
Boot drive missing or out of sync or failed.
1213
070744
W
A boot drive is in the wrong location.
1214
070745
W
Boot drive in unsupported slot.
1472
070746
W
Technician port connection is not valid.
746
070747
W
Technician connected.
747
070766
E
CMOS battery has failed.
1670
070775
E
Power supply has a problem.
1097
070776
W
Power supply mains cable is unplugged.
1097
070777
E
Power supply is missing.
1097
070779
E
Battery is missing.
1129
070780
E
Battery has failed.
1130
070781
E
Battery is below the minimum operating temperature.
1476
070782
E
Battery is above the maximum operating temperature.
1475
070783
E
Battery has a communications error.
1109
070784
E
Battery is nearing end of life.
1474
SAN Volume Controller: Troubleshooting Guide
Condition
Error
code
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
Condition
Error
code
070786
E
Battery VPD has a checksum error.
1130
070787
E
Battery is at a hardware revision level not supported by
the current code level.
1473
070572
E
Battery protection temporarily unavailable; both batteries 1473
are expected to be available soon.
070579
E
Battery subsystem has insufficient charge to save system
data.
1473
070840
W
Detected hardware is not a valid configuration.
1198
070841
W
Detected hardware needs activation.
1199
071747
W
Technician connected.
747
072005
E
CMOS battery has a failure.
1670
072007
E
CMOS battery has a failure.
1670
072008
E
CMOS battery has a failure.
1032
072101
System board has more or lesss processors detected.
1025
072102
System board has more or lesss processors detected.
1025
072103
System board has more or lesss processors detected.
1032
073003
W
The Fibre Channel ports are not operational.
1060
073004
E
Fibre Channel adapter detected PCI bus error.
1012
073005
E
System path has a failure.
1550
073006
W
The SAN is not correctly zoned. As a result, more than
1800
512 ports on the SAN have logged into one SAN Volume
Controller port.
073251
E
More or less Fibre Channel adapters are detected.
1011
073252
E
Fibre Channel adapter is faulty.
1055
073258
E
Fibre Channel adapter has detected PCI bus error.
1013
073261
E
More or less Fibre Channel adapters are detected.
1011
073262
E
Fibre Channel adapter is faulty.
1055
073268
E
Fibre Channel adapter has detected PCI bus error.
1013
073271
E
More or less Fibre Channel adapters are detected.
1011
073272
E
Fibre Channel adapter is faulty.
1055
073278
E
Fibre Channel adapter has detected PCI bus error.
1013
073305
W
Fibre Channel speed has changed.
1065
073310
E
Duplicate Fibre Channel frame is detected.
1203
073402
E
The Fibre Channel adapter has a failure.
1032
073404
E
Fibre Channel adapter has detected PCI bus error.
1032
073512
E
Enclosure VPD is inconsistent.
1008
073522
E
The system board service processor has failed.
1034
073528
E
Ambient temperature is too high during system startup.
1098
074001
W
System is unable to determine VPD for a FRU.
2040
074002
E
The node warm started after a software error.
2030
Chapter 7. Diagnosing problems
153
Table 77. Error event IDs and error codes (continued)
154
Event
ID
Notification
type
074003
W
A connection to a configured remote system has been
lost because of a connectivity problem.
1715
074004
W
A connection to a configured remote system has been
lost because of too many minor errors.
1716
075011
E
Flash boot device has a failure.
1040
075012
E
Flash boot device has recovered.
1040
075015
E
Service controller has a read failure.
1044
075021
E
Flash boot device has a failure.
1040
075022
E
Flash boot device has recovered.
1040
075025
E
Service controller has a read failure.
1044
075031
E
Flash boot device has a failure.
1040
075032
E
Flash boot device has recovered.
1040
075035
E
A service controller read failure has occurred. This error 1044
applies to only the SAN Volume Controller 2145-CF8 and
the SAN Volume Controller 2145-CG8 models.
076001
E
The internal disk for a node has failed.
1030
076002
E
The hard disk is full and cannot capture any more
output.
2030
076401
E
One of the two power supply units in the node has
failed.
1096
076402
E
One of the two power supply units in the node cannot
be detected.
1096
076403
E
One of the two power supply units in the node is
without power.
1097
076501
E
A high-speed SAS adapter is missing. This error applies
to only the SAN Volume Controller 2145-CF8 model.
1120
076502
E
The PCIe lanes on a high-speed SAS adapter are
degraded.
1121
076503
E
A PCI bus error occurred on a high-speed SAS adapter.
1121
076504
E
A high-speed SAS adapter requires a PCI bus reset.
1122
076505
E
The SAS adapter has an internal fault.
1121
077105
E
The node service processor indicated a fan failure.
1089
077106
E
The node service processor indicated a fan failure.
1089
077107
E
The node service processor indicated a fan failure.
1089
077161
E
Node ambient temperature threshold has exceeded.
1094
077162
E
The node processor indicated a temperature warning.
1093
077163
E
The node service processor or ambient critical
temperature threshold has exceeded.
1092
077165
E
Node ambient temperature threshold has exceeded.
1094
077166
E
Node processor temperature has a warning.
1093
077167
E
Node processor or ambient critical temperature threshold 1092
has exceeded.
077171
E
System board voltage is high.
SAN Volume Controller: Troubleshooting Guide
Condition
Error
code
1101
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
Condition
Error
code
077172
E
System board voltage is high.
1101
077173
E
System board voltage is high.
1101
077174
E
System board voltage is low.
1106
077175
E
System board voltage is low.
1106
077176
E
System board voltage is low.
1106
077178
E
Power management board has a voltage fault.
1110
077185
E
Node ambient temperature threshold has exceeded.
1094
077186
E
The node processor warning temperature threshold has
exceeded. This error applies to the SAN Volume
Controller 2145-CF8 and the SAN Volume Controller
2145-CG8 models.
1093
077187
E
The node processor or ambient critical threshold has
exceeded. This error applies to the SAN Volume
Controller 2145-CF8 and the SAN Volume Controller
2145-CG8 models.
1092
077188
E
Power management board voltage has a fault.
1110
078001
E
Power domain error. Both nodes in the I/O group are
powered by the same UPS.
1155
079500
W
The limit on the number of system secure shell (SSH)
sessions has been reached.
2500
079501
W
Unable to access the Network Time Protocol (NTP)
network time server.
2700
079503
W
Unable to connect to NTP server that has been
automatically configured.
2702
079504
W
Hardware configurations of nodes differ in an I/O
group.
1470
079506
I
Technician port connection is not active.
3024
079507
I
Technician port connection is active.
3024
081001
E
Ethernet interface has a failure.
1400
082001
E
A server error has occurred.
2100
082002
W
Service failure has occured.
2100
083001
E
System failed to communicate with UPS.
1145
083002
E
UPS output loading was unexpectedly high.
1165
083003
E
Battery has reached end of life.
1190
083004
E
UPS battery has a fault.
1180
083005
E
UPS electronics has a fault.
1170
083006
E
UPS frame has a fault.
1175
083007
E
UPS is overcurrent.
1160
083008
E
UPS has a fault but no specific FRU is identified.
1185
083009
E
The UPS has detected an input power fault.
1140
083010
E
UPS has a cabling error.
1150
083011
E
UPS ambient temperature threshold has exceeded.
1135
Chapter 7. Diagnosing problems
155
Table 77. Error event IDs and error codes (continued)
156
Event
ID
Notification
type
Condition
Error
code
083012
E
UPS ambient temperature is high.
3000
083013
E
UPS crossed-cable test is bypassed because of an internal 3010
UPS software error.
083101
E
System failed to communicate with UPS.
1146
083102
E
UPS output loading was unexpectedly high.
1166
083103
E
Battery has reached end of life.
1191
083104
E
UPS has a battery fault.
1181
083105
E
UPS has an electronics fault.
1171
083107
E
UPS is overcurrent.
1161
083108
E
UPS has a fault but no specific FRU is identified.
1186
083109
E
The UPS has detected an input power fault.
1141
083110
E
UPS has a cabling error.
1151
083111
E
UPS ambient temperature threshold has exceeded.
1136
083112
E
UPS ambient temperature is high.
3001
083113
E
UPS crossed-cable test is bypassed because of an internal 3011
UPS software error.
084000
W
An array MDisk has deconfigured members and has lost
redundancy.
1689
084100
E
An array MDisk is corrupt because of lost metadata.
1240
084200
W
Array MDisk has taken a spare member that does not
match array goals.
1692
084201
W
An array has members that are located in a different I/O 1688
group.
084300
W
An array MDisk is no longer protected by an appropriate 1690
number of suitable spares.
084301
W
No spare protection exists for one or more array MDisks. 1690
084302
W
Distributed array MDisk has fewer rebuild areas
available than threshold.
1690
084500
E
An array MDisk is offline. The metadata for the inflight
writes is on a missing node.
1243
084400
W
A background scrub process has found an inconsistency
between data and parity on the array.
1691
084420
W
Array MDisk has been forced to disable hardware data
integrity checking on member drives.
2035
084600
E
An array MDisk is offline. Metadata on the missing node 1243
contains needed state information.
084700
W
Array response time too high.
1750
084701
W
Distributed array MDisk member slow write count
threshold exceeded.
1750
084800
E
Distributed array MDisk offline due to I/O timeout.
1340
981110
I
iSCSI discovery occurred, configuration changes
pending.
SAN Volume Controller: Troubleshooting Guide
Table 77. Error event IDs and error codes (continued)
Event
ID
Notification
type
Error
code
981111
I
iSCSI discovery occurred, configuration changes
complete.
981112
I
iSCSI discovery occurred, no configuration changes were
detected.
988308
I
Distributed array MDisk rebuild started.
988309
I
Distributed array MDisk rebuild completed.
988310
I
Distributed array MDisk copyback started.
988311
I
Distributed array MDisk copyback completed.
988312
I
Distributed array MDisk initialization started.
988313
I
Distributed array MDisk initialization completed.
988314
I
Distributed array MDisk needs resynchronization.
Condition
Resolving a problem with the SAN Volume Controller
2145-DH8 boot drives
Follow these resolution steps to resolve most problems with boot drives.
Before you begin
The node serial number (also known as the product or machine serial number) is
on the MT-M SN label (Machine Type - Model and Serial Number label) on the
front (left side) of the 2145-DH8 node. The node serial number is written to the
system board and to each of the two boot drives during the manufacturing
process.
When the SAN Volume Controller software starts, it reads the node serial number
from the system board (using the node serial number for the panel name) and
compares it with the node serial numbers stored on the two boot drives.
Specific node errors are produced under the following conditions:
v Fatal node error 543: When none of the node serial numbers that are stored in
the three locations match. The node serial number from the system board must
match with at least one of the two boot drives for the SAN Volume Controller
software to assume that node serial number is good.
v Fatal node error 545: If the node serial numbers on each boot drive match each
other but are not the same as the node serial number from the system board. In
this case, the node serial number on the system board might be wrong or the
node serial number on the boot drives might be wrong. For example, the system
board changed or the boot drives come from another 2145-DH8 node.
v Node error 743: If the node serial number cannot be read from one of the two
boot drives because that drive failed, is missing, or is out of sync with the other
boot drive.
v Node error 744: If the node serial number form one of the boot drives identifies
as belonging to a different 2145-DH8 node. If boot drives were swapped
between drive slots 1 and 2, node error 744 is produced.
v Node error 745: If a boot drive is found in an unsupported slot. Drives slots 3-8
are not supported by version 7.3 SAN Volume Controller software.
Chapter 7. Diagnosing problems
157
About this task
There is an event in the Monitoring > Events panel of the management GUI if the
problem produces node error 743, 744, or 745. Run the fix procedure for that event.
Otherwise, connect to the technician port to use the MT-M SN label on the node to
see the boot drive slot information and determine the problem.
Attention: If a drive slot has Yes in the Active column, the operating system
depends on that drive. Do not remove that drive without first shutting down the
node.
v Do not swap boot drives between slots.
v Each boot drive has a copy of the VPD on the system board.
v Software upgrading is to one boot drive at a time to prevent failures during
CCU.
Procedure
To resolve a problem with a boot drive, complete the following steps in order:
1. Remove any drive that is in an unsupported slot. Move the drive to the correct
slot if you can.
2. If possible, replace any drive that is shown as missing from a slot. Otherwise,
reseat the drive or replace it with a drive from FRU stock.
3. Move any drive that is in the wrong node back to the correct node.
Note: If the node serial number does not match the node serial number on the
system board, a drive slot has a status of wrong_node. You can ignore this status
if the serial number on the MT-M SN label matches the node serial number on
the drive.
4. Move any drive that is in the wrong slot back to the correct slot.
5. Reseat the drive in any slot that has a status of failed. If the status remains
failed, replace the drive with one from FRU stock.
6. If the drive slot has status out of sync and Yes in the can_sync column, then:
v Use the service assistant GUI to synchronize boot drives, or
v Use the command-line interface (CLI) command satask chbootdrive -sync.
v If No is displayed in the can_sync column, you must resolve another boot
drive problem first.
Replacing the 2145-DH8 system board:
7. Replace the SAN Volume Controller 2145-DH8 system board.
When neither of the boot drives have usable SAN Volume Controller software:
For example, if you replace both of the boot drives from FRU stock at the same
time, neither boot drive has usable SAN Volume Controller software. If the SAN
Volume Controller software is not running, the node status, node fault, battery
status, and battery fault LEDs remain off.
8. If you cannot replace at least one of the original boot drives with a drive that
contains usable SAN Volume Controller software and has a node serial number
that matches the MT-M SN label on the front of the node, contact IBM Remote
Technical support. IBM Remote Technical support can help you install the SAN
Volume Controller software with a bootable USB flash drive .
v Field-based USB installation also repairs the node serial number and WWNN
stored on each boot drive by finding values that are stored on the system
board during manufacturing.
158
SAN Volume Controller: Troubleshooting Guide
v If the WWNN of this node changed in the past, you must change the
WWNN again after completing the SAN Volume Controller software
installation. For example, if the node replaced a legacy SAN Volume
Controller node, you would have changed the WWNN to that of the legacy
node. You can repeat the change to the WWNN after the SAN Volume
Controller software installation with the service assistant GUI or by
command.
When every copy of the node serial number is lost:
For example, if you replace the system board and both of the boot drives with FRU
stock at the same time, every copy of the node serial number is lost.
9. If you cannot replace one of the original boot drives or the original system
board so that at least one copy of the original node serial number is present,
you cannot repair the node in the field. You must return the node to IBM for
repair.
Results
The status of a drive slot is uninitialized only if the SAN Volume Controller
software might not automatically initialize the FRU drive. This status can happen if
the node serial number on the other boot drive does not match the node serial
number on the system board. If the node serial number on the other boot drive
matches the MT-M SN label on the front that is left of the node, you can rescue the
uninitialized boot drive from the other boot drive safely. Use the service assistant
GUI or the satask recuenode command to rescue the drive.
Determining a hardware boot failure
During the hardware boot, you see progress messages, if the model has a front
panel display. If the model does not have a front panel display, light path LEDs
indicate a hardware boot failure.
Before you begin
Line 1 of the front panel displays the message Booting that is followed by the boot
code. Line 2 of the display shows a boot progress indicator. If the boot code detects
an error that makes it impossible to continue, Failed is displayed. You can use the
code to isolate the fault.
The following figure shows an example of a hardware boot display.
Failed
120
Figure 58. Example of a boot error code
About this task
If the boot detects a situation where it cannot continue, it fails. The cause might be
that the software on the hard disk drive is missing or damaged. If possible, the
boot sequence loads and starts the SAN Volume Controller software. Any faults
that are detected are reported as a node error.
Perform the following steps to determine a boot failure:
Chapter 7. Diagnosing problems
159
Procedure
1. Attempt to restore the software by using the node rescue procedure.
2. If node rescue fails, perform the actions that are described for any failing
node-rescue code or procedure.
Boot code reference
Boot codes are displayed on the screen when a node is booting.
The codes indicate the progress of the boot operation. Line 1 of the front panel
displays the message Booting that is followed by the boot code. Line 2 of the
display shows a boot progress indicator. Figure 59 provides a view of the boot
progress display.
Booting
130
Figure 59. Example of a boot progress display
Node error code overview
Node error codes describe failures that relate to a specific node. Node rescue codes
are displayed on the front panel display during node rescue. A 2145-DH8 node
does not have a front panel display and does not have node rescue codes.
Use the service assistant GUI by the Technician port to view node errors on a node
that does not have a front panel display such as a 2145-DH8 node.
Because node errors are specific to a node, for example, memory failures, the errors
might be reported only on that node. However, if the node can communicate with
the configuration node, then it is reported in the system event log.
When the node error code indicates that a critical error was detected that prevents
the node from becoming a member of a clustered system, the Node fault LED is on
for the 2145-DH8 node, or Line 1 of the front panel display contains the message
Node Error.
Line 2 contains either the error code or the error code and additional data. In
errors that involve a node with more than one power supply, the error code is
followed by two numbers. The first number indicates the power supply that has a
problem (either a 1 or a 2). The second number indicates the problem that is
detected.
Figure 60 provides an example of a node error code. This data might exceed the
maximum width of the menu screen. You can press the Right navigation to scroll
the display.
Figure 60. Example of a displayed node error code
160
SAN Volume Controller: Troubleshooting Guide
The additional data is unique for any error code. It provides the necessary
information to isolate the problem in an offline environment. Examples of
additional data are disk serial numbers and field replaceable unit (FRU) location
codes. When these codes are displayed, you can do additional fault isolation by
browsing the default menu to determine the node and Fibre Channel port status.
There are two types of node errors: critical node errors and noncritical node errors.
Critical errors
A critical error means that the node is not able to participate in a clustered system
until the issue that is preventing it from joining a clustered system is resolved. This
error occurs because part of the hardware failed or the system detects that the
software is corrupted. If a node has a critical node error, it is in service state, and
the fault LED on the node is on. The exception is when the node cannot connect to
enough resources to form a clustered system. It shows a critical node error but is
in the starting state. Resolve the errors in priority order. The range of errors that
are reserved for critical errors are 500 - 699.
Noncritical errors
A noncritical error code is logged when a hardware or code failure that is related
to just one specific node. These errors do not stop the node from entering active
state and joining a clustered system. If the node is part of a clustered system, an
alert describes the error condition. The range of errors that are reserved for
noncritical errors are 800 - 899.
Node rescue codes
To start node rescue, press and hold the left and right buttons on the front panel
during a power-on cycle. The menu screen displays the Node rescue request. See
the node rescue request topic. The hard disk is formatted and, if the format
completes without error, the software image is downloaded from any available
node. During node recovery, Line 1 of the menu screen displays the message
Booting followed by one of the node rescue codes. Line 2 of the menu screen
displays a boot progress indicator. Figure 61 shows an example of a displayed
node rescue code.
Booting
300
Figure 61. Example of a node-rescue error code
The three-digit code that is shown in Figure 61 represents a node rescue code.
Note: The 2145 UPS-1U does not power off following a node rescue failure.
Clustered-system code overview
The error codes for creating a clustered system are displayed on the menu screen
when you are using the front panel to create a new system, but the create
operation fails. Recovery codes for clustered systems indicate that a critical
software error has occurred that might corrupt your system. Error codes for
Chapter 7. Diagnosing problems
161
clustered systems describe errors other than creation and recovery errors. Each
error-code topic includes an error code number, a description, action, and possible
field-replaceable units (FRUs).
Error codes for creating a clustered system
Figure 62 provides an example of a create error code.
Figure 62. Example of a create error code for a clustered system
Line 1 of the menu screen contains the message Create Failed. Line 2 shows the
error code and, where necessary, additional data.
Error codes for recovering a clustered system
You must perform software problem analysis before you can perform further
operations to avoid the possibility of corrupting your configuration.
Figure 63 provides an example of a recovery error code.
Figure 63. Example of a recovery error code
Error codes for clustered systems
Error codes for clustered systems describe errors other than recovery errors.
svc00433
Figure 64 provides an example of a clustered-system error code.
Figure 64. Example of an error code for a clustered system
Error code range
This topic shows the number range for each message classification.
Table 78 lists the number range for each message classification.
Table 78. Message classification number range
162
Message classification
Range
Booting codes
100-299
SAN Volume Controller: Troubleshooting Guide
100 • 150
Table 78. Message classification number range (continued)
Message classification
Range
Node errors
Node rescue errors
300-399
Log-only node errors
400-499
Critical node errors
500-699
Noncritical node errors
800-899
Error codes when creating 700, 710
a clustered system
Error codes when
recovering a clustered
system
920, 990
Error codes for a clustered 1001-3081
system
100
Boot is running
Explanation: The node has started. It is running
diagnostics and loading the runtime code.
User response: Go to the hardware boot MAP to
resolve the problem.
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
v Service controller (47%)
v Service controller cable (47%)
v System board assembly (6%)
120
Disk drive hardware error
Explanation: The internal disk drive of the node has
reported an error. The node is unable to start.
User response: Ensure that the boot disk drive and all
related cabling is properly connected, then exchange
the FRU for a new FRU. (See “Possible Cause-FRUs or
other.”)
Possible Cause-FRUs or other:
2145-CF8 or 2145-CG8
this point, run the node rescue procedure.
Possible Cause-FRUs or other:
v None.
132
Explanation: The system has found that changes are
required to the BIOS settings of the node. These
changes are being made. The node will restart once the
changes are complete.
User response: If the progress bar has stopped for
more than 10 minutes, or if the display has shown
codes 100 and 132 three times or more, go to “MAP
5900: Hardware boot” on page 341 to resolve the
problem.
135
v Disk backplane (10%)
v Disk signal cable (8%)
v Disk power cable (1%)
v System board (1%)
130
Verifying the software
Explanation: The software packages of the node are
being checked for integrity.
User response: Allow the verification process to
complete.
137
v Disk drive (50%)
v Disk controller (30%)
Updating BIOS settings of the node
Updating system board service processor
firmware
Explanation: The service processor firmware of the
node is being updated to a new level. This process can
take 90 minutes. Do not restart the node while this is in
progress.
User response: Allow the updating process to
complete.
Checking the internal disk file system
Explanation: The file system on the internal disk drive
of the node is being checked for inconsistencies.
User response: If the progress bar has been stopped
for at least five minutes, power off the node and then
power on the node. If the boot process stops again at
150
Loading cluster code
Explanation: The SAN Volume Controller code is
being loaded.
User response: If the progress bar has been stopped
for at least 90 seconds, power off the node and then
Chapter 7. Diagnosing problems
163
155 • 310
power on the node. If the boot process stops again at
this point, run the node rescue procedure.
Possible Cause-FRUs or other:
v None.
155
v Service controller (95%)
v Service controller cable (5%)
All previous 2145 models
v Service controller (100%)
Loading cluster data
Explanation: The saved cluster state and cache data is
being loaded.
User response: If the progress bar has been stopped
for at least 5 minutes, power off the node and then
power on the node. If the boot process stops again at
this point, run the node rescue procedure.
Possible Cause-FRUs or other:
182
Explanation: The node is checking whether the
uninterruptible power supply is operating correctly.
User response: Allow the checking process to
complete.
232
v None.
160
Updating the service controller
Explanation: The firmware on the service controller is
being updated. This can take 30 minutes.
User response: When a node rescue is occurring, if the
progress bar has been stopped for at least 30 minutes,
exchange the FRU for a new FRU. When a node rescue
is not occurring, if the progress bar has been stopped
for at least 15 minutes, exchange the FRU for a new
FRU.
Checking uninterruptible power supply
Checking uninterruptible power supply
connections
Explanation: The node is checking whether the power
and signal cable connections to the uninterruptible
power supply are correct.
User response: Allow the checking process to
complete.
300
The 2145 is running node rescue.
Explanation: The 2145 is running node rescue.
Possible Cause-FRUs or other:
User response: If the progress bar has been stopped
for at least two minutes, exchange the FRU for a new
FRU.
2145-CG8 or 2145-CF8
Possible Cause-FRUs or other:
v Service controller (95%)
2145-CG8 or 2145-CF8
v Service controller cable (5%)
v Service controller (95%)
All previous 2145 models
v Service Controller (100%)
168
The command cannot be initiated
because authentication credentials for
the current SSH session have expired.
Explanation: Authentication credentials for the current
SSH session have expired, and all authorization for the
current session has been revoked. A system
administrator may have cleared the authentication
cache.
User response: Begin a new SSH session and re-issue
the command.
170
A flash module hardware error has
occurred.
Explanation: A flash module hardware error has
occurred.
User response: Exchange the FRU for a new FRU.
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
164
SAN Volume Controller: Troubleshooting Guide
v Service controller cable (5%)
v Service controller (100%)
310
The 2145 is running a format operation.
Explanation: The 2145 is running a format operation.
User response: If the progress bar has been stopped
for two minutes, exchange the FRU for a new FRU.
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
v Disk drive (50%)
v Disk controller (30%)
v Disk backplane (10%)
v Disk signal cable (8%)
v Disk power cable (1%)
v System board (1%)
v Disk drive assembly (90%)
v Disk cable assembly (10%)
320 • 370
320
A 2145 format operation has failed.
Explanation: A 2145 format operation has failed.
User response: Exchange the FRU for a new FRU.
more than two minutes, exchange the FRU for a new
FRU.
Possible Cause-FRUs or other:
v Fibre Channel adapter (100%)
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
350
v Disk drive (50%)
Explanation: The 2145 cannot find a donor node.
v Disk controller (30%)
User response: If the progress bar has stopped for
more than two minutes, perform the following steps:
v Disk backplane (10%)
v Disk signal cable (8%)
v Disk power cable (1%)
v System board (1%)
v Disk drive assembly (90%)
v Disk cable assembly (10%)
330
The 2145 is partitioning its disk drive.
Explanation: The 2145 is partitioning its disk drive.
User response: If the progress bar has been stopped
for two minutes, exchange the FRU for a new FRU.
Possible Cause-FRUs or other:
The 2145 cannot find a donor node.
1. Ensure that all of the Fibre Channel cables are
connected correctly and securely to the cluster.
2. Ensure that at least one other node is operational, is
connected to the same Fibre Channel network, and
is a donor node candidate. A node is a donor node
candidate if the version of software that is installed
on that node supports the model type of the node
that is being rescued.
3. Ensure that the Fibre Channel zoning allows a
connection between the node that is being rescued
and the donor node candidate.
4. Perform the problem determination procedures for
the network.
2145-CG8 or 2145-CF8
Possible Cause-FRUs or other:
v Disk drive (50%)
v None
v Disk controller (30%)
v Disk backplane (10%)
Other:
v Disk signal cable (8%)
v Fibre Channel network problem
v Disk power cable (1%)
v System board (1%)
360
v Disk drive assembly (90%)
The 2145 is loading software from the
donor.
v Disk cable assembly (10%)
Explanation: The 2145 is loading software from the
donor.
Other:
User response: If the progress bar has been stopped
for at least two minutes, restart the node rescue
procedure.
v Configuration problem
v Software error
Possible Cause-FRUs or other:
340
The 2145 is searching for donor node.
Explanation: The 2145 is searching for donor node.
User response: If the progress bar has been stopped
for more than two minutes, exchange the FRU for a
new FRU.
v None
365
Cannot load SW from donor
Explanation: None.
User response: None.
Possible Cause-FRUs or other:
v Fibre Channel adapter (100%)
370
Installing software
Explanation: The 2145 is installing software.
345
The 2145 is searching for a donor node
from which to copy the software.
Explanation: The node is searching at 1 Gb/s for a
donor node.
User response: If the progress bar has stopped for
User response:
1. If this code is displayed and the progress bar has
been stopped for at least ten minutes, the software
install process has failed with an unexpected
software error.
2. Power off the 2145 and wait for 60 seconds.
Chapter 7. Diagnosing problems
165
500 • 502
3. Power on the 2145. The software update operation
continues.
4. Report this problem immediately to your Software
Support Center.
Possible Cause-FRUs or other:
v None
500
Incorrect enclosure
Explanation: The node canister has saved cluster
information, which indicates that the canister is now
located in a different enclosure from where it was
previously used. Using the node canister in this state
might corrupt the data held on the enclosure drives.
User response: Follow troubleshooting procedures to
move the nodes to the correct location.
1. Follow the “Procedure: Getting node canister and
system information using the service assistant” task
to review the node canister saved location
information and the status of the other node
canister in the enclosure (the partner canister).
Determine if the enclosure is part of an active
system with volumes that contain required data.
2. If you have unintentionally moved the canister into
this enclosure, move the canister back to its original
location, and put the original canister back in this
enclosure. Follow the “Replacing a node canister”
procedure.
3. If you have intentionally moved the node canister
into this enclosure you should check it is safe to
continue or whether you will lose data on the
enclosure you removed it from. Do not continue if
the system the node canister was removed from is
offline, rather return the node canister to that
system.
4. If you have determined that you can continue,
follow the “Procedure: Removing system data from
a node canister” task to remove cluster data from
node canister.
5. If the partner node in this enclosure is not online, or
is not present, you will have to perform a system
recovery. Do not create a new system, you will lose
all the volume data.
Possible Cause-FRUs or other cause:
v None
501
Incorrect slot
Explanation: The node canister has saved cluster
information, which indicates that the canister is not
located in the expected enclosure, but in a different slot
from where it was previously used. Using the node
canister in this state might mean that hosts are not able
to connect correctly.
User response: Follow troubleshooting procedures to
166
SAN Volume Controller: Troubleshooting Guide
relocate the node canister to the correct location.
1. Follow the “Procedure: Getting node canister and
system information using the service assistant” task
to review the node canister saved location
information and the status of the other node
canister in the enclosure (the partner canister). If the
node canister has been inadvertently swapped, the
other node canister will have the same error.
2. If the canisters have been swapped, use the
“Replacing a node canister” procedure to swap the
canisters. The system should start.
3. If the partner canister is in candidate state, use the
hardware remove and replace canister procedure to
swap the canisters. The system should start.
4. If the partner canister is in active state, it is running
the cluster on this enclosure and has replaced the
original use of this canister. Follow the “Procedure:
Removing system data from a node canister” task to
remove cluster data from this node canister. The
node canister will then become active in the cluster
in its current slot.
5. If the partner canister is in service state, review its
node error to determine the correct action.
Generally, you will fix the errors reported on the
partner node in priority order, and review the
situation again after each change. If you have to
replace the partner canister with a new one, you
should move this canister back to the correct
location at the same time.
Possible Cause-FRUs or other:
v None
502
No enclosure identity exists and a status
from the partner node could not be
obtained.
Explanation: The enclosure has been replaced and
communication with the other node canister (partner
node) in the enclosure is not possible. The partner node
could be missing, powered off, unable to boot, or an
internode communication failure may exist.
User response: Follow troubleshooting procedures to
configure the enclosure:
1. Follow the procedures to resolve a problem to get
the partner node started. An error will still exist
because the enclosure has no identity. If the error
has changed, follow the service procedure for that
error.
2. If the partner has started and is showing a location
error (probably this one), then the PCI link is
probably broken. Since the enclosure midplane was
recently replaced, this is likely the problem. Obtain
a replacement enclosure midplane, and replace it.
3. If this action does not resolve the issue, contact IBM
Support Center. They will work with you to ensure
that the system state data is not lost while resolving
the problem.
503 • 506
Possible Cause—FRUs or other:
canister in the enclosure (the partner canister).
Determine if the enclosure is part of an active
system with volumes that contain required data.
v Enclosure midplane (100%)
503
Incorrect enclosure type
2. Decide what to do with the node canister that did
not come from the enclosure that is being replaced.
Explanation: The node canister has been moved to an
expansion enclosure. A node canister will not operate
in this environment. This can also be reported when a
replacement node canister is installed for the first time.
a. If the other node canister from the enclosure
being replaced is available, use the hardware
remove and replace canister procedures to
remove the incorrect canister and replace it with
the second node canister from the enclosure
being replaced. Restart both canisters. The two
node canister should show node error 504 and
the actions for that error should be followed.
User response: Follow troubleshooting procedures to
relocate the nodes to the correct location.
1. Follow the procedure Getting node canister and
system information using a USB flash drive and
review the saved location information of the node
canister to determine which control enclosure the
node canister belongs in.
b. If the other node canister from the enclosure
being replaced is not available, check the
enclosure of the node canister that did not come
from the replaced enclosure. Do not use this
canister in this enclosure if you require the
volume data on the system from which the node
canister was removed, and that system is not
running with two online nodes. You should
return the canister to its original enclosure and
use a different canister in this enclosure.
2. Follow the procedure to move the node canister to
the correct location, then follow the procedure to
move the expansion canister that is probably in that
location to the correct location. If there is a node
canister that is in active state where this node
canister must be, do not replace that node canister
with this one.
504
c. When you have checked that it is not required
elsewhere, follow the “Procedure: Removing
system data from a node canister” task to
remove cluster data from the node canister that
did not come from the enclosure that is being
replaced.
No enclosure identity and partner node
matches.
Explanation: The enclosure vital product data
indicates that the enclosure midplane has been
replaced. This node canister and the other node canister
in the enclosure were previously operating in the same
enclosure midplane.
d. Restart both nodes. Expect node error 506 to be
reported now, then follow the service procedures
for that error.
User response: Follow troubleshooting procedures to
configure the enclosure.
Possible Cause—FRUs or other:
1. This is an expected situation during the hardware
remove and replace procedure for a control
enclosure midplane. Continue following the remove
and replace procedure and configure the new
enclosure.
506
Possible Cause—FRUs or other:
v None
505
No enclosure identity and partner has
system data that does not match.
Explanation: The enclosure vital product data
indicates that the enclosure midplane has been
replaced. This node canister and the other node canister
in the enclosure do not come from the same original
enclosure.
User response: Follow troubleshooting procedures to
relocate nodes to the correct location.
1. Follow the “Procedure: Getting node canister and
system information using the service assistant” task
to review the node canister saved location
information and the status of the other node
v None
No enclosure identity and no node state
on partner
Explanation: The enclosure vital product data
indicates that the enclosure midplane has been
replaced. There is no cluster state information on the
other node canister in the enclosure (the partner
canister), so both node canisters from the original
enclosure have not been moved to this one.
User response: Follow troubleshooting procedures to
relocate nodes to the correct location:
1. Follow the procedure: Getting node canister and
system information and review the saved location
information of the node canister and determine why
the second node canister from the original enclosure
was not moved into this enclosure.
2. If you are sure that this node canister came from
the enclosure that is being replaced, and the original
partner canister is available, use the “Replacing a
node canister” procedure to install the second node
canister in this enclosure. Restart the node canister.
Chapter 7. Diagnosing problems
167
507 • 510
The two node canisters should show node error 504,
and the actions for that error should be followed.
3. If you are sure this node canister came from the
enclosure that is being replaced, and that the
original partner canister has failed, continue
following the remove and replace procedure for an
enclosure midplane and configure the new
enclosure.
Possible Cause—FRUs or other:
v None
507
No enclosure identity and no node state
Explanation: The node canister has been placed in a
replacement enclosure midplane. The node canister is
also a replacement or has had all cluster state removed
from it.
User response: Follow troubleshooting procedures to
relocate the nodes to the correct location.
1. Check the status of the other node in the enclosure.
Unless it also shows error 507, check the errors on
the other node and follow the corresponding
procedures to resolve the errors. It typically shows
node error 506.
2. If the other node in the enclosure is also reporting
507, the enclosure and both node canisters have no
state information. Contact IBM support. They will
assist you in setting the enclosure vital product data
and running cluster recovery.
Possible Cause-FRUs or other:
v None
508
Cluster identifier is different between
enclosure and node
Explanation: The node canister location information
shows it is in the correct enclosure, however the
enclosure has had a new clustered system created on it
since the node was last shut down. Therefore, the
clustered system state data stored on the node is not
valid.
User response: Follow troubleshooting procedures to
correctly relocate the nodes.
1. Check whether a new clustered system has been
created on this enclosure while this canister was not
operating or whether the node canister was recently
installed in the enclosure.
2. Follow the “Procedure: Getting node canister and
system information using the service assistant” task,
and check the partner node canister to see if it is
also reporting node error 508. If it is, check that the
saved system information on this and the partner
node match.
If the system information on both nodes matches,
follow the “Replacing a control enclosure midplane”
procedure to change the enclosure midplane.
3. If this node canister is the one to be used in this
enclosure, follow the “Procedure: Removing system
data from a node canister” task to remove clustered
system data from the node canister. It will then join
the clustered system.
4. If this is not the node canister that you intended to
use, follow the “Replacing a node canister”
procedure to replace the node canister with the one
intended for use.
Possible Cause—FRUs or other:
v Service procedure error (90%)
v Enclosure midplane (10%)
509
The enclosure identity cannot be read.
Explanation: The canister was unable to read vital
product data (VPD) from the enclosure. The canister
requires this data to be able to initialize correctly.
User response: Follow troubleshooting procedures to
fix the hardware:
1. Check errors reported on the other node canister in
this enclosure (the partner canister).
2. If it is reporting the same error, follow the hardware
remove and replace procedure to replace the
enclosure midplane.
3. If the partner canister is not reporting this error,
follow the hardware remove and replace procedure
to replace this canister.
Note: If a newly installed system has this error on both
node canisters, the data that needs to be written to the
enclosure will not be available on the canisters; contact
IBM support for the WWNNs to use.
Remember: Review the lsservicenodes output for
what the node is reporting.
Possible Cause—FRUs or other:
v Node canister (50%)
v Enclosure midplane (50%)
510
The detected memory size does not
match the expected memory size.
Explanation: The amount of memory detected in the
node is less than the amount required for the node to
operate as an active member of a system. The error
code data shows the detected memory, in MB, followed
by the minimum required memory, in MB, there is then
a series of values indicating the amount of memory, in
GB, detected in each memory slot.
Data:
v Detected memory on MB
v Minimum required memory in MB
168
SAN Volume Controller: Troubleshooting Guide
511 • 523
v Memory in slot 1 in GB
see whether it is listed there. Follow the procedures
to change the WWNN of a node.
v Memory in slot 2 in GB
v ... etc.
Possible Cause-FRUs or other:
User response: Check the memory size of another
2145 that is in the same cluster. For the 2145-CF8 and
2145-CG8, if you have just replaced a memory module,
check that the module that you have installed is the
correct size, then go to the light path MAP to isolate
any possible failed memory modules.
Possible Cause-FRUs or other:
v Memory module (100%)
511
Memory bank 1 of the 2145 is failing.
For the 2145-DH8 only, the DIMMS are
incorrectly installed.
Explanation: Memory bank 1 of the 2145 is failing.
For the 2145-DH8 only, the DIMMS are incorrectly
installed. This will degrade performance.
User response: For the 2145-DH8 only, shut down the
node and adjust the DIMM placement as per the install
directions.
Possible Cause-FRUs or other:
v Memory module (100%)
512
Enclosure VPD is inconsistent
Explanation: The enclosure midplane VPD is not
consistent. The machine part number is not compatible
with the machine type and model. This indicates that
the enclosure VPD is corrupted.
User response:
1. Check the support site for a code update.
2. Use the remove and replace procedures to replace
the enclosure midplane.
Possible Cause—FRUs or other:
v Enclosure midplane (100%)
v None
521
Unable to detect a Fibre Channel
adapter
Explanation: The 2145 cannot detect any Fibre
Channel adapters.
User response: Ensure that a Fibre Channel adapter
has been installed. Ensure that the Fibre Channel
adapter is seated correctly in the riser card. Ensure that
the riser card is seated correctly on the system board. If
the problem persists, exchange FRUs for new FRUs in
the order shown.
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
v 4-port Fibre Channel host bus adapter assembly
(95%)
v System board assembly (5%)
522
The system board service processor has
failed.
Explanation: The service processor on the system
board has failed.
User response: For the 2145-DH8 only:
1. Shutdown the node.
2. Remove mains power cable.
3. Wait for lights to stop blinking.
4. Plug in power, and then wait for the node to boot.
5. If that fails, replace the system board.
Exchange the FRU for a new FRU.
Possible Cause-FRUs or other:
2145-CF8, 2145-CG8, or 2145-DH8
517
The WWNNs of the service controller
and the disk do not match.
Explanation: The node is unable to determine the
WWNN that it must use. This is because of the service
controller or the nodes internal drive being replaced.
User response: Follow troubleshooting procedures to
configure the WWNN of the node.
1. Continue to follow the hardware remove and
replace procedure for the service controller or disk.
2. If you have not followed the hardware remove and
replace procedures, determine the correct WWNN.
If you do not have this information recorded,
examine your Fibre Channel switch configuration to
v System board assembly (100%)
523
The internal disk file system is
damaged.
Explanation: The node startup procedures have found
problems with the file system on the internal disk of
the node.
User response: Follow troubleshooting procedures to
reload the software.
1. Follow Procedure: Rescuing node canister machine
code from another node (node rescue).
2. If the rescue node does not succeed, use the
hardware remove and replace procedures.
Chapter 7. Diagnosing problems
169
524 • 530
Possible Cause—FRUs or other:
v Node canister (80%)
v Other (20%)
524
Unable to update BIOS settings.
Explanation: Unable to update BIOS settings.
User response: Power off node, wait 30 seconds, and
then power on again. If the error code is still reported,
replace the system board.
Possible Cause-FRUs or other:
v System board (100%)
525
Unable to update system board service
processor firmware.
Explanation: The node startup procedures have been
unable to update the firmware configuration of the
node canister. The update may take 90 minutes.
User response:
1. If the progress bar has been stopped for more than
90 minutes, power off and reboot the node. If the
boot progress bar stops again on this code, replace
the FRU shown.
2. Try removing the power cords and then restarting
to fix the problem, if the power off or restart does
not work.
Possible Cause—FRUs or other:
2145-CF8, or 2145-CG8
v System board (100%)
528
Ambient temperature is too high during
system startup.
Explanation: The ambient temperature read during
the node startup procedures is too high for the node to
continue. The startup procedure will continue when the
temperature is within range.
User response: Reduce the temperature around the
system.
1. Resolve the issue with the ambient temperature, by
checking and correcting:
a. Room temperature and air conditioning
b. Ventilation around the rack
c. Airflow within the rack
Possible Cause—FRUs or other:
v Environment issue (100%)
530
A problem with one of the node's power
supplies has been detected.
Explanation: The 530 error code is followed by two
numbers. The first number is either 1 or 2 to indicate
which power supply has the problem.
The second number is either 1, 2 or 3 to indicate the
reason. 1 indicates that the power supply is not
detected. 2 indicates that the power supply has failed. 3
indicates that there is no input power to the power
supply.
If the node is a member of a cluster, the cluster will
report error code 1096 or 1097, depending on the error
reason.
The error will automatically clear when the problem is
fixed.
User response:
1. Ensure that the power supply is seated correctly
and that the power cable is attached correctly to
both the node and to the 2145 UPS-1U.
2. If the error has not been automatically marked fixed
after two minutes, note the status of the three LEDs
on the back of the power supply. For the 2145-CG8
or 2145-CF8, the AC LED is the top green LED, the
DC LED is the middle green LED and the error
LED is the bottom amber LED.
3. If the power supply error LED is off and the AC
and DC power LEDs are both on, this is the normal
condition. If the error has not been automatically
fixed after two minutes, replace the system board.
4. Follow the action specified for the LED states noted
in the table below.
5. If the error has not been automatically fixed after
two minutes, contact support.
Error,AC,DC:Action
ON,ON or OFF,ON or OFF:The power supply has a
fault. Replace the power supply.
OFF,OFF,OFF:There is no power detected. Ensure that
the power cable is connected at the node and 2145
UPS-1U. If the AC LED does not light, check whether
the 2145 UPS-1U is showing any errors. Follow MAP
5150 2145 UPS-1U if the UPS-1U is showing an error;
otherwise, replace the power cable. If the AC LED still
does not light, replace the power supply.
OFF,OFF,ON:The power supply has a fault. Replace the
power supply.
OFF,ON,OFF:Ensure that the power supply is installed
correctly. If the DC LED does not light, replace the
power supply.
Possible Cause-FRUs or other:
170
SAN Volume Controller: Troubleshooting Guide
534 • 536
Reason 1: A power supply is not detected.
v Power supply (19%)
v System board (1%)
v Other: Power supply is not installed correctly (80%)
– Replace the power supply if the OVER SPEC LED
on the light path diagnostics panel is still lit.
v Pwr rail F: Replace the following components until
"Pwr rail F" is no longer reported:
– DIMMs 19 - 24
Reason 2: The power supply has failed.
– Fan 4
v Power supply (90%)
v Power cable assembly (5%)
– Optional adapters that are installed in PCI
riser-card assembly 2
v System board (5%)
– PCI riser-card assembly 2
Reason 3: There is no input power to the power supply.
– Replace the power supply if the OVER SPEC LED
on the light path diagnostics panel is still lit.
v Power cable assembly (25%)
v UPS-1U assembly (4%)
v System board (1%)
v Other: Power supply is not installed correctly (70%)
534
System board fault
Explanation: There is a unrecoverable error condition
in a device on the system board.
User response: For a storage enclosure, replace the
canister and reuse the interface adapters and fans.
For a control enclosure, refer to the additional details
supplied with the error to determine the proper parts
replacement sequence.
v Pwr rail G: Replace the following components until
"Pwr rail G" is no longer reported:
– Hard disk drive backplane assembly
– Hard disk drives
– Fan 3
– Optional PCI adapter power cable
v Pwr rail H: Replace the following components until
"Pwr rail H" is no longer reported:
– Optional adapters that are installed in PCI
riser-card assembly 2
– Optional PCI adapter power cable
Possible Cause—FRUs or other:
v Hardware (100%)
v Pwr rail A: Replace CPU 1.
Replace the power supply if the OVER SPEC LED on
the light path diagnostics panel is still lit.
v Pwr rail B: Replace CPU 2.
Replace the power supply if the OVER SPEC LED on
the light path diagnostics panel is still lit.
v Pwr rail C: Replace the following components until
"Pwr rail C" is no longer reported:
535
Canister internal PCIe switch failed
Explanation: The PCI Express switch has failed or
cannot be detected. In this situation, the only
connectivity to the node canister is through the
Ethernet ports.
User response: Follow troubleshooting procedures to
fix the hardware:
– DIMMs 1 - 6
– PCI riser-card assembly 1
536
– Fan 1
– Optional adapters that are installed in PCI
riser-card assembly 1
– Replace the power supply if the OVER SPEC LED
on the light path diagnostics panel is still lit.
v Pwr rail D: Replace the following components until
"Pwr rail D" is no longer reported:
– DIMMs 7 - 12
– Fan 2
– Optional PCI adapter power cable
– Replace the power supply if the OVER SPEC LED
on the light path diagnostics panel is still lit.
v Pwr rail E: Replace the following components until
"Pwr rail E" is no longer reported:
The temperature of a device on the
system board is greater than or equal to
the critical threshold.
Explanation: The temperature of a device on the
system board is greater than or equal to the critical
threshold.
User response: Check for external and internal air
flow blockages or damage.
1. Remove the top of the machine case and check for
missing baffles, damaged heat sinks, or internal
blockages.
2. If the error persists, replace system board.
Possible Cause-FRUs or other:
v None
– DIMMs 13 - 18
– Hard disk drives
Chapter 7. Diagnosing problems
171
538 • 545
538
The temperature of a PCI riser card is
greater than or equal to the critical
threshold.
Explanation: The temperature of a PCI riser card is
greater than or equal to the critical threshold.
User response: Improve cooling.
1. If the problem persists, replace the PCI riser
Possible Cause-FRUs or other:
v None
541
Multiple, undetermined, hardware
errors
Explanation: Multiple hardware failures have been
reported on the data paths within the node canister,
and the threshold of the number of acceptable errors
within a given time frame has been reached. It has not
been possible to isolate the errors to a single
component.
After this node error has been raised, all ports on the
node will be deactivated. The reason for this is that the
node canister is considered unstable, and has the
potential to corrupt data.
User response:
1. Follow the procedure for collecting information for
support, and contact your support organization.
2. A software [code] update may resolve the issue.
3. Replace the node canister.
3. If you intend to use a drive from a different node in
this node from now on, the node error changes to a
different node error when the other drive is
replaced.
4. If you replaced the system board, then the panel
name is now 0000000, and if you replaced one of
the drives, then the slot status of that drive is
uninitialized. If the node serial number of the other
boot drive matches the MT-M S/N label on the
front of the node, then run satask rescuenode to
initialize the uninitialized drive. Initializing the
drive should lead to the 545 node error.
Possible Cause-FRUs or other:
v None
544
Boot drives are from other nodes.
Explanation: Boot drives are from other nodes.
User response: Look at a boot drive view for the node
to determine what to do.
1. Put any drive that belongs to a different node back
where it belongs.
2. If you intend to use a drive from a different node in
this node from now on, the node error changes to a
different node error when the other drive is
replaced.
3. See error code 1035 for additional information
regarding boot drive problems.
Possible Cause-FRUs or other:
v None
542
An installed CPU has failed or been
removed.
545
The node serial number on the boot
drives match each other, but they do not
match the product serial number on the
system board.
Explanation: An installed CPU has failed or been
removed.
User response: Replace the CPU.
Possible Cause-FRUs or other:
v CPU (100%)
543
None of the node serial numbers that
are stored in the three locations match.
Explanation: When the SAN Volume Controller
software starts, it reads the node serial number from
the system board and compares this serial number to
the node serial numbers stored on the two boot drives.
There must be at least two matching node serial
numbers for the SAN Volume Controller software to
assume that node serial number is good.
Explanation: The node serial number on the boot
drives match each other, but they do not match the
product serial number on the system board.
User response: Check the S/N value on the MT-M
S/N label on the front of the node. Look at a boot
drive view to see the node serial number of the system
board and the node serial number of each drive.
1. Replace the boot drives with the correct boot drives
if needed.
2. Set the system board serial number using the
following command:
satask chvpd -type <value> -serial <S/N value from
the MT-M S/N label>
User response: Look at a boot drive view for the node
to work out what to do.
Possible Cause-FRUs or other:
1. Replace missing or failed drives.
v None
2. Put any drive that belongs to a different node back
where it belongs.
172
SAN Volume Controller: Troubleshooting Guide
550 • 551
550
A clustered system cannot be formed
because of a lack of clustered system
resources.
Explanation: The node cannot become active in a
clustered system because it is unable to connect to
enough clustered system resources. The clustered
system resources are the nodes in the system and the
active quorum disk. The node needs to be able to
connect to a majority of the resources before that group
forms an online clustered system. This prevents the
clustered system splitting into two or more active parts,
with both parts independently performing I/O.
Supplemental data that is displayed with this error
code list the missing IDs for the 2145s and the quorum
disk controller. Each missing node is listed by its node
ID. A missing quorum disk is listed as
WWWWWWWWWWWWWWWW/LL, where
WWWWWWWWWWWWWWWW is a worldwide port
name (WWPN) on the disk controller that contains the
missing quorum disk and LL is the Logical Unit
Number (LUN) of the missing quorum disk on that
controller.
If the system topology is stretched and the number of
operational nodes are less than half, then node error
550 is displayed. In this case, the Site Disaster Recovery
feature cannot be used as the number of operational
nodes is less than the quorum required to create the
clustered system that uses the Site Disaster Recovery
feature.
User response: Follow troubleshooting procedures to
correct connectivity issues between the cluster nodes
and the quorum devices.
1. Check for any node errors that indicate issues with
Fibre Channel connectivity. Resolve any issues.
2. Ensure that the other systems in the cluster are
powered on and operational.
3. Check the Fibre Channel port status. If any port is
not active, perform the Fibre Channel port problem
determination procedures.
4. Ensure that Fibre Channel network zoning changes
have not restricted communication between nodes
or between the nodes and the quorum disk.
5. Perform the problem determination procedures for
the network.
6. The quorum disk failed or cannot be accessed.
Perform the problem determination procedures for
the disk controller.
551
A cluster cannot be formed because of a
lack of cluster resources.
Explanation: The node does not have sufficient
connectivity to other nodes or the quorum device to
form a cluster.
nodes at the other site cannot be recovered, then it is
possible to allow the nodes at the surviving site to form
a system by using local storage.
User response: Follow troubleshooting procedures to
correct connectivity issues between the cluster nodes
and the quorum devices.
1. Check for any node errors that indicate issues with
Fibre Channel connectivity. Resolve any issues.
2. Ensure that the other nodes in the cluster are
powered on and operational.
3. Using the SAT GUI or CLI (sainfo lsservicestatus),
display the Fibre Channel port status. If any port is
not active, perform the Fibre Channel port problem
determination procedures.
4. Ensure that Fibre Channel network zoning changes
have not restricted communication between nodes
or between the nodes and the quorum disk.
5. Perform the problem determination procedures for
the network.
6. The quorum disk failed or cannot be accessed.
Perform the problem determination procedures for
the disk controller.
7. As a last resort when the nodes at the other site
cannot be recovered, then it is possible to allow the
nodes at the surviving site to form a system by
using local site storage:
To avoid data corruption ensure that all host servers
that were previously accessing the system have had
all volumes unmounted or have been rebooted.
Ensure that the nodes at the other site are not
operational and are unable to form a system in the
future.
After starting this command, a full
resynchronization of all mirrored volumes is
completed when the other site is recovered. This is
likely to take many hours or days to complete.
Contact IBM support personnel if you are unsure.
Note: Before continuing, confirm that you have
taken the following actions - failure to perform
these actions can lead to data corruption that is
undetected by the system but affects host
applications.
a. All host servers that were previously accessing
the system have had all volumes unmounted or
have been rebooted.
b. Ensure that the nodes at the other site are not
operating as a system and actions have been
taken to prevent them from forming a system in
the future.
After these actions have been taken, the satask
overridequorum can be used to allow the nodes at
the surviving site to form a system that uses local
storage.
Attempt to repair the fabric or quorum device to
establish connectivity. If a disaster occurred and the
Chapter 7. Diagnosing problems
173
555 • 561
555
Explanation: Both 2145s in an I/O group that are
being powered by the same uninterruptible power
supply. The ID of the other 2145 is displayed with the
node error code on the front panel.
User response: Ensure that the configuration is correct
and that each 2145 is in an I/O group is connected
from a separate uninterruptible power supply.
556
volumes that it is managing will be lost. You should
ensure that the host systems are in the correct state
before you change the WWNN.
Power Domain error
A duplicate WWNN has been detected.
Explanation: The node has detected another device
that has the same World Wide Node Name (WWNN)
on the Fibre Channel network. A WWNN is 16
hexadecimal digits long. For a cluster, the first 11 digits
are always 50050768010. The last 5 digits of the
WWNN are given in the additional data of the error
and appear on the front panel displays. The Fibre
Channel ports of the node are disabled to prevent
disruption of the Fibre Channel network. One or both
nodes with the same WWNN can show the error.
Because of the way WWNNs are allocated, a device
with a duplicate WWNN is normally another cluster
node.
5. If the node showing the error had the correct
WWNN, it can be restarted, using the front panel
power control button, after the node with the
duplicate WWNN is updated.
6. If you are unable to find a cluster node with the
same WWNN as the node showing the error, use
the SAN monitoring tools to determine whether
there is another device on the SAN with the same
WWNN. This device should not be using a WWNN
assigned to a cluster, so you should follow the
service procedures for the device to change its
WWNN. Once the duplicate has been removed,
restart the node canister.
558
The node is unable to communicate
with other nodes.
Explanation: The 2145 cannot see the Fibre Channel
fabric or the Fibre Channel adapter port speed might
be set to a different speed than the Fibre Channel
fabric.
User response: Ensure that:
User response: Follow troubleshooting procedures to
configure the WWNN of the node:
1. The Fibre Channel network fabric switch is
powered-on.
1. Find the cluster node with the same WWNN as the
node reporting the error. The WWNN for a cluster
node can be found from the node Vital Product
Data (VPD) or from the Node menu on the front
panel. The node with the duplicate WWNN need
not be part of the same cluster as the node
reporting the error; it could be remote from the
node reporting the error on a part of the fabric
connected through an inter-switch link. The WWNN
of the node is stored within the service controller, so
the duplication is most likely caused by the
replacement of a service controller.
2. At least one Fibre Channel cable connects the 2145
to the Fibre Channel network fabric.
2. If a cluster node with a duplicate WWNN is found,
determine whether it, or the node reporting the
error, has the incorrect WWNN. Generally, it is the
node that has had its service controller that was
recently replaced or had its WWNN changed
incorrectly. Also consider how the SAN is zoned
when making your decision.
3. Determine the correct WWNN for the node with the
incorrect WWNN. If the service controller has been
replaced as part of a service action, the WWNN for
the node should have been written down. If the
correct WWNN cannot be determined contact your
support center for assistance.
4. Use the front panel menus to modify the incorrect
WWNN. If it is the node showing the error that
should be modified, this can safely be done
immediately. If it is an active node that should be
modified, use caution because the node will restart
when the WWNN is changed. If this node is the
only operational node in an enclosure, access to the
174
SAN Volume Controller: Troubleshooting Guide
3. The Fibre Channel adapter port speed is equal to
the Fibre Channel fabric.
4. At least one Fibre Channel adapter is installed in
the 2145.
5. Go to the Fibre Channel MAP.
Possible Cause-FRUs or other:
v None
560
Battery cabling fault
Explanation: A fault exists in one of the cables
connecting the battery backplane to the rest of the
system.
User response: Follow troubleshooting procedures to
fix the hardware:
1. Reseat the cable.
2. If reseating the cable does not fix the problem,
replace the cable.
3. If replacing the cable does not fix the problem,
replace the battery backplane.
561
Battery backplane or cabling fault
Explanation: Either the battery backplane has failed,
or the power or LPC cables connecting the battery
backplane to the rest of the system are not connected
properly.
562 • 570
User response: Follow troubleshooting procedures to
fix the hardware:
Possible Cause—FRUs or other:
v None
1. Check the cables connecting the battery backplane.
2. Reseat the power and LPC cables.
3. If reseating the cables does not fix the problem,
replace the cables.
4. Once the cables are well connected, but the problem
persists, replace the battery backplane.
5. Conduct the corrective service procedure described
in “1108” on page 206.
562
The nodes hardware configuration does
not meet the minimum requirements
Explanation: The node hardware is not at the
minimum specification for the node to become active in
a cluster. This may be because of hardware failure, but
is also possible after a service action has used an
incorrect replacement part.
565
The internal drive of the node is failing.
Explanation: The internal drive within the node is
reporting too many errors. It is no longer safe to rely
on the integrity of the drive. Replacement is
recommended.
User response: Follow troubleshooting procedures to
fix the hardware:
1. View hardware information.
2. Replace parts (canister or disk).
Possible Cause—FRUs or other:
v 2145-CG8 or 2145-CF8
– Disk drive (50%)
– Disk controller (30%)
User response: Follow troubleshooting procedures to
fix the hardware:
– Disk backplane (10%)
1. View node VPD information, to see whether
anything looks inconsistent. Compare the failing
node VPD with the VPD of a working node of the
same type. Pay particular attention to the number
and type of CPUs and memory.
– Disk power cable (1%)
– Disk signal cable (8%)
– System board (1%)
569
2. Replace any incorrect parts.
564
Too many machine code crashes have
occurred.
Explanation: The node has been determined to be
unstable because of multiple resets. The cause of the
resets can be that the system encountered an
unexpected state or has executed instructions that were
not valid. The node has entered the service state so that
diagnostic data can be recovered.
The node error does not persist across restarts of the
machine code on the node.
User response: Follow troubleshooting procedures to
reload the machine code:
At boot time: the CPU reached a
temperature that is greater than or equal
to the warning threshold. During
normal running: the CPU reached a
temperature that is greater than or equal
to the critical threshold.
Explanation: At boot time: the CPU reached a
temperature that is greater than or equal to the
warning threshold. During normal running: the CPU
reached a temperature that is greater than or equal to
the critical threshold.
User response: Check for external and internal air
flow blockages or damage.
1. Remove the top of the machine case and check for
missing baffles, damaged heat sinks, or internal
blockages.
1. Get a support package (snap), including dumps,
from the node, using the management GUI or the
service assistant.
2. If problem persists, replace the CPU/heat sink.
2. If more than one node is reporting this error,
contact IBM technical support for assistance. The
support package from each node will be required.
v CPU
3. Check the support site to see whether the issue is
known and whether a machine code update exists
to resolve the issue. Update the cluster machine
code if a resolution is available. Use the manual
update process on the node that reported the error
first.
570
4. If the problem remains unresolved, contact IBM
technical support and send them the support
package.
Possible Cause-FRUs or other:
v Heat sink
Battery protection unavailable
Explanation: The node cannot start because battery
protection is not available. Both batteries require user
intervention before they can become available.
User response: Follow troubleshooting procedures to
fix hardware.
The appropriate service action will be indicated by an
accompanying non-fatal node error. Examine the event
Chapter 7. Diagnosing problems
175
571 • 578
log to determine the accompanying node error.
v None
571
576
Battery protection temporarily
unavailable; one battery is expected to
be available soon
Explanation: The node cannot start because battery
protection is not available. One battery is expected to
become available shortly with no user intervention
required, but the other battery will not become
available.
User response: Follow troubleshooting procedures to
fix hardware.
The appropriate service action will be indicated by an
accompanying non-fatal node error. Examine the event
log to determine the accompanying node error.
The cluster state and configuration data
cannot be read.
Explanation: The node has been unable to read the
saved cluster state and configuration data from its
internal drive because of a read or medium error.
User response: In the sequence shown, exchange the
FRUs for new FRUs.
Possible Cause—FRUs or other:
v 2145-CG8 or 2145-CF8
– Disk drive (50%)
– Disk controller (30%)
– Disk backplane (10%)
– Disk signal cable (8%)
572
Battery protection temporarily
unavailable; both batteries are expected
to be available soon
Explanation: The node cannot start because battery
protection is not available. Both batteries are expected
to become available shortly with no user intervention
required.
User response: Wait for sufficient battery charge for
enclosure to start.
573
The node machine code is inconsistent.
Explanation: Parts of the node machine code package
are receiving unexpected results; there may be an
inconsistent set of subpackages installed, or one
subpackage may be damaged.
User response: Follow troubleshooting procedures to
reload the machine code.
1. Follow the procedure to run a node rescue.
2. If the error occurs again, contact IBM technical
support.
Possible Cause—FRUs or other:
v None
574
The node machine code is damaged.
Explanation: A checksum failure has indicated that
the node machine code is damaged and needs to be
reinstalled.
User response:
1. If the other nodes are operational, run node rescue;
otherwise, install new machine code using the
service assistant. Node rescue failures, as well as the
repeated return of this node error after
reinstallation, are symptomatic of a hardware fault
with the node.
Possible Cause—FRUs or other:
176
SAN Volume Controller: Troubleshooting Guide
– Disk power cable (1%)
– System board (1%)
578
The state data was not saved following
a power loss.
Explanation: On startup, the node was unable to read
its state data. When this happens, it expects to be
automatically added back into a clustered system.
However, if it is not joined to a clustered system in 60
sec, it raises this node error. This error is a critical node
error, and user action is required before the node can
become a candidate to join a clustered system.
User response: Follow troubleshooting procedures to
correct connectivity issues between the clustered system
nodes and the quorum devices.
1. Manual intervention is required once the node
reports this error.
2. Attempt to reestablish the clustered system by using
other nodes. This step might involve fixing
hardware issues on other nodes or fixing
connectivity issues between nodes.
3. If you are able to reestablish the clustered system,
remove the system data from the node that shows
error 578 so it goes to a candidate state. It is then
automatically added back to the clustered system.
a. To remove the system data from the node, go to
the service assistant, select the radio button for
the node with a 578, click Manage System, and
then choose Remove System Data.
b. Or use the CLI command satask leavecluster
-force.
If the node does not automatically add back to the
clustered system, note the name and I/O group of
the node, and then delete the node from the
clustered system configuration (if this has not
already happened). Add the node back to the
clustered system using the same name and I/O
group.
579 • 584
4. If all nodes have either node error 578 or 550,
follow the recommended user response for node
error 550.
5. Attempt to determine what caused the nodes to
shut down.
Possible Cause—FRUs or other:
v None
579
Battery subsystem has insufficient
charge to save system data
Explanation: Not enough capacity is available from
the battery subsystem to save system data in response
to a series of battery and boot-drive faults.
v 2145 system board (30%)
582
A battery error in the 2145 UPS-1U has
occurred.
Explanation: A problem has occurred with the
uninterruptible power supply 2145 UPS-1U battery.
User response: Exchange the FRU for a new FRU.
After replacing the battery assembly, if the 2145
UPS-1U service indicator is on, press and hold the 2145
UPS-1U Test button for three seconds to start the
self-test and verify the repair. During the self-test, the
rightmost four LEDs on the 2145 UPS-1U front-panel
assembly flash in sequence.
Possible Cause-FRUs or other:
User response: Follow troubleshooting procedures to
fix hardware.
v UPS-1U battery assembly (50%)
The appropriate service actions are indicated by the
series of battery and boot-drive faults. Examine the
event log to determine the accompanying faults. Service
the other faults.
583
580
The service controller ID could not be
read.
v UPS-1U assembly (50%)
An electronics error in the 2145 UPS-1U
has occurred.
Explanation: A problem has occurred with the 2145
UPS-1U electronics.
User response: Exchange the FRU for a new FRU.
Explanation: The 2145 cannot read the unique ID from
the service controller, so the Fibre Channel adapters
cannot be started.
Possible Cause-FRUs or other:
User response: In the sequence shown, exchange the
following FRUs for new FRUs.
584
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
v Service controller (70%)
v 2145 UPS-1U assembly
The 2145 UPS-1U is overloaded.
Explanation: A problem with output overload has
been reported by the uninterruptible power supply
2145 UPS-1U. The Overload Indicator on the 2145
UPS-1U front panel is illuminated red.
v Service controller cable (30%)
User response:
Service controller (100%)
1. Ensure that only one 2145 is receiving power from
the 2145 UPS-1U. Also ensure that no other devices
are connected to the 2145 UPS-1U.
Other:
2. Disconnect the 2145 from the 2145 UPS-1U. If the
Overload Indicator is still illuminated, on the
disconnected 2145 replace the 2145 UPS-1U.
v None
581
A serial link error in the 2145 UPS-1U
has occurred.
Explanation: There is a fault in the communications
cable, the serial interface in the uninterruptible power
supply 2145 UPS-1U, or 2145.
User response: Check that the communications cable
is correctly plugged into the 2145 and the 2145 UPS-1U.
If the cable is plugged in correctly, replace the FRUs in
the order shown.
Possible Cause-FRUs or other:
2145-CF8, or 2145-CG8
v 2145 power cable assembly (40%)
3. If the Overload Indicator is now off, and the node is
a 2145-CG8 or 2145-CF8, on the disconnected 2145,
with all outputs disconnected, determine whether it
is one of the two power supplies or the power cable
assembly that must be replaced. Plug just one
power cable into the left hand power supply and
start the node and see whether the error is reported.
Then shut down the node and connect the other
power cable into the left hand power supply and
start the node and see whether the error is repeated.
Then repeat the two tests for the right hand power
supply. If the error is repeated for both cables on
one power supply but not the other, replace the
power supply that showed the error; otherwise,
replace the power cable assembly.
v 2145 UPS-1U assembly (30%)
Chapter 7. Diagnosing problems
177
586 • 650
Possible Cause-FRUs or other:
590
v Power cable assembly (45%)
v Power supply assembly (45%)
v UPS-1U assembly (10%)
586
The power supply to the 2145 UPS-1U
does not meet requirements.
Explanation: None.
User response: Follow troubleshooting procedures to
fix the hardware.
587
An incorrect type of uninterruptible
power supply has been detected.
Explanation: An incorrect type of 2145 UPS-1U was
installed.
User response: Exchange the 2145 UPS-1U for one of
the correct type.
Possible Cause-FRUs or other:
Repetitive node transitions into standby
mode from normal mode because of
power subsystem-related node errors.
Explanation: Multiple node restarts occurred because
of 2145 UPS-1U errors, which can be reported on any
node type
This error means that the node made the transition into
standby from normal mode because of power
subsystem-related node errors too many times within a
short period. Too many times are defined as three, and
a short period is defined as 1 hour. This error alerts the
user that something might be wrong with the power
subsystem as it is clearly not normal for the node to
repeatedly go in and out of standby.
If the actions of the tester or engineer are expected to
cause many frequent transitions from normal to
standby and back, then this error does not imply that
there is any actual fault with the system.
User response: Follow troubleshooting procedures to
fix the hardware:
v 2145 UPS-1U (100%)
1. Verify that the room temperature is within specified
limits and that the input power is stable.
588
2. If a 2145 UPS-1U is connected, verify that the 2145
UPS-1U signal cable is fastened securely at both
ends.
The 2145 UPS-1U is not cabled correctly.
Explanation: The signal cable or the 2145 power
cables are probably not connected correctly. The power
cable and signal cable might be connected to different
2145 UPS-1U assemblies.
User response:
1. Connect the cables correctly.
3.
Look in the system event log for the node error
that is repeating.
Note: The condition is reset by powering off the node
from the node front panel.
2. Restart the node.
650
Possible Cause-FRUs or other:
v None.
Other:
v Cabling error (100%)
589
The 2145 UPS-1U ambient temperature
limit has been exceeded.
Explanation: The ambient temperature threshold for
the 2145 UPS-1U has been exceeded.
User response: Reduce the temperature around the
system:
1. Turn off the 2145 UPS-1U and unplug it from the
power source.
2. Clear the vents and remove any heat sources.
3. Ensure that the air flow around the 2145 UPS-1U is
not restricted.
4. Wait at least five minutes, and then restart the 2145
UPS-1U. If the problem remains, exchange 2145
UPS-1U assembly.
178
SAN Volume Controller: Troubleshooting Guide
The canister battery is not supported
Explanation: The canister battery shows product data
that indicates it cannot be used with the code version
of the canister.
User response: This is resolved by either obtaining a
battery which is supported by the system's code level,
or the canister's code level is updated to a level which
supports the battery.
1. Remove the canister and its lid and check the FRU
part number of the new battery matches that of the
replaced battery. Obtain the correct FRU part if it
does not.
2. If the canister has just been replaced, check the code
level of the partner node canister and use the
service assistant to update this canister's code level
to the same level.
Possible cause—FRUs or other cause
v canister battery
651 • 657
651
The canister battery is missing
Explanation: The canister battery cannot be detected.
User response:
1. Use the remove and replace procedures to remove
the node canister and its lid.
2. Use the remove and replace procedures to install a
battery.
3. If a battery is present, ensure that it is fully
inserted. Replace the canister.
4. If this error persists, use the remove and replace
procedures to replace the battery.
v If necessary, reduce the ambient temperature.
v Wait for the battery to cool down, the error will clear
when normal working temperature is reached. Keep
checking the reported error as the system may
determine the battery has failed.
v If the node error persists for more than two hours
after the ambient temperature returns to the normal
operating range, use the remove and replace
procedures to replace the battery.
Possible cause—FRUs or other cause
v canister battery
655
Possible cause—FRUs or other cause
v Canister battery
652
The canister battery has failed
Explanation: The canister battery has failed. The
battery may be showing an error state, it may have
reached the end of life, or it may have failed to charge.
Data
Canister battery communications fault.
Explanation: The canister cannot communicate with
the battery.
User response:
v Use the remove and replace procedures to replace
the battery.
v If the node error persists, use the remove and replace
procedures to replace the node canister.
Possible Cause-FRUs or other cause:
Number indicators with failure reasons
v Canister battery
v 1—battery reports a failure
v Node canister
v 2—end of life
v 3—failure to charge
656
User response:
The canister battery has insufficient
charge
1. Use the remove and replace procedures to replace
the battery.
Explanation: The canister battery has insufficient
charge to save the canister’s state and cache data to the
internal drive if power were to fail.
Possible cause—FRUs or other cause
User response:
v canister battery
v Wait for the battery to charge, the battery does not
need to be fully charged for the error to
automatically clear.
653
The canister battery’s temperature is too
low
Possible cause—FRUs or other cause
Explanation: The canister battery’s temperature is
below its minimum operating temperature.
v none
User response:
657
v Wait for the battery to warm up, the error will clear
when its minimum working temperature is reached.
v If the error persists for more than an hour when the
ambient temperature is normal, use the remove and
replace procedures to replace the battery.
Not enough battery charge to support
graceful shutdown of the storage
enclosure.
Explanation: Insufficient power available for the
enclosure.
Possible cause—FRUs or other cause
User response: If a battery is missing, failed or having
a communication error, replace the battery.
v canister battery
If a battery is failed, replace the battery.
654
If a battery is charging, this error should go away when
the battery is charged.
The canister battery’s temperature is too
high
Explanation: The canister battery’s temperature is
above its safe operating temperature.
User response:
If a battery is too hot, the system can be started after it
has cooled.
If running on a single power supply with low input
power (110 V AC), "low voltage" will be seen in the
Chapter 7. Diagnosing problems
179
668 • 673
extra data. If this is the case, the failed or missing
power supply should be replaced. This will only
happen if a single power supply is running with input
power that is too low.
668
The remote setting is not available for
users for the current system.
Explanation: On the current systems, users cannot be
set to remote.
User response: Any user defined on the system must
be a local user. To create a remote user the user must
not be defined on the local system.
670
The UPS battery charge is not enough to
allow the node to start.
Explanation: The uninterruptible power supply
connected to the node does not have sufficient battery
charge for the node to safely become active in a cluster.
The node will not start until a sufficient charge exists to
store the state and configuration data held in the node
memory if power were to fail. The front panel of the
node will show "charging".
User response: Wait for sufficient battery charge for
enclosure to start:
1. Wait for the node to automatically fix the error
when there is sufficient charge.
2. Ensure that no error conditions are indicated on the
uninterruptible power supply.
671
The available battery charge is not
enough to allow the node canister to
start. Two batteries are charging.
Explanation: The battery charge within the enclosure
is not sufficient for the node to safely become active in
a cluster. The node will not start until sufficient charge
exists to store the state and configuration data held in
the node canister memory if power were to fail. Two
batteries are within the enclosure, one in each of the
power supplies. Neither of the batteries indicate an
error—both are charging.
The node will start automatically when sufficient
charge is available. The batteries do not have to be
fully charged before the nodes can become active.
Both nodes within the enclosure share the battery
charge, so both node canisters report this error. The
service assistant shows the estimated start time in the
node canister hardware details.
User response: Wait for the node to automatically fix
the error when sufficient charge becomes available.
672
The available battery charge is not
enough to allow the node canister to
start. One battery is charging.
Explanation: The battery charge within the enclosure
is not sufficient for the node to safely become active in
a cluster. The node will not start until sufficient charge
exists to store the state and configuration data held in
the node canister memory if power were to fail. Two
batteries are within the enclosure, one in each of the
power supplies. Only one of the batteries is charging,
so the time to reach sufficient charge will be extended.
The node will start automatically when sufficient
charge is available. The batteries do not have to be
fully charged before the nodes can become active.
Both nodes within the enclosure share the battery
charge, so both node canisters report this error.
The service assistant shows the estimated start time,
and the battery status, in the node canister hardware
details.
Possible Cause-FRUs or other:
v None
User response:
1. Wait for the node to automatically fix the error
when sufficient charge becomes available.
2. If possible, determine why one battery is not
charging. Use the battery status shown in the node
canister hardware details and the indicator LEDs on
the PSUs in the enclosure to diagnose the problem.
If the issue cannot be resolved, wait until the cluster
is operational and use the troubleshooting options
in the management GUI to assist in resolving the
issue.
Possible Cause-FRUs or other:
v Battery (33%)
v Control power supply (33%)
v Power cord (33%)
673
The available battery charge is not
enough to allow the node canister to
start. No batteries are charging.
Explanation: A node cannot be in active state if it
does not have sufficient battery power to store
configuration and cache data from memory to internal
disk after a power failure. The system has determined
that both batteries have failed or are missing. The
problem with the batteries must be resolved to allow
the system to start.
User response: Follow troubleshooting procedures to
fix hardware:
1. Resolve problems in both batteries by following the
procedure to determine status using the LEDs.
180
SAN Volume Controller: Troubleshooting Guide
674 • 701
2. If the LEDs do not show a fault on the power
supplies or batteries, power off both power supplies
in the enclosure and remove the power cords. Wait
20 seconds, then replace the power cords and
restore power to both power supplies. If both node
canisters continue to report this error replace the
enclosure chassis.
Data:
v Location—A number indicating the adapter location.
The location indicates an adapter slot, see the node
canister description for the definition of the adapter
slot locations
User response:
v Battery (33%)
1. If possible, this noncritical node error should be
serviced using the management GUI and running
the recommended actions for the service error code.
v Power supply (33%)
2.
Possible Cause-FRUs or other:
v Power cord (33%)
There are a number of possibilities.
v Enclosure chassis (1%)
a. If you have deliberately removed the adapter
(possibly replacing it with a different adapter
type), you will need to follow the management
GUI recommended actions to mark the
hardware change as intentional.
674
The cycling mode of a Metro Mirror
object cannot be changed.
b. If the previous steps have not isolated the
problem, use the remove and replace procedures
to replace the adapter, if this does not fix the
problem replace the system board.
Explanation: The cycling mode may only be set for
Global Mirror objects. Metro Mirror objects cannot have
a cycling mode defined.
User response: The object's type must be set to 'global'
before or when setting the cycling mode.
690
The node is held in the service state.
Explanation: The node is in service state and has been
instructed to remain in service state. While in service
state, the node will not run as part of a cluster. A node
must not be in service state for longer than necessary
while the cluster is online because a loss of redundancy
will result. A node can be set to remain in service state
either because of a service assistant user action or
because the node was deleted from the cluster.
User response: When it is no longer necessary to hold
the node in the service state, exit the service state to
allow the node to run:
1. Use the service assistant action to release the service
state.
Possible Cause—FRUs or other:
v none
700
The Fibre Channel adapter that was
previously present has not been
detected.
Explanation: A Fibre Channel adapter that was
previously present has not been detected. The adapter
might not be correctly installed, or it might have failed.
This node error does not, in itself, stop the node
canister from becoming active in the system; however,
the Fibre Channel network might be being used to
communicate between the node canisters in a clustered
system. It is possible that this node error indicates why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node canister.
Possible Cause—FRUs or other cause:
v Fibre Channel adapter
v System board
701
A Fibre Channel adapter has failed.
Explanation: A Fibre Channel adapter has failed.
This node error does not, in itself, stop the node
becoming active in the system. However, the Fibre
Channel network might be being used to communicate
between the nodes in a clustered system. Therefore, it
is possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node.
Data:
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Use the remove and replace procedures to replace
the adapter. If this does not fix the problem, replace
the system board.
Possible Cause-FRUs or other cause:
v Fibre Channel adapter
v System board
Chapter 7. Diagnosing problems
181
702 • 704
702
A Fibre Channel adapter has a PCI
error.
Explanation: A Fibre Channel adapter has a PCI error.
This node error does not, in itself, stop the node from
becoming active in the system. However, the Fibre
Channel network might be being used to communicate
between the nodes in a clustered system. Therefore, it
is possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node.
Data:
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Use the remove and replace procedures to replace
the adapter. If this does not fix the problem, replace
the system board.
v System board
704
Fewer Fibre Channel ports operational.
Explanation: A Fibre Channel port that was
previously operational is no longer operational. The
physical link is down.
This node error does not, in itself, stop the node
becoming active in the system. However, the Fibre
Channel network might be being used to communicate
between the nodes in a clustered system. Therefore, it
is possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node.
Data:
Three numeric values are listed:
v The ID of the first unexpected inactive port. This ID
is a decimal number.
v The ports that are expected to be active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
v Fibre Channel adapter
v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is active.
v System board
User response:
703
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
Possible Cause-FRUs or other cause:
A Fibre Channel adapter is degraded.
Explanation: A Fibre Channel adapter is degraded.
This node error does not, in itself, stop the node
becoming active in the system. However, the Fibre
Channel network might be being used to communicate
between the nodes in a clustered system. Therefore, it
is possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node.
Data:
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Use the remove and replace procedures to replace
the adapter. If this does not fix the problem, replace
the system board.
Possible Cause FRUs or other cause:
v Fibre Channel adapter
182
SAN Volume Controller: Troubleshooting Guide
2. Possibilities:
v If the port has been intentionally disconnected,
use the management GUI recommended action
for the service error code and acknowledge the
intended change.
v Check that the Fibre Channel cable is connected
at both ends and is not damaged. If necessary,
replace the cable.
v Check the switch port or other device that the
cable is connected to is powered and enabled in a
compatible mode. Rectify any issue. The device
service interface might indicate the issue.
v Use the remove and replace procedures to replace
the SFP transceiver in the 2145 node and the SFP
transceiver in the connected switch or device.
v Use the remove and replace procedures to replace
the adapter.
Possible Cause-FRUs or other cause:
v Fibre Channel cable
v SFP transceiver
v Fibre Channel adapter
705 • 710
705
Fewer Fibre Channel I/O ports
operational.
Explanation: One or more Fibre Channel I/O ports
that have previously been active are now inactive. This
situation has continued for one minute.
A Fibre Channel I/O port might be established on
either a Fibre Channel platform port or an Ethernet
platform port using FCoE. This error is expected if the
associated Fibre Channel or Ethernet port is not
operational.
Data:
Three numeric values are listed:
v The ID of the first unexpected inactive port. This ID
is a decimal number.
v The ports that are expected to be active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is active.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Follow the procedure for mapping I/O ports to
platform ports to determine which platform port is
providing this I/O port.
error is not reported unless a node is active in a
clustered system.
A Fibre Channel I/O port might be established on
either a FC platform port or an Ethernet platform port
using Fiber Channel over Ethernet (FCoE).
Data:
Three numeric values are listed:
v The ID of the first FC I/O port that does not have
connectivity. This is a decimal number.
v The ports that are expected to have connections. This
is a hexadecimal number, and each bit position
represents a port - with the least significant bit
representing port 1. The bit is 1 if the port is
expected to have a connection to all online nodes.
v The ports that actually have connections. This is a
hexadecimal number, each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port has a connection to all online
nodes.
User response:
1. If possible, this noncritical node error should be
serviced using the management GUI and running
the recommended actions for the service error code.
2. Follow the procedure: Mapping I/O ports to
platform ports to determine which platform port
does not have connectivity.
3.
v If the port’s connectivity has been intentionally
reconfigured, use the management GUI
recommended action for the service error code
and acknowledge the intended change. You must
have at least two I/O ports with connections to
all other nodes.
3. Check for any 704 (Fibre channel platform port
not operational) or 724 (Ethernet platform port
not operational) node errors reported for the
platform port.
v Resolve other node errors relating to this
platform port or I/O port.
4. Possibilities:
v If the port has been intentionally disconnected,
use the management GUI recommended action
for the service error code and acknowledge the
intended change.
v Resolve the 704 or 724 error.
v If this is an FCoE connection, use the information
the view gives about the Fibre Channel forwarder
(FCF) to troubleshoot the connection between the
port and the FCF.
There are a number of possibilities.
v Check that the SAN zoning is correct.
Possible Cause: FRUs or other cause:
v None.
710
The SAS adapter that was previously
present has not been detected.
Possible Cause-FRUs or other cause:
Explanation: A SAS adapter that was previously
present has not been detected. The adapter might not
be correctly installed or it might have failed.
v None
Data:
706
Fibre Channel clustered system path
failure.
Explanation: One or more Fibre Channel (FC)
input/output (I/O) ports that have previously been
able to see all required online nodes can no longer see
them. This situation has continued for 5 minutes. This
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
Chapter 7. Diagnosing problems
183
711 • 715
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Possibilities:
v If the adapter has been intentionally removed,
use the management GUI recommended actions
for the service error code, to acknowledge the
change.
v Use the remove and replace procedures to
remove and open the node and check the adapter
is fully installed.
v If the previous steps have not isolated the
problem, use the remove and replace procedures
to replace the adapter. If this does not fix the
problem, replace the system board.
Possible Cause-FRUs or other cause:
v High-speed SAS adapter
Possible Cause-FRUs or other cause:
v
SAS adapter
v
System board
713
A SAS adapter is degraded.
Explanation: A SAS adapter is degraded.
Data:
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
v System board
2. Use the remove and replace procedures to replace
the adapter. If this does not fix the problem, replace
the system board.
711
Possible Cause-FRUs or other cause:
A SAS adapter has failed.
Explanation: A SAS adapter has failed.
v High-speed SAS adapter
Data:
v System board
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
715
Fewer SAS host ports operational
Explanation: A SAS port that was previously
operational is no longer operational. The physical link
is down.
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
Data:
2. Use the remove and replace procedures to replace
the adapter. If this does not fix the problem, replace
the system board.
v The ID of the first unexpected inactive port. This ID
is a decimal number.
Possible Cause-FRUs or other cause:
v High-speed SAS adapter
v System board
712
A SAS adapter has a PCI error.
Three numeric values are listed:
v The ports that are expected to be active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is active.
Explanation: A SAS adapter has a PCI error.
User response:
Data:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Replace the adapter using the remove and replace
procedures. If this does not fix the problem, replace
the system board.
184
SAN Volume Controller: Troubleshooting Guide
2. Possibilities:
v If the port has been intentionally disconnected,
use the management GUI recommended action
for the service error code and acknowledge the
intended change.
v Check that the SAS cable is connected at both
ends and is not damaged. If necessary, replace the
cable.
720 • 723
v Check the switch port or other device that the
cable is connected to is powered and enabled in a
compatible mode. Rectify any issue. The device
service interface might indicate the issue.
v Use the remove and replace procedures to replace
the adapter.
Possible Cause-FRUs or other cause:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. If the adapter location is 0, use the remove and
replace procedures to replace the system board.
3. If the adapter location is not 0, use the remove and
replace procedures to replace the adapter. If this
does not fix the problem, replace the system board.
v SAS cable
v SAS adapter
Possible Cause—FRUs or other cause:
v Ethernet adapter
720
Ethernet adapter that was previously
present has not been detected.
Explanation: An Ethernet adapter that was previously
present has not been detected. The adapter might not
be correctly installed or it might have failed.
Data:
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations. If the location is 0, the adapter integrated
into the system board is being reported.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. If the adapter location is 0, use the remove and
replace procedures to replace the system board.
3. If the location is not 0, there are a number of
possibilities:
a. Use the remove and replace procedures to
remove and open the node and check that the
adapter is fully installed.
b. If the previous steps have not located and
isolated the problem, use the remove and
replace procedures to replace the adapter. If this
does not fix the problem, replace the system
board.
v System board
722
An Ethernet adapter has a PCI error.
Explanation: An Ethernet adapter has a PCI error.
Data:
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations. If the location is 0, the adapter integrated
into the system board is being reported.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. If the adapter location is 0, use the remove and
replace procedures to replace the system board.
3. If the adapter location is not 0, use the remove and
replace procedures to replace the adapter. If this
does not fix the problem, replace the system board.
Possible Cause—FRUs or other cause:
v Ethernet adapter
v System board
723
An Ethernet adapter is degraded.
Explanation: An Ethernet adapter is degraded.
Possible Cause-FRUs or other cause:
Data:
v Ethernet adapter
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations. If the location is 0, the adapter integrated
into the system board is being reported.
v System board
721
An Ethernet adapter has failed.
Explanation: An Ethernet adapter has failed.
User response:
Data:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
v A number indicating the adapter location. The
location indicates an adapter slot. See the node
description for the definition of the adapter slot
locations. If the location is 0, the adapter integrated
into the system board is being reported.
User response:
2. If the adapter location is 0, use the remove and
replace procedures to replace the system board.
3. If the adapter location is not 0, use the remove and
replace procedures to replace the adapter. If this
does not fix the problem, replace the system board.
Chapter 7. Diagnosing problems
185
724 • 731
Possible Cause—FRUs or other cause:
v Ethernet adapter
v System board
724
Fewer Ethernet ports active.
Explanation: An Ethernet port that was previously
operational is no longer operational. The physical link
is down.
Data:
Three numeric values are listed:
v The ID of the first unexpected inactive port. This is a
decimal number.
730
The bus adapter has not been detected.
Explanation: The bus adapter that connects the
canister to the enclosure midplane has not been
detected.
This node error does not, in itself, stop the node
canister becoming active in the system. However, the
bus might be being used to communicate between the
node canisters in a clustered system. Therefore, it is
possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node canister.
Data:
v The ports that are expected to be active. This is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
v A number indicating the adapter location. Location 0
indicates that the adapter integrated into the system
board is being reported.
v The ports that are actually active. This is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is active.
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Possibilities:
a. If the port has been intentionally disconnected,
use the management GUI recommended action
for the service error code and acknowledge the
intended change.
b. Make sure the Ethernet cable is connected at
both ends and is undamaged. If necessary,
replace the cable.
c. Check that the switch port, or other device the
cable is connected to, is powered and enabled in
a compatible mode. Rectify any issue. The
device service interface might indicate the issue.
d. If this is a 1 Gbps port, use the remove and
replace procedures to replace the SFP transceiver
in the SAN Volume Controller and the SFP
transceiver in the connected switch or device.
e. Replace the adapter or the system board
(depending on the port location) by using the
remove and replace procedures.
Possible Cause—FRUs or other cause:
v Ethernet cable
v Ethernet SFP transceiver
v Ethernet adapter
v System board
User response:
2. As the adapter is located on the system board,
replace the node canister using the remove and
replace procedures.
Possible Cause-FRUs or other cause:
v Node canister
731
The bus adapter has failed.
Explanation: The bus adapter that connects the
canister to the enclosure midplane has failed.
This node error does not, in itself, stop the node
canister becoming active in the system. However, the
bus might be being used to communicate between the
node canisters in a clustered system. Therefore, it is
possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node canister.
Data:
v A number indicating the adapter location. Location 0
indicates that the adapter integrated into the system
board is being reported.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. As the adapter is located on the system board,
replace the node canister using the remove and
replace procedures.
Possible Cause-FRUs or other cause:
v Node canister
186
SAN Volume Controller: Troubleshooting Guide
732 • 736
732
The bus adapter has a PCI error.
Explanation: The bus adapter that connects the
canister to the enclosure midplane has a PCI error.
This node error does not, in itself, stop the node
canister becoming active in the system. However, the
bus might be being used to communicate between the
node canisters in a clustered system; therefore it is
possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node canister.
Data:
v A number indicating the adapter location. Location 0
indicates that the adapter integrated into the system
board is being reported.
User response:
1. If possible, this noncritical node error should be
serviced using the management GUI and running
the recommended actions for the service error code.
2. As the adapter is located on the system board,
replace the node canister using the remove and
replace procedures.
Possible Cause-FRUs or other cause:
v Node canister
733
The bus adapter degraded.
Explanation: The bus adapter that connects the
canister to the enclosure midplane is degraded.
This node error does not, in itself, stop the node
canister from becoming active in the system. However,
the bus might be being used to communicate between
the node canisters in a clustered system. Therefore, it is
possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed
because of a lack of cluster resources is reported
on the node canister.
Data:
v A number indicating the adapter location. Location 0
indicates that the adapter integrated into the system
board is being reported.
734
Fewer bus ports.
Explanation: One or more PCI bus ports that have
previously been active are now inactive. This condition
has existed for over one minute. That is, the internode
link has been down at the protocol level.
This could be a link issue but is more likely caused by
the partner node unexpectedly failing to respond.
Data:
Three numeric values are listed:
v The ID of the first unexpected inactive port. This is a
decimal number.
v The ports that are expected to be active. This is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
v The ports that are actually active. This is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is active.
User response:
1. If possible, this noncritical node error should be
serviced using the management GUI and running
the recommended actions for the service error code.
2. Follow the procedure for getting node canister and
clustered-system information and determine the
state of the partner node canister in the enclosure.
Fix any errors reported on the partner node canister.
3. Use the remove and replace procedures to replace
the enclosure.
Possible Cause-FRUs or other cause:
v Node canister
v Enclosure midplane
736
The temperature of a device on the
system board is greater than or equal to
the warning threshold.
Explanation: The temperature of a device on the
system board is greater than or equal to the warning
threshold.
User response:
User response: Check for external and internal air
flow blockages or damage.
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
1. Remove the top of the machine case and check for
missing baffles, damaged heat sinks, or internal
blockages.
2. As the adapter is located on the system board,
replace the node canister using the remove and
replace procedures.
2.
Possible Cause-FRUs or other cause:
If problem persists, replace the system board.
Possible Cause-FRUs or other:
v System board
v Node canister
Chapter 7. Diagnosing problems
187
737 • 745
737
v If removing the CPU was deliberate, follow the
management GUI recommended actions to mark
the hardware change as intentional.
The temperature of a power supply is
greater than or equal to the warning or
critical threshold.
v If it is not possible to isolate the problem, use the
remove and replace procedures to replace the
CPU.
Explanation: The temperature of a power supply is
greater than or equal to the warning or critical
threshold.
User response: Check for external and internal air
flow blockages or damage.
1. Remove the top of the machine case and check for
missing baffles, damaged heat sinks, or internal
blockages.
2. If the problem persists, replace the power supply.
Possible Cause-FRUs or other:
v Power supply
738
The temperature of a PCI riser card is
greater than or equal to the warning
threshold.
Explanation: The temperature of a PCI riser card is
greater than or equal to the warning threshold.
User response: Check for external and internal air
flow blockages or damage.
1. Remove the top of the machine case and check for
missing PCI riser card 2, missing baffles, or internal
blockages.
2. Check all of the PCI cards plugged into the riser
that is identified by the extra data to find if any are
faulty, and replace as necessary.
v Replace the system board.
743
A boot drive is offline, missing, out of
sync, or the persistent data is not usable.
Explanation: A boot drive is offline, missing, out of
sync, or the persistent data is not usable.
User response: Look at a boot drive view to
determine the problem.
1. If slot status is out of sync, then re-sync the boot
drives by running the command satask
chbootdrive.
2. If slot status is missing, then put the original drive
back in this slot or install a FRU drive.
3. If slot status is failed, then replace the drive.
Possible Cause-FRUs or other:
v Boot drive
744
A boot drive is in the wrong location.
Explanation: A boot drive is in the wrong slot or
comes from another SAN Volume Controller 2145-DH8
node.
3. If the problem persists, replace the PCI riser.
User response: Look at a boot drive view to
determine the problem.
Possible Cause-FRUs or other:
1. Replace the boot drive with the correct drive and
put this drive back in the node from which it came.
v PCI riser
740
2. Sync the boot drive if you choose to use it in this
node.
The command failed because of a
wiring error described in the event log.
Explanation: It is dangerous to exclude a sas port
while the topology is invalid, so we forbid the user
from attempting it to avoid any potential loss of data
access.
User response: Correct the topology, then retry the
command.
741
CPU missing
Explanation: A CPU that was previously present has
not been detected. The CPU might not be correctly
installed or it might have failed.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Select one of the following actions:
188
SAN Volume Controller: Troubleshooting Guide
Possible Cause-FRUs or other:
v None
745
A boot drive is in an unsupported slot.
Explanation: A boot drive is in an unsupported slot.
For SAN Volume Controller 2145-DH8, this means that
at least one of the first two drives are online and at
least one invalid slot (3-8) is occupied.
User response: Look at a boot drive view to
determine which invalid slot(s) are occupied and
remove the drive(s).
Possible Cause-FRUs or other:
v None
746 • 770
746
Technician port connection invalid.
Explanation: The code has detected more than one
MAC address though the connection, or the DHCP has
given out more than one address. The code thus
believes there is a switch attached.
User response:
1. Plug a cable from the technician port to a switch,
and plug 2 or more machines into that switch. They
must have IP addresses in the range 192.168.0.1 192.168.0.30
2. Request a DHCP lease to trigger the detection.
v A text string identifying the thermal sensor reporting
the warning level and the current temperature in
degrees (Celsius).
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Check the temperature of the room and correct any
air conditioning or ventilation problems.
3. Check the airflow around the system to make sure
no vents are blocked.
Possible Cause-FRUs or other cause:
747
The Technician port is being used.
Explanation: The Technician port is active and being
used
User response: No service action is required. Use the
workstation to configure the node.
748
The technician port is enabled.
Explanation: The technician port is enabled initially
for easy configuration, and then disabled, so that the
port can be used for iSCSI connection. When all
connectivity to the node fails, the technician port can be
reenabled for emergency use but must not remain
enabled. This event is to remind you to disable the
technician port. While the technician port is enabled, do
not connect it to the LAN/SAN.
v None
769
CPU temperature warning.
Explanation: The temperature of the CPU within the
node is close to the point where the node stops
performing I/O and enters service state. The node is
currently continuing to operate. This is most likely an
ambient temperature problem, but it might be a
hardware problem.
Data:
v A text string identifying the thermal sensor reporting
the warning level and the current temperature in
degrees (Celsius).
User response:
User response: Complete the following step to resolve
this problem.
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
1. Turn off technician port by using the following CLI
command:
2. Check the temperature of the room and correct any
air conditioning or ventilation problems.
satask chserviceip -techport disable
3. Check the airflow around the system. Ensure no
vents are blocked.
Possible Cause-FRUs or other:
4. Make sure the node fans are operational.
v N/A
5. If the error is still reported, replace the node’s CPU.
766
CMOS battery failure.
Explanation: CMOS battery failure.
User response: Replace the CMOS battery.
Possible Cause-FRUs or other:
v CMOS battery
768
Ambient temperature warning.
Possible Cause—FRUs or other cause:
v CPU
770
Shutdown temperature reached
Explanation: The node temperature has reached the
point at which it is must shut down to protect
electronics and data. This is most likely an ambient
temperature problem, but it could be a hardware issue.
Explanation: The ambient temperature of the node is
close to the point where it stops performing I/O and
enters a service state. The node is currently continuing
to operate.
Data:
Data:
User response:
v A text string identifying the thermal sensor reporting
the warning level and the current temperature in
degrees (Celsius).
Chapter 7. Diagnosing problems
189
775 • 784
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Check the temperature of the room and correct any
air conditioning or ventilation problems.
3. Check the airflow around the system and make sure
no vents are blocked.
Possible Cause-FRUs or other cause:
v CPU
775
Power supply problem.
Explanation: A power supply has a fault condition.
v Battery (100%)
781
Battery is below the minimum operating
temperature
Explanation: The battery cannot perform the required
function because it is below the minimum operating
temperature.
This error is reported only if the battery subsystem
cannot provide full protection.
An inability to charge is not reported if the combined
charge available from all installed batteries can provide
full protection at the current charge levels.
User response: Replace the power supply.
User response: No service action required, use the
console to manage the node.
Possible Cause-FRUs or other:
Wait for the battery to warm up.
v Power supply
782
776
Power supply mains cable unplugged.
Explanation: A power supply mains cable is not
plugged in.
User response: Plug in power supply mains cable.
Possible Cause-FRUs or other:
v None
777
Power supply missing.
Explanation: A power supply is missing.
User response: Install power supply.
Possible Cause-FRUs or other:
v Power supply
779
Battery is missing
Explanation: The battery is not installed in the system.
Battery is above the maximum operating
temperature
Explanation: The battery cannot perform the required
function because it is above the maximum operating
temperature.
This error is reported only if the battery subsystem
cannot provide full protection.
An inability to charge is not reported if the combined
charge available from all installed batteries can provide
full protection at the current charge levels.
User response: No service action required, use the
console to manage the node.
Wait for the battery to cool down.
783
Battery communications error
Explanation: A battery is installed, but
communications via I2C are not functioning.
User response: Install the battery.
This might be either a fault in the battery unit or a
fault in the battery backplane.
You can power up the system without the battery
installed.
User response: No service action required, use the
console to manage the node.
Possible Cause-FRUs or other:
Replace the battery. If the problem persists, conduct the
corrective service procedure described in “1109” on
page 206.
v Battery (100%)
780
Battery has failed
Explanation:
1. The battery has failed.
2. The battery is past the end of its useful life.
3. The battery failed to provide power on a previous
occasion and is therefore, regarded as unfit for its
purpose.
User response: Replace the battery.
Possible Cause-FRUs or other:
190
SAN Volume Controller: Troubleshooting Guide
784
Battery is nearing end of life
Explanation: The battery is near the end of its useful
life. You should replace it at the earliest convenient
opportunity.
This might be either a fault in the battery unit or a
fault in the battery backplane.
User response: No service action required, use the
console to manage the node.
785 • 831
Replace the battery.
785
Battery capacity is reduced because of
cell imbalance
818
Unable to recover the service controller
flash disk.
Explanation: A nonrecoverable error occurred when
accessing the service controller persistent memory.
Explanation: The charge levels of the cells within the
battery pack are out of balance.
User response:
Some cells become fully charged before others, which
causes charging to terminate early, before the entire
battery pack is fully charged.
2. Replace the field replaceable units (FRUs) in the
order listed.
1. Restart the node and see if it recovers.
Ending recharging prematurely effectively reduces the
available capacity of the pack.
Possible Cause-FRUs or other cause:
Circuitry within the battery pack corrects such errors
normally, but can take tens of hours to complete.
v Service controller cable
If this error is not fixed after 24 hours, or if the error
reoccurs after it fixes itself, the error is likely indicative
of a problem in the battery cells. In such a case, replace
the battery pack.
820
User response: No service action required, use the
console to manage the node.
v Service controller
Canister type is incompatible with
enclosure model
Explanation: The node canister has detected that it
has a hardware type that is not compatible with the
control enclosure MTM, such as node canister type 300
in an enclosure with MTM 2076-112.
Wait for the cells to balance.
This is an expected condition when a control enclosure
is being updated to a different type of node canister.
786
User response:
Battery VPD checksum error
Explanation: The checksum on the vital product data
(VPD) stored in the battery EEPROM is incorrect.
User response: No service action required, use the
console to manage the node.
Replace the battery.
1. Check that all the update instructions have been
followed completely.
2. Use the management GUI to run the recommended
actions for the associated service error code.
Possible Cause-FRUs or other cause:
v None
787
Battery is at a hardware revision level
not supported by the current code level
Explanation: The battery currently installed is at a
hardware revision level that is not supported by the
current code level.
User response: No service action required, use the
console to manage the node.
Either update the code level to one that supports the
currently installed battery or replace the battery with
one that is supported by the current code level.
830
Explanation: It is necessary to provide an encryption
key before the system can become fully operational.
This node error occurs when a system with encryption
enabled is restarted without an encryption key
available.
User response: Insert a USB flash drive containing a
valid key into one of the node canisters.
831
803
Fibre Channel adapter not working
Encryption key required.
Encryption key is not valid.
Explanation: A problem has been detected on the
node’s Fibre Channel (FC) adapter. This node error is
reported only on SAN Volume Controller 2145-CG8 or
older nodes.
Explanation: It is necessary to provide an encryption
key before the system can become fully operational.
This node error occurs when the encryption key
identified is invalid. A file with the correct name was
found but the key in the file is corrupt.
User response: Follow troubleshooting procedures to
fix the hardware.
This node error will clear once the USB flash drive
containing the invalid key is removed.
User response: Remove the USB flash drive from the
port.
Chapter 7. Diagnosing problems
191
832 • 860
832
Encryption key file not found.
841
Supported hardware change detected.
Explanation: A USB flash drive containing an
encryption key is present but he expected file cannot be
located. This can occur if a key for a different system or
an old key for this system has been provided .
Explanation: A change has been detected in the node
hardware configuration. The new configuration is
supported by the node software. The new configuration
does not become active until it is activated.
Additionally, other user-created files that match the key
file name format can cause this error if the USB flash
drive does not contain the expected key.
A node configuration is remembered only while it is
active in a system. This node error is therefore resolved
using the management GUI.
This node error will clear when the USB flash drive
identified has been removed.
User response:
User response: Remove the USB flash drive from the
port.
833
Unsupported USB device.
Explanation: An unsupported device has been
connected to a USB port.
Only USB flash drives are supported and this node
error will be raised if another type of device is
connected to a USB port.
User response: Remove the unsupported device.
840
Unsupported hardware change detected.
Explanation: A change has been detected to this
node’s hardware configuration. The new configuration
is not supported by the node canister software. User
action is required to repair the hardware or update the
software.
1. Use the management GUI to run the recommended
actions for the associated service error code. Use the
directed maintenance to accept or reject the new
configuration.
850
The canister battery is reaching the end
of its useful life.
Explanation: The canister battery is reaching the end
of its useful life. It should be replaced within a week of
the node error first being reported.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Replace the node canister battery by using the
remove and replace procedures.
Possible Cause-FRUs or other cause:
v Canister battery
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Follow the procedure for getting node and
clustered-system information. A change to the
hardware configuration is expected.
3. If the hardware configuration is unexpectedly
reduced, make sure the component has not been
unseated. Hardware replacement might be
necessary.
4. If a new hardware component is shown as
unsupported, check the software version required to
support the hardware component. Update the
software to a version that supports the hardware.
If the hardware detected does not match the expected
configuration, replace the hardware component that is
reported incorrectly.
Possible Cause-FRUs or other cause:
v One of the optional hardware components might
require replacement
860
Fibre Channel network fabric is too big.
Explanation: The number of Fibre Channel (FC) logins
made to the node exceeds the allowed limit. The node
continues to operate, but only communicates with the
logins made before the limit was reached. The order in
which other devices log into the node cannot be
determined, so the node’s FC connectivity might vary
after each restart. The connection might be with host
systems, other storage systems, or with other nodes.
This error might be the reason the node is unable to
participate in a system.
The number of allowed logins per node is 1024.
Data:
v None
User response: This error indicates a problem with the
Fibre Channel fabric configuration. It is resolved by
reconfiguring the FC switch:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Rezone the FC network so only the ports the node
needs to connect to are visible to it.
192
SAN Volume Controller: Troubleshooting Guide
870 • 888
Possible Cause-FRUs or other cause:
v None
870
Too many cluster creations made on
node
Explanation: Too many SAN Volume Controller
clustered systems have been created on this node. The
count of the number of clustered systems created on
the node is stored within the node service controller.
Data:
v Where applicable, remove and replace the hardware
that is preventing the candidate from joining the
clustered system.
Possible Cause—FRUs or other cause.
For information on feature codes available, see the SAN
Volume Controller and Storwize family Characteristic
Interoperability Matrix on the support website:
www.ibm.com/storage/support/2145.
878
v None
User response:
1. Try to create the clustered system on a different
node.
2. Replace the service controller using the remove and
replace procedures.
Possible cause-FRUs or other cause:
Attempting recovery after loss of state
data.
Explanation: During startup, the node cannot read its
state data. It reports this error while waiting to be
added back into a clustered system. If the node is not
added back into a clustered system within a set time,
node error 578 is reported.
User response:
v Service controller
1. Allow time for recovery. No further action is
required.
871
2. Keep monitoring in case the error changes to error
code 578.
Failed to increment cluster ID
Explanation: The clustered system create option failed
because the clustered system, which is stored in the
service controller, could not be updated.
Data:
v None
User response:
1. Try to create the clustered system on a different
node.
2. Replace the service controller using the remove and
replace procedures.
Possible cause-FRUs or other cause:
v Service controller
875
Request to cluster rejected.
Explanation: A candidate node could not be added to
the clustered system. The node contains hardware or
firmware that is not supported in the clustered system.
Data:
This node error and extra data is viewable through
sainfo lsservicestatus on the candidate node only.
The extra data lists a full set of feature codes that are
required by the node to run in the clustered system.
User response:
v Choose a different candidate that is compatible with
the clustered system.
v Update the clustered system to code that is
supported by all components.
v Do not add a candidate to the clustered system.
888
Too many Fibre Channel logins between
nodes.
Explanation: The system has determined that the user
has zoned the fabric such that this node has received
more than 16 unmasked logins originating from
another node or node canister - this can be any non
service mode node or canister in the local cluster or in
a remote cluster with a partnership. An unmasked
login is from a port whose corresponding bit in the FC
port mask is '1'. If the error is raised against a node in
the local cluster, then it is the local FC port mask that is
applied. If the error is raised against a node in a remote
cluster, then it is the partner FC port masks from both
clusters that apply.
More than 16 logins is not a supported configuration as
it increases internode communication and can affect
bandwidth and performance. For example, if node A
has 8 ports and node B has 8 ports where the nodes are
in different clusters, if node A has a partner FC port
mask of 00000011 and node B has a partner FC port
mask of 11000000 there are 4 unmasked logins possible
(1,7 1,8 2,7 2,8). Fabric zoning may be used to reduce
this amount further, i.e. if node B port 8 is removed
from the zone there are only 2 (1,7 and 2,7). The
combination of masks and zoning must leave 16 or
fewer possible logins.
Note: This count includes both FC and Fibre Channel
over Ethernet (FCoE) logins. The log-in count will not
include masked ports.
When this event is logged. the cluster id and node id of
the first node whose logins exceed this limit on the
local node will be reported, as well as the WWNN of
Chapter 7. Diagnosing problems
193
889 • 1001
said node. If logins change, the error is automatically
fixed and another error is logged if appropriate (this
may or may not choose the same node to report in the
sense data if the same node is still over the maximum
allowed).
Data
Text string showing
v WWNN of the other node
v Cluster ID of other node
v Arbitrary node ID of one other node that is logged
into this node. (node ID as it appears in lsnode)
User response: The error is resolved by either
re-configuring the system to change which type of
connection is allowed on a port, or by changing the
SAN fabric configuration so ports are not in the same
zone. A combination of both options may be used.
The system reconfiguration is to change the Fibre
Channel ports mask to reduce which ports can be used
for internode communication.
The local Fibre Channel port mask should be modified
if the cluster id reported matches the cluster id of the
node logging the error.
The partner Fibre Channel port mask should be
modified if the cluster id reported does not match the
cluster id of the node logging the error. The partner
Fibre Channel port mask may need to be changed for
one or both clusters.
SAN fabric configuration is set using the switch
configuration utilities.
Use the lsfabric command to view the current number
of logins between nodes.
Possible Cause-FRUs or other cause:
v None
Service error code
921
Unable to perform cluster recovery
because of a lack of cluster resources.
Explanation: The node does not have sufficient
connectivity to other nodes or quorum device to form a
cluster. If a disaster has occurred and the nodes at the
other site cannot be recovered, then it is possible to
allow the nodes at the surviving site to form a system
using local storage.
User response: Repair the fabric or quorum device to
establish connectivity. As a last resort when the nodes
at the other site cannot be recovered, then it is possible
to allow the nodes at the surviving site to form a
system using local site storage as described below:
To avoid data corruption ensure that all host servers
that were previously accessing the system have had all
volumes un-mounted or have been rebooted. Ensure
that the nodes at the other site are not operational and
are unable to form a system in the future.
After invoking this command a full re-synchronization
of all mirrored volumes will be performed when the
other site is recovered. This is likely to take many
hours or days to complete.
Contact IBM support personnel if you are unsure.
Note: Before continuing confirm that you have taken
the following actions - failure to perform these actions
can lead to data corruption that will be undetected by
the system but will affect host applications.
1. All host servers that were previously accessing the
system have had all volumes un-mounted or have
been rebooted.
2. Ensure that the nodes at the other site are not
operating as a system and actions have been taken
to prevent them from forming a system in the
future.
After these actions have been taken the satask
overridequorum can be used to allow the nodes at the
surviving site to form a system using local storage.
1801
950
889
Failed to create remote IP connection.
Explanation: Despite a request to create a remote IP
partnership port connection, the action has failed or
timed out.
Special update mode.
Explanation: Special update mode.
User response: None.
990
Cluster recovery has failed.
User response: Fix the remote IP link so that traffic
can flow correctly. Once the connection is made, the
error will auto-correct.
Explanation: Cluster recovery has failed.
920
1001
Unable to perform cluster recovery
because of a lack of cluster resources.
Explanation: The node is looking for a quorum of
resources which also require cluster recovery.
User response: Contact IBM technical support.
194
SAN Volume Controller: Troubleshooting Guide
User response: Contact IBM technical support.
Automatic cluster recovery has run.
Explanation: All cluster configuration commands are
blocked.
User response: Call your software support center.
1002 • 1013
Caution: You can unblock the configuration commands
through the cluster GUI, but you must first consult
with your software support to avoid corrupting your
cluster configuration.
2. Ensure that memory DIMMs are spread evenly
across all memory channels.
3. Restart the node.
4. If the error persists, replace system board.
Possible Cause-FRUs or other:
Possible Cause-FRUs or other:
v None
v None
1002
Event log full.
Explanation: Event log full.
User response: To fix the errors in the event log, go to
the start MAP.
Possible Cause-FRUs or other:
v Unfixed errors in the log.
1007
Canister to canister communication
error.
Explanation: A canister to canister communication
error can appear when one canister cannot
communicate with the other.
User response: Reseat the passive canister, and then
try reseating the active canister. If neither resolve the
alert, try replacing the passive canister, and then the
other canister.
1011
Fibre Channel adapter (4 port) in slot 1
is missing.
Explanation: Fibre Channel adapter (4 port) in slot 1
is missing.
User response:
1. In the sequence shown, exchange the FRUs for new
FRUs.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
A canister can be safely reseated or replaced while the
system is in production. Make sure that the other
canister is the active node before removing this canister.
It is preferable that this canister shuts down completely
before removing it, but it is not required.
2145-CG8 or 2145-CF8
1. Reseat the passive canister (a failover is not
required).
1013
2. Reseat the second canister (a failover is required).
Explanation: Fibre Channel adapter (4-port) in slot 1
PCI fault.
3. If necessary, replace the passive canister (a failover
is not required).
4. If necessary, replace the active canister (a failover is
required).
If a second new canister is not available, the
previously removed canister can be used, as it
apparently is not at fault.
5. An enclosure replacement might be necessary.
Contact IBM support.
v 4-port Fibre Channel host bus adapter (98%)
v System board (2%)
Fibre Channel adapter (4-port) in slot 1
PCI fault.
User response:
1. In the sequence shown, exchange the FRUs for new
FRUs.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
Possible Cause-FRUs or other:
3. Go to repair verification MAP.
Canister (95%)
Possible Cause-FRUs or other:
Enclosure (5%)
2145-CG8 or 2145-CF8
v 4-port Fibre Channel host bus adapter (98%)
1009
DIMMs are incorrectly installed.
v System board (2%)
Explanation: DIMMs are incorrectly installed.
User response: Ensure that memory DIMMs are
spread evenly across all memory channels.
1. Shut down the node.
Chapter 7. Diagnosing problems
195
1014 • 1018
1014
Fibre Channel adapter in slot 1 is
missing.
Explanation: The Fibre Channel adapter in slot 1 is
missing.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
Possible Cause, FRUs, or other:
User response:
v Fibre Channel host bus adapter (90%)
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
v PCI riser card (5%)
v Other (5%)
2. Check node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
1017
Fibre Channel adapter in slot 1 PCI bus
error.
Explanation: The Fibre Channel adapter in PCI slot 1
is failing with a PCI bus error.
User response:
3. Go to the repair verification MAP.
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
Possible Cause, FRUs, or other:
2. Check node status:
v N/A
1015
v If all nodes show a status of online, mark the
error as fixed.
Fibre Channel adapter in slot 2 is
missing.
Explanation: Fibre Channel adapter in slot 2 is
missing.
User response:
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
1. In the sequence that is shown in the log, replace
any failing FRUs for new FRUs.
Possible Cause, FRUs, or other:
2. Check the node status:
v PCI riser card (10%)
v Fibre Channel host bus adapter (80%)
v If all nodes show a status of online, mark the
error as fixed.
v Other (10%)
v If any node does not show a status of online, go
to the start MAP.
1018
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
Fibre Channel adapter in slot 2 PCI
fault.
Explanation: The Fibre Channel adapter in slot 2 is
failing with a PCI fault.
User response:
Possible Cause, FRUs, or other:
v N/A
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
2. Check node status:
1016
Fibre Channel adapter (4 port) in slot 2
is missing.
v If all nodes show a status of online, mark the
error as fixed.
Explanation: The four-port Fibre Channel adapter in
PCI slot 2 is missing.
v If any nodes do not show a status of online, go
to the start MAP.
User response:
v If you return to this step, contact your support
center to resolve the problem with the node.
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
3. Go to the repair verification MAP.
2. Check node status:
Possible Cause, FRUs, or other:
v If all nodes show a status of online, mark the
error as fixed.
v Dual port Fibre Channel host bus adapter - full
height (80%)
v If any nodes do not show a status of online, go
to the start MAP.
v PCI riser card (10%)
196
SAN Volume Controller: Troubleshooting Guide
v Other (10%)
1019 • 1027
1019
Fibre Channel adapter (four-port) in slot
2 PCI fault.
Explanation: The four-port Fibre Channel adapter in
slot 2 is failing with a PCI fault.
User response:
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
Note: Intentional removal is not permitted on a
clustered node. To use the node with only one
processor, you must rmnode, and then readd. Otherwise,
shutdown the node and replace the processor that was
removed.
Possible Cause-FRUs or other:
v CPU (80%)
v System board (20%)
2. Check node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
Possible Cause, FRUs, or other:
v Four-port Fibre Channel host bus adapter (80%)
v PCI Express riser card (10%)
1025
The 2145 system assembly is failing.
Explanation: The 2145 system assembly is failing.
User response:
1. Go to the light path diagnostic MAP and complete
the light path diagnostic procedures.
2. If the light path diagnostic procedure isolates the
FRU, mark this error as fixed. Then, go to the repair
verification MAP.
3. If you replace a FRU, but it does not correct the
problem, ensure that the FRU is installed correctly.
Then, go to the next step.
v Other (10%)
4. Replace the system board as indicated in the
Possible Cause list.
1020
5. Check the node status:
The system board service processor has
failed.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 522. See
the details of node error 522 for more information.
User response: See node error 522.
1021
Incorrect enclosure
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
6. Go to the repair verification MAP.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 500. See
the details of node error 500 for more information.
Possible Cause, FRUs, or other:
User response: See node error 500.
v The FRUs that are indicated by the Light path
diagnostics (98%)
1022
The detected memory size does not
match the expected memory size.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 510. See
the details of node error 510 for more information.
2145-CF8 or 2145-CG8
v System board (2%)
1026
System board device problem.
Explanation: System board device problem.
User response: See node error 510.
User response: The action depends on the extra data
that is provided with the node error and the light path
diagnostics.
1024
Possible Cause-FRUs or other:
CPU is broken or missing.
Explanation: CPU is broken or missing.
v Variable
User response: Review the node hardware using the
svcinfo lsnodehw command on the node indicated by
this event.
1027
Unable to update BIOS settings.
1. Shutdown the node. Replace the CPU that is broken
as indicated by the light path and event data.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 524. See
the details of node error 524 for more information.
2. If error persist, replace system board.
User response: See node error 524.
Chapter 7. Diagnosing problems
197
1028 • 1035
1028
System board service processor failed.
Explanation: System board service processor failed.
User response: Complete the following steps:
1. Shut down the node.
2. Remove the main power cable.
1. List all enclosure canisters for all control enclosures.
Look for an online canister that does not have a
node ID associated with it. This canister is the one
with the problem.
2. Unplug the SAS cable from port 2 of the canister
that is identified in step 1.
5. Wait for node to boot.
3. Run the command lsenclosurecanister, and see
whether there is a node ID present. If step 2 fixes
the error (a node ID is present), then something
failed in one of the attached devices.
6. If the node still reports the error, replace system
board.
4. Reconnect the expansion enclosures and see
whether the system is able to isolate the fault.
Possible Cause-FRUs or other:
5. Reseat all the canisters on that strand and replace
the canister that is identified in step 1 if step 4 does
not fix the error.
3. Wait for the lights to stop flashing.
4. Plug in the power cable.
v System board
Possible Cause-FRUs or other:
1029
Enclosure VPD is unavailable or
invalid.
v Nothing (80%)
v Canister (20%)
Explanation: Enclosure VPD is unavailable or invalid.
User response: Overwrite the enclosure VPD or
replace the power interposer board.
Possible Cause-FRUs or other:
PIB card (10%)
Other:
No FRU (90%)
1030
The internal disk of a node has failed.
Explanation: An error has occurred while attempting
to read or write data to the internal disk of one of the
nodes in the cluster. The disk has failed.
User response: Determine which node's internal disk
has failed using the node information in the error.
Replace the FRUs in the order shown. Mark the error
as fixed.
Possible Cause-FRUs or other:
2072 - Node Canister (100%)
1032
Fibre Channel adapter not working
Explanation: A problem has been detected on the
node’s Fibre Channel (FC) adapter. This node error is
reported only on SAN Volume Controller 2145-CG8 or
older nodes.
User response: Follow troubleshooting procedures to
fix the hardware.
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
Possible Cause-FRUs or other cause:
v None
1034
Canister fault type 2
Explanation: There is a canister internal error.
User response: Reseat the canister, and then replace
the canister if the error continues.
v disk drive (50%)
Possible Cause-FRUs or other:
v Disk controller (30%)
Canister (80%)
v Disk backplane (10%)
Other:
v Disk signal cable (8%)
v Disk power cable (1%)
v System board (1%)
1031
Node canister location unknown.
Explanation: Node canister location unknown.
User response: Complete the following steps to
resolve this problem.
198
SAN Volume Controller: Troubleshooting Guide
No FRU (20%)
1035
Boot drive problems
Explanation: Boot drive problems
User response: Complete the following steps:
1. Look at a boot drive view to determine the
problems.
2. Run the commands lsnodebootdrive / lsbootdrive
to display a status for each slot for users and DMPs
to diagnose and repair problems.
1036 • 1054
3. If you plan to move any drives, shut down the
node if booted yes is shown for that drive in the
boot drive view (lsbootdrive). After you move the
drives, a different node error will probably be
displayed for you to work on.
4.
If you plan to set the serial number of the system
board, see satask chvpd.
2145-CG8 or 2145-CF8
v Service controller (50%)
v Service controller cable (50%)
1044
A service controller read failure
occurred.
5. If there is still no usable persistent data on the boot
drives, then contact IBM Remote Technical Support.
Explanation: A service controller read failure occurred.
Possible Cause-FRUs or other:
1. Replace the FRUs below in the order listed.
v System drive
2. Check node status. If all nodes show a status of
Online, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
Online, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
1036
The enclosure identity cannot be read.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 509. See
the details of node error 509 for more information.
User response: See node error 509.
1039
Canister failure, canister replacement
required.
User response:
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
v Service controller (50%)
Explanation: Canister failure, canister replacement
required.
v Service controller cable (50%)
User response: Replace the canister.
1048
A canister can be safely replaced while the system is in
production. Make sure that the other canister is the
active node before removing this canister. It is
preferable that this canister shuts down completely
before removing it, but it is not required.
Explanation: Unexpected enclosure fault.
Possible cause-FRUs or other:
Unexpected enclosure fault.
User response: Use the bottom snap option in the
management GUI. This performs the following
functions:
v Generates new enclosure dumps for all enclosures.
v Generates livedump from all nodes in the cluster.
Interface adapter (50%)
v Runs an svc_snap dumpall.
SFP (20%)
1. Contact IBM support for further analysis.
Canister (20%)
Possible Cause-FRUs or other:
Internal interface adapter cable (10%)
v None
1040
1052
A flash module error has occurred after
a successful start of a 2145.
Explanation: Note: The node containing the flash
module has not been rejected by the cluster.
Incorrect type of uninterruptible power
supply detected
User response:
Explanation: The cluster is reporting that a node is
not operational because of critical node error 587. For
more information, see the details of node error 587.
1. Replace the FRUs below in the order listed
User response: See node error 587.
2. Check node status. If all nodes show a status of
Online, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
Online, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
1054
Fibre Channel adapter in slot 1 adapter
present but failed.
Explanation: The Fibre Channel adapter in PCI slot 1
is present but is failing.
3. Go to repair verification MAP.
User response:
Possible Cause-FRUs or other:
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
Chapter 7. Diagnosing problems
199
1055 • 1061
2. Check node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
1057
Fibre Channel adapter (four-port) in slot
2 adapter is present but failing.
Explanation: The four-port Fibre Channel adapter in
slot 2 is present but failing.
User response:
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
2. Check node status:
Possible Cause, FRUs, or other:
v Fibre Channel host bus adapter (100%)
v If all nodes show a status of online, mark the
error as fixed.
1055
v If any nodes do not show a status of online, go
to the start MAP.
Fibre Channel adapter (4 port) in slot 1
adapter present but failed.
v If you return to this step, contact your support
center to resolve the problem with the node.
Explanation: Fibre Channel adapter (4 port) in slot 1
adapter present but failed.
3. Go to the repair verification MAP.
User response:
Possible Cause, FRUs, or other:
1. Exchange the FRU for new FRU.
v N/A
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
1060
One or more Fibre Channel ports on the
2072 are not operational.
Explanation: One or more Fibre Channel ports on the
2072 are not operational.
3. Go to repair verification MAP.
User response:
Possible Cause-FRUs or other:
1. Go to MAP 5600: Fibre Channel to isolate and
repair the problem.
2145-CF8, or 2145-CG8
2. Go to the repair verification MAP.
v 4-port Fibre Channel host bus adapter (100%)
1056
The Fibre Channel adapter in slot 2 is
present but is failing.
Explanation: The Fibre Channel adapter in slot 2 is
present but is failing.
Possible Cause-FRUs or other:
v Fibre Channel cable (80%)
v Small Form-factor Pluggable (SFP) connector (5%)
v 4-port Fibre Channel host bus adapter (5%)
Other:
User response:
v Fibre Channel network fabric (10%)
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
1061
2. Check node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
Possible Cause, FRUs, or other:
v N/A
Explanation: Fibre Channel ports are not operational.
User response: An offline port can have many causes
and so it is necessary to check them all. Start with the
easiest and least intrusive possibility such as resetting
the Fibre Channel or FCoE port via CLI command.
Possible Cause-FRUs or other:
External (cable, HBA/CNA, switch, and so on) (75%)
SFP (10%)
Interface (10%)
Canister (5%)
200
SAN Volume Controller: Troubleshooting Guide
Fibre Channel ports are not operational.
1065 • 1087
1065
One or more Fibre Channel ports are
running at lower than the previously
saved speed.
Explanation: The Fibre Channel ports will normally
operate at the highest speed permitted by the Fibre
Channel switch, but this speed might be reduced if the
signal quality on the Fibre Channel connection is poor.
The Fibre Channel switch could have been set to
operate at a lower speed by the user, or the quality of
the Fibre Channel signal has deteriorated.
User response:
v Go to MAP 5600: Fibre Channel to resolve the
problem.
1084
Explanation: System board device exceeded
temperature threshold.
User response: Complete the following steps:
1. Check for external air flow blockages.
2. Remove the top of the machine case and check for
missing baffles, damaged heat sinks, or internal
blockages.
3. If problem persists, follow the service instructions
for replacing the system board FRU in question.
Possible Cause-FRUs or other:
Possible Cause-FRUs or other:
v Variable
2072 - Node Canister (100%)
1085
v Fibre Channel cable (50%)
v Small Form-factor Pluggable (SFP) connector (20%)
v 4-port Fibre Channel host bus adapter (5%)
PCI Riser card exceeded temperature
threshold.
Explanation: PCI Riser card exceeded temperature
threshold.
User response: Complete the following steps:
Other:
v Fibre Channel switch, SFP connector, or GBIC (25%)
1067
System board device exceeded
temperature threshold.
Fan fault type 1
Explanation: The fan has failed.
User response: Replace the fan.
Possible Cause-FRUs or other:
1. Check airflow.
2. Remove the top of the machine case and check for
missing baffles or internal blockages.
3. Check for faulty PCI cards and replace as necessary.
4.
If problem persists, replace PCI Riser FRU.
Possible Cause-FRUs or other:
v None
Fan (100%)
1087
1068
Fan fault type 2
Explanation: The fan is missing.
User response: Reseat the fan, and then replace the
fan if reseating the fan does not correct the error.
Note: If replacing the fan does not correct the error,
then the canister will need to be replaced.
Possible Cause-FRUs or other:
Shutdown temperature threshold
exceeded
Explanation: Shutdown temperature threshold
exceeded.
User response: Inspect the enclosure and the
enclosure environment.
1. Check environmental temperature.
2. Ensure that all of the components are installed or
that there are fillers in each bay.
Fan (80%)
3. Check that all of the fans are installed and
operating properly.
Other:
4. Check for any obstructions to airflow, proper
clearance for fresh inlet air, and exhaust air.
No FRU (20%)
5. Handle any specific obstructed airflow errors that
are related to the drive, the battery, and the power
supply unit.
1083
Unrecognized node error
Explanation: The cluster is reporting that a node is
not operational because of critical node error 562. See
the details of node error 562 for more information.
User response: See node error 562.
6. Bring the system back online. If the system
performed a hard shutdown, the power must be
removed and reapplied.
Possible Cause-FRUs or other:
Canister (2%)
Chapter 7. Diagnosing problems
201
1089 • 1092
Battery (1%)
Possible Cause, FRUs, or other:
Power supply unit (1%)
v N/A
1091
Drive (1%)
One or more fans (40x40x56) are failing.
Explanation: One or more fans (40x40x56) are failing.
Other:
User response:
Environment (95%)
1089
One or more fans are failing.
Explanation: One or more fans are failing.
For the 2145-DH8, a fan has a fault condition.
User response:
1. Determine the failing fan(s) from the fan indicator
on the system board or from the text of the error
data in the log. The reported fan for the 2145-CF8,
or 2145-CG8 matches the fan module position. Each
fan module contains two fans.
2. For the 2145-DH8, mechanically stop fan or remove
fan. If a fan is not installed, shut down the node,
open it, and install the fan. If a fan is installed,
replace fan FRU indicated by the FAN identifier
that is supplied in the Extra data.
3. Exchange the FRU for a new FRU.
4. Go to repair verification MAP.
v Fan number: Fan module position
v 1 or 2 :1
1. Determine the failing fans from the fan indicator on
the system board or from the text of the error data
in the log.
2. Verify that the cable between the fan backplane and
the system board is connected:
v
If all fans on the fan backplane are failing
v If no fan fault lights are illuminated
3. Exchange the FRU for a new FRU.
4. Go to repair verification MAP.
Possible Cause, FRUs, or other:
v N/A
1092
The temperature soft or hard shutdown
threshold of the 2072 has been exceeded.
The 2072 has automatically powered off.
Explanation: The temperature soft or hard shutdown
threshold of the 2072 has been exceeded. The 2072 has
automatically powered off.
User response:
v 3 or 4 :2
1. Ensure that the operating environment meets
specifications.
v 5 or 6 :3
2. Ensure that the airflow is not obstructed.
v 7 or 8 :4
3. Ensure that the fans are operational.
v 9 or 10:5
4. Go to the light path diagnostic MAP and perform
the light path diagnostic procedures.
v 11 or 12:6
Possible Cause-FRUs or other:
2145-CF8, 2145-DH8 or 2145-CG8
v Fan module (100%)
1090
One or more fans (40x40x28) are failing.
Explanation: One or more fans (40x40x28) are failing.
User response:
1. Determine the failing fans from the fan indicator on
the system board or from the text of the error data
in the log.
2. Verify that the cable between the fan backplane and
the system board is connected:
v If all fans on the fan backplane are failing
v If no fan fault lights are illuminated
3. Exchange the FRU for a new FRU.
4. Go to repair verification MAP.
202
SAN Volume Controller: Troubleshooting Guide
5. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to the start MAP. If you return to this
step, contact your support center to resolve the
problem with the 2145.
6. Go to the repair verification MAP.
Possible Cause-FRUs or other:
2072 - Node Canister (100%)
v The FRU that is indicated by the Light path
diagnostics (25%)
v System board (5%)
Other:
System environment or airflow blockage (70%)
1093 • 1096
1093
The internal temperature sensor of the
2145 has reported that the temperature
warning threshold has been exceeded.
For the 2145-DH8 and 2145-CF8 only, the
CPU temperature is too high.
Explanation: The internal temperature sensor of the
2145 has reported that the temperature warning
threshold has been exceeded.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
None
Other:
System environment (100%)
User response:
1. Ensure that the internal airflow of the node has not
been obstructed.
1095
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to the start MAP. If you return to this
step, contact your support center to resolve the
problem with the 2145.
Explanation: Enclosure temperature has passed critical
threshold.
3. Go to repair verification MAP.
2. Check for any impedance to airflow.
For the 2145-DH8 and 2145-CF8 only:
3. If the enclosure has shut down, then turn off both
power switches on the enclosure and power both
back on.
1. Check for external air flow blockages.
2. Remove the top of the machine case and check for
missing baffles, damaged heatsinks, or internal
blockages.
3. If the problem persists after taking these measures,
replace the CPU assembly FRU if 2145-DH8 or
2145-CF8.
Enclosure temperature has passed
critical threshold.
User response: Check for external and internal air
flow blockages or damage.
1. Check environmental temperature.
Possible Cause-FRUs or other:
v None
1096
A Power Supply Unit is missing or has
failed.
Possible Cause-FRUs or other:
Explanation: One of the two power supply units in
the node is either missing or has failed.
2145-CF8 or 2145-CG8
NOTE: This error is reported when a hot-swap power
supply is removed from an active node, so it might be
reported when a faulty power supply is removed for
replacement. Both the missing and faulty conditions
report this error code.
v Fan module (20%)
v System board (5%)
v Canister (5%)
Possible Cause-FRUs or other:
2145-DH8 or 2145-CF8
v CPU assembly (30%)
Other:
Airflow blockage (70%)
1094
The ambient temperature threshold has
been exceeded.
Explanation: The ambient temperature threshold has
been exceeded.
User response:
1. Check that the room temperature is within the
limits allowed.
2. Check for obstructions in the air flow.
3. Mark the errors as fixed.
User response: Error code 1096 is reported when the
power supply either cannot be detected or reports an
error.
1. Ensure that the power supply is seated correctly
and that the power cable is attached correctly to
both the node and to the 2145 UPS-1U.
2. If the error has not been automatically marked fixed
after two minutes, note the status of the three LEDs
on the back of the power supply. For the 2145-CG8
or 2145-CF8, the AC LED is the top green LED, the
DC LED is the middle green LED and the error
LED is the bottom amber LED.
3. If the power supply error LED is off and the AC
and DC power LEDs are both on, this is the normal
condition. If the error has not been automatically
fixed after two minutes, replace the system board.
4. Follow the action specified for the LED states noted
in the table below.
5. If the error has not been automatically fixed after
two minutes, contact support.
Chapter 7. Diagnosing problems
203
1097 • 1099
6. Go to repair verification MAP.
normal condition. If the error is not automatically
fixed after 2 minutes, replace the system board.
Error,AC,DC:Action
5. Follow the action that is specified for the LED states
noted in the following list.
ON,ON or OFF,ON or OFF:The power supply has a
fault. Replace the power supply.
6. If the error is not automatically fixed after 2
minutes, contact support.
OFF,OFF,OFF:There is no power detected. Ensure that
the power cable is connected at the node and 2145
UPS-1U. If the AC LED does not light, check the status
of the 2145 UPS-1U to which the power supply is
connected. Follow MAP 5150 2145 UPS-1U if the
UPS-1U is showing no power or an error; otherwise,
replace the power cable. If the AC LED still does not
light, replace the power supply.
OFF,OFF,ON:The power supply has a fault. Replace the
power supply.
OFF,ON,OFF:Ensure that the power supply is installed
correctly. If the DC LED does not light, replace the
power supply.
Possible Cause-FRUs or other:
Failed PSU:
v Power supply (90%)
7. Go to repair verification MAP.
Error,AC,DC:Action
ON,ON or OFF,ON or OFF:The power supply has a
fault. Replace the power supply.
OFF,OFF,OFF:There is no power detected. Ensure that
the power cable is connected at the node and 2145
UPS-1U. If the AC LED does not light, check whether
the 2145 UPS-1U is showing any errors. Follow MAP
5150 2145 UPS-1U if the UPS-1U is showing an error;
otherwise, replace the power cable. If the AC LED still
does not light, replace the power supply.
OFF,OFF,ON:The power supply has a fault. Replace the
power supply.
OFF,ON,OFF:Ensure that the power supply is installed
correctly. If the DC LED does not light, replace the
power supply.
v Power cable assembly (5%)
v System board (5%)
Possible Cause-FRUs or other:
v Power cable assembly (85%)
Missing PSU:
v UPS-1U assembly (10%)
v Power supply (19%)
v System board (5%)
v System board (1%)
v For the 2145-DH8: power supply (100%)
v Other: Power supply not correctly installed (80%)
1098
1097
A power supply unit reports no A/C
power. For the 2145-DH8, a power
supply has a fault condition.
Explanation: One of the two power supply units in
the node is reporting that no main power is detected.
For the 2145-DH8, a power supply has a fault
condition.
Enclosure temperature has passed
warning threshold.
Explanation: Enclosure temperature has passed
warning threshold.
User response: Check for external and internal air
flow blockages or damage.
1. Check environmental temperature.
2. Check for any impedance to airflow.
User response:
1. For the 2145-DH8, replace the power supply FRU.
For all other models, complete the following steps.
2. Ensure that the power supply is attached correctly
to both the node and to the 2145 UPS-1U.
3. If the error is not automatically marked fixed after 2
minutes, note the status of the three LEDs on the
back of the power supply. For the 2145-CG8 or
2145-CF8, the AC LED is the top green LED. The
DC LED is the middle green LED, and the error
LED is the bottom amber LED.
4. If the power supply error LED is off and the AC
and DC power LEDs are both on, this state is the
204
SAN Volume Controller: Troubleshooting Guide
Possible Cause-FRUs or other:
v None
1099
Temperature exceeded warning
threshold
Explanation: Temperature exceeded warning
threshold.
User response: Inspect the enclosure and the
enclosure environment.
1. Check environmental temperature.
1100 • 1107
2. Ensure that all of the components are installed or
that there are fillers in each bay.
2145-CF8, or 2145-CG8
3. Check that all of the fans are installed and
operating properly.
v System board (2%)
4. Check for any obstructions to airflow, proper
clearance for fresh inlet air, and exhaust air.
1105
v Light path diagnostic MAP FRUs (98%)
5. Wait for the component to cool.
One of the voltages that is monitored on
the system board is under the set
threshold.
Possible Cause-FRUs or other:
Explanation: One of the voltages that is monitored on
the system board is under the set threshold.
Hardware component (5%)
User response:
1. Check the cable connections.
Other:
2. See the light path diagnostic MAP.
Environment (95%)
3. If the light path diagnostic MAP does not resolve
the issue, exchange the frame assembly.
Explanation: One of the voltages that is monitored on
the system board is over the set threshold.
4. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
User response:
5. Go to repair verification MAP.
1100
One of the voltages that is monitored on
the system board is over the set
threshold.
1. See the light path diagnostic MAP.
2. If the light path diagnostic MAP does not resolve
the issue, exchange the frame assembly.
3. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
1106
One of the voltages that is monitored on
the system board is under the set
threshold.
Explanation: One of the voltages that is monitored on
the system board is under the set threshold.
User response:
1. Check the cable connections.
4. Go to repair verification MAP.
2. See the light path diagnostic MAP.
Possible Cause-FRUs or other:
3. If the light path diagnostic MAP does not resolve
the issue, exchange the system board assembly.
1101
One of the voltages that is monitored on
the system board is over the set
threshold.
Explanation: One of the voltages that is monitored on
the system board is over the set threshold.
4. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
5. Go to repair verification MAP.
User response:
1. See the light path diagnostic MAP.
Possible Cause-FRUs or other:
2. If the light path diagnostic MAP does not resolve
the issue, exchange the system board assembly.
2145-CF8, or 2145-CG8
3. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
as “fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
v Light path diagnostic MAP FRUs (98%)
v System board (2%)
1107
The battery subsystem has insufficient
capacity to save system data due to
multiple faults.
Explanation: This message is an indication of other
problems to solve before the system can successfully
recharge the batteries.
Chapter 7. Diagnosing problems
205
1108 • 1111
User response: No service action is required for this
error, but other errors must be fixed. Look at other
indications to see if the batteries can recharge without
being put into use.
To verify that the battery backplane works after
replacing it, check that the node error is fixed.
Possible Cause-FRUs or other:
v Battery backplane (50%)
1108
Battery backplane cabling faulty or
possible battery backplane requires
replacing.
Explanation: Faulty cabling or a faulty backplane are
preventing the system from full communication with
and control of the batteries.
User response: Check the cabling to the battery
backplane, making sure that all the connectors are
properly mated.
Four signal cables (EPOW, LPC, PWR_SENSE & LED)
and one power cable (which uses 12 red and 12 black
heavy gauge wires) are involved:
v The EPOW cable runs to a 20-pin connector at the
front of the system planar, which is the edge nearest
the drive bays, near the left side.
To check that this connector is mated properly, it is
necessary to remove the plastic airflow baffle, which
lifts up.
A number of wires run from the same connector to
the disk backplane located to the left of the battery
backplane.
v The LPC cable runs to a small adapter that is
plugged into the back of the system planar between
two PCI Express adapter cages. It is helpful to
remove the left adapter cage when checking that
these connectors are mated properly.
v The PWR_SENSE cable runs to a 24-pin connector at
the back of the system planar between the PSUs and
the left adapter cage. Check the connections of both a
female connector (to the system planar) and a male
connector (to the connector from the top PSU).
Again, it can be helpful to remove the left adapter
cage to check the proper mating of the connectors.
v The power cable runs to the system planar between
the PSUs and the left adapter cage. It is located just
in front of the PWR_SENSE connector. This cable has
both a female connector that connects to the system
planar, and a male connector that mates with the
connector from the top PSU. Due to the bulk of this
cable, care must be taken to not disturb PWR_SENSE
connections when dressing it away in the space
between the PSUs and the left adapter cage.
v The LED cable runs to a small PCB on the front
bezel. The only consequence of this cable not being
mated correctly is that the LEDs do not work.
If no problems exist, replace the battery backplane as
described in the service action for “1109.”
You do not replace either battery at this time.
206
SAN Volume Controller: Troubleshooting Guide
1109
Battery or possibly battery backplane
requires replacing.
Explanation: Battery or possibly battery backplane
requires replacing.
User response: Complete the following steps:
1. Replace the drive bay battery.
2. Check to see whether the node error is fixed. If not,
replace the battery backplane.
3. To verify that the new battery backplane is working
correctly, check that the node error is fixed.
Possible Cause-FRUs or other:
v Drive bay battery (95%)
v Battery backplane (5%)
1110
The power management board detected
a voltage that is outside of the set
thresholds.
Explanation: The power management board detected
a voltage that is outside of the set thresholds.
User response:
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
2. Check node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to repair verification MAP.
Possible Cause, FRUs, or other:
2145-CG8 or 2145-CF8
v Power supply unit (50%)
v System board (50%)
1111
Batteries have insufficient charge.
Explanation: The insufficient charge message can
appear for various reasons such as the battery is
charging; the battery is missing or has failed; there is a
communication error, or there has been an over
temperature event.
User response: This node error can be corrected by
correcting each of the underlying battery problems.
1112 • 1121
1. If a battery is missing, replace the battery.
2. If a battery is failed, replace the battery.
3. If a battery is charging, this error should go away
when the battery is charged.
1114
Enclosure battery fault type 1
Explanation: Enclosure battery fault type 1.
User response: Replace the battery.
4. If a battery is having a communication error (comm
errror), try to reseat the battery as described in the
replacement procedure. If reseating the battery does
not correct the problem, replace the battery.
Possible Cause-FRUs or other:
5. If a battery is too hot, the system can be started
after it has cooled.
1115
Inspect the battery for damage after an
over-temperature event.
Battery (100%)
Enclosure Battery fault type 4
Explanation: Enclosure Battery fault type 4.
User response: Reseat the battery. Replace the battery
if the error continues.
Possible Cause - FRUs or other:
If both batteries have errors, battery charging might be
underway. (No FRU)
Note: Do not reseat a battery unless the other battery
has enough charge, or data loss might occur.
Possible Cause-FRUs or other:
If both batteries have errors that do not resolve after a
sufficient time to charge, battery charging might be
impaired, such as by a faulty battery backplane FRU.
Battery (95%)
Communication errors are often correctable by
reseating the battery or by allowing the temperature of
the battery to cool without the need to replace the
battery. (No FRU)
Bad connection (5%)
Other:
1120
A high speed SAS adapter is missing.
If a battery is missing or failed, the solution is to
replace the battery FRU.
Explanation: This node has detected that a high speed
SAS adapter that was previously installed is no longer
present.
Battery (50%)
User response: If the high speed SAS adapter was
deliberately removed, mark the error “fixed.”
Other:
No FRU (50%)
Otherwise, the high speed SAS adapter has failed and
must be replaced. In the sequence shown, exchange the
FRUs for new FRUs.
Go to the repair verification MAP.
1112
Enclosure battery is missing.
Possible Cause-FRUs or other:
Explanation: Enclosure battery is missing.
1. High speed SAS adapter (90%)
User response: Install a battery in the missing slot. If
the battery is present in the slot, reseat the battery.
2. System board (10%)
1121
Attention: Do not reseat a battery unless the other
battery has enough charge, or data loss might occur.
Possible Cause-FRUs or other:
Battery (95%)
Other:
No FRU (5%)
A high speed SAS adapter has failed.
Explanation: A fault has been detected on a high
speed SAS adapter.
User response: In the sequence shown, exchange the
FRUs for new FRUs.
Go to the repair verification MAP.
Possible Cause-FRUs or other:
1. High speed SAS adapter (90%)
2. System board (10%)
Chapter 7. Diagnosing problems
207
1122 • 1128
1122
A high speed SAS adapter error has
occurred.
Explanation: The high speed SAS adapter has
detected a PCI bus error and requires service before it
can be restarted. The high speed SAS adapter failure
has caused all of the flash drives that were being
accessed through this adapter to go Offline.
User response: If this is the first time that this error
has occurred on this node, complete the following
steps:
1. Power off the node.
User response: Replace the PSU with a supported
version.
Attention: To avoid losing state and data from the
node, use the satask startservice command to put
the node into service state so that it no longer processes
I/O. Then you can remove and replace the top power
supply unit (PSU 2). This precaution is due to a
limitation in the power-supply configuration. Once the
service action is complete, run the satask stopservice
command to let the node rejoin the system.
Possible Cause-FRUs or other:
2. Reseat the high speed SAS adapter.
3. Power on the node.
PSU (100%)
4. Submit the lsmdisk task and ensure that all of the
flash drive managed disks that are located in this
node have a status of Online.
1126
If the sequence of actions above has not resolved the
problem or the error occurs again on the same node,
complete the following steps:
User response:
1. In the sequence shown, exchange the FRUs for new
FRUs.
2. Submit the lsmdisk task and ensure that all of the
flash drive managed disks that are located in this
node have a status of Online.
3. Go to the repair verification MAP.
Possible Cause-FRUs or other:
1. High speed SAS adapter (90%)
2. System board (10%)
Power Supply Unit fault type 2
Explanation: A fault exists on the power supply unit
(PSU).
1. Reseat the PSU in the enclosure.
Attention: To avoid losing state and data from the
node, use the satask startservice command to put
the node into service state so that it no longer
processes I/O. Then you can remove and replace
the top power supply unit (PSU 2). This precaution
is due to a limitation in the power-supply
configuration. Once the service action is complete,
run the satask stopservice command to let the
node rejoin the system.
2. If the fault is not resolved, replace the PSU.
Possible Cause-FRUs or other:
1124
Power Supply Unit fault type 1
1. No Part (30%)
Explanation: A fault has been detected on a power
supply unit (PSU).
2. PSU (70 %)
User response: Replace the PSU.
1128
Attention: To avoid losing state and data from the
node, use the satask startservice command to put
the node into service state so that it no longer processes
I/O. Then you can remove and replace the top power
supply unit (PSU 2). This precaution is due to a
limitation in the power-supply configuration. Once the
service action is complete, run the satask stopservice
command to let the node rejoin the system.
Possible Cause-FRUs or other:
PSU (100%)
1125
Power Supply Unit fault type 1
Explanation: The power supply unit (PSU) is not
supported.
Power Supply Unit missing
Explanation: The power supply unit (PSU) is not
seated in the enclosure, or no PSU is installed.
User response:
1. If no PSU is installed, install a PSU.
2. If a PSU is installed, reseat the PSU in the
enclosure.
Attention: To avoid losing state and data from the
node, use the satask startservice command to put
the node into service state so that it no longer processes
I/O. Then you can remove and replace the top power
supply unit (PSU 2). This precaution is due to a
limitation in the power-supply configuration. Once the
service action is complete, run the satask stopservice
command to let the node rejoin the system.
Possible Cause-FRUs or other:
1. No Part (5%)
208
SAN Volume Controller: Troubleshooting Guide
1129 • 1135
2. PSU (95%)
1131
Reseat the power supply unit in the enclosure.
Possible Cause-FRUs or other:
Power supply (100%)
1129
The node battery is missing.
Explanation: Install new batteries to enable the node
to join a clustered system.
User response: Install a battery in battery slot 1 (on
the left from the front) and in battery slot 2 (on the
right). Leave the node running as you add the batteries.
Align each battery so that the guide rails in the
enclosure engage the guide rail slots on the battery.
Push the battery firmly into the battery bay until it
stops. The cam on the front of the battery remains
closed during this installation.
To verify that the new battery works correctly, check
that the node error is fixed. After the node joins a
clustered system, use the lsnodebattery command to
view information about the battery.
Possible Cause-FRUs or other:
v Battery (100%)
1130
The node battery requires replacing.
Explanation: When a battery must be replaced, you
get this message. The proper response is to install new
batteries.
User response: Battery 1 is on the left (from the front),
and battery 2 is on the right. Remove the old battery by
disengaging and pulling down the cam handle to lever
out the battery enough to pull the battery from the
enclosure.
This service procedure is intended for a failed or offline
battery. To prevent losing data from a battery that is
online, run the svctask chnodebattery -remove
-battery battery_ID node_ID. Running the command
verifies when it is safe to remove the battery.
Install new batteries in battery slot 1 and in battery slot
2. Leave the node running as you add the batteries.
Align each battery so that the guide rails in the
enclosure engage the guide rail slots on the battery.
Push the battery firmly into the battery bay until it
stops. The cam on the front of the battery remains
closed during this installation.
To verify that the new battery works correctly, check
that the node error is fixed. After the node joins a
clustered system, use the lsnodebattery command to
view information about the battery.
Battery conditioning is required but not
possible.
Explanation: Battery conditioning is required but not
possible.
One possible cause of this error is that battery
reconditioning cannot occur in a clustered node if the
partner node is not online.
User response: This error can be corrected on its own.
For example, if the partner node comes online, the
reconditioning begins.
Possible Cause-FRUs or other:
Other:
Wait, or address other errors.
1133
A duplicate WWNN has been detected.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 556. See
the details of node error 556 for more information.
User response: See node error 556.
1135
The 2145 UPS has reported an ambient
over temperature.
Explanation: The 2145 UPS has reported an ambient
over temperature. The uninterruptible power supply
switches to Bypass mode to allow the 2145 UPS to cool.
User response:
1. Power off the nodes attached to the 2145 UPS.
2. Turn off the 2145 UPS, and then unplug the 2145
UPS from the main power source.
3. Ensure that the air vents of the 2145 UPS are not
obstructed.
4. Ensure that the air flow around the 2145 UPS is not
restricted.
5. Wait for at least five minutes, and then restart the
2145 UPS. If the problem remains, check the
ambient temperature. Correct the problem.
Otherwise, exchange the FRU for a new FRU.
6. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
7. Go to repair verification MAP.
Possible Cause-FRUs or other:
2145 UPS electronics unit (50%)
Other:
Chapter 7. Diagnosing problems
209
1136 • 1145
The system ambient temperature is outside the
specification (50%)
1140
1136
Explanation: The 2145 UPS has reported that it has a
problem with the input AC power.
The 2145 UPS-1U has reported an
ambient over temperature.
Explanation: The 2145 UPS-1U has reported an
ambient over temperature.
User response:
1. Power off the node attached to the 2145 UPS-1U.
2. Turn off the 2145 UPS-1U, and then unplug the 2145
UPS-1U from the main power source.
3. Ensure that the air vents of the 2145 UPS-1U are not
obstructed.
4. Ensure that the air flow around the 2145 UPS-1U is
not restricted.
5. Wait for at least five minutes, and then restart the
2145 UPS-1U. If the problem remains, check the
ambient temperature. Correct the problem.
Otherwise, exchange the FRU for a new FRU.
6. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
The 2145 UPS has reported that it has a
problem with the input AC power.
User response:
1. Check the input AC power, whether it is missing or
out of specification. Correct if necessary. Otherwise,
exchange the FRU for a new FRU.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v 2145 UPS input power cable (10%)
v Electronics assembly (10%)
Other:
v The input AC power is missing (40%)
v The input AC power is not in specification (40%)
7. Go to repair verification MAP.
1141
Possible Cause-FRUs or other:
Explanation: The 2145 UPS-1U has reported that it has
a problem with the input AC power.
2145 UPS-1U assembly (50%)
Other:
The system ambient temperature is outside the
specification (50%)
1138
Power supply unit input power failed.
Explanation: Power Supply Unit input power failed.
User response: Check the power cord.
1. Check that the power cord is plugged in.
The 2145 UPS-1U has reported that it
has a problem with the input AC power.
User response:
1. Check the input AC power, whether it is missing or
out of specification. Correct if necessary. Otherwise,
exchange the FRU for a new FRU.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
2. Check that the wall power is good.
Possible Cause-FRUs or other:
3. Replace the power cable.
v 2145 UPS-1U input power cable (10%)
4. Replace the power supply unit.
v 2145 UPS-1U assembly (10%)
Possible Cause-FRUs or other:
Other:
Power cord (20%)
v The input AC power is missing (40%)
v The input AC power is not in specification (40%)
PSU (5%)
1145
Other:
No FRU (75%)
210
SAN Volume Controller: Troubleshooting Guide
The signal connection between a 2145
and its 2145 UPS is failing.
Explanation: The signal connection between a 2145
and its 2145 UPS is failing.
1146 • 1155
User response:
User response:
1. If other nodes that are using this uninterruptible
power supply are reporting this error, exchange the
2145 UPS electronics unit for a new one.
1. Connect the cables correctly. See your product's
installation guide.
2. If only this node is reporting the problem, check the
signal cable, exchange the FRUs for new FRUs in
the sequence shown.
3. Check the node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the
uninterruptible power supply.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Other:
4. Go to the repair verification MAP.
v Configuration error
Possible Cause, FRUs, or other:
1151
2145-CF8 or 2145-CG8
N/A
1146
The signal connection between a 2145
and its 2145 UPS-1U is failing.
Data that the 2145 has received from the
2145 UPS-1U suggests the 2145 UPS-1U
power cable, the signal cable, or both,
are not connected correctly.
Explanation: Data that the 2145 has received from the
2145 UPS-1U suggests the 2145 UPS-1U power cable,
the signal cable, or both, are not connected correctly.
User response:
Explanation: The signal connection between a node
and its UPS is failing.
1. Connect the cables correctly. See your product's
installation guide.
User response:
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
1. In the sequence that is shown in the log, replace
any failing FRUs with new FRUs.
2. Check node status:
v If all nodes show a status of online, mark the
error as fixed.
v If any nodes do not show a status of online, go
to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to the repair verification MAP.
Possible Cause, FRUs, or other:
2145-CF8 or 2145-CG8
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Other:
v Configuration error
1152
v Power cable assembly (40%)
v 2145 UPS-1U assembly (30%)
v System board (30%)
1150
Data that the 2145 has received from the
2145 UPS suggests the 2145 UPS power
cable, the signal cable, or both, are not
connected correctly.
Explanation: Data that the 2145 has received from the
2145 UPS suggests the 2145 UPS power cable, the
signal cable, or both, are not connected correctly.
Incorrect type of uninterruptible power
supply detected.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 587. See
the details of node error 587 for more information.
User response: See node error 587.
1155
A power domain error has occurred.
Explanation: Both 2145s of a pair are powered by the
same uninterruptible power supply.
User response:
Chapter 7. Diagnosing problems
211
1160 • 1166
1. List the 2145s of the cluster and check that 2145s in
the same I/O group are connected to a different
uninterruptible power supply.
2. Connect one of the 2145s as identified in step 1 to a
different uninterruptible power supply.
3. Mark the error that you have just repaired, “fixed”.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Other:
v Configuration error
User response:
1. Ensure that only 2145s are receiving power from the
uninterruptible power supply. Also, ensure that no
other devices are connected to the 2145 UPS-1U.
2. Exchange, in the sequence shown, the FRUs for new
FRUs. If the Overload Indicator is still illuminated
with all outputs disconnected, replace the 2145
UPS-1U.
3. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145 UPS-1U.
4. Go to repair verification MAP.
1160
The output load on the 2145 UPS
exceeds the specification.
Possible Cause-FRUs or other:
Explanation: The 2145 UPS is reporting that too much
power is being drawn from it. The power overload
warning LED, which is above the load level indicators,
on the 2145 UPS will be on.
v Power cable assembly (50%)
User response:
1165
v Power supply assembly (40%)
v 2145 UPS-1U assembly (10%)
1. Determine the 2145 UPS that is reporting the error
from the error event data. Perform the following
steps on just this uninterruptible power supply.
2. Check that the 2145 UPS is still reporting the error.
If the power overload warning LED is no longer on,
go to step 6.
3. Ensure that only 2145s are receiving power from the
uninterruptible power supply. Ensure that there are
no switches or disk controllers that are connected to
the 2145 UPS.
4. Remove each connected 2145 input power in turn,
until the output overload is removed.
5. Exchange the FRUs for new FRUs in the sequence
shown, on the overcurrent 2145.
6. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145 UPS.
7. Go to repair verification MAP.
Possible Cause-FRUs or other:
v Power cable assembly (50%)
v Power supply assembly (40%)
v 2145 UPS electronics assembly (10%)
1161
The output load on the 2145 UPS-1U
exceeds the specifications (reported by
2145 UPS-1U alarm bits).
Explanation: The output load on the 2145 UPS-1U
exceeds the specifications (reported by 2145 UPS-1U
alarm bits).
212
SAN Volume Controller: Troubleshooting Guide
The 2145 UPS output load is
unexpectedly high. The 2145 UPS output
is possibly connected to an extra
non-2145 load.
Explanation: The 2145 UPS output load is
unexpectedly high. The 2145 UPS output is possibly
connected to an extra non-2145 load.
User response:
1. Ensure that only 2145s are receiving power from the
uninterruptible power supply. Ensure that there are
no switches or disk controllers that are connected to
the 2145 UPS.
2. Check node status. If all nodes show a status of
“online”, the problem no longer exists. Mark the
error that you have just repaired “fixed” and go to
the repair verification MAP.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
None
Other:
v Configuration error
1166
The 2145 UPS-1U output load is
unexpectedly high.
Explanation: The uninterruptible power supply output
is possibly connected to an extra non-2145 load.
User response:
1. Ensure that there are no other devices that are
connected to the 2145 UPS-1U.
1170 • 1180
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145 UPS-1U.
1175
A problem has occurred with the
uninterruptible power supply frame
fault (reported by uninterruptible power
supply alarm bits).
3. Go to repair verification MAP.
Explanation: A problem has occurred with the
uninterruptible power supply frame fault (reported by
the uninterruptible power supply alarm bits).
Possible Cause-FRUs or other:
User response:
v 2145 UPS-1U assembly (5%)
1. Replace the uninterruptible power supply assembly.
Other:
v Configuration error (95%)
1170
2145 UPS electronics fault (reported by
the 2145 UPS alarm bits).
Explanation: 2145 UPS electronics fault (reported by
the 2145 UPS alarm bits).
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
User response:
1. Replace the uninterruptible power supply
electronics assembly.
Uninterruptible power supply assembly (100%)
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the UPS.
1179
3. Go to repair verification MAP.
User response:
Possible Cause-FRUs or other:
1. Disconnect any excessive unmanaged enclosures
from the system.
2145 UPS electronics assembly (100%)
2. Un-manage any offline drives that are not present
in the system.
1171
3. Identify unused drives and remove them from the
enclosures.
2145 UPS-1U electronics fault (reported
by the 2145 UPS-1U alarm bits).
Explanation: 2145 UPS-1U electronics fault (reported
by the 2145 UPS-1U alarm bits).
User response:
1. Replace the uninterruptible power supply assembly.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145 UPS-1U.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
Too many drives attached to the system.
Explanation: The cluster only supports a fixed number
of drives. A drive has been added that makes the
number of drives larger than the total number of
supported drives per cluster.
4. Identify arrays of drives that are no longer required.
5. Remove the arrays and remove the drives from the
enclosures if they are present.
6. Once there are fewer than 1056 drives in the
system, consider re-engineering system capacity by
migrating data from small arrays onto large arrays,
then removing the small arrays and the drives that
formed them. Consider the need for an additional
Storwize system in your SAN solution.
1180
2145 UPS battery fault (reported by 2145
UPS alarm bits).
Explanation: 2145 UPS battery fault (reported by 2145
UPS alarm bits).
User response:
2145 UPS-1U assembly (100%)
1. Replace the 2145 UPS battery assembly.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
Chapter 7. Diagnosing problems
213
1181 • 1189
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the 2145 UPS.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
Possible Cause-FRUs or other:
2145 UPS battery assembly (100%)
v 2145 UPS electronics assembly (60%)
v 2145 UPS battery assembly (20%)
1181
2145 UPS-1U battery fault (reported by
2145 UPS-1U alarm bits).
Explanation: 2145 UPS-1U battery fault (reported by
2145 UPS-1U alarm bits).
v 2145 UPS assembly (20%)
1186
User response:
1. Replace the 2145 UPS-1U battery assembly.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
2145 UPS-1U battery assembly (100%)
A problem has occurred in the 2145
UPS-1U, with no specific FRU identified
(reported by 2145 UPS-1U alarm bits).
Explanation: A problem has occurred in the 2145
UPS-1U, with no specific FRU identified (reported by
2145 UPS-1U alarm bits).
User response:
1. In the sequence shown, exchange the FRU for a
new FRU.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
1182
Ambient temperature is too high during
system startup.
Possible Cause-FRUs or other:
Explanation: The cluster is reporting that a node is
not operational because of critical node error 528. See
the details of node error 528 for more information.
2145 UPS-1U assembly (100%)
User response: See node error 528.
1187
1183
Explanation: The cluster is reporting that a node is
not operational because of critical node errors 523, 573,
574. See the details of node errors 523, 573, 574 for
more information.
The nodes hardware configuration does
not meet the minimum requirements.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 562. See
the details of node error 562 for more information.
User response: See node error 562.
1185
2145 UPS fault, with no specific FRU
identified (reported by uninterruptible
power supply alarm bits).
Explanation: 2145 UPS fault, with no specific FRU
identified (reported by 2145 UPS alarm bits).
User response:
1. In the sequence shown, exchange the FRU for a
new FRU.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
214
SAN Volume Controller: Troubleshooting Guide
Node software is inconsistent or
damaged
User response: See node errors 523, 573, 574.
1188
Too many software crashes have
occurred.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 564. See
the details of node error 564 for more information.
User response: See node error 564.
1189
The node is held in the service state.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 690. See
the details of node error 690 for more information.
User response: See node error 690.
1190 • 1194
1190
The 2145 UPS battery has reached its
end of life.
Explanation: The 2145 UPS battery has reached its end
of life.
User response:
1. Replace the 2145 UPS battery assembly.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
2145 UPS battery assembly (100%)
1191
The 2145 UPS-1U battery has reached its
end of life.
Explanation: The 2145 UPS-1U battery has reached its
end of life.
User response:
1. Replace the 2145 UPS-1U battery assembly.
2. Check node status. If all nodes show a status of
“online”, mark the error that you have just repaired
“fixed”. If any nodes do not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the uninterruptible power supply.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
2145 UPS-1U battery assembly (100%)
1192
Unexpected node error
Explanation: A node is missing from the cluster. The
error that it is reporting is not recognized by the
system.
User response: Find the node that is in service state
and use the service assistant to determine why it is not
active.
1193
The UPS battery charge is not enough to
allow the node to start.
Explanation: The cluster is reporting that a node is
not operational because of critical node error 587.
User response: See the details of node error 587 for
more information.
1194
Automatic recovery of offline node has
failed.
Explanation: The cluster has an offline node and has
determined that one of the candidate nodes matches
the characteristics of the offline node. The cluster has
attempted but failed to add the node back into the
cluster. The cluster has stopped attempting to
automatically add the node back into the cluster.
If a node has incomplete state data, it remains offline
after it starts. This occurs if the node has had a loss of
power or a hardware failure that prevented it from
completing the writing of all of the state data to disk.
The node reports a node error 578 when it is in this
state.
If three attempts to automatically add a matching
candidate node to a cluster have been made, but the
node has not returned online for 24 hours, the cluster
stops automatic attempts to add the node and logs
error code 1194 “Automatic recovery of offline node
failed”.
Two possible scenarios when this error event is logged
are:
1. The node has failed without saving all of its state
data. The node has restarted, possibly after a repair,
and shows node error 578 and is a candidate node
for joining the cluster. The cluster attempts to add
the node into the cluster but does not succeed. After
15 minutes, the cluster makes a second attempt to
add the node into the cluster and again does not
succeed. After another 15 minutes, the cluster
makes a third attempt to add the node into the
cluster and again does not succeed. After another 15
minutes, the cluster logs error code 1194. The node
never came online during the attempts to add it to
the cluster.
2. The node has failed without saving all of its state
data. The node has restarted, possibly after a repair,
and shows node error 578 and is a candidate node
for joining the cluster. The cluster attempts to add
the node into the cluster and succeeds and the node
becomes online. Within 24 hours the node fails
again without saving its state data. The node
restarts and shows node error 578 and is a
candidate node for joining the cluster. The cluster
again attempts to add the node into the cluster,
succeeds, and the node becomes online; however,
the node again fails within 24 hours. The cluster
attempts a third time to add the node into the
cluster, succeeds, and the node becomes online;
however, the node again fails within 24 hours. After
another 15 minutes, the cluster logs error code 1194.
A combination of these scenarios is also possible.
Note: If the node is manually removed from the cluster,
the count of automatic recovery attempts is reset to
zero.
Chapter 7. Diagnosing problems
215
1195 • 1200
User response:
1. If the node has been continuously online in the
cluster for more than 24 hours, mark the error as
fixed and go to the Repair Verification MAP.
2. Determine the history of events for this node by
locating events for this node name in the event log.
Note that the node ID will change, so match on the
WWNN and node name. Also, check the service
records. Specifically, note entries indicating one of
three events: 1) the node is missing from the cluster
(cluster error 1195 event 009052), 2) an attempt to
automatically recover the offline node is starting
(event 980352), 3) the node has been added to the
cluster (event 980349).
3. If the node has not been added to the cluster since
the recovery process started, there is probably a
hardware problem. The node's internal disk might
be failing in a manner that it is unable to modify its
software level to match the software level of the
cluster. If you have not yet determined the root
cause of the problem, you can attempt to manually
remove the node from the cluster and add the node
back into the cluster. Continuously monitor the
status of the nodes in the cluster while the cluster is
attempting to add the node. Note: If the node type
is not supported by the software version of the
cluster, the node will not appear as a candidate
node. Therefore, incompatible hardware is not a
potential root cause of this error.
4. If the node was added to the cluster but failed
again before it has been online for 24 hours,
investigate the root cause of the failure. If no events
in the event log indicate the reason for the node
failure, collect dumps and contact IBM technical
support for assistance.
5. When you have fixed the problem with the node,
you must use either the cluster console or the
command line interface to manually remove the
node from the cluster and add the node into the
cluster.
6. Mark the error as fixed and go to the verification
MAP.
Possible Cause-FRUs or other:
None, although investigation might indicate a
hardware failure.
1195
Node missing.
Explanation: You can resolve this problem by
repairing the failure on the missing 3700.
User response:
3. When the repair has been completed, this error is
automatically marked as fixed.
4. Check node status. If all nodes show a status of
“online”, but the error in the log has not been
marked as fixed, manually mark the error that you
have just repaired “fixed”. If any nodes do not
show a status of “online”, go to start MAP. If you
return to this step, contact your support center to
resolve the problem with the 3700.
5. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
1198
Detected hardware is not a valid
configuration.
Explanation: A hardware change was made to this
node that is not supported by its software. Either a
hardware component failed, or the node was
incorrectly upgraded.
User response: Complete the following steps:
1. If required, power the node off for servicing.
2. If new hardware is correctly installed, but it is listed
as an invalid configuration, then update the
software to a level that supports the new hardware.
Use the management GUI to install this level if
necessary.
3. If you upgraded the software to make the hardware
work, there is a new event after the upgrade
requesting that you enable the new hardware.
Possible Cause-FRUs or other:
v None
1200
The configuration is not valid. Too
many devices, MDisks, or targets have
been presented to the system.
Explanation: The configuration is not valid. Too many
devices, MDisks, or targets have been presented to the
system.
User response:
1. Remove unwanted devices from the Fibre Channel
network fabric.
2. Start a cluster discovery operation to find
devices/disks by rescanning the Fibre Channel
network.
3. List all connected managed disks. Check with the
customer that the configuration is as expected.
Mark the error that you have just repaired fixed.
1. If it is not obvious which node in the cluster has
failed, check the status of the nodes and find the
3700 with a status of offline.
4. Go to repair verification MAP.
2. Go to the Start MAP and perform the repair on the
failing node.
v None
216
SAN Volume Controller: Troubleshooting Guide
Possible Cause-FRUs or other:
1201 • 1214
Other:
1210
Fibre Channel network fabric fault (100%)
1201
A flash drive requires a recovery.
Explanation: The flash drive that is identified by this
error needs to be recovered.
User response: To recover this flash drive, submit the
following command: chdrive -task recover drive_id
where drive_id is the identity of the drive that needs to
be recovered.
1202
A flash drive is missing from the
configuration.
Explanation: The offline flash drive identified by this
error must be repaired.
User response: In the management GUI, click
Troubleshooting > Recommended Actions to run the
recommended action for this error. Otherwise, use MAP
6000 to replace the drive.
1203
A duplicate Fibre Channel frame has
been received.
Explanation: A duplicate Fibre Channel frame should
never be detected. Receiving a duplicate Fibre Channel
frame indicates that there is a problem with the Fibre
Channel fabric. Other errors related to the Fibre
Channel fabric might be generated.
A local Fibre Channel port has been
excluded.
Explanation: A local Fibre Channel port has been
excluded.
User response:
1. Repair faults in the order shown.
2. Check the status of the disk controllers. If all disk
controllers show a “good” status, mark the error
that you just repaired as “fixed”.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v Fibre Channel cable assembly (75%)
v Small Form-factor Pluggable (SFP) connector (10%)
v Fibre Channel adapter (5%)
Other:
v Fibre Channel network fabric fault (10%)
1212
Power supply exceeded temperature
threshold.
Explanation: Power supply exceeded temperature
threshold.
User response: Complete the following steps:
1. Check airflow. Remove the top of the machine case
and check for missing baffles or internal blockages.
2. If problem persists, replace the power supply.
User response:
1. Use the transmitting and receiving WWPNs
indicated in the error data to determine the section
of the Fibre Channel fabric that has generated the
duplicate frame. Search for the cause of the problem
by using fabric monitoring tools. The duplicate
frame might be caused by a design error in the
topology of the fabric, by a configuration error, or
by a software or hardware fault in one of the
components of the Fibre Channel fabric, including
inter-switch links.
2. When you are satisfied that the problem has been
corrected, mark the error that you have just
repaired “fixed”.
3. Go to MAP 5700: Repair verification.
Possible Cause-FRUs or other:
v Fibre Channel cable assembly (1%)
v Fibre Channel adapter (1%)
Possible Cause-FRUs or other:
v Power supply
1213
Boot drive missing, out of sync, or
failed.
Explanation: Boot drive missing, out of sync, or failed.
User response: Complete the following steps:
1. Look at a boot drive view to determine the missing,
failed or out of sync drive.
2. Insert a missing drive.
3. Replace a failed drive.
4. Synchronize an out of sync drive by running the
commands svctask chnodebootdrive -sync and/or
satask chbootdrive -sync.
Possible Cause-FRUs or other:
v System drive
Other:
v Fibre Channel network fabric fault (98%)
1214
Boot drive is in the wrong slot.
Explanation: Boot drive is in the wrong slot.
User response: Complete the following steps:
Chapter 7. Diagnosing problems
217
1215 • 1230
1. Look at a boot drive view to determine which drive
is in the wrong slot, which node and slot it belongs
in, and which drive must be in this slot.
4. Mark the error as fixed. If the error recurs, contact
hardware support for further investigation.
2. Swap the drive for the correct one but shut down
the node first if booted yes is shown for that drive
in boot drive view.
Possible Cause-FRUs or other:
3. If you want to use the drive in this node,
synchronize the boot drives by running the
commands svctask chnodebootdrive -sync and/or
satask chbootdrive -sync.
Other:
4. The node error clears, or a new node error is
displayed for you to work on.
v Flash drive (10%)
v System environment or airflow blockage (90%)
1220
A remote Fibre Channel port has been
excluded.
Possible Cause-FRUs or other:
Explanation: A remote Fibre Channel port has been
excluded.
v None
User response:
1215
A flash drive is failing.
Explanation: The flash drive has detected faults that
indicate that the drive is likely to fail soon. The drive
should be replaced. The cluster event log will identify a
drive ID for the flash drive that caused the error.
User response: In the management GUI, click
Troubleshooting > Recommended Actions to run the
recommended action for this error. If this does not
resolve the issue, contact your next level of support.
1216
SAS errors have exceeded thresholds.
1. View the event log. Note the MDisk ID associated
with the error code.
2. From the MDisk, determine the failing disk
controller ID.
3. Refer to the service documentation for the disk
controller and the Fibre Channel network to resolve
the reported problem.
4. After the disk drive is repaired, start a cluster
discovery operation to recover the excluded Fibre
Channel port by rescanning the Fibre Channel
network.
5. To restore MDisk online status, include the
managed disk that you noted in step 1.
Explanation: The cluster has experienced a large
number of SAS communication errors, which indicates
a faulty SAS component that must be replaced.
6. Check the status of the disk controller. If all disk
controllers show a “good” status, mark the error
that you have just repaired, “fixed”.
User response: In the sequence shown, exchange the
FRUs for new FRUs.
7. If all disk controllers do not show a good status,
contact your support center to resolve the problem
with the disk controller.
Go to the repair verification MAP.
8. Go to repair verification MAP.
Possible Cause-FRUs or other:
1. SAS Cable (70%)
Possible Cause-FRUs or other:
2. High speed SAS adapter (20%)
v None
3. SAS drive backplane (5%)
4. flash drive (5%)
1217
A flash drive has exceeded the
temperature warning threshold.
Explanation: The flash drive identified by this error
has reported that its temperature is higher than the
warning threshold.
User response: Take steps to reduce the temperature
of the drive.
1. Determine the temperature of the room, and reduce
the room temperature if this action is appropriate.
2. Replace any failed fans.
3. Ensure that there are no obstructions to air flow for
the node.
218
SAN Volume Controller: Troubleshooting Guide
Other:
v Enclosure/controller fault (50%)
v Fibre Channel network fabric (50%)
1230
A login has been excluded.
Explanation: A port to port fabric connection, or login,
between the cluster node and either a controller or
another cluster has had excessive errors. The login has
therefore been excluded, and will not be used for I/O
operations.
User response: Determine the remote system, which
might be either a controller or a cluster. Check the
event log for other 1230 errors. Ensure that all higher
priority errors are fixed.
This error event is usually caused by a fabric problem.
1260 • 1320
If possible, use the fabric switch or other fabric
diagnostic tools to determine which link or port is
reporting the errors. If there are error events for links
from this node to a number of different controllers or
clusters, then it is probably the node to switch link that
is causing the errors. Unless there are other contrary
indications, first replace the cable between the switch
and the remote system.
1. From the fabric analysis, determine the FRU that is
most likely causing the error. If this FRU has
recently been replaced while resolving a 1230 error,
choose the next most likely FRU that has not been
replaced recently. Exchange the FRU for a new FRU.
2. Mark the error as fixed. If the FRU replacement has
not fixed the problem, the error will be logged
again; however, depending on the severity of the
problem, the error might not be logged again
immediately.
3. Start a cluster discovery operation to recover the
login by re-scanning the Fibre Channel network.
4. Check the status of the disk controller or remote
cluster. If the status is not “good”, go to the Start
MAP.
5. Go to repair verification MAP.
Possible Cause-FRUs or other:
v Fibre Channel cable, switch to remote port, (30%)
v Switch or remote device SFP connector or adapter,
(30%)
v Fibre Channel cable, local port to switch, (30%)
v Cluster SFP connector, (9%)
Possible Cause-FRUs or other:
v SAS cable
v Canister
1298
A node has encountered an error
updating.
Explanation: One or more nodes has failed the
update.
User response: Check lsupdate for the node that
failed and continue troubleshooting with the error code
it provides.
1310
A managed disk is reporting excessive
errors.
Explanation: A managed disk is reporting excessive
errors.
User response:
1. Repair the enclosure/controller fault.
2. Check the managed disk status. If all managed
disks show a status of “online”, mark the error that
you have just repaired as “fixed”. If any managed
disks show a status of “excluded”, include the
excluded managed disks and then mark the error as
“fixed”.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
v Cluster Fibre Channel adapter, (1%)
Other:
Note: The first two FRUs are not cluster FRUs.
Enclosure/controller fault (100%)
1260
1311
SAS cable fault type 2.
Explanation: SAS cable fault type 2.
A flash drive is offline due to excessive
errors.
User response: Complete the following steps:
Explanation: The drive that is reporting excessive
errors has been taken offline.
Note: After each action, check to see whether the
canister ports at both ends of the cable are excluded. If
the ports are excluded, then enable them by issuing the
following command:
User response: In the management GUI, click
Troubleshooting > Recommended Actions to run the
recommended action for this error. If this does not
resolve the issue, contact your next level of support.
chenclosurecanister -excludesasport no -port X
1. Reset this canister and the upstream canister.
The upstream canister is identified in sense data as
enclosureid2, faultobjectlocation2...
2. Reseat cable between the two ports that are
identified in the sense data.
3. Replace cable between the two ports that are
identified in the sense data.
4. Replace this canister.
5. Replace other canister (enclosureid2).
1320
A disk I/O medium error has occurred.
Explanation: A disk I/O medium error has occurred.
User response:
1. Check whether the volume the error is reported
against is mirrored. If it is, check if there is a “1870
Mirrored volume offline because a hardware read
error has occurred” error relating to this volume in
the event log. Also check if one of the mirror copies
is synchronizing. If all these tests are true then you
must delete the volume copy that is not
Chapter 7. Diagnosing problems
219
1322 • 1330
synchronized from the volume. Check that the
volume is online before continuing with the
following actions. Wait until the medium error is
corrected before trying to re-create the volume
mirror.
2. If the medium error was detected by a read from a
host, ask the customer to rewrite the incorrect data
to the block logical block address (LBA) that is
reported in the host systems SCSI sense data. If an
individual block cannot be recovered it will be
necessary to restore the volume from backup. (If
this error has occurred during a migration, the host
system does not notice the error until the target
device is accessed.)
3. If the medium error was detected during a mirrored
volume synchronization, the block might not be
being used for host data. The medium error must
still be corrected before the mirror can be
established. It may be possible to fix the block that
is in error using the disk controller or host tools.
Otherwise, it will be necessary to use the host tools
to copy the volume content that is being used to a
new volume. Depending on the circumstances, this
new volume can be kept and mirrored, or the
original volume can be repaired and the data copied
back again.
4. Check managed disk status. If all managed disks
show a status of “online”, mark the error that you
have just repaired as “fixed”. If any managed disks
do not show a status of “online”, go to start MAP. If
you return to this step, contact your support center
to resolve the problem with the disk controller.
5. Go to repair verification MAP.
4. If only a single drive with this error has been
logged, the system is monitoring the drive for
health and will fail if RAID is used to correct too
many errors of this kind.
1328
Encryption key required.
Explanation: It is necessary to provide an encryption
key before the system can become fully operational.
This error occurs when a system with encryption
enabled is restarted without an encryption key
available.
User response: Insert a USB flash drive containing a
valid key into one of the node canisters.
1330
A suitable managed disk (MDisk) or
drive for use as a quorum disk was not
found.
Explanation: A quorum disk is needed to enable a
tie-break when some cluster members are missing.
Three quorum disks are usually defined. By default, the
cluster automatically allocates quorum disks when
managed disks are created; however, the option exists
to manually assign quorum disks. This error is reported
when there are managed disks or image mode disks
but no quorum disks.
To become a quorum disk:
v The MDisk must be accessible by all nodes in the
cluster.
v The MDisk must be managed; that is, it must be a
member of a storage pool.
v The MDisk must have free extents.
Possible Cause-FRUs or other:
v None
Other:
Enclosure/controller fault (100%)
1322
Data protection information mismatch.
Explanation: This error occurs when something has
broken the protection information in read or write
commands.
User response:
1. Determine if there is a single or multiple drives
logging the error. Because the SAS transport layer
can cause multiple drive errors, it is necessary to fix
other hardware errors first.
2. Check related higher priority hardware errors. Fix
higher priority errors before continuing.
3. Use lseventlog to determine if more than one drive
with this error has been logged in the last 24 hours.
If so, contact IBM support.
220
SAN Volume Controller: Troubleshooting Guide
v The MDisk must be associated with a controller that
is enabled for quorum support. If the controller has
multiple WWNNs, all of the controller components
must be enabled for quorum support.
A quorum disk might not be available because of a
Fibre Channel network failure or because of a Fibre
Channel switch zoning problem.
User response:
1. Resolve any known Fibre Channel network
problems.
2. Ask the customer to confirm that MDisks have been
added to storage pools and that those MDisks have
free extents and are on a controller that is enabled
for use as a provider of quorum disks. Ensure that
any controller with multiple WWNNs has all of its
components enabled to provide quorum disks.
Either create a suitable MDisk or if possible enable
quorum support on controllers with which existing
MDisks are associated. If at least one managed disk
shows a mode of managed and has a non-zero
quorum index, mark the error that you have just
repaired as “fixed”.
1335 • 1360
3. If the customer is unable to make the appropriate
changes, ask your software support center for
assistance.
4. Go to repair verification MAP.
Other:
v Enclosure/controller fault
v Fibre Channel switch
Possible Cause-FRUs or other:
1350
v None
Explanation: IB ports are not operational.
Other:
Configuration error (100%)
1335
Quorum disk not available.
Explanation: Quorum disk not available.
User response:
1. View the event log entry to identify the managed
disk (MDisk) being used as a quorum disk, that is
no longer available.
2. Perform the disk controller problem determination
and repair procedures for the MDisk identified in
step 1.
3. Include the MDisks into the cluster.
4. Check the managed disk status. If the managed disk
identified in step 1 shows a status of “online”, mark
the error that you have just repaired as “fixed”. If
the managed disk does not show a status of
“online”, go to start MAP. If you return to this step,
contact your support center to resolve the problem
with the disk controller.
5. Go to repair verification MAP.
IB ports are not operational.
User response: An offline port can have many causes
and so it is necessary to check them all. Start with the
easiest and least intrusive possibility.
1. Reset the IB port with CLI command.
2. If the IB port is connected to a switch, double-check
the switch configuration for issues.
3. Reseat the IB cable on both the IB side and the
HBA/switch side.
4. Run a temporary second IB cable to replace the
current one to check for a cable fault.
5. If the system is in production, schedule a
maintenance downtime before continuing to the
next step. Other ports will be affected.
6. Reset the IB interface adapter; reset the canister;
reboot the system.
Possible Cause-FRUs or other:
External (cable, HCA, switch, and so on) (85%)
Interface (10%)
Canister (5%)
Possible Cause-FRUs or other:
1360
v None
Explanation: This error has been reported because the
2145 performed error recovery procedures in response
to SAN component associated transport errors. The
problem is probably caused by a failure of a component
of the SAN.
Other:
Enclosure/controller fault (100%)
A SAN transport error occurred.
User response:
1340
A managed disk has timed out.
Explanation: This error was reported because a large
number of disk timeout conditions have been detected.
The problem is probably caused by a failure of some
other component on the SAN.
User response:
1. Repair problems on all enclosures/controllers and
switches on the same SAN as this 2145 cluster.
2. If problems are found, mark this error as “fixed”.
3. If no switch or disk controller failures can be found,
take an event log dump and call your hardware
support center.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
1. View the event log entry to determine the node that
logged the problem. Determine the 2145 node or
controller that the problem was logged against.
2. Perform Fibre Channel switch problem
determination and repair procedures for the
switches connected to the 2145 node or controller.
3. Perform Fibre Channel cabling problem
determination and repair procedures for the cables
connected to the 2145 node or controller.
4. If any problems are found and resolved in step 2
and 3, mark this error as “fixed”.
5. If no switch or cable failures were found in steps 2
and 3, take an event log dump. Call your hardware
support center.
6. Go to repair verification MAP.
Possible Cause-FRUs or other:
Chapter 7. Diagnosing problems
221
1370 • 1450
v None
v Ethernet cable is disconnected or damaged (25%)
v Ethernet hub fault (25%)
Other:
v Fibre Channel switch
v Fibre Channel cabling
1370
A managed disk error recovery
procedure (ERP) has occurred.
1403
External port not operational.
Explanation: If this error occurs when a port was
initially online and subsequently went offline, it
indicates:
v the server, HBA, CNA or switch has been turned off.
Explanation: This error was reported because a large
number of disk error recovery procedures have been
performed by the disk controller. The problem is
probably caused by a failure of some other component
on the SAN.
v there is a physical issue.
User response:
User response:
1. View the event log entry and determine the
managed disk that was being accessed when the
problem was detected.
1. Reset the port via the CLI command Maintenance. If
the port is now online, the DMP is complete.
2. Perform the disk controller problem determination
and repair procedures for the MDisk determined in
step 1.
3. Perform problem determination and repair
procedures for the fibre channel switches connected
to the 2145 and any other Fibre Channel network
components.
If this error occurs during an initial setup or during a
setup change, it is most likely a configuration issue
rather than a physical issue.
2. If the port is connected to a switch, check the
switch to make sure the port is not disabled. Check
the switch vendor troubleshooting documentation
for other possibilities. If the port is now online, the
DMP is complete.
3. Reseat the cable. This includes plugging in the cable
and SFP if not already done. If the port is now
online, the DMP is complete.
4. If any problems are found and resolved in steps 2
and 3, mark this error as “fixed”.
4. Reseat the hot swap SFPs (optics modules). If the
port is now online, the DMP is complete.
5. If no switch or disk controller failures were found
in steps 2 and 3, take an event log dump. Call your
hardware support center.
5. Try using a new cable.
6. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
6. Try using a new SFP.
7. Try using a new port on the switch.
Note: Continuing from here will affect other ports
connected on the adapter.
8. Reset the adapter.
9. Reset the canister.
Other:
v Enclosure/controller fault
v Fibre Channel switch
1450
1400
Explanation: One or more Fibre Channel I/O ports
that have previously been active are now inactive. This
situation has continued for one minute.
The 2145 cannot detect an Ethernet
connection.
Explanation: The 2145 cannot detect an Ethernet
connection.
User response:
1. Go to the Ethernet MAP.
2. Go to the repair verification MAP.
Fewer Fibre Channel I/O ports
operational.
A Fibre Channel I/O port might be established on
either a Fibre Channel platform port or an Ethernet
platform port using FCoE. This error is expected if the
associated Fibre Channel or Ethernet port is not
operational.
Data:
Possible Cause-FRUs or other:
2145-CF8, or 2145-CG8
v Ethernet cable (25%)
v System board (25%)
Other:
222
SAN Volume Controller: Troubleshooting Guide
Three numeric values are listed:
v The ID of the first unexpected inactive port. This ID
is a decimal number.
v The ports that are expected to be active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
1471 • 1474
v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is active.
User response:
1. If possible, use the management GUI to run the
recommended actions for the associated service
error code.
2. Follow the procedure for mapping I/O ports to
platform ports to determine which platform port is
providing this I/O port.
3. Check for any 704 (Fibre channel platform port
not operational) or 724 (Ethernet platform port
not operational) node errors reported for the
platform port.
4. Possibilities:
v If the port has been intentionally disconnected,
use the management GUI recommended action
for the service error code and acknowledge the
intended change.
v Resolve the 704 or 724 error.
v If this is an FCoE connection, use the information
the view gives about the Fibre Channel forwarder
(FCF) to troubleshoot the connection between the
port and the FCF.
Possible Cause-FRUs or other cause:
v None
1471
Interface card is unsupported.
Explanation: Interface adapter is unsupported.
User response: Replace the wrong interface adapter
with the correct type.
Possible Cause-FRUs or other:
Interface adapter (100%)
1472
Boot drive is in an unsupported slot.
Explanation: Boot drive is in an unsupported slot.
User response: Complete the following steps:
1. Look at a boot drive view to determine which drive
is in an unsupported slot.
2. Move the drive back to its correct node and slot,
but shut down the node first if booted yes is shown
for that drive in boot drive view.
3. The node error clears, or a new node error is
displayed for you to work on.
1473
The installed battery is at a hardware
revision level that is not supported by
the current code level.
Explanation: The installed battery is at a hardware
revision level that is not supported by the current code
level.
User response: To replace the battery with one that is
supported by the current code level, follow the service
action for “1130” on page 209. To update the code level
to one that supports the currently installed battery,
perform a service mode code update. Always install the
latest level of the system software to avoid problems
with upgrades and component compatibility.
Possible Cause-FRUs or other:
v Battery (50%)
1474
Battery is nearing end of life.
Explanation: When a battery nears the end of its life,
you must replace it if you intend to preserve the
capacity to failover power to batteries.
User response: Replace the battery by following this
procedure as soon as you can.
If the node is in a clustered system, ensure that the
battery is not being relied upon to provide data
protection before you remove it. Issue the
chnodebattery -remove -battery battery_ID node_ID
command to establish the lack of reliance on the
battery.
If the command returns with a “The command has
failed because the specified battery is
offline”(BATTERY_OFFLINE) error, replace the battery
immediately.
If the command returns with a “The command has
failed because the specified battery is not redundant”
(BATTERY_NOT_REDUNDANT) error, do not remove
the relied-on battery. Removing the battery
compromises data protection.
In this case, without other battery-related errors, use
the chnodebattery -remove -battery battery_ID
node_ID command periodically to force the system to
remove reliance on the battery. The system often
removes reliance within one hour (TBC).
Alternatively, remove the node from the clustered
system. Once the node is independent, you can replace
its battery immediately. If the node is not part of a
cluster, or the battery is offline, or the chnodebattery
command returns without error, conduct the service
action for “1130” on page 209.
Possible Cause-FRUs or other:
Possible Cause-FRUs or other:
v None
v Battery (100%)
Chapter 7. Diagnosing problems
223
1475 • 1600
1475
Battery is too hot.
Explanation: Battery is too hot.
User response: The battery might be slow to cool if
the ambient temperature is high. You must wait for the
battery to cool down before it can resume its normal
operation.
If node error 768 is reported, service that as well.
1476
Battery is too cold.
Explanation: You must wait for the battery to warm
before it can resume its normal operation.
User response: The battery might be slow to warm if
the ambient temperature is low. If node error 768 is
reported, service that as well.
Otherwise, wait for the battery to warm.
A configuration change to the SAN, or to a storage
system with multiple WWNNs, might result in the
cluster discovering new component controllers for the
storage system. These components will take the default
setting for allowing quorum. This error is reported if
there is a quorum disk associated with the controller
and the default setting is not to allow quorum.
User response:
v Determine if there should be a quorum disk on this
storage system. Ensure that the controller supports
quorum before you allow quorum disks on any disk
controller. You can check the support website
www.ibm.com/storage/support/2145 for more
information.
v If a quorum disk is required on this storage system,
allow quorum on the controller component that is
reported in the error. If the quorum disk should not
be on this storage system, move it elsewhere.
v Mark the error as “fixed”.
1550
A cluster path has failed.
Explanation: One of the V3700 Fibre Channel ports is
unable to communicate with all the other V3700s in the
cluster.
Possible Cause-FRUs or other:
v None
Other:
User response:
1. Check for incorrect switch zoning.
Fibre Channel network fabric fault (100%)
2. Repair the fault in the Fibre Channel network
fabric.
1600
3. Check the status of the node ports that are not
excluded via the system's local port mask. If the
status of the node ports shows as active, mark the
error that you have repaired as “fixed”. If any node
ports do not show a status of active, go to start
MAP. If you return to this step contact your support
center to resolve the problem with the V3700.
Mirrored disk repair halted because of
difference.
Explanation: During the repair of a mirrored volume
two copy disks were found to contain different data for
the same logical block address (LBA). The validate
option was used, so the repair process has halted.
v None
Read operations to the LBAs that differ might return
the data of either volume copy. Therefore it is
important not to use the volume unless you are sure
that the host applications will not read the LBAs that
differ or can manage the different data that potentially
can be returned.
Other:
User response: Perform one of the following actions:
Fibre Channel network fabric fault (100%)
v Continue the repair starting with the next LBA after
the difference to see how many differences there are
for the whole mirrored volume. This can help you
decide which of the following actions to take.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
1570
Quorum disk configured on controller
that has quorum disabled
Explanation: This error can occur with a storage
controller that can be accessed through multiple
WWNNs and have a default setting of not allowing
quorum disks. When these controllers are detected by a
cluster, although multiple component controller
definitions are created, the cluster recognizes that all of
the component controllers belong to the same storage
system. To enable the creation of a quorum disk on this
storage system, all of the controller components must
be configured to allow quorum.
224
SAN Volume Controller: Troubleshooting Guide
v Choose a primary disk and run repair
resynchronizing differences.
v Run a repair and create medium errors for
differences.
v Restore all or part of the volume from a backup.
v Decide which disk has correct data, then delete the
copy that is different and re-create it allowing it to be
synchronized.
Then mark the error as “fixed”.
1610 • 1623
Possible Cause-FRUs or other:
v None
1610
There are too many copied media errors
on a managed disk.
Explanation: The cluster maintains a virtual medium
error table for each MDisk. This table is a list of logical
block addresses on the managed disk that contain data
that is not valid and cannot be read. The virtual
medium error table has a fixed length. This error event
indicates that the system has attempted to add an entry
to the table, but the attempt has failed because the table
is already full.
There are two circumstances that will cause an entry to
be added to the virtual medium error table:
1. FlashCopy, data migration and mirrored volume
synchronization operations copy data from one
managed disk extent to another. If the source extent
contains either a virtual medium error or the RAID
controller reports a real medium error, the system
creates a matching virtual medium error on the
target extent.
2. The mirrored volume validate and repair process
has the option to create virtual medium errors on
sectors that do not match on all volume copies.
Normally zero, or very few, differences are
expected; however, if the copies have been marked
as synchronized inappropriately, then a large
number of virtual medium errors could be created.
User response: Ensure that all higher priority errors
are fixed before you attempt to resolve this error.
Determine whether the excessive number of virtual
medium errors occurred because of a mirrored disk
validate and repair operation that created errors for
differences, or whether the errors were created because
of a copy operation. Follow the corresponding option
shown below.
1. If the virtual medium errors occurred because of a
mirrored disk validate and repair operation that
created medium errors for differences, then also
ensure that the volume copies had been fully
synchronized prior to starting the operation. If the
copies had been synchronized, there should be only
a few virtual medium errors created by the validate
and repair operation. In this case, it might be
possible to rewrite only the data that was not
consistent on the copies using the local data
recovery process. If the copies had not been
synchronized, it is likely that there are now a large
number of medium errors on all of the volume
copies. Even if the virtual medium errors are
expected to be only for blocks that have never been
written, it is important to clear the virtual medium
errors to avoid inhibition of other operations. To
recover the data for all of these virtual medium
errors it is likely that the volume will have to be
recovered from a backup using a process that
rewrites all sectors of the volume.
2. If the virtual medium errors have been created by a
copy operation, it is best practice to correct any
medium errors on the source volume and to not
propagate the medium errors to copies of the
volume. Fixing higher priority errors in the event
log would have corrected the medium error on the
source volume. Once the medium errors have been
fixed, you must run the copy operation again to
clear the virtual medium errors from the target
volume. It might be necessary to repeat a sequence
of copy operations if copies have been made of
already copied medium errors.
An alternative that does not address the root cause is to
delete volumes on the target managed disk that have
the virtual medium errors. This volume deletion
reduces the number of virtual medium error entries in
the MDisk table. Migrating the volume to a different
managed disk will also delete entries in the MDisk
table, but will create more entries on the MDisk table of
the MDisk to which the volume is migrated.
Possible Cause-FRUs or other:
v None
1620
A storage pool is offline.
Explanation: A storage pool is offline.
User response:
1. Repair the faults in the order shown.
2. Start a cluster discovery operation by rescanning the
Fibre Channel network.
3. Check managed disk (MDisk) status. If all MDisks
show a status of “online”, mark the error that you
have just repaired as “fixed”. If any MDisks do not
show a status of “online”, go to start MAP. If you
return to this step, contact your support center to
resolve the problem with the disk controller.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Other:
v Fibre Channel network fabric fault (50%)
v Enclosure/controller fault (50%)
1623
One or more MDisks on a controller are
degraded.
Explanation: At least one MDisk on a controller is
degraded because the MDisk is not available through
one or more nodes. The MDisk is available through at
least one node. Access to data might be lost if another
failure occurs.
Chapter 7. Diagnosing problems
225
1624 • 1627
In a correctly configured system, each node accesses all
of the MDisks on a controller through all of the
controller's ports.
2. Mark the error that you have just repaired as
“fixed”. If the problem has not been fixed it will be
logged again; this could take a few minutes.
This error is only logged once per controller. There
might be more than one MDisk on this controller that
has been configured incorrectly, but the error is only
logged for one MDisk.
3. Go to repair verification MAP.
To prevent this error from being logged because of
short-term fabric maintenance activities, this error
condition must have existed for one hour before the
error is logged.
Possible Cause-FRUs or other:
v None
Other:
v Enclosure/controller fault
User response:
1625
1. Determine which MDisks are degraded. Look for
MDisks with a path count lower than the number of
nodes. Do not use only the MDisk status, since
other errors can also cause degraded MDisks.
Explanation: While running an MDisk discovery, the
cluster has detected that a disk controller's
configuration is not supported by the cluster. The disk
controller might appear to be operating with the
cluster; however, the configuration detected can
potentially cause issues and should not be used. The
unsupported configuration is shown in the event data.
2. Ensure that the controller is zoned correctly with all
of the nodes.
3. Ensure that the logical unit is mapped to all of the
nodes.
4. Ensure that the logical unit is mapped to all of the
nodes using the same LUN.
5. Run the console or CLI command to discover
MDisks and ensure that the command completes.
6. Mark the error that you have just repaired as
“fixed”. When you mark the error as “fixed”, the
controller's MDisk availability is tested and the
error will be logged again immediately if the error
persists for any MDisks. It is possible that the new
error will report a different MDisk.
7. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
1. Use the event data to determine changes required
on the disk controller and reconfigure the disk
controller to use a supported configuration.
2. Mark the error that you have just repaired as
“fixed”. If the problem has not been fixed it will be
logged again by the managed disk discovery that
automatically runs at this time; this could take a
few minutes.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
v Enclosure/controller fault
v Fibre Channel network fabric fault (50%)
v Enclosure/controller fault (50%)
Controller configuration has
unsupported RDAC mode.
Explanation: The cluster has detected that an IBM DS
series disk controller's configuration is not supported
by the cluster. The disk controller is operating in RDAC
mode. The disk controller might appear to be operating
with the cluster; however, the configuration is
unsupported because it is known to not work with the
cluster.
User response:
1. Using the IBM DS series console, ensure that the
host type is set to 'IBM TS SAN VCE' and that the
AVT option is enabled. (The AVT and RDAC
options are mutually exclusive).
226
User response:
Other:
Other:
1624
Incorrect disk controller configuration.
SAN Volume Controller: Troubleshooting Guide
1627
The cluster has insufficient redundancy
in its controller connectivity.
Explanation: The cluster has detected that it does not
have sufficient redundancy in its connections to the
disk controllers. This means that another failure in the
SAN could result in loss of access to the application
data. The cluster SAN environment should have
redundant connections to every disk controller. This
redundancy allows for continued operation when there
is a failure in one of the SAN components.
To provide recommended redundancy, a cluster should
be configured so that:
v each node can access each disk controller through
two or more different initiator ports on the node.
v each node can access each disk controller through
two or more different controller target ports. Note:
Some disk controllers only provide a single target
port.
1627
v each node can access each disk controller target port
through at least one initiator port on the node.
v Only a single port on a disk controller is accessible to
every node when there are multiple ports on the
controller that could be connected.
If there are no higher-priority errors being reported,
this error usually indicates a problem with the SAN
design, a problem with the SAN zoning or a problem
with the disk controller.
v The error data indicates the WWPN of the disk
controller port that is connected.
If there are unfixed higher-priority errors that relate to
the SAN or to disk controllers, those errors should be
fixed before resolving this error because they might
indicate the reason for the lack of redundancy. Error
codes that must be fixed first are:
v 1210 Local FC port excluded
v 1230 Login has been excluded
Note: This error can be reported if the required action,
to rescan the Fibre Channel network for new MDisks,
has not been performed after a deliberate
reconfiguration of a disk controller or after SAN
rezoning.
The 1627 error code is reported for a number of
different error IDs. The error ID indicates the area
where there is a lack of redundancy. The data reported
in an event log entry indicates where the condition was
found.
The meaning of the error IDs is shown below. For each
error ID the most likely reason for the condition is
given. If the problem is not found in the suggested
areas, check the configuration and state of all of the
SAN components (switches, controllers, disks, cables
and cluster) to determine where there is a single point
of failure.
010040 A disk controller is only accessible from a single
node port.
v A node has detected that it only has a connection to
the disk controller through exactly one initiator port,
and more than one initiator port is operational.
v The error data indicates the device WWNN and the
WWPN of the connected port.
v A zoning issue or a Fibre Channel connection
hardware fault might cause this condition.
010041 A disk controller is only accessible from a single
port on the controller.
v A node has detected that it is only connected to
exactly one target port on a disk controller, and more
than one target port connection is expected.
v A zoning issue or a Fibre Channel connection
hardware fault might cause this condition.
010043 A disk controller is accessible through only half,
or less, of the previously configured controller ports.
v Although there might still be multiple ports that are
accessible on the disk controller, a hardware
component of the controller might have failed or one
of the SAN fabrics has failed such that the
operational system configuration has been reduced to
a single point of failure.
v The error data indicates a port on the disk controller
that is still connected, and also lists controller ports
that are expected but that are not connected.
v A disk controller issue, switch hardware issue,
zoning issue or cable fault might cause this
condition.
010044 A disk controller is not accessible from a node.
v A node has detected that it has no access to a disk
controller. The controller is still accessible from the
partner node in the I/O group, so its data is still
accessible to the host applications.
v The error data indicates the WWPN of the missing
disk controller.
v A zoning issue or a cabling error might cause this
condition.
User response:
1. Check the error ID and data for a more detailed
description of the error.
2. Determine if there has been an intentional change
to the SAN zoning or to a disk controller
configuration that reduces the cluster's access to
the indicated disk controller. If either action has
occurred, continue with step 8.
3. Use the GUI or the CLI command lsfabric to
ensure that all disk controller WWPNs are
reported as expected.
4. Ensure that all disk controller WWPNs are zoned
appropriately for use by the cluster.
5. Check for any unfixed errors on the disk
controllers.
6. Ensure that all of the Fibre Channel cables are
connected to the correct ports at each end.
v The error data indicates the WWPN of the disk
controller port that is connected.
7. Check for failures in the Fibre Channel cables and
connectors.
v A zoning issue or a Fibre Channel connection
hardware fault might cause this condition.
8. When you have resolved the issues, use the GUI
or the CLI command detectmdisk to rescan the
Fibre Channel network for changes to the MDisks.
Note: Do not attempt to detect MDisks unless you
010042 Only a single port on a disk controller is
accessible from every node in the cluster.
Chapter 7. Diagnosing problems
227
1630 • 1680
are sure that all problems have been fixed.
Detecting MDisks prematurely might mask an
issue.
9. Mark the error that you have just repaired as
fixed. The cluster will revalidate the redundancy
and will report another error if there is still not
sufficient redundancy.
10. Go to MAP 5700: Repair verification.
Possible Cause-FRUs or other:
v None
1630
The number of device logins was
reduced.
Explanation: The number of port to port fabric
connections, or logins, between the node and a storage
controller has decreased. This might be caused by a
problem on the SAN or by a deliberate reconfiguration
of the SAN.
User response:
1. Check the error in the cluster event log to identify
the object ID associated with the error.
2. Check the availability of the failing device using the
following command line: lscontroller object_ID.
If the command fails with the message
“CMMVC6014E The command failed because the
requested object is either unavailable or does not
exist,” ask the customer if this device was removed
from the system.
v If “yes”, mark the error as fixed in the cluster
event log and continue with the repair
verification MAP.
v If “no” or if the command lists details of the
failing controller, continue with the next step.
3. Check whether the device has regained connectivity.
If it has not, check the cable connection to the
remote-device port.
4. If all attempts to log in to a remote-device port have
failed and you cannot solve the problem by
changing cables, check the condition of the
remote-device port and the condition of the remote
device.
5. Start a cluster discovery operation by rescanning the
Fibre Channel network.
Other:
v Fibre Channel network fabric fault (50%)
v Enclosure/controller fault (50%)
1660
Explanation: The initialization of the managed disk
has failed.
User response:
1. View the event log entry to identify the managed
disk (MDisk) that was being accessed when the
problem was detected.
2. Perform the disk controller problem determination
and repair procedures for the MDisk identified in
step 1.
3. Include the MDisk into the cluster.
4. Check the managed disk status. If all managed
disks show a status of “online”, mark the error that
you have just repaired as “fixed”. If any managed
disks do not show a status of “online”, go to the
start MAP. If you return to this step, contact your
support center to resolve the problem with the disk
controller.
5. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Other:
Enclosure/controller fault (100%)
1670
User response: Replace the node until the FRU is
available.
Possible Cause-FRUs or other:
CMOS battery (100%)
1680
7. Go to repair verification MAP.
Canister (3%)
v None
228
SAN Volume Controller: Troubleshooting Guide
The CMOS battery on the system board
failed.
Explanation: The CMOS battery on the system board
failed.
6. Check the status of the disk controller. If all disk
controllers show a “good” status, mark the error
that you have just repaired as “fixed”. If any disk
controllers do not show “good” status, go to the
start MAP. If you return to this step, contact the
support center to resolve the problem with the disk
controller.
Possible Cause-FRUs or other:
The initialization of the managed disk
has failed.
Drive fault type 1
Explanation: Drive fault type 1
User response: Replace the drive.
Possible Cause-FRUs or other:
Drive (95%)
Midplane (2%)
1684 • 1693
Explanation: Drive is missing.
lsarraymembergoals command to determine drive
suitability by using tech_type, capacity, and rpm
information.
User response: Install the missing drive. The drive is
typically a data drive that was previously part of the
array.
v Offer to add the drives into the array. Allow up
to the number of missing array members to be
added.
1684
Drive is missing.
Possible Cause-FRUs or other:
Drive (100%)
1686
Drive fault type 3.
Explanation: Drive fault type 3.
User response: Complete the following steps to
resolve this problem.
1. Reseat the drive.
v Recheck after array members are added.
4. If no drives are available, explain that drives need
to be added to restore the wanted number of
rebuild areas.
v If the threshold is greater than the number of
rebuild areas available, and the threshold is
greater than 1, offer to reduce the threshold to the
number of drives that are available.
1691
2. Replace the drive.
3. Replace the canister as identified in the sense data.
4. Replace the enclosure.
Note: The removal of the exclusion on the drive slot
will happen automatically, but only after this error has
been marked as fixed.
Possible Cause-FRUs or other:
v Drive (46%)
v Canister (46%)
A background scrub process has found
an inconsistency between data and
parity on the array.
Explanation: The array has at least one stride where
the data and parity do not match. RAID has found an
inconsistency between the data stored on the drives
and the parity information. This could either mean that
the data has been corrupted, or that the parity
information has been corrupted.
User response: Follow the directed maintenance
procedure for inconsistent arrays.
v Enclosure (8%)
1692
1689
Array MDisk has lost redundancy.
Explanation: Array MDisk has lost redundancy. The
RAID 5 system is missing a data drive.
User response: Replace the missing or failed drive.
Possible Cause-FRUs or other:
Drives removed or failed (100%)
1690
No spare protection exists for one or
more array MDisks.
Explanation: The system spare pool cannot
immediately provide a spare of any suitability to one or
more arrays.
Array MDisk has taken a spare member
that does not match array goals.
Explanation:
1. A member of the array MDisk either has technology
or capability that does not match exactly with the
established goals of the array.
2. The array is configured to want location matches,
and the drive location does not match all the
location goals.
User response: The error will fix itself automatically
as soon as the rebuild or exchange is queued up. It
does not wait until the array is showing balanced =
exact (which indicates that all populated members
have exact capability match and exact location match).
User response:
1693
1. Configure an array but no spares.
Explanation: Drive exchange required.
2. Configure many arrays and a single spare. Cause
that spare to be consumed or change its use.
User response: Complete the following steps to
resolve this problem.
For a distributed array, unused or candidate drives are
converted into array members.
1. Decode/explain the number of rebuild areas
available and the threshold set.
Drive exchange required.
1. Exchange the failed drive.
Possible Cause-FRUs or other:
v Drive (100%)
2. Check for unfixed higher priority errors.
3. Check for unused and candidate drives that are
suitable for the distributed array. Run the
Chapter 7. Diagnosing problems
229
1695 • 1710
1695
Persistent unsupported disk controller
configuration.
Explanation: A disk controller configuration that
might prevent failover for the cluster has persisted for
more than four hours. The problem was originally
logged through a 010032 event, service error code 1625.
User response:
1. Fix any higher priority error. In particular, follow
the service actions to fix the 1625 error indicated by
this error's root event. This error will be marked as
“fixed” when the root event is marked as “fixed”.
2. If the root event cannot be found, or is marked as
“fixed”, perform an MDisk discovery and mark this
error as “fixed”.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
For the active-active relationships that have the
primary attribute value of auxiliary, use the
rmvolumecopy CLI command (which also deletes the
relationship). For example, rmvolumecopy
master_volume_id/name.
Note: The error is automatically marked as “fixed”
once the last relationship on the I/O group is
deleted. New relationships must not be created until
the error is fixed.
5. Re-create all the relationships that you deleted by
using the details noted in step 3.
Note: For Metro Mirror and Global Mirror
relationships, you are able to delete a relationship
from either the master or auxiliary system; however,
you must re-create the relationship on the master
system. Therefore, it might be necessary to go to
another system to complete this service action.
Possible Cause-FRUs or other:
Other:
v None
v Enclosure/controller fault
1710
1700
Unrecovered remote copy relationship
Explanation: This error might be reported after the
recovery action for a clustered system failure or a
complete I/O group failure. The error is reported
because some remote copy relationships, whose control
data is stored by the I/O group, could not be
recovered.
User response: To fix this error it is necessary to
delete all of the relationships that might not be
recovered, and then re-create the relationships.
1. Note the I/O group index against which the error is
logged.
2. List all of the relationships that have either a master
or an auxiliary volume in this I/O group. Use the
volume view to determine which volumes in the
I/O group you noted have a relationship that is
defined.
3. Note the details of the relationships that are listed
so that they can be re-created.
If the affected I/O group has active-active
relationships that are in a consistency group, run
the command chrcrelationship -noconsistgrp
rc_rel_name for each active-active relationship that
was not recovered. Then, use the command
lsrcrelatioship in case volume labels are changed
and to see the value of the primary attributes.
4. Delete all of the relationships that are listed in step
2, except any active-active relationship that has host
applications that use the auxiliary volume via the
master volume unique ID. (that is, the primary
attribute value is auxiliary in the output from
lsrcrelationship).
230
SAN Volume Controller: Troubleshooting Guide
There are too many cluster partnerships.
The number of cluster partnerships has
been reduced.
Explanation: A cluster can have a Metro Mirror and
Global Mirror cluster partnership with one or more
other clusters. Partnership sets consist of clusters that
are either in direct partnership with each other or are
in indirect partnership by having a partnership with
the same intermediate cluster. The topology of the
partnership set is not fixed; the topology might be a
star, a loop, a chain or a mesh. The maximum
supported number of clusters in a partnership set is
four. A cluster is a member of a partnership set if it has
a partnership with another cluster in the set, regardless
of whether that partnership has any defined
consistency groups or relationships.
These are examples of valid partnership sets for five
unique clusters labelled A, B, C, D, and E where a
partnership is indicated by a dash between two cluster
names:
v A-B, A-C, A-D. E has no partnerships defined and
therefore is not a member of the set.
v A-B, A-D, B-C, C-D. E has no partnerships defined
and therefore is not a member of the set.
v A-B, B-C, C-D. E has no partnerships defined and
therefore is not a member of the set.
v A-B, A-C, A-D, B-C, B-D, C-D. E has no partnerships
defined and therefore is not a member of the set.
v A-B, A-C, B-C. D-E. There are two partnership sets.
One contains clusters A, B, and C. The other contains
clusters D and E.
1720 • 1741
These are examples of unsupported configurations
because the number of clusters in the set is five, which
exceeds the supported maximum of four clusters:
1720
v A-B, A-C, A-D, A-E.
v A-B, A-D, B-C, C-D, C-E.
v A-B, B-C, C-D, D-E.
The cluster prevents you from creating a new Metro
Mirror and Global Mirror cluster partnership if a
resulting partnership set would exceed the maximum
of four clusters. However, if you restore a broken link
between two clusters that have a partnership, the
number of clusters in the set might exceed four. If this
occurs, Metro Mirror and Global Mirror cluster
partnerships are excluded from the set until only four
clusters remain in the set. A cluster partnership that is
excluded from a set has all of its Metro Mirror and
Global Mirror cluster partnerships excluded.
Metro Mirror (remote copy) Relationship has stopped and lost
synchronization, for reason other than a
persistent I/O error (LSYNC)
Explanation: A remote copy relationship or
consistency group needs to be restarted. In a Metro
Mirror (remote copy) or Global Mirror operation, the
relationship has stopped and lost synchronization, for a
reason other than a persistent I/O error.
User response: The administrator must examine the
state of the system to validate that everything is online
to allow a restart to work. Examining the state of the
system also requires checking the partner Fibre
Channel (FC) port masks on both clusters.
1. If the partner FC port mask was changed recently,
check that the correct mask was selected.
2. Perform whatever steps are needed to maintain a
consistent secondary volume, if desired.
Event ID 0x050030 is reported if the cluster is retained
in the partnership set. Event ID 0x050031 is reported if
the cluster is excluded from the partnership set. All
clusters that were in the partnership set report error
1710.
3. The administrator must issue a start command.
All inter-cluster Metro Mirror or Global Mirror
relationships that involve an excluded cluster will lose
connectivity. If any of these relationships are in the
consistent_synchronized state and they receive a write
I/O, they will stop with error code 1720.
1740
User response: To fix this error it is necessary to
delete all of the relationships that could not be
recovered and then re-create the relationships.
1. If the key is not available:
1. Determine which clusters are still connected and
members of the partnership set, and which clusters
have been excluded.
2. Determine the Metro Mirror and Global Mirror
relationships that exist on those clusters.
3. Determine which of the Metro Mirror and Global
Mirror relationships you want to maintain, which
determines which cluster partnerships you want to
maintain. Ensure that the partnership set or sets
that would result from configuring the cluster
partnerships that you want contain no more than
four clusters in each set. NOTE: The reduced
partnership set created by the cluster might not
contain the clusters that you want in the set.
4. Remove all of the Metro Mirror and Global Mirror
relationships that you do not want to retain.
5. Remove all of the Metro Mirror and Global Mirror
cluster partnerships that you do not want to retain.
6. Restart all relationships and consistency groups that
were stopped.
7. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Possible Cause-FRUs or other:
v None
Recovery encryption key not available.
Explanation: Recovery encryption key is not available.
User response: Make the recovery encryption key
available.
v Install a USB drive with the encryption key.
v Ensure correct file is on the USB drive.
2. If the key is not valid:
v Get a USB drive with a valid key for this MTMS.
The key does not have a valid CRC.
Possible Cause-FRUs or other:
No FRU
1741
Flash module is predicted to fail.
Explanation: The Flash module is predicted to fail due
to low health (event ID 085023) or due to an encryption
issue (event ID 085158). In either case, the drive should
be replaced.
User response: A replacement drive of the same size
is needed to correct this error.
If any higher array events exist, correct those first.
If no other array events exist, replace the drive. If the
array is RAID5, replace and format the drive.
If the array is RAID0, correcting this issue will result in
a loss of all data. If the data is needed, do the
following:
1. Backup all array data.
Chapter 7. Diagnosing problems
231
1750 • 1804
2. Replace the drives using the recoverarray format.
1801
3. Restore array data.
If the array data is not needed, replace the drive(s)
using the recoverarray format.
1750
Array response time too high.
Explanation: A number of causes can lead to
higher-than-usual array response time.
User response:
1. Fix higher priority errors first.
2. Fix any other known errors.
3. Change the array into redundancy mode by using
the charray interface.
Possible Cause-FRUs or other:
Environment or configuration issues:
A node has received too many Fibre
Channel logins from another node.
Explanation: This event was logged because the node
has received more than sixteen Fibre Channel logins
originating from another node. This indicates that the
Fibre Channel storage area network that connects the
two nodes is not correctly configured.
Data:
v None
User response: Change the zoning and/or Fibre
Channel port masking so that no more than 16 logins
are possible between a pair of nodes.
See Non-critical node error “888” on page 193 for
details.
Use the lsfabric command to view the current number
of logins between nodes.
Possible Cause-FRUs or other cause:
Volume config 30%
v None
Slow drive 30%
1802
Enclosure 20%
Explanation: Fibre Channel network settings
SAS port 20%
User response: Follow these troubleshooting steps to
reduce the number of hosts that are logged in to the
port:
1780
Encryption key changes are not
committed.
Explanation: Changes were made to the encryption
key, but the pending changes were not committed. A
directed maintenance procedure (DMP) was launched
to cancel the changes.
User response: Press Next to cancel the pending key
changes. Launch the GUI to restart the operation.
1800
The SAN has been zoned incorrectly.
Fibre Channel network settings
1. Increase the granularity of the switch zoning to
reduce unnecessary host port logins.
2. Change switch zoning to spread out host ports
across other available ports.
3. Use interfaces with more ports, if not already at the
maximum.
4. Scale out by using another FlashSystem enclosure.
Possible Cause-FRUs or other:
No FRU
Explanation: This has resulted in more than 512 other
ports on the SAN logging into one port of a 2145 node.
1804
User response:
Explanation: IB network settings
1. Ask the user to reconfigure the SAN.
IB network settings
3. Go to repair verification MAP.
User response: Follow these troubleshooting steps to
reduce the number of hosts that are logged in to the
port:
Possible Cause-FRUs or other:
1. Increase the granularity of the switch zoning to
reduce unnecessary host port logins.
2. Mark the error as “fixed”.
v None
2. Change switch zoning to spread out host ports
across other available ports.
Other:
v Fibre Channel switch configuration error
3. Use interfaces with more ports, if not already at the
maximum.
v Fibre Channel switch
4. Scale out by using another FlashSystem enclosure.
Possible Cause-FRUs or other:
232
SAN Volume Controller: Troubleshooting Guide
1840 • 1864
No FRU
1840
1860
The managed disk has bad blocks.
Explanation: These are "virtual" medium errors which
are created when copying a volume where the source
has medium errors. During data moves or duplication,
such as during a flash copy, an attempt is made to
move medium errors; to achieve this, virtual medium
errors called “bad blocks” are created. Once a bad
block has been created, no attempt will be made to
read the underlying data, as there is no guarantee that
the old data still exists once the “bad block” is created.
Therefore, it is possible to have “bad blocks”, and thus
medium errors, reported on a target volume, without
medium errors actually existing on the underlying
storage. The “bad block” records are removed when the
data is overwritten by a host.
Note: On an external controller, this error can only
result from a copied medium error.
User response:
1. The support center will direct the user to restore the
data on the affected volumes.
2. When the volume data has been restored, or the
user has chosen not to restore the data, mark the
error as “fixed”.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
1850
Compressed volume copy has bad
blocks
Explanation: A system recovery operation was
performed, but data on one or more volumes was not
recovered; this is normally caused by a combination of
hardware faults. If data containing a medium error is
copied or migrated to another volume, bad blocks will
be recorded. If a host attempts to read the data in any
of the bad block regions, the read will fail with a
medium error.
User response:
1. The support center will direct the user to restore the
data on the affected volumes.
2. When the volume data has been restored, or the
user has chosen not to restore the data, mark the
error as “fixed”.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Thin-provisioned volume copy offline
because of failed repair.
Explanation: The attempt to repair the metadata of a
thin-provisioned volume that describes the disk
contents has failed because of problems with the
automatically maintained backup copy of this data. The
error event data describes the problem.
User response: Delete the thin-provisioned volume
and reconstruct a new one from a backup or mirror
copy. Mark the error as “fixed”. Also mark the original
1862 error as “fixed”.
Possible Cause-FRUs or other:
v None
1862
Thin-provisioned volume copy offline
because of corrupt metadata.
Explanation: A thin-provisioned volume has been
taken offline because there is an inconsistency in the
cluster metadata that describes the disk contents. This
might occur because of corruption of data on the
physical disk (e.g., medium error or data miscompare),
the loss of cached metadata (because of a cluster
recovery) or because of a software error. The event data
gives information on the reason.
The cluster maintains backup copies of the metadata
and it might be possible to repair the thin-provisioned
volume using this data.
User response: The cluster is able to repair the
inconsistency in some circumstances. Run the repair
volume option to start the repair process. This repair
process, however, can take some time. In some
situations it might be more appropriate to delete the
thin-provisioned volume and reconstruct a new one
from a backup or mirror copy.
If you run the repair procedure and it completes, this
error is automatically marked as “fixed”; otherwise,
another error event (error code 1860) is logged to
indicate that the repair action has failed.
Possible Cause-FRUs or other:
v None
1864
Compressed volume size limitation
breached, diagnosis required
Explanation: The system indicates that the virtual or
real capacity of at least one compressed volume
exceeds the system limits.
User response: For information about how to deal
with this issue, see www.ibm.com/support/
docview.wss?uid=ssg1S1005731.
Chapter 7. Diagnosing problems
233
1865 • 1895
1865
Thin-provisioned volume copy offline
because of insufficient space.
Explanation: A thin-provisioned volume has been
taken offline because there is insufficient allocated real
capacity available on the volume for the used space to
increase further. If the thin-provisioned volume is
auto-expand enabled, then the storage pool it is in also
has no free space.
User response: The service action differs depending
on whether the thin-provisioned volume copy is
auto-expand enabled or not. Whether the disk is
auto-expand enabled or not is indicated in the error
event data.
If the volume copy is auto-expand enabled, perform
one or more of the following actions. When you have
performed all of the actions that you intend to perform,
mark the error as “fixed”; the volume copy will then
return online.
v Determine why the storage pool free space has been
depleted. Any of the thin-provisioned volume copies,
with auto-expand enabled, in this storage pool might
have expanded at an unexpected rate; this could
indicate an application error. New volume copies
might have been created in, or migrated to, the
storage pool.
v Increase the capacity of the storage pool that is
associated with the thin-provisioned volume copy by
adding more MDisks to the storage pool.
v Provide some free capacity in the storage pool by
reducing the used space. Volume copies that are no
longer required can be deleted, the size of volume
copies can be reduced or volume copies can be
migrated to a different storage pool.
1870
Mirrored volume offline because a
hardware read error has occurred.
Explanation: While attempting to maintain the volume
mirror, a hardware read error occurred on all of the
synchronized volume copies.
The volume copies might be inconsistent, so the
volume is now offline.
User response:
v Fix all higher priority errors. In particular, fix any
read errors that are listed in the sense data. This
error event will automatically be fixed when the root
event is marked as “fixed”.
v If you cannot fix the root error, but the read errors
on some of the volume copies have been fixed, mark
this error as “fixed” to run without the mirror. You
can then delete the volume copy that cannot read
data and re-create it on different MDisks.
Possible Cause-FRUs or other:
v None
1895
Unrecovered FlashCopy mappings
Explanation: This error might be reported after the
recovery action for a cluster failure or a complete I/O
group failure. The error is reported because some
FlashCopies, whose control data is stored by the I/O
group, were active at the time of the failure and the
current state of the mapping could not be recovered.
User response: To fix this error it is necessary to
delete all of the FlashCopy mappings on the I/O group
that failed.
v Migrate the thin-provisioned volume copy to a
storage pool that has sufficient unused capacity.
1. Note the I/O group index against which the error is
logged.
v Consider reducing the value of the storage pool
warning threshold to give more time to allocate extra
space.
2. List all of the FlashCopy mappings that are using
this I/O group for their bitmaps. You should get the
detailed view of every possible FlashCopy ID. Note
the IDs of the mappings whose IO_group_id
matches the ID of the I/O group against which this
error is logged.
If the volume copy is not auto-expand enabled,
perform one or more of the following actions. In this
case the error will automatically be marked as “fixed”,
and the volume copy will return online when space is
available.
3. Note the details of the FlashCopy mappings that are
listed so that they can be re-created.
v Increase the real capacity of the volume copy.
4. Delete all of the FlashCopy mappings that are
listed. Note: The error will automatically be marked
as “fixed” once the last mapping on the I/O group
is deleted. New mappings cannot be created until
the error is fixed.
v Enable auto-expand for the thin-provisioned volume
copy.
5. Using the details noted in step 3, re-create all of the
FlashCopy mappings that you just deleted.
v Determine why the thin-provisioned volume copy
used space has grown at the rate that it has. There
might be an application error.
v Consider reducing the value of the thin-provisioned
volume copy warning threshold to give more time to
allocate more real space.
Possible Cause-FRUs or other:
v None
234
SAN Volume Controller: Troubleshooting Guide
Possible Cause-FRUs or other:
v None
1900 • 1920
1900
A FlashCopy, Trigger Prepare command
has failed because a cache flush has
failed.
Explanation: A FlashCopy, Trigger Prepare command
has failed because a cache flush has failed.
User response:
1. Correct higher priority errors, and then try the
Trigger Prepare command again.
2. Mark the error that you have just repaired as
“fixed”.
3. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
Other:
Cache flush error (100%)
1910
A FlashCopy mapping task was stopped
because of the error that is indicated in
the sense data.
Explanation: A stopped FlashCopy might affect the
status of other volumes in the same I/O group.
Preparing the stopped FlashCopy operations as soon as
possible is advised.
User response:
1. Correct higher priority errors, and then prepare and
start the FlashCopy task again.
2. Mark the error that you have just repaired as
“fixed”.
reported again when the time period next expires (the
default period is five minutes).
This error might also be reported because the primary
cluster has encountered read errors.
You might need to refer to the Copy Services features
information in the software installation and
configuration documentation while diagnosing this
error.
User response:
1. If the 1920 error has occurred previously on Metro
Mirror or Global Mirror between the same clusters
and all the following actions have been attempted,
contact your product support center to resolve the
problem.
2. On both clusters, check the partner fc port mask to
ensure that there is sufficient connectivity. If the
partner fc port mask was changed recently, ensure
the mask is correct.
3. On the primary cluster reporting the error, correct
any higher priority errors.
4. On the secondary cluster, review the maintenance
logs to determine if the cluster was operating with
reduced capability at the time the error was
reported. The reduced capability might be because
of a software upgrade, hardware maintenance to a
2145 node, maintenance to a backend disk system
or maintenance to the SAN.
5. On the secondary 2145 cluster, correct any errors
that are not fixed.
6. On the intercluster link, review the logs of each
link component for any incidents that would cause
reduced capability at the time of the error. Ensure
the problems are fixed.
3. Go to repair verification MAP.
7. If a reason for the error has been found and
corrected, go to Action 11.
Possible Cause-FRUs or other:
8. On the primary cluster reporting the error,
examine the 2145 statistics using a SAN
productivity monitoring tool and confirm that all
the Metro Mirror and Global Mirror requirements
described in the planning documentation are met.
Ensure that any changes to the applications using
Metro Mirror or Global Mirror have been taken
into account. Resolve any issues.
v None
1920
Global and Metro Mirror persistent
error.
Explanation: This error might be caused by a problem
on the primary cluster, a problem on the secondary
cluster, or a problem on the inter-cluster link. The
problem might be a failure of a component, a
component becoming unavailable or having reduced
performance because of a service action or it might be
that the performance of a component has dropped to a
level where the Metro Mirror or Global Mirror
relationship cannot be maintained. Alternatively the
error might be caused by a change in the performance
requirements of the applications using Metro Mirror or
Global Mirror.
This error is reported on the primary cluster when the
copy relationship has not progressed sufficiently over a
period of time. Therefore, if the relationship is restarted
before all of the problems are fixed, the error might be
9. On the secondary cluster, examine the 2145
statistics using a SAN productivity monitoring tool
and confirm that all the Metro Mirror and Global
Mirror requirements described in the software
installation and configuration documentation are
met. Resolve any issues.
10. On the intercluster link, examine the performance
of each component using an appropriate SAN
productivity monitoring tool to ensure that they
are operating as expected. Resolve any issues.
11. Mark the error as “fixed” and restart the Metro
Mirror or Global Mirror relationship.
Chapter 7. Diagnosing problems
235
1925 • 1950
When you restart the Metro Mirror or Global Mirror
relationship there will be an initial period during which
Metro Mirror or Global Mirror performs a background
copy to resynchronize the volume data on the primary
and secondary clusters. During this period the data on
the Metro Mirror or Global Mirror auxiliary volumes
on the secondary cluster is inconsistent and the
volumes could not be used as backup disks by your
applications.
Note: To ensure the system has the capacity to handle
the background copy load you may want to delay
restarting the Metro Mirror or Global Mirror
relationship until there is a quiet period when the
secondary cluster and the SAN fabric (including the
intercluster link) have the required capacity. If the
required capacity is not available you might experience
another 1920 error and the Metro Mirror or Global
Mirror relationship will stop in an inconsistent state.
Note: If the Metro Mirror or Global Mirror relationship
has stopped in a consistent state (“consistent-stopped”)
it is possible to use the data on the Metro Mirror or
Global Mirror auxiliary volumes on the secondary
cluster as backup disks by your applications. You might
therefore want to start a Flash Copy of your Metro
Mirror or Global Mirror auxiliary disks on the
secondary system before restarting the Metro Mirror or
Global Mirror relationship. This means you maintain
the current, consistent, image until the time when the
Metro Mirror or Global Mirror relationship is again
synchronized and in a consistent state.
1930
Migration suspended.
Explanation: Migration suspended.
User response:
1. Ensure that all error codes of a higher priority have
already been fixed.
2. Ask the customer to ensure that all storage pools
that are the destination of suspended migrate
operations have available free extents.
3. Mark this error as “fixed”. This causes the migrate
operation to be restarted. If the restart fails, a new
error is logged.
4. Go to repair verification MAP.
Possible Cause-FRUs or other:
v None
1940
HyperSwap volume or consistency
group has lost synchronization between
sites.
Explanation: HyperSwap volume or consistency group
has lost synchronization between sites.
User response: Complete the following steps to
resolve this problem.
1. Check the event log for any higher priority unfixed
errors.
2. HyperSwap volumes will automatically
resynchronize when the underlying problem has
been resolved.
Possible Cause-FRUs or other:
Possible Cause-FRUs or other:
v None
v N/A
Other:
v Primary 2145 cluster or SAN fabric problem (10%)
v Primary 2145 cluster or SAN fabric configuration
(10%)
v Secondary 2145 cluster or SAN fabric problem (15%)
v Secondary 2145 cluster or SAN fabric configuration
(25%)
v Intercluster link problem (15%)
v Intercluster link configuration (25%)
1925
Cached data cannot be destaged.
Explanation: Problem diagnosis is required.
User response:
1. Run the directed maintenance procedure to fix all
errors of a higher priority. This will allow the
cached data to be destaged and the originating
event to be marked fixed.
Possible Cause-FRUs or other:
v None
236
SAN Volume Controller: Troubleshooting Guide
1950
Unable to mirror medium error.
Explanation: During the synchronization of a mirrored
volume copy it was necessary to duplicate the record of
a medium error onto the volume copy, creating a
virtual medium error. Each managed disk has a table of
virtual medium errors. The virtual medium error could
not be created because the table is full. The volume
copy is in an inconsistent state and has been taken
offline.
User response: Three different approaches can be
taken to resolving this problem: 1) the source volume
copy can be fixed so that it does not contain medium
errors, 2) the number of virtual medium errors on the
target managed disk can be reduced or 3) the target
volume copy can be moved to a managed disk with
more free virtual medium error entries.
The managed disk with a full medium error table can
be determined from the data of the root event.
Approach 1) - This is the preferred procedure because
it restores the source volume copy to a state where all
of the data can be read. Use the normal service
2008 • 2030
procedures for fixing a medium error (rewrite block or
volume from backup or regenerate the data using local
procedures).
Approach 2) - This method can be used if the majority
of the virtual medium errors on the target managed
disk do not relate to the volume copy. Determine where
the virtual medium errors are using the event log
events and re-write the block or volume from backup.
Approach 3) - Delete the offline volume copy and
create a new one either forcing the use of different
MDisks in the storage pool or using a completely
different storage pool.
Follow your selection option(s) and then mark the error
as “fixed”.
Possible Cause-FRUs or other:
v None
2008
A software downgrade has failed.
Explanation: Cluster configuration changes are
restricted until the downgrade is completed. The cluster
downgrade process waits for user intervention when
this error is logged.
User response: The action required to recover from a
stalled downgrade depends on the current state of the
cluster being downgraded. Call IBM Support for an
action plan to resolve this problem.
Possible Cause-FRUs or other:
v None
2020
IP Remote Copy link unavailable.
Explanation: IP Remote Copy link is unavailable.
User response: Fix the remote IP link so that traffic
can flow correctly. Once the connection is made, the
error will auto-correct.
2021
Partner cluster IP address unreachable.
Explanation: Partner cluster IP address unreachable.
User response:
1. Verify the system IP address of the remote system
forming the partnership.
2. Check if remote cluster IP address is reachable from
local cluster. The following can be done to verify
accessibility:
a. Use svctask to ping the remote cluster IP
address. If the ping works, there may be a block
on the specific port traffic that needs to be
opened in the network. If the ping does not
work, there may be no route between the
system. Check the IP gateway configuration on
the SAN Volume Controller nodes and the IP
network configuration.
b. Check the configuration of the routers and
firewall to ensure that TCP/IP port 3620 used
for IP partnership is not blocked.
c. Use the ssh command from another system to
attempt to establish a session with the
problematic remote cluster IP address to confirm
that the remote cluster is operational.
Other:
2022
2145 software (100%)
2010
A software update has failed.
Explanation: Cluster configuration changes are
restricted until the update is completed or rolled back.
The cluster update process waits for user intervention
when this error is logged.
User response: The action required to recover from a
stalled update depends on the current state of the
cluster being updated. Call IBM technical support for
an action plan to resolve this problem.
Possible Cause-FRUs or other:
v None
Other:
Cannot authenticate with partner cluster.
Explanation: Cannot authenticate with partner cluster.
User response: Verify the CHAP secret set of
partnership using mkippartnership or chpartnership
CLIs match remote system CHAP secret set using
chsystem CLI. If they don't match, use appropriate
commands to set the right CHAP secrets.
2023
Unexpected cluster ID for partner
cluster.
Explanation: Unexpected cluster ID for partner cluster.
User response: After deleting all relationships and
consistency group, remove the partnership.
This is an unrecoverable error when one of the sites has
undergone a T3 recovery and lost all partnership
information. Contact IBM support.
2145 software (100%)
2030
Software error.
Explanation: The software has restarted because of a
problem in the cluster, on a disk system or on the Fibre
Channel fabric.
Chapter 7. Diagnosing problems
237
2035 • 2100
User response:
1. Collect the software dump file(s) generated at the
time the error was logged on the cluster.
2. Contact your product support center to investigate
and resolve the problem.
3. Ensure that the software is at the latest level on the
cluster and on the disk systems.
4. Contact your product support center to resolve the
problem.
5. Mark the error that you have just repaired as
“fixed”.
6. Go to repair verification MAP.
Possible Cause-FRUs or other:
4. Use the available SAN monitoring tools to check for
any problems on the fabric.
v None
5. Mark the error that you have just repaired as
“fixed”.
Other:
6. Go to repair verification Map.
2145 software (100%)
Possible Cause-FRUs or other:
2055
v Your support center might indicate a FRU based on
their problem analysis (2%)
Explanation: System reboot required.
Other:
v Software (48%)
v Enclosure/controller software (25%)
System reboot required.
User response: The software update is not complete.
Reboot the system.
The system will not be available for IO or systems
management during the system reset.
v Fibre Channel switch or switch configuration (25%)
2060
2035
Drive has disabled protection
information support.
Explanation: An array has been interrupted in the
process of establishing data integrity protection
information on or more of its members by initial writes
or rebuild writes.
In order to ensure the array is usable, the system has
turned off hardware data protection for the member
drive.
User response: If many or all the member drives in an
array have logged this error, and sufficient storage
exists in the pool to migrate the allocated extents, then
the simplest strategy is to delete the array and recreate
it once the drive service action has been accomplished.
If a small number of drives are affected then it is
simplest to remove these drives from the array and
service them individually. This option is not possible if
the array is currently syncing post recovery.
2040
A software update is required.
Explanation: The software cannot determine the VPD
for a FRU. Probably, a new FRU has been installed and
the software does not recognize that FRU.
User response:
1. If a FRU has been replaced, ensure that the correct
replacement part was used. The node VPD indicates
which part is not recognized.
2. Ensure that the cluster software is at the latest level.
3. Save dump data with configuration dump and
logged data dump.
238
SAN Volume Controller: Troubleshooting Guide
Manual discharge of batteries required.
Explanation: Manual discharge of batteries required.
User response: Use chenclosureslot -battery -slot
1 -recondition on to cause battery calibration.
2070
A drive has been detected in an
enclosure that does not support that
drive.
Explanation: A drive has been detected in an
enclosure that does not support that drive.
User response: Remove the drive. If the result is an
invalid number of drives, replace the drive with a valid
drive.
Possible Cause-FRUs or other:
Drive (100%)
2100
A software error has occurred.
Explanation: One of the V3700 server software
components (sshd, crond, or httpd) has failed and
reported an error.
User response:
1. Ensure that the software is at the latest level on the
cluster.
2. Save dump data with configuration dump and
logged data dump.
3. Contact your product support center to resolve the
problem.
4. Mark the error that you have just repaired as
“fixed”.
5. Go to repair verification MAP.
2115 • 2555
Possible Cause-FRUs or other:
2259
v None
Other:
V3700 software (100%)
Storwize V7000 Gen1 compatibility
mode can now be disabled on this
system.
Explanation: No more Storwize® V7000 Gen1 canisters
are attached to the system.
User response: Complete one of the following actions:
2115
Performance of external MDisk has
changed
Explanation: The system identified a change in the
performance category of an external MDisk. A storage
device in the external system might have been replaced
with a device that has different performance
characteristics to the original. The ID of the MDisk is
logged in the event (Bytes 5-8 of the sense data). It
might be necessary to re-configure the tier of the
MDisk so that EasyTier makes best use of the storage.
User response: Run the fix procedure for this event,
assisting you with the following tasks:
1.
Run the Detect MDisks task, so that the system
determines the current performance category of
each Mdisk. When the detection task is complete, if
performance has reverted, the event is automatically
marked as fixed.
2. If the event is not automatically fixed, you can
change the tier of the MDisk to the recommended
tier shown in the event properties. The
recommended tier is logged in the event (Bytes 9-13
of the sense data. A value of 10 hex indicates flash
tier, a value of 20 hex indicates enterprise tier).
3. If you choose not to change the tier configuration,
mark the event as fixed.
2258
System SSL certificate has expired.
Explanation: System SSL certificate has expired.
Connections to the GUI, service assistant, and CIMOM
are likely to generate security exceptions.
User response: Complete the following steps to
resolve this problem.
1. Access the CLI by using ssh.
2. Check that the system time and date is correct. If it
is incorrect, it can cause the certificate to be
incorrectly marked as expired.
3. Create a new self-signed system certificate, or create
a certificate request. Get it signed by your certificate
authority and install the signed request.
Note: If it takes some time to get a certificate
signed, you can also create a self-signed certificate
to use while you wait for your request to be signed.
Possible Cause-FRUs or other:
v N/A
v If you want to disable Storwize V7000 Gen1
compatibility mode, enter the following command:
chsystem -gen1compatibilitymode no
v If you want to maintain Storwize V7000 Gen1
compatibility mode, you can reattach Storwize V7000
Gen1 canisters to the cluster.
2500
A secure shell (SSH) session limit for
the cluster has been reached.
Explanation: Secure Shell (SSH) sessions are used by
applications that manage the cluster. An example of
such an application is the command-line interface
(CLI). An application must initially log in to the cluster
to create an SSH session. The cluster imposes a limit on
the number of SSH sessions that can be open at one
time. This error indicates that the limit on the number
of SSH sessions has been reached and that no more
logins can be accepted until a current session logs out.
The limit on the number of SSH sessions is usually
reached because multiple users have opened an SSH
session but have forgotten to close the SSH session
when they are no longer using the application.
User response:
v Because this error indicates a problem with the
number of sessions that are attempting external
access to the cluster, determine the reason that so
many SSH sessions have been opened.
v Run the Fix Procedure for this error on the panel at
Management GUI Troubleshooting >
Recommended Actions to view and manage the
open SSH sessions.
2550
Encryption key on USB flash drive
removed
Explanation: The USB flash drive in a particular node
or port has been removed. This USB flash drive
contained a valid encryption key for the system.
Unauthorized removal can compromise data security.
User response: If your data has been compromised,
perform a rekey operation immediately.
2555
Encryption key error on USB flash
drive.
Explanation: It is necessary to provide an encryption
key before the system can become fully operational.
This error occurs when the encryption key identified is
Chapter 7. Diagnosing problems
239
2600 • 2702
invalid. A file with the correct name was found but the
key in the file is corrupt.
User response: Remove the USB flash drive from the
port.
2600
The cluster was unable to send an
email.
Explanation: The cluster has attempted to send an
email in response to an event, but there was no
acknowledgement that it was successfully received by
the SMTP mail server. It might have failed because the
cluster was unable to connect to the configured SMTP
server, the email might have been rejected by the
server, or a timeout might have occurred. The SMTP
server might not be running or might not be correctly
configured, or the cluster might not be correctly
configured. This error is not logged by the test email
function because it responds immediately with a result
code.
User response:
v Ensure that the SMTP email server is active.
v Ensure that the SMTP server TCP/IP address and
port are correctly configured in the cluster email
configuration.
v Send a test email and validate that the change has
corrected the issue.
Possible Cause-FRUs or other:
v None
2700
Unable to access NTP network time
server
Explanation: Cluster time cannot be synchronized
with the NTP network time server that is configured.
User response: There are three main causes to
examine:
v The cluster NTP network time server configuration is
incorrect. Ensure that the configured IP address
matches that of the NTP network time server.
v The NTP network time server is not operational.
Check the status of the NTP network time server.
v The TCP/IP network is not configured correctly.
Check the configuration of the routers, gateways and
firewalls. Ensure that the cluster can access the NTP
network time server and that the NTP protocol is
permitted.
The error will automatically fix when the cluster is able
to synchronize its time with the NTP network time
server.
Possible Cause-FRUs or other:
v None
v Mark the error that you have just repaired as fixed.
v Go to MAP 5700: Repair verification.
Possible Cause-FRUs or other:
v None
2601
Error detected while sending an email.
Explanation: An error has occured while the cluster
was attempting to send an email in response to an
event. The cluster is unable to determine if the email
has been sent and will attempt to resend it. The
problem might be with the SMTP server or with the
cluster email configuration. The problem might also be
caused by a failover of the configuration node. This
error is not logged by the test email function because it
responds immediately with a result code.
User response:
v If there are higher-priority unfixed errors in the log,
fix those errors first.
2702
Check configuration settings of the NTP
server on the CMM
Explanation: The node is configured to automatically
set the time using an NTP server within the CMM. It is
not possible to connect to the NTP server during
authentication. The NTP server configuration cannot be
changed within S-ITE. Within the CMM, there are
changeable NTP settings. However, these settings
configure how the CMM gets the time and date - the
internal CMM NTP server that is used by the S-ITE
cannot be changed or configured. This event is only
raised when an attempt is made to use the server once every half hour.
Note: The NTP configuration settings are re-read from
the CMM before each connection.
The reason for a connection error can be due to the
following:
v all suitable Ethernet ports are offline
v Ensure that the SMTP email server is active.
v the CMM hardware is not operational
v Ensure that the SMTP server TCP/IP address and
port are correctly configured in the cluster email
configuration.
v the CMM is active but the CMM NTP server is
offline.
v Send a test email and validate that the change has
corrected the issue.
The reason for an authentication issue can be due to
the following:
v Mark the error that you have just repaired as fixed.
v the authentication values provided were invalid
v Go to MAP 5700: Repair verification.
v the NTP server rejected the authentication key
provided to the node by the CMM.
240
SAN Volume Controller: Troubleshooting Guide
3000 • 3024
If the NTP port is an unsupported value, a port error
can display. Currently, only port 123 is supported. Only
the current configuration node attempts to resync with
the server.
User response:
1. Make sure that CMM is operational by logging in
and confirming its time.
2. Check that the IP address in the event log can be
pinged from the node.
3. If there is an error, try rebooting the CMM.
3000
The 2145 UPS temperature is close to its
upper limit. If the temperature
continues to rise the 2145 UPS will
power off.
Explanation: The temperature sensor in the 2145 UPS
is reporting a temperature that is close to the
operational limit of the unit. If the temperature
continues to rise the 2145 UPS will power off for safety
reasons. The sensor is probably reporting an excessively
high temperature because the environment in which
the 2145 UPS is operating is too hot.
User response:
1. Ensure that the room ambient temperature is within
the permitted limits.
2. Ensure that the air vents at the front and back of
the 2145 UPS are not obstructed.
3. Ensure that other devices in the same rack are not
overheating.
4. When you are satisfied that the cause of the
overheating has been resolved, mark the error
“fixed”.
3010
Internal uninterruptible power supply
software error detected.
Explanation: Some of the tests that are performed
during node startup did not complete because some of
the data reported by the uninterruptible power supply
during node startup is inconsistent because of a
software error in the uninterruptible power supply. The
node has determined that the uninterruptible power
supply is functioning sufficiently for the node to
continue operations. The operation of the cluster is not
affected by this error. This error is usually resolved by
power cycling the uninterruptible power supply.
User response:
1. Power cycle the uninterruptible power supply at a
convenient time. The one or two nodes attached to
the uninterruptible power supply should be
powered off before powering off the uninterruptible
power supply. Once the nodes have powered off,
wait 5 minutes for the uninterruptible power supply
to go into standby mode (flashing green AC LED).
If this does not happen automatically then check the
cabling to confirm that all nodes powered by this
uninterruptible power supply have been powered
off. Remove the power input cable from the
uninterruptible power supply and wait at least 2
minutes for the uninterruptible power supply to
clear its internal state. Reconnect the uninterruptible
power supply power input cable. Press the
uninterruptible power supply ON button. Power on
the nodes connected to this uninterruptible power
supply.
2. If the error is reported again after the nodes are
restarted replace the 2145 UPS electronics assembly.
Possible Cause-FRUs or other:
3001
The 2145 UPS-1U temperature is close to
its upper limit. If the temperature
continues to rise the 2145 UPS-1U will
power off.
Explanation: The temperature sensor in the 2145
UPS-1U is reporting a temperature that is close to the
operational limit of the unit. If the temperature
continues to rise the 2145 UPS-1U will power off for
safety reasons. The sensor is probably reporting an
excessively high temperature because the environment
in which the 2145 UPS-1U is operating is too hot.
v 2145 UPS electronics assembly (5%)
Other:
v Transient 2145 UPS error (95%)
3024
Technician port connection invalid
Explanation: The code has detected more than one
MAC address through the connection, or the DHCP has
given out more than one address. The code thus
believes there is a switch attached.
User response:
User response:
1. Ensure that the room ambient temperature is within
the permitted limits.
1.
2. Ensure that the air vents at the front and back of
the 2145 UPS-1U are not obstructed.
Remove the cable from the technician port.
2. (Optional) Disable additional network adapters on
the laptop to which it is to connected.
3. Ensure DHCP is enabled on the network adapter.
3. Ensure that other devices in the same rack are not
overheating.
4. If this was not possible, manually set the IP to
192.168.0.2
4. When you are satisfied that the cause of the
overheating has been resolved, mark the error
“fixed”.
5. Connect a standard Ethernet cable between the
network adapter and the technician port.
Chapter 7. Diagnosing problems
241
3025 • 3031
6. If this still does not work, reboot the node and
repeat the above steps.
7. This event will auto-fix once either no connection or
a valid connection has been detected.
3025
A virtualization feature license is
required.
Explanation: The cluster has no virtualization feature
license registered. You should have either an Entry
Edition Physical Disk virtualization feature license or a
Capacity virtualization feature license that covers the
cluster.
The cluster will continue to operate, but it might be
violating the license conditions.
User response:
v If you do not have a virtualization feature license
that is valid and sufficient for this cluster, contact
your IBM sales representative, arrange a license and
change the license settings for the cluster to register
the license.
v The error will automatically fix when the situation is
resolved.
Possible Cause-FRUs or other:
v None
3029
Virtualization feature capacity is not
valid.
Possible Cause-FRUs or other:
v None
3030
Global and Metro Mirror feature
capacity not set.
Explanation: The Global and Metro Mirror feature is
set to On for the cluster, but the capacity has not been
set.
This error event is created when a cluster is upgraded
from a version prior to 4.3.0 to version 4.3.0 or later.
Prior to version 4.3.0 the feature can only be set to On
or Off; with version 4.3.0 and later the licensed capacity
for the feature must also be set.
User response: Perform one of the following actions:
v Change the Global and Metro Mirror license settings
for the cluster either to the licensed Global and
Metro Mirror capacity, or if the license applies to
more than one cluster, to the portion of the license
allocated to this cluster. Set the licensed Global and
Metro Mirror capacity to zero if it is no longer being
used.
v View the event data or the feature log to ensure that
the licensed Global and Metro Mirror capacity is
sufficient for the space actually being used. Contact
your IBM sales representative if you want to change
the licensed Global and Metro Mirror capacity.
v The error will automatically be fixed when a valid
configuration is entered.
Explanation: The setting for the amount of space that
can be virtualized is not valid. The value must be an
integer number of terabytes.
Possible Cause-FRUs or other:
This error event is created when a cluster is upgraded
from a version prior to 4.3.0 to version 4.3.0 or later.
Prior to version 4.3.0 the virtualization feature capacity
value was in gigabytes and therefore could be set to a
fraction of a terabyte. With version 4.3.0 and later the
licensed capacity for the virtualization feature must be
an integer number of terabytes.
3031
User response:
v Review the license conditions for the virtualization
feature. If you have one cluster, change the license
settings for the cluster to match the capacity that is
licensed. If your license covers more than one cluster,
apportion an integer number of terabytes to each
cluster. You might have to change the virtualization
capacity that is set on the other clusters to ensure
that the sum of the capacities for all of the clusters
does not exceed the licensed capacity.
v You can view the event data or the feature log to
ensure that the licensed capacity is sufficient for the
space that is actually being used. Contact your IBM
sales representative if you want to change the
capacity of the license.
v This error will automatically be fixed when a valid
configuration is entered.
242
SAN Volume Controller: Troubleshooting Guide
v None
FlashCopy feature capacity not set.
Explanation: The FlashCopy feature is set to On for
the cluster, but the capacity has not been set.
This error event is created when a cluster is upgraded
from a version prior to 4.3.0 to version 4.3.0 or later.
Prior to version 4.3.0 the feature can only be set to On
or Off; with version 4.3.0 and later the licensed capacity
for the feature must also be set.
User response: Perform one of the following actions:
v Change the FlashCopy license settings for the cluster
either to the licensed FlashCopy capacity, or if the
license applies to more than one cluster, to the
portion of the license allocated to this cluster. Set the
licensed FlashCopy capacity to zero if it is no longer
being used.
v View the event data or the feature log to ensure that
the licensed FlashCopy capacity is sufficient for the
space actually being used. Contact your IBM sales
representative if you want to change the licensed
FlashCopy capacity.
v The error will automatically be fixed when a valid
configuration is entered.
3032 • 3036
Possible Cause-FRUs or other:
Possible Cause-FRUs or other:
v None
v None
3032
Feature license limit exceeded.
Explanation: The amount of space that is licensed for
a cluster feature is being exceeded.
The feature that is being exceeded might be:
v Virtualization feature - event identifier 009172
v FlashCopy feature - event identifier 009173
v Global and Metro Mirror feature - event identifier
009174
The cluster will continue to operate, but it might be
violating the license conditions.
User response:
v Determine which feature license limit has been
exceeded. This might be:
v Virtualization feature - event identifier 009172
v FlashCopy feature - event identifier 009173
v Global and Metro Mirror feature - event identifier
009174
v Ensure that the feature capacity that is reported by
the cluster has been set to match either the licensed
size, or if the license applies to more than one
cluster, to the portion of the license that is allocated
to this cluster.
3035
Physical Disk FlashCopy feature license
required
Explanation: The Entry Edition cluster has some
FlashCopy mappings defined. There is, however, no
Physical Disk FlashCopy license registered on the
cluster. The cluster will continue to operate, but it
might be violating the license conditions.
User response:
v Check whether you have an Entry Edition Physical
Disk FlashCopy license for this cluster that you have
not registered on the cluster. Update the cluster
license configuration if you have a license.
v Decide whether you want to continue to use the
FlashCopy feature or not.
v If you want to use the FlashCopy feature contact
your IBM sales representative, arrange a license and
change the license settings for the cluster to register
the license.
v If you do not want to use the FlashCopy feature, you
must delete all of the FlashCopy mappings.
v The error will automatically fix when the situation is
resolved.
Possible Cause-FRUs or other:
v Decide whether to increase the feature capacity or to
reduce the space that is being used by this feature.
v None
v To increase the feature capacity, contact your IBM
sales representative and arrange an increased license
capacity. Change the license settings for the cluster to
set the new licensed capacity. Alternatively, if the
license applies to more than one cluster modify how
the licensed capacity is apportioned between the
clusters. Update every cluster so that the sum of the
license capacity for all of the clusters does not exceed
the licensed capacity for the location.
3036
v To reduce the amount of disk space that is
virtualized, delete some of the managed disks or
image mode volumes. The used virtualization size is
the sum of the capacities of all of the managed disks
and image mode disks.
User response:
v To reduce the FlashCopy capacity delete some
FlashCopy mappings. The used FlashCopy size is the
sum of all of the volumes that are the source volume
of a FlashCopy mapping.
v Decide whether you want to continue to use the
Global Mirror or Metro Mirror features or not.
v To reduce Global and Metro Mirror capacity delete
some Global Mirror or Metro Mirror relationships.
The used Global and Metro Mirror size is the sum of
the capacities of all of the volumes that are in a
Metro Mirror or Global Mirror relationship; both
master and auxiliary volumes are counted.
v The error will automatically be fixed when the
licensed capacity is greater than the capacity that is
being used.
Physical Disk Global and Metro Mirror
feature license required
Explanation: The Entry Edition cluster has some
Global Mirror or Metro Mirror relationships defined.
There is, however, no Physical Disk Global and Metro
Mirror license registered on the cluster. The cluster will
continue to operate, but it might be violating the
license conditions.
v Check if you have an Entry Edition Physical Disk
Global and Metro Mirror license for this cluster that
you have not registered on the cluster. Update the
cluster license configuration if you have a license.
v If you want to use either the Global Mirror or Metro
Mirror feature contact your IBM sales representative,
arrange a license and change the license settings for
the cluster to register the license.
v If you do not want to use both the Global Mirror and
Metro Mirror features, you must delete all of the
Global Mirror and Metro Mirror relationships.
v The error will automatically fix when the situation is
resolved.
Chapter 7. Diagnosing problems
243
3080 • 3090
Possible Cause-FRUs or other:
3081
v None
3080
Global or Metro Mirror relationship or
consistency group with deleted
partnership
Explanation: A Global Mirror or Metro Mirror
relationship or consistency group exists with a cluster
whose partnership is deleted.
Unable to send email to any of the
configured email servers.
Explanation: Either the system was not able to
connect to any of the SMTP email servers, or the email
transmission has failed. A maximum of six email
servers can be configured. Error event 2600 or 2601 is
raised when an individual email server is found to be
not working. This error indicates that all of the email
servers were found to be not working.
Beginning with SAN Volume Controller software
version 4.3.1 this configuration is not supported and
should be resolved. This condition can occur as a result
of an update to SAN Volume Controller software
version 4.3.1 or later.
User response:
User response: The issue can be resolved either by
deleting all of the Global Mirror or Metro Mirror
relationships or consistency groups that exist with a
cluster whose partnership is deleted, or by recreating
all of the partnerships that they were using.
v Perform the check email function to test that an
email server is operating properly.
The error will automatically fix when the situation is
resolved.
v List all of the Global Mirror and Metro Mirror
relationships and note those where the master cluster
name or the auxiliary cluster name is blank. For each
of these relationships, also note the cluster ID of the
remote cluster.
v List all of the Global Mirror and Metro Mirror
consistency groups and note those where the master
cluster name or the auxiliary cluster name is blank.
For each of these consistency groups, also note the
cluster ID of the remote cluster.
v Determine how many unique remote cluster IDs
there are among all of the Global Mirror and Metro
Mirror relationships and consistency groups that you
have identified in the first two steps. For each of
these remote clusters, decide if you want to
re-establish the partnership with that cluster. Ensure
that the total number of partnerships that you want
to have with remote clusters does not exceed the
cluster limit. In version 4.3.1 this limit is 1. If you
re-establish a partnership, you will not have to delete
the Global Mirror and Metro Mirror relationships
and consistency groups that use the partnership.
v Check the event log for all unresolved 2600 and 2601
errors and fix those problems.
v If this error has not already been automatically
marked fixed, mark this error as fixed.
Possible Cause-FRUs or other:
v None
3090
Drive firmware download is cancelled
by user or system, problem diagnosis
required.
Explanation: The drive firmware download has been
cancelled by the user or the system and problem
diagnosis required.
User response: If you cancelled the download using
applydrivesoftware -cancel then this error is to be
expected.
If you changed the state of any drive while the
download was ongoing, this error is to be expected,
however you will have to rerun the
applydrivesoftware to ensure all your drive firmware
has been updated.
Otherwise:
1. Check the drive states using lsdrive, in particular
look at drives which are status=degraded, offline or
use=failed.
2. Check node states using lsnode or lsnodecanister,
and confirm all nodes are online.
v Re-establish any selected partnerships.
3. Use lsdependentvdisks -drive <drive_id> to check
for vdisks that are dependent on specific drives.
v Delete all of the Global Mirror and Metro Mirror
relationships and consistency groups that you listed
in either of the first two steps whose remote cluster
partnership has not been re-established.
4. If the drive is a member of a RAID0 array, consider
whether to introduce additional redundancy to
protect the data on that drive.
v Check that the error has been marked as fixed by the
system. If it has not, return to the first step and
determine which Global Mirror or Metro Mirror
relationships or consistency groups are still causing
the issue.
Possible Cause-FRUs or other:
v None
244
SAN Volume Controller: Troubleshooting Guide
5. If the drive is not a member of a RAID0 array, fix
any errors in the event log that relate to the array.
6. Consider using the -force option. With any drive
software upgrade there is a risk that the drive
might become unusable. Only use the -force option
if you accept this risk.
7. Reissue the applydrivesoftware again.
3130
Note: The lsdriveupgradeprogress command can be
used to check the progress of the applydrivesoftware
command as it updates each drive.
3130
System SSL certificate expires within
the next 30 days.
Explanation: System SSL certificate expires within the
next 30 days.
The system SSL certificate that is used to authenticate
connections to the GUI, service assistant, and the
CIMOM is about to expire.
1. If you are using a self-signed certificate, then
generate a new self-signed certificate.
2. If you are using a certificate that is signed by a
certificate authority, generate a new certificate
request and get this certificate signed by your
certificate authority. The existing certificate can
continue to be used until the expiry date to provide
time to get the new certificate request signed and
installed.
Possible Cause-FRUs or other:
v N/A
User response: Complete the following steps to
resolve this problem.
Procedure: SAN problem determination
You can solve problems on the SAN Volume Controller system and its connection
to the storage area network (SAN).
About this task
SAN failures might cause SAN Volume Controller volumes to be inaccessible to
host systems. Failures can be caused by SAN configuration changes or by
hardware failures in SAN components.
The following list identifies some of the hardware that might cause failures:
v Power, fan, or cooling
v Application-specific integrated circuits
v Installed small form-factor pluggable (SFP) transceiver
v Fiber-optic cables
If either the maintenance analysis procedures or the error codes sent you here,
complete the following steps:
Procedure
1. If the customer changed the SAN configuration by changing the Fibre Channel
cable connections or switch zoning, verify that the changes were correct and, if
necessary, reverse those changes.
2. Verify that the power is turned on to all switches and storage controllers that
the SAN Volume Controller system uses, and that they are not reporting any
hardware failures. If problems are found, resolve those problems before you
proceed further.
3. Verify that the Fibre Channel cables that connect the systems to the switches
are securely connected.
4. If the customer is running a SAN management tool, you can use that tool to
view the SAN topology and isolate the failing component.
Resolving a problem with SSL/TLS clients
Changing the security level of the system might cause the web interface, CIM
clients, and other SSL/TLS clients to stop working. If any clients stop working,
complete the following procedure.
Chapter 7. Diagnosing problems
245
Procedure
1. Wait 5 minutes and try again. The clients might still need to wait for the
services to restart.
2. Confirm that the SSL/TLS implementation of the client (for example, the web
browser or CIM management tool) is up to date and supports the level of
security that is being enforced.
3. If necessary, revert to a weaker SSL/TLS security level in SAN Volume
Controller and see whether this action resolves the issue.
4. If the problem is a browser problem, check the exact error message reported by
the browser.
If the error message is cipher error, SSL error, TLS error, or handshake error,
then the error implies that there is a problem with the secure connection. In
this case, confirm that the browser is up to date. All of the supported browsers
(Internet Explorer, Firefox, Firefox ESR, and Chrome) support TLS 1.2 at the
latest version.
If there is only a blank screen, it is likely that either the web service needs to
restart, or there is a problem unrelated to the security level.
Procedure: Making drives support protection information
You can use this procedure to migrate drives and arrays to pick up support for
protection information.
About this task
Drives cannot start using protection information for I/O requests on demand. They
must be validated as having a correct format and general support for the function
within the code. SAN Volume Controller is capable of validating the format and
general support when the drive object is first discovered by the system. The
requirement for system validation means that no drive that exists can use
protection information on an update from version 730 regardless of use in the
configuration. The system can reject a request to make a drive a candidate if the
media is not formatted correctly for use with protection information. The process
to begin using protection information on an existing drive is to use the system
interface (GUI/CLI) and involves unmanaging and rediscovering the drive to
allow the software to reacquire the drive characteristics.
The lsdrive view contains the protection_enabled field that shows whether a
drive is using protection information. Drives and arrays that exist on an update to
version 740 do not automatically pick up support for protection information. All
newly discovered drives at this code level support protection information. If the
system has spare capacity, then migration can proceed an MDisk at a time.
Otherwise, the migration to using protection information on drives must proceed
drive by drive.
To migrate a MDisk using spare storage capacity, complete the following
procedure.
Procedure
1. Migrate data off the MDisk. The data migration can be accomplished by MDisk
migration as part of MDisk delete (rmmdisk, lsmigrate) within a storage pool.
You can also use volume mirroring to create an in-sync mirrored copy of each
volume in another pool (addvdiskcopy). When it is copied
246
SAN Volume Controller: Troubleshooting Guide
(lsvdisksyncprogress), delete the original volume copies (rmvdiskcopy), and
then delete the MDisk (rmmdisk) that has no data.
2. When the MDisk is deleted (see lsmigrate), follow the instructions in step 5 for
all the drives that are now candidates.
3. When all old members adopt protection information, re-create the array by
using the system interface.
4. To adopt protection information on an individual drive if the drive is a
member, complete the following steps:
a. Run the charraymember command to eject the drive from the array (either
immediately with redundancy loss or after an exchange).
b. When the drive is no longer a member, follow the instructions in step 5 for
candidates or spares.
c. Repeat for the next member.
5. If the drive is a spare or candidate, complete the following steps:
a. Use the management GUI to take the drive offline.
b. When the drive is offline, use the system interfaces to change the drive's use
to unused.
c. The system reacquires the drive and brings it back online, possibly changing
the drive ID.
d. Use the system interface to change the drive's use to candidate, and then if
required, to spare.
e. Enter lsdrive driveID and check that the protection_enabled field is yes.
This drive can now be used in an array.
Resolving a problem with new expansion enclosures
Determine why a newly installed expansion enclosure was not detected by the
system.
When you install a new expansion enclosure, follow the management GUI Add
Enclosure wizard. Select Monitoring > System. From the Actions menu, select
Add Enclosures.
If the expansion enclosure is not detected, complete the following verifications:
v Verify the status of the LEDs at the back of the expansion enclosure. At least one
power supply unit must be on with no faults shown. At least one canister must
be active, with no fault LED on. The SAN Volume Controller 2145-24F has two
LEDs per Serial-attached SCSI (SAS) port: one green link-status LED and one
amber fault LED. The link status LED of the ports that are in use is on while the
fault LED is off. For details about LED status, see SAN Volume Controller
2145-24F expansion canister SAS ports and indicators.
v Verify that the SAS cabling to the expansion enclosure is correctly installed. To
review the requirements, see Connecting the optional expansion enclosures.
Fibre Channel and 10G Ethernet link failures
You might need to replace the small form-factor pluggable (SFP) transceiver when
a failure occurs on a single Fibre Channel or 10G Ethernet link (applicable to Fibre
Channel over Ethernet personality enabled 10G Ethernet link).
Chapter 7. Diagnosing problems
247
Before you begin
The following items can indicate that a single Fibre Channel or 10 G Ethernet link
failed:
v The Fibre Channel port status on the front panel of the node
v The Fibre Channel status light-emitting diodes (LEDs) at the rear of the node
v An error that indicates that a single port failed (703, 723).
About this task
Use only IBM supported 10 Gb SFP transceivers with the SAN Volume Controller
2145-DH8. Using any other SFP transceivers can lead to unexpected system
behavior. Copper DAC is not supported by these 10 Gb ports. The SFP transceiver
replacement in a 10 Gbps Ethernet adapter port is governed by the following rules:
v An existing 10 Gb SFP transceiver replaced with a new 10 Gb SFP transceiver:
The 10 Gbps Ethernet adapter port detects a new SFP transceiver and becomes
operational immediately.
v If the 10 Gbps Ethernet adapter port detects a new SFP transceiver and becomes
operational immediately, the port has an incorrect SFP transceiver since the last
reboot and is replaced with the correct 10 Gb SFP transceiver. This situation can
occur with an incompatible SFP transceiver (8 Gb SFP or 4 Gb SFP) that is
inserted in the 10 Gbps Ethernet adapter port.
– The node will require a reboot for detecting the new SFP transceiver. The new
SFP transceiver will be operational only after reboot (no DMP is produced).
v
The 10 Gbps Ethernet adapter port contains no SFP transceiver since the last
reboot and the correct 10 Gb SFP transceiver is installed:
– System reboot is required for detecting the new SFP transceiver.
Procedure
Attempt each of these actions, in the following order, until the failure is fixed.
1. Replace the SFP transceiver for the failing port on the node.
Note: SAN Volume Controller nodes are supported by both longwave SFP
transceivers and shortwave SFP transceivers. You must replace an SFP
transceiver with the same type of SFP transceiver. If the SFP transceiver to
replace is a longwave SFP transceiver, for example, you must provide a suitable
replacement. Removing the wrong SFP transceiver might result in loss of data
access.
2. Replace the Fibre Channel adapter on the node.Replace the Fibre Channel
adapter or Fibre Channel over Ethernet adapter on the node.
Ethernet iSCSI host-link problems
If you are having problems attaching to the Ethernet hosts, your problem might be
related to the network, the SAN Volume Controller system, or the host.
Note: SAN Volume Controller and Host IP should be on the same VLAN. Host
and SAN Volume Controller nodes should not have same subnet on different
VLANs.
For network problems, you can attempt any of the following actions:
v Test your connectivity between the host and SAN Volume Controller ports.
248
SAN Volume Controller: Troubleshooting Guide
v Try to ping the SAN Volume Controller system from the host.
v Ask the Ethernet network administrator to check the firewall and router settings.
v Check that the subnet mask and gateway are correct for the SAN Volume
Controller host configuration.
Using the management GUI for SAN Volume Controller problems, you can attempt
any of the following actions:
v View the configured node port IP addresses.
v View the list of volumes that are mapped to a host to ensure that the volume
host mappings are correct.
v Verify that the volume is online.
For host problems, you can attempt any of the following actions:
v Verify that the host iSCSI qualified name (IQN) is correctly configured.
v Use operating system utilities (such as Windows device manager) to verify that
the device driver is installed, loaded, and operating correctly.
v If you configured the VLAN, check that its settings are correct. Ensure that Host
Ethernet port, SAN Volume Controller Ethernet ports IP address, and Switch
port are on the same VLAN ID. Ensure that on each VLAN, a different subnet is
used. Configuring the same subnet on different VLAN IDs can cause network
connectivity problems.
Fibre Channel over Ethernet host-link problems
Problems attaching to the Fibre Channel over Ethernet hosts might be related to
the network, the SAN Volume Controller system, or the host.
Before you begin
If error code 705 on node is displayed, this means the FC I/O port is inactive.
Fibre Channel over Ethernet uses Fibre Channel as a protocol and Ethernet as an
inter-connect.
Note: Concerning a Fibre Channel over Ethernet enabled port: either the fibre
channel forwarder (FCF) is not seen, or the Fibre Channel over Ethernet feature is
not configured on switch.
v Verify that the Fibre Channel over Ethernet feature is enabled on the FCF.
v Verify the remote port (switch port) properties on the FCF.
If connecting the host through a Converged Enhanced Ethernet (CEE) Switch:
v Test your connectivity between the host and CEE Switch.
v Ask the Ethernet network administrator to check the firewall and router settings.
Run lsfabric, and verify the host is seen as a remote port in the output. If the
host is not seen, in order:
v Verify that SAN Volume Controller and host get an Fibre Channel ID (FCID) on
the FCF. If unable to verify, check the VLAN configuration.
v Verify that SAN Volume Controller and host port are part of a zone and that
zone is currently in force.
v Verify the volumes are mapped to host and are online. For more information, see
lshostvdiskmap and lsvdisk in the description in the SAN Volume Controller
Information Center.
Chapter 7. Diagnosing problems
249
What to do next
If the problem is not resolved, verify the state of the host adapter.
v Unload and load the device driver
v Use the operating system utilities (for example, Windows Device Manager) to
verify the device driver is installed, loaded, and operating correctly.
Servicing storage systems
Storage systems that are supported for attachment to the SAN Volume Controller
system are designed with redundant components and access paths to enable
concurrent maintenance. Hosts have continuous access to their data during
component failure and replacement.
The following categories represent the types of service actions for storage systems:
v Controller code update
v Field replaceable unit (FRU) replacement
Controller code update
Ensure that you are familiar with the following guidelines for updating controller
code:
v Check to see if the SAN Volume Controller supports concurrent maintenance for
your storage system.
v Allow the storage system to coordinate the entire update process.
v If it is not possible to allow the storage system to coordinate the entire update
process, perform the following steps:
1. Reduce the storage system workload by 50%.
2. Use the configuration tools for the storage system to manually failover all
logical units (LUs) from the controller that you want to update.
3. Update the controller code.
4. Restart the controller.
5. Manually failback the LUs to their original controller.
6. Repeat for all controllers.
FRU replacement
Ensure that you are familiar with the following guidelines for replacing FRUs:
v If the component that you want to replace is directly in the host-side data path
(for example, cable, Fibre Channel port, or controller), disable the external data
paths to prepare for update. To disable external data paths, disconnect or disable
the appropriate ports on the fabric switch. The SAN Volume Controller ERPs
reroute access over the alternate path.
v If the component that you want to replace is in the internal data path (for
example, cache, or drive) and did not completely fail, ensure that the data is
backed up before you attempt to replace the component.
v If the component that you want to replace is not in the data path, for example,
uninterruptible power supply units, fans, or batteries, the component is
generally dual-redundant and can be replaced without additional steps.
250
SAN Volume Controller: Troubleshooting Guide
Chapter 8. Disaster recovery
Use these disaster recovery solutions for HyperSwap, Metro Mirror, Global Mirror,
and Stretched System, where access to storage is still possible after the failure of a
site.
HyperSwap
Active-active volume access is always provided while there is an up-to-date
consistent copy. If there is an out-of-date consistent copy, there is not an automatic
failover to it, nor is read only access given to it. Use the stoprcrelationshipaccess or stoprcconsistgrp-access command to make it accessible. The
relationship is then in the Idling state. After you enable access with the
stoprcrelationship-access or stoprcconsistgrp-access command, use the
startrcrelationship -primary <master/aux> or startrcconsistgrp -primary
<master/aux> command to make the relationship leave the Idling state and resume
HyperSwap replication. If you previously ran overridequorum, the
startrcrelationship or startrcconsistgrp command fails.
When you resume HyperSwap replication, consider whether you want to continue
using the out-of-date consistent copy or revert to the up-to-date copy. To identify
whether the master or auxiliary volume has access, look at the primary field that is
shown by the lsrcrelationship or lsrcconsistgrp command. To continue using
the out-of-date copy, provide that value as the argument to the -primary parameter
of the startrcrelationship or startrcconsistgrp command. To revert to the
up-to-date copy, specify the opposite value as the argument to the -primary
parameter. For example, if master is shown in the primary field of lsrcconsistgrp
for an active-active consistency group in the Idling state, to revert to the up-to-date
copy, use startrcconsistgrp -primary aux.
Metro Mirror and Global Mirror
Note: Inappropriate use of these procedures can allow host systems to make
independent modifications to both the primary and secondary copies of data. The
user is responsible for ensuring that no host systems are continuing to use the
primary copy of the data before you enable access to the secondary copy.
In a Metro Mirror or Global Mirror configuration a system is configured at each
site. Relationships are configured between the systems to mirror data from storage
at the primary site to storage at the secondary site. If an outage occurs at the
secondary site the primary site continues operation without any intervention. If an
outage occurs at the primary site, then it is necessary to enable access to storage at
the secondary site.
Use the stoprcrelationship-access or stoprcconsistgrp-access command to
enable access to the storage at the secondary site.
Stretched System
In a stretched system (formerly split-site) configuration, a system is configured
with half the nodes at each site and a quorum device at a third location. If an
outage occurs at either site, then the other nodes at the other site accesses the
quorum device and continue operation without any intervention. If connectivity
© Copyright IBM Corp. 2003, 2015
251
between the two sites is lost, then whichever nodes access the quorum device first
continues operation. For disaster recovery purposes a user might want to enable
access to the storage at the site that lost the race to access the quorum device.
Use the satask overridequorum command to enable access to the storage at the
secondary site. This feature is only available if the system was configured by
assigning sites to nodes and storage controllers, and changing the system topology
to stretched.
Important: If the user ran disaster recovery on one site and then powered up the
remaining, failed site (which contained the configuration node at the time of the
disaster), then the cluster would assert itself as designed. This procedure would
start a second, identical cluster in parallel, which can cause data corruption. The
user must follow these steps:
Example
1. Remove the connectivity of the nodes from the site that is experiencing the
outage
2. Power up or recover those nodes
3. Run satask leavecluster-force or svctask rmnode command for all the nodes
in the cluster
4. Bring the nodes into candidate state, and then
5. Connect them to the site on which the site disaster recovery feature was run.
Other configurations
To recover access to the storage in other configurations, use “Recover system
procedure” on page 253.
252
SAN Volume Controller: Troubleshooting Guide
Chapter 9. Recovery procedures
This topic describes these recovery procedures: recover a system and back up and
restore a system configuration. This topic also contains information about
performing the node rescue.
Recover system procedure
The recover system procedure recovers the entire system if the system state is lost
from all nodes. The procedure re-creates the system by using saved configuration
data. The saved configuration data is in the active quorum disk and the latest XML
configuration backup file. The recovery might not be able to restore all volume
data. This procedure is also known as Tier 3 (T3) recovery.
CAUTION:
If the system encounters a state where:
v No nodes are active
Do not attempt to initiate a node rescue (which the user can initiate either by
using the SAN Volume Controller front panel, the service assistant GUI, or
the satask rescuenode service CLI command). STOP and contact IBM® Remote
Technical Support. Initiating this T3 recover system procedure while in this
specific state can result in loss of the XML configuration backup files.
Attention:
v Run service actions only when directed by the fix procedures. If used
inappropriately, service actions can cause loss of access to data or even data loss.
Before you attempt to recover a system, investigate the cause of the failure and
attempt to resolve those issues by using other fix procedures. Read and
understand all of the instructions before you complete any action.
v The recovery procedure can take several hours if the system uses large-capacity
devices as quorum devices.
Do not attempt the recover system procedure unless the following conditions are
met:
v All of the conditions have been met in “When to run the recover system
procedure” on page 254.
v All hardware errors are fixed. See “Fix hardware errors” on page 254
v All nodes have candidate status. Otherwise, see step 1.
v All nodes must be at the same level of code that the system had before the
failure. If any nodes were modified or replaced, use the service assistant to
verify the levels of code, and where necessary, to reinstall the level of code so
that it matches the level that is running on the other nodes in the system. For
more information, see “Removing system information for nodes with error code
550 or error code 578 using the service assistant” on page 256.
The system recovery procedure is one of several tasks that must be completed. The
following list is an overview of the tasks and the order in which they must be
completed:
1. Preparing for system recovery
© Copyright IBM Corp. 2003, 2015
253
a. Review the information regarding when to run the recover system
procedure.
b. Fix your hardware errors and make sure that all nodes in the system are
shown in service assistant or in the output from sainfo lsservicenodes.
c. Remove the system information for nodes with error code 550 or error code
578 by using the service assistant, but only if the recommended user
response for these node errors has already been followed.
d. For Virtual Volumes (VVols), shut down the services for any instances of
Spectrum Control Base that are connecting to the system. Use the Spectrum
Control Base command service ibm_spectrum_control stop.
2. Running the system recovery. After you prepared the system for recovery and
met all the pre-conditions, run the system recovery.
Note: Run the procedure on one system in a fabric at a time. Do not run the
procedure on different nodes in the same system. This restriction also applies to
remote systems.
3. Completing actions to get your environment operational.
v Recovering from offline volumes by using the CLI.
v Checking your system, for example, to ensure that all mapped volumes can
access the host.
When to run the recover system procedure
Attempt a recover procedure only after a complete and thorough investigation of
the cause of the system failure. Attempt to resolve those issues by using other
service procedures.
Attention: If you experience failures at any time while running the recover
system procedure, call the IBM remote technical support. Do not attempt to do
further recovery actions, because these actions might prevent support from
restoring the system to an operational status.
Certain conditions must be met before you run the recovery procedure. Use the
following items to help you determine when to run the recovery procedure:
1. All enclosures and external storage systems are powered up and can
communicate with each other.
2. Check that all nodes in the system are shown in the service assistant tool or
using the service command: sainfo lsservicenodes. Investigate any missing
nodes.
3. Check that no node in the system is active and that the management IP is not
accessible. If any node has active status, it is not necessary to recover the
system.
4. Resolve all hardware errors in nodes so that only node errors 578 or 550 are
present. If this is not the case, go to “Fix hardware errors.”
5. Ensure all backend storage that is administered by the system is present before
you run the recover system procedure.
6. If any nodes have been replaced, ensure that the WWNN of the replacement
node matches that of the replaced node, and that no prior system data remains
on this node.
Fix hardware errors
Before running a system recovery procedure, it is important to identify and fix the
root cause of the hardware issues.
254
SAN Volume Controller: Troubleshooting Guide
Identifying and fixing the root cause can help recover a system, if these are the
faults that are causing the system to fail. The following are common issues which
can be easily resolved:
v The node has been powered off or the power cords were unplugged.
v A 2145 UPS-1U might have failed and shut down one or more nodes because of
the failure. In general, this cause might not happen because of the redundancy
provided by the second 2145 UPS-1U.
v Check the node status of every node that is a member of the system. Resolve all
errors.
– All nodes must be reporting either a node error 578, or no cluster name is
shown on the Cluster: display. These error codes indicate that the system has
lost its configuration data. If any nodes report anything other than these error
codes, do not perform a recovery. You can encounter situations where
non-configuration nodes report other node errors, such as a node error 550.
The 550 error can also indicate that a node is not able to join a system.
Note: If any of the buttons on the front panel have been pressed after these
two error codes are reported, the report for the node returns to the 578 node
error. The change in the report happens after approximately 60 seconds. Also,
if the node was rebooted or if hardware service actions were taken, the node
might show no cluster name on the Cluster: display.
– If any nodes show Node Error: 550, record the data from the second line of
the display. If the last character on the second line of the display is >, use the
right button to scroll the display to the right.
- In addition to the Node Error: 550, the second line of the display can show
a list of node front panel IDs (seven digits) that are separated by spaces.
The list can also show the WWPN/LUN ID (16 hexadecimal digits
followed by a forward slash and a decimal number).
- If the error data contains any front panel IDs, ensure that the node referred
to by that front panel ID is showing Node Error 578:. If it is not reporting
node error 578, ensure that the two nodes can communicate with each
other. Verify the SAN connectivity and restart one of the two nodes by
pressing the front panel power button twice.
- If the error data contains a WWPN/LUN ID, verify the SAN connectivity
between this node and that WWPN. Check the storage system to ensure
that the LUN referred to is online. After verifying, restart the node by
pressing the front panel power button twice.
Note: If after resolving all these scenarios, half or greater than half of the
nodes are reporting Node Error: 578, it is appropriate to run the recovery
procedure.
– For any nodes that are reporting a node error 550, ensure that all the missing
hardware that is identified by these errors is powered on and connected
without faults.
– If you have not been able to restart the system, and if any node other than
the current node is reporting node error 550 or 578, you must remove system
data from those nodes. This action acknowledges the data loss and puts the
nodes into the required candidate state.
Chapter 9. Recovery procedures
255
Removing clustered system information for nodes with error
code 550 or error code 578 using the front panel
The recovery procedure for clustered systems works only when all nodes are in
candidate status. If any nodes display error code 550 or error code 578, you must
remove their system data.
About this task
To remove clustered system information from a node with an error 550 or 578,
follow this front panel procedure :
Procedure
1. Press and release the up or down button until the Actions menu option is
displayed.
2. Press and release the select.
3. Press and release the up or down button until Remove Cluster? option is
displayed.
4. Press and release the select.
5. The node displays Confirm Remove?.
6. Press and release the select.
7. The node displays Cluster:.
Results
When all nodes show Cluster: on the top line and blank on the second line, the
nodes are in candidate status. The 550 or 578 error is removed. You can now run
the recovery procedure.
Removing system information for nodes with error code 550
or error code 578 using the service assistant
The system recovery procedure works only when all nodes are in candidate status.
If there are any nodes that display error code 550 or error code 578, you must
remove their data.
About this task
Before performing this task, ensure that you have read the introductory
information in the overall recover system procedure.
To remove system information from a node with an error 550 or 578, follow this
procedure using the service assistant:
Procedure
1. Point your browser to the service IP address of one of the nodes, for example,
https://node_service_ip_address/service/.
If you do not know the IP address or if it has not been configured, configure
the service address in one of the following ways:
v On SAN Volume Controller models 2145-CG8 and 2145-CF8 nodes, use the
front panel menu to configure a service address on the node.
v
256
On SAN Volume Controller 2145-DH8 nodes, use the technician port to
connect to the service assistant and configure a service address on the node.
SAN Volume Controller: Troubleshooting Guide
2. Log on to the service assistant.
3. Select Manage System.
4. Click Remove System Data.
5. Confirm that you want to remove the system data when prompted.
6. Remove the system data for the other nodes that display a 550 or a 578 error.
All nodes previously in this system must have a node status of Candidate and
have no errors listed against them.
7. Resolve any hardware errors until the error condition for all nodes in the
system is None.
8. Ensure that all nodes in the system display a status of candidate.
Results
When all nodes display a status of candidate and all error conditions are None,
you can run the recovery procedure.
Completing recovery procedure for clustered systems using
the front panel
Start recovery when all nodes that were members of the system are online and are
in candidate status. If there are any nodes that display error code 550 or error code
578, remove their system data to place them into candidate status. Do not run the
recovery procedure on different nodes in the same system; this restriction includes
remote clustered systems.
About this task
Attention: This service action has serious implications if not completed properly.
If at any time an error is encountered not covered by this procedure, stop and call
IBM Support.
Any one of the following categories of messages may be displayed:
v T3 successful
The volumes are online. Use the final checks to make the environment
operational; see “What to check after running the system recovery” on page 261.
v T3 incomplete
One or more of the volumes is offline because there was fast write data in the
cache. Further actions are required to bring the volumes online; see “Recovering
from offline volumes using the CLI” on page 260 for details (specifically, see the
task concerning recovery from offline VDisks using the command-line interface
(CLI)).
v T3 failed
Call IBM Support. Do not attempt any further action.
Start the recovery procedure from any node in the system; the node must not have
participated in any other system. To receive optimal results in maintaining the I/O
group ordering, run the recovery from a node that was in I/O group 0.
Note: Each individual stage of the recovery procedure might take significant time
to complete, dependant upon the specific configuration.
Chapter 9. Recovery procedures
257
Procedure
1. Click the up or down button until the Actions menu option is displayed; then
click Select.
2. Click the up or down button until the Recover Cluster? option is displayed,
and then click Select; the node displays Confirm Recover?.
3. Click Select; the node displays Retrieving.
After a short delay, the second line displays a sequence of progress messages
indicating the actions are taking place; for example, Finding qdisks. The
backup files are scanned to find the most recent configuration backup data.
After the file and quorum data retrieval is complete, the node displays T3
data: on the top line.
4. Verify the date and time on the second line of the display. The time stamp
shown is the date and time of the last quorum update and must be less than 30
minutes before the failure. The time stamp format is YYYYMMDD hh:mm,
where YYYY is the year, MM is the month, DD is the day, hh is the hour, and
mm is the minute.
Attention: If the time stamp is not less than 30 minutes before the failure, call
IBM support.
5. After verifying the time stamp is correct, press and hold the UP ARROW and
click Select.
The node displays Backup file on the top line.
6. Verify the date and time on the second line of the display. The time stamp
shown is the date and time of the last configuration backup and must be less
than 24 hours before the failure. The time stamp format is YYYYMMDD hh:mm,
where YYYY is the year, MM is the month, DD is the day, hh is the hour, and
mm is the minute.
Attention: If the time stamp is not less than 24 hours before the failure, call
IBM support.
Note: Changes made after the time of this configuration backup might not be
restored.
7. After verifying the time stamp is correct, press and hold the UP ARROW and
click Select.
The node displays Restoring. After a short delay, the second line displays a
sequence of progress messages indicating the actions taking place; then the
software on the node restarts.
The node displays Cluster on the top line and a management IP address on the
second line. After a few moments, the node displays T3 Completing.
Note: Any system errors logged at this time might temporarily overwrite the
display; ignore the message: Cluster Error: 3025. After a short delay, the
second line displays a sequence of progress messages indicating the actions
taking place.
When each node is added to the system, the display shows Cluster: on the top
line, and the cluster (system) name on the second line.
Attention: After the last node is added to the system, there is a short delay to
allow the system to stabilize. Do not attempt to use the system. The recovery is
still in progress. Once recovery is complete, the node displays T3 Succeeded on
the top line.
8. Click Select to return the node to normal display.
258
SAN Volume Controller: Troubleshooting Guide
Results
Recovery is complete when the node displays T3 Succeeded. Verify the
environment is operational by completing the checks provided in “What to check
after running the system recovery” on page 261.
Running system recovery using the service assistant
Start recovery when all nodes that were members of the system are online and are
in candidate status. If any nodes display error code 550 or 578, remove system
information to place them into candidate status. Do not run the recovery procedure
on different nodes in the same system; this restriction includes remote systems.
Before you begin
Note: Ensure that the web browser is not blocking pop-up windows. If it does,
progress windows cannot open.
Before you begin this procedure, read the recover system procedure introductory
information; see “Recover system procedure” on page 253.
About this task
Attention: This service action has serious implications if not completed properly.
If at any time an error is encountered not covered by this procedure, stop and call
the support center.
Run the recovery from any nodes in the system; the nodes must not have
participated in any other system.
Note: Each individual stage of the recovery procedure can take significant time to
complete, depending on the specific configuration.
Procedure
1. Point your browser to the service IP address of one of the nodes.
If you do not know the IP address or if it has not been configured, configure
the service address in one of the following ways:
v On SAN Volume Controller models 2145-CG8 and 2145-CF8 nodes, use the
front panel menu to configure a service address on the node.
v
On SAN Volume Controller 2145-DH8 nodes, use the technician port to
connect to the service assistant and configure a service address on the node.
2. Log on to the service assistant.
3. Select Recover System from the navigation.
4. Follow the online instructions to complete the recovery procedure.
a. Verify the date and time of the last quorum time. The time stamp must be
less than 30 minutes before the failure. The time stamp format is
YYYYMMDD hh:mm, where YYYY is the year, MM is the month, DD is the
day, hh is the hour, and mm is the minute.
Attention: If the time stamp is not less than 30 minutes before the failure,
call the support center.
b. Verify the date and time of the last backup date. The time stamp must be
less than 24 hours before the failure. The time stamp format is YYYYMMDD
hh:mm, where YYYY is the year, MM is the month, DD is the day, hh is the
hour, and mm is the minute.
Chapter 9. Recovery procedures
259
Attention: If the time stamp is not less than 24 hours before the failure,
call the support center.
Changes that are made after the time of this backup date might not be
restored.
Results
Any one of the following categories of messages might be displayed:
v T3 successful
The volumes are back online. Use the final checks to get your environment
operational again.
v T3 recovery completed with errors
T3 recovery completed with errors: One or more of the volumes are offline
because there was fast write data in the cache. To bring the volumes online, see
“Recovering from offline volumes using the CLI” for details.
v T3 failed
Call the support center. Do not attempt any further action.
Verify that the environment is operational by completing the checks that are
provided in “What to check after running the system recovery” on page 261.
If any errors are logged in the error log after the system recovery procedure
completes, use the fix procedures to resolve these errors, especially the errors that
are related to offline arrays.
If the recovery completes with offline volumes, go to “Recovering from offline
volumes using the CLI.”
Recovering from offline volumes using the CLI
If a Tier 3 recovery procedure completes with offline volumes, then it is likely that
the data which was in the write-cache of the node canisters was lost during the
failure that caused all of the node canisters to lose the block storage system cluster
state. You can use the command-line interface (CLI) to acknowledge that there was
lost data lost from the write-cache, and bring the volume back online to attempt to
deal with the data loss.
About this task
If you have run the recovery procedure but there are offline volumes, you can
complete the following steps to bring the volumes back online. Any volumes that
are offline and are not thin-provisioned (or compressed) volumes are offline
because of the loss of write-cache data during the event that led all node canisters
to lose their cluster state. Any data lost from the write-cache cannot be recovered.
These volumes might need additional recovery steps after the volume is brought
back online.
Note: If you encounter errors in the error log after running the recovery procedure
that are related to offline arrays, use the fix procedures to resolve the offline array
errors before fixing the offline volume errors.
260
SAN Volume Controller: Troubleshooting Guide
Example
Complete the following steps to recover an offline volume after the recovery
procedure has completed:
1. Delete all IBM FlashCopy function mappings and Metro Mirror or Global
Mirror relationships that use the offline volumes.
2. Run the recovervdisk, recovervdiskbyiogrp or recovervdiskbysystem
command. (This will only bring the volume back online so that you can
attempt to deal with the data loss.)
3.
Refer to “What to check after running the system recovery” for what to do
with volumes that have been corrupted by the loss of data from the
write-cache.
4. Recreate all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the volumes.
What to check after running the system recovery
Several tasks must be completed before you use the system.
The recovery procedure recreates the old system from the quorum data. However,
some things cannot be restored, such as cached data or system data managing
in-flight I/O. This latter loss of state affects RAID arrays managing internal
storage. The detailed map about where data is out of synchronization has been
lost, meaning that all parity information must be restored, and mirrored pairs must
be brought back into synchronization. Normally this results in either old or stale
data being used, so only writes in flight are affected. However, if the array had lost
redundancy (such as syncing, or degraded or critical RAID status) prior to the
error requiring system recovery, then the situation is more severe. Under this
situation you need to check the internal storage:
v Parity arrays will likely be syncing to restore parity; they do not have
redundancy when this operation proceeds.
v Because there is no redundancy in this process, bad blocks might have been
created where data is not accessible.
v Parity arrays could be marked as corrupt. This indicates that the extent of lost
data is wider than in-flight I/O, and in order to bring the array online, the data
loss must be acknowledged.
v RAID-6 arrays that were actually degraded prior the system recovery might
require a full restore from backup. For this reason, it is important to have at
least a capacity match spare available.
Be aware of these differences regarding the recovered configuration:
v FlashCopy mappings are restored as “idle_or_copied” with 0% progress. Both
volumes must have been restored to their original I/O groups.
v The management ID is different. Any scripts or associated programs that refer to
the system-management ID of the clustered system (system) must be changed.
v Any FlashCopy mappings that were not in the “idle_or_copied” state with 100%
progress at the point of disaster have inconsistent data on their target disks.
These mappings must be restarted.
v Intersystem remote copy partnerships and relationships are not restored and
must be re-created manually.
v Consistency groups are not restored and must be re-created manually.
v Intrasystem remote copy relationships are restored if all dependencies were
successfully restored to their original I/O groups.
Chapter 9. Recovery procedures
261
v If hardware was replaced before the recovery, the SSL certificate might not be
restored. If it is not restored, then a new self-signed certificate is generated with
a validity of 30 days. Follow the associated Directed Maintenance Procedures
(DMP) for a permanent resolution.
v The system time zone might not have been restored.
v Any Global Mirror secondary volumes on the recovered system might have
inconsistent data if there was replication I/O from the primary volume cached
on the secondary system at the point of the disaster. A full synchronization is
required when recreating and restarting these remote copy relationships.
v Immediately after the T3 recovery process runs, compressed disks do not know
the correct value of their used capacity. The disks initially set the capacity as the
entire real capacity. When I/O resumes, the capacity is shrunk down to the
correct value.
Similar behavior occurs when you use the -autoexpand option on vdisks. The
real capacity of a disk might increase slightly, caused by the same kind of
behavior that affects compressed vdisks. Again, the capacity shrinks down as
I/O to the disk is resumed.
Before using the volumes, complete the following tasks:
v Start the host systems.
v Manual actions might be necessary on the hosts to trigger them to rescan for
devices. You can complete this task by disconnecting and reconnecting the Fibre
Channel cables to each host bus adapter (HBA) port.
v Verify that all mapped volumes can be accessed by the hosts.
v Run file system consistency checks.
Note: Any data that was in the SAN Volume Controller write cache at the time
of the failure is lost.
v Run the application consistency checks.
For Virtual Volumes (VVols), complete the following tasks.
v After you confirm that the T3 completed successfully, restart Spectrum Control
Base (SCB) services. Use the Spectrum Control Base command service
ibm_spectrum_control start.
v Refresh the storage system information on the SCB GUI to ensure that the
systems are in sync after the recovery.
– To complete this task, login to the SCB GUI.
– Hover over the affected storage system, select the menu launcher, and then
select Refresh. This step repopulates the system.
– Repeat this step for all Spectrum Control Base instances.
v Rescan the storage providers from within the vSphere Web Client.
– Select vCSA > Manage > Storage Providers > select Active VP > Re-scan
icon.
For Virtual Volumes (VVols), also be aware of the following information.
FlashCopy mappings are not restored for VVols. The implications are as follows.
v The mappings that describe the VM's snapshot relationships are lost. However,
the Virtual Volumes that are associated with these snapshots still exist, and the
snapshots might still appear on the vSphere Web Client. This outcome might
have implications on your VMware back-up solution.
262
SAN Volume Controller: Troubleshooting Guide
– Do not attempt to revert to snapshots.
– Use the vSphere Web Client to delete any snapshots for VMs on a VVol data
store to free up disk space that is being used unnecessarily.
v The targets of any outstanding 'clone' FlashCopy relationships might not
function as expected (even if the vSphere Web Client recently reported clone
operations as complete). For any VMs, which are targets of recent clone
operations, complete the following tasks.
– Perform data integrity checks as is recommended for conventional volumes.
– If clones do not function as expected or show signs of corrupted data, take a
fresh clone of the source VM to ensure that data integrity is maintained.
Backing up and restoring the system configuration
You can back up and restore the configuration data for the system after
preliminary tasks are completed.
Configuration data for the system provides information about your system and the
objects that are defined in it. The backup and restore functions of the svcconfig
command can back up and restore only your configuration data for the SAN
Volume Controller system. You must regularly back up your application data by
using the appropriate backup methods.
You can maintain your configuration data for the system by completing the
following tasks:
v Backing up the configuration data
v Restoring the configuration data
v Deleting unwanted backup configuration data files
Before you back up your configuration data, the following prerequisites must be
met:
v No independent operations that change the configuration for the system can be
running while the backup command is running.
v No object name can begin with an underscore character (_).
Note:
v The default object names for controllers, I/O groups, and managed disks
(MDisks) do not restore correctly if the ID of the object is different from what is
recorded in the current configuration data file.
v All other objects with default names are renamed during the restore process. The
new names appear in the format name_r where name is the name of the object in
your system.
1
v Connections to iSCSI Mdisks for migration purposes are not restored.
Before you restore your configuration data, the following prerequisites must be
met:
v You have the Security Administrator role associated with your user name and
password.
v You have a copy of your backup configuration files on a server that is accessible
to the system.
v You have a backup copy of your application data that is ready to load on your
system after the restore configuration operation is complete.
v You know the current license settings for your system.
Chapter 9. Recovery procedures
263
v You did not remove any hardware since the last backup of your system
configuration. If you had to replace a faulty node, the new node must use the
same worldwide node name (WWNN) as the faulty node that it replaced.
Note: You can add new hardware, but you must not remove any hardware
because the removal can cause the restore process to fail.
v No zoning changes were made on the Fibre Channel fabric which would prevent
communication between the SAN Volume Controller and any storage controllers
which are present in the configuration.
v You have at least 3 USB flash drives if encryption was enabled on the system
when its configuration was backed up. The USB flash drives are used for
generation of new keys as part of the restore process or for manually restoring
encryption if the system has less than 3 USB ports.
Use the following steps to determine how to achieve an ideal T4 recovery:
v Open the appropriate svc.config.backup.xml (or svc.config.cron.xml) file with a
suitable text editor or browser and navigate to the node section of the file.
v For each node entry, make a note of the value of the following properties:
IO_group_id and panel_name.
v Use the CLI sainfo lsservicenodes command and the data to determine which
nodes previously belonged in each I/O group.
Restoring the system configuration should be performed via one of the nodes
previously in IO group zero. For example, property name="IO_group_id"
value="0" . The remaining nodes should be added, as required, in the appropriate
order based on the previous IO_group_id of its nodes.
The SAN Volume Controller analyzes the backup configuration data file and the
system to verify that the required disk controller system nodes are available.
Before you begin, hardware recovery must be complete. The following hardware
must be operational: hosts, SAN Volume Controller nodes, internal flash drives and
expansion enclosures (if applicable), the Ethernet network, the SAN fabric, and any
external storage systems (if applicable).
Backing up the system configuration using the CLI
You can back up your configuration data using the command-line interface (CLI).
Before you begin
Before you back up your configuration data, the following prerequisites must be
met:
v No independent operations that change the configuration can be running while
the backup command is running.
v No object name can begin with an underscore character (_).
About this task
The backup feature of the svcconfig CLI command is designed to back up
information about your system configuration, such as volumes, local Metro Mirror
information, local Global Mirror information, storage pools, and nodes. All other
data that you wrote to the volumes is not backed up. Any application that uses the
volumes on the system as storage, must use the appropriate backup methods to
back up its application data.
264
SAN Volume Controller: Troubleshooting Guide
You must regularly back up your configuration data and your application data to
avoid data loss, such as after any significant changes to the system configuration.
Note: The system automatically creates a backup of the configuration data each
day at 1 AM. This backup is known as a cron backup and is written to
/dumps/svc.config.cron.xml_serial# on the configuration node.
Use the these instructions to generate a manual backup at any time. If a severe
failure occurs, both the configuration of the system and application data might be
lost. The backup of the configuration data can be used to restore the system
configuration to the exact state it was in before the failure. In some cases, it might
be possible to automatically recover the application data. This backup can be
attempted with the Recover System Procedure, also known as a Tier 3 (T3)
procedure. To restore the system configuration without attempting to recover the
application data, use the Restoring the System Configuration procedure, also
known as a Tier 4 (T4) recovery. Both of these procedures require a recent backup
of the configuration data.
Complete the following steps to back up your configuration data:
Procedure
1. Use your preferred backup method to back up all of the application data that
you stored on your volumes.
2. Issue the following CLI command to back up your configuration:
svcconfig backup
The following output is an example of the messages that might be displayed
during the backup process:
CMMVC6112W io_grp io_grp1 has a default name
CMMVC6112W io_grp io_grp2 has a default name
CMMVC6112W mdisk mdisk14 ...
CMMVC6112W node node1 ...
CMMVC6112W node node2 ...
....................................................
The svcconfig backup CLI command creates three files that provide
information about the backup process and the configuration. These files are
created in the /dumps directory of the configuration node canister.
Table 79 describes the three files that are created by the backup process:
Table 79. Files created by the backup process
File name
Description
svc.config.backup.xml_<serial#>
Contains your configuration data.
svc.config.backup.sh_<serial#>
Contains the names of the commands that
were issued to create the backup of the
system.
svc.config.backup.log_<serial#>
Contains details about the backup, including
any reported errors or warnings.
3. Check that the svcconfig backup command completes successfully, and
examine the command output for any warnings or errors. The following output
is an example of the message that is displayed when the backup process is
successful:
Chapter 9. Recovery procedures
265
CMMVC6155I SVCCONFIG processing completed successfully
If the process fails, resolve the errors, and run the command again.
4. Keep backup copies of the files outside the system to protect them against a
system hardware failure. Copy the backup files off the system to a secure
location; use either the management GUI or SmartCloud Provisioning
command line. For example:
pscp -unsafe superuser@cluster_ip:/dumps/svc.config.backup.*
/offclusterstorage/
The cluster_ip is the IP address or DNS name of the system and
offclusterstorage is the location where you want to store the backup files.
Tip: To maintain controlled access to your configuration data, copy the backup
files to a location that is password-protected.
Restoring the system configuration
Use this procedure in the following situations: only if the recover procedure failed
or if the data that is stored on the volumes is not required. This procedure is also
known as Tier 4 (T4) recovery. For directions on the recover procedure, see
“Recover system procedure” on page 253.
Before you begin
This configuration restore procedure is designed to restore information about your
configuration, such as volumes, local Metro Mirror information, local Global Mirror
information, storage pools, and nodes. The data that you wrote to the volumes is
not restored. To restore the data on the volumes, you must restore application data
from any application that uses the volumes on the clustered system as storage
separately. Therefore, you must have a backup of this data before you follow the
configuration recovery process.
If encryption was enabled on the system when its configuration was backed up,
then at least 3 USB flash drives need to be present in the node USB ports for the
configuration restore to work. The USB flash drives do not need to contain any
keys. They are for generation of new keys as part of the restore process.
About this task
You must regularly back up your configuration data and your application data to
avoid data loss. If a system is lost after a severe failure occurs, both configuration
for the system and application data is lost. You must restore the system to the
exact state it was in before the failure, and then recover the application data.
During the restore process, the nodes and the storage enclosure will be restored to
the system, and then the MDisks and the array will be re-created and configured.
If there are multiple storage enclosures involved, the arrays and MDisks will be
restored on the proper enclosures based on the enclosure IDs.
Important:
v There are two phases during the restore process: prepare and execute. You must
not change the fabric or system between these two phases.
266
SAN Volume Controller: Troubleshooting Guide
v For a SAN Volume Controller system that contains nodes with more than four
Fibre Channel ports, the system localfcportmask and partnerfcportmask
settings are manually reapplied before you restore your data. See step 8 on page
268.
|
|
|
|
|
v For a SAN Volume Controller with internal flash drives (including nodes that
are connected to expansion enclosures), all nodes must be added into the system
before you restore your data. See step 9 on page 268.
v For SAN Volume Controller systems that contain nodes that are attached to
external controllers virtualized by iSCSI, all nodes must be added into the
system before you restore your data. Additionally, the system cfgportip settings
and iSCSI storage ports must be manually reapplied before you restore your
data. See step 10 on page 268.
|
|
|
|
|
|
|
v For VMware vSphere Virtual Volumes (sometimes referred to as VVols)
environments, after a T4 restoration, some of the Virtual Volumes configuration
steps are already completed: metadatavdisk created, usergroup and user created,
adminlun hosts created. However, the user must then complete the last two
configuration steps manually (creating a storage container on IBM Spectrum
Control Base Edition and creating virtual machines on VMware vCenter). See
Configuring Virtual Volumes.
If you do not understand the instructions to run the CLI commands, see the
command-line interface reference information.
To restore your configuration data, follow these steps:
Procedure
1. Verify that all nodes are available as candidate nodes before you run this
recovery procedure. You must remove errors 550 or 578 to put the node in
candidate state.
2. Create a system. If possible, use the node that was originally in I/O group 0.
v For SAN Volume Controller 2145-DH8 systems, use the technician port.
v For all other earlier models, use the front panel.
3. In a supported browser, enter the IP address that you used to initialize the
system and the default superuser password (passw0rd).
4. Issue the following CLI command to ensure that only the configuration node
is online:
lsnode
The following output is an example of what is displayed:
id name status IO_group_id IO_group_name config_node
1 nodel online 0 io_grp0 yes
5. Using the command-line interface, issue the following command to log on to
the system:
plink -i ssh_private_key_file superuser@cluster_ip
Where ssh_private_key_file is the name of the SSH private key file for the
superuser and cluster_ip is the IP address or DNS name of the system for
which you want to restore the configuration.
Note: Because the RSA host key changed, a warning message might display
when you connect to the system using SSH.
6. Identify the configuration backup file from which you want to restore.
Chapter 9. Recovery procedures
267
The file can be either a local copy of the configuration backup XML file that
you saved when you backed-up the configuration or an up-to-date file on one
of the nodes.
Configuration data is automatically backed up daily at 01:00 system time on
the configuration node.
Download and check the configuration backup files on all nodes that were
previously in the system to identify the one containing the most recent
complete backup
a. From the management GUI, click Settings > Support.
b. Click Show full log listing.
c. For each node (canister) in the system, complete the following steps:
1) Select the node to operate on from the selection box at the top of the
table.
2) Find all the files with names that match the pattern svc.config.*.xml*.
3) Double-click the files to download them to your computer.
d. If a recent configuration file is not present on this node, configure service
IP addresses for other nodes and connect to the service assistant to look
for configuration files on other nodes. For more information, see the
Service IPv4 or Service IPv6 options topic at “Service IPv4 or Service IPv6
options” on page 112.
The XML files contain a date and time that can be used to identify the most
recent backup. After you identify the backup XML file that is to be used when
you restore the system, rename the file to svc.config.backup.xml.
7. Copy onto the system the XML backup file from which you want to restore.
pscp full_path_to_identified_svc.config.file
superuser@cluster_ip:/tmp/svc.config.backup.xml
8. If the system contains any nodes with a 10 GB interface adapter or a second
Fibre Channel interface adapter that is installed and non-default
localfcportmask and partnerfcportmask settings were previously configured,
then manually reconfigure these settings before you restore your data.
9. If the system uses a stretched or HyperSwap topology with nodes located at
two sites, or if the system contains any nodes with internal flash drives
(including nodes that are connected to expansion enclosures), these nodes
must be added to the system now. To add these nodes, determine the panel
name, node name, and I/O groups of any such nodes from the configuration
backup file. To add the nodes to the system, run the following command:
svctask addnode -panelname panel_name -iogrp iogrp_name_or_id -name node_name
Where panel_name is the name that is displayed on the panel, iogrp_name_or_id
is the name or ID of the I/O group to which you want to add this node, and
node_name is the name of the node.
10. If the system contains any iSCSI storage controllers, these controllers must be
detected manually now. The nodes that are connected to these controllers, the
iSCSI port IP addresses, and the iSCSI storage ports must be added to the
system before you restore your data.
|
|
|
|
a. To add these nodes, determine the panel name, node name, and I/O
groups of any such nodes from the configuration backup file. To add the
nodes to the system, run the following command:
|
|
|
|
svctask addnode -panelname panel_name -iogrp iogrp_name_or_id -name node_name
268
SAN Volume Controller: Troubleshooting Guide
|
|
|
|
Where panel_name is the name that is displayed on the panel,
iogrp_name_or_id is the name or ID of the I/O group to which you want to
add this node, and node_name is the name of the node.
b. To restore iSCSI port IP addresses, use the cfgportip command.
|
|
|
|
|
|
|
|
1) To restore IPv4 address, determine id (port_id), node_id, node_name,
IP_address, mask, gateway, host (0/1 stands for no/yes), remote_copy
(0/1 stands for no/yes), and storage (0/1 stands for no/yes) from the
configuration backup file, run the following command:
|
|
|
|
|
|
|
|
|
2) To restore IPv6 address, determine id (port_id), node_id, node_name,
IP_address_6, mask, gateway_6, prefix_6, host_6 (0/1 stands for
no/yes), remote_copy_6 (0/1 stands for no/yes), and storage_6 (0/1
stands for no/yes) from the configuration backup file, run the
following command:
svctask cfgportip -node node_name_or_id -ip ipv4_address -gw ipv4_gw -host yes | no -r
Where node_name_or_id is the name or id of the node, ipv4_address is
the IP v4 version protocol address of the port, and ipv4_gw is the IPv4
gateway address for the port.
svctask cfgportip -node node_name_or_id -ip_6 ipv6_address -gw_6 ipv6_gw -prefix_6 pre
Where node_name_or_id is the name or id of the node, ipv6_address is
the IP v6 version protocol address of the port, ipv6_gw is the IPv6
gateway address for the port, and prefix is the IPv6 prefix.
|
|
|
|
|
|
|
|
|
|
Complete steps b.i and b.ii for all (earlier configured) IP ports in the
node_ethernet_portip_ip sections from the backup configuration file.
c. Next, detect and add the iSCSI storage port candidates by using the
detectiscsistorageportcandidate and addiscsistorageport commands.
Make sure that you detect the iSCSI storage ports and add these ports in
exactly the same order as you see them in the configuration backup file. If
you do not follow the correct order, it might result in a T4 failure. Step c.i
must be followed by steps c.ii and c.iii. You must repeat these steps for all
the iSCSI sessions listed in the backup configuration file exactly in the
same order.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1) To detect iSCSI storage ports, determine src_port_id, IO_group_id
(optional, not required if the value is 255), target_ipv4/target_ipv6 (the
target ip that is not blank is required), iscsi_user_name (not required if
blank), iscsi_chap_secret (not required if blank), and site (not required
if blank) from the configuration backup file, run the following
command:
|
|
|
|
2) Match the discovered target_iscsiname with the target_iscsiname for this
particular session in the backup configuration file by running the
lsiscsistorageportcandidate command, and use the matching index
to add iSCSI storage ports in step c.iii.
|
|
Run the svcinfo lsiscsistorageportcandidate command and
determine the id field of the row whose target_iscsiname matches with
svctask detectiscsistorageportcandidate -srcportid src_port_id -iogrp IO_group_id -tar
-site site_id_or_name
Where src_port_id is the source Ethernet port ID of the configured port,
IO_group_id is the I/O group ID or name being detected,
target_ipv4/target_ipv6 is the IPv4/IPv6 target iSCSI controller IPv4/IPv6
address, iscsi_user_name is the target controller user name being
detected, iscsi_chap_secret is the target controller chap secret being
detected, and site_id_or_name is the specified id or name of the site
being detected.
Chapter 9. Recovery procedures
269
the target_iscsiname from the configuration backup file. This is your
candidate_id to be used in step c.iii.
3) To add the iSCSI storage port, determine IO_group_id (optional, not
required if the value is 255), site (not required if blank),
iscsi_user_name (not required if blank in backup file), and
iscsi_chap_secret (not required if blank) from the configuration backup
file, provide the target_iscsiname_index matched in step c.ii, and then
run the following command:
|
|
|
|
|
|
|
|
|
|
|
|
|
addiscsistorageport -iogrp iogrp_id -username iscsi_user_name -chapsecret iscsi_chap_secr
Where iogrp_id is the I/O group ID or name that is added,
iscsi_user_name is the target controller user name being added,
iscsi_chap_secret is the target controller chap secret being added, and
site_id_or_name specified the id or name of the site being added.
11. Issue the following CLI command to compare the current configuration with
the backup configuration data file:
svcconfig restore -prepare
This CLI command creates a log file in the /tmp directory of the configuration
node. The name of the log file is svc.config.restore.prepare.log.
Note: It can take up to a minute for each 256-MDisk batch to be discovered. If
you receive error message CMMVC6200W for an MDisk after you enter this
command, all the managed disks (MDisks) might not be discovered yet. Allow
a suitable time to elapse and try the svcconfig restore -prepare command
again.
12. Issue the following command to copy the log file to another server that is
accessible to the system:
pscp superuser@cluster_ip:/tmp/svc.config.restore.prepare.log
full_path_for_where_to_copy_log_files
13. Open the log file from the server where the copy is now stored.
14. Check the log file for errors.
v If you find errors, correct the condition that caused the errors and reissue
the command. You must correct all errors before you can proceed to step 15.
v If you need assistance, contact the IBM Support Center.
15. Issue the following CLI command to restore the configuration:
svcconfig restore -execute
This CLI command creates a log file in the /tmp directory of the configuration
node. The name of the log file is svc.config.restore.execute.log.
16. Issue the following command to copy the log file to another server that is
accessible to the system:
pscp superuser@cluster_ip:/tmp/svc.config.restore.execute.log
full_path_for_where_to_copy_log_files
17. Open the log file from the server where the copy is now stored.
18. Check the log file to ensure that no errors or warnings occurred.
Note: You might receive a warning that states that a licensed feature is not
enabled. This message means that after the recovery process, the current
license settings do not match the previous license settings. The recovery
process continues normally and you can enter the correct license settings in
the management GUI later.
When you log in to the CLI again over SSH, you see this output:
270
SAN Volume Controller: Troubleshooting Guide
IBM_2145:your_cluster_name:superuser>
What to do next
You can remove any unwanted configuration backup and restore files from the
/tmp directory on your configuration by issuing the following CLI command:
svcconfig clear -all
Deleting backup configuration files using the CLI
You can use the command-line interface (CLI) to delete backup configuration files.
About this task
Complete the following steps to delete backup configuration files:
Procedure
1. Issue the following command to log on to the system:
plink -i ssh_private_key_file superuser@cluster_ip
where ssh_private_key_file is the name of the SSH private key file for the
superuser and cluster_ip is the IP address or DNS name of the clustered system
from which you want to delete the configuration.
2. Issue the following CLI command to erase all of the files that are stored in the
/tmp directory:
svcconfig clear -all
Completing the node rescue when the node boots
If it is necessary to replace the hard disk drive or if the software on the hard disk
drive is corrupted, you can use the node rescue procedure to reinstall the SAN
Volume Controller software.
Before you begin
Similarly, if you have replaced the service controller, use the node rescue procedure
to ensure that the service controller has the correct software.
About this task
Attention: If you recently replaced both the service controller and the disk drive
as part of the same repair operation, node rescue fails.
Node rescue works by booting the operating system from the service controller
and running a program that copies all the SAN Volume Controller software from
any other node that can be found on the Fibre Channel fabric.
Attention: When running node rescue operations, run only one node rescue
operation on the same SAN, at any one time. Wait for one node rescue operation
to complete before starting another.
Perform the following steps to complete the node rescue:
Chapter 9. Recovery procedures
271
Procedure
1. Ensure that the Fibre Channel cables are connected.
2. Ensure that at least one other node is connected to the Fibre Channel fabric.
3. Ensure that the SAN zoning allows a connection between at least one port of
this node and one port of another node. It is better if multiple ports can
connect. This is particularly important if the zoning is by worldwide port name
(WWPN) and you are using a new service controller. In this case, you might
need to use SAN monitoring tools to determine the WWPNs of the node. If you
need to change the zoning, remember to set it back when the service procedure
is complete.
4. Turn off the node.
5. Press and hold the left and right buttons on the front panel.
6. Press the power button.
7. Continue to hold the left and right buttons until the node-rescue-request
symbol is displayed on the front panel (Figure 65).
Results
Figure 65. Node rescue display
The node rescue request symbol displays on the front panel display until the node
starts to boot from the service controller. If the node rescue request symbol
displays for more than two minutes, go to the hardware boot MAP to resolve the
problem. When the node rescue starts, the service display shows the progress or
failure of the node rescue operation.
Note: If the recovered node was part of a clustered system, the node is now
offline. Delete the offline node from the system and then add the node back into
the system. If node recovery was used to recover a node that failed during a
software update process, it is not possible to add the node back into the system
until the code update process has completed. This can take up to four hours for an
eight-node clustered system.
272
SAN Volume Controller: Troubleshooting Guide
Chapter 10. Understanding the medium errors and bad blocks
A storage system returns a medium error response to a host when it is unable to
successfully read a block. The SAN Volume Controller response to a host read
follows this behavior.
The volume virtualization that is provided extends the time when a medium error
is returned to a host. Because of this difference to non-virtualized systems, the
SAN Volume Controller uses the term bad blocks rather than medium errors.
The SAN Volume Controller allocates volumes from the extents that are on the
managed disks (MDisks). The MDisk can be a volume on an external storage
controller or a RAID array that is created from internal drives. In either case,
depending on the RAID level used, there is normally protection against a read
error on a single drive. However, it is still possible to get a medium error on a
read request if multiple drives have errors or if the drives are rebuilding or are
offline due to other issues.
The SAN Volume Controller provides migration facilities to move a volume from
one underlying set of physical storage to another or to replicate a volume that uses
FlashCopy or Metro Mirror or Global Mirror. In all these cases, the migrated
volume or the replicated volume returns a medium error to the host when the
logical block address on the original volume is read. The system maintains tables
of bad blocks to record where the logical block addresses that cannot be read are.
These tables are associated with the MDisks that are providing storage for the
volumes.
The dumpmdiskbadblocks command and the dumpallmdiskbadblocks command are
available to query the location of bad blocks.
Important: The dumpmdiskbadblocks only outputs the virtual medium errors that
have been created, and not a list of the actual medium errors on MDisks or drives.
It is possible that the tables that are used to record bad block locations can fill up.
The table can fill either on an MDisk or on the system as a whole. If a table does
fill up, the migration or replication that was creating the bad block fails because it
was not possible to create an exact image of the source volume.
The system creates alerts in the event log for the following situations:
v When it detects medium errors and creates a bad block
v When the bad block tables fill up
Table 80 lists the bad block error codes.
Table 80. Bad block errors
Error code
Description
1840
The managed disk has bad blocks. On an
external controller, this can only be a copied
medium error.
1226
The system has failed to create a bad block
because the MDisk already has the
maximum number of allowed bad blocks.
© Copyright IBM Corp. 2003, 2015
273
Table 80. Bad block errors (continued)
Error code
Description
1225
The system has failed to create a bad block
because the system already has the
maximum number of allowed bad blocks.
The recommended actions for these alerts guide you in correcting the situation.
Clear bad blocks by deallocating the volume disk extent, by deleting the volume or
by issuing write I/O to the block. It is good practice to correct bad blocks as soon
as they are detected. This action prevents the bad block from being propagated
when the volume is replicated or migrated. It is possible, however, for the bad
block to be on part of the volume that is not used by the application. For example,
it can be in part of a database that has not been initialized. These bad blocks are
corrected when the application writes data to these areas. Before the correction
happens, the bad block records continue to use up the available bad block space.
274
SAN Volume Controller: Troubleshooting Guide
Chapter 11. Using the maintenance analysis procedures
The maintenance analysis procedures (MAPs) inform you how to analyze a failure
that occurs with a SAN Volume Controller node.
About this task
SAN Volume Controller nodes must be configured in pairs so you can perform
concurrent maintenance.
When you service one node, the other node keeps the storage area network (SAN)
operational. With concurrent maintenance, you can remove, replace, and test all
field replaceable units (FRUs) on one node while the SAN and host systems are
powered on and doing productive work.
Note: Unless you have a particular reason, do not remove the power from both
nodes unless instructed to do so. When you need to remove power, see “MAP
5350: Powering off a node” on page 302.
Procedure
v To isolate the FRUs in the failing node, complete the actions and answer the
questions given in these maintenance analysis procedures (MAPs).
v When instructed to exchange two or more FRUs in sequence:
1. Exchange the first FRU in the list for a new one.
2. Verify that the problem is solved.
3. If the problem remains:
a. Reinstall the original FRU.
b. Exchange the next FRU in the list for a new one.
4. Repeat steps 2 and 3 until either the problem is solved, or all the related
FRUs have been exchanged.
5. Complete the next action indicated by the MAP.
6. If you are using one or more MAPs because of a system error code, mark the
error as fixed in the event log after the repair, but before you verify the
repair.
Note: Start all problem determination procedures and repair procedures with
“MAP 5000: Start.”
MAP 5000: Start
MAP 5000: Start is an entry point to the maintenance analysis procedures (MAPs)
for the SAN Volume Controller.
Before you begin
Note: The service assistant interface should be used if there is no front panel
display, for example on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures.”
© Copyright IBM Corp. 2003, 2015
275
This MAP applies to all SAN Volume Controller models. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working with, look for the label that identifies the model type on
the front of the node.
You might be sent here for one of the following reasons:
v The fix procedures sent you here
v A problem occurred during the installation of a SAN Volume Controller
v Another MAP sent you here
v A user observed a problem that was not detected by the system
SAN Volume Controller nodes are configured in pairs. While you service one node,
you can access all the storage managed by the pair from the other node. With
concurrent maintenance, you can remove, replace, and test all FRUs on one SAN
Volume Controller while the SAN and host systems are powered on and doing
productive work.
Notes:
v Unless you have a particular reason, do not remove the power from both nodes
unless instructed to do so.
v If an action in these procedures involves removing or replacing a part, use the
applicable procedure.
v If the problem persists after you complete the actions in this procedure, return to
step 1 of the MAP to try again to fix the problem.
Procedure
1.
Were you sent here from a fix procedure?
NO
Go to step 2
YES
Go to step 6 on page 277
2. (from step 1)
Access the management GUI. See “Accessing the management GUI” on page
60
3. (from step 2)
Does the management GUI start?
NO
Go to step 6 on page 277.
YES
Go to step 4.
4. (from step 3)
Is the Welcome window displayed?
NO
Go to step 6 on page 277.
YES
Go to step 5.
5. (from step 4)
Log in to the management GUI. Use the user ID and password that is
provided by the user.
Go to the Events page.
Start the fix procedure for the recommended action.
Did the fix procedures find an error that is to be fixed?
NO
276
Go to step 6 on page 277.
SAN Volume Controller: Troubleshooting Guide
YES
Follow the fix procedures.
6. (from steps 1 on page 276, 3 on page 276, 4 on page 276, and 5 on page 276)
Is the power indicator on the front panel off? Check to see whether the
power LED on the operator-information panel is off.
NO
Go to step 7.
YES
Try to turn on the nodes. See “Using the power control for the SAN
Volume Controller node” on page 117.
Note: The uninterruptible power supply unit that supplies power to
the node might also be turned off. The uninterruptible power supply
must be turned on before the node is turned on.
SAN Volume Controller 2145-DH8 does not have an external
uninterruptible power supply unit. This system has battery modules
in its front panel instead.
If the nodes are turned on, go to step 7; otherwise, go to the
appropriate Power MAP: “MAP 5050: Power 2145-CG8 and 2145-CF8”
on page 288 or “MAP 5040: Power SAN Volume Controller 2145-DH8”
on page 283.
7. (from step 6)
Does the front panel of the node show a hardware error?
NO
Go to step 8.
YES
The service controller for the SAN Volume Controller failed. (The SAN
Volume Controller 2145-DH8 does not have a service controller.)
a. Check that the service controller that is indicating an error is
correctly installed. If it is, replace the service controller.
b. Go to “MAP 5700: Repair verification” on page 321.
8. (from step 7)
Is the operator-information panel error LED (▌1▐ in Figure 66 or ▌7▐ in
Figure 67 on page 278) illuminated or flashing? Or, is the check log ID (▌6▐
in Figure 67 on page 278 illuminated or flashing?
2145-CF8
2145-CG8
svc00923
1
Figure 66. Error LED on the SAN Volume Controller models
Figure 67 on page 278 shows the operator-information panel for the SAN
Volume Controller 2145-DH8.
Chapter 11. Using the maintenance analysis procedures
277
4
3
2
1 2
3 4
5
6
7
svc00824
1
▌1▐ Power-control button and power-on LED
▌2▐ Ethernet icon
▌3▐ System-locator button and LED
▌4▐ Release latch for the light path diagnostics panel
▌5▐ Ethernet activity LEDs
▌6▐ Check log LED
▌7▐ System-error LED
Note: If the node has more than 4 Ethernet ports, activity for ports 5 and above is not
indicated by the Ethernet activity LEDs on the operator-information panel.
Figure 67. SAN Volume Controller 2145-DH8 operator-information panel
NO
Go to step 9.
YES
Go to “MAP 5800: Light path” on page 322.
9. (from step 8 on page 277)
Is the hardware boot display that you see in Figure 68 displayed on the
node? (SAN Volume Controller 2145-DH8 does not have a front panel
display. For this 2145-DH8 model, are the node status LED, node fault LED,
and battery status LED that you see in Figure 69 on page 279 all off?)
Figure 68. Hardware boot display
NO
Go to step 11.
YES
Go to step 10.
10. (from step 9)
Has the hardware boot display that you see in Figure 68 displayed for more
than 3 minutes? For 2145-DH8, have the node status LED, node fault LED,
and battery status LED that you see in Figure 69 on page 279 all been off
for more than 3 minutes?
NO
Go to step 11.
YES
For 2145-DH8, go to step 23 on page 282. Otherwise:
a. Go to “MAP 5900: Hardware boot” on page 341.
b. Go to “MAP 5700: Repair verification” on page 321.
11. (from step 9 )
278
SAN Volume Controller: Troubleshooting Guide
Is Failed displayed on the top line of the front-panel display of the node?
Or, is the node fault LED (▌8▐ in Figure 69) on the front panel of a SAN
Volume Controller 2145-DH8 on?
Figure 69 shows the node fault LED.
1
2
3
2
3
4
5
aa
aa
aa
aa
aa
aa
aa
a
aa
aaaa
a
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aa
a
a aa
aaaa
aa
aa
aa
aa
aa
aa
a
aaaa
aa
aa
aa
aa
aaaa
a
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
aaaa
aaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
aaaa
aaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
6
7
1+
8
-
2+
-
1 2
6
12
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaa aaaaaa aaaaaa
a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
11
3 4
SAN Volume Controller
6
10
7
svc00800
1
aa
aa
aa
aa
aa
aa
aa
a
aa
aaaa
a
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aa
a
a aa
aaaa
5
4
8
- +
9
▌7▐ Node status LED
▌8▐ Node fault LED
▌9▐ Battery status LED
Figure 69. SAN Volume Controller 2145-DH8 front panel
NO
Go to step 12.
YES
Complete these steps:
a. If the node has a front panel display, note the failure code and go
to “Boot code reference” on page 160 to follow the repair actions.
b. Access the service assistant interface via the “Technician port for
node access” on page 75 and follow the service recommendation
presented.
c. Go to “MAP 5700: Repair verification” on page 321.
12. (from step 11 on page 278)
Is Booting displayed on the top line of the front-panel display of the node?
NO
Go to step 14.
YES
Go to step 13.
13. (from step 12)
A progress bar and a boot code are displayed. If the progress bar does not
advance for more than 3 minutes, the progress is stalled.
Has the progress bar stalled?
NO
Go to step 14.
YES
a. Note the failure code and go to “Boot code reference” on page 160
to complete the repair actions.
b. Go to “MAP 5700: Repair verification” on page 321.
14. (from step 12 and step 13)
Chapter 11. Using the maintenance analysis procedures
279
If you pressed any of the navigation buttons on the front panel, wait for 60
seconds to ensure that the display switched to its default display.
Is Node Error displayed on the top line of the front-panel display of the
node?
Or, is the node fault LED, which is the middle of the 3 status LEDs on the
front panel of a SAN Volume Controller 2145-DH8, on? Figure 69 on page
279 shows the node fault LED.
NO
Go to step 15.
YES
Complete these steps:
a. Note the failure code and go to “Node error code overview” on
page 160 to complete the repair actions.
b. If the node does not have a front panel display, access the service
assistant interface via the “Technician port for node access” on
page 75 and follow the service recommendation presented.
c. Go to “MAP 5700: Repair verification” on page 321.
15. (from step 14 on page 279)
Is Cluster Error displayed on the top line of the front-panel display of the
node?
NO
Go to step 16.
YES
A cluster error was detected. This error code is displayed on all the
operational nodes in the system. The fix procedures normally repair
this type of error. Follow these steps:
a. Go to “Clustered-system code overview” on page 161 to complete
the repair actions.
b. Go to “MAP 5700: Repair verification” on page 321.
16. (from step 15)
Is Powering Off, Restarting, Shutting Down, or Power Failure displayed in
the top line of the front-panel display?
NO
Go to step 18 on page 281.
YES
The progress bar moves every few seconds. Wait for the operation to
complete and then return to step 1 on page 276 in this MAP. If the
progress bar does not move for 3 minutes, press power and go to step
17.
17. (from step 16)
Did the node power off?
NO
Complete the following steps:
a. Remove the power cord from the rear of the box.
b. Wait 60 seconds.
c. Replace the power cord.
d. If the node does not power on, press power to power on the node
and then return to step 1 on page 276 in this MAP.
YES
Complete the following steps:
a. Wait 60 seconds.
b. Click power to turn on the node and then return to step 1 on page
276 in this MAP.
280
SAN Volume Controller: Troubleshooting Guide
Note: The 2145 UPS-1U turns off only when its power button is
pressed, input power has been lost for more than five minutes, or the
SAN Volume Controller node has shut it down following a reported
loss of input power.
18. (from step 17 on page 280)
Is Charging or Recovering displayed in the top line of the front-panel
display of the node?
NO
Go to step 19.
YES
v If Charging is displayed, the uninterruptible power supply battery is
not yet charged sufficiently to support the node. If Charging is
displayed for more than 2 hours, go to “MAP 5150: 2145 UPS-1U”
on page 292.
v If Recovering is displayed, the uninterruptible power supply battery
is not yet charged sufficiently to be able to support the node
immediately following a power supply failure. However, if
Recovering is displayed, the node can be used normally.
v If Recovering is displayed for more than 2 hours, go to “MAP 5150:
2145 UPS-1U” on page 292.
19. (from step 18)
Is Validate WWNN? displayed on the front-panel display of the node?
NO
Go to step 20 on page 282.
YES
The node is indicating that its WWNN might need changing. The
node enters this mode when the node service controller or disk is
changed but the service procedures were not followed.
Note: Do not validate the WWNN until you read the following
information to ensure that you choose the correct value. If you choose
an incorrect value, you might find that the SAN zoning for the node is
also not correct and more than one node is using the same WWNN.
Therefore, it is important to establish the correct WWNN before you
continue.
a. Determine which WWNN that you want to use.
v If the service controller was replaced, the correct value is
probably the WWNN that is stored on disk (the disk WWNN).
v If the disk was replaced, perhaps as part of a frame replacement
procedure, but was not reinitialized, the correct value is
probably the WWNN that is stored on the service controller (the
panel WWNN).
b. Select the stored WWNN that you want this node to use:
v To use the WWNN that is stored on the disk:
1) From the Validate WWNN? panel, press and release the
select button. The Disk WWNN: panel is displayed and
shows the last five digits of the WWNN that is stored on the
disk.
2) From the Disk WWNN: panel, press and release the down
button. The Use Disk WWNN? panel is displayed.
3) Press and release the select button.
v To use the WWNN that is stored on the service controller:
Chapter 11. Using the maintenance analysis procedures
281
1) From the Validate WWNN? panel, press and release the
select button. The Disk WWNN: panel is displayed.
2) From the Disk WWNN: panel, press, and release the right
button. The Panel WWNN: panel is displayed and shows the
last five numbers of the WWNN that is stored on the service
controller.
3) From the Panel WWNN: panel, press and release the down
button. The Use Panel WWNN? panel is displayed.
4) Press and release the select button.
c. After you set the WWNN, check the front-panel display:
v If the Node WWNN: panel is displayed on the front panel, the
node is now using the selected WWNN. The Node WWNN:
panel shows the last five numbers of the WWNN that you
selected.
v If the front panel shows Cluster: but does not show a system
name, you must use the recover procedure for a clustered
system to delete the node from the system and add the node
back into the system.
20. (from step 19 on page 281)
Is there a node that is not a member of a clustered system? You can tell if a
node is not a member of a system because the node status LED is off or
blinking for SAN Volume Controller 2145-DH8 or by checking the front-panel
menu. If Cluster: is displayed but no system name is shown, the node is not
a member of a system. If the current language font allows a two-line display,
the name is on the second line of the front-panel display. Otherwise, you can
press the select button to display the name.)
NO
Go to step 21.
YES
The node is not a member of a system. The node might have been
deleted during a maintenance procedure and was not added back into
the system. Make sure that each I/O group in the system contains two
nodes. If an I/O group has only one node, add the node back into
that system. Then, ensure that the node is restored to the same I/O
group from which it was deleted.
21. (from step 20)
Is the front-panel display unreadable?
NO
Go to step 22.
YES
Complete the following steps:
a. Check the language. The display might be set to another language.
b. If the language is set correctly, go to “MAP 5400: Front panel” on
page 307.
22. (from step 21)
No errors were detected by the SAN Volume Controller. If you suspect that
the problem that is reported by the customer is a hardware problem, follow
these tasks:
a. Complete Problem Determination procedures on your host systems, disk
controllers, and Fibre Channel switches.
b. Ask IBM remote technical support for assistance.
23. (from step 10 on page 278)
282
SAN Volume Controller: Troubleshooting Guide
Can you access the service assistant interface through the 2145-DH8
technician port or service IP address, or use a USB flash drive to get
satask_results.html?
NO
The SAN Volume Controller software might not be running. Attach a
USB keyboard and VGA monitor to the 2145-DH8 to see if the node is
stuck booting.
YES
Go to step 24.
24. (from step 23 on page 282)
Can node error 561 be seen?
NO
Follow the recommended action for any node error that can be seen.
YES
The SAN Volume Controller software might not be able to
communicate with the battery backplane.
Check the connections between the system board and the battery
backplane. Then follow the recommended action for node error 561.
Results
If you suspect that the problem is a software problem, see “Updating the system”
documentation for details about how to update your entire SAN Volume Controller
environment.
If the problem is still not fixed, collect diagnostic information and contact IBM
Remote Technical Support.
MAP 5040: Power SAN Volume Controller 2145-DH8
It might become necessary to solve problems that are associated with power on the
SAN Volume Controller 2145-DH8.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
Power problems might be associated with any of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller node
v The power switch failed to turn on the node
v The power switch failed to turn off the node
v Another MAP sent you here
Procedure
1. Are you here because the node is not powered on?
NO
Go to step 10 on page 287.
YES
Go to step 2.
2. (from step 1)
Is the power LED on the operator-information panel continuously
illuminated? Figure 70 on page 284 shows the location of the power LED ▌1▐
on the operator-information panel.
Chapter 11. Using the maintenance analysis procedures
283
4
3
2
5
6
7
ifs00064
1
▌1▐ Power button and power LED (green)
Figure 70. Power LED on the SAN Volume Controller 2145-DH8
NO
Go to step 3.
YES
The node is powered on correctly. Reassess the symptoms and return
to MAP 5000: Start or go to MAP 5700: Repair verification to verify
the correct operation.
3. (from step 2 on page 283)
Is the power LED on the operator-information panel flashing approximately
four times per second?
NO
Go to step 4.
YES
The node is turned off and is not ready to be turned on. Wait until the
power LED flashes at a rate of once per second, then go to step 5.
If this behavior persists for more than 3 minutes, complete the
following procedure:
a. Remove all input power from the SAN Volume Controller node by
removing the power supply from the back of the node. See
“Removing a SAN Volume Controller 2145-DH8 power supply”
when you are removing the power cords from the node.
b. Wait 1 minute and then verify that all power LEDs on the node are
extinguished.
c. Reinsert the power supply.
d. Wait for the flashing rate of the power LED to slow down to one
flash per second. Go to step 5.
e. If the power LED keeps flashing at a rate of four flashes per
second for a second time, replace the parts in the following
sequence:
v System board
Verify the repair by continuing with MAP 5700: Repair verification.
4. (from step 3)
Is the Power LED on the operator-information panel flashing once per
second?
YES
The node is in standby mode. Input power is present. Go to step 5.
NO
Go to step 6 on page 285.
5. (from step 3 and step 4)
Press Power on the operator-information panel of the node.
284
SAN Volume Controller: Troubleshooting Guide
Is the Power LED on the operator-information panel illuminated a solid
green?
NO
Verify that the operator-information panel cable is correctly seated at
both ends.
If the node still fails to power on, replace parts in the following
sequence:
a. Operator-information panel assembly
b. System board
Verify the repair by continuing with MAP 5700: Repair verification.
YES
The power LED on the operator-information panel shows that the
node successfully powered on. Verify the correct operation by
continuing with MAP 5700: Repair verification.
6. (from step 4 on page 284)
Is the rear panel power LED on or flashing?Figure 71 shows the location of
the power LED ▌1▐ on the SAN Volume Controller 2145-DH8.
2
4 5 4
svc00574
5
1
Figure 71. Power LED indicator on the rear panel of the SAN Volume Controller 2145-DH8
NO
Go to step 7.
YES
The operator-information panel is failing.
Verify that the operator-information panel cable is seated on the
system board.
If the node still fails to power on, replace parts in the following
sequence:
a. Operator-information panel assembly
b. System board
7. (from step 6)
Are the ac LED indicators on the rear of the power supply assemblies
illuminated? Figure 72 on page 286 shows the location of the ac LED ▌1▐, the
dc LED ▌2▐, and the power-supply error LED ▌3▐ on the rear of the power
supply assembly that is on the rear panel of the SAN Volume Controller
2145-DH8.
Chapter 11. Using the maintenance analysis procedures
285
1 AC LED (green)
AC
2 DC LED (green)
DC
3 Power-supply
error LED (yellow)
AC
svc00794
DC
Figure 72. AC, dc, and power-supply error LED indicators on the rear panel of the SAN Volume Controller 2145-DH8
NO
Verify that the input power cable or cables are securely connected at
both ends and show no sign of damage; replace damaged cables. If
the node still fails to power on, replace the specified parts that are
based on the SAN Volume Controller model type.
Replace the SAN Volume Controller 2145-DH8 parts in the following
sequence:
a. Power supply 750 W
YES
Go to step 8.
8. (from step 7 on page 285)
Is the power-supply error LED on the rear of the SAN Volume Controller
2145-DH8 power supply illuminated? Figure 72 shows the location of the
power-supply error LED ▌3▐.
YES
Replace the power supply unit.
NO
Go to step 9
9. (from step 8)
Are the dc LED indicators on the rear of the power supply assemblies
illuminated?
286
SAN Volume Controller: Troubleshooting Guide
NO
Replace the SAN Volume Controller 2145-DH8 parts in the following
sequence:
a. Power supply 750 W
b. System board
YES
Verify that the operator-information panel cable is correctly seated at
both ends. If the node still fails to power on, replace parts in the
following sequence:
a. Operator-information panel
b. Cable, signal
c. System board
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
10. (from step 1 on page 283)
The node does not power off immediately when the power button is pressed.
When the node fully boots, the node powers-off under the control of the SAN
Volume Controller software. The power-off operation can take up to 5 minutes
to complete.
Is the power LED on the operator-information panel flashing approximately
four times per second?
NO
Go to step 11.
YES
Wait for the node to power off. If the node fails to power off after 5
minutes, go to step 11.
11. (from step 10)
Attention: Turning off the node by any means other than using the
management GUI might cause a loss of data in the node cache. If you are
performing concurrent maintenance, this node must be deleted from the
system before you proceed. Ask the customer to delete the node from the
system now. If they are unable to delete the node, call your support center for
assistance before you proceed.
The node cannot be turned off either because of a software fault or a
hardware failure. Press and hold the power button. The node can turn off
within 5 seconds.
Did the node turn off?
NO
Determine whether you are using an Advanced Configuration and
Power Interface (ACPI) or a non-ACPI operating system.
If you are using a non-ACPI operating system, complete the
following steps:
Press Ctrl+Alt+Delete.
Turn off the server by pressing and holding Power for 5
seconds.
Restart the server.
If the server fails POST and pressing Power does not work,
disconnect the power cord for 20 seconds
Reconnect the power cord and restart the server.
If the problem remains or if you are using an ACPI-aware
operating system, suspect the system board.
Go to step 12 on page 288
Chapter 11. Using the maintenance analysis procedures
287
YES
Go to step 12.
12. (from step 11 on page 287)
Press the power button to turn on the node.
Did the node turn on and boot correctly?
NO
Go to “MAP 5000: Start” on page 275 to resolve the problem.
YES
Go to step 13.
13. (from step 12)
The node probably suffered a software failure. Memory dump data might be
captured that helps resolve the problem. Call your support center for
assistance.
MAP 5050: Power 2145-CG8 and 2145-CF8
This topic helps you to solve power problems that have occurred on SAN Volume
Controller models 2145-CG8 and 2145-CF8.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
You might have been sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller node
v The power switch failed to turn the node on
v The power switch failed to turn the node off
v Another MAP sent you here
About this task
Perform the following steps:
Procedure
1. Are you here because the node is not powered on?
NO
Go to step 11 on page 292.
YES
Go to step 2.
2. (from step 1)
Is the power LED on the operator-information panel continuously
illuminated? Figure 73 on page 289 shows the location of the power LED ▌1▐
on the operator-information panel.
288
SAN Volume Controller: Troubleshooting Guide
2145-CF8
2145-CG8
svc00715_new
1
Figure 73. Power LED on the SAN Volume Controller models 2145-CG8 or2145-CF8
operator-information panel
NO
Go to step 3.
YES
The node is powered on correctly. Reassess the symptoms and return
to “MAP 5000: Start” on page 275 or go to “MAP 5700: Repair
verification” on page 321 to verify the correct operation.
3. (from step 2 on page 288)
Is the power LED on the operator-information panel flashing approximately
four times per second?
NO
Go to step 4.
YES
The node is turned off and is not ready to be turned on. Wait until the
power LED flashes at a rate of approximately once per second, then
go to step 5.
If this behavior persists for more than three minutes, perform the
following procedure:
a. Remove all input power from the SAN Volume Controller node by
removing the power retention brackets and the power cords from
the back of the node. See “Removing the cable-retention brackets”
to see how to remove the cable-rentention brackets when removing
the power cords from the node.
b. Wait one minute and then verify that all power LEDs on the node
are extinguished.
c. Reinsert the power cords and power retention brackets.
d. Wait for the flashing rate of the power LED to slow down to one
flash per second. Go to step 5.
e. If the power LED keeps flashing at a rate of four flashes per
second for a second time, replace the parts in the following
sequence:
v System board
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
4. (from step 3)
Is the Power LED on the operator-information panel flashing approximately
once per second?
YES
The node is in standby mode. Input power is present. Go to step 5.
NO
Go to step 6 on page 290.
5. (from step 3 and step 4)
Press the power-on button on the operator-information panel of the node.
Chapter 11. Using the maintenance analysis procedures
289
Is the Power LED on the operator-information panel illuminated a solid
green?
NO
Verify that the operator-information panel cable is correctly seated at
both ends.
If you are working on a SAN Volume Controller 2145-CG8 or
2145-CF8, and the node still fails to power on, replace parts in the
following sequence:
a. Operator-information panel assembly
b. System board
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
The power-on indicator on the operator-information panel shows that
the node has successfully powered on. Continue with “MAP 5700:
Repair verification” on page 321 to verify the correct operation.
6. (from step 4 on page 289)
Is the rear panel power LED on or flashing? Figure 74 shows the location of
the power LED ▌1▐ on the 2145-CF8 or the 2145-CG8.
2
4 5 4
1
svc00574
5
Figure 74. Power LED indicator on the rear panel of the SAN Volume Controller 2145-CG8 or
2145-CF8
NO
Go to step 7.
YES
The operator-information panel is failing.
Verify that the operator-information panel cable is seated on the
system board.
If you are working on a SAN Volume Controller 2145-CG8 or
2145-CF8, and the node still fails to power on, replace parts in the
following sequence:
a. Operator-information panel assembly
b. System board
7. (from step 6)
Locate the 2145 UPS-1U (2145 UPS-1U) that is connected to this node.
Does the 2145 UPS-1U that is powering this node have its power on and is
its load segment 2 indicator a solid green?
290
SAN Volume Controller: Troubleshooting Guide
NO
Go to “MAP 5150: 2145 UPS-1U” on page 292.
YES
Go to step 8.
8. (from step 7 on page 290)
Are the ac LED indicators on the rear of the power supply assemblies
illuminated? Figure 75 shows the location of the ac LED ▌1▐ and the dc LED
▌2▐ on the rear of the power supply assembly that is on the rear panel of the
2145-CF8 or the 2145-CG8.
svc00571
1
2
3
Figure 75. Power LED indicator and ac and dc indicators on the rear panel of the SAN
Volume Controller 2145-CG8 or 2145-CF8
NO
Verify that the input power cable or cables are securely connected at
both ends and show no sign of damage; otherwise, if the cable or
cables are faulty or damaged, replace them. If the node still fails to
power on, replace the specified parts based on the SAN Volume
Controller model type.
Replace the SAN Volume Controller 2145-CG8 or 2145-CF8 parts in
the following sequence:
a. Power supply 675W
YES
Go to step 9 for SAN Volume Controller 2145-CG8 or 2145-CF8
models.
Go to step 10 for all other models.
9. (from step 8)
Is the power supply error LED on the rear of the SAN Volume Controller
2145-CG8 or 2145-CF8 power supply assemblies illuminated? Figure 74 on
page 290 shows the location of the power LED ▌1▐ on the 2145-CF8 or the
2145-CG8.
YES
Replace the power supply unit.
NO
Go to step 10
10. (from step 8 or step 9)
Are the dc LED indicators on the rear of the power supply assemblies
illuminated?
NO
Replace the SAN Volume Controller 2145-CG8 or 2145-CF8 parts in
the following sequence:
a. Power supply 675W
b. System board
YES
Verify that the operator-information panel cable is correctly seated at
both ends. If the node still fails to power on, replace parts in the
following sequence:
a. Operator-information panel
b. Cable, signal, front panel
c. System board (if the node is a SAN Volume Controller 2145-CG8or
SAN Volume Controller 2145-CF8)
Chapter 11. Using the maintenance analysis procedures
291
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
11. (from step 1 on page 288)
The node will not power off immediately when the power button is pressed.
When the node is fully booted, power-off is performed under the control of
the SAN Volume Controller software. The power-off operation can take up to
five minutes to complete.
Is Powering Off displayed on the front panel?
NO
Go to step 12.
YES
Wait for the node to power off. If the node fails to power off after 5
minutes, go to step 12.
12. (from step 11)
Attention: Turning off the node by any means other than using the
management GUI might cause a loss of data in the node cache. If you are
performing concurrent maintenance, this node must be deleted from the
system before you proceed. Ask the customer to delete the node from the
system now. If they are unable to delete the node, call your support center for
assistance before you proceed.
The node cannot be turned off either because of a software fault or a
hardware failure. Press and hold the power button. The node should turn off
within five seconds.
Did the node turn off?
NO
Turn off the 2145 UPS-1U that is connected to this node.
Attention: Be sure that you are turning off the correct 2145 UPS-1U.
If necessary, trace the cables back to the 2145 UPS-1U assembly.
Turning off the wrong 2145 UPS-1U might cause customer data loss.
Go to step 13.
YES
Go to step 13.
13. (from step 12)
If necessary, turn on the 2145 UPS-1U that is connected to this node and then
press the power button to turn the node on.
Did the node turn on and boot correctly?
NO
Go to “MAP 5000: Start” on page 275 to resolve the problem.
YES
Go to step 14.
14. (from step 13)
The node has probably suffered a software failure. Dump data might have
been captured that will help resolve the problem. Call your support center for
assistance.
MAP 5150: 2145 UPS-1U
MAP 5150: 2145 UPS-1U helps you solve problems that have occurred in the 2145
UPS-1U systems that are used on a SAN Volume Controller.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
292
SAN Volume Controller: Troubleshooting Guide
You may have been sent here for one of the following reasons:
v The system problem determination procedures sent you here
v A problem occurred during the installation of a SAN Volume Controller
v Another MAP sent you here
v A customer observed a problem that was not detected by the system problem
determination procedures
Tip: If the 2145 UPS-1U does not seem to work, ensure that the power cable is
connected properly or reseat the power cable.
About this task
Figure 76 shows an illustration of the front of the panel for the 2145 UPS-1U.
7
LOAD 2 LOAD 1
+ -
1
2
3
4
5
1yyzvm
8
6
Figure 76. 2145 UPS-1U front-panel assembly
▌1▐ Load segment 2 indicator
▌2▐ Load segment 1 indicator
▌3▐ Alarm
▌4▐ On-battery indicator
▌5▐ Overload indicator
▌6▐ Power-on indicator
▌7▐ On or off button
▌8▐ Test and alarm reset button
Table 81 identifies which status and error LEDs that display on the 2145 UPS-1U
front-panel assembly relate to the specified error conditions. It also lists the
uninterruptible power supply alert-buzzer behavior.
Table 81. 2145 UPS-1U error indicators
[1] Load2
[2] Load1
Green (see
Note 1)
Green
Amber (see
Note 2)
[3] Alarm
[4] Battery
[5]
Overload
[6]
Power-on
Buzzer
Error condition
Green
(see Note 3 No errors; the 2145
)
UPS-1U was
configured by the SAN
Volume Controller
Green
No errors; the 2145
UPS-1U is not yet
configured by the SAN
Volume Controller
Chapter 11. Using the maintenance analysis procedures
293
Table 81. 2145 UPS-1U error indicators (continued)
[1] Load2
[2] Load1
Green
Either on
or off
[3] Alarm
Either on
or off
[5]
Overload
Amber
Flashing
red
Green
[4] Battery
Flashing
amber
[6]
Power-on
Green
Flashing
red
Flashing
red
Flashing
green
Buzzer
Beeps for
two
seconds
and then
stops
Error condition
The ac power is over
or under limit. The
uninterruptible power
supply has switched to
battery mode.
Three
Battery undervoltage
beeps every
ten seconds
Flashing
green
Solid on
Battery overvoltage
Flashing
green
Solid on
Output wave is
abnormal when the
charger is open, on
battery mode
Flashing
red
Flashing
amber
Flashing
red
Flashing
amber
Solid on
The ac-power output
wave is under low
limit or above high
limit on battery mode
Green
Either on
or off
Amber
Beeps for
four
seconds
and then
stops
On battery (no ac
power)
Green
Either on
or off
Flashing
amber
Beeps for
two
seconds
and then
stops
Low battery (no ac
power)
Green
Either on
or off
Red
Amber
Either on
or off
Either on
or off
Flashing
red
Either on
or off
Either on
or off
Flashing
red
Red
Amber
Red
Flashing
red
Amber
Flashing
red
Amber
SAN Volume Controller: Troubleshooting Guide
Green
Red
Beeps for Overload while on line
one second
and then
stops
Beeps for Overload while on
one second battery
and then
stops
Green
Flashing
red
294
Green
Green
Solid on
Fan failure
Solid on
Battery test fail
Solid on
Overload timeout
Solid on
Over temperature
Output short circuit
Table 81. 2145 UPS-1U error indicators (continued)
[1] Load2
[2] Load1
[3] Alarm
[4] Battery
[5]
Overload
[6]
Power-on
Buzzer
Error condition
Notes:
1. The green Load2 LED ([1]) indicates that power is being supplied to the right pair of ac-power outlets as seen
from the rear of the 2145 UPS-1U.
2. The amber Load1 LED ([2]) indicates that power is being supplied to the left pair of ac-power outlets as seen
from the rear of the 2145 UPS-1U. These outlets are not used by the SAN Volume Controller.
This LED might be illuminated during power-on sequences, but it is typically extinguished by the SAN Volume
Controller node that is attached to the 2145 UPS-1U.
3. A blank cell indicates that the light or buzzer is off.
Procedure
1. Is the power-on indicator for the 2145 UPS-1U that is connected to the failing
SAN Volume Controller off?
NO
Go to step 3.
YES
Go to step 2.
2. (from step 1)
Are other 2145 UPS-1U units showing the power-on indicator as off?
NO
The 2145 UPS-1U might be in standby mode. This can be because the
on or off button on this 2145 UPS-1U was pressed, input power has
been missing for more than five minutes, or because the SAN Volume
Controller shut it down following a reported loss of input power. Press
and hold the on or off button until the 2145 UPS-1U power-on indicator
is illuminated (approximately five seconds). On some versions of the
2145 UPS-1U, you need a pointed device, such as a screwdriver, to
press the on or off button.
Go to step 3.
YES
Either main power is missing from the installation or a redundant
AC-power switch has failed. If the 2145 UPS-1U units are connected to
a redundant AC-power switch, go to “MAP 5320: Redundant AC
power” on page 299. Otherwise, complete these steps:
a. Restore main power to installation.
b. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
3. (from step 1 and step 2)
Are the power-on and load segment 2 indicators for the 2145 UPS-1U
illuminated solid green, with service, on-battery, and overload indicators off?
NO
Go to step 4.
YES
The 2145 UPS-1U is no longer showing a fault. Verify the repair by
continuing with “MAP 5250: 2145 UPS-1U repair verification” on page
298.
4. (from step 3)
Is the 2145 UPS-1U on-battery indicator illuminated yellow (solid or
flashing), with service and overload indicators off?
NO
Go to step 5 on page 296.
YES
The input power supply to this 2145 UPS-1U is not working or is not
Chapter 11. Using the maintenance analysis procedures
295
correctly connected, or the 2145 UPS-1U is receiving input power that
might be unstable or outside the specified voltage or frequency range.
(The voltage should be between 200V and 240V and the frequency
should be either 50 Hz or 60 Hz.) The SAN Volume Controller
automatically adjusts the 2145 UPS-1U voltage range. If the input
voltage has recently changed, the alarm condition might be present
until the SAN Volume Controller has adjusted the alarm setting. Power
on the SAN Volume Controller that is connected to the 2145 UPS-1U. If
the SAN Volume Controller starts the on-battery indicator should go off
within five minutes. If the SAN Volume Controller powers off again or
if the condition persists for at least five minutes, do the following:
a. Check the input circuit protector on the 2145 UPS-1U rear panel,
and press it, if it is open.
b. If redundant ac power is used for the 2145 UPS-1U, check the
voltage and frequency at the redundant AC-power switch output
receptable connected to this 2145 UPS-1U. If there is no power, go
to “MAP 5340: Redundant ac power verification” on page 300. If the
power is not within specification, ask the customer to resolve the
issue. If redundant ac power is not used for this uninterruptible
power supply, check the site power outlet for the 2145 UPS-1U
providing power to this SAN Volume Controller. Check the
connection, voltage, and frequency. If the power is not within
specification, ask the customer to resolve the issue.
c. If the input power is within specification and the input circuit
protector is stable, replace the field-replaceable units (FRUs) in the
following sequence:
1) 2145 UPS-1U power cord
2) 2145 UPS-1U
d. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
5. (from step 4 on page 295)
Is the 2145 UPS-1U overload indicator illuminated solid red?
NO
Go to step 6 on page 297.
YES
The 2145 UPS-1U output power requirement has exceeded the 2145
UPS-1U capacity.
a. Check that only one SAN Volume Controller node is connected to
the 2145 UPS-1U.
b. Check that no other loads are connected to the 2145 UPS-1U.
c. After ensuring that the output loading is correct, turn off the 2145
UPS-1U by pressing the on or off button until the power-on
indicator goes off. Then unplug the input power from the 2145
UPS-1U. Wait at least five seconds until all LEDs are off and restart
the 2145 UPS-1U by reconnecting it to input power and pressing the
on or off button until the 2145 UPS-1U power-on indicator is
illuminated (approximately five seconds). On some versions of the
2145 UPS-1U, you need a pointed device, such as a screwdriver, to
press the on or off button.
d. If the condition persists, replace the 2145 UPS-1U.
Note: If the condition recurs, replace the power supply or power
supplies in the node.
296
SAN Volume Controller: Troubleshooting Guide
e. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
6. (from step 5 on page 296)
Is the 2145 UPS-1U service indicator illuminated flashing red and the
on-battery indicator illuminated solid yellow, with the power-on and
overload indicators off?
NO
Go to step 7.
YES
The 2145 UPS-1U battery might be fully discharged or faulty.
a. Check that the 2145 UPS-1U has been connected to a power outlet
for at least two hours to charge the battery. After charging the
battery, press and hold the test or alarm reset button for three
seconds; and then check the service indicator.
b. If the service indicator is still flashing, replace the 2145 UPS-1U.
c. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
7. (from step 6)
Is the 2145 UPS-1U service indicator illuminated flashing red, the on-battery
indicator illuminated solid yellow, and the power-on illuminated solid green,
with the overload indicator off?
NO
Go to step 8.
YES
The 2145 UPS-1U internal temperature is too high.
a. Turn off the 2145 UPS-1U by pressing the on or off button until the
power-on indicator goes off. Then unplug the 2145 UPS-1U. Clear
vents at the front and rear of the 2145 UPS-1U. Remove any heat
sources. Ensure the airflow around the 2145 UPS-1U is not
restricted.
b. Wait at least five minutes and restart the 2145 UPS-1U by
reconnecting to input power and pressing the on or off button until
the 2145 UPS-1U power-on indicator is illuminated (approximately
five seconds).
c. If the condition persists, replace the 2145 UPS-1U.
d. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
8. (from step 7)
Is the 2145 UPS-1U, service, on-battery, overload, and power-on indicators
illuminated and flashing?
NO
The 2145 UPS-1U has an internal fault.
a. Replace the 2145 UPS-1U.
b. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
YES
The 2145 UPS-1U battery might be fully discharged or faulty.
a. Check that the 2145 UPS-1U has been connected to a power outlet
for at least two hours to charge the battery. After charging the
battery, press and hold the test or alarm reset button for three
seconds and then check the service indicator.
b. If the service indicator is still flashing, replace the 2145 UPS-1U.
c. Verify the repair by continuing with “MAP 5250: 2145 UPS-1U
repair verification” on page 298.
Chapter 11. Using the maintenance analysis procedures
297
MAP 5250: 2145 UPS-1U repair verification
MAP 5250: 2145 UPS-1U repair verification helps you to verify that field
replaceable units (FRUs) that you have exchanged for new FRUs, or repair actions
that were done, have solved all the problems on the SAN Volume Controller 2145
UPS-1U.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
You may have been sent here because you have performed a repair and want to
confirm that no other problems exist on the machine.
About this task
Perform the following steps:
Procedure
1. Are the power-on and load segment 2 indicators for the repaired 2145
UPS-1U illuminated solid green, with service, on-battery, and overload
indicators off?
NO
Continue with “MAP 5000: Start” on page 275.
YES
Go to step 2.
2. (from step 1)
Is the SAN Volume Controller node powered by this 2145 UPS-1U powered
on?
NO
Press power-on on the SAN Volume Controller node that is connected
to this 2145 UPS-1U and is powered off. Go to step 3.
YES
Go to step 3.
3. (from step 2)
Is the node that is connected to this 2145 UPS-1U still not powered on or
showing error codes in the front panel display?
NO
Go to step 4.
YES
Continue with “MAP 5000: Start” on page 275.
4. (from step 3)
Does the SAN Volume Controller node that is connected to this 2145 UPS-1U
show “Charging” on the front panel display?
NO
Go to step 5.
YES
Wait for the “Charging” display to finish (this might take up to two
hours). Go to step 5.
5. (from step 4)
Press and hold the test/alarm reset button on the repaired 2145 UPS-1U for
three seconds to initiate a self-test. During the test, individual indicators
illuminate as various parts of the 2145 UPS-1U are checked.
Does the 2145 UPS-1U service, on-battery, or overload indicator stay on?
NO
298
2145 UPS-1U repair verification has completed successfully. Continue
with “MAP 5700: Repair verification” on page 321.
SAN Volume Controller: Troubleshooting Guide
YES
Continue with “MAP 5000: Start” on page 275.
MAP 5320: Redundant AC power
MAP 5320: Redundant AC power helps you solve problems that have occurred in
the redundant AC-power switches used on a SAN Volume Controller. Use this
MAP when a 2145 UPS-1U that is connected to a redundant AC-power switch does
not appear to have input power.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
You might have been sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller.
v “MAP 5150: 2145 UPS-1U” on page 292 sent you here.
About this task
Perform the following steps to solve problems that have occurred in any
redundant AC-power switch:
Procedure
1. One or two 2145 UPS-1Us might be connected to the redundant AC-power
switch. Is the power-on indicator on any connected 2145 UPS-1U on?
NO
Go to step 3.
YES
The redundant AC-power switch is powered. Go to step 2.
2. (from step 1)
Measure the voltage at the redundant AC-power switch output socket
connected to the 2145 UPS-1U that is not showing power-on.
CAUTION:
Ensure that you do not remove the power cable of any powered
uninterruptible power supply units
Is there power at the output socket?
NO
One redundant AC-power switch output is working while the other is
not. Replace the redundant AC-power switch.
CAUTION:
You might need to power-off an operational node to replace the
redundant AC-power switch assembly. If this is the case, consult with
the customer to determine a suitable time to perform the
replacement. See “MAP 5350: Powering off a node” on page 302.
After you replace the redundant AC-power switch, continue with
“MAP 5340: Redundant ac power verification” on page 300.
YES
The redundant AC-power switch is working. There is a problem with
the 2145 UPS-1U power cord or the 2145 UPS-1U . Return to the
procedure that called this MAP and continue from where you were
within that procedure. It will help you analyze the problem with the
2145 UPS-1U power cord or the 2145 UPS-1U.
3. (from step 1)
None of the used redundant AC-power switch outputs appears to have power.
Chapter 11. Using the maintenance analysis procedures
299
Are the two input power cables for the redundant AC-power switches
correctly connected to the redundant AC-power switch and to different mains
circuits?
NO
Correctly connect the cables. Go to “MAP 5340: Redundant ac power
verification.”
YES
Verify that there is main power at both the site's power distribution
units that are providing power to this redundant AC-power switch. Go
to step 4.
4. (from step 3 on page 299)
Is power available at one or more of the site's power distribution units that are
providing power to this redundant AC-power switch?
NO
Have the customer fix the mains circuits. Return to the procedure that
called this MAP and continue from where you were within that
procedure.
YES
The redundant AC-power switch should operate in this situation.
Replace the redundant AC-power switch assembly. After you replace
the redundant AC-power switch, continue with “MAP 5340: Redundant
ac power verification.”
MAP 5340: Redundant ac power verification
MAP 5340: Redundant ac power verification helps you verify that a redundant
AC-power switch is functioning correctly.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
You might have been sent here because you have replaced a redundant AC-power
switch or corrected the cabling of a redundant AC-power switch. You can also use
this MAP if you think a redundant AC-power switch might not be working
correctly, because it is connected to nodes that have lost power when only one ac
power circuit lost power.
In this MAP, you will be asked to confirm that power is available at the redundant
AC-power switch output sockets 1 and 2. If the redundant AC-power switch is
connected to nodes that are not powered on, use a voltage meter to confirm that
power is available.
If the redundant AC-power switch is powering nodes that are powered on (so the
nodes are operational), take some precautions before continuing with these tests.
Although you do not have to power off the nodes to conduct the test, the nodes
will power off if the redundant AC-power switch is not functioning correctly.
About this task
For each of the powered-on nodes connected to this redundant AC-power switch,
perform the following steps:
1. Use the management GUI or the command-line interface (CLI) to confirm that
the other node in the same I/O group as this node is online.
2. Use the management GUI or the CLI to confirm that all virtual disks connected
to this I/O group are online.
300
SAN Volume Controller: Troubleshooting Guide
3. Check the redundant AC-power switch output cables to confirm that the
redundant AC-power switch is not connected to two nodes in the same I/O
group.
If any of these tests fail, correct any failures before continuing with this MAP. If
you are performing the verification using powered-on nodes, understand that
power is no longer available if the following is true:
v The on-battery indicator on the 2145 UPS-1U that connects the redundant
AC-power switch to the node lights for more than five seconds.
v The SAN Volume Controller node display shows Power Failure.
When the instructions say “remove power,” you can switch the power off if the
sitepower distribution unit has outputs that are individually switched; otherwise,
remove the specified redundant AC-power switch power cable from the site power
distribution unit's outlet.
Perform the following steps:
Procedure
1. Are the two site power distribution units providing power to this redundant
AC-power switch connected to different power circuits?
NO
Correct the problem and then return to this MAP.
YES
Go to step 2.
2. (from step 1)
Are both of the site power distribution units providing power to this redundant
AC-power switch powered?
NO
Correct the problem and then return to the start of this MAP.
YES
Go to step 3.
3. (from step 2)
Are the two cables that are connecting the site power distribution units to the
redundant AC-power switch connected?
NO
Correct the problem and then return to the start of this MAP.
YES
Go to step 4.
4. (from step 3)
Is there power at the redundant AC-power switch output socket 2?
NO
Go to step 8 on page 302.
YES
Go to step 5.
5. (from step 4)
Is there power at the redundant AC-power switch output socket 1?
NO
Go to step 8 on page 302.
YES
Go to step 6.
6. (from step 5)
Remove power from the Main power cable to the redundant AC-power switch.
Is there power at the redundant AC-power switch output socket 1?
NO
Go to step 8 on page 302.
YES
Go to step 7 on page 302.
Chapter 11. Using the maintenance analysis procedures
301
7. (from step 6 on page 301)
Reconnect the Main power cable. Remove power from the Backup power cable
to the redundant AC-power switch. Is there power at the redundant AC-power
switch output socket 1?
NO
Go to step 8.
YES
Reconnect the Backup power cable. The redundant ac power
verification has been successfully completed. Continue with “MAP
5700: Repair verification” on page 321.
8. (from steps 4 on page 301, 5 on page 301, 6 on page 301, and 7)
The redundant AC-power switch has not functioned as expected. Replace the
redundant AC-power switch assembly. Return to the start of this MAP.
Results
MAP 5350: Powering off a node
MAP 5350: Powering off a node helps you power off a single node to perform a
service action without disrupting host access to volumes.
Before you begin
If the solution is set up correctly, powering off a single node does not disrupt
normally the operation of a SAN Volume Controller system. Normal operation
within a system has nodes in pairs called I/O groups. An I/O group continues to
handle I/O to the disks it manages with only a single node powered on. However,
performance degrades and resilience to error is reduced.
Be careful when powering off a SAN Volume Controller node to impact the system
no more than necessary. If you do not follow the procedures outlined here, your
application hosts might lose access to their data or they might lose data in the
worst case.
You can use the following preferred methods to power off a node that is a member
of a system and not offline:
1. Use the Power off option in the management GUI or in the service assistant
interface.
2. Use the CLI command stopsystem –node name.
It is preferable to use either the management GUI or the command-line interface
(CLI) to power off a node, as these methods provide a controlled handover to the
partner node and provide better resilience to other faults in the system.
Only if a node is offline or not a member of a system must you power it off using
the power button.
About this task
To provide the least disruption when powering off a node, all of the following
conditions should apply:
v The other node in the I/O group is powered on and active in the system.
v The other node in the I/O group has SAN Fibre Channel connections to all hosts
and disk controllers managed by the I/O group.
v All volumes handled by this I/O group are online.
302
SAN Volume Controller: Troubleshooting Guide
v Host multipathing is online to the other node in the I/O group.
In some circumstances, the reason you power off the node might make these
conditions impossible. For instance, if you replace a failed Fibre Channel adapter,
volumes do not show an online status. Use your judgment to decide that it is safe
to proceed when a condition is not met. Always check with the system
administrator before proceeding with a power off that you know disrupts I/O
access, as the system administrator might prefer to wait for a more suitable time or
suspend host applications.
To ensure a smooth restart, a node must save data structures that it cannot recreate
to its local, internal disk drive. The amount of data the node saves to local disk can
be high, so this operation might take several minutes. Do not attempt to interrupt
the controlled power off.
Attention: The following actions do not allow the node to save data to its local
disk. Therefore, do not power off a node using the following methods:
v Removing the power cable between the node and the uninterruptible power
supply.
Normally the uninterruptible power supply provides sufficient power to allow
the write to local disk in the event of a power failure, but obviously it is unable
to provide power in this case.
v Holding down the power button on the node.
When you press and release the power button, the node indicates this to the
software so the node can write its data to local disk before the node powers off.
When you hold down the power button, the hardware interprets this as an
emergency power off indication and shuts down immediately. The hardware
does not save the data to a local disk before powering down. The emergency
power off occurs approximately four seconds after you press and hold down the
power button.
v Pressing the reset button on the light path diagnostics panel.
Important: If you power off the node and might not power it back on the same
day, follow these steps to prevent the batteries from being discharged too much
while the node is connected to power but not powered on:
1. Pull both batteries out of the node. Keep them out until you're ready to power
on the node.
2. Push the batteries in just before you press the power button to power on the
node.
If you disconnect the power from the node and might not reconnect power to it
again within the next 24 hours, follow these steps to prevent the batteries from
being discharged too much while the node is not connected to power:
1. After both power cords are disconnected from the node, pull both batteries out
of the node. This step completely turns off the battery backplane.
2. Push the batteries back in again.
Using the management GUI to power off a system
Use the management GUI to power off a system.
Chapter 11. Using the maintenance analysis procedures
303
Procedure
To use the management GUI to power off a system, complete the following steps:
1. Launch the management GUI for the system that you are servicing.
Optionally, you can sign on to the IBM System Storage Productivity Center as
an administrator to launch the management GUI for the system that you are
servicing.
2. Select Monitoring > System.
If the nodes to power off are shown as Offline, the nodes are not participating
in the system. In such circumstances, use the power button on the offline nodes
to power off the nodes.
If the nodes to power off are shown as Online, powering off the nodes can
result in their dependent volumes also going offline:
a. Select the node and click Show Dependent Volumes.
b. Make sure the status of each volume in the I/O group is Online. You might
need to view more than one page.
If any volumes are Degraded, only one node in the I/O is processing I/O
requests for that volume. If that node is powered off, it impacts all the hosts
that are submitting I/O requests to the degraded volume.
If any volumes are degraded and you believe that this might be because the
partner node in the I/O group has been powered off recently, wait until a
refresh of the screen shows all volumes online. All the volumes should be
online within 30 minutes of the partner node being powered off.
Note: After waiting 30 minutes, if you have a degraded volume and all of
the associated nodes and MDisks are online, contact support for assistance.
Ensure that all volumes that are used by hosts are online before continuing.
c. If possible, check that all hosts that access volumes managed by this I/O
group are able to fail over to use paths that are provided by the other node
in the group.
Perform this check using the multipathing device driver software of the host
system. Commands to use differ, depending on the multipathing device
driver being used.
If you use the System Storage Multipath Subsystem Device Driver (SDD),
the command to query paths is datapath query device.
It can take some time for the multipathing device drivers to rediscover
paths after a node is powered on. If you are unable to check on the host
that all paths to both nodes in the I/O group are available, do not power off
a node within 30 minutes of the partner node being powered on or you
might lose access to the volume.
d. If you decide that it is okay to continue with powering off the nodes, select
the node to power off and click Shut Down System.
e. Click OK. If the node that you select is the last remaining node that
provides access to a volume, for example a node that contains flash drives
with unmirrored volumes, the Shutting Down a Node-Force panel is
displayed with a list of volumes that will go offline if the node is shut
down.
f. Check that no host applications access the volumes that are going offline.
Continue with the shut down only if the loss of access to these volumes is
acceptable. To continue with shutting down the node, click Force Shutdown.
304
SAN Volume Controller: Troubleshooting Guide
What to do next
During the shut down, the node saves its data structures to its local disk and
destages all write data held in cache to the SAN disks. Such processing can take
several minutes.
At the end of this processing, the system powers off.
Using the SAN Volume Controller CLI to power off a node
Use the command-line interface (CLI) to power off a node.
Procedure
1. Issue the lsnode CLI command to display a list of nodes in the system and
their properties. Find the node to shut down and write down the name of its
I/O group. Confirm that the other node in the I/O group is online.
lsnode -delim :
id:name:UPS_serial_number:WWNN:status:IO_group_id: IO_group_name:config_node:
UPS_unique_id
1:group1node1:10L3ASH:500507680100002C:online:0:io_grp0:yes:202378101C0D18D8
2:group1node2:10L3ANF:5005076801000009:online:0:io_grp0:no:202378101C0D1796
3:group2node1:10L3ASH:5005076801000001:online:1:io_grp1:no:202378101C0D18D8
4:group2node2:10L3ANF:50050768010000F4:online:1:io_grp1:no:202378101C0D1796
If the node to power off is shown as Offline, the node is not participating in
the system and is not processing I/O requests. In such circumstances, use the
power button on the node to power off the node.
If the node to power off is shown as Online, but the other node in the I/O
group is not online, powering off the node impacts all hosts that are submitting
I/O requests to the volumes that are managed by the I/O group. Ensure that
the other node in the I/O group is online before you continue.
2. Issue the lsdependentvdisks CLI command to list the volumes that are
dependent on the status of a specified node.
lsdependentvdisks group1node1
vdisk_id
0
1
vdisk_name
vdisk0
vdisk1
If the node goes offline or is removed from the system, the dependent volumes
also go offline. Before taking a node offline or removing it from the system, you
can use the command to ensure that you do not lose access to any volumes.
3. If you decide that it is okay to continue powering off the node, issue the
stopsystem –node <name> CLI command to power off the node. Use the –node
parameter to avoid powering off the whole system:
stopsystem –node group1node1
Are you sure that you want to continue with the shut down? yes
Note: To shut down the node even though there are dependent volumes, add
the -force parameter to the stopsystem command. The force parameter forces
continuation of the command even though any node-dependent volumes will
be taken offline. Use the force parameter with caution; access to data on
node-dependent volumes will be lost.
During the shut down, the node saves its data structures to its local disk and
destages all write data held in the cache to the SAN disks, which can take
several minutes.
At the end of this process, the node powers off.
Chapter 11. Using the maintenance analysis procedures
305
Using the SAN Volume Controller power control button
Do not use the power control button to power off a node unless an emergency
exists or another procedure directs you to do so.
Before you begin
With this method, you cannot check the system status from the front panel, so you
cannot tell if the power off is liable to cause excessive disruption to the system.
Instead, use the management GUI or the CLI commands, described in the previous
topics, to power off an active node.
About this task
If you must use this method, notice in Figure 77 that each model type has a power
control button ▌1▐ on the front.
Figure 77. Power control button on the SAN Volume Controller models
When you determine it is safe to do so, press and immediately release the power
button. On models other than the 2145-DH8, the front panel display changes to
display Powering Off and displays a progress bar.
Note: The 2145-DH8 is a new design that does not use a front panel display.
The 2145-CG8 or the 2145-CF8 requires that you remove a power button cover
before you can press the power button.
If you press the power button for too long, the node immediately powers down
and cannot write all data to its local disk. An extended service procedure is
required to restart the node, which involves deleting the node from the system
before adding it back.
The following graphic shows how Powering Off is displayed on the front panel of
all nodes but the 2145-DH8:
306
SAN Volume Controller: Troubleshooting Guide
Results
The node saves its data structures to disk while powering off. The power off
process can take up to five minutes.
When a node is powered off by using the power button (or because of a power
failure), the partner node in its I/O group immediately stops using its cache for
new write data and destages any write data already in its cache to the SAN
attached disks.
The time taken by this destage depends on the speed and utilization of the disk
controllers. The time to complete is usually less than 15 minutes, but it might be
longer. The destaging cannot complete if there is data waiting to be written to a
disk that is offline.
A node that powers off and restarts while its partner node continues to process
I/O might not be able to become an active member of the I/O group immediately.
The node must wait until the partner node completes its destage of the cache.
If the partner node powers off during this period, access to the SAN storage that is
managed by this I/O group is lost. If one of the nodes in the I/O group is unable
to service any I/O, for example because the partner node in the I/O group is still
flushing its write cache, volumes that are managed by that I/O group have a
status of Degraded.
MAP 5400: Front panel
MAP 5400: Front panel helps you to solve problems that have occurred on the
front panel.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This MAP applies to all SAN Volume Controller models. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working with, look for the label that identifies the model type on
the front of the node.
You might have been sent here because:
v A problem occurred during the installation of a SAN Volume Controller system,
the front-panel display test failed, or the correct node number failed to be
displayed
v Another MAP sent you here
About this task
Perform the following steps:
Procedure
1. Is the power LED on the operator-information panel illuminated and
showing a solid green?
NO
Continue with the power MAP. See “MAP 5050: Power 2145-CG8 and
2145-CF8” on page 288.
Chapter 11. Using the maintenance analysis procedures
307
YES
Go to step 2.
2. (from step 1 on page 307)
Is the service controller error light ▌1▐ that you see in Figure 78 illuminated
and showing a solid amber?
svc00561
1
Figure 78. SAN Volume Controller service controller error light
NO
Start the front panel tests by pressing and holding the select button for
five seconds. Go to step 3.
Attention: Do not start this test until the node is powered on for at
least two minutes. You might receive unexpected results.
YES
The SAN Volume Controller service controller has failed.
v Replace the service controller.
v Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
3. (from step 2)
The front-panel check light illuminates and the display test of all display bits
turns on for 3 seconds and then turns off for 3 seconds, then a vertical line
travels from left to right, followed by a horizontal line travelling from top to
bottom. The test completes with the switch test display of a single rectangle in
the center of the display.
Did the front-panel lights and display operate as described?
NO
SAN Volume Controller front panel has failed its display test.
v Replace the service controller.
v Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
Go to step 4.
4. (from step 3)
Figure 79 on page 309 provides four examples of what the front-panel display
shows before you press any button and then when you press the up button, the
left and right buttons, and the select button. To perform the front panel switch
test, press any button in any sequence or any combination. The display
indicates which buttons you pressed.
308
SAN Volume Controller: Troubleshooting Guide
Figure 79. Front-panel display when push buttons are pressed
Check each switch in turn. Did the service panel switches and display operate
as described in Figure 79?
NO
The SAN Volume Controller front panel has failed its switch test.
v Replace the service controller.
v Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
Press and hold the select button for five seconds to exit the test. Go to
step 5.
5. Is the front-panel display now showing Cluster:?
NO
Continue with “MAP 5000: Start” on page 275.
Keep pressing and releasing the down button until Node is displayed in
line 1 of the menu screen. Go to step 6.
6. (from step 5)
YES
Is this MAP being used as part of the installation of a new node?
NO
Front-panel tests have completed with no fault found. Verify the repair
by continuing with “MAP 5700: Repair verification” on page 321.
YES
Go to step 7.
7. (from step 6)
Is the node number that is displayed in line 2 of the menu screen the same
as the node number that is printed on the front panel of the node?
NO
Node number stored in front-panel electronics is not the same as that
printed on the front panel.
v Replace the service controller.
v Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
Front-panel tests have completed with no fault found. Verify the repair
by continuing with “MAP 5700: Repair verification” on page 321.
MAP 5500: Ethernet
MAP 5500: Ethernet helps you solve problems that have occurred on the SAN
Volume Controller Ethernet connections.
Chapter 11. Using the maintenance analysis procedures
309
Before you begin
Note: The service assistant GUI should be used if there is no front panel display,
for example on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
If you encounter problems with the 10 Gbps Ethernet feature on the SAN Volume
Controller 2145-CG8 or SAN Volume Controller 2145-DH8, see “MAP 5550: 10G
Ethernet and Fibre Channel over Ethernet personality enabled adapter port” on
page 313.
You might have been sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller system
and the Ethernet checks failed
v Another MAP sent you here
v The customer needs immediate access to the system by using an alternate
configuration node. See “Defining an alternate configuration node” on page 313
About this task
Perform the following steps:
Procedure
1. Is the front panel of any node in the system displaying Node Error with
error code 805?
YES
Go to step 6 on page 311.
NO
Go to step 2.
2. Is the system reporting error 1400 either on the front panel or in the event
log?
YES
Go to step 4.
NO
Go to step 3.
3. Are you experiencing Ethernet performance issues?
YES
Go to step 9 on page 312.
NO
Go to step 10 on page 312.
4. (from step 2) On all nodes perform the following actions:
a. Press the down button until the top line of the display shows Ethernet.
b. Press right until the top line displays Ethernet port 1.
c. If the second line of the display shows link offline, record this port as
one that requires fixing.
d. If the system is configured with two Ethernet cables per node, press the
right button until the top line of the display shows Ethernet port 2 and
repeat the previous step.
e. Go to step 5.
5. (from step 4) Are any Ethernet ports that have cables attached to them
reporting link offline?
310
YES
Go to step 6 on page 311.
NO
Go to step 10 on page 312.
SAN Volume Controller: Troubleshooting Guide
6. (from step 5 on page 310) Do the SAN Volume Controller nodes have one or
two cables connected?
Go to step 7.
One
Two
Go to step 8.
7. (from step 6) Perform the following actions:
a. Plug the Ethernet cable from that node into the Ethernet port 2 from a
different node, as shown in Figure 80 and Figure 81.
b. If the Ethernet link light is illuminated when the cable is plugged into
Ethernet port 2 of the other node, replace the system board of the original
node.
3
2
4
svc00718
1
Figure 80. Port 2 Ethernet link LED on the SAN Volume Controller rear panel
▌1▐SAN Volume Controller 2145-CG8 port 2 (upper right) Ethernet link
LED
▌2▐SAN Volume Controller 2145-CF8 port 2 (upper right) Ethernet link
LED
4
5
3
6
1
2
svc00861
1
2
3
Figure 81. Ethernet ports on the rear of the SAN Volume Controller 2145-DH8
▌2▐ 1 Gbps Ethernet port 2
c. If the Ethernet link light does not illuminate, check the Ethernet switch or
hub port and cable to resolve the problem.
d. Verify the repair by continuing with “MAP 5700: Repair verification” on
page 321.
8. (from step 5 on page 310 or step 6) Perform the following actions:
a. Plug the Ethernet cable from that node into another device, for example,
the SSPC.
b. If the Ethernet link light is illuminated when the cable is plugged into the
other Ethernet device, replace the system board of the original node.
c. If the Ethernet link light does not illuminate, check the Ethernet
switch/hub port and cable to resolve the problem.
Chapter 11. Using the maintenance analysis procedures
311
d. Verify the repair by continuing with “MAP 5700: Repair verification” on
page 321.
9. (from step 3 on page 310) Perform the following actions:
a. Check all Speed port 1 and Speed port 2 panels for the speed and duplex
settings. The format is: <Speed>/<Duplex>.
1) Press the down button until the top line of the display shows Ethernet.
2) Press the right button until the top line displays Speed 1.
3) If the second line of the display shows link offline, record this port
as one that requires fixing.
4) If the system is configured with two Ethernet cables per node, press
the right button until the top line of the display shows Speed 2 and
repeat the previous step.
b. Check that the SAN Volume Controller port has negotiated at the highest
speed available on the switch. All nodes have gigabit Ethernet network
ports.
c. If the Duplex setting is half, perform the following steps:
1) There is a known problem with gigabit Ethernet when one side of the
link is set to a fixed speed and duplex and the other side is set to
autonegotiate. The problem can cause the fixed side of the link to run
at full duplex and the negotiated side of the link to run at half duplex.
The duplex mismatch can cause significant Ethernet performance
degradation.
2) If the switch is set to full duplex, set the switch to autonegotiate to
prevent the problem described previously.
3) If the switch is set to half duplex, set it to autonegotiate to allow the
link to run at the higher bandwidth available on the full duplex link.
d. If none of the above are true, call your support center for assistance.
10. (from step 2 on page 310)
A previously reported fault with the Ethernet interface is no longer present. A
problem with the Ethernet might have been fixed, or there might be an
intermittent problem. Check with the customer to determine that the Ethernet
interface has not been intentionally disconnected. Also check that there is no
recent history of fixed Ethernet problems with other components of the
Ethernet network.
Is the Ethernet failure explained by the previous checks?
NO
There might be an intermittent Ethernet error. Perform these steps in
the following sequence until the problem is resolved:
a. Use the Ethernet hub problem determination procedure to check
for and resolve an Ethernet network connection problem. If you
resolve a problem, continue with “MAP 5700: Repair verification”
on page 321.
b. Determine if similar Ethernet connection problems have occurred
recently on this node. If they have, replace the system board.
c. Verify the repair by continuing with “MAP 5700: Repair
verification” on page 321.
YES
312
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
SAN Volume Controller: Troubleshooting Guide
Defining an alternate configuration node
A situation can arise where the customer needs immediate access to the system by
using an alternate configuration node.
About this task
If all Ethernet connections to the configuration node have failed, the system is
unable to report failure conditions, and the management GUI is unable to access
the system to perform administrative or service tasks. If this is the case and the
customer needs immediate access to the system, you can make the system use an
alternate configuration node by using the service assistant GUI. The service
assistant is accessed via the technician port..
Note: If the system has no front panel display such as on SAN Volume Controller
2145-DH8, use the service assistant GUI. The service assistant is accessed via the
technician port.
If only one node is displaying Node Error 805 on the front panel, perform the
following steps:
Procedure
1. Press and release the power button on the node that is displaying Node Error
805.
2. When Powering off is displayed on the front panel display, press the power
button again.
3. Restarting is displayed.
Results
The system will select a new configuration node. The management GUI is able to
access the system again.
MAP 5550: 10G Ethernet and Fibre Channel over Ethernet personality
enabled adapter port
MAP 5550: 10G Ethernet helps you solve problems that occur on a SAN Volume
Controller 2145-CG8 or SAN Volume Controller 2145-DH8 with 10G Ethernet
capability, and Fibre Channel over Ethernet personality enabled.
Before you begin
Note: The service assistant GUI might be used if there is no front panel display,
for example on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This MAP applies to the SAN Volume Controller 2145-CG8 and SAN Volume
Controller 2145-DH8 models with the 10G Ethernet feature installed. Be sure that
you know which model you are using before you start this procedure. To
determine which model you are working with, look for the label that identifies the
model type on the front of the node. Check that the 10G Ethernet adapter is
installed and that an optical cable is attached to each port. Figure 26 on page 28
shows the rear panel of the 2145-CG8 with the 10G Ethernet ports.
Chapter 11. Using the maintenance analysis procedures
313
If you experience a problem with error code 805, go to “MAP 5500: Ethernet” on
page 309.
If you experience a problem with error code 703 or 723, go to “Fibre Channel and
10G Ethernet link failures” on page 247.
You might be sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller system
and the Ethernet checks failed
v Another MAP sent you here
About this task
Perform the following steps:
Procedure
1. Is node error 720 or 721 displayed on the front panel of the affected node or
is service error code 1072 shown in the event log?
YES
Go to step 11 on page 315.
NO
Go to step 2.
2. (from step 1) Perform the following actions from the front panel of the
affected node:
a. Press and release the up or down button until Ethernet is shown.
b. Press and release the left or right button until Ethernet port 3 is shown.
Was Ethernet port 3 found?
No
Go to step 11 on page 315
Yes
Go to step 3
3. (from step 2) Perform the following actions from the front panel of the
affected node:
a. Press and release the up or down button until Ethernet is shown.
b. Press and release the up or down button until Ethernet port 3 is shown.
c. Record if the second line of the display shows Link offline, Link online,
or Not configured.
d. Press and release the up or down button until Ethernet port 4 is shown.
e. Record if the second line of the display shows Link offline, Link online,
or Not configured.
f. Go to step 4.
4. (from step 3) What was the state of the 10G Ethernet ports that were seen in
step 3?
Both ports show Link online
The 10G link is working now. Verify the repair by continuing with
“MAP 5700: Repair verification” on page 321.
One or more ports show Link offline
Go to step 5 on page 315.
One or more ports show Not configured
For information about the port configuration, see the CLI command
cfgportip description in the SAN Volume Controller Information
Center for iSCSI.
314
SAN Volume Controller: Troubleshooting Guide
For Fibre Channel over Ethernet information, see the CLI command
lsportfc description in the SAN Volume Controller Information
Center. This command provides connection properties and status to
help determine whether the Fibre Channel over Ethernet is a part of a
correctly configured VLAN.
5. (from step 4 on page 314) Is the amber 10G Ethernet link LED off for the
offline port?
YES
Go to step 6
NO
The physical link is operational. The problem might be with the
system configuration. See the configuration topic “iSCSI configuration
details” and “Fibre Channel over Ethernet configuration details” in the
SAN Volume Controller Information Center.
6. (from step 5) Perform the following actions:
a. Check that the 10G Ethernet ports are connected to a 10G Ethernet fabric.
b. Check that the 10G Ethernet fabric is configured.
c. Pull out the small form-factor pluggable (SFP) transceiver and plug it back
in.
d. Pull out the optical cable and plug it back in
e. Clean contacts with a small blast of air, if available.
f. Go to step 7.
7. (from step 6) Did the amber link LED light?
YES
The physical link is operational. Verify the repair by continuing with
“MAP 5700: Repair verification” on page 321.
NO
Go to step 8.
8. (from step 7) Swap the 10G SFPs in port 3 and port 4, but keep the optical
cables connected to the same port.
Is the amber link LED on the other port off now?
YES
Go to step 10.
NO
Go to step 9.
9. (from step 8) Swap the 10G Ethernet optical cables in port 3 and port 4.
Observe how the amber link LED changes. Swap the cables back.
Did the amber link LED on the other port go off?
YES
Check the 10G Ethernet optical link and fabric that is connected to the
port that now has the amber LED off. The problem is associated with
the cable. The problem is either in the optical cable or the Ethernet
switch. Check that the Ethernet switch shows that the port is
operational. If it does not show that the port is operational, replace the
optical cable. Verify the repair by continuing with “MAP 5700: Repair
verification” on page 321.
NO
Go to step 11.
10. (from step 8) Perform the following actions:
a. Replace the SFP that now has the amber link LED off.
b. Verify the repair by continuing with “MAP 5700: Repair verification” on
page 321.
11. (from steps 1 on page 314, 2 on page 314, and 9) Have you already removed
and replaced the 10G Ethernet adapter?
YES
Go to step 12 on page 316.
Chapter 11. Using the maintenance analysis procedures
315
NO
Perform the following actions:
a. Remove and replace the 10G Ethernet adapter.
b. Verify the repair by continuing with “MAP 5700: Repair
verification” on page 321.
12. (from steps 11 on page 315) Replace the 10G Ethernet adapter with a new
one.
a. Replace the 10G Ethernet adapter.
b. Verify the repair by continuing with “MAP 5700: Repair verification” on
page 321.
MAP 5600: Fibre Channel
MAP 5600: Fibre Channel helps you to solve problems that occur on the SAN
Volume Controller Fibre Channel ports.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This MAP applies to all SAN Volume Controller models. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working with, look for the label that identifies the model type on
the front of the node.
Note: Use the service assistant GUI if there is no front panel display, for example
on the SAN Volume Controller 2145-DH8 where the service assistant GUI can be
accessed via the Technician port.
You might be sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller system
and the Fibre Channel checks failed
v Another MAP sent you here
About this task
Complete the following steps to solve problems that are caused by the Fibre
Channel ports:
Procedure
1. Are you trying to resolve a Fibre Channel port speed problem?
2.
NO
Go to step 2.
YES
Go to step 11 on page 320.
(from step 1) Display the Fibre Channel port 1 status on the front panel
display or the service assistant GUI. For more information, see Chapter 6,
“Using the front panel of the SAN Volume Controller,” on page 91.
Is the front panel display or the service assistant GUI on the SAN Volume
Controller showing Fibre Channel port-1 active?
NO
316
A Fibre Channel port is not working correctly. Check the port status
on the second line of the front panel display or the service assistant
GUI.
SAN Volume Controller: Troubleshooting Guide
v Inactive: The port is operational but cannot access the Fibre
Channel fabric. The Fibre Channel adapter is not configured
correctly; the Fibre Channel small form-factor pluggable (SFP)
transceiver failed; the Fibre Channel cable that is either failed or is
not installed; or the device at the other end of the cable failed.
Make a note of port-1. Go to step 7 on page 319.
v Failed: The port is not operational because of a hardware failure.
Make a note of port-1. Go to step 9 on page 319.
v Not installed: This port is not installed. Make a note of port-1. Go
to step 10 on page 320.
YES
Press and release the right button to display Fibre Channel port-2. Go
to step 3.
3. (from step 2 on page 316)
Is the front panel display or the service assistant GUI on the SAN Volume
Controller showing Fibre Channel port-2 active?
NO
A Fibre Channel port is not working correctly. Check the port status.
v Inactive: The port is operational but cannot access the Fibre
Channel fabric. The Fibre Channel adapter is not configured
correctly; the Fibre Channel small form-factor pluggable (SFP)
transceiver failed; the Fibre Channel cable that is either failed or is
not installed; or the device at the other end of the cable failed.
Make a note of port-2. Go to step 7 on page 319.
v Failed: The port is not operational because of a hardware failure.
Make a note of port-2. Go to step 9 on page 319.
v Not installed: This port is not installed. Make a note of port-2. Go
to step 10 on page 320.
YES
Press and release the right button to display Fibre Channel port-3. Go
to step 4.
4. (from step 3)
Is the front panel display or the service assistant GUI on the SAN Volume
Controller showing Fibre Channel port-3 active?
NO
A Fibre Channel port is not working correctly. Check the port status.
v Inactive: The port is operational but cannot access the Fibre
Channel fabric. The Fibre Channel adapter is not configured
correctly; the Fibre Channel small form-factor pluggable (SFP)
transceiver failed; the Fibre Channel cable that is either failed or is
not installed; or the device at the other end of the cable failed.
Make a note of port-3. Go to step 7 on page 319.
v Failed: The port is not operational because of a hardware failure.
Make a note of port-3. Go to step 9 on page 319.
v Not installed: This port is not installed. Make a note of port-3. Go
to step 10 on page 320.
YES
Press and release the right button to display Fibre Channel port-4. Go
to step 5.
5. (from step 4)
Is the front panel display or the service assistant GUI on the SAN Volume
Controller showing Fibre Channel port-4 active?
NO
A Fibre Channel port is not working correctly. Check the port status.
Chapter 11. Using the maintenance analysis procedures
317
v Inactive: The port is operational but cannot access the Fibre
Channel fabric. The Fibre Channel adapter is not configured
correctly; the Fibre Channel small form-factor pluggable (SFP)
transceiver failed; the Fibre Channel cable that is either failed or is
not installed; or the device at the other end of the cable failed.
Make a note of port-4. Go to step 7 on page 319.
v Failed: The port is not operational because of a hardware failure.
Make a note of port-4. Go to step 8 on page 319.
v Not installed: This port is not installed. Make a note of port-4. Go
to step 10 on page 320.
YES
If there are more than four Fibre Channel ports on the node, repeat
step 5 on page 317 for each additional Fibre Channel port that uses
the service assistant.
Go to step 6.
6. (from step 5 on page 317)
A previously reported fault with a Fibre Channel port is no longer being
shown. A problem with the SAN Fibre Channel fabric might be fixed or there
might be an intermittent problem.
Check with the customer to see whether any Fibre Channel ports are
disconnected or if any component of the SAN Fibre Channel fabric failed and
was recently fixed.
Is the Fibre Channel port failure explained by the previous checks?
NO
There might be an intermittent Fibre Channel error.
a. Use the SAN problem determination procedure to check for and
resolve any Fibre Channel fabric connection problems. If you
resolve a problem, continue with “MAP 5700: Repair verification”
on page 321.
b. Check whether similar Fibre Channel errors occurred recently on
the same port on this SAN Volume Controller node. If they have,
replace the Fibre Channel cable, unless it was replaced.
c. Replace the Fibre Channel SFP transceiver, unless it was replaced.
Note: SAN Volume Controller nodes are supported by both
longwave SFP transceivers and shortwave SFP transceivers. You
must replace an SFP transceiver with the same type of SFP
transceiver. If the SFP transceiver to replace is a longwave SFP
transceiver, for example, you must provide a suitable replacement.
Removing the wrong SFP transceiver might result in loss of data
access. See the “Removing and replacing the Fibre Channel SFP
transceiver on a SAN Volume Controller node” documentation to
find out how to replace an SFP transceiver.
d. Replace the Fibre Channel adapter assembly that is shown in
Table 82.
Table 82. Fibre Channel assemblies
318
Node
Adapter assembly
SAN Volume Controller 2145-CG8 port 1, 2,
3, or 4 (slot 1)
Four-port Fibre Channel HBA
SAN Volume Controller 2145-CG8 port 5, 6,
7, or 8 (slot 2)
Four-port Fibre Channel HBA
SAN Volume Controller: Troubleshooting Guide
Table 82. Fibre Channel assemblies (continued)
Node
Adapter assembly
SAN Volume Controller 2145-CF8 port 1, 2,
3, or 4
Four-port Fibre Channel HBA
SAN Volume Controller 2145-DH8 port 1, 2,
3, or 4 (slot 1 mandatory; the first FC
adapter)
Four-port Fibre Channel adapter
SAN Volume Controller 2145-DH8 port 5, 6,
7, or 8 (slot 2 optional; the second FC
adapter)
Four-port Fibre Channel adapter
SAN Volume Controller 2145-DH8 port 9, 10, Four-port Fibre Channel adapter
11, or 12 (slot 5 optional; the third FC
adapter)
e. Verify the repair by continuing with “MAP 5700: Repair
verification” on page 321.
YES
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
7. (from steps 2 on page 316, 3 on page 317, 4 on page 317, and 5 on page 317)
The port noted on the SAN Volume Controller is showing a status of inactive.
For certain models, this inactive status might occur when the Fibre Channel
speed is not set correctly.
8. (from step 7)
The noted port on the SAN Volume Controller displays a status of inactive. If
the noted port still displays a status of inactive, replace the parts that are
associated with the noted port until the problem is fixed in the following
order:
a. Fibre Channel cables from the SAN Volume Controller to Fibre Channel
network.
b. Faulty Fibre Channel fabric connections, particularly the SFP transceiver at
the Fibre Channel switch. Use the SAN problem determination procedure
to resolve any Fibre Channel fabric connection problem.
c. SAN Volume Controller Fibre Channel SFP transceiver.
Note: SAN Volume Controller nodes are supported by both longwave
SFPs and shortwave SFPs. You must replace an SFP with the same type of
SFP transceiver that you are replacing. If the SFP transceiver to replace is a
longwave SFP transceiver, for example, you must provide a suitable
replacement. Removing the wrong SFP transceiver might result in loss of
data access. See the “Removing and replacing the Fibre Channel SFP
transceiver on a SAN Volume Controller node” documentation to find out
how to replace an SFP transceiver.
d. Replace the Fibre Channel adapter assembly as shown in Table 82 on page
318.
e. Verify the repair by continuing with “MAP 5700: Repair verification” on
page 321.
9. (from steps 2 on page 316, 3 on page 317, 4 on page 317, and 5 on page 317)
The noted port on the SAN Volume Controller displays a status of failed.
Verify that the Fibre Channel cables that connect the SAN Volume Controller
Chapter 11. Using the maintenance analysis procedures
319
nodes to the switches are securely connected. Replace the parts that are
associated with the noted port until the problem is fixed in the following
order:
a. Fibre Channel SFP transceiver.
Note: SAN Volume Controller nodes are supported by both longwave SFP
transceivers and shortwave SFP transceivers. You must replace an SFP
transceiver with the same type of SFP transceiver. If the SFP transceiver to
replace is a longwave SFP transceiver, for example, you must provide a
suitable replacement. Removing the wrong SFP transceiver might result in
loss of data access. See the “Removing and replacing the Fibre Channel
SFP transceiver on a SAN Volume Controller node” documentation to find
out how to replace an SFP transceiver.
b. Replace the Fibre Channel adapter assembly as shown in Table 82 on page
318.
c. Verify the repair by continuing with “MAP 5700: Repair verification” on
page 321.
10. (from steps 2 on page 316, 3 on page 317, 4 on page 317, and 5 on page 317)
The noted port on the SAN Volume Controller displays a status of not
installed. If you replaced the Fibre Channel adapter, make sure that it is
installed correctly. If you replaced any other system board components, make
sure that the Fibre Channel adapter was not disturbed.
Is the Fibre Channel adapter failure explained by the previous checks?
NO
a. Replace the Fibre Channel adapter assembly as shown in Table 82
on page 318.
b. If the problem is not fixed, replace the Fibre Channel connection
hardware in the order that is shown in Table 83.
Table 83. SAN Volume Controller Fibre Channel adapter connection hardware
Node
Adapter connection hardware
SAN Volume Controller 2145-CG8 port 1, 2,
3, or 4 (slot 1)
1. PCI Express® FC with Riser card
assembly 1
2. System board
SAN Volume Controller 2145-CG8 port 5, 6,
7, or 8 (slot 2)
1. PCI Express® FC with Riser card
assembly 2
2. System board
SAN Volume Controller 2145-DH8 port 1-8
1. PCI Express® Riser card assembly 1
2. System board
SAN Volume Controller 2145-DH8 port 9-12
1. PCI Express® Riser card assembly 2
2. System board
c. Verify the repair by continuing with “MAP 5700: Repair
verification” on page 321.
YES
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
11. (from step 1 on page 316)
If the operating speed is lower than the operating speed that is supported by
the switch, a high number of link errors are being detected.
320
SAN Volume Controller: Troubleshooting Guide
To display the current speed of the link, see http://www-01.ibm.com/
support/knowledgecenter/STPVGU_7.6.0/
com.ibm.storage.svc.console.760.doc/svc_svcdetfibrenetspeed_23eeaf.html
Is the port operating at lower than the expected speed?
NO
Repeat the check with the other Fibre Channel ports until the failing
port is located. If no failing port is located, the problem no longer
exists. Verify the repair by continuing with “MAP 5700: Repair
verification.”
YES
Perform the following steps:
a. Check the routing of the Fibre Channel cable to ensure that no
damage exists and that the cable route contains no tight bends (no
less than a 3-inch radius). Either reroute or replace the Fibre
Channel cable.
b. Remove the Fibre Channel cable for 2 seconds and then reinsert it
to force the Fibre Channel adapter to renegotiate its operating
speed.
c. Recheck the speed of the Fibre Channel port. If it is now correct,
the problem is resolved. Otherwise, the problem might be caused
by one of the following conditions:
v Four-port Fibre Channel HBA
v SAN Volume Controller SFP transceiver
v Fibre Channel switch gigabit interface converter (GBIC) or SFP
transceiver
v Fibre Channel switch
Recheck the speed after you change any component until the
problem is resolved and then verify the repair by continuing with
“MAP 5700: Repair verification.”
MAP 5700: Repair verification
MAP 5700: Repair verification helps you to verify that field-replaceable units
(FRUs) that you have exchanged for new FRUs, or repair actions that have been
done have solved all the problems on the SAN Volume Controller.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
You might have been sent here because you performed a repair and want to
confirm that no other problems exists on the machine.
Procedure
1. Are the Power LEDs on all the nodes on? For more information about this
LED, see “Power LED” on page 23.
NO
Go to “MAP 5000: Start” on page 275.
YES
Go to step 2.
2. (from step 1)
Are all the nodes displaying Cluster: or is the node status LED on?
NO
Go to “MAP 5000: Start” on page 275.
Chapter 11. Using the maintenance analysis procedures
321
YES
Go to step 3.
3. (from step 2 on page 321)
Using the SAN Volume Controller application for the system you have just
repaired, check the status of all configured managed disks (MDisks).
Do all MDisks have a status of online?
NO
If any MDisks have a status of offline, repair the MDisks. Use the
problem determination procedure for the disk controller to repair the
MDisk faults before returning to this MAP.
If any MDisks have a status of degraded paths or degraded ports,
repair any storage area network (SAN) and MDisk faults before
returning to this MAP.
If any MDisks show a status of excluded, include MDisks before
returning to this MAP.
Go to “MAP 5000: Start” on page 275.
YES
Go to step 4.
4. (from step 3)
Using the SAN Volume Controller application for the system you have just
repaired, check the status of all configured volumes. Do all volumes have a
status of online?
NO
Go to step 5.
YES
Go to step 6.
5. (from step 4)
Following a repair of the SAN Volume Controller, a number of volumes are
showing a status of offline. Volumes will be held offline if SAN Volume
Controller cannot confirm the integrity of the data. The volumes might be the
target of a copy that did not complete, or cache write data that was not written
back to disk might have been lost. Determine why the volume is offline. If the
volume was the target of a copy that did not complete, you can start the copy
again. Otherwise, write data might not have been written to the disk, so its
state cannot be verified. Your site procedures will determine how data is
restored to a known state.
To bring the volume online, you must move all the offline disks to the recovery
I/O group and then move them back to an active I/O group.
Go to “MAP 5000: Start” on page 275.
6. (from step 4)
You have successfully repaired the SAN Volume Controller.
MAP 5800: Light path
MAP 5800: Light path helps you to solve hardware problems on a SAN Volume
Controller that are preventing the node from booting.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
You might be sent here because of the following situations:
v The Error LED on the operator-information panel is on or flashing
322
SAN Volume Controller: Troubleshooting Guide
v Another MAP sent you here:
– “Light path for SAN Volume Controller 2145-DH8”
– “Light path for SAN Volume Controller 2145-CG8” on page 329
– “Light path for SAN Volume Controller 2145-CF8” on page 335
Light path for SAN Volume Controller 2145-DH8
Light path diagnostics is a system of LEDs on top of the operator-information
panel of the SAN Volume Controller 2145-DH8 node, which leads you to the failed
component.
About this task
When an error occurs, LEDs are lit along the front of the operator-information
panel, the light path diagnostics panel, then on the failed component. By viewing
the LEDs in a particular order, you can often identify the source of the error.
LEDs that are lit to indicate an error, remain lit when the server is turned off, if the
node is connected to an operating power supply.
Ensure that the node is turned on, and then resolve any hardware errors that are
indicated by the Error LED and light path LEDs:
Procedure
1. Is the System error LED ▌7▐, shown in Figure 82, on the SAN Volume
Controller 2145-DH8 operator-information panel on or flashing?
4
3
2
5
7
6
ifs00064
1
Figure 82. SAN Volume Controller 2145-DH8 operator-information panel
▌1▐ Power control button and LED.
▌2▐ Ethernet LED.
▌3▐ Locator button and LED.
▌4▐ Release latch.
▌5▐ Ethernet activity LEDs.
▌6▐ Check log LED.
▌7▐ System error LED.
NO
Reassess your symptoms and return to “MAP 5000: Start” on page 275.
YES
Go to step 2.
2. (from step 1)
Press the release latch, as shown in Figure 83 on page 324, and open the light
path diagnostics panel, which is shown in Figure 84 on page 324.
Chapter 11. Using the maintenance analysis procedures
323
Operator information
panel
Light path
diagnostics LEDs
Release latch
Figure 83. Press the release latch
Are one or more LEDs on the light path diagnostics panel on or flashing?
Checkpoint
Code
Remind
Reset
Light Path Diagnostics
Figure 84. SAN Volume Controller 2145-DH8 light path diagnostics panel
NO
Verify that the operator-information panel cable is correctly seated at
both ends. If the error LED is still illuminated but no LEDs are
illuminated on the light path diagnostics panel, replace parts in the
following sequence:
a. Operator-information panel
b. System board
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
324
See Table 84 on page 326 and complete the action that is specified for
the specific light-path-diagnostics LEDs. Then, go to step 3 on page 329.
Some actions require that you observe the state of LEDs on the system
SAN Volume Controller: Troubleshooting Guide
board. Figure 85 shows the location of the system board LEDs. The fan
LEDs are located next to each FAN. To view the LEDs, complete the
following actions:
a. Turn off the node while ensuring that its data is mirrored and
synchronized.
b. Identify and label all the cables that are attached to the node so that
they can be replaced in the same port. Remove the node from the
rack and place it on a flat, static-protective surface. For more
information, see “Removing the node from a rack”.
c. Remove the top cover.
d. See Table 84 on page 326 and complete the action that is specified
for the specific light-path-diagnostics LEDs. Then, go to step 3 on
page 329.
System Error
LED
Locator LED
Power LED
Enclosure management
heartbeat LED
Imm2 heartbeat
LED
Standby power
LED
10G Ethernet card
error LED
Battery
error LED
DIMM 19-24
error LED
(under the latches)
DIMM 1-6
error LED
(under the latches)
Microprocessor 1
error LED
Microprocessor 2
error LED
Fan 4
error LED
Fan3
error LED
Fan2
DIMM 7-18
error LED
error LED
(under the latches)
System board
error LED
Fan1
error LED
Figure 85. SAN Volume Controller 2145-DH8 system board LEDs.
Chapter 11. Using the maintenance analysis procedures
325
Table 84. Diagnostics panel LED.
LED
Description
Action
The Error
log or
Check log
LED
An error occurred and cannot be isolated without
completing certain procedures.
1. Plug in the VGA screen and the USB keyboard.
operatorinformation
panel
Systemerror LED
2. Check the IMM2 system event log and the
system-error log for information about the
error.
3. Save the log if necessary and clear the log
afterward.
An error occurred.
1. Check the light-path-diagnostics LEDs and
follow the instructions.
2. Check the IMM2 system event log and the
system-error log for information about the
error.
operatorinformation
panel
3. Save the log if necessary and clear the log
afterward.
PS
When only the PS LED is lit, a power supply
failed.
The system might detect a power supply error.
Complete the following steps to correct the
problem:
1. Check the power-supply with a lit yellow LED.
2. Make sure that the power supplies are seated
correctly and plugged in a good AC outlet.
3. Remove a power supply to isolate the failed
power supply.
4. Make sure that both power supplies installed
in the server are of the same AC input voltage.
5. Replace the failed power supply.
PS + CONFIG
OVER
SPEC
When both the PS and CONFIG LEDs are lit, the
power supply configuration is not valid.
If the PS LED and the CONFIG LED are lit, the
system logs an invalid power configuration error.
Make sure that both power supplies installed in
the node are of the same rating or wattage.
The system consumption reaches the power
supply over-current protection point or the power
supplies are damaged.
1. If the Pwr Rail (A, B, C, D, E, F, G, and H)
error was not detected, complete the following
steps:
a. Use the IBM Power Configurator utility to
determine current system power
consumption. For more information and to
download the utility, go to
http://www.ibm.com/systems/
bladecenter/ resources/powerconfig.html.
b. Replace the failed power supply.
2. If the Pwr Rail (A, B, C, D, E, F, G, and H)
error was also detected, follow actions that are
listed in MAP 5040: Power.
326
SAN Volume Controller: Troubleshooting Guide
Table 84. Diagnostics panel LED. (continued)
LED
Description
Action
PCI
An error occurred on a PCI bus or on the system
1. Check the riser-card LEDs, the ServeRAID
board. Another LED is lit next to a failing PCI slot.
error LED, and the dual-port network adapter
error LED to identify the component that
caused the error.
2. Check the system-error log for information
about the error.
3. If you cannot isolate the failing component by
using the LEDs and the information in the
system-error log, remove one component at a
time. Then, restart the server after each
component is removed.
4. Replace the following components, in the order
that is shown, restarting the server each time:
v PCI riser cards
v ServeRAID adapter
v Network adapter
v (Trained technician only) System board.
5. If the failure remains, contact your IBM service
representative.
NMI
A nonmaskable interrupt occurred, or the NMI
button was pressed.
1. Check the system-error log for information
about the error.
2. Restart the server.
CONFIG
CONFIG + PS An invalid power configuration
error occurred.
If the CONFIG LED and the PS LED are lit, the
system logs an invalid power configuration error.
Make sure that both power supplies installed in
the server are of the same rating or wattage.
CONFIG + CPU A hardware configuration error
occurred.
If the CONFIG LED and the CPU LED are lit,
complete the following steps to correct the
problem:
1. Check the microprocessors that were installed
to make sure that they are compatible with
each other.
2. (Trained technician only) Replace the
incompatible microprocessor.
3. Check the system-error logs for information
about the error. Replace any component that is
identified in the error log.
LINK
CONFIG + MEM A hardware configuration error
occurred.
If the CONFIG LED and the MEM LED are lit,
check the system-event log in the Setup utility or
IMM2 error messages.
CONFIG + PCI A hardware configuration error
occurred.
If the CONFIG LED and the PCI LED are lit, check
the system-error logs for information about the
error. Replace any component that is identified in
the error log.
CONFIG + HDD A disk drive error occurred.
If the CONFIG LED and the HDD LED are lit,
check the system-error logs for information about
the error. Replace any component that is identified
in the error log.
Reserved.
Chapter 11. Using the maintenance analysis procedures
327
Table 84. Diagnostics panel LED. (continued)
LED
Description
Action
CPU
When only the CPU LED is lit, a microprocessor
1. If the CONFIG LED is not lit, a microprocessor
failed. When both the CPU and CONFIG LEDs are
failure occurs, complete the following steps:
lit, the microprocessor configuration is invalid.
a. (Trained technician only) Make sure that
the failing microprocessor and its heat sink,
which are indicated by a lit LED on the
system board, are installed correctly.
b. (Trained technician only) Replace the failing
microprocessor.
c. For more information, contact your IBM
service representative.
2. If the CONFIG LED and the CPU LED are lit,
the system logs an invalid microprocessor
configuration error. Complete the following
steps to correct the problem:
a. Check recently installed microprocessors to
ensure that they are compatible with each
other.
b. (Trained technician only) Replace any
incompatible microprocessor.
c. Check the system-error logs for information
about the error. Replace any component
that is identified in the error log.
MEM
When only the MEM LED is lit, a memory error
occurred
Note: Note: Each time that you install or remove a
DIMM, you must disconnect the node from the
power source; then, wait 10 seconds before you
restart the server. If the CONFIG LED is not lit,
the system might detect a memory error. Complete
the following steps to correct the problem:
1. Update the node firmware.
2. Reseat or swap the DIMMs with lit LED.
3. Check the system-event log in the Setup utility
or IMM error messages.
4. Replace the failing DIMM.
MEM + CONFIG
When both the MEM and CONFIG LEDs are lit,
the memory configuration is not valid.
TEMP
The system or the system component temperature
exceeded a threshold level. A failing fan can cause
the TEMP LED to be lit.
If the MEM LED and the CONFIG LED are lit,
check the system-event log in the Setup utility or
IMM2 error messages.
1. Make sure that the heat sink is seated correctly.
2. Determine whether a fan failed and replace the
fan if necessary.
3. Make sure that the room temperature is not too
high. See the environment requirements for the
server temperature information.
4. Make sure that the air vents are not blocked.
5. Make sure that the heat sink or the fan on the
adapter, or any other network adapter is seated
correctly. If the fan failed, replace it.
6. For more information, contact your IBM service
representative.
328
SAN Volume Controller: Troubleshooting Guide
Table 84. Diagnostics panel LED. (continued)
LED
Description
Action
FAN
A fan is either failed, operating too slowly, or is
removed. The TEMP LED might also be lit.
1. Check whether your node is installed with the
dual-port network adapter. If yes, make sure
that your node compiles with the configuration
with four fans installed.
2. Reseat the failing fan, which is indicated by a
lit LED near the fan connector on the system
board.
3. Replace the failing fan.
BOARD
An error occurred on the system board or the
system battery.
1. Check the LEDs on the system board to
identify the component that caused the error.
The BOARD LED can be lit due to any of the
following reasons:
v Battery
v (Trained technician only) System board
2. Check the system-error log for information
about the error.
3. Replace the failing component.
HDD
A hard disk drive that is failed or is missing.
1. Check the LEDs on the hard disk drives for the
drive with a lit status LED and reseat the hard
disk drive.
2. Reseat the hard disk drive backplane.
3. If the error remains, replace the following
components one at a time, in the order that is
listed, restarting the server after each:
a. Replace the hard disk drive.
b. Replace the hard disk drive backplane.
4. If the problem remains, contact your IBM
service representative.
3. Continue with “MAP 5700: Repair verification” on page 321 to verify the
correct operation.
Light path for SAN Volume Controller 2145-CG8
Use the diagnostics LEDs that are on the system board to solve hardware problems
with the SAN Volume Controller 2145-CG8 node.
About this task
Ensure that the node is turned on, and then complete the following steps to
resolve any hardware errors that are indicated by the Error LED and light path
LEDs:
Procedure
1. Is the Error LED, shown in Figure 86 on page 330, on the SAN Volume
Controller 2145-CG8 operator-information panel on or flashing?
Chapter 11. Using the maintenance analysis procedures
329
1
2
svc00721
2
1
Figure 86. SAN Volume Controller 2145-CG8 or 2145-CF8 operator-information panel
▌1▐ System error LED.
▌2▐ Release latch.
NO
Reassess your symptoms and return to “MAP 5000: Start” on page 275.
YES
Go to step 2.
2. (from step 1 on page 329)
Press the release latch and open the light path diagnostics panel, which is
shown in Figure 87.
Are one or more LEDs on the light path diagnostics panel on or flashing?
REMIND
OVERSPEC
CNFG
LOG
LINK
PS
PCI
FAN
TEMP
MEM
NMI
CPU
VRM
DASD
RAID
SP
BRD
RESET
Light Path Diagnostics
Figure 87. SAN Volume Controller 2145-CG8 or 2145-CF8 light path diagnostics panel
NO
Verify that the operator-information panel cable is correctly seated at
both ends. If the error LED is still illuminated but no LEDs are
illuminated on the light path diagnostics panel, replace parts in the
following sequence:
a. Operator-information panel
b. System board
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
330
See Table 85 on page 333 and complete the action that is specified for
the specific light-path-diagnostics LEDs. Then, go to step 3 on page 335.
Some actions require that you observe the state of LEDs on the system
board. Figure 88 on page 332 shows the location of the system board
LEDs. The fan LEDs are located next to each FAN. To view the LEDs,
complete the following actions:
SAN Volume Controller: Troubleshooting Guide
a. Turn off the node while ensuring that its data is mirrored and
synchronized. For more information, see “MAP 5350: Powering off a
node” on page 302.
b. Identify and label all the cables that are attached to the node so that
they can be replaced in the same port. Remove the node from the
rack and place it on a flat, static-protective surface. For more
information, see “Removing the node from a rack”.
c. Remove the top cover.
d. See Table 85 on page 333 and complete the action that is specified
for the specific light-path-diagnostics LEDs. Then, go to step 3 on
page 335.
Chapter 11. Using the maintenance analysis procedures
331
1
3
2
23
4
22
21
5
6
7
14
13
12
11
10
9
Figure 88. SAN Volume Controller 2145-CG8 system board LEDs diagnostics panel
▌1▐ Battery LED.
▌2▐ IMM heartbeat LED.
▌3▐ Enclosure management heartbeat LED.
▌4▐ DIMM 10-18 error LEDs.
▌5▐ Microprocessor 1 error LED.
▌6▐ DIMM 1-9 error LEDs.
▌7▐ Fan one error LED.
▌8▐ Fan two error LED.
▌9▐ Fan three error LED.
▌10▐ Fan four error LED.
332
SAN Volume Controller: Troubleshooting Guide
8
svc00713
20
19
18
17
16
15
▌11▐ Fan five error LED.
▌12▐ Fan six error LED.
▌13▐ SAS RAID riser-card missing LED.
▌14▐ 240 VA error LED.
▌15▐ Power channel A error LED.
▌16▐ Power channel B error LED.
▌17▐ Power channel C error LED.
▌18▐ Power channel D error LED.
▌19▐ Power channel E error LED.
▌20▐ AUX power channel error LED.
▌21▐ System board error LED.
▌22▐ Microprocessor 2 error LED.
▌23▐ Riser 2 missing LED.
Table 85. Diagnostics panel LED prescribed actions
Diagnostics panel
LED
OVER SPEC
Action
The power supplies are using more power than their maximum rating. If the OVER SPEC LED
is lit, one or more of the six 12 V channel error LEDs is also lit. The 12 V channel error LEDs
(A, B, C, D, E, or AUX) are on the system board. Complete the following actions to resolve the
problem:
1. Turn off the node, pull the node forward in the rack, and remove the cover. Do not
disconnect power from the node.
2. Check the 12 V channel error LED that is lit on the system board, and remove components
that are listed for that LED:
v LED A: fans, disk drive, any flash drives, or disk backplane
v LED B: Fibre Channel adapter and riser, all memory
v LED C: disk controller, all memory
v LED D: microprocessor
v LED E: High-speed SAS adapter and riser, if installed
v LED AUX: Fibre Channel adapter and high-speed SAS adapter, if installed
3. Restart the node to see whether the problem remains.
4. Reinstall each device one at a time that you removed for the LED problems. Start the node
each time to isolate the failing device.
5. Replace any failing device.
6. If no device was isolated, and if LED C or LED D is lit, turn off the node and remove the
microprocessor. You need alcohol wipes and thermal grease to replace the microprocessor.
Power on the server by toggling switch block 3 (SW3) bit 6. Restart the server. If the
problem is resolved, replace the microprocessor; otherwise, reinstall the microprocessor. In
either case, toggle switch block 3 (SW3) bit 6 back to its original position.
7. If no device was isolated, and if LED AUX is lit, turn off the node and remove the
operator-information panel. Power on the server by toggling switch block 3 (SW3) bit 6.
Restart the server. Restart the server. If the problem was resolved, replace the
operator-information panel; otherwise, reinstall the operator-information panel. In either
case, toggle switch block 3 (SW3) bit 6 back to its original position.
8. If no failing device is isolated, replace the system board.
LOG
An error occurred. Connect a keyboard and a monitor. Check the IMM system event log and
the system event log for information about the error. Replace any components that are
identified in event logs.
LINK
This LED is not used on the SAN Volume Controller 2145-CG8. Replace the system board.
Chapter 11. Using the maintenance analysis procedures
333
Table 85. Diagnostics panel LED prescribed actions (continued)
Diagnostics panel
LED
PS
Action
Power supply 1 or power supply 2 failed. Complete the following actions to resolve the
problem:
1. Check the power supply with a lit amber LED.
2. Make sure that the power supplies are seated correctly.
3. Remove 1 power supply to isolate the failed power supply.
4. Replace the failed power supply.
PCI
An error occurred on a PCI bus or on the system board. Another LED is lit next to a failing
PCI slot. Complete the following actions to resolve the problem:
1. Identify the failing adapter by checking the LEDs on the PCI slots.
2. If the PCI in slot 1 is showing an error, replace the four-port Fibre Channel adapter
assembly.
3. If the PCI in slot 2 is showing an error, replace the high-speed SAS adapter assembly.
4. If the error is not resolved, replace the system board.
SP
A service processor error was detected. Complete the following actions to resolve the problem:
1. Remove power from the node. Reconnect the server to the power, and restart the node.
2. If the problem remains, replace the system board.
FAN
A fan is either failed, operating too slowly, or is removed. A failing fan can also cause the
TEMP LED to be lit. Complete the following actions to resolve the problem:
1. Reseat the failing fan, which is indicated by a lit LED near the fan connector on the system
board.
2. If the problem remains, replace the failing fan.
TEMP
The system temperature exceeded a threshold level. A failing fan can cause the TEMP LED to
be lit. Complete the following actions to resolve the problem:
1. Make sure that the heat sink is seated correctly.
2. Determine whether a fan failed, and replace it.
3. Verify that the ambient temperature is within normal operating specifications.
4. Make sure that airflow in and around the SAN Volume Controller 2145-CG8 is not
obstructed.
MEM
A memory configuration or a memory error that is not valid occurred. Both the MEM LED and
CNFG LED might be lit. Complete the following actions to resolve the problem:
1. Check that all the memory DIMMs are correctly installed.
2. If any memory error LEDs are lit, replace the indicated memory module.
3. If the MEM LED and the CNFG LED are lit, adjust the memory so that DIMM slots 2, 3, 5,
6, 7, and 8 are the only ones used.
NMI
A non-maskable interrupt occurred or the NMI button was pressed. This situation does not
occur normally. If the NMI button on the light path diagnostic panel was pressed by mistake,
restart the node. Otherwise, call your support center.
CNFG
A hardware configuration error occurred. If the MEM LED is also lit, complete the actions that
are shown for MEM LED. If the CPU LED is lit, check to see whether a microprocessor is
installed in CPU 2. If a CPU is installed, remove it. The configuration is not supported. If no
other light path LEDs are lit, replace the FRUs in the order that is shown until the problem is
resolved:
1. Operator-information panel
2. Operator-information panel cable
3. System board
334
SAN Volume Controller: Troubleshooting Guide
Table 85. Diagnostics panel LED prescribed actions (continued)
Diagnostics panel
LED
CPU
Action
A microprocessor that failed or a microprocessor configuration is not valid. Both the CPU LED
and the CNFG LED might be lit. Complete the following actions:
1. Check the system board error LEDs.
2. If CPU 1 error LED is lit, check that the microprocessor is correctly installed.
3. If the error persists, replace the microprocessor.
4. If the error persists, replace the system board.
VRM
This LED is not used on the SAN Volume Controller 2145-CG8.
DASD
A disk drive that failed or is missing. A SAN Volume Controller 2145-CG8 must have its
system hard disk drive that is installed in drive slot 4. Up to 4 flash drives can be installed in
drive slots 0 - 3.
If a Flash drive is deliberately removed from a slot, the system error LED and the DASD
diagnostics panel LED lights. The error is maintained even if the Flash drive is replaced in a
different slot. If a Flash drive is removed or moved, clear the error by completing this
procedure:
1. Power off the node by using MAP 5350.
2. Remove both power cables.
3. Replacing both power cables.
4. Restart the node.
Resolve any node or system errors that relate to Flash drives or the system disk drive.
If an error is still shown, power off the node and reseat all the drives.
If the error remains, replace the following components in the order listed:
1. The system disk drive
2. The disk backplane
RAID
This LED is not used on the SAN Volume Controller 2145-CG8.
BRD
An error occurred on the system board. Complete the following actions to resolve the problem:
1. Check the LEDs on the system board to identify the component that caused the error. The
BRD LED can be lit because of any of the following reasons:
v Battery.
v Missing PCI riser-card assembly. There must be a riser card in PCI slot 2 even if another
adapter is not present.
v Failed voltage regulator.
2. Replace any failed or missing replacement components, such as the battery or PCI
riser-card assembly.
3. If a voltage regulator fails, replace the system board.
3. Continue with “MAP 5700: Repair verification” on page 321 to verify the
correct operation.
Light path for SAN Volume Controller 2145-CF8
Use the diagnostics LEDs that are on the system board to solve hardware problems
with theSAN Volume Controller 2145-CF8 node.
Chapter 11. Using the maintenance analysis procedures
335
About this task
Ensure that the node is turned on, and then complete the following steps to
resolve any hardware errors that are indicated by the Error LED and light path
LEDs:
Procedure
1. Is the Error LED, shown in Figure 89, on the SAN Volume Controller
2145-CF8 operator-information panel on or flashing?
3
2
10
2
1
4
3
9
8
4
5
svc_bb1gs008
1
6
7
Figure 89. SAN Volume Controller 2145-CG8 or 2145-CF8 operator-information panel
▌5▐ System error LED.
▌6▐ Release latch.
NO
Reassess your symptoms and return to “MAP 5000: Start” on page 275.
YES
Go to step 2.
2. (from step 1)
Press the release latch and open the light path diagnostics panel, which is
shown in Figure 90.
Are one or more LEDs on the light path diagnostics panel on or flashing?
REMIND
OVERSPEC
CNFG
LOG
LINK
PS
PCI
FAN
TEMP
MEM
NMI
CPU
VRM
DASD
RAID
SP
BRD
RESET
Light Path Diagnostics
Figure 90. SAN Volume Controller 2145-CG8 or 2145-CF8 light path diagnostics panel
NO
Verify that the operator-information panel cable is correctly seated at
both ends. If the error LED is still illuminated but no LEDs are
illuminated on the light path diagnostics panel, replace parts in the
following sequence:
a. Operator-information panel
b. System board
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
336
SAN Volume Controller: Troubleshooting Guide
See Table 86 on page 339 and complete the action that is specified for
the specific light-path-diagnostics LEDs. Then, go to step 3 on page 341.
Some actions require that you observe the state of LEDs on the system
board. Figure 91 shows the location of the system board LEDs. The fan
LEDs are located next to each FAN. To view the LEDs, complete the
following actions:
YES
a. Turn off the node while ensuring that its data is mirrored and
synchronized. For more information, see“MAP 5350: Powering off a
node” on page 302.
b. Identify and label all the cables that are attached to the node so that
they can be replaced in the same port. Remove the node from the
rack and place it on a flat, static-protective surface. For more
information, see “Removing the node from a rack”.
c. Remove the top cover.
d. See Table 86 on page 339 and complete the action that is specified
for the specific light-path-diagnostics LEDs. Then, go to step 3 on
page 341.
2
1
4
3
24
5
23
6
22
7
21
20
19
8
18
9
17
16
14
13
12
11
10
15
Figure 91. SAN Volume Controller 2145-CF8 system board LEDs diagnostics panel
Chapter 11. Using the maintenance analysis procedures
337
▌1▐ Slot 2 missing PCI riser card LED.
▌2▐ Enclosure manager heartbeat LED.
▌3▐ Battery LED.
▌4▐ IMM heartbeat LED.
▌5▐ Slot 1 missing PCI riser card LED.
▌6▐ System error LED.
▌7▐ Microprocessor 1 error LED.
▌8▐ DIMM 1-8 error LEDs.
▌9▐ Fan one error LED.
▌10▐ Fan two error LED.
▌11▐ Fan three error LED.
▌12▐ Fan four error LED.
▌13▐ Fan five error LED.
▌14▐ Fan six error LED.
▌15▐ 240 VA error LED.
▌16▐ Power channel A error LED.
▌17▐ Power channel B error LED.
▌18▐ Power channel C error LED.
▌19▐ Power channel D error LED.
▌20▐ Power channel E error LED.
▌21▐ AUX power channel error LED.
▌22▐ SAS/SATA RAID error LED.
▌23▐ Microprocessor 2 error LED.
▌24▐ DIMM 9-16 error LEDs.
338
SAN Volume Controller: Troubleshooting Guide
Table 86. Diagnostics panel LED prescribed actions
Diagnostics panel
LED
OVER SPEC
Action
The power supplies are using more power than their maximum rating. If the OVER SPEC LED
is lit, one or more of the six 12 V channel error LEDs is also lit. The 12 V channel error LEDs
(A, B, C, D, E, or AUX) are on the system board. Complete the following actions to resolve the
problem:
1. Turn off the node, pull the node forward in the rack, and remove the cover. Do not
disconnect power from the node.
2. Check the 12 V channel error LED that is lit on the system board. Remove components that
are listed for that LED:
v LED A: fans, disk drive, any flash drives, or disk backplane
v LED B: Fibre Channel adapter and riser, all memory
v LED C: disk controller, all memory
v LED D: microprocessor
v LED E: High-speed SAS adapter and riser, if installed
v LED AUX: Fibre Channel adapter and high-speed SAS adapter, if installed
3. Restart the node to see whether the problem remains.
4. Reinstall each device one at a time that you removed for the LED problems. Start the node
each time to isolate the failing device.
5. Replace any failing device.
6. If no device was isolated, and if LED C or LED D is lit, turn off the node and remove the
microprocessor. You need alcohol wipes and thermal grease to replace the microprocessor.
Power on the server by toggling switch block 3 (SW3) bit 6. Restart the server. Restart the
server. If the problem is resolved, replace the microprocessor; otherwise, reinstall the
microprocessor. In either case, toggle switch block 3 (SW3) bit 6 back to its original
position.
7. If no device was isolated, and if LED AUX is lit, turn off the node and remove the
operator-information panel. Power on the server by toggling switch block 3 (SW3) bit 6.
Restart the server. Restart the server. If the problem was resolved, replace the
operator-information panel; otherwise, reinstall the operator-information panel. In either
case, toggle switch block 3 (SW3) bit 6 back to its original position.
8. If no failing device is isolated, replace the system board.
LOG
An error occurred. Connect a keyboard and a monitor. Check the IMM system event log and
the system event log for information about the error. Replace any components that are
identified in event logs.
LINK
This LED is not used on the SAN Volume Controller 2145-CF8. Replace the system board.
PS
Power supply 1 or power supply 2 failed. Complete the following actions to resolve the
problem:
1. Check the power supply that has a lit amber LED.
2. Make sure that the power supplies are seated correctly.
3. Remove a power supply to isolate the failed power supply.
4. Replace the failed power supply.
PCI
An error occurred on a PCI bus or on the system board. Another LED is lit next to a failing
PCI slot. Complete the following actions to resolve the problem:
1. Identify the failing adapter by checking the LEDs on the PCI slots.
2. If the PCI slot 1 LED is lit, replace the four-port Fibre Channel adapter assembly.
3. If the PCI slot 2 LED is lit, replace the high-speed SAS adapter assembly.
4. If the error is not resolved, replace the system board.
Chapter 11. Using the maintenance analysis procedures
339
Table 86. Diagnostics panel LED prescribed actions (continued)
Diagnostics panel
LED
Action
SP
A service processor error was detected. Complete the following actions to resolve the problem:
1. Remove power from the node. Reconnect the server to the power, and restart the node.
2. If the problem remains, replace the system board.
FAN
A fan is failing, is operating too slowly, or is removed. A failing fan can also cause the TEMP
LED to be lit. Complete the following actions to resolve the problem:
1. Reseat the failing fan, which is indicated by a lit LED near the fan connector on the system
board.
2. If the problem remains, replace the failing fan.
TEMP
The system temperature exceeded a threshold level. A failing fan can cause the TEMP LED to
be lit. Complete the following actions to resolve the problem:
1. Make sure that the heat sink is seated correctly.
2. Determine whether a fan failed. If it has, replace it.
3. Verify that the ambient temperature is within normal operating specifications.
4. Make sure that airflow in and around the SAN Volume Controller 2145-CF8 is not
obstructed.
MEM
A memory configuration or a memory error that is not valid occurred. Both the MEM LED and
CNFG LED might be lit. Complete the following actions to resolve the problem:
1. Check that all the memory DIMMs are correctly installed.
2. If any memory error LEDs are lit, replace the indicated memory module.
3. If the MEM LED and the CNFG LED are lit, adjust the memory so that DIMM slots 2, 3, 5,
6, 7, and 8 are the only ones used.
NMI
A non-maskable interrupt occurred or the NMI button was pressed. This situation does not
occur normally. If the NMI button on the light path diagnostic panel was pressed by mistake,
restart the node. Otherwise, call your support center.
CNFG
A hardware configuration error occurred. If the MEM LED is also lit, follow the actions that
are shown for MEM LED. If the CPU LED is lit, check to see whether a microprocessor is
installed in CPU 2. If a CPU is installed, remove it. The configuration is not supported. If no
other light path LEDs are lit, replace the FRUs in the order that is shown until the problem is
resolved:
1. Operator-information panel
2. Operator-information panel cable
3. System board
CPU
A microprocessor that is failed, or a microprocessor configuration is not valid. Both the CPU
LED and the CNFG LED might be lit. Complete the following actions:
1. Check the system board error LEDs.
2. If CPU 1 error LED is lit, check that the microprocessor is correctly installed.
3. If the error persists, replace the microprocessor.
4. If the error persists, replace the system board.
VRM
340
This LED is not used on the SAN Volume Controller 2145-CF8.
SAN Volume Controller: Troubleshooting Guide
Table 86. Diagnostics panel LED prescribed actions (continued)
Diagnostics panel
LED
DASD
Action
A disk drive that is failed or is missing. A SAN Volume Controller 2145-CF8 must have its
system hard disk drive that is installed in drive slot 4. Up to 4 flash drives can be installed in
drive slots 0 - 3.
If an Flash drive is deliberately removed from a slot, the system error LED and the DASD
diagnostics panel LED lights. The error is maintained even if the Flash drive is replaced in a
different slot. If a Flash drive is removed or moved, clear the error by completing this
procedure:
1. Power off the node by using MAP 5350.
2. Remove both power cables.
3. Replacing both power cables.
4. Restart the node.
Resolve any node or system errors that relate to Flash drives or the system disk drive.
If an error is still shown, power off the node and reseat all the drives.
If the error remains, replace the following components in the order listed:
1. The system disk drive
2. The disk backplane
RAID
This LED is not used on the SAN Volume Controller 2145-CF8.
BRD
An error occurred on the system board. Complete the following actions to resolve the problem:
1. Check the LEDs on the system board to identify the component that caused the error. The
BRD LED can be lit because of any of the following reasons:
v Battery
v Missing PCI riser-card assembly. There must be a riser card in PCI slot 2 even if another
adapter is not present.
v Failed voltage regulator
2. Replace any failed or missing replacement components, such as the battery or PCI
riser-card assembly.
3. If a voltage regulator fails, replace the system board.
3. Continue with “MAP 5700: Repair verification” on page 321 to verify the
correct operation.
MAP 5900: Hardware boot
MAP 5900: Hardware boot helps you solve problems that are preventing the node
from starting its boot sequence.
Before you begin
Note: Use the service assistant GUI if there is no front panel display, for example
on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This MAP applies to all SAN Volume Controller models. However, some models
do not have a front panel display; use the service assistant GUI if the node does
not have a front panel display. Be sure that you know which model you are using
Chapter 11. Using the maintenance analysis procedures
341
before you start this procedure. To determine which model you are working with,
look for the label that identifies the model type on the front of the node.
You might have been sent here for one of the following reasons:
v The hardware boot display, shown in Figure 92, is displayed continuously.
Figure 92. Hardware boot display
v The node rescue display, shown in Figure 93, is displayed continuously.
Figure 93. Node rescue display
v The boot progress is hung and an error is displayed on the front panel
v Another MAP sent you here
v The node status LED, node fault LED and battery status LED have remained off
About this task
Perform the following steps to allow the node to start its boot sequence:
Procedure
1. Is the Error LED on the operator-information panel illuminated or flashing?
NO
Go to step 2.
YES
Go to “MAP 5800: Light path” on page 322 to resolve the problem.
2. (From step 1)
If you have just installed the SAN Volume Controller node or have just
replaced a field replaceable unit (FRU) inside the node, perform the
following steps:
a. Identify and label all the cables that are attached to the node so that they
can be replaced in the same port. Remove the node from the rack and place
it on a flat, static-protective surface. See the Removing the node from a rack
information to find out how to perform the procedure.
b. Remove the top cover. See the “Removing the top cover” information to
find out how to perform the procedure.
c. If you just replaced a FRU, ensure that the FRU is correctly placed and that
all connections to the FRU are secure.
d. Ensure that all memory modules are correctly installed and that the latches
are fully closed. See the Replacing the memory modules (DIMM) information
to find out how to perform the procedure.
342
SAN Volume Controller: Troubleshooting Guide
e. Ensure that the Fibre Channel adapters are correctly installed. See the
Replacing the Fibre Channel adapter assembly information to find out how to
perform the procedure.
f. Ensure that the disk drive and its connectors are correctly installed. See the
Replacing the disk drive information to find out how to perform the
procedure.
g. Ensure that the service controller is correctly installed. See the Replacing the
service controller information to find out how to perform the procedure.
h. Replace the top cover. See the Replacing the top cover information to find out
how to perform the procedure.
i. Place the node in the rack. See the Replacing the node in a rack information to
find out how to perform the procedure.
j. Turn on the node.
Does the boot operation still hang?
NO
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
Go to step 3.
3. (from step 2 on page 342)
Check if the system BIOS is reporting any errors. You need to attach a VGA
display and USB keyboard to see the BIOS output. The customer should be
able to supply a suitable display and keyboard.
a. Turn off the node while ensuring that its data is mirrored and synchronized.
See “MAP 5350: Powering off a node” on page 302.
b. Connect the keyboard ▌1▐ and the display ▌2▐. Figure 94 shows the
location of the keyboard and monitor ports on the 2145-CF8. Figure 95
shows the location of the keyboard and monitor ports on the 2145-CG8.
Connect to the VGA port and a USB port on the front or the back of a
2145-DH8.
svc00572
2
1
Figure 94. Keyboard and monitor ports on the SAN Volume Controller 2145-CF8
svc00723
1
2
Figure 95. Keyboard and monitor ports on the SAN Volume Controller 2145-CG8
Chapter 11. Using the maintenance analysis procedures
343
1
2
3
2
3
4
5
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
a a
aaaaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
aaaa
a a
aaaaaa
a a
6
7
1+
8
-
2+
-
1 2
aa
aa
aa
aa
a
aa
aa
aa
aa
aaaa
aaaa
aaaa
a aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
a
a aa
6
12
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaa aaaaaa aaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa
aaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
aaaa aaaa aaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a
a a a a a a a a a a a
a a
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
11
3 4
SAN Volume Controller
6
10
7
svc00800
1
5
4
8
- +
9
▌3▐ USB port 1 and 2
▌4▐ Video port
Figure 96. Keyboard and monitor ports on the SAN Volume Controller 2145-DH8, front
1
4
2
5
3
6
2
1
3
4
13
12
11
10
9
8
7
6
svc00859
5
▌7▐ USB port 6
▌8▐ USB port 5
▌9▐ USB port 4
▌10▐ USB port 3
▌11▐ Serial port
▌12▐ Video port
Figure 97. Keyboard and monitor ports on the SAN Volume Controller 2145-DH8, rear
c. Turn on the node.
d. Watch the VGA display.
v If the POST sequence indicates an error, or if the BIOS
Configuration/Setup Utility program indicates an error during startup,
you need to resolve the error.
v If it indicates an error with a specific hardware item, power off the node
and remove it from the rack. Ensure the item specified is correctly
344
SAN Volume Controller: Troubleshooting Guide
installed, replace the node in the rack, and then restart the node. If the
error is still reported, replace the specified item.
v If a configuration error is reported, run the Configuration/Setup Utility
program option to reset the BIOS to its default (factory) settings.
v If no boot image can be found on a 2145-DH8 node then contact IBM
Remote technical support for help, which could lead to installing the
software from a USB flash drive.
e. Turn off the node and remove the keyboard and display.
f. Turn on the node.
Does the boot operation still hang?
NO
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
Go to step 4.
4. (from step 3 on page 343)
a. Turn off the node while ensuring that its data is mirrored and synchronized.
See “MAP 5350: Powering off a node” on page 302.
b. Identify and label all the cables that are attached to the node so that they
can be replaced in the same port. Remove the node from the rack and place
it on a flat, static-protective surface. See the Removing the node from a rack
information to find out how to perform the procedure.
c. Remove the top cover. See the Removing the top cover information to find out
how to perform the procedure.
d. Remove some of the memory modules:
v In the SAN Volume Controller 2145-DH8, remove the memory modules
in slots 1, 4, 9, and 12, in addition to slots 13, 16, 21 and 24 if the second
microprocessor is fitted.
e. Remove all installed adapters.
f. Remove the disk drive, unless this is a 2145-DH8, which can only boot from
its boot drives.
g. Replace the top cover. See the Replacing the top cover information to find out
how to perform the procedure.
h. Place the node in the rack. See the Replacing the node in a rack information to
find out how to perform the procedure.
i. Turn on the node.
5. Does the boot operation still hang with the booting display (perform the NO
action) or has the boot operation progressed (perform the YES action)?
Note: With the FRUs removed, the boot will hang with a different boot failure
code.
NO
Go to step 6 to replace the FRUs, one-at-a-time, until the failing FRU is
isolated.
YES
Go to step 7 on page 346
6. (From step 5)
Remove all hardware except the hardware that is necessary to power up.
Continue to add in the FRUs one at a time and power on each time until the
original failure is introduced.
Does the boot operation still hang?
Chapter 11. Using the maintenance analysis procedures
345
NO
Verify the repair by continuing with “MAP 5700: Repair verification”
on page 321.
YES
Go to step 7.
7. (from steps 4 on page 345 and 6 on page 345)
a. Turn off the node. See “MAP 5350: Powering off a node” on page 302 for
more information.
b. Identify and label all the cables that are attached to the node so that they
can be replaced in the same port. Remove the node from the rack and place
it on a flat, static-protective surface. See the Removing the node in a rack
information to find out how to perform the procedure.
c. Remove the top cover. See the Removing the top cover information to find out
how to perform the procedure.
d. Replace the adapter and the disk drive.
e. Replace the memory modules:
v In the SAN Volume Controller 2145-DH8, remove the memory modules
in slots 1, 4, 9, and 12, in addition to slots 13, 16, 21 and 24 if the second
microprocessor is fitted.
f. Replace the top cover. See the Replacing the top cover information to find out
how to perform the procedure.
g. Place the node in the rack. See the Replacing the node in a rack information to
find out how to perform the procedure.
h. Turn on the node.
Does the boot operation still hang with the booting display (perform the NO
action) or does the display progress beyond the initial booting panel
(perform the YES action)?
NO
Exchange the failing memory modules for new FRUs and verify the
repair by continuing with “MAP 5700: Repair verification” on page 321.
YES
Replace the parts in the following sequence:
v For the SAN Volume Controller 2145-CG8 or 2145-CF8:
a. Service controller
b. System board
Verify the repair by continuing with “MAP 5700: Repair verification” on page
321.
MAP 6000: Replace offline SSD
MAP 6000: This procedure replaces a flash drive that has failed while it is still a
member of a storage pool.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This map applies to models with internal flash drives. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working on, look for the label that identifies the model type on the
front of the node.
346
SAN Volume Controller: Troubleshooting Guide
About this task
Use this MAP to determine which detailed MAP to use for replacing an offline
SSD.
Attention: If the drive use property is member and the drive must be replaced,
contact IBM support before taking any actions.
Procedure
Are you using an SSD in a RAID 0 array and using volume mirroring to provide
redundancy?
Yes
Go to “MAP 6001: Replace offline SSD in a RAID 0 array.”
No
Go to “MAP 6002: Replace offline SSD in RAID 1 array or RAID 10 array”
on page 349.
MAP 6001: Replace offline SSD in a RAID 0 array
MAP 6001: This procedure replaces a flash drive that has failed while it is still a
member of a storage pool.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This map applies to models with internal flash drives. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working on, look for the label that identifies the model type on the
front of the node.
Attention:
1. Back up your SAN Volume Controller configuration before you begin these
steps.
2. If the drive use property is member and the drive must be replaced, contact IBM
support before taking any actions.
About this task
Perform the following steps only if a drive in a RAID 0 (striped) array has failed:
Procedure
1. Record the properties of all volume copies, MDisks, and storage pools that are
dependent on the failed drive.
a. Identify the drive ID and the error sequence number with status equals
offline and use equals failed using the lsdrive CLI command.
b. Review the offline reason using the lsevent <seq_no> CLI command.
c. Obtain detailed information about the offline drive or drives using the
lsdrive <drive_id> CLI command.
d. Record the mdisk_id, mdisk_name, node_id, node_name, and slot_id for each
offline drive.
e. Obtain the storage pools of the failed drives using the lsmdisk <mdisk_id>
CLI command for each MDisk that was identified in the substep 1c.
Chapter 11. Using the maintenance analysis procedures
347
Continue with the following steps by replacing all the failed drives in one
of the storage pools. Make note of the node, slot, and ID of the selected
drives.
f. Determine all the MDisks in the storage pool using the lsmdisk
-filtervalue mdisk_grp_id=<grp id> CLI command.
g. Identify which MDisks are internal (ctrl_type equals 4) and which
MDisks contain SSDs (ctrl_type equals 6).
h. Find the volumes with extents in the storage pool using the lsmdiskmember
<mdisk_id> CLI command for each MDisk found in substep 1f.
It is likely that the same volumes will be returned for each MDisk.
i. Record all the properties on each volume listed in step 1h by using the
lsvdisk <vdisk_id> CLI command. For each volume check if it has online
volume copies which indicate it is mirrored. Use this information in step 9
on page 349.
j. Obtain a list of all the drives in each internal MDisk in the storage pool
using the lsdrive -filtervalue mdisk_id=<mdisk_id> CLI command. Use
this information in step 8.
k. Record all the properties of all the MDisks in the storage pool using the
lsmdisk <mdisk_id> CLI command. Use this information in step 8.
l. Record all the properties of the storage pool using the lsmdisk <mdisk_id>
CLI command. Use this information in step 7.
Note: If a listed volume has a mirrored, online, and in-sync copy, you can
recover the copied volume data from the copy. All the data on the unmirrored
volumes will be lost and will need to be restored from backup.
2. Delete the storage pool using the rmmdiskgrp -force <mdiskgrp id> CLI
command.
All MDisks and volume copies in the storage pool are also deleted. If any of
the volume copies were the last in-sync copy of a volume, all the copies that
are not in sync are also deleted, even if they are not in the storage pool.
3. Using the drive ID that you recorded in substep 1e, set the use property of the
drive to unused using the chdrive command.
chdrive -use unused <id of offline drive>
The drive is removed from the drive listing.
4. Follow the physical instructions to replace or remove a drive. See the
“Replacing a SAN Volume Controller 2145-CG8 flash drive” documentation or
the “Removing a SAN Volume Controller 2145-CG8 flash drive”
documentation to find out how to perform the procedures.
5. A new drive object is created with the use attribute set to unused. This action
might take several minutes.
Obtain the ID of the new drive using the lsdrive CLI command.
6. Change the use property for the new drive to candidate.
chdrive -use candidate <drive id of new drive>
7. Create a new storage pool with the same properties as the deleted storage
pool. Use the properties that you recorded in substep 1l.
mkmdiskgrp -name <mdiskgrp name as before> -ext <extent size as before>
8. Create again all MDisks that were previously in the storage pool using the
information from steps 1j and 1k.
v For internal RAID 0 MDisks, use this command:
348
SAN Volume Controller: Troubleshooting Guide
mkarray -level raid0 -drive <list of drive IDs> -name
<mdisk_name> <mdiskgrp id or name>
where -name <mdisk_name> is optional, but you can use the parameter to
make the new array have the same MDisk name as the old array.
v For external MDisks, use the addmdisk CLI command.
v For non-RAID 0 MDisks, use the mkarray CLI command.
9. For all the volumes that had online, in sync, mirrored volume copies before
the MDisk group was deleted, add a new volume copy in the new storage
pool to restore redundancy using the following command:
addvdiskcopy -mdiskgrp <mdiskgrp id> -vtype striped -easytier
<on or off as before> <vdisk_id>
10. For any volumes that did not have an online, in sync, mirrored copy, create
the volume again and restore the data from a backup or use other methods.
11. Mark the drive error as fixed using the error sequence number from step 1b.
cherrstate -sequencenumber <error_sequence_number>
MAP 6002: Replace offline SSD in RAID 1 array or RAID 10
array
MAP 6002: This procedure replaces a flash drive that has failed while it is still a
member of a storage pool.
Before you begin
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, “Using the maintenance analysis procedures,” on page 275.
This map applies to models with internal flash drives. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working on, look for the label that identifies the model type on the
front of the node.
Attention:
1. Back up your SAN Volume Controller configuration before you begin these
steps.
2. If the drive use property is member and the drive must be replaced, contact IBM
support before taking any actions.
About this task
Perform the following steps if a drive fails in a RAID 1 or RAID 10 array:
Procedure
1. Make sure the drive property use is not member.
Use the lsdrive CLI command to determine the use.
2. Record the drive property values of the node ID and the slot ID for use in step
4. These values identify which physical drive to remove.
3. Record the error sequence number for use in step 11 on page 350.
4. Use the drive ID that you recorded in step 2 to set the use attribute property
of the drive to unused with the chdrive command.
chdrive -use failed <id of offline drive>
chdrive -use unused <id of offline drive>
The drive is removed from the drive listing.
Chapter 11. Using the maintenance analysis procedures
349
5. Follow the physical instructions to replace or remove a drive. See the
“Replacing a SAN Volume Controller 2145-CG8 flash drive” documentation or
the “Removing a SAN Volume Controller 2145-CG8 flash drive”
documentation to find out how to perform the procedures.
6. A new drive object is created with the use property set to unused.
7. Change the use property for the drive to candidate.
chdrive -use candidate <id of new drive>
8. Change the use property for the drive to spare.
chdrive -use spare <id of new drive>
v If you are using spare drives, perform a member exchange. Move data from
the spare to the newly inserted device.
v If you do not have a spare, when you mark the drive object as spare, the
array starts to build on the newly inserted device.
9. If the spare is not a perfect match for the replaced drive, then the array is
considered unbalanced, and error code 1692 is recorded in the error log.
10. Follow the fix procedure to complete the procedure.
11. Mark the drive error as fixed using the error sequence number from step 3 on
page 349.
cherrstate -sequencenumber <error_sequence_number>
350
SAN Volume Controller: Troubleshooting Guide
Chapter 12. iSCSI performance analysis and tuning
This procedure provides a solution for Internet Small Computer Systems Interface
(iSCSI) host performance problems while connected to a SAN Volume Controller
system and its connectivity to the network switch.
About this task
Some of the attributes and host parameters that might affect iSCSI performance:
v Transmission Control Protocol (TCP) Delayed ACK
v Ethernet jumbo frame
v Network bottleneck or oversubscription
v iSCSI session login balance
v Priority flow control (PFC) setting and bandwidth allocation for iSCSI on the
network
Procedure
1. Disable the TCP delayed acknowledgment feature.
To disable this feature, refer to OS/platform documentation.
v VMWare: http://kb.vmware.com/selfservice/microsites/microsite.do
v Windows: http://support.microsoft.com/kb/823764
The primary signature of this issue: read performance is significantly lower
than write performance. Transmission Control Protocol (TCP) delayed
acknowledgment is a technique that is used by some implementations of the
TCP in an effort to improve network performance. However, in this scenario
where the number of outstanding I/O is 1, the technique can significantly
reduce I/O performance.
In essence, several ACK responses can be combined together into a single
response, reducing protocol overhead. As described in RFC 1122, a host can
delay sending an ACK response by up to 500 ms. Additionally, with a stream of
full-sized incoming segments, ACK responses must be sent for every second
segment.
Important: The host must be rebooted for these settings to take effect. A few
platforms (for example, standard Linux distributions) do not provide a way to
disable this feature. However, the issue was resolved with the version 7.1
release, and no host configuration changes are required to manage
TcpDelayedAck behavior.
2. Enable jumbo frame for iSCSI.
Jumbo frames are Ethernet frames with a size in excess of 1500 bytes. The
maximum transmission unit (MTU) parameter is used to measure the size of
jumbo frames.
The SAN Volume Controller supports 9000-bytes MTU. Refer to the CLI
command cfgportip to enable jumbo frame. This command is disruptive as the
link flips and the I/O operation through that port pauses.
The network must support jumbo frames end-to-end for this to be effective;
verify this by sending a ping packet to be delivered without fragmentation. For
example:
v Windows:
© Copyright IBM Corp. 2003, 2015
351
ping -t <iscsi target ip> -S <iscsi initiator ip> -f -l <new mtu size - packet overhead (usually 36, might differ)>
The following command is an example of a command that is used to check
whether a 9000-bytes MTU is set correctly on a Windows 7 system:
ping -t -S 192.168.1.117 192.168.1.217 -f -l 8964
The following output is an example of a successful reply:
192.168.1.217: bytes=8964 time=1ms TTL=52
v Linux:
ping -l <source iscsi initatior ip> -s <new mtu size> -M do <iscsi target ip>
v ESXi:
ping <iscsi target ip> -I <source iscsi initiator ip> -s <new mtu size - 28> -d
3. Verify the switch's port statistic where initiator/target ports are connected to
make sure that packet drops are not high.
Review network architecture to avoid any bottlenecks and oversubscription.
The network needs to be balanced to avoid any packet drop; packet drop
significantly reduces storage performance. Involve networking support to fix
any such issues.
4. Optimize and utilize all iSCSI ports.
To optimize SAN Volume Controller resource utilization, all iSCSI ports must
be used.
v Each port is assigned to one CPU, and by balancing the login, one can
maximize CPU utilization and achieve better performance. Ideally, configure
subnets equal to the number of iSCSI ports on the SAN Volume Controller
node. Configure each port of a node with an IP on a different subnet and
keep it the same for other nodes. The following example displays an ideal
configuration:
Node
Port
Port
Port
1
1: 192.168.1.11
2: 192.168.2.21
3: 192.168.3.31
Node
Port
Port
Port
2:
1: 192.168.1.12
2: 192.168.2.22
3: 192.168.3.33
v Avoid situations where 50 hosts are logged in to port 1 and only five hosts
are logged in to port 2.
v Use proper subnetting to achieve a balance between the number of sessions
and redundancy.
5. Troubleshoot problems with PFC settings.
You do not need to enable PFC on the SAN Volume Controller system. SAN
Volume Controller reads the data center bridging exchange (DCBx) packet and
enables PFC for iSCSI automatically if it is enabled on the switch. In the
lsportip command output, the fields lossless_iscsi and lossless_iscsi6
show [on/off] depending on whether PFC is enabled or not for iSCSI on the
system.
If the fields lossless_iscsi and lossless_iscsi6 are showing off, it might be
due to one of the following reasons:
a. VLAN is not set for that IP. Verify the following checks:
v For IP address type IPv4, check the vlan field in the lsportip output. It
should not be blank.
352
SAN Volume Controller: Troubleshooting Guide
v For IP address type IPv6, check the vlan_6 field in the lsportip output. It
should not be blank.
v If the vlan and vlan_6 fields are blank, set the VLAN for the IP type
using Configuring VLAN for iSCSI.
b. Host flag is not set for that IP. Verify the following checks:
v For IP address type IPv4, check the host field in the lsportip output. It
should be yes.
v For IP address type IPv6, check the host_6 field in the lsportip output. It
should be yes.
v If the host and host_6 fields are not yes, set the host flag for the IP type
using the cfgportip CLI command.
c. PFC is not properly set on the switch.
If the VLAN is properly set, and the host flag is also set, but the
lossless_iscsi or lossless_iscsi6 field is still showing off, some switch
settings might be missing or incorrect.
Verify the following settings in the switch:
v Priority tag is set for iSCSI traffic.
v PFC is enabled for priority tag that is assigned to iSCSI CoS.
v DCBx is enabled on the switch.
Also check the appropriate documentation:
v Consult the documentation for enabling PFC on your specific switch.
v Consult the documentation for enabling PFC on Red Hat Enterprise
Linux (RHEL) and Windows hosts specific to your configuration.
6. Ensure that proper bandwidth is given to iSCSI on the network.
You can divide the bandwidth among the various types of traffic. It is
important to assign proper bandwidth for good performance. To assign
bandwidth for iSCSI traffic, you need to first enable the priority flow control
for iSCSI.
Chapter 12. iSCSI performance analysis and tuning
353
354
SAN Volume Controller: Troubleshooting Guide
Appendix A. Accessibility features for IBM SAN Volume
Controller
Accessibility features help users who have a disability, such as restricted mobility
or limited vision, to use information technology products successfully.
Accessibility features
These are the major accessibility features for the SAN Volume Controller:
v You can use screen-reader software and a digital speech synthesizer to hear what
is displayed on the screen. HTML documents have been tested using JAWS
version 15.0.
v This product uses standard Windows navigation keys.
v Interfaces are commonly used by screen readers.
v Keys are discernible by touch, but do not activate just by touching them.
v Industry-standard devices, ports, and connectors.
v You can attach alternative input and output devices.
The SAN Volume Controller online documentation and its related publications are
accessibility-enabled. The accessibility features of the online documentation are
described in Viewing information in the information center .
Keyboard navigation
You can use keys or key combinations to perform operations and initiate menu
actions that can also be done through mouse actions. You can navigate the SAN
Volume Controller online documentation from the keyboard by using the shortcut
keys for your browser or screen-reader software. See your browser or screen-reader
software Help for a list of shortcut keys that it supports.
IBM and accessibility
See the IBM Human Ability and Accessibility Center for more information about
the commitment that IBM has to accessibility.
© Copyright IBM Corp. 2003, 2015
355
356
SAN Volume Controller: Troubleshooting Guide
Appendix B. Where to find the Statement of Limited Warranty
The Statement of Limited Warranty is available in both hardcopy format and in the
SAN Volume Controller information center.
The Statement of Limited Warranty is shipped (in hardcopy form) with your
product. It can also be ordered from IBM (see Table 2 on page xii for the part
number).
© Copyright IBM Corp. 2003, 2015
357
358
SAN Volume Controller: Troubleshooting Guide
Notices
This information was developed for products and services offered in the US. This
material might be available from IBM in other languages. However, you may be
required to own a copy of the product or product version in that language in order
to access it.
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
For license inquiries regarding double-byte character set (DBCS) information,
contact the IBM Intellectual Property Department in your country or send
inquiries, in writing, to:
Intellectual Property Licensing
Legal and Intellectual Property Law
IBM Japan, Ltd.
19-21, Nihonbashi-Hakozakicho, Chuo-ku
Tokyo 103-8510, Japan
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may
not apply to you.
This information could include technical inaccuracies or typographical errors.
Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements
and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM websites are provided for
convenience only and do not in any manner serve as an endorsement of those
© Copyright IBM Corp. 2003, 2015
359
websites. The materials at those websites are not part of the materials for this IBM
product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it
believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:
IBM Director of Licensing
IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
US
Such information may be available, subject to appropriate terms and conditions,
including in some cases, payment of a fee.
The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.
The performance data discussed herein is presented as derived under specific
operating conditions. Actual results may vary.
Information concerning non-IBM products was obtained from the suppliers of
those products, their published announcements or other publicly available sources.
IBM has not tested those products and cannot confirm the accuracy of
performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products.
Statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
All IBM prices shown are IBM's suggested retail prices, are current and are subject
to change without notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to
change before the products described become available.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which
illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to
IBM, for the purposes of developing, using, marketing or distributing application
360
SAN Volume Controller: Troubleshooting Guide
programs conforming to the application programming interface for the operating
platform for which the sample programs are written. These examples have not
been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or
imply reliability, serviceability, or function of these programs. The sample
programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.
If you are viewing this information softcopy, the photographs and color
illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com® are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the web at Copyright and
trademark information at www.ibm.com/legal/copytrade.shtml.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
trademarks or trademarks of Adobe Systems Incorporated in the United States,
and/or other countries.
Linux and the Linux logo is a registered trademark of Linus Torvalds in the United
States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.
Other product and service names might be trademarks of IBM or other companies.
Homologation statement
This product may not be certified in your country for connection by any means
whatsoever to interfaces of public telecommunications networks. Further
certification may be required by law prior to making any such connection. Contact
an IBM representative or reseller for any questions.
Electronic emission notices
This section contains the electronic emission notices or statements for the United
States and other countries.
Federal Communications Commission (FCC) statement
This explains the Federal Communications Commission's (FCC’s) statement.
This equipment has been tested and found to comply with the limits for a Class A
digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to
provide reasonable protection against harmful interference when the equipment is
operated in a commercial environment. This equipment generates, uses, and can
radiate radio frequency energy and, if not installed and used in accordance with
the instruction manual, might cause harmful interference to radio communications.
Operation of this equipment in a residential area is likely to cause harmful
interference, in which case the user will be required to correct the interference at
his own expense.
Notices
361
Properly shielded and grounded cables and connectors must be used in order to
meet FCC emission limits. IBM is not responsible for any radio or television
interference caused by using other than recommended cables and connectors, or by
unauthorized changes or modifications to this equipment. Unauthorized changes
or modifications could void the user's authority to operate the equipment.
This device complies with Part 15 of the FCC Rules. Operation is subject to the
following two conditions: (1) this device might not cause harmful interference, and
(2) this device must accept any interference received, including interference that
might cause undesired operation.
Industry Canada compliance statement
This Class A digital apparatus complies with Canadian ICES-003.
Cet appareil numérique de la classe A est conform à la norme NMB-003 du
Canada.
Australia and New Zealand Class A Statement
Attention: This is a Class A product. In a domestic environment this product
might cause radio interference in which case the user might be required to take
adequate measures.
European Union Electromagnetic Compatibility Directive
This product is in conformity with the protection requirements of European Union
(EU) Council Directive 2004/108/EC on the approximation of the laws of the
Member States relating to electromagnetic compatibility. IBM cannot accept
responsibility for any failure to satisfy the protection requirements resulting from a
non-recommended modification of the product, including the fitting of non-IBM
option cards.
Attention: This is an EN 55022 Class A product. In a domestic environment this
product might cause radio interference in which case the user might be required to
take adequate measures.
Responsible Manufacturer:
International Business Machines Corp.
New Orchard Road
Armonk, New York 10504
914-499-1900
European community contact:
IBM Deutschland GmbH
Technical Regulations, Department M372
IBM-Allee 1, 71139 Ehningen, Germany
Tele: +49 (0) 800 225 5423 or +49 (0) 180 331 3233
Email: halloibm@de.ibm.com
Germany Electromagnetic Compatibility Directive
Deutschsprachiger EU Hinweis: Hinweis für Geräte der Klasse A EU-Richtlinie
zur Elektromagnetischen Verträglichkeit
362
SAN Volume Controller: Troubleshooting Guide
Dieses Produkt entspricht den Schutzanforderungen der EU-Richtlinie
2004/108/EG zur Angleichung der Rechtsvorschriften über die elektromagnetische
Verträglichkeit in den EU-Mitgliedsstaaten und hält die Grenzwerte der EN 55022
Klasse A ein.
Um dieses sicherzustellen, sind die Geräte wie in den Handbüchern beschrieben zu
installieren und zu betreiben. Des Weiteren dürfen auch nur von der IBM
empfohlene Kabel angeschlossen werden. IBM übernimmt keine Verantwortung für
die Einhaltung der Schutzanforderungen, wenn das Produkt ohne Zustimmung der
IBM verändert bzw. wenn Erweiterungskomponenten von Fremdherstellern ohne
Empfehlung der IBM gesteckt/eingebaut werden.
EN 55022 Klasse A Geräte müssen mit folgendem Warnhinweis versehen werden:
“Warnung: Dieses ist eine Einrichtung der Klasse A. Diese Einrichtung kann im
Wohnbereich Funk-Störungen verursachen; in diesem Fall kann vom Betreiber
verlangt werden, angemessene Mabnahmen zu ergreifen und dafür
aufzukommen.”
Deutschland: Einhaltung des Gesetzes über die elektromagnetische
Verträglichkeit von Geräten
Dieses Produkt entspricht dem “Gesetz über die elektromagnetische Verträglichkeit
von Geräten (EMVG).” Dies ist die Umsetzung der EU-Richtlinie 2004/108/EG in
der Bundesrepublik Deutschland.
Zulassungsbescheinigung laut dem Deutschen Gesetz über die
elektromagnetische Verträglichkeit von Geräten (EMVG) (bzw. der EMC EG
Richtlinie 2004/108/EG) für Geräte der Klasse A
Dieses Gerät ist berechtigt, in übereinstimmung mit dem Deutschen EMVG das
EG-Konformitätszeichen - CE - zu führen.
Verantwortlich für die Einhaltung der EMV Vorschriften ist der Hersteller:
International Business Machines Corp.
New Orchard Road
Armonk,New York 10504
Tel: 914-499-1900
Der verantwortliche Ansprechpartner des Herstellers in der EU ist:
IBM Deutschland GmbH
Technical Regulations, Abteilung M372
IBM-Allee 1, 71139 Ehningen, Germany
Tele: +49 (0) 800 225 5423 or +49 (0) 180 331 3233
Email: halloibm@de.ibm.com
Generelle Informationen:
Das Gerät erfüllt die Schutzanforderungen nach EN 55024 und EN 55022 Klasse
A.
People's Republic of China Class A Statement
Notices
363
Taiwan Class A compliance statement
Taiwan Contact Information
This topic contains the product service contact information for Taiwan.
f2c00790
IBM Taiwan Product Service Contact Information:
IBM Taiwan Corporation
3F, No 7, Song Ren Rd., Taipei Taiwan
Tel: 0800-016-888
Japan VCCI Council Class A statement
This explains the Japan Voluntary Control Council for Interference (VCCI)
statement.
Japan Electronics and Information Technology Industries
Association Statement
This statement explains the Japan JIS C 61000-3-2 product wattage compliance.
364
SAN Volume Controller: Troubleshooting Guide
This statement explains the Japan Electronics and Information Technology
Industries Association (JEITA) statement for products less than or equal to 20 A per
phase.
This statement explains the JEITA statement for products greater than 20 A, single
phase.
This statement explains the JEITA statement for products greater than 20 A per
phase, three-phase.
Korean Communications Commission Class A Statement
This explains the Korean Communications Commission (KCC) statement.
Russia Electromagnetic Interference Class A Statement
rusemi
This statement explains the Russia Electromagnetic Interference (EMI) statement.
Notices
365
366
SAN Volume Controller: Troubleshooting Guide
Index
Numerics
10 Gbps Ethernet
link failures 313
MAP 5550 313
10 Gbps Ethernet card
activity LED 24
10G Ethernet 248, 313
2145 UPS-1U
alarm 50
circuit breakers 51
connecting 48
connectors 51
controls and indicators on the front
panel 49
description of parts 51
dip switches 51
environment 53
heat output of node 39
Load segment 1 indicator 50
Load segment 2 indicator 50
MAP
5150: 2145 UPS-1U 292
5250: repair verification 298
nodes
heat output 39
on or off button 50
on-battery indicator 50
operation 48
overload indicator 50
ports not used 51
power-on indicator 50
service indicator 50
test and alarm-reset button 51
unused ports 51
2145-DH8
additional space requirements 36
air temperature without redundant ac
power 36
dimensions and weight 36
heat output of node 37
humidity without redundant ac
power 36
input-voltage requirements 35
nodes
heat output 37
power requirements for each
node 35
product characteristics 35
requirements 35
specifications 35
weight and dimensions 36
A
about this document
sending comments xv
ac and dc LEDs 33, 34
AC and DC LEDs 33
ac power switch, cabling 44
accessibility 355
© Copyright IBM Corp. 2003, 2015
accessibility (continued)
repeat rate
up and down buttons 355
repeat rate of up and down
buttons 115
accessing
cluster (system) CLI 68
management GUI 61
publications 355
service assistant 67
service CLI 69
action menu options
front panel display 102
sequence 102
action options
node
create cluster 107
actions
reset service IP address 70
reset superuser password 70
active status 98
adding
nodes 63
address
MAC 101
Address Resolution Protocol (ARP) 12
addressing
configuration node 12
B
back-panel assembly
SAN Volume Controller 2145-CF8
connectors 30
indicators 29
SAN Volume Controller 2145-CG8
connectors 27
indicators 26
SAN Volume Controller 2145-DH8
connectors 24
indicators 24
backing up
system configuration files 264
backup configuration files
deleting
using the CLI 271
restoring 266
bad blocks 273
battery
Charging, front panel display 92
power 93
battery fault LED 17
battery status LED 17
boot
codes, understanding 160
failed 91
progress indicator 91
boot drive
SAN Volume Controller
2145-DH8 157
buttons, navigation 19
C
Call Home 133, 136
Canadian electronic emission notice 362
charging 92
circuit breakers
2145 UPS-1U 51
requirements
SAN Volume Controller
2145-CF8 40
SAN Volume Controller
2145-CG8 38
CLI
service commands 68
system commands 68
when to use 68
CLI commands
lssystem
displaying clustered system
properties 82
cluster (system) CLI
accessing 68
clustered system
restore 261
T3 recovery 261
clustered systems
Call Home email 133, 136
deleting nodes 61
error codes 162
IP address
configuration node 12
IP failover 12
IPv4 address 98
IPv6 address 99
metadata, saving 93
options 98
overview 11
properties 82
recovery codes 162
removing nodes 61
restore 254
T3 recovery 254
codes
node error
critical 160
noncritical 160
node rescue 160
commands
create cluster 72
install software 72
query status 74
reset service assistant password 71
satask.txt 70
snap 71
svcconfig backup 264
svcconfig restore 266
comments, sending xv
configuration
node failover 12
configuration node 12
connecting
2145 UPS-1U 48
367
connectors
2145 UPS-1U 51
SAN Volume Controller 2145-CF8 30
SAN Volume Controller 2145-CG8 27
SAN Volume Controller
2145-DH8 24
contact information
Taiwan 364
controls and indicators on the front panel
2145 UPS-1U
alarm 50
illustration 49
Load segment 1 indicator 50
Load segment 2 indicator 50
on or off button 50
on-battery indicator 50
overload indicator 50
power-on indicator 50
test and alarm-reset button 51
error LED 20
front-panel display 19
SAN Volume Controller
navigation buttons 19
node status LED 18
select button 20
SAN Volume Controller 2145-CF8
illustration 18
operator-information panel 22
SAN Volume Controller 2145-CG8
illustration 17
operator-information panel 21
SAN Volume Controller 2145-DH8
illustration 15
operator-information panel 20
status indicators
action menu options 102
boot failed 91
boot progress 91
charging 92
error codes 92
hardware boot 92
menu options 96
node rescue request 93
power failure 93
powering off 93
recovering 94
restarting 94
shutting down 94
create cluster
action option 107
create cluster command 72
create clustered system
error codes 162
critical
node errors 160
detection error
expansion location 247
determining
hardware boot failure 159
SAN problem 245
Deutschsprachiger EU Hinweis 362
diagnosing problems
through error codes 119
through event logs 119
with SAN Volume Controller 119
Disaster recovery
Global Mirror 251
Metro Mirror 251
Stretched Cluster 251
Stretched System 251
disk drive activity LED 23
display on front panel
Change WWNN option 114
Enter Service? option 114
Exit Actions option 116
Exit Service option 114
IPv6 address 99
Node WWNN 100
overview 19
Paced Upgrade option 115
Recover Cluster 115
Rescue Node option 116
Service Address 100
Service DHCPv4 113
Service DHCPv6 113
Set FC Speed option 115
status indicators
action menu options 102
boot failed 91
boot progress 91
charging 92
error codes 92
hardware boot 92
menu options 96
node rescue request 93
power failure 93
powering off 93
recovering 94
restarting 94
shutting down 94
validate WWNN? 95
version 100
displaying
IPv6 address 99
displaying vital product data 81
documentation
improvement xv
drives 246
electronic emission notices (continued)
People's Republic of China 363
Taiwan 364
emails
Call Home
event notifications 135
inventory information 136
inventory information 136
EMC statement, People's Republic of
China 363
Enter Service?
option 114
error
expansion enclosure 247
not detected 247
error codes 146
front panel display 92
understanding 137
error event IDs 146
error events 131
error LED 20
errors 246
logs
describing the fields 132
error events 131
managing 131
understanding 131
viewing 132
node 160
Ethernet 351
activity LED 24, 32
link failures 12, 310
link LED 33
MAP 5500 310
port 101
European Union (EU), EMC Directive
conformance statement 362
event IDs 138
event notifications
inventory information email 136
overview 133
events
reporting 130
examples
clusters in SAN fabric 14
redundant ac power switch
cabling 44
Exit Actions
option 116
Exit Service
option 114
expansion enclosure
detection error 247
E
F
D
electronic emission notices
Deutschsprachiger EU Hinweis 362
European Union (EU) 362
Federal Communications Commission
(FCC) 361
Germany 362
Industry Canada 362
Japanese Voluntary Control Council
for Interference (VCCI) 364
Korean 365
New Zealand 362
fabric
SAN overview 14
failover, configuration node 12
FCC (Federal Communications
Commission) electronic emission
notice 361
Federal Communications Commission
(FCC) electronic emission notice 361
Fibre Channel
LEDs 32
link failures 248
defining FRUs
for the redundant AC-power
switch 58
for the SAN Volume Controller 53
degraded status 98
deleting
backup configuration files
using the CLI 271
nodes 61
368
SAN Volume Controller: Troubleshooting Guide
Fibre Channel (continued)
MAP 316
port menu option 101
port numbers 35
SFP transceiver 248
field replaceable units
redundant AC-power switch
describing 58
SAN Volume Controller
describing 53
disk drive assembly 53
disk drive cables 53
Ethernet cable 53
fan assembly 53
Fibre Channel cable 53
Fibre Channel SFP transceiver 53
frame assembly 53
front panel 53
operator-information panel 53
power cable assembly 53
service controller 53
system board assembly 53
fields
description for the node vital product
data 83
description for the system vital
product data 88
device 83
event log 132
Fibre Channel adapter 83
front panel 83
memory module 83
processor 83
processor cache 83
software 83
system 88
system board 83
uninterruptible power supply 83
fix
errors 255
front panel
2145 UPS-1U 49
action menu options 102
booting 117
buttons and indicators 91
charging 117
display 19
ID 20
interface 76
menu options 96
Ethernet 101
Fibre Channel port-1 through
port-4 101
IPv4 address 98
IPv6 address 99
Language? 116
node 100
version 100
power failure 117
powering off the SAN Volume
Controller 117
recovering 117
SAN Volume Controller 91
front panel display
node rescue request 271
G
gateway
menu option 99
node option 109, 111
Germany electronic emission compliance
statement 362
H
hardware
boot 92, 341
boot failure 159
components 15
failure 92
node 15
help xv
homologation statement
361
I
I/O operations, stopped 93
identification
label, node 20
name 100
number 100
inactive status 98
indicators and controls on the front panel
2145 UPS-1U
alarm 50
illustration 49
Load segment 1 indicator 50
Load segment 2 indicator 50
on or off button 50
on-battery indicator 50
overload indicator 50
power-on indicator 50
test and alarm-reset button 51
error LED 20
SAN Volume Controller
navigation buttons 19
node status LED 18
select button 20
SAN Volume Controller 2145-CF8
illustration 18
operator-information panel 22
SAN Volume Controller 2145-CG8
illustration 17
operator-information panel 21
SAN Volume Controller 2145-DH8
illustration 15
operator-information panel 20
status indicators
action menu options 102
boot failed 91
boot progress 91
charging 92
error codes 92
hardware boot 92
menu options 96
node rescue request 93
power failure 93
powering off 93
recovering 94
restarting 94
shutting down 94
indicators on the rear panel 32
indicators on the rear panel (continued)
10 Gbps Ethernet card 24
ac and dc LEDs 34
AC and DC LEDs 33
Ethernet
activity LED 24, 32
link LED 33
Fibre Channel LEDs 32
power-supply error LED 33, 34
power, location, and system-error
LEDs 33
SAN Volume Controller 2145-CG8
Ethernet activity LED 24
information help xv
information, system
LED 23
informational events 138
install software command 72
interface
front panel 76
inventory information
emails 136
event notifications 133
IP address
cluster 99
cluster (system) 98
IPv6 99
service 112
system 99
IPv4 address 98
IPv6
address 99
gateway menu option 99
prefix mask menu option 99
iSCSI 351
link problems 248, 249
J
Japanese electronic emission notice
jumbo frame 351
364
K
keyboards
accessibility features 355
Knowledge Center xii
Korean electronic emission
statement 365
L
language menu selection options 116
LEDs
ac and dc 33, 34
AC and DC 33
diagnostics 322
disk drive activity 23
Ethernet
activity 24, 32
link 33
Fibre Channel 32
location 24, 33
power 23, 33
power-supply error 33, 34
rear-panel indicators 24, 26, 29
Index
369
LEDs (continued)
SAN Volume Controller 2145-CF8 29
SAN Volume Controller 2145-CG8 26
SAN Volume Controller
2145-DH8 24
system information 23
system-error 22, 33
light path MAP 322
link failures
Fibre Channel 248
link problems
iSCSI 248, 249
Load segment 1 indicator 50
Load segment 2 indicator 50
locator LED 24
log files
viewing 132
M
MAC address 101
maintenance analysis procedures (MAPs)
10 Gbps Ethernet 313
2145 UPS-1U 292
Ethernet 310
Fibre Channel 316
front panel 307
hardware boot 341
light path 322
overview 275
power
SAN Volume Controller
2145-CG8 288
SAN Volume Controller
2145-DH8 283
repair verification 321
SSD failure 346, 347, 349
start 275
management GUI
accessing 61
shut down node 302
management GUI interface
when to use 60
managing
event log 131
MAP
5000: Start 275
5040: Power SAN Volume Controller
2145-DH8 283
5050: Power SAN Volume Controller
2145-CG8 and 2145-CF8 288
5150: 2145 UPS-1U 292
5250: 2145 UPS-1U repair
verification 298
5320: Redundant AC power 299
5340: Redundant ac power
verification 300
5400: Front panel 307
5500: Ethernet 310
5550: 10 Gbps Ethernet 313
5600: Fibre Channel 316
5700: Repair verification 321
5800: Light path 322
5900: Hardware boot 341
6000: Replace offline SSD 346
6001 Replace offline SSD in a RAID 0
array 347
370
MAP (continued)
6002: Replace offline SSD in a RAID 1
array or RAID 10 array 349
power off node 302
MAPs (maintenance analysis procedures)
10 Gbps Ethernet 313
2145 UPS-1U 292
2145 UPS-1U repair verification 298
Ethernet 310
Fibre Channel 316
front panel 307
hardware boot 341
light path 322
power
SAN Volume Controller
2145-CF8 288
SAN Volume Controller
2145-CG8 288
SAN Volume Controller
2145-DH8 283
power off 302
redundant ac power 300
redundant AC power 299
repair verification 321
SSD failure 346, 347, 349
start 275
using 275
media access control (MAC) address 101
medium errors 273
menu options
clustered system
IPv4 address 98
IPv4 gateway 99
IPv4 subnet 99
clustered systems
IPv6 address 99
clusters
IPv6 address 99
options 98
reset password 115
status 98
Ethernet
MAC address 101
port 101
speed 101
Fibre Channel port-1 through
port-4 101
front panel display 96
IPv4 gateway 99
IPv6 gateway 99
IPv6 prefix 99
Language? 116
node
options 100
status 100
SAN Volume Controller
active 98
degraded 98
inactive 98
sequence 96
system
gateway 99
IPv6 prefix 99
status 100
message classification 162
migrate 246
migrate drives 246
SAN Volume Controller: Troubleshooting Guide
N
navigation
accessibility 355
buttons 19
create cluster 107
Language? 116
recover cluster 115
New Zealand electronic emission
statement 362
node
create cluster 107
options
create cluster? 107
gateway 111
IPv4 address 107
IPv4 confirm create? 109
IPv4 gateway 109
IPv4 subnet mask 108
IPv6 address 110
IPv6 Confirm Create? 111
IPv6 prefix 110
Remove Cluster? 115
status 100
subnet mask 108
rescue request 93
software failure 283, 288
node canisters
configuration 11
node fault LED 16
node rescue
codes 160
node status LED 16, 18
nodes
adding 63
cache data, saving 93
configuration 11
addressing 12
failover 12
deleting 61
downloading
vital product data 81
failover 12
hard disk drive failure 93
identification label 20
options
main 100
removing 61
rescue
completing 271
viewing
general details 81
noncritical
node errors 160
not used
2145 UPS-1U ports 51
location LED 33
notifications
Call Home information 136
inventory information 136
sending 133
number range 162
O
object classes and instances
object codes 145
145
object types 145
on or off button 50
operator information panel
locator LED 24
system-information LED 23
operator-information panel
disk drive activity LED 23
power button 23
power LED 23
reset button 23
SAN Volume Controller 2145-CF8 22
SAN Volume Controller 2145-CG8 21
SAN Volume Controller
2145-DH8 20
system-error LED 22
overload indicator 50
overview
product 1
redundant AC-power switch 42
SAN fabric 14
vital product data 81
P
Paced Upgrade
option 115
panel
front 19
name 20
operator information
SAN Volume Controller
2145-CF8 22
SAN Volume Controller
2145-CG8 21
SAN Volume Controller
2145-DH8 20
rear
SAN Volume Controller
2145-CF8 29
SAN Volume Controller
2145-CG8 26
SAN Volume Controller
2145-DH8 24
passwords
resetting 115
People's Republic of China, electronic
emission statement 363
physical characteristics
2145 UPS-1U 53
redundant AC-power switch 43
SAN Volume Controller 2145-CF8
connectors 30
service ports 31
unused ports 31
SAN Volume Controller 2145-CG8
connectors 27
service ports 28
unused ports 29
SAN Volume Controller 2145-DH8
connectors 24
service ports 26
unused ports 26
port speed
Fibre Channel 101
ports
Ethernet 24, 32
ports (continued)
not used
2145 UPS-1U 51
port names, worldwide 35
port numbers, Fibre Channel 35
SAN Volume Controller 2145-CF8 30
SAN Volume Controller 2145-CG8 27
SAN Volume Controller
2145-DH8 24
POST (power-on self-test) 130
power
button 23
controls 117
failure 93
off
operation 93
requirements
2145-DH8 35
SAN Volume Controller
2145-CF8 40
SAN Volume Controller
2145-CG8 37
restored 93
switch, failure 283, 288
uninterruptible power supply 117
power LED 23
Power MAP 2145-CG8 and
2145-CF8 288
Power MAP SAN Volume Controller
2145-DH8 283
power off 302
power-supply error LED 33, 34
preparing
SAN Volume Controller
environment 35
uninterruptible power supply
environment 53
protection information 246
publications
accessing 355
Q
query status command
74
R
reader feedback, sending xv
rear-panel indicators
SAN Volume Controller 2145-CF8 29
SAN Volume Controller 2145-CG8 26
SAN Volume Controller
2145-DH8 24
Recover Cluster
option 115
recovering
front panel display 94
offline volumes
using CLI 79, 260
recovery
system
when to run 254
systems
starting 259
redundant ac power switch
cabling 44
redundant ac power switch (continued)
examples 44
redundant AC-power switch
environment preparation 43
field replaceable units 58
MAP 299, 300
overview 42
problems 299
specifications 43
verifying 300
related information xii
removing
550 errors 256
578 errors 256
node from a cluster 115
nodes 61
Repair verification MAP 321
repairing
thin-provisioned volume 78
reporting
events 130
requirements
2145 UPS-1U 48
2145-DH8 35
ac voltage 35, 37, 38, 40, 41
circuit breakers 38, 40
electrical 35, 37, 40
power 35, 37, 40
SAN Volume Controller 2145-CF8 40
SAN Volume Controller 2145-CG8 37
Rescue Node
option 116
rescue nodes
completing 271
reset button 23
reset password menu option 115
navigation 115
resetting the password 115
reset service assistant password 71
reset service IP address 70
reset superuser password 70
resetting passwords 115
restore
system 253, 261
S
SAN (storage area network)
fabric overview 14
problem determination 245
SAN Volume Controller
2145 UPS-1U 48
action options
create cluster 107
field replaceable units
4-port Fibre Channel adapter
40x40x28 fan 53
40x40x56 fan 53
alcohol wipe 53
CMOS battery 53
disk backplane 53
disk controller 53
disk drive assembly 53
disk drive cables 53
disk power cable 53
disk signal cable 53
Ethernet cable 53
Index
53
371
SAN Volume Controller (continued)
field replaceable units (continued)
fan assembly 53
fan power cable 53
Fibre Channel adapter
assembly 53
Fibre Channel cable 53
Fibre Channel HBA 53
frame assembly 53
front panel 53
memory module 53
microprocessor 53
operator-information panel 53
power backplane 53
power cable assembly 53
power supply assembly 53
riser card, PCI 53
riser card, PCI Express 53
service controller 53
service controller cable 53
system board 53
thermal grease 53
voltage regulator module 53
front-panel display 91
hardware 1
hardware components 15
menu options
Language? 116
node 100
node 15
overview 1
power control 117
power-on self-test 130
preparing environment 35
properties 81
software
overview 1
SAN Volume Controller 2145-CF8
additional space requirements 42
air temperature without redundant ac
power 40
circuit breaker requirements 40
connectors 30
controls and indicators on the front
panel 18
dimensions and weight 41
heat output of node 42
humidity with redundant ac
power 41
humidity without redundant ac
power 40
indicators and controls on the front
panel 18
input-voltage requirements 40
light path MAP 336
MAP 5800: Light path 336
nodes
heat output 42
operator-information panel 22
ports 30
power requirements for each
node 40
product characteristics 40
rear-panel indicators 29
requirements 40
service ports 31
specifications 40
372
SAN Volume Controller 2145-CF8
(continued)
temperature with redundant ac
power 41
unused ports 31
weight and dimensions 41
SAN Volume Controller 2145-CG8
additional space requirements 39
air temperature without redundant ac
power 38
circuit breaker requirements 38
connectors 27
controls and indicators on the front
panel 17
dimensions and weight 39
heat output of node 39
humidity with redundant ac
power 38
humidity without redundant ac
power 38
indicators and controls on the front
panel 17
input-voltage requirements 37
light path MAP 329
MAP 5800: Light path 329
nodes
heat output 39
operator-information panel 21
ports 27
power requirements for each
node 37
product characteristics 37
rear-panel indicators 26
requirements 37
service ports 28
specifications 37
temperature with redundant ac
power 38
unused ports 29
weight and dimensions 39
SAN Volume Controller 2145-CG8 node
features 11
SAN Volume Controller 2145-DH8
boot drive 157
connectors 24
controls and indicators on the front
panel 15
indicators and controls on the front
panel 15
light path MAP 323
MAP 5800: Light path 323
operator-information panel 20
ports 24
rear-panel indicators 24
service ports 26
unused ports 26
SAN Volume Controller library
related publications xii
satask.txt
commands 70
Security level 246
self-test, power-on 130
sending
comments xv
serial number 19
SAN Volume Controller: Troubleshooting Guide
service
actions, uninterruptible power
supply 48
service address
navigation 112
options 112
Service Address
option 100
service assistant
accessing 67
interface 67
when to use 67
service CLI
accessing 69
when to use 68
service commands
CLI 68
create cluster 72
install software 72
reset service assistant password 71
reset service IP address 70
reset superuser password 70
snap 71
service controller
replacing
validate WWNN 95
Service DHCPv4
option 113
Service DHCPv6
option 113
service ports
SAN Volume Controller 2145-CF8 31
SAN Volume Controller 2145-CG8 28
SAN Volume Controller
2145-DH8 26
Set FC Speed
option 115
shortcut keys
keyboard 355
shutting down
front panel display 94
snap command 71
SNMP traps 133
software
failure, MAP 5050 283, 288
overview 1
version
display 100
space requirements
2145-DH8 36
SAN Volume Controller 2145-CF8 42
SAN Volume Controller 2145-CG8 39
specifications
redundant AC-power switch 43
speed
Fibre Channel port 101
Start MAP 275
starting
clustered system recovery 257
system recovery 259
T3 recovery 257
Statement of Limited Warranty, Where to
find the 357
status
active 98
degraded 98
inactive 98
status (continued)
operational 98, 100
storage area network (SAN)
fabric overview 14
problem determination 245
storage systems
restore 253
servicing 250
subnet
menu option 99
subnet mask
node option 108
summary of changes xvii
switches
2145 UPS-1U 51
redundant ac power 42
syslog messages 133
system
backing up configuration file using
the CLI 264
diagnose failures 100
IPv6 address 99
restoring backup configuration
files 266
system commands
CLI 68
system-error LED 22
systems
adding nodes 63
understanding (continued)
node rescue codes 160
uninterruptible power supply
2145 UPS-1U
controls and indicators 49
environment 53
operation 48
overview 48
front panel MAP 307
operation 48
overview 47
preparing environment 53
unused ports
2145 UPS-1U 51
SAN Volume Controller 2145-CF8 31
SAN Volume Controller 2145-CG8 29
SAN Volume Controller
2145-DH8 26
USB key
using 69
when to use 69
using 69
CLI 77
error code tables 138
GUI interfaces 59
management GUI 59
service assistant 67
technician port 75
USB key 69
T
V
T3 recovery
removing
550 errors 256
578 errors 256
restore
clustered system 253
starting 257
what to check 261
when to run 254
Taiwan
contact information 364
electronic emission notice 364
TCP 351
technical assistance xv
technician port
using 75
test and alarm-reset button 51
trademarks 361
troubleshooting
event notification email 133, 136
SAN failures 245
using error logs 92
using the front panel 91
validating
volume copies 77
viewing
event log 132
vital product data (VPD)
displaying 81
overview 81
understanding the fields
node 83
understanding the fields
system 88
viewing
nodes 81
volume copies
validating 77
volumes
recovering from offline
using CLI 79, 260
VPD (vital product data)
displaying 81
overview 81
understanding the fields
node 83
understanding the fields
system 88
U
understanding
clustered-system recovery codes 162
error codes 137, 162
event log 131
fields for the node vital product
data 83
fields for the system vital product
data 88
when to use (continued)
USB key 69
worldwide node names
change 114
choose 95
display 100
node, front panel display 100, 114
validate, front panel display 95
worldwide port names (WWPNs)
description 35
for the
for the
for the
for the
W
websites xiv
when to use
CLI 68
management GUI interface
service assistant 67
service CLI 68
60
Index
373
374
SAN Volume Controller: Troubleshooting Guide
IBM®
Printed in USA
GC27-2284-10
Download PDF
Similar pages