Sonexion 2000 Replacement Procedures 2.0

Sonexion 2000 Replacement Procedures 2.0
Sonexion 2000 Replacement Procedures 2.0
Contents
Contents
About Sonexion 2000 Replacement Procedures 2.0.................................................................................................3
Replace a 5U84 EBOD..............................................................................................................................................5
Replace a 5U84 Fan Module.....................................................................................................................................9
Replace a 5U84 Backlit Panel Bezel ......................................................................................................................14
Replace a 5U84 Side Card......................................................................................................................................22
Check for Errors if LEDs are On....................................................................................................................36
Replace a 5U84 OSS Controller..............................................................................................................................40
Output Example to Check USM Firmware.....................................................................................................54
Replace a 5U84 Chassis.........................................................................................................................................61
Replace a 2U24 EBOD............................................................................................................................................66
Replace a 2U24 Chassis.........................................................................................................................................70
Replace a Quad Server Disk...................................................................................................................................77
Clean Up Failed or Pulled Drive in Node.......................................................................................................84
Replace a Quad Server MGMT Node......................................................................................................................88
Replace a Quad Server MGS or MDS Node.........................................................................................................102
Replace a Quad Server Chassis...........................................................................................................................114
Replace a Cabinet Network Switch.......................................................................................................................122
Replace a Cabinet Network Switch PSU...............................................................................................................132
Replace a Cabinet Management Switch (Brocade)...............................................................................................136
Configure a Brocade Management Switch..................................................................................................141
Replace a Cabinet Management Switch PSU (Brocade)......................................................................................147
Replace a Cabinet PDU.........................................................................................................................................150
Replace a Cabinet Power Distribution Strip...........................................................................................................158
Replace a CNG Server Module.............................................................................................................................164
Configure BIOS Settings for CNG Node......................................................................................................169
BMC IP Address Table for CNG Nodes.......................................................................................................172
Replace a CNG Chassis........................................................................................................................................175
2
About Sonexion 2000 Replacement Procedures 2.0
About Sonexion 2000 Replacement Procedures 2.0
This publication describes procedures to remove and replace hardware components in a Sonexion 2000 system
running release 2.0. These are termed field-replaceable units (FRUs). The procedures are intended for Cray
service technicians.
Procedures That Have Been Replaced by Video Versions
Some replacement procedure are being remade as video procedures viewed from a PC connected to the
Sonexion system. In these cases no text-based procedure is included. Instead, field personnel should log in to the
Sonexion service console, which provides step-by-step instructions to replace the failed part. Follow the steps
below to access the service console:
1. Cable a laptop to any available port on any LMN switch (located at the top of the rack).
2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in,
navigate to the service console (http://service:8080). If that URL is not active, then log in to port 8080
of the IP address of the currently active MGMT node (MGMT0):
http://IP_address:8080
where IP_address is the IP address of the currently active (primary) MGMT node.
3. Enter the standard service credentials.
Typographic Conventions
Monospace
A Monospace font indicates program code, reserved words or library functions,
screen output, file names, path names, and other software constructs
Monospaced Bold
A bold monospace font indicates commands that must be entered on a command
line.
Oblique or Italics
An oblique or italics font indicates user-supplied values for options in the
syntax definitions
Proportional Bold
A proportional bold font indicates a user interface control, window name, or
graphical user interface button or control.
Alt-Ctrl-f
Monospaced hypenated text typically indicates a keyboard combination
Record of Revision
Publication Number Date
Release Comment
HR5-6131-0
November 2014 1.5
HR5-6131-A
April 2015
2.0
HR5-6131-B
August 2015
2.0
New note for 5U84 procedures
3
About Sonexion 2000 Replacement Procedures 2.0
Publication Number Date
Release Comment
HR5-6131-C
2.0
October 2015
Procedures shared with release 1.5.
References to RAS (video) procedures, replacing some textbased procedures.
4
Replace a 5U84 EBOD
Replace a 5U84 EBOD
Prerequisites
Part number
100843600: Controller Assy, Sonexion EBOD Expansion Module
Time
1.5 hours
Interrupt level
Failover (can be applied to a live system with no service interruption, but requires failover/
failback)
Tools
●
Labels (attach to SAS cables)
●
ESD strap
●
Serial cable (9 pin to 3.5mm phone plug)
●
Hostnames must be assigned to all OSS nodes in the cluster containing the failed
EBOD I/O module (available from the customer)
●
To complete this procedure, the replacement EBOD I/O module must have the correct
USM firmware for the system. See Sonexion 2000 USM Firmware Update Guide.
Requirements
About this task
This procedure includes steps to replace the failed EBOD I/O module in the ESU component (SSU+1 or +n
component), verify the operation of the new EBOD I/O module, and return the system to normal operation.
Subtasks:
●
Replace EBOD
●
Reactivate the Node
A 5U84 ESU enclosure comprises one 5U chassis with two EBOD I/O modules, 82 drives, two power supply units
(PSUs), and five power cooling modules (PCMs).
The EBOD I/O modules (left and right) are located on top of the fan modules, and are accessible from the back of
the rack.
5U84 Enclosure Precautions
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully
redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component
should be replaced as soon as possible to return the SSU to a fully redundant state.
●
Incorrectly carrying out this procedure may cause damage to the enclosure.
5
Replace a 5U84 EBOD
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin
down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not
attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening
drawers.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Use the steps in this procedure to fail over and shut down the node, remove the failed EBOD, and install a new
one.
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or
SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion
operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully
redundant state.
Procedure
1. If the location of the failed EBOD I/O module has not been provided, look for an amber Fault LED on the
failed EBOD I/O module and the Module Fault LED on the OCP (Operator Control Panel) on the 5U84
enclosure (front panel).
2. Log in to the primary MGMT node:
$ ssh -l admin primary_MGMT_node
3. Determine whether a failover occurred:
$ sudo cscli fs_info
If a failover occurred, go to step 5 on page 7. Otherwise continue to the next step.
4. Fail over the affected OSS resources to its HA partner:
$ cscli failover -n affected_oss_nodename
Wait for the OSS node's resources to fail over to the OSS node's HA partner. To confirm that the failover
operation is complete, run:
$ cscli fs_info
6
Replace a 5U84 EBOD
5. Shut down the OSS node:
$ cscli power_manage -n oss_nodename --power-off
Wait for the affected OSS node to completely power off. To confirm that the power-off operation is complete,
run:
$ pm -q
Replace EBOD
Perform the following steps from the back of the rack, with required ESD strap.
6. Disconnect both SAS cables from port A and port B of the failed EBOD I/O module. Label the cables to know
the correct port locations when cables are reconnected to the new EBOD I/O module.
7. Release the module latch by grasping it between the thumb and forefinger and gently squeezing it.
Figure 1. EBOD I/O Latch Operation
8. Using the latch as a handle, carefully remove the failed module from the enclosure.
9. Inspect the new EBOD I/O module for damage, especially to the interface connector.
If the module is damaged, do not install it but obtain another EBOD I/O module.
10. With the latch in the released (open) position, slide the new EBOD I/O module into the enclosure until it
completely seats and engages the latch.
11. Secure the module by closing the latch.
There is an audible click as the latch engages.
The I/O module may take up to 1 minute to re-initialize after the cables are reconnected.
Reactivate the Node
12. Plug in the SAS cables to their original ports on the EBOD module, using the labels applied in step 6 on page
7 for more information.
13. Power on the affected OSS node. On the primary MGMT node, run:
$ cslci power_manage -n affected oss_nodename --power-on
14. Wait for the OSS node to come online. This may take a few minutes. Confirm that the OSS node is online :
$ pdsh -a uname -r | dshbak -c
7
Replace a 5U84 EBOD
15. Verify the following LED status:
●
On the new EBOD, the Fault LED is off and the Health LED is lit green.
●
On the new EBOD, Connected SAS Lane LEDs are ON and ready, with no traffic showing.
●
On the OCP located at the front of the 5U84 enclosure, the Module Fault LED is off.
16. When the affected OSS node is back online, fail back its HA resources on the OSS node:
$ cscli failback -n oss_nodename
Confirm that HA Resources are failing back:
$ cscli fs_info
17. Verify the status of the SAS Lane LEDs on the new EBOD I/O module:
ON and FLASHING: the EBOD is active with I/O traffic
ON: the EBOD is ready, with no traffic.
18. Connect a serial cable to the new EBOD I/O module and open a terminal session with these settings:
Bits per second: 115200
Data bits: 8
Parity: none
Stop bits: 1
Flow control: none
19. Press the Enter key until a GEM> prompt appears. Type ver and press Enter.
20. The USM and GEM firmware versions must agree between the EBOD I/O modules.
Consult Cray Hardware Product Support to obtain the correct files, and use the procedures in Cray
publication Sonexion USM Firmware Update Guide. When the firmware versions match, proceed to the next
step.
21. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from the
new controller.
The 5U84 enclosure EBOD I/O module FRU procedure is complete.
8
Replace a 5U84 Fan Module
Replace a 5U84 Fan Module
Prerequisites
Part number
100842900: Fan Tray Assy, Sonexion SSU
Time
30 minutes
Interrupt level
Live (can be applied to a live system with no service interruption)
Tools
●
ESD strap (recommended)
●
Host names assigned to the two OSS nodes in the SSU that contains the failed fan
module (available from the customer)
About this task
The SSU contains a 5U84 enclosure, two controllers, two PSUs, and five fan modules. Each SSU controller hosts
one OSS node; there are two OSS nodes per SSU. Within an SSU, the OSS nodes are organized in an HA pair.
Each SSU controller hosts one OSS node; there are two OSS nodes per SSU. Within an SSU, the OSS nodes
are organized in an HA pair. Each fan module contains two fans that are numbered sequentially. The first fan
module contains fan 0 and fan 1, the second fan module contains fan 2 and fan 3, etc. The fan modules
themselves are not individually numbered, but are ordered left to right (as viewed from the rear). Fan module 1 is
the leftmost canister, fan module 5 is the rightmost canister, and fan modules 2, 3 and 4 are in between.
Subtasks:
●
Locate the Failed Fan Module
●
Replace the Fan Module
This procedure includes steps to verify the operation of the new fan module.
5U84 Enclosure Precautions
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully
redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component
should be replaced as soon as possible to return the SSU to a fully redundant state.
●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin
down before removal.
9
Replace a 5U84 Fan Module
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not
attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening
drawers.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or
SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion
operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully
redundant state.
Procedure
1. If the location of the failed fan module is known, skip to Replace the Fan Module.
2. Establish communication with the management node (n000) using one of the following two methods:
Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP
address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in
the following table.
Table 1. Settings for MGMT Connection
Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+.
Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown
in the following figure. Ensure that the connection has the settings shown in the table above.
10
Replace a 5U84 Fan Module
Figure 2. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows:
login as: admin
admin@172.30.72.42’s password: password
Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59
[admin@snx11000n000 ~]$
4. Change to root user:
$ sudo su Locate the Failed Fan Module
5. Access GEM on an OSS node in the SSU that contains the failed fan module and enter:
# conman OSS_host_name -gem
where OSS_host_name is the host name of one of the OSS nodes in the SSU that contains the failed fan
module.
Obtain the OSS node names from the customer.
6. At the GEM prompt, enter:
GEM> fan_set
7. If the following error is not issued from the preceding command, proceed to Replace the Fan Module.
1+03:58:30.358 S1 GEM> fan_set
The following steps can be run only from the master.
8. Exit from the OSS node.
9. Access GEM on the other OSS node (HA partner) and enter:
# conman nodename -gem
Where nodename is the host name of the other OSS node in the SSU that contains the failed fan module.
11
Replace a 5U84 Fan Module
10. At the GEM prompt, enter:
GEM> fan_set
The following fan_set output sample shows status or RPM speed for the ten fans (two per module):
GEM> fan_set
Fan 0, speed = 13520.
Fan 1, speed = 13920.
Fan 2, status = STATUS_DEVICE_REMOVED.
Fan 3, status = STATUS_DEVICE_REMOVED.
Fan 4, speed = 13728.
Fan 5, speed = 14016.
Fan 6, speed = 13584.
Fan 7, speed = 13872.
Fan 8, speed = 13536.
Fan 9, speed = 13888.
In the above command output:
●
Fans 0 and 1 refer to fan module 1, fans 2 and 3 refer to fan module 2, and so on.
●
Speeds in the 13K-14K range indicate a potential problem. The 8K range is typical.
●
Depending on the cause of the fan module failure, a status other than STATUS_DEVICE_REMOVED may
appear.
●
The STATUS_DEVICE_REMOVED status indicates a fan failure.
Replace the Fan Module
11. Release the fan module by grasping the handle and pushing down on the orange latch.
12. Using the handle, carefully remove the fan module's canister from the enclosure bay.
13. Insert the new fan module into the enclosure bay and slide it in until the module completely seats and resets
the latch. The new fan module powers up.
14. When the fan reaches full speed, enter the fan_set command on the OSS node associated with the new fan
module:
GEM> fan_set
The following example fan_set output shows the current RPM speed of each fan:
Fan
Fan
Fan
Fan
Fan
Fan
Fan
Fan
Fan
Fan
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
=
=
=
=
=
=
=
=
=
=
8208.
8144.
8448.
8416.
8224.
8208.
8416.
8448.
8496.
8448.
12
Replace a 5U84 Fan Module
15. Exit GEM:
GEM> &.
16. Verify that the fan_set output is normal and shows fan speeds for all fan modules without errors or
warnings.
13
Replace a 5U84 Backlit Panel Bezel
Replace a 5U84 Backlit Panel Bezel
Prerequisites
Part number
Sonexion Model
Part Number
Description
Sonexion 2000
101205500
Panel Assy, Back-lit Sonexion 2000 SSU Fascia
Sonexion 1600
100901300
Panel Assy, Back-lit Sonexion 1600 SSU Fascia
Sonexion 1600
101337700
Panel Assy Fry, Back-lit Sonexion 1600 SSU Fascia
:
Time
1 hour
Interrupt level
Interrupt (requires disconnecting the Lustre clients from the filesystem)
Tools
●
Serial cable
●
2mm recessed Allen hex socket screwdriver or torque driver
●
Special plastic tool
●
T-20 Torx screwdriver
●
ESD strap
About this task
The modular SSU incorporates a chassis and two controllers. The chassis is a 5U84 enclosure that contains 84
disks.
This procedure includes steps to take the OSS nodes offline, remove and replace the failed bezel, bring the OSS
nodes online, and return the system to normal operation.
Subtasks:
●
Remove the Defective Bezel
●
Install the New Bezel
5U84 Enclosure Precautions
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully
redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component
should be replaced as soon as possible to return the SSU to a fully redundant state.
●
Incorrectly carrying out this procedure may cause damage to the enclosure.
14
Replace a 5U84 Backlit Panel Bezel
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin
down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not
attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening
drawers.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
The following procedure requires disconnecting the Lustre clients from the filesystem.
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or
SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion
operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully
redundant state.
Procedure
1. If the logical locations (hostnames) of the two OSS nodes associated with the affected SSU are known, skip
to Remove the Defective Bezel.
2. Establish communication with the management node (n000) using one of the following two methods:
Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP
address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in
the following table.
Table 2. Settings for MGMT Connection
Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+.
Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown
in the following figure. Ensure that the connection has the settings shown in the table above.
15
Replace a 5U84 Backlit Panel Bezel
Figure 3. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows:
login as: admin
admin@172.30.72.42’s password: password
Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59
[admin@snx11000n000 ~]$
4. Change to root user:
$ sudo su 5. Connect a serial cable to one of the controllers in the affected SSU (the serial port is on the rear panel).
6. Log in to one of the OSS nodes hosted on the controller (username admin and the customer's password).
7. Determine the hostnames of the OSS nodes hosted on the SSU containing the failed backlit panel bezel:
# sudo crm_mon
Leave the serial cable connected. This connection is used in the following steps.
Remove the Defective Bezel
8. Power off the system as described in Power Off Sonexion 2000.
9. Attach an ESD wrist strap and use it continuously.
10. Power off both Power Supply Units (PSUs) in the SSU (located below the controllers) and remove the power
cords from the PSUs.
11. Unlock the SSU drawer containing the failed bezel. Using a T-20 Torx screwdriver, rotate the handle lock
screw to the unlocked position (opposite from the lock icon).
12. Locate four 2mm Allen screws (one above and one below the handle lock screw, on both sides).
13. Using an Allen screwdriver, detach the bezel from the drawer by loosening and removing each 2mm screw, as
shown in the following figure. Place screws in a safe location.
16
Replace a 5U84 Backlit Panel Bezel
Figure 4. Remove Allen Screws from Bezel
14. Unlatch the SSU drawer and open it slightly for better access to the bezel, as shown in the following figure.
Figure 5. Unlatch the SSU Drawer
15. Use the following steps to disengage the five tabs along the top of the molded bezel. The molded bezel
contains five plastic tabs, along the top and bottom edges. To remove the bezel from the SSU drawer, it is
necessary to slightly compress these tabs to disengage them. This step releases only the tabs of the top edge
of the bezel. The tabs on the bottom edge are disengaged in step 16.a on page 18
a. Using the special plastic tool, start at the upper right-hand edge and gently pry the first tab with a slight
twisting motion, as shown in the following figure. The first tab will not release completely until the tool
reaches the second tab, and so forth.
17
Replace a 5U84 Backlit Panel Bezel
Figure 6. Remove Tab to Release Bezel
b. Using the plastic tool, continue to disengage each tab across the top of the bezel.
16. Remove the failed bezel from the SSU drawer:
a. Rotate the bezel downward from the top edge and lift away from the drawer enough to allow access to
remove the interface cable. This disengages tabs along the bottom edge.
Caution: Use care when lifting the bezel away from the drawer; an interface cable attaches the bezel to
the drawer.
b. Remove the screw that secures the cable retaining plate, as shown in the following figure.
Figure 7. Remove Cable Retaining Plate Securing Screw
18
Replace a 5U84 Backlit Panel Bezel
c.
Remove the cable retaining plate from the cable by separating the metal plate from the rubber grommet,
as shown in the following figure.
Figure 8. Remove Cable Retaining Plate
d. Set the retaining plate and screw aside for re-installation.
e. Disconnect the interface cable from the socket, as shown in the following figure.
Figure 9. Disconnect the Interface Cable
Do not touch the printed circuit board component.
Install the New Bezel
17. Reconnect the interface cable to the PCB plug with the gold pins facing up, as shown in the following figure.
19
Replace a 5U84 Backlit Panel Bezel
Figure 10. Reconnect Interface Cable
18. Insert the grommet of the cable assembly into the retaining plate, that was set aside earlier, as shown in the
following figure.
Figure 11. Inserting the Cable Assembly Grommet
19. Install the screw that secures the cable retaining plate. Torque to 1.1 Nm (Newton meters) or 9.74 in-lbf
(pound-force inches).
20. At the front of the drawer, position the bezel so that the five lower tabs engage in the bottom edge and rotate
the bezel into position at the top edge.
21. Press in on the bezel near each tab on the top until all tabs are fully engaged.
22. As shown in the following figure, attach the bezel to the drawer by re-installing the four 2mm Allen screws and
torque them to 0.5 Nm (4.43 in-lbf).
20
Replace a 5U84 Backlit Panel Bezel
Figure 12. Attaching the Bezel to the SSU Drawer
If a torque driver is unavailable, snug the screws.
23. Close the SSU drawer.
24. Reconnect the power cords to the two PSUs in the affected SSU (rear panel).
25. Power on each PSU by moving its ON/OFF switch to the ON position. The PSUs power on and the OSS
nodes hosted on the SSU start to boot.
Each OSS node can take 5-30 minutes to fully boot.
26. Verify that the new backlit panel bezel LEDs are illuminated and functioning correctly.
27. Power on the system as described in Power On Sonexion 2000.
28. Disconnect the serial cable from the controller in the affected SSU and the console (or PC).
29. Using the package from the replacement backlit panel bezel, re-package the failed part and return it to
Logistics using the authorized procedures.
21
Replace a 5U84 Side Card
Replace a 5U84 Side Card
Prerequisites
Part number
●
100934200: PCB Assy, Sonexion SSU Side-card Left
●
100934300: PCB Assy, Sonexion SSU Side-card Right
Time
1 hour
Interrupt level
Interrupt (requires disconnecting Lustre clients from the filesystem)
Requirement
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8
data bits, no parity and 1 stop bit)
Tools
●
Serial cable
●
2mm recessed Allen hex socket screwdriver or torque driver
●
T-20 Torx screwdriver
●
ESD strap, boots, garment or other approved protection
●
SSU reset tool (P/N 101150300)
About this task
A chassis and two controllers are bundled in the modular SSU (5U84 enclosure). The chassis contains 84 disk
drives, divided into two drawers with 42 drives each. A drawer contains two side cards, known as the left side
card (LH) and right side card (RH).
This procedure includes steps to take the SSU’s OSS nodes offline, remove and replace the side card, bring the
OSS nodes online, and return the Sonexion system to normal operation.
Subtasks:
●
Remove Side Card
●
Install Side Card
●
Power On the Affected SSU
●
Check for Errors if LEDs are On on page 36
The following procedure requires disconnecting the Lustre clients from the filesystem.
5U84 Enclosure Precautions
22
Replace a 5U84 Side Card
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully
redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component
should be replaced as soon as possible to return the SSU to a fully redundant state.
●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin
down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not
attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening
drawers.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Figure 13. OSS Operator’s Panel
Table 3. OSS Operator's Panel Controls and Indicators
Display
Description
Unit Identification Display
Dual 7-segment display used to provide feedback to the user. Its primary usage is
to display an enclosure unit identification number to assist in setting up and
maintaining multiple enclosure systems. The Unit Identification Display is configured
via a VPD option. By default, the display is OFF, and the dual 7-segment display is
OFF. If the VPD selects use of the display, the 7 segment display is ON and
displays the number stored in VPD.
Mute/Input Switch
Used to set the Unit Identification Display and transition alarm states.
23
Replace a 5U84 Side Card
Display
Description
Power On/Standby LED
(Green/Amber)
Lights green when system power is available. Lights amber when only standby
power is available.
Module Fault LED (PSU/
Cooling Fan/SBB Status)
(Amber)
Lights when a system hardware fault occurs. It may be associated with a fault LED
on a PSU or an I/O module; helps the user to identify the faulty component.
Logical Fault LED (Amber) Indicates a state change or a fault from something other than the enclosure
management system. The condition may be caused by an internal or external
source and communicated to the enclosure (normally via SES). The fault is usually
associated with a disk drive. LEDs at each disk drive position help the user to
identify the affected drive.
Drawer 1 Fault (Amber)
Drive cable or side card fault.
Drawer 2 Fault (Amber)
Drive cable or side card fault.
Procedure
1. If the failed side card is not known, determine the SSU containing the card using the front panel and drawer
indicators shown in the preceding figure and table. At the front of the rack, look for an amber Drawer 1 Fault
or Drawer 2 Fault on the operator’s panel, indicating the drawer containing the defective side card.
2. In the drawer containing the lit Drawer Fault LED, check the Sideplane Fault LED (on the Drawer Status LEDs
panel). See the following figure and table. If this LED is lit, the drawer contains the failed side card.
24
Replace a 5U84 Side Card
Figure 14. Drawer Status LEDs
Table 4. Drawer Status LED Key
Status
Power LED Fault LED
(Green)
(Amber)
Cable
Fault LED
(Amber)
Drawer
Fault LED
(Amber)
Activity
Bar Graph
(Green)
Sideplane Card OK/Power Good
On
Off
Off
Off
-
Sideplane Card Fault
Off
On
-
-
Off
Drive Fault
Off
-
-
On
Off
Cable Fault
Off
-
On
-
Off
Drive Activity
On
Off
Off
Off
On
3. Log in to the primary MGMT node:
[client]$ ssh -l admin primary_MGMT_node
4. Log in to the SSU containing the failed side card, using one of the following methods:
●
If logging into the SSU via the primary MGMT node (preferred method), enter:
[admin@n000]$ ssh OSS_nodename
25
Replace a 5U84 Side Card
●
Connect to the Management node using either approach described in Connect to the MGMT Node
5. Open a terminal session with these settings:
●
Bits per second: 115200
●
Data bits: 8
●
Parity: none
●
Stop bits: 1
●
Flow control: none
Set terminal emulation to VT100+.
6. Connect a serial cable to one of the OSS controllers in the affected SSU (serial port is on the rear panel).
7. Log in to the OSS node hosted on the OSS controller in the affected SSU (username admin and the
customer's password).
8. Stop the Lustre file system:
[admin@n000]$ cscli unmount -f fsname
Example:
[admin@snx11000n000 tmp]$ sudo /opt/xyratex/bin/cscli unmount -f snx11000
unmount: stopping snx11000 on snx11000n[002-003]...
unmount: stopping snx11000 on snx11000n[004-005]...
unmount: snx11000 is stopped on snx11000n[002-003]!
unmount: snx11000 is stopped on snx11000n[004-005]!
unmount: File system ssetest is unmounted
In the following, a zero in the first column under Target shows the file system is not mounted.
admin@snx11000n000 tmp]@ sudo /opt/xyratex/bin/cscli fs_info
Information about "snx11000" file system:
Node
Failover
Node
type
Targets
partner
Devices
snx11000n005
oss
0 / 4
snx11000n004
/dev/md1, /dev/md3, /dev/md5, /
dev/md7
snx11000n004
oss
0 / 4
snx11000n005
/dev/md0, /dev/md2, /dev/md4, /
dev/md6
snx11000n003
mds
0 / 1
snx11000n002
/dev/md66
snx11000n002
mgs
0 / 0
snx11000n003
9. Verify that Lustre has stopped on OSS nodes in the affected SSU :
[n0xy]$ sudo pdsh -g oss "mount -t lustre | wc -l" | dshbak -c
Example :
[admin@snx11000n004 ~]$ sudo pdsh -g oss "mount -t lustre | wc -l" | dshbak -c
---------------snx11000n[004-105]
---------------(0 "shows lustre not mounted on any OSS’s ")
26
Replace a 5U84 Side Card
10. Log in to one of the OSS nodes in the affected SSU:
[admin@n0xx]$ ssh oss_node
where oss_node is one of the OSS nodes in the affected SSU.
11. To determine which side card has, first run the following on each OSS controller in the affected SSU:
[OSS controller]$ sudo sg_map -i -x | grep XYRATEX
12. Check the command output for missing expanders. If the even-numbered OSS node is missing an expander,
the right side-card has failed and needs to be replaced. If the odd-numbered OSS node is missing an
expander, then the left side-card has failed and needs to be replaced.
The command output shows what appears to be double the expected amount of output; that is, two of each
system device and two of each expanders. This effect is caused by the SAS cable link on the OSS controller.
Example output for the even-numbered OSS node (no missing expanders):
[admin@snx11000n004 ~]$ sudo sg_map -i -x | grep XYRATEX
/dev/sg4 7 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519
/dev/sg6 8 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519
/dev/sg33 7 0 15 0 3 XYRATEX DEFAULT-SD-R24 3519
/dev/sg36 8 0 15 0 3 XYRATEX DEFAULT-SD-R24 3519
/dev/sg91 7 0 44 0 3 XYRATEX DEFAULT-SD-R36 3519
/dev/sg94 8 0 44 0 3 XYRATEX DEFAULT-SD-R36 3519
/dev/sg149 7 0 73 0 3 XYRATEX DEFAULT-SD-R36 3519
/dev/sg152 8 0 73 0 3 XYRATEX DEFAULT-SD-R36 3519
/dev/sg179 7 0 88 0 3 XYRATEX DEFAULT-SD-R24 3519
/dev/sg181 8 0 88 0 3 XYRATEX DEFAULT-SD-R24 3519
Example output for the odd-numbered OSS node (no missing expanders):
[admin@snx11000n005 ~]$ sudo sg_map -i -x | grep XYRATEX
/dev/sg1 7 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519
/dev/sg30 7 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519
/dev/sg45 7 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519
/dev/sg60 7 0 59 0 3 XYRATEX DEFAULT-SD-L24 3519
/dev/sg89 7 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519
/dev/sg90 8 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519
/dev/sg119 8 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519
/dev/sg134 8 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519
/dev/sg149 8 0 59 0 3 XYRATEX DEFAULT-SD-L24 3519
/dev/sg178 8 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519
Example output for an OSS node missing an expander: an empty row indicates a missing expander (in this
case /dev/sg60, /dev/sg149 is a duplicate of /dev/sg60 and not missing an expander):
[admin@snx11000n005 ~]$ sudo sg_map -i -x | grep XYRATEX
/dev/sg1 7 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519
/dev/sg30 7 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519
/dev/sg45 7 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519
/dev/sg60
/dev/sg89 7 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519
/dev/sg90 8 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519
/dev/sg119 8 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519
/dev/sg134 8 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519
27
Replace a 5U84 Side Card
/dev/sg149
/dev/sg178 8 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519
Remove Side Card
13. Exit the OSS node and return to the primary MGMT node:
[n0xy]$ exit
14. Power off the OSS node pair in the affected SSU :
[admin@n000]$ cscli power_manage –n nodexxx-xxy --power-off
Where nodexxx-xxy indicates the OSS nodes in the HA pair.
Example:
[admin@snx11000n000 ~]$ pm -n snx11000n[004-005]
Command completed successfully
[admin@snx11000n000 ~]$ pm -q
on:
snx11000n[000-003]
off:
snx11000n[004-005]
unknown:
15. Once the OSS node pair is powered off, shut down the affected SSU by turning off the power switches (at the
back of the rack) on the PSUs (located below the OSS controllers).
16. Remove the power cords from the PSUs.
17. Attach an ESD wrist strap and use it at all times.
18. Unlock the SSU drawer containing the failed side card. Use the T-20 Torx screwdriver to rotate the handle
lock screw to the unlocked position (opposite from the lock icon).
19. Push and hold the drawer latches inward while pulling the drawer all the way out until it locks.
Figure 15. Open the Drawer
Once the drawer locks in the open position, the side cards are accessible, as shown in the following figure.
28
Replace a 5U84 Side Card
Figure 16. Side Card Exploded View
20. Using the 2mm hex driver loosen the captive fasteners on the safety cover, shown in the above figure. Place
the cover in a safe place, as it will be reused.
21. Unplug the three power cables attached to the side card and release them from their retaining clips, by
squeezing the sides of the cable connectors and firmly pulling. The connectors are shown in the preceding
figure.
22. Unplug the two SAS cables. Use a spring hook to lift the female connector casing to release hooks located on
male connector, if necessary. Do not bend the metal receiver too much or connector will not grab when reinserted.
Figure 17. SAS Cables Male/Female Connector
23. Allow the power and SAS cables to hang down freely out of the way.
29
Replace a 5U84 Side Card
24. Pull the drawer out to the limit (past where it settles in the locked position). While the drawer its held out, and
holding the power connectors on the side card, gently ease the side card off the drawer, getting the card's
right-angle bracket past the end of the drawer frame.
It may be necessary to apply alternating pressure (an up-and-down, rocking motion) to remove the side card,
but be careful not to bend it.
Figure 18. Ease Side Card from drawer
Install Side Card
25. Check for any bent pins. Again holding the drawer at the limit of travel, position the card for mounting by
getting the right-angle bracket past the drawer frame. On the new side card, align the three square
connectors on the new side card with the receptacles on the drawer.
26. Press the new side card from the rear of the three connectors onto the drawer chassis. Be careful to press
only above the connector area to avoid flexing the card.
30
Replace a 5U84 Side Card
Figure 19. Press the new Side Card onto Drawer Chassis
27. Install the white cable retaining clips included with the new side card. There are two different size clips (three
long clips and one short clip), and they are installed in specific locations, see the following figure.
Figure 20. Clip and Connector Locations
28. Connect the two SAS cables to the SAS connectors at the back of the new side card.
29. Connect the three power cables to the power cable connectors on the new side card.
30. Secure the power cables by clipping them to the white retaining clips.
31. Align the safety cover over the new side card, ensuring that the power cables are not trapped beneath the
cover.
32. Using the 2mm hex driver, tighten the captive fasteners.
●
Tighten the silver-headed fasteners to 0.5Nm.
31
Replace a 5U84 Side Card
●
Tighten the black-headed fasteners to 0.9Nm.
33. Perform a visual inspection to ensure no cables are outside of the safety cover, and the cover is installed flat
against the drawer chassis.
34. Close the drawer by pulling and holding both white latches on the sides and pushing in the drawer slightly.
35. Release the white latches and ensure that they have returned to their original position.
36. Push the drawer completely into the 5U84 enclosure.
37. If required, use the Torx driver to lock the drawer.
38. Reconnect the power cords to the two affected PSUs in the SSU (rear panel).
Power On the Affected SSU
39. At the back of the rack, place the PSU switches to the ON position.
Wait 2 minutes to allow the OSS controllers to come online.
40. Power on using one of the following methods.
●
Using a pen or appropriate tool, press the Power On/Off button on the rear panel of the OSSs in the SSU.
●
Issue a command from the primary MGMT node to power on the OSS nodes:
1. Log in to the primary MGMT node:
[client]$ ssh -l admin primary_MGMT_node
2. Power on the OSS nodes:
[admin@n000]$ cscli power_manage –n nodexxx-xxy --power-on
Example:
[admin@snx11000n000 ~]$ power_manage –n snx11000n[004-105]
Command completed successfully
[admin@snx11000n000 ~]$ pm -q
on:
snx11000n[000-005]
off:
unknown:
--power-on
32
Replace a 5U84 Side Card
Figure 21. OSS Rear Panel on Sonexion 2000
41. Verify that OSS’s have booted:
[admin@n000]$ pdsh –g oss date
Example:
[admin@snx11000n000 ~]$ pdsh -g oss date
snx11000n004: Mon Apr 1 11:30:50 PDT 2013
snx11000n005: Mon Apr 1 11:30:50 PDT 2013
42. Use the crm_mon utility to verify the status of the OSS nodes and that the STONITH high-availability (HA)
resource has started.
a. Log in to an even-numbered OSS node via SSH:
[admin@n000]$ ssh OSS_node_hostname
b. Display the node status:
[admin@n000]$ sudo crm_mon -1
43. Continue to run the crm_mon utility until the output verifies that the OSS nodes are running and that
STONITH has started.
Following is a partial sample of output showing the OSS nodes are online.
[admin@snx11000n000 ~]$ sudo pdsh -w snx11000n004 crm_mon -1r
snx11000n004: ============
snx11000n004: Last updated: Mon Apr 1 11:46:31 2013
snx11000n004: Last change: Mon Apr 1 10:47:56 2013 via cibadmin on snx11000n005
snx11000n004: Stack: Heartbeat
snx11000n004: Current DC: snx11000n005 (ee6d3c80-0d49-4fa2-a013-3b4991ed6a2f) partition with quorum
snx11000n004: Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052
snx11000n004: 2 Nodes configured, unknown expected votes
snx11000n004: 55 Resources configured.
snx11000n004: ============
snx11000n004:
snx11000n004: Online: [ snx11000n004 snx11000n005 ]
snx11000n004:
33
Replace a 5U84 Side Card
snx11000n004: Full list of resources:
snx11000n004:
snx11000n004: snx11000n004-stonith (stonith:external/gem_stonith):
snx11000n004
snx11000n004: snx11000n005-stonith (stonith:external/gem_stonith):
snx11000n005
snx11000n004: snx11000n004_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004
snx11000n004: snx11000n005_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n005
snx11000n004: baton (ocf::heartbeat:baton): Started snx11000n004
snx11000n004: snx11000n004_ibstat (ocf::heartbeat:ibstat): Started
snx11000n004: snx11000n005_ibstat (ocf::heartbeat:ibstat): Started
Started
Started
snx11000n004
snx11000n005
44. Log in to the primary MGMT node via SSH:
[admin@n000]$ ssh -l admin primary_MGMT_node_hostname
45. Mount the Lustre file system:
[admin@n000]$ cscli mount -f none
Example :
[admin@snx11000n000 ~]$ cscli mount -f snx11000
mount: MGS is starting...
mount: MGS is started!
mount: starting ssetest on snx11000n[002-003]...
mount: starting ssetest on snx11000n[004-005]...
mount: snx11000 is started on snx11000n[002-003]!
mount: snx11000 is started on snx11000n[004-005]!
mount: File system snx11000 is mounted.
[root@snx11000n000 ~]# /opt/xyratex/bin/cscli fs_info
Information about "snx11000" file system:
Node
Failover
Node
type Targets partner
Devices
snx11000n005 oss
4 / 4
snx11000n004 /dev/md1, /dev/md3, /dev/md5, /dev/
md7
snx11000n004 oss
4 / 4
snx11000n005 /dev/md0, /dev/md2, /dev/md4, /dev/
md6
snx11000n003 mds
1 / 1
snx11000n002 /dev/md66
snx11000n002 mgs
0 / 0
snx11000n003
46. Check the USM firmware version running on the new side card. As viewed from the front of the SSU, if the
right side card was changed, run the following command on the left (or primary) side card (even-numbered
ESM controller). If the left side card was changed, run the following command on the right (or secondary) side
card (odd-numbered ESM controller).
[admin@n000]$ sudo conman node_name -gem
a. When the GEM command prompt is visible, type: ver
To list all commands type: help all
Output example:
19+05:16:40.965 M0 GEM> ver
Canister firmware : 3.5.0.25
34
Replace a 5U84 Side Card
Canister firmware date : May 1 2014 11:41:17
Canister bootloader : 5.03
Canister config CRC : 0xD1B030A4
Canister VPD structure : 0x06
Canister VPD CRC : 0xEE3504B4
Canister CPLD : 0x17
Canister chip : 0x80050002
Canister SDK : 3.06.01-B028
Midplane VPD structure : 0x0F
Midplane VPD CRC : 0x0E74E375
Midplane CPLD : 0x05
PCM 1 firmware : 2.24|2.17|2.00
PCM 2 firmware : 2.20|2.12|2.00
PCM 1 VPD structure : 0x03
PCM 2 VPD structure : 0x03
PCM 1 VPD CRC : 0x486003DF
PCM 2 VPD CRC : 0x486003DF
Fan Controller 0 config : 0960837-04_0
Fan Controller 0 deviceFW : UCD90124A|2.3.6.0000|110809
Fan Controller 1 config : 0960837-04_0
Fan Controller 1 deviceFW : UCD90124A|2.3.6.0000|110809
Fan Controller 2 config : 0960837-04_0
Fan Controller 2 deviceFW : UCD90124A|2.3.6.0000|110809
Fan Controller 3 config : 0960837-04_0
Fan Controller 3 deviceFW : UCD90124A|2.3.6.0000|110809
Fan Controller 4 config : 0960837-04_0
Fan Controller 4 deviceFW : UCD90124A|2.3.6.0000|110809
Battery 1 firmware : Not present
Battery 2 firmware : Not present
Sled 1 Element 0 Firmware : 3.5.0.25|BL=6.10|FC=0x11EEFEAF|VR=0x06|
VC=0xF027D510|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028
Sled 1 Element 1 Firmware : 3.5.0.25|BL=6.10|FC=0x687FE3F1|VR=0x06|
VC=0x2737A206|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028
Sled 1 Element 2 Firmware : 3.5.0.25|BL=6.10|FC=0xFA337B92|VR=0x06|
VC=0x152E4AD9|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028
Sled 1 Element 3 Firmware : 3.5.0.25|BL=6.10|FC=0x6591A901|VR=0x06|
VC=0x38B21CF7|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028
Sled 2 Element 0 Firmware : 3.5.0.25|BL=6.10|FC=0x11EEFEAF|VR=0x06|
VC=0xF027D510|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028
Sled 2 Element 1 Firmware : 3.5.0.25|BL=6.10|FC=0x687FE3F1|VR=0x06|
VC=0x2737A206|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028
Sled 2 Element 2 Firmware : 3.5.0.25|BL=6.10|FC=0xFA337B92|VR=0x06|
VC=0x152E4AD9|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028
Sled 2 Element 3 Firmware : 3.5.0.25|BL=6.10|FC=0x6591A901|VR=0x06|
VC=0x38B21CF7|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028
19+05:17:56.232 M0 GEM>
In the above output, the side-card firmware versions are listed last in the Sled 1 Element and Sled 2
Element lines.
b. To exit GEM, type: &.
47. Any discrepancies in the Sled 1 and Sled 2 element lines require a USM firmware upgrade to level the new
side card to the latest firmware version. Refer to Sonexion USM Firmware Update Guide for the model of your
system.
48. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from the
new controller.
35
Replace a 5U84 Side Card
If LEDs are illuminated on the front panel after powering up the system, check for errors using the following
procedure, Check for Errors if LEDs are On on page 36.
Check for Errors if LEDs are On
About this task
During installation of a 5U84 sidecard, if any LEDs are lit on the affected SSU’s front panel after powering up the
Sonexion system, use this procedure to check for errors in the enclosure. Allow several minutes for the OSS
nodes in the affected SSU to come back online and start operations. Typically, the InfiniBand connection is ready
as soon as it comes online
Procedure
1. Check if the Infiniband connection is ready after it comes online. Log in to to one of the OSS nodes in the
SSU and run:
[admin@n000]$ sudo crm_mon -1
2. Once the OSS nodes are back online, visually inspect the front of the SSU and verify the LED status.
3. If all LEDs are green, proceed to Step 9 on page 36. If one or more LEDs are not green, proceed to the
following step.
4. Power off the SSU again, remove both side-card covers, and verify all cable connections and the seating of
the side card.
5. Power on the SSU. If all LEDs are green, skip to step 9 on page 36, as the amber LED could indicate a
different issue. If one or more LEDs are not green, proceed to the next step.
6. Log in to GEM on one of the OSS nodes in the affected SSU:
conman nodenameXXX-gem
7. Create a log file on the MGS node and place it in /var/log/conman:
GEM> ddump
This log dump runs for an extended period. The log file can be located by hostname and date/time stamp.
This file contains information to help determine what is causing the problem and can be sent to Seagate
Support for examination and root cause analysis.
8. As a final step, if any extra side cards are available, install a different one in the SSU to determine if the LED
behavior is different and results in the OSS nodes coming back online with all LEDs lit green on the front of
the affected SSU.
9. From the primary MGMT node, connect to the even-numbered OSS node in the affected SSU:
[admin@n000]$ conman nodenameXXX–gem
This OSS node is hosted on the left-hand side card (as viewed from the rear of the chassis).
36
Replace a 5U84 Side Card
10. When connected, at the GEM prompt run:
GEM> gncli 3,all,all phydump
Verify that all PHYs labeled ‘Drive’ show the speed as 6Gbps as indicated below on the 36-port expanders.
PHY| Type
0 |Port
1 |Port
2 |Port
3 |Port
4 |Port
5 |Port
6 |Port
7 |Port
8 |Drive
9 |Drive
10 |Drive
11 |Drive
12 |Drive
13 |Drive
14 |Drive
15 |Drive
16 |Drive
17 |Drive
18 |Drive
19 |Drive
20 |Drive
21 |Drive
22 |Drive
23 |Drive
24 |Drive
25 |Drive
26 |Drive
27 |Drive
28 |Drive
29 |Drive
30 |Drive
31 |Drive
32 |Drive
33 |Drive
34 |Drive
35 |Drive
| Index
|
0
|
0
|
0
|
0
|
1
|
1
|
1
|
1
|
20
|
23
|
26
|
27
|
24
|
25
|
21
|
22
|
18
|
19
|
15
|
16
|
14
|
17
|
6
|
9
|
12
|
13
|
10
|
11
|
7
|
8
|
4
|
5
|
1
|
2
|
0
|
3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Flags
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
State
E L
E L
E L
E L
E L
E L
E L
E L
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
E L S
| Speed |
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
Type
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
|
WWN
|50050cc10ab11cbf
|50050cc10ab11cbf
|50050cc10ab11cbf
|50050cc10ab11cbf
|50050cc10ab11cbf
|50050cc10ab11cbf
|50050cc10ab11cbf
|50050cc10ab11cbf
|5000c50025e067c1
|5000c50025e06775
|5000c50025dff939
|5000c50025e05e71
|5000c50025dce9c9
|5000c50025dce0a9
|5000c50025dccba1
|5000c50025e03dfd
|5000c50025e07b49
|5000c50025e066cd
|5000c50025e08a95
|5000c50025e0513d
|5000c50025e08789
|5000c50025e06995
|5000c50025dfbb7d
|5000c50025e041d9
|5000c50025e05d1d
|5000c50025dfbbd5
|5000c50025e06099
|5000c50025e03a59
|5000c50025e04531
|5000c50025e04155
|5000c50025dfb115
|5000c50025e04349
|5000c50025e03c8d
|5000c50025dc83ed
|5000c50025e03fa1
|5000c50025e067a9
And as shown here on the 24 port expanders:
PHY| Type | Index |
Flags
| State | Speed | Type |
WWN
-----------------------------------------------------------------0 |Port
|
0
|
| E L
|6.0Gbps| SAS |50050cc10ab11cbf
1 |Port
|
0
|
| E L
|6.0Gbps| SAS |50050cc10ab11cbf
2 |Port
|
0
|
| E L
|6.0Gbps| SAS |50050cc10ab11cbf
3 |Port
|
0
|
| E L
|6.0Gbps| SAS |50050cc10ab11cbf
4 |Port
|
1
|
| E
|
|
|
5 |Port
|
2
|
| E
|
|
|
6 |Port
|
3
|
| E
|
|
|
7 |Port
|
4
|
| E
|
|
|
8 |Port
|
5
|
| E
|
|
|
9 |Port
|
6
|
| E
|
|
|
10 |Drive
|
6
|
| E L S |6.0Gbps| SAS |5000c50025e03efd
37
Replace a 5U84 Side Card
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
|Drive
Virtual
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9
12
13
10
11
7
8
4
5
1
2
0
3
0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
E
E
E
E
E
E
E
E
E
E
E
E
E
E
L
L
L
L
L
L
L
L
L
L
L
L
L
L
S
S
S
S
S
S
S
S
S
S
S
S
S
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|6.0Gbps|
|
|
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
|5000c50025e00505
|5000c50025e06561
|5000c50025e08a41
|5000c50025e03c81
|5000c50025e0849d
|5000c50025e03de1
|5000c50025e05b19
|5000c50025e03b65
|5000c50025e08471
|5000c50025e07ed9
|5000c50025e05da9
|5000c50025e0672d
|5000c50025e06361
|50050cc10d01093e
11. Verify that no errors are seen on the link:
[admin@n000]$ gncli 3,all,all ddump_phycounters
12. Wait for 10 minutes.
13. Verify again that no PHY errors are seen using the following command:
[admin@n000]$ gncli 3,all,all ddump_phycounters
14. Run report_faults on the OSS module and verify that no errors are reported.
Example:
GEM> report_faults
Drive Manager faults
No faults
Environmental Control faults
No faults
General Service faults
10 component(s) registered with fault tracker
Local faults:
No faults
Remote faults:
No faults
Human Interface Device faults
Ops Panel status:
Logic Fault LED
: OFF
Module Fault LED
: OFF
Warning:
Local: Remote:
****No HID warnings to report****
No alarms to report
RemoteSync Client faults
No faults
Processor service faults
Board 0
Status: No faults
Board 1
Status: No faults
Power Manager faults
Running in minimal redundant mode
38
Replace a 5U84 Side Card
Sled Manager faults
No faults
15. Verify that all PHYs are at 6Gbps and there are no faults.
If the data for the PHY error count is scrolling beyond the available buffer, review the GEM command results
and output in the GEM log file for the currently connected node. The log file is located in /va/log/conman
(for example, /var/log/conman/snx11000n002-gem.log).
16. Exit GEM:
GEM> &.
39
Replace a 5U84 OSS Controller
Replace a 5U84 OSS Controller
Prerequisites
Part number
101171000: Controller Assy, Sonexion 2000 Turbo 32GB 2648L V2 FRU
Time
1.5 hours
Interrupt levels
●
To remove and replace SSU controller: Failover (can be applied to a live system with no
service interruption, but requires failover/failback operations)
●
When a USM firmware update is needed: Interrupt. (Requires taking the Lustre file
system offline. Perform a USM upgrade only if the firmware version is out of date.)
●
Console with monitor and keyboard (or PC with a serial COM port configured for
115.2Kbs)
●
Serial cable
●
Reset tool (P/N 101150300)
●
ESD strap
●
The replacement OSS controller must have the correct USM firmware for your system.
Consult Cray Hardware Product Support to obtain the correct files, and see Sonexion
2000 USM Firmware Update Guide.
●
When replacing ESM controllers in an SSU, the operator must only use controllers that
have been received from Cray. Operators must not reuse controllers from one system to
another, nor should they move controllers from one slot within a system to another.
●
Before performing this operation, obtain a separate, stand-alone 5U84 (or similar) SSU
that is not part of the existing cluster. Contact Cray Support to make arrangements.
Tools
Requirements
About this task
This procedure includes steps to replace the failed controller and verify the operation of the new controller, and a
procedure to check firmware in the new controller's Object Storage Server (OSS) and update it, if necessary.
Subtasks:
●
Wipe HA Data from SSD on the OSS Controller
●
Fail Over Controller and Verify State
●
Replace the Controller
●
Verify New Controller Function
40
Replace a 5U84 OSS Controller
●
Output Example to Check USM Firmware on page 54
A chassis and two controllers are bundled in the modular SSU. Each controller hosts one OSS node; there are
two OSS nodes per SSU. Within an SSU, OSS nodes are organized in an HA pair with sequential numbers (for
example, n004 / n005). If an OSS node goes down because its controller fails, its resources migrate to the HA
partner/OSS node in the other controller.
A downed OSS node cannot be reached directly by the Sonexion system. Several steps in this procedure involve
logging into the HA partner (on the other controller) to determine the downed node's status and whether its
resources have successfully failed over to the HA partner.
●
If the resources have failed over but the downed node is still online, it is placed into standby mode.
●
If the resources have not failed over and the downed node is still online, it is placed into standby mode, which
will cause its resources to fail over.
5U84 Enclosure Precautions
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully
redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component
should be replaced as soon as possible to return the SSU to a fully redundant state.
●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin
down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not
attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening
drawers.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or
SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion
operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully
redundant state.
Procedure
Wipe HA Data from SSD on the OSS Controller
If the canister state is known, and it has been verified that canisters have not been used for lab or testing
purposes, proceed to Fail Over Controller and Verify State. Perform the following steps on canisters that
cannot be verified as new.
41
Replace a 5U84 OSS Controller
1. Obtain a copy of a Live USB image for the release of Sonexion software in use on the system. Contact Cray
Support for the location of the image.
2. Burn the Live USB image onto a USB drive with a capacity of at least 8 GB. The Linuxdd command is used
to burn the Live USB image (cs15_live_usb.img) onto the USB drive (/dev/sdb). For example:
dd if=cs15_live_usb.img of=/dev/sdb bs=512k
To run dd and find the device file for the USB drive, refer to the instructions for the operating system.
3. Insert the controller to be downgraded into a powered off stand-alone SSU enclosure.
4. Insert the USB drive into one of the USB ports on the new controller.
5. Connect a serial cable from the console or PC to the new controller (serial port is on the rear panel).
6. Establish communication with the management node (n000) using one of the following two methods:
Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP
address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in
the following table.
Table 5. Settings for MGMT Connection
Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+.
Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown
in the following figure. Ensure that the connection has the settings shown in the table above.
Figure 22. Monitor and Keyboard Connections on the MGMT Node
42
Replace a 5U84 OSS Controller
7. Log on to the MGMT node (n000) as admin using the related password, as follows:
login as: admin
admin@172.30.72.42’s password: password
Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59
[admin@snx11000n000 ~]$
8. Change to root user:
$ sudo su 9. Power on the enclosure. If the controller does not power on automatically, use the OSS reset tool to press the
Power switch.
Figure 23. Controller Power and Reset Switches
10. Once the controller starts booting, interrupt the boot procedure by pressing the Esc key. The boot options
menu selection appears.
If necessary, press Esc key several times before the boot options menu is shown.
11. From the menu, select an option that allows booting from a USB drive.
12. Wait for the operating system to boot completely to the shell prompt.
13. Locate the SSD’s device file:
[node]$ sudo sg_map -i -x | grep SanDisk
For example:
[node]$ sudo sg_map -i -x | grep SanDisk
/dev/sg0 3 0 0 0 0 /dev/sda ATA
SanDisk SSD U100
10.5
If this command does not produce any output, confirm that /dev/sda is the internal SSD:
[node]$ sudo sg_map -i –x
Or contact Cray support if there are any doubts.
43
Replace a 5U84 OSS Controller
14. Using the device file run the following dd commands to wipe the HA data, run:
[node]$ sudo dd if=/dev/zero of=/dev/sdXX bs=1M count=8192 oflag=direct
For example:
[node]$ sudo dd if=/dev/zero of=/dev/sda bs=1M count=8192 oflag=direct
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 73.0721 s, 118 MB/s
Then run:
[node]$ sudo dd if=/dev/zero of=/dev/sdXX bs=1M seek=22341 oflag=direct
The expected output for this command is: No space left on device.
For example:
[node]$ sudo dd if=/dev/zero of=/dev/sda bs=1M seek=22341 oflag=direct
dd: writing `/dev/sda': No space left on device
8193+0 records in
8192+0 records out
8590811136 bytes (8.6 GB) copied, 72.362 s, 119 MB/s
15. Remove the serial cable from the new controller that connects to the console or PC.
16. Remove the USB drive.
17. Remove the controller from the powered off stand-alone SSU enclosure.
Fail Over Controller and Verify State
If the canister state is unknown or if canisters might have been used for lab or testing purposes, execute the
steps in Wipe HA Data from SSD on the OSS Controller
When an OSS node is down, it cannot be reached directly by the Sonexion system. Use the following steps
to log in to the HA partner on the other controller to determine the downed node's status (placing it into
standby mode if necessary), and make sure its resources have failed over before replacing the failed
controller.
18. Determine the physical and logical location (hostname) of the failed controller in the SSU.
19. Log in to the primary MGMT node via SSH.
[Client]$ ssh –l admin primary_MGMT_node
20. Determine which node is acting as the primary MGMT node:
[admin@n000]$ cscli show_nodes
For example:
[admin@n000]$ cscli show_nodes
-------------------------------------------------------------------------------------Power
Service
HA
44
Replace a 5U84 OSS Controller
Hostname
Role
State
state
Targets
HA Partner
Resources
-------------------------------------------------------------------------------------snx11000n000
MGMT
On
----0 / 0
snx11000n001
----snx11000n001
MGMT)
On
----0 / 0
snx11000n000
----snx11000n002
MDS),(MGS)
On
N/a
0 / 0
snx11000n003
None
snx11000n003
MDS),(MGS)
On
Stopped
0 / 1
snx11000n002
Local
snx11000n004
(OSS)
Off
Stopped
0 / 1
snx11000n005
Local
snx11000n005
(OSS)
On
Stopped
0 / 1
snx11000n004
Local
snx11000n006
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n007
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n008
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n009
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n010
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n011
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n012
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ----snx11000n013
(CIFS),(NFS) On
Stopped
0 / 0
snx11000n[006-013] ------------------------------------------------------------------------------------------
In the example above, the secondary MGMT node is identifiable by the parentheses around it (in row
snx11000n001), while the primary MGMT node has no parentheses in row snx11000n000.
If the incorrect MGMT node was used to run cscli show_nodes, the following error message is issued:
[MGMT01]$ cscli show_nodes
cscli: Please, run cscli on active management node
If this occurs, log in to the other MGMT node via SSH.
21. From the primary MGMT node, fail over the resources from the affected OSSnode to its HA partner:
[admin@n000]$ cscli failover -n nodes
Where nodes are the names of the node(s) that require failover.
For example, if there is a failure on OSS node n004 and its resources need to fail over to node n005 on
cluster TEST, the command is as follows:
[admin@n000]$ cscli failover -n n004
NOTE: Use the following steps to verify the state of the failed controller. When an OSS node is
down, it cannot be reached directly by the Sonexion system. You must log in to the HA partner on
the other controller to determine the downed node's status (placing it into standby mode if
necessary), and make sure its resources have failed over before replacing the failed controller.
22. Log in via SSH to the HA partner of the OSS node that is hosted on the failed controller.
[admin@n000]$ ssh HA_partner_node_hostname
23. Use the crm_mon utility to display the status of both OSS nodes:
[nodeXY]$ sudo crm_mon -1
When both OSS nodes are online with their resources assigned to them, the crm_mon -1 output looks as
follows:
[admin@snx11000n004 ~]$ sudo crm_mon -1
============
Last updated: Wed Jul 30 13:12:06 2014
45
Replace a 5U84 OSS Controller
Last change: Wed Jul 30 13:10:02 2014 via crm_resource on snx11000n005
Stack: Heartbeat
Current DC: snx11000n004 (191d0bb0-80da-4715-b2c9-618af928458a) - partition
with quorum
Version: 1.1.6.1-6.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
35 Resources configured.
============
Online: [ snx11000n004 snx11000n005 ]
snx11000n005-1-ipmi-stonith (stonith:external/ipmi): Started snx11000n005
snx11000n005-2-ipmi-stonith (stonith:external/ipmi): Started snx11000n005
snx11000n004-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n004
snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004
Clone Set: cln-kdump-stonith [kdump-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-ssh-10-stonith [ssh-10-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-gem-stonith [gem-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-ssh-stonith [ssh-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-phydump-stonith [phydump-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-last-stonith [last-stonith]
Started: [ snx11000n004 snx11000n005 ]
snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n004
snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n005
baton (ocf::heartbeat:baton): Started snx11000n005
Clone Set: cln-diskmonitor [diskmonitor]
Started: [ snx11000n004 snx11000n005 ]
snx11000n004_ibstat
(ocf::heartbeat:ibstat): Started snx11000n004
snx11000n005_ibstat
(ocf::heartbeat:ibstat): Started snx11000n005
Resource Group: snx11000n004_md0-group
snx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004
Resource Group: snx11000n004_md1-group
snx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005
snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005
snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005
snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005
snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005
snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005
When the OSS node on the failed controller is in standby mode and its resources have failed over to its HA
partner, the crm_mon -1 output looks like this:
[admin@snx11000n004 ~]$ sudo crm_mon -1
2 Nodes configured, unknown expected votes
35 Resources configured.
============
Online: [ snx11000n004 ]
OFFLINE: [ snx11000n005 ]
snx11000n004-3-ipmi-stonith
(stonith:external/ipmi):
snx11000n004
Started
46
Replace a 5U84 OSS Controller
snx11000n004-4-ipmi-stonith
(stonith:external/ipmi):
Started
snx11000n004
Clone Set: cln-kdump-stonith [kdump-stonith]
Started: [ snx11000n004 ]
Stopped: [ kdump-stonith:1 ]
Clone Set: cln-ssh-10-stonith [ssh-10-stonith]
Started: [ snx11000n004 ]
Stopped: [ ssh-10-stonith:1 ]
Clone Set: cln-gem-stonith [gem-stonith]
Started: [ snx11000n004 ]
Stopped: [ gem-stonith:1 ]
Clone Set: cln-ssh-stonith [ssh-stonith]
Started: [ snx11000n004 ]
Stopped: [ ssh-stonith:1 ]
Clone Set: cln-phydump-stonith [phydump-stonith]
Started: [ snx11000n004 ]
Stopped: [ phydump-stonith:1 ]
Clone Set: cln-last-stonith [last-stonith]
Started: [ snx11000n004 ]
Stopped: [ last-stonith:1 ]
snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n004
baton (ocf::heartbeat:baton): Started snx11000n004
Clone Set: cln-diskmonitor [diskmonitor]
Started: [ snx11000n004 ]
Stopped: [ diskmonitor:1 ]
snx11000n004_ibstat
(ocf::heartbeat:ibstat): Started snx11000n004
Resource Group: snx11000n004_md0-group
snx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004
Resource Group: snx11000n004_md1-group
snx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n004
In the above example, the OSS node on the failed controller is snx11000n005 and its HA partner is
snx11000n004.
Replace the Controller
Use the following steps to collect the ddump and BMC logs (if possible), power off the controller, and
remove.
24. From the primary MGMT node run:
[admin@n000]$ conman nodeXX-gem
Where nodeXX is the failed node.
47
Replace a 5U84 OSS Controller
The GEM command prompt then appears. For example:
[admin@snx11000n000]$ conman snx11000n005-gem
<ConMan> Connection to console [snx11000n005gem] opened.
11+01:54:53.622 M0 GEM>
25. At this GEM prompt, run:
GEM> ddump
This takes more than 7 minutes to complete.
The output from this command is placed in /var/log/conman.
For example:
------ 1 root root 2730380 Jul 25 04:10 snx11000n005-gem.log
26. Exit GEM:
GEM> &.
27. If the failed controller does not respond to IPMI commands, retrieve the BMC sel list as follows, if it
needs to be collected separately.
[admin@n000]$ ssh nodeXX ipmitool sel list
For example:
[admin@snx11000n000 /]$ ssh
1 | 07/24/2014 | 13:25:13 |
2 | 07/24/2014 | 13:25:14 |
3 | 07/24/2014 | 13:25:16 |
4 | 07/24/2014 | 13:25:20 |
5 | 07/24/2014 | 13:25:21 |
snx11000n005 ipmitool sel list
Event Logging Disabled #0x10 |
System Event #0x20 | Timestamp
System Event #0x20 | Timestamp
System Event #0x20 | Timestamp
System Event #0x20 | Timestamp
Log area reset/cleared | Asserted
Clock Sync | Asserted
Clock Sync | Asserted
Clock Sync | Asserted
Clock Sync | Asserted
28. If the failed controller responds to IPMI commands, collect BMC logs from the primary MGMT node as
follows, if it needs to be collected separately:
[admin@n000]$ ipmitool -U admin -P admin -H nodeXY-ipmi sel list | tee -a
bmc_logs_nodeXY_`date +%s`
~/
The output on the screen and output in the file will be the same. The file will be stored in the admin home
directory (/home/admin)
For example:
[admin@snx11000n000 /]$ ipmitool -U admin -P admin -H snx11000n005-ipmi sel list | tee -a ~/
bmc_logs_snx11000n005_`date +%s`
1 | 07/24/2014 | 13:25:13 | Event Logging Disabled #0x10 | Log area reset/cleared | Asserted
2 | 07/24/2014 | 13:25:14 | System Event #0x20 | Timestamp Clock Sync | Asserted
3 | 07/24/2014 | 13:25:16 | System Event #0x20 | Timestamp Clock Sync | Asserted
4 | 07/24/2014 | 13:25:20 | System Event #0x20 | Timestamp Clock Sync | Asserted
5 | 07/24/2014 | 13:25:21 | System Event #0x20 | Timestamp Clock Sync | Asserted
Contact Cray Support if any problems are encountered at any stage when retrieving these logs.
48
Replace a 5U84 OSS Controller
29. Power off the failed controller by running the following from the primary MGMT node:
[admin@n000]$ cscli power_manage –n nodeXX --power-off
30. Unplug the two RJ-45 network cables.
IMPORTANT: Make certain to mark which cable is connected to each port. The ports are numerically
labeled. Make certain the cables are connected to the same ports when the new OSS controller is
installed.
31. Unplug the InfiniBand (or 40GbE) cable.
32. Unplug the SAS cables if using SSU + n configuration.
33. Unplug the LSI HBA SAS Loopback cable from the two SAS ports.
34. Remove the failed controller from the SSU (using the locking lever to slide out the controller from the back of
the rack).
35. Check the new controller for a dust cover cap and remove the one that corresponds to the InfiniBand cable
removed earlier (Port II).
36. Insert the new controller halfway into the SSU, but do not seat it in the enclosure.
37. Connect the cables to the new controller.
a. Plug in the RJ-45 network cable.
b. Plug in the InfiniBand (or 40GbE) cable to the original port it came from (Port II).
c.
Plug in the SAS cables if using SSU + n configuration.
d. Plug in the LSI HBA SAS Loopback cable between the two SAS ports.
38. Connect a serial cable from the console or PC to the new controller (serial port is on the rear panel).
39. Open a terminal session using either method described in Connect to the MGMT Node. This serial connection
allows monitoring the boot and discovery process but is not needed to complete the procedure.
40. Completely insert the new controller into the SSU (until the locking lever engages and the unit is properly
seated in the chassis) and power on the controller. Use the Reset Tool (P/N 101150300) to press the Power
On/Off button (hold for 3 seconds, and then release) on the rear panel of the OSS.
As shown in the following figure, the controller has two buttons, which are not labeled. The button on the left
is the power button, while the button on the right is the reset button.
49
Replace a 5U84 OSS Controller
Figure 24. Controller Power and Reset Switches
When a serial connection is being monitored, the following steps normally occur during replacement:
a. The new controller reboots into discovery mode, initiating discovery .
b. The controller automatically reboots with the correct hostname and restores the HA configuration.
c.
The controller reboots again and becomes completely operational.
41. Wait for the discovery procedure to complete (this will take about 10 minutes).
IMPORTANT: If the replacement controller has been removed or swapped from another system,
follow the procedure described in Wipe HA Data from SSD on the OSS Controller. Otherwise the
auto-discovery process can fail due to left over HA data or a duplicate BMC IP address set on the
FRU.
42. Log in to the OSS node hosted on the controller (username admin and the customer's password).
Verify New Controller Function
43. After the controller reboots, verify that the new controller is online:
[snx11000nxxx]$ sudo crm_mon -1
When both OSS nodes are online with their resources assigned to them, the node status line changes from
the following (as seen in step 23 on page 45):
Online: [ snx11000n004 ]
OFFLINE: [ snx11000n005 ]
To:
Online: [ snx11000n004 snx11000n005]
For example:
[admin@snx11000n004 ~]$ sudo crm_mon -1
2 Nodes configured, unknown expected votes
35 Resources configured.
============
Online: [ snx11000n004 snx11000n005]
snx11000n004-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n004
snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004
Clone Set: cln-kdump-stonith [kdump-stonith]
50
Replace a 5U84 OSS Controller
Started: [ snx11000n004 ]
Stopped: [ kdump-stonith:1 ]
Clone Set: cln-ssh-10-stonith [ssh-10-stonith]
Started: [ snx11000n004 ]
Stopped: [ ssh-10-stonith:1 ]
Clone Set: cln-gem-stonith [gem-stonith]
Started: [ snx11000n004 ]
Stopped: [ gem-stonith:1 ]
Clone Set: cln-ssh-stonith [ssh-stonith]
Started: [ snx11000n004 ]
Stopped: [ ssh-stonith:1 ]
Clone Set: cln-phydump-stonith [phydump-stonith]
Started: [ snx11000n004 ]
Stopped: [ phydump-stonith:1 ]
Clone Set: cln-last-stonith [last-stonith]
Started: [ snx11000n004 ]
Stopped: [ last-stonith:1 ]
snx11000n004_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004
baton (ocf::heartbeat:baton): Started snx11000n004
Clone Set: cln-diskmonitor [diskmonitor]
Started: [ snx11000n004 ]
Stopped: [ diskmonitor:1 ]
snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004
Resource Group: snx11000n004_md0-group
snx11000n004_md0-wibr (ocf::heartbeat:XYRAID):Started snx11000n004
snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID):Started snx11000n004
snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR):Started snx11000n004
snx11000n004_md0-raid (ocf::heartbeat:XYRAID):Started snx11000n004
snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR):Started snx11000n004
snx11000n004_md0-stop (ocf::heartbeat:XYSTOP):Started snx11000n004
Resource Group: snx11000n004_md1-group
snx11000n004_md1-wibr (ocf::heartbeat:XYRAID):Started snx11000n004
snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID):Started snx11000n004
snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR):Started snx11000n004
snx11000n004_md1-raid (ocf::heartbeat:XYRAID):Started snx11000n004
snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR):Started snx11000n004
snx11000n004_md1-stop (ocf::heartbeat:XYSTOP):Started snx11000n004
44. If the crm_mon -1 node status still shows the new node as offline after 15 minutes, verify the state of the
xybridge driver and xyvnic0 port:
[snx11000nxxx]$ ifconfig xyvnic0
Where [snx11000nxxx] is the node containing the significantly different output.
For example:
[admin@snx11000n004 ~]$ ifconfig xyvnic0
xyvnic0 Link encap:Ethernet HWaddr 80:00:E0:5B:7B:C2
inet addr:203.0.113.1 Bcast:203.0.113.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:900 Metric:1
RX packets:41151 errors:0 dropped:0 overruns:0 frame:0
TX packets:40752 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9632555 (9.1 MiB) TX bytes:9711629 (9.2 MiB)
45. If there is any other output, update the xybridge firmware and restart the xyvnic driver.
51
Replace a 5U84 OSS Controller
a. Determine if a xybridge firmware update is available:
[snx11000nxxx]$ sudo xrtx_bridgefw –c
The output will be similar to the following if an update is available:
[admin@snx11000n004 ~]$ sudo xrtx_bridgefw -c
Current:c3822a46 Update:98229a3e
b. Update the xybridge firmware, if an update is available:
[snx11000nxxx]$ sudo xrtx_bridgefw -u
c.
Reboot the affected node:
[snx11000nxxx]$ sudo reboot
46. If the output from step 44 on page 51 matches the example, but no HA is present, attempt to ping the
neighbor controller over the xyvnic0. For the even-numbered nodes the IP address is always 203.0.113.1
and for odd-numbered nodes the address is always 203.0.113.2:
[snx11000nxxx]$ ping 203.0.113.2
For example:
[admin@snx11000n004 ~]$ ping 203.0.113.2
PING 203.0.113.2 (203.0.113.2) 56(84) bytes of data.
64 bytes from 203.0.113.2: icmp_seq=1 ttl=64 time=0.786 ms
64 bytes from 203.0.113.2: icmp_seq=2 ttl=64 time=0.947 ms
^C
— 203.0.113.2 ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1742ms
rtt min/avg/max/mdev = 0.786/0.866/0.947/0.085 ms
a. If the above response appears, attempt to reboot the affected node. If there is no response from the
neighbor controller, check the firmware of xybridge and possibly update it. Follow all substeps in step 45
on page 51.
b. If all of the above steps failed, power-cycle the SSU whose controller was replaced. At this point it may be
necessary to shut down the filesystem first or stop the IO (if any).
47. Log in to the node with the new controller via SSH:
[admin@n000]$ ssh -l admin node_with_new_controller
48. Check the USM firmware version running on the new controller:
[n004]$ sudo /lib/firmware/gem_usm/xyr_usm_sbb-onestor_r3.20_rc1_rel-470/
fwtool.sh -c
Partial sample output:
root: XYRATEX:AMI_FW: Root Current Ver: 1.44.0000
root: XYRATEX:AMI_FW: Root New Ver: 1.44.0000
root: XYRATEX:AMI_FW: Root Backup Ver: 1.44.0000
52
Replace a 5U84 OSS Controller
root:
root:
root:
root:
root:
root:
root:
root:
XYRATEX:AMI_FW:
XYRATEX:AMI_FW:
XYRATEX:AMI_FW:
XYRATEX:AMI_FW:
XYRATEX:AMI_FW:
XYRATEX:AMI_FW:
XYRATEX:AMI_FW:
XYRATEX:FWTOOL:
Boot Current Ver: 1.40.0005
Boot New Ver: 1.40.0005
BIOS Current Ver: 0.39.0006
BIOS New Ver: 0.39.0006
BIOS Backup Ver: 0.37.0000
CPLD Current Ver: 0.16.0001
CPLD New Ver: 0.16.0001
Version checking done
To see the complete output from the previous fwtool.sh –c command, see Output Example to Check
USM Firmware on page 54.
If the firmware version on the new controller is the same as the other controller, go to step 50 on page 53.
49. Update the firmware version on the new controller if it is running a different version. Refer to Sonexion 2000
USM Firmware Update Guide.
50. From the primary MGMT node, fail back the resources to balance the load between the affected nodes:
[admin@n000]$ cscli failback -n nodes
Where nodes are the names of the node(s) that previously failed over.
For example:
[admin@n000]$ cscli failback -n x004
51. After the controller reboots, verify that the new controller is online:
[snx11000nxxx]$ sudo crm_mon -1
The above command should return the following output:
============
Last updated: Wed Jul 30 13:12:06 2014
Last change: Wed Jul 30 13:10:02 2014 via crm_resource on snx11000n005
Stack: Heartbeat
Current DC: snx11000n004 (191d0bb0-80da-4715-b2c9-618af928458a) - partition
with quorum
Version: 1.1.6.1-6.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
35 Resources configured.
============
Online: [ snx11000n004 snx11000n005 ]
snx11000n005-1-ipmi-stonith (stonith:external/ipmi): Started snx11000n005
snx11000n005-2-ipmi-stonith (stonith:external/ipmi): Started snx11000n005
snx11000n004-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n004
snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004
Clone Set: cln-kdump-stonith [kdump-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-ssh-10-stonith [ssh-10-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-gem-stonith [gem-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-ssh-stonith [ssh-stonith]
Started: [ snx11000n004 snx11000n005 ]
Clone Set: cln-phydump-stonith [phydump-stonith]
Started: [ snx11000n004 snx11000n005 ]
53
Replace a 5U84 OSS Controller
Clone Set: cln-last-stonith [last-stonith]
Started: [ snx11000n004 snx11000n005 ]
snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n004
snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n005
baton (ocf::heartbeat:baton): Started snx11000n005
Clone Set: cln-diskmonitor [diskmonitor]
Started: [ snx11000n004 snx11000n005 ]
snx11000n004_ibstat
(ocf::heartbeat:ibstat): Started snx11000n004
snx11000n005_ibstat
(ocf::heartbeat:ibstat): Started snx11000n005
Resource Group: snx11000n004_md0-group
snx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004
snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004
snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004
Resource Group: snx11000n004_md1-group
snx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005
snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005
snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005
snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005
snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005
snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005
52. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from the
new controller.
Output Example to Check USM Firmware
The following is the complete output for step 48 on page 52 in Replace a 5U84 OSS Controller on page 40
[admin@snx11000n005 ~]$ sudo /lib/firmware/gem_usm/xyr_usm_sbbonestor_r3.20_rc1_rel-470/fwtool.sh -c
root: XYRATEX:FWTOOL: Searching for Canisters...
root: XYRATEX:FWTOOL: Found 1 Canisters:
root: XYRATEX:FWTOOL: PID: UD-8435-CS-9000. WWN: 50050CC10E05547E
Xyratex FWDownloader v3.51
Scanning SES devices...Please Wait...2 SES devices found.
----------SES Device 0 Addr='/dev/sg1'
Found SES Page 1 for Device 0 (/dev/sg1) size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
--------------------SES Device 1 Addr='/dev/sg90'
Found SES Page 1 for Device 1 (/dev/sg90) size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
----------Performing Check /dev/sg1 UD-8435-CS-9000
Checking canister in Slot 1
Check Canister /dev/sg1 UD-8435-CS-9000
Xyratex FWDownloader v3.51
54
Replace a 5U84 OSS Controller
3.51 Apply rule file rf_cs6000_check_slot1.gem to download file to Enclosure /
dev/sg1 using SES page 0x0E.
----------SES Device Addr='/dev/sg1'
Found SES Page 1 for Device /dev/sg1 size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb)
InterpreterVersion : 2.33
PackageVersion
: 2.0
DESCRIPTION:
Show Sati 2 GEM Version and Titan (xyr_usm_sbb-onestor_r3.20_rc1_rel-470)
VERSION:
1.0.1 Copyright 2012 Xyratex Inc.
Enclosure Services Controller Electronics BootLoader
SHOW elec_2; Boot Loader revision : 0503 MATCH Upgrade version : 0503
Enclosure Services Controller Electronics local Firmware
SHOW elec_2; Firmware revision : 03050019 MATCH Upgrade version : 03050019
Enclosure Services Controller Electronics CPLD
SHOW elec_2; CPLD revision : 17 MATCH Upgrade version : 17
Enclosure Services Controller Electronics Flash Config
SHOW elec_2; Flash Config data CRC : d1b030a4 MATCH Upgrade version : d1b030a4
Enclosure Services Controller Electronics VPD CRC
SHOW elec_2; VPD CRC : ee3504b4 MATCH Upgrade version : ee3504b4
Performing Check
Check Midplane /dev/sg1 UD-8435-CS-9000
Check Fans /dev/sg1 UD-8435-CS-9000
Check Sideplane /dev/sg1 UD-8435-CS-9000
Check Sideplane /dev/sg1 UD-8435-CS-9000
Xyratex FWDownloader v3.51
3.51 Apply rule file rf_5u84_mid_check.gem to download file to Enclosure /dev/
sg1 using SES page 0x0E.
----------SES Device Addr='/dev/sg1'
Found SES Page 1 for Device /dev/sg1 size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb)
InterpreterVersion : 2.33
PackageVersion
: 2.0
DESCRIPTION:
Show GEM Version on 5u84 and NEO 3000
55
Replace a 5U84 OSS Controller
VERSION:
1.0.1 Copyright 2012 Xyratex Inc.
Enclosure cpld
SHOW elec_1; CPLD revision : 05 MATCH Upgrade version : 05
Enclosure vpdcrc
SHOW elec_1; VPD CRC : 0e74e375 MATCH Upgrade version : 0e74e375
Xyratex FWDownloader v3.51
3.51 Apply rule file rf_5u84_fan_check.gem to download file to Enclosure /dev/
sg1 using SES page 0x0E.
----------SES Device Addr='/dev/sg1'
Found SES Page 1 for Device /dev/sg1 size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb)
InterpreterVersion : 2.33
PackageVersion
: 2.0
DESCRIPTION:
Show GEM Version on 5u84 and NEO 3000
VERSION:
1.0.1 Copyright 2012 Xyratex Inc.
Fan
Slot1 Controller Config
SHOW elec_1; Controller Config : 0960837-04_0 MATCH Upgrade version :
0960837-04_0
Fan
Slot1 Controller Firmware
SHOW elec_1; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade
version : ucd90124a|2.3.6.0000|110809
Fan
Slot2 Controller Config
SHOW elec_2; Controller Config : 0960837-04_0 MATCH Upgrade version :
0960837-04_0
Fan
Slot2 Controller Firmware
SHOW elec_2; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade
version : ucd90124a|2.3.6.0000|110809
Fan
Slot3 Controller Config
SHOW elec_3; Controller Config : 0960837-04_0 MATCH Upgrade version :
0960837-04_0
Fan
Slot3 Controller Firmware
SHOW elec_3; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade
version : ucd90124a|2.3.6.0000|110809
Fan
Slot4 Controller Config
56
Replace a 5U84 OSS Controller
SHOW elec_4; Controller Config : 0960837-04_0 MATCH Upgrade version :
0960837-04_0
Fan
Slot4 Controller Firmware
SHOW elec_4; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade
version : ucd90124a|2.3.6.0000|110809
Fan
Slot5 Controller Config
SHOW elec_5; Controller Config : 0960837-04_0 MATCH Upgrade version :
0960837-04_0
Fan
Slot5 Controller Firmware
SHOW elec_5; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade
version : ucd90124a|2.3.6.0000|110809
Xyratex FWDownloader v3.51
3.51 Apply rule file rf_sideplane_s0_check.gem to download file to Enclosure /
dev/sg1 using SES page 0x0E.
----------SES Device Addr='/dev/sg1'
Found SES Page 1 for Device /dev/sg1 size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb)
InterpreterVersion : 2.33
PackageVersion
: 2.0
DESCRIPTION:
Show Sat 2 GEM Version and Ttan (xyr_usm_sbb-onestor_r3.20_rc1_rel-470)
VERSION:
1.0.1 Copyright 2012 Xyratex Inc.
SAS Expander
BootLoader
SHOW elec_3; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_3; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_3; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_3; Flash Config data CRC : 6591a901 MATCH Upgrade version : 6591a901
SAS Expander VPD
SHOW elec_3; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_3; VPD CRC : 38b21cf7 MATCH Upgrade version : 38b21cf7
SAS Expander BootLoader
SHOW elec_4; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
57
Replace a 5U84 OSS Controller
SHOW elec_4; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_4; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_4; Flash Config data CRC : fa337b92 MATCH Upgrade version : fa337b92
SAS Expander VPD
SHOW elec_4; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_4; VPD CRC : 152e4ad9 MATCH Upgrade version : 152e4ad9
SAS Expander BootLoader
SHOW elec_7; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_7; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_7; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_7; Flash Config data CRC : 6591a901 MATCH Upgrade version : 6591a901
SAS Expander VPD
SHOW elec_7; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_7; VPD CRC : 38b21cf7 MATCH Upgrade version : 38b21cf7
SAS Expander BootLoader
SHOW elec_8; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_8; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_8; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_8; Flash Config data CRC : fa337b92 MATCH Upgrade version : fa337b92
SAS Expander VPD
SHOW elec_8; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_8; VPD CRC : 152e4ad9 MATCH Upgrade version : 152e4ad9
Xyratex FWDownloader v3.51
3.51 Apply rule file rf_sideplane_s1_check.gem to download file to Enclosure /
dev/sg1 using SES page 0x0E.
----------SES Device Addr='/dev/sg1'
Found SES Page 1 for Device /dev/sg1 size 284 (0x11c)
WWN: 50050cc10c4000f5 Vendor ID: XYRATEX
Product ID: UD-8435-CS-9000
Product Revision: 3519 S/N: SHX0965000G02FX
----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb)
58
Replace a 5U84 OSS Controller
InterpreterVersion : 2.33
PackageVersion
: 2.0
DESCRIPTION:
Show Sat 2 GEM Version and Ttan (xyr_usm_sbb-onestor_r3.20_rc1_rel-470)
VERSION:
1.0.1 Copyright 2012 Xyratex Inc.
SAS Expander
BootLoader
SHOW elec_1; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_1; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_1; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_1; Flash Config data CRC : 687fe3f1 MATCH Upgrade version : 687fe3f1
SAS Expander VPD
SHOW elec_1; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_1; VPD CRC : 2737a206 MATCH Upgrade version : 2737a206
SAS Expander BootLoader
SHOW elec_2; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_2; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_2; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_2; Flash Config data CRC : 11eefeaf MATCH Upgrade version : 11eefeaf
SAS Expander VPD
SHOW elec_2; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_2; VPD CRC : f027d510 MATCH Upgrade version : f027d510
SAS Expander BootLoader
SHOW elec_5; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_5; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_5; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_5; Flash Config data CRC : 687fe3f1 MATCH Upgrade version : 687fe3f1
SAS Expander VPD
59
Replace a 5U84 OSS Controller
SHOW elec_5; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_5; VPD CRC : 2737a206 MATCH Upgrade version : 2737a206
SAS Expander BootLoader
SHOW elec_6; Boot Loader revision : 0610 MATCH Upgrade version : 0610
SAS Expander Firmware
SHOW elec_6; Firmware revision : 03050019 MATCH Upgrade version : 03050019
SAS Expander CPLD
SHOW elec_6; CPLD revision : 10 MATCH Upgrade version : 10
SAS Expander Flash Config
SHOW elec_6; Flash Config data CRC : 11eefeaf MATCH Upgrade version : 11eefeaf
SAS Expander VPD
SHOW elec_6; VPD revision : 06 MATCH Upgrade version : 06
SAS Expander VPD CRC
SHOW elec_6; VPD CRC : f027d510 MATCH Upgrade version : f027d510
root: XYRATEX:AMI_FW: Root Current Ver: 2.0.000A
root: XYRATEX:AMI_FW: Root New
Ver: 2.0.000A
root: XYRATEX:AMI_FW: Root Backup Ver: 2.0.000A
root: XYRATEX:AMI_FW: Boot Current Ver: 2.0.0001
root: XYRATEX:AMI_FW: Boot New
Ver: 2.0.0001
root: XYRATEX:AMI_FW: BIOS Current Ver: 0.46.0001
root: XYRATEX:AMI_FW: BIOS New
Ver: 0.46.0001
root: XYRATEX:AMI_FW: BIOS Backup Ver: 0.46.0001
root: XYRATEX:AMI_FW: CPLD Current Ver: 17.0.0004
root: XYRATEX:AMI_FW: CPLD New
Ver: 17.0.0004
root: XYRATEX:FWTOOL: Version checking done
60
Replace a 5U84 Chassis
Replace a 5U84 Chassis
Prerequisites
Part number
101170900: Chassis Assy, Sonexion 2000 5U SSU FRU, with power supplies and fans
Time
2.5 hours
Interrupt level
Interrupt (requires disconnecting the Lustre clients from the filesystem)
Tools
●
ESD strap
●
Console with monitor and keyboard or a PC with a serial port configured for 115.2 Kbs
●
Serial cable
About this task
The SSU comprises a chassis and the two controllers. Each controller hosts an object storage system (OSS)
node. Each SSU has two OSS nodes.
In this procedure, only the defective chassis is replaced; all other components are re-used in the new SSU
chassis.
The following procedure requires taking the Lustre filesystem offline.
Figure 25. SSU Chassis, Rear View
5U84 Enclosure Precautions
61
Replace a 5U84 Chassis
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully
redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component
should be replaced as soon as possible to return the SSU to a fully redundant state.
●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin
down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not
attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening
drawers.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or
SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion
operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully
redundant state.
Procedure
1. Locate the SSU containing the defective chassis. If the logical locations (hostnames) of the two OSS nodes
associated with the affected SSU are known, skip to Replace SSU Chassis and Verify.
2. Establish communication with the management node (n000) using one of the following two methods:
Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP
address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in
the following table.
Table 6. Settings for MGMT Connection
Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+.
62
Replace a 5U84 Chassis
Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown
in the following figure. Ensure that the connection has the settings shown in the table above.
Figure 26. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows:
login as: admin
admin@172.30.72.42’s password: password
Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59
[admin@snx11000n000 ~]$
4. Change to root user:
$ sudo su 5. Log in to one of the OSS nodes hosted on the controller, with username admin and the associated
password.
6. Determine the hostnames of the OSS nodes in an HA pair:
# crm_mon
Leave the serial cable connected, to use in the following steps.
7. Power off the affected SSU:
[admin@n000]$ cscli power_manage –n nodeXX --power-off
Repeat the command for both nodes inside the affected SSU.
Replace SSU Chassis and Verify
Perform the following steps at the back of the rack.
8. Power off both PSUs in the SSU (located below the controllers).
9. Remove the power cords from the PSUs.
10. Disconnect the RJ-45 and Infiniband cables from both SSU controllers. If serial cables are attached to the
controllers as part of the system, disconnect them.
63
Replace a 5U84 Chassis
11. Remove all drives from the chassis, keeping the same order per shelf/slot.
Although the drive roaming functionality works, the best practice is to put each drive back in its original slot.
For peak performance, SSDs must be re-inserted in their original slot.
12. Remove both PSUs from the chassis.
13. Remove all fan modules from the chassis.
14. Remove the SSU controller on the left side of the chassis and mark it as “L”.
15. Remove the SSU controller on the right side of the chassis and mark it as “R”.
16. Remove the four screws in the front mounting brackets.
17. Disconnect the rear sliding brackets from the chassis.
18. With a second person, remove the SSU chassis from the rack.
19. With a second person, move the new SSU chassis into the rack.
20. Connect the chassis to the rack.
a. Secure four screws to the front mounting brackets.
b. Connect the rear sliding brackets to the chassis.
21. Insert the SSU controller marked “L” into the left side of the chassis (at the rear of the rack).
22. Insert the SSU controller marked “R” into the right side of the chassis (at the rear of the rack).
23. Connect the RJ-45 and InfiniBand cables to both OSS controllers. If serial cables were previously attached to
the controllers as part of the system, reconnect them.
24. Insert all fan modules into the chassis.
25. Insert both PSUs into the chassis.
26. Insert all drives into the chassis (keeping the original order per shelf/slot).
27. Connect the power cords to the PSUs and power on both modules.
28. Wait for the SSU containing the new chassis to fully power on, the OSS nodes to come online.
It can take up to 15 minutes for the OSS nodes to boot.
29. Verify that the new SSU controller works correctly
a. Connect a serial cable to one of the controllers in the new SSU (serial port is on the rear panel) and to the
console with monitor and keyboard or the PC.
b. Log in to one of the OSS nodes hosted on the controller (username admin and the customer's
password).
If the login fails, the OSS node is not yet online.
64
Replace a 5U84 Chassis
c.
Verify that all drives in the new chassis are detected:
[admin@n000]$ sudo sg_map -i -x | egrep -e 'HIT|SEA|WD' | wc -l
The command should return a count of 84 or 168 with SSU +1 systems.
Example:
$ sudo sg_map -i -x | egrep -e 'HIT|SEA|WD' | wc -l
84
IMPORTANT: If the drive counts are lower than expected or not all drives are detected, do not continue.
Contact Cray Support to troubleshoot the problem.
d. Check the USM firmware version running on the new controller. See Sonexion 2000 USM Firmware
Update Guide.
e. Disconnect the serial cable from the controller in the affected SSU and the console (or PC).
65
Replace a 2U24 EBOD
Replace a 2U24 EBOD
Prerequisites
Part number
100843600: Controller Assy, Sonexion EBOD Expansion Module
Time
1.5 hours
Interrupt level
Failover (can be applied to a live system with no service interruption, but requires failover/
failback)
Tools
●
Labels (attach to SAS cables)
●
ESD strap
●
Serial cable (9 pin to 3.5mm phone plug)
Requirements
●
IMPORTANT: Whenever a component (such as a PSU, fan, or controller)
fails and causes an MMU or SSU to change from a fully redundant state to
a non-redundant state, to maintain uninterrupted Sonexion operations,
replace the faulty component as soon as possible to return the MMU or
SSU to a fully redundant state.
●
Hostnames must be assigned to all OSS nodes in the cluster containing the failed
EBOD I/O module (available from the customer)
●
To complete this procedure, the replacement EBOD I/O module must have the correct
USM firmware for the system. See Sonexion 2000 USM Firmware Update Guide.
About this task
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U24
storage enclosure contains 24 drives, 2 EBOD I/O modules, and 2 Power Cooling Modules (PCMs).
The EBOD I/O modules (upper and lower) are located between the PCMs and are accessible from the back of the
cabinet. Each EBOD I/O module has three ports, but the system uses only two ports (A and B).
The system architecture requires the lower EBOD I/O module to be installed upside-down in the enclosure. Thus
the module's markings and the port sequence are reversed (C / B / A). When an EBOD I/O module is replaced,
the SAS cables must be reconnected to the same ports as on the failed module. Apply a label to each SAS cable
before it is unplugged, to indicate which port to attach it to on the new EBOD I/O module. When the cables are
reconnected, verify that the module and port marked on the label match the module and port on the enclosure.
Use this procedure to halt all client I/O and file systems, replace the failed EBOD I/O module, verify the operation
of the new EBOD I/O module, and return the system to normal operation.
66
Replace a 2U24 EBOD
Although either MGMT node may be in the failed state, this document presents command prompts as [MGMT0],
representing the active MGMT node, which could be either MGMT0 or MGMT1.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
This procedure has steps to fail over and shut down the node, remove the failed EBOD, and install a new one.
Procedure
1. Determine the location of the failed EBOD I/O module.
a. Check if the customer included this information in an SFDC case. If the ticket contains this information
proceed to Step 2.
b. If the information has not been provided, look for an amber Fault LED on the failed EBOD I/O module and
the Module Fault LED on the Operator Control Panel (OCP) on the 2U24 enclosure (front panel).
2. Connect a KVM or console (or PC) to the primary MGMT server.
3. Log in to the primary MGMT node.
[Client]$ ssh –l admin primary_MGMT_node
4. Stop Lustre on all nodes.
a. Stop the file system. If multiple file systems are configured in the cluster, be sure to run this command on
each file system:
[admin@n000]$ cscli unmount -f fsname
b. Verify that the file system is in a “stopped" state on all nodes in the cluster:
[admin@n000]$ cscli fs_info
5. Power off the Sonexion system.
Remove the Failed EBOD I/O Module
Perform these steps at the back of the rack:
6. Turn off the power switches on both PCMs in the 2U24 enclosure.
7. Disconnect each SAS cable from the failed EBOD I/O module. Note the cable connections to port locations
when cables are reconnected to the new EBOD I/O module. Use upper or lower to indicate the module and A
or B to indicate the port. Following is a sample label:
Upper A
67
Replace a 2U24 EBOD
EBOD I/O modules are installed opposite of each other in the enclosure. Depending on the cable clearance, it
may be necessary to disconnect the SAS cables on both EBOD I/O modules to access the failed unit. If the
SAS cables are disconnected from the functioning EBOD I/O module, be sure to attach an identifying label to
each one (described above) so the cables will be properly reconnected after the new module is installed.
8. Release the module latch by grasping it between your thumb and forefinger and gently squeezing it (see the
following figure).
Figure 27. EBOD I/O Latch Operation
9. Using the latch as a handle, carefully remove the failed module from the enclosure (see the following figure).
Figure 28. Remove an EBOD I/O Module
10. Inspect the new EBOD I/O module for damage, especially to the interface connector.
If the module is damaged, do not install it. Obtain another EBOD I/O module.
Install the New EBOD I/O Module
11. With the latch in the released (open) position, slide the new EBOD I/O module into the enclosure until it
completely seats and engages the latch (see the following figure).
68
Replace a 2U24 EBOD
Figure 29. Install an EBOD I/O Module
12. Secure the module by closing the latch.
An audible click occurs as the latch engages.
13. Plug in the SAS cables to their original ports on the EBOD I/O. See for more information.
14. Turn on the power switches on both PCMs in the 2U24 storage enclosure.
15. Verify the following LED status:
●
On the new EBOD I/O module, the Fault LED is extinguished and the Health LED is illuminated green.
●
On the Operator Control Panel located at the front of the 2U24 storage enclosure, the Module Fault LED
is off.
16. Power on the Sonexion system.
Powering on the nodes may take some time. To determine when the power on sequence is completed,
monitor the console output or watch for the IB link LEDs.
17. Check the USM firmware version running on the new EBOD I/O module using the latest document revision for
your version of Sonexion:
18. Start the Lustre file system:
[admin@n000]$ sudo cscli mount -f fsname
19. Check that the Lustre file system is started on all nodes:
[admin@n000]$ sudo cscli fs_info
20. Close the console connection and disconnect the KVM, or, if using a console or PC, disconnect the serial
cable from the primary MGMT server.
The procedure to replace a failed EBOD I/O module on the MMU’s 2U24 storage enclosure is complete.
69
Replace a 2U24 Chassis
Replace a 2U24 Chassis
Prerequisites
Part number
100853300: Chassis Assy, 2U24 without power supplies or controller
Time
2 hours
Interrupt level
Interrupt (requires disconnecting Lustre clients from the filesystem)
Tools
●
ESD strap
●
#2 Phillips screwdriver
Required files
The following RPMs may be required to complete this procedure. See Sonexion 2000 USM
Firmware Update Guide. RPM versions depend on the Sonexion release. Refer to the
Sonexion Update Bundles for the current firmware levels. USM firmware must be installed
on the primary MGMT node before the USM firmware update procedure is performed.
●
lsscsi RPM (lsscsi-*.rpm)
●
fwdownloader RPM (fwdownloader-*.rpm)
●
GEM/HPM RPM (XYR_USM_SBB-*.rpm)
About this task
Use this procedure to remove and replace a defective chassis (2U24 enclosure) in the MMU component.
Subtasks:
●
Remove 2U24 Components and Chassis
●
Replace Chassis and Reinstall Components
●
Power on 2U24 Components and Reactivate
The MMU comprises a 2U24 enclosure and one Intel 2U chassis with four servers. The 2U24 enclosure contains
24 disk drive in carriers (DDIC, referred to simply as disks), two EBOD I/O modules and two power/cooling
modules (PCMs). There can be as many as four MMU chassis in a system.
The EBOD I/O modules (upper and lower) are located between the PCMs and are accessible from the back of the
rack. Each EBOD I/O module has three ports but only two ports (A and B) are used in the system. The
architecture requires that the lower EBOD I/O module be installed upside down in the enclosure. This causes the
module's markings to be upside down and the port sequence to be reversed (C / B / A).
When an EBOD I/O module is replaced, reconnect the SAS cables to the same ports as on the failed module.
Apply a label to each SAS cable before it is unplugged to remember which port to attach it to on the new EBOD I/
70
Replace a 2U24 Chassis
O module. When the cables are reconnected, verify the module and port marked on the label against the module
and port in the enclosure.
In this procedure, only the defective chassis is replaced; all other components in the defective MMU enclosure are
re-used in the new chassis. This procedure includes steps to stop all client I/O and file systems, replace the failed
2U24 chassis, verify the operation of the new 2U24 chassis, and return the system to normal operation.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
CAUTION: The size and weight of the 2U24 chassis requires two individuals to move the unit safely. Do
not perform this procedure unless two individuals are onsite and available to move each 2U24 chassis.
Procedure
1. If the location of the defective chassis is not known, a fault LED (amber) on the front panel of the failed 2U24
chassis indicates the failure. (A system can have as many as four MMU 2U24 chassis.)
2. Establish communication with the management node (n000) using one of the following two methods:
Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP
address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in
the following table.
Table 7. Settings for MGMT Connection
Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+.
Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown
in the following figure. Ensure that the connection has the settings shown in the table above.
71
Replace a 2U24 Chassis
Figure 30. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows:
login as: admin
admin@172.30.72.42’s password: password
Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59
[admin@snx11000n000 ~]$
4. Change to root user:
$ sudo su 5. Log in to the primary MGMT node:
[admin@n000]$ ssh -l admin primary_MGMT_node
6. Unmount the Lustre file system:
$ cscli unmount -f fsname
7. Verify that the Lustre file system is stopped on all nodes:
$ cscli fs_info
8. Power off the Sonexion system as described in Power Off Sonexion 2000.
Remove 2U24 Components and Chassis
In the following steps, remove the EBOD I/O modules, power cooling modules (PCMs), and disks from the
chassis.
Perform these steps from the rear of the rack.
9. Remove the EBOD I/O modules:
a. Turn off the power switches on both PCMs in the chassis.
b. On one EBOD I/O module, disconnect each SAS cable and attach a label that indicates the module and
port for reconnecting the cable to the module. Use “upper” or “lower” to indicate the module, and A or B to
indicate the port. Sample label: “Upper A”.
72
Replace a 2U24 Chassis
c.
Gently squeeze the module latch between the thumb and forefinger.
d. Using the latch as a handle, carefully withdraw the module from the enclosure.
e. Repeat the previous three steps for the second EBOD I/O module.
10. Remove the PCMs:
a. At one of the PCMs, disconnect the power cord by moving the bale toward the center of the PCM and
removing the cord.
b. Release the module latch by gently squeezing it between the thumb and forefinger.
c.
Using the latch as a handle, carefully withdraw the PCM from the enclosure.
DANGER: To avoid electrical shock, do not remove the cover from the PCM. Failure to comply
will result in death or serious injury.
d. Repeat the preceding substeps to remove the second PCM.
11. Remove the disks:
a. Record the exact location of the drives, as they must be installed in the same order in the new 2U24
chassis.
b. On one disk, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red
indicator is no longer visible in the opening above the key.
c.
Remove the lock key.
d. Release the disk by pressing the latch and rotating it downward.
e. Gently remove the disk from the drive slot.
f.
Mark the drive with its current drive slot number in the chassis so that it can be reinstalled in the same slot
in the new chassis. From the front of the rack, the drive slots are numbered 0 to 23 (left to right).
g. Repeat preceding substeps for the remaining disks.
12. Remove the 2U24 chassis from the front of the rack:
a. Remove the left and right front flange caps by pulling the caps free.
b. Disconnect the chassis from the rack by removing the screw from the left and right flanges (now exposed
after removing the flange caps).
c.
With the assistance of a second person, remove the chassis from the rack.
d. With the chassis on a bench, remove the left and right front flange caps by pulling the caps free. (The
caps simply snap onto the flanges.)
Replace Chassis and Reinstall Components
13. Install the new 2U24 chassis in the rack:
a. With the assistance of a second person, move the 2U24 chassis into the rack. Carefully align the guide
on each side of the chassis with the groove on the rail assembly and gently push the chassis completely
into the rack.
b. Connect the chassis to the rack by installing a screw into the left and right flanges.
73
Replace a 2U24 Chassis
c.
Install the flange caps by pressing them into position. They snap into place on the flanges.
14. From the front of the rack, install disks in the new chassis as follows, placing each disk where it was located
in the old 2U24 chassis, oriented with the drive handle opening downward.
a. On one disk, verify that the disk handle is released and in the open position.
b. Insert each disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops.
c.
Seat the disk by pressing the handle latch and rotating it to the closed position. There will be an audible
click as the handle latch engages.
d. Verify that the new disk is in the same orientation as the other disks in the enclosure.
e. Carefully insert the lock key into the lock socket and rotate it clockwise until the red indicator is visible in
the opening above the key.
f.
Remove the lock key.
g. Repeat the disk drive installation steps for the remaining disks.
15. Install the PCMs:
a. Carefully inspect the PCM for damage, especially to the rear connector. Avoid damaging the connector
pins. If the PCM is damaged, do not install it. Obtain another PCM.
b. Slide a PCM into the chassis. As the PCM begins to seat, grasp the handle latch and close it to engage
the latch.
This action engages the caming mechanism on the side of the module and secures the PCM.
c.
Verify that the power switch on each PCM is in the OFF position.
d. Slide a PCM into the chassis. As the PCM begins to seat, grasp the handle latch and close it to engage
the latch.
e. For each PCM, move the bale toward the center of the PCM, connect the power cord to the PCM, and
place the bale over and onto the power cord.
f.
Repeat the module installation steps for the second PCM.
16. Install EBOD I/O modules:
a. Inspect the EBOD damage, especially to the interface connector. If the module is damaged, do not install
it. Obtain another EBOD.
b. With the latch in the released (open) position, slide the EBOD into the enclosure until it completely seats
and engages the latch.
c.
Secure the module by closing the latch. There will be an audible click as the latch engages.
d. Repeat the module installation steps for the second EBOD.
Power on 2U24 Components and Reactivate
17. Plug in the four SAS cables to their original ports on the EBOD I/O modules.
74
Replace a 2U24 Chassis
Figure 31. 2U24 SAS Cabling
Use the cable labels prepared in step 9.b on page 72 , to ensure that the cables are connected to the proper
ports.
18. Turn on the power switches on both PCMs.
19. Verify that the indicator LEDs on the PCMs, EBOD I/O modules, and the front panel of the 2U24 chassis are
normal and illuminated green.
Figure 32. 2U24 Operators Panel Indicators
Table 8. 2U24 Operator's Panel LED Description
LEDs
State
Indicates
System Power
Steady green
AC power is applied to the enclosure.
75
Replace a 2U24 Chassis
LEDs
State
Indicates
Module Fault
Steady amber
PCM fault
OSS fault
Over or under termperature fault.
Refer to individual
module fault LEDs.
Logical Fault
Steady amber
Failure of a disk drive
20. Power on the Sonexion as described in Power On Sonexion 2000
21. The USM and GEM firmware versions must agree between the EBOD I/O modules.
Consult Cray Hardware Product Support to obtain the correct files, and use the procedures described in
Sonexion 2000 USM Firmware Update Guide. When the firmware versions match, proceed to the following
step.
22. Start the Lustre file system:
$ cscli mount -f fsname
23. Check that the Lustre file system is started on all nodes:
$ cscli fs_info
24. Close the console connection and disconnect the KVM, or, if using a console or PC, disconnect the serial
cable from the primary MGMT server.
76
Replace a Quad Server Disk
Replace a Quad Server Disk
Prerequisites
Part number:
100900800: Disk Drive Assy, 450GB Sonexion 2U Quad Server MMU FRU
Time:
3 hours
Interrupt level:
Live (procedure can be applied to a live system with no service interruption)
Tools:
ESD strap, boots, garment or other approved methods
Console with monitor and keyboard or PC with a serial COM port configured for 115.2Kbs
Serial cable
Philips screwdriver
About this task
The following procedure can be applied to a live system with no service interruption, but requires failover/failback
operations. It includes steps to replace the failed disk, verify the disk recovery/rebuild on a spare disk, and mark
the newly installed disk as a hot spare.
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U quad
server contains four server nodes, two power supply units (PSUs), fans, and disk drives. This FRU procedure
applies only to hard drives hosting the primary and secondary MGMT nodes (nodes 00 and 01 respectively).
This procedure does not apply to disks hosting the MGS and MDS internal nodes (nodes 02 and 03 respectively),
as they are diskless nodes that do not contain a RAID array using internal drives.
Subtasks:
●
Additional Information on MGMT Node RAID Status
●
Replace Quad Server Disk
●
Clean Up Failed or Pulled Drive in Node on page 84
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
77
Replace a Quad Server Disk
Procedure
1. If the location of the failed disk is known, go to Replace Quad Server Disk.
2. If the failed disk is not known, log in to the primary MGMT node:
$ ssh -l admin primary_MGMT_node
3. Run the dm_report command on the primary or secondary MGMT node suspected to have the failed disk.
$ sudo dm_report
Both the primary MGMT node and the secondary MGMT node have MDRAID arrays in the MMU. It may be
necessary to run this step on each node to determine which one is associated with the failed disk.
4. Connect the console with monitor and keyboard or PC to the primary MGMT node.
5. Log in to the primary MGMT node.
[admin@n000]$ ssh primary_MGMT_node
6. Issue the dm_report command:
[admin@n000]$ sudo dm_report
If the primary MGMT node is in an unhealthy state, the dm report command produces an output similar to
the following:
[admin@n000]$ sudo dm_report
Diskmonitor Inventory Report: Version: 1.0.x.1.5-25.x.2455 Host: snx11000n00 Time: Thu Jul 31 03:29:36 2014
encl: 0, wwn: 50050cc10c2002dd, dev: /dev/sg24, slots: 24, vendor: XYRATEX , product_id: EB-2425P-E6EBD
slot: 0, wwn: 5000c500320a6d63, cap: 450098159616, dev: sda, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 1, wwn: 5000c500320a81fb, cap: 450098159616, dev: sdb, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 2, wwn: 5000c500320a7a6f, cap: 450098159616, dev: sdl, parts: 0, status: Ok, t10: 11110111000
slot: 3, wwn: 5000c500320afb6f, cap: 450098159616, dev: sdj, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 4, wwn: 5000c500320d4073, cap: 450098159616, dev: sdi, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 5, wwn: 5000c500320a7bef, cap: 450098159616, dev: sdh, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 6, wwn: 5000c500320a7987, cap: 450098159616, dev: sdg, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 7, wwn: 5000c500320ad06b, cap: 450098159616, dev: sde, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 8, wwn: 5000c500320a7a07, cap: 450098159616, dev: sdd, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 9, wwn: 5000c500320a8383, cap: 450098159616, dev: sdc, parts: 0, status: Hot Spare, t10: 11110111000
slot: 10, wwn: 5000c500320a6f53, cap: 450098159616, dev: sdf, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 11, wwn: 5000c500320a7f6b, cap: 450098159616, dev: sdk, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 12, wwn: 5000c500320a7903, cap: 450098159616, dev: sdq, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 13, wwn: 5000c500320a852b, cap: 450098159616, dev: sdp, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 14, wwn: 5000c500320a8243, cap: 450098159616, dev: sdo, parts: 0, status: Ok, t10: 11110111000
slot: 15, wwn: 5000c500320b76db, cap: 450098159616, dev: sdr, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 16, wwn: 5000c500320a72df, cap: 450098159616, dev: sdu, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 17, wwn: 5000c500320b0b77, cap: 450098159616, dev: sdt, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 18, wwn: 5000c500320b968b, cap: 450098159616, dev: sds, parts: 0, status: Ok, t10: 11110111000
slot: 19, wwn: 5000c500320afe87, cap: 450098159616, dev: sdn, parts: 0, status: Foreign Arrays, t10: 11110111000
slot: 20, wwn: 5000c500320c8f33, cap: 450098159616, dev: sdm, parts: 0, status: Hot Spare, t10: 11110111000
slot: 21, wwn: 5000c500320b0067, cap: 450098159616, dev: sdv, parts: 0, status: Ok, t10: 11110111000
slot: 22, wwn: 5000cca01305e080, cap: 100030242816, dev: sdw, parts: 0, status: Foreign Arrays, t10: 11100111000
slot: 23, wwn: 5000cca01305e9fc, cap: 100030242816, dev: sdx, parts: 0, status: Foreign Arrays, t10: 11100111000
Array: md64, UUID: 435e6c6a-f76d9702-e41ab3e3-e9ce9e5c, status: Ok, t10: disabled
disk_wwn: 5000c500320a7a6f, disk_sd: sdl, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 2
disk_wwn: 5000c500320a8243, disk_sd: sdo, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 14
disk_wwn: 5000c500320b968b, disk_sd: sds, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 18
disk_wwn: 5000c500320b0067, disk_sd: sdv, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 21
Array: md127, UUID: 03cf20b5-c06203d5-2300dd64-1b2cd976, status: Degraded, t10: disabled
Array is unmanaged -- found no disks in a managed enclosure
T10_key_begin:
GRD_CHK(1), APP_CHK(1), REF_CHK(1), ATO(1), RWWP(1), SPT(1), P_TYPE(1), PROT_EN(1), DPICZ(1), FMT(1), READ_CHK(1)
T10_key_end
End_of_report
78
Replace a Quad Server Disk
In this example, the command output for md127 is degraded.
If the RAID status is OK, the primary MGMT node is operating correctly. Run the following commands to
obtain additional RAID status information for the primary MGMT node.
Additional Information on MGMT Node RAID Status
The following steps provide more information on the RAID status of the primary MGMT node.
7. Check the drive sg_map to retrieve the device (/dev) number:
[admin@n000]$ sudo sg_map –i –x
The sg_map command should produce the following output:
[sudo@node00 ~]# sg_map -i -x
/dev/sg0 0 0 0 0 0 /dev/sda SEAGATE
ST9450404SS
XQB6
/dev/sg1 0 0 1 0 0 /dev/sdb SEAGATE
ST9450404SS
XQB6
/dev/sg2 0 0 2 0 0 /dev/sdc SEAGATE
ST9450404SS
XQB6
/dev/sg3 0 0 3 0 0 /dev/sdd SEAGATE
ST9450404SS
XQB6
/dev/sg4 0 0 4 0 0 /dev/sde SEAGATE
ST9450404SS
XQB6
/dev/sg5 0 0 5 0 0 /dev/sdf SEAGATE
ST9450404SS
XQB6
/dev/sg6 0 0 6 0 0 /dev/sdg SEAGATE
ST9450404SS
XQB6
/dev/sg7 0 0 7 0 0 /dev/sdh SEAGATE
ST9450404SS
XQB6
/dev/sg8 0 0 8 0 0 /dev/sdi SEAGATE
ST9450404SS
XQB6
/dev/sg9 0 0 9 0 0 /dev/sdj SEAGATE
ST9450404SS
XQB6
/dev/sg10 0 0 10 0 0 /dev/sdk SEAGATE
ST9450404SS
XQB6
/dev/sg11 0 0 11 0 0 /dev/sdl SEAGATE
ST9450404SS
XQB6
/dev/sg12 0 0 12 0 0 /dev/sdm SEAGATE
ST9450404SS
XQB6
/dev/sg13 0 0 13 0 0 /dev/sdn SEAGATE
ST9450404SS
XQB6
/dev/sg14 0 0 14 0 0 /dev/sdo SEAGATE
ST9450404SS
XQB6
/dev/sg15 0 0 15 0 0 /dev/sdp SEAGATE
ST9450404SS
XQB6
/dev/sg16 0 0 16 0 0 /dev/sdq SEAGATE
ST9450404SS
XQB6
/dev/sg17 0 0 17 0 0 /dev/sdr SEAGATE
ST9450404SS
XQB6
/dev/sg18 0 0 18 0 0 /dev/sds SEAGATE
ST9450404SS
XQB6
/dev/sg19 0 0 19 0 0 /dev/sdt SEAGATE
ST9450404SS
XQB6
/dev/sg20 0 0 20 0 0 /dev/sdu SEAGATE
ST9450404SS
XQB6
/dev/sg21 0 0 21 0 0 /dev/sdv SEAGATE
ST9450404SS
XQB6
/dev/sg22 0 0 22 0 0 /dev/sdw HITACHI
HUSSL4010ASS600
A182
/dev/sg23 0 0 23 0 0 /dev/sdx HITACHI
HUSSL4010ASS600
A182
/dev/sg24 0 0 24 0 13 XYRATEX
EB-2425-E6EBD
3022
/dev/sg25 1 0 0 0 0 /dev/sdy SEAGATE ST9450405SS 0002 internal drive 0 , left one
/dev/sg26 1 0 2 0 0 /dev/sdaa SEAGATE ST9450405SS 0002 internal drive , right one
In this example, the device numbers are sg25 and sg26, as indicated by the highlighted text above.
8. Obtain additional RAID status information:
[admin@n000 ~]$ sudo mdadm –-detail /dev/md127
The preceding command should produce the following output:
[admin@n000 ~]$ sudo mdadm --detail /dev/md127
If the drive is in a healthy state, the mdadm command should produce the following output:
/dev/md127:
Version : 1.0
Creation Time : Tue Mar 12 09:16:15 2013
Raid Level : raid1
Array Size : 439548848 (419.19 GiB 450.10 GB)
Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
79
Replace a Quad Server Disk
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Mar 20 09:44:36 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : snx11000n000:md512 (local to host snx11000n000)
UUID : 475cfc47:37213002:a89b42ce:b4ad65e0
Events : 90
Number Major Minor RaidDevice State
0
65
128
0
active sync /dev/sdy
1
65
144
1
active sync /dev/sdz
If the drive has failed, the command produces the following output:
[admin@node00 ~]$ sudo mdadm --detail /dev/md127
/dev/md127:
Version : 1.0
Creation Time : Tue Mar 12 09:16:15 2013
Raid Level : raid1
Array Size : 439548848 (419.19 GiB 450.10 GB)
Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Mar 20 09:44:36 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : snx11000n000:md512 (local to host snx11000n000)
UUID : 475cfc47:37213002:a89b42ce:b4ad65e0
Events : 90
Number Major Minor RaidDevice State
0
65
128
0
active sync /dev/sdy
1
65
144
1
removed /dev/sdz
In the above example, RAID device 1 has been removed.
9. Check the status of the removed RAID device:
Node00$ cat /proc/mdstat
If the drive has failed, the command produces the following output:
[admin@n100 ~]# cat /proc/mdstat
Personalities : [raid1] [raid10]
md64 : active raid10 sdj[0] sdn[3] sdr[2] sdd[1]
839843744 blocks super 1.2 4K chunks 2 near-copies [4/4] [UUUU]
md127 : active raid1 sdz[1](F) sdy[0]
439548848 blocks super 1.0 [2/1] [U_]
unused devices: <none>
80
Replace a Quad Server Disk
The (F) notation after sdz indicates that drive md127 has failed.
Replace Quad Server Disk
10. After locating the failed disk, inspect the front of the 2U quad server and verify that the slot containing the
failed disk has a lit Fault LED (amber) on the carrier.
11. Remove the failed disk by pressing the green button and opening the lever, then sliding the disk out.
Figure 33. Remove a disk
12. Place the disk on a stable work surface and remove the four screws holding the disk drive in the carrier, as
shown in the following figure.
This step may be unnecessary if the new disk drive includes the carrier.
Figure 34. Remove a Disk Drive From the Carrier
13. Place the new disk drive in the carrier, align the holes, and secure it with the four screws removed earlier, as
shown in the following figure.
81
Replace a Quad Server Disk
Figure 35. Install the Drive in the Carrier
14. Install the new drive in the 2U quad server. Verify that the handle latch is released and in the open position,
then slide the drive carrier into the enclosure until it stops. Seat the disk by pushing up the handle latch and
rotating it to the closed position, as shown in the following figure.
There will be an audble click as the handle latch engages.
Figure 36. Installing the DDIC (disk)
15. Confirm that the new drive is listed in the sg_map command output, in a good state, and operational:
[MGMT0]$ sudo sg_map –x -i
The new drive appears at the bottom of the drive list as /dev/sg25 or /dev/sg26 (depending on which
is the newly inserted drive).
For example:
[sude@snx11000n000]$ sudo sg_map –x
/dev/sg0 6 0 0 0 0 /dev/sda SEAGATE
/dev/sg1 6 0 1 0 0 /dev/sdb SEAGATE
/dev/sg2 6 0 2 0 0 /dev/sdc SEAGATE
/dev/sg3 6 0 3 0 0 /dev/sdd SEAGATE
-i
ST9450404SS
ST9450404SS
ST9450404SS
ST9450404SS
XRFA
XRFA
XRFA
XRFA
82
Replace a Quad Server Disk
/dev/sg4 6 0 4 0 0 /dev/sde SEAGATE ST9450404SS XRFA
/dev/sg5 6 0 5 0 0 /dev/sdf SEAGATE ST9450404SS XRFA
/dev/sg6 6 0 6 0 0 /dev/sdg SEAGATE ST9450404SS XRFA
/dev/sg7 6 0 7 0 0 /dev/sdh SEAGATE ST9450404SS XRFA
/dev/sg8 6 0 8 0 0 /dev/sdi SEAGATE ST9450404SS XRFA
/dev/sg9 6 0 9 0 0 /dev/sdj SEAGATE ST9450404SS XRFA
/dev/sg10 6 0 10 0 0 /dev/sdk SEAGATE ST9450404SS XRFA
/dev/sg11 6 0 11 0 0 /dev/sdl SEAGATE ST9450404SS XRFA
/dev/sg12 6 0 12 0 0 /dev/sdm SEAGATE ST9450404SS XRFA
/dev/sg13 6 0 13 0 0 /dev/sdn SEAGATE ST9450404SS XRFA
/dev/sg14 6 0 14 0 0 /dev/sdo SEAGATE ST9450404SS XRFA
/dev/sg15 6 0 15 0 0 /dev/sdp SEAGATE ST9450404SS XRFA
/dev/sg16 6 0 16 0 0 /dev/sdq SEAGATE ST9450404SS XRFA
/dev/sg17 6 0 17 0 0 /dev/sdr SEAGATE ST9450404SS XRFA
/dev/sg18 6 0 18 0 0 /dev/sds SEAGATE ST9450404SS XRFA
/dev/sg19 6 0 19 0 0 /dev/sdt SEAGATE ST9450404SS XRFA
/dev/sg20 6 0 20 0 0 /dev/sdu SEAGATE ST9450404SS XRFA
/dev/sg21 6 0 21 0 0 /dev/sdv SEAGATE ST9450404SS XRFA
/dev/sg22 6 0 22 0 0 /dev/sdw HITACHI HUSSL4010ASS600 A202
/dev/sg23 6 0 23 0 0 /dev/sdx HITACHI HUSSL4010ASS600 A202
/dev/sg24 6 0 24 0 13 XYRATEX EB-2425P-E6EBD 3519
/dev/sg25 7 0 0 0 0 /dev/sdy SEAGATE ST9450405SS XRC0
/dev/sg26 7 0 2 0 0 /dev/sdaa SEAGATE ST9450405SS XRC0
16. Rebuild the disk drive in the md127 array, using the /dev/sd name of the newly inserted device:
[admin@n000]$ sudo mdadm /dev/md127 --add /dev/sd xx
Where sdxx is the new device.
17. Verify that the new disk drive has started to rebuild in the array, by running one of the following commands:
●
Use the cat /proc/mdstat command:
[admin@n000]$ cat /proc/mdstat
For example:
[root@snx11000n000 ~]$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid10]
md64 : active raid10 sds[0] sdl[3] sdo[2] sdv[1]
859374976 blocks super 1.2 64K chunks 2 near-copies [4/4] [UUUU]
bitmap: 3/7 pages [12KB], 65536KB chunk
md127 : active raid1 sdaa[2] sdy[0]
439548848 blocks super 1.0 [2/1] [U_]
[==================>..] recovery = 93.6% (411549504/439548848)
finish=7.3min speed=63767K/sec
unused devices: <none>
For example:
[admin@n000]$ sudo mdadm –detail /dev/md127
●
Use the mdadm –detail /dev/md127 command:
/dev/md127:
Version : 1.0
Creation Time : Mon Jul 7 02:00:06 2014
83
Replace a Quad Server Disk
Raid Level : raid1
Array Size : 439548848 (419.19 GiB 450.10 GB)
Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Jul 31 04:53:27 2014
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
PD-Repaired : 0
Rebuild Status : 93% complete
Name : snx11000n000:md512 (local to host snx11000n000)
UUID : 03cf20b5:c06203d5:2300dd64:1b2cd976
Events : 51756
Number Major Minor RaidDevice State
0
65
128
0
active sync /dev/sdy
2
65
160
1
spare rebuilding /dev/sdaa
If the build finishes without any errors, the procedure is complete.
18. If the mdadm command produces the following error:
$ mdadm /dev/md127 --add /dev/sdz
mdadm: Cannot open /dev/sdz: Device or resource busy
Run the clean drive procedure for 5 minutes.
[admin@n000]$ dd if=/dev/zero of=/dev/sdz bs=1M
After 5 minutes, press Ctrl-C to quit the procedure. .
19. Repeat steps 17 on page 83 and 18 on page 84 to add the drive to the array and confirm that it starts to
rebuild.
If the drive fails to rebuild, contact Cray Support for further information.
Clean Up Failed or Pulled Drive in Node
About this task
Perform this procedure if the drive status shows "removed" or "faulty", and if inserting a new drive does not
update the drive status to "spare rebuilding”. This procedure uses node 1 as an example.
To clean up a failed or pulled drive in a node:
Procedure
1. Clear the superblock information, if the disk comes up as anything other than hot spare, enter:
$ sudo mdadm –zero-superblock-- force /dev/sd XX
84
Replace a Quad Server Disk
2. Log in to the failed or pulled node:
$ ssh failed_MGMT_node
The preceding command should produce the following output:
[admin@n000 ~]$ ssh snx11000n001
Last login: Wed Mar 20 10:44:14 2013 from 172.16.2.2
[root@snx11000n001 ~]# cat /proc/mdstat
Personalities : [raid1] [raid10]
md127 : active raid1 sdz[1](F) sdy[0]
439548848 blocks super 1.0 [2/1] [U_]
unused devices: <none>
Sdz is showing as failed.
3. Retrieve md127 array details, which is built from the two internal drives:
$ sudo mdadm –detail /dev/md127
The preceding command should procedure the following output:
[admin@snx11000n001 ~]$ sudo mdadm --detail /dev/
/dev/md127:
Version : 1.0
Creation Time : Tue Mar 12 09:16:14 2013
Raid Level : raid1
Array Size : 439548848 (419.19 GiB 450.10 GB)
Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Mar 20 10:57:50 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Name : snx11000n001:md512 (local to host snx11000n001)
UUID : 188d023a:0a1c06de:71c7b8fc:eb7f560f
Events : 1903
Number Major Minor RaidDevice State
0
65
128
0
active sync /dev/sdy
1
0
0
1
removed
1
65
144
faulty spare
The faulty spare must then be cleared.
4. Clear the faulty spare:
$ sudo mdadm –manage /dev/md127 –remove faulty
The preceding command should produce the following output:
[admin@snx11000n001 ~]$ sudo mdadm --manage /dev/md127 --remove faulty
mdadm: hot removed 65:144 from /dev/md127
85
Replace a Quad Server Disk
5. Check the md127 array details to make certain the faulty spare has been removed:
$ sudo mdadm –detail /dev/md127
The preceding command should produce the following output:
[admin@snx11000n001 ~]$ sudo mdadm --detail /dev/md127
/dev/md127:
Version : 1.0
Creation Time : Tue Mar 12 09:16:14 2013
Raid Level : raid1
Array Size : 439548848 (419.19 GiB 450.10 GB)
Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Wed Mar 20 10:58:25 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : snx11000n001:md512 (local to host snx11000n001)
UUID : 188d023a:0a1c06de:71c7b8fc:eb7f560f
Events : 1924
Number Major Minor RaidDevice State
0 65 128 0 active sync /dev/sdy
1 0 0 1 removed
The failed or pulled drive will require recovering.
6. Check the dev on the SG map before recovering a faulty drive as the dev may change.
7. Run:
$ sudo sg_map –i –x
The preceding command should produce the following output:
[admin@snx11000n001 ~]$ sudo sg_map -i -x
/dev/sg0 0 0 125 0 0 /dev/sda HITACHI HUSSL4010ASS600 A182
/dev/sg1 0 0 126 0 0 /dev/sdb HITACHI HUSSL4010ASS600 A182
/dev/sg2 0 0 127 0 0 /dev/sdc SEAGATE ST9450404SS XQB6
/dev/sg3 0 0 128 0 0 /dev/sdd SEAGATE ST9450404SS XQB6
/dev/sg4 0 0 129 0 0 /dev/sdf SEAGATE ST9450404SS XQB6
/dev/sg5 0 0 130 0 0 /dev/sdg SEAGATE ST9450404SS XQB6
/dev/sg6 0 0 131 0 0 /dev/sdh SEAGATE ST9450404SS XQB6
/dev/sg7 0 0 132 0 0 /dev/sdj SEAGATE ST9450404SS XQB6
/dev/sg8 0 0 133 0 0 /dev/sdk SEAGATE ST9450404SS XQB6
/dev/sg9 0 0 134 0 0 /dev/sdl SEAGATE ST9450404SS XQB6
/dev/sg10 0 0 135 0 0 /dev/sdm SEAGATE ST9450404SS XQB6
/dev/sg11 0 0 136 0 0 /dev/sdn SEAGATE ST9450404SS XQB6
/dev/sg12 0 0 137 0 0 /dev/sdo SEAGATE ST9450404SS XQB6
/dev/sg13 0 0 138 0 0 /dev/sdq SEAGATE ST9450404SS XQB6
/dev/sg14 0 0 139 0 0 /dev/sdr SEAGATE ST9450404SS XQB6
/dev/sg15 0 0 140 0 0 /dev/sds SEAGATE ST9450404SS XQB6
/dev/sg16 0 0 141 0 0 /dev/sdt SEAGATE ST9450404SS XQB6
/dev/sg17 0 0 142 0 0 /dev/sdv SEAGATE ST9450404SS XQB6
/dev/sg18 0 0 143 0 0 /dev/sdw SEAGATE ST9450404SS XQB6
86
Replace a Quad Server Disk
/dev/sg19
/dev/sg20
/dev/sg21
/dev/sg22
/dev/sg23
/dev/sg24
/dev/sg25
/dev/sg26
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
144
145
146
147
148
149
0 0
2 0
0
0
0
0
0
0
0
0
0 /dev/sdx SEAGATE ST9450404SS XQB6
0 /dev/sdaa SEAGATE ST9450404SS XQB6
0 /dev/sdab SEAGATE ST9450404SS XQB6
0 /dev/sdac SEAGATE ST9450404SS XQB6
0 /dev/sdad SEAGATE ST9450404SS XQB6
13 XYRATEX EB-2425-E6EBD 3022
/dev/sdy SEAGATE ST9450405SS 0002
/dev/sde SEAGATE ST9450405SS 0002
In the above example, the new dev number is “sde”.
8. Recover the failed or pulled drive:
$ sudo mdadm --manage /dev/md127 –re-add /dev/sde
The preceding command should produce the following output:
root@snx11000n001 ~]# mdadm --manage /dev/md127 --re-add /dev/sde
mdadm: re-added /dev/sde
9. Check the md127 array details to make certain the faulty spare has been added:
$ sudo mdadm –detail /dev/md127
The preceding command should produce the following output:
[admin@snx11000n001 ~]$ sudo mdadm --detail /dev/md127
/dev/md127:
Version : 1.0
Creation Time : Tue Mar 12 09:16:14 2013
Raid Level : raid1
Array Size : 439548848 (419.19 GiB 450.10 GB)
Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Mar 20 10:59:24 2013
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 0% complete
Name : snx11000n001:md512 (local to host snx11000n001)
UUID : 188d023a:0a1c06de:71c7b8fc:eb7f560f
Events : 1950
Number Major Minor RaidDevice State
0
65
128
0
active sync /dev/sdy
1
8
64
1
spare rebuilding /dev/sde
The drive is now rebuilding.
87
Replace a Quad Server MGMT Node
Replace a Quad Server MGMT Node
Prerequisites
Part number
100900501: Server, Sonexion Single E5-2680 32GB Management Server Node FDR
Time
1 hour
Interrupt level
Interrupt, if the cords connected to the power distribution strip block the MGMT server nodes
and require reconfiguring (require taking the Lustre file system offline)
Failover, if the MGMT server nodes can be easily accessed (can be applied to a live system
with no service interruption, but requires failover/failback)
Tools
ESD strap, shoes, and garment
Console with monitor and keyboard (or PC with a serial COM port configured for 115.2Kbs,
8 data bits, no parity, and 1 stop bit)
About this task
Use this procedure to remove and replace a failed server node hosting the primary and secondary MGMT nodes
in the MMU component in the field. This procedure includes steps to replace the failed server node and return the
Sonexion system to normal operations.
Subtasks:
●
Shut Down the Failed MGMT Node and Replace
●
Set MGMT Node IPMI Address and Record MAC Address
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U quad
server contains four server nodes, two PSUs, and disk drives.
The MMU’s server nodes host the primary and secondary MGMT nodes, MGS, and MDS nodes. The system's
High Availability architecture provides that if one of the server nodes goes down, its resources migrate to its HA
partner node so Sonexion operations continue without interruption. This document details the replacement of a
failed server node hosting the primary and secondary MGMT nodes.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
88
Replace a Quad Server MGMT Node
Since either the primary or secondary MGMT node may be in the failed state, this document will present
command prompts as [MGMT] which represents the active MGMT node, which could be either MGMT0 or
MGMT1.
Procedure
1. If the location of the failed server node is known, go to Step 5 on page 90.
2. Attempt to log in to the primary MGMT node via SSH:
[Client]$ ssh –l admin primary_MGMT_node
3. If the primary MGMT node cannot be logged into, then attempt to log in to the secondary MGMT node via
SSH:
[Client]$ ssh –l admin secondary_MGMT_node
4. Do one of the following:
●
Access CSSM and use the Health tab to identify the malfunctioning server node.
●
At the front of the rack, verify which server node has failed by looking for a System Status LED (amber) or
a dark LED on the left and right control panels of the 2U quad server, as shown in the following figure and
table.
The following figure and table show the mapping of server node names.
Figure 37. Quad Server Control Panels
Table 9. System Status LED Descriptions
LED Color
Condition
Description
Green
On
System Ready/No Alarm
89
Replace a Quad Server MGMT Node
LED Color
Condition
Description
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or
fan failure; non-critical temp/voltage threshold; battery failure; or predictive
power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage
(power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure,
non-critical temperature and voltage.
-
Off
Power off: system unplugged.
Power on: system power off and in standby, no prior degraded/noncritical/
critical state.
From the rear of the rack, physical and logical node definitions are shown in the following figure.
Figure 38. Quad Server: Rear Component Identification
The server node names are mapped as defined in this table.
Table 10. Quad Server Node Designations
Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
5. Record the BMC IP, Mask and Gateway IP addresses:
[MGMT]$ grep affected_nodename-ipmi /etc/hosts
This command produces the following output:
[admin@snx11000n000 ~]$ grep snx11000n001-ipmi /etc/hosts
172.16.0.102
snx11000n001-ipmi
90
Replace a Quad Server MGMT Node
The Mask and Gateway addresses are set the same for all nodes:
Mask = 255.255.0.0
Gateway = 172.16.2.1
6. In some Sonexion configurations, power cords block access to the 2U quad server, while in other
configurations, the 2U quad server can be freely accessed. If the cords block access to the 2U quad server,
proceed to the following substeps. If they do not block access to the 2U quad server, go to Step 7 on page
92.
Figure 39. Power Distribution Strip With Cables in Top Two Sockets
a. Power down the Sonexion system as described in Power Off Sonexion 2000.
b. Remove the power cords from the power distribution strip.
c.
Re-install the power cords in the power distribution strip so that no sockets in use block removal of the
failed server node in the quad server.
91
Replace a Quad Server MGMT Node
Figure 40. Power Distribution Strip With Top Two Sockets Open
d. Power on the Sonexion system as described in Power On Sonexion 2000.
7. If the failed node is still operational, fail over its resources to its HA partner (active MGMT) node:
[MGMT]$ cscli failover –n affected_nodename
8. Verify that the failover operationwas successful:
[MGMT]$ sudo crm_mon -1
The following is an example of a successful failover:
============
[root@snx11000n000 ~]# crm_mon -1r
============
Last updated: Wed Aug 6 03:07:01 2014
Last change: Wed Aug 6 02:59:44 2014 via cibadmin on snx11000n001
Stack: Heartbeat
Current DC: snx11000n001 (8d542227-dce8-49d7-bcc6-d1e651d7d0ec) - partition
with quorum
Version: 1.1.6.1-5.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
40 Resources configured.
============
Online: [ snx11000n000 snx11000n001 ]
Full list of resources:
baton (ocf::heartbeat:baton): Started snx11000n000
snx11000n000_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n000
snx11000n001_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
92
Replace a Quad Server MGMT Node
Started snx11000n001
snx11000n000-stonith (stonith:external/libvirt): Started snx11000n000
snx11000n001-stonith (stonith:external/libvirt): Started snx11000n001
prm-httpd (lsb:httpd): Started snx11000n001
prm-mysql (lsb:mysqld): Started snx11000n001
prm-nfslock (lsb:nfslock): Started snx11000n001
prm-bebundd (lsb:bebundd): Started snx11000n001
Clone Set: cln-cerebrod [prm-cerebrod]
Stopped: [ prm-cerebrod:0 prm-cerebrod:1 ]
prm-conman (lsb:conman): Stopped
prm-dhcpd (lsb:dhcpd): Started snx11000n001
prm-xinetd (lsb:xinetd): Started snx11000n001
Clone Set: cln-syslogng [prm-syslogng]
Started: [ snx11000n000 snx11000n001 ]
prm-nodes-monitor (lsb:nodes-monitor): Started snx11000n001
Clone Set: cln-ses_mon [prm-ses_monitor]
Started: [ snx11000n000 snx11000n001 ]
Clone Set: cln-nsca_passive_checks [prm-nsca_passive_checks]
Started: [ snx11000n000 snx11000n001 ]
Resource Group: grp-icinga
prm-icinga (lsb:icinga): Stopped
prm-nsca (lsb:nsca): Stopped
prm-npcd (lsb:npcd): Stopped
Resource Group: grp-plex
prm-rabbitmq (lsb:rabbitmq-server): Stopped
prm-plex (lsb:plex): Stopped
prm-repo-local (ocf::heartbeat:Filesystem): Started snx11000n001
prm-repo-remote (ocf::heartbeat:Filesystem): Started snx11000n000
prm-db2puppet (ocf::heartbeat:oneshot): Started snx11000n001
Clone Set: cln-puppet [prm-puppet]
Started: [ snx11000n001 ]
Stopped: [ prm-puppet:0 ]
prm-nfsd (ocf::heartbeat:nfsserver): Started snx11000n001
prm-vip-eth0-mgmt (ocf::heartbeat:IPaddr2): Started snx11000n001
prm-vip-eth0-nfs (ocf::heartbeat:IPaddr2): Started snx11000n001
Resource Group: snx11000n000_md64-group
snx11000n000_md64-raid (ocf::heartbeat:XYRAID): Started snx11000n001
snx11000n000_md64-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001
snx11000n000_md64-stop (ocf::heartbeat:XYSTOP): Started snx11000n001
Resource Group: snx11000n000_md67-group
snx11000n000_md67-raid (ocf::heartbeat:XYRAID): Started snx11000n001
snx11000n000_md67-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001
snx11000n000_md67-stop (ocf::heartbeat:XYSTOP): Started snx11000n001
The above output shows that the resources are running on node 001 (snx11000n001).
Shut Down the Failed MGMT Node and Replace
9. To begin shutting down the affected node, run:
[MGMT]$ cscli power_manage –n failed_server_nodename --power-off
If the failed server node does not power down after the poweroff command, users have two options to shut
down the node:
10. From the active MGMT node, run:
[MGMT]$ pm -0 failed_server_nodename
93
Replace a Quad Server MGMT Node
11. If the node is still powered on and the above command has failed, press and hold the power button, for at
least six seconds on the front panel of the failed server node.
12. Verify that the failed server node is powered off:
[MGMT]$ pm –q
For example:
on:
snx11000n0[01-05]
off:
snx11000n000
unknown:
13. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods.
14. Disconnect the cables attached to the failed server node and note where each cable attaches to the
enclosure to ensure that the cables can be connected properly during re-installation.
IMPORTANT: Be sure to note the port that each cable is attached to, so the same cable connections
can be made after the new server node is installed. Refer to the cable reference guide attached to the
rack’s left hand rear door.
15. Remove the failed server node and install the replacement server node.
a. Push the green latch to release the failed server node, while using the handle to pull it from the chassis.
Figure 41. Removing the Server Node
b. Install the new server node by inserting it into the empty bay at the rear of the enclosure. It may be
necessary to firmly push the node to fully seat it.
There will be an audible click as the node seats.
94
Replace a Quad Server MGMT Node
Figure 42. Installing the Server Node
16. Connect the cables to the new node, except for the cables that go to the first two NIC ports of the add-in
Quad-port NIC card and the Infiniband adapter, based on the notes made in Step 14 on page 94, and the
cable reference guide attached to the rack’s left hand rear door.
17. Connect the console keyboard and monitor, and then press the power button on the front of the 2U quad
server for the server node that was replaced.
Set MGMT Node IPMI Address and Record MAC Address
18. Retrieve the BMC IP Address, Subnet Mask and Gateway IP address obtained in Step 5 on page 90, from the
output of the grep affected_nodename-ipmi /etc/hosts command, for the server node being
replaced.
a. Press F2.
b. Select BIOS.
c.
Select the Server Management tab.
d. Enter the data in the Baseboard LAN configuration section.
95
Replace a Quad Server MGMT Node
Figure 43. Server Management Tab Screen
19. On the new server node, record the MAC address from BIOS.
a. Press F2.
b. Select the Advanced tab.
c.
Select PCI Configuration.
d. Select NIC Configuration.
e. Record the MAC address for “IOM1 Port1 MAC Address” and “IOM1 Port2 MAC Address”.
f.
Press F10 to save changes and exit BIOS.
96
Replace a Quad Server MGMT Node
Figure 44. Advanced Screen: MAC Address
20. With the first two NIC ports of the add-in Quad-port NIC card and InfiniBand cables still disconnected, allow
the replacement node to boot to a login screen.
97
Replace a Quad Server MGMT Node
Figure 45. Advanced Screen – NIC Configuration
21. Log in as user ‘admin’, using the cluster administrator password.
22. Delete the udev rules file. The new updated udev rules rebuild during reboot:
[new MGMT]$ sudo rm /etc/udev/rules.d/70-persistent-net.rules
23. Shut down the new server node:
[new MGMT]$ sudo sync; sudo ipmitool chassis power off
24. Connect the first two NIC port (eth0 and eth1) cables to the replacement node. Do not reconnect the
Infiniband/40GbE cable yet.
25. If not already done, log in to the active MGMT node:
[MGMT]$ ssh –l admin active_MGMT_node
26. Save the t0db databases:
[MGMT]$ mkdir -p "/home/admin/$(date +%F)_t0db"
[MGMT]$ sudo mysqldump t0db --ignore-table t0db.logs --ignore-table
t0db.be_command_log > "/home/admin/$(date +%F)_t0db/t0db_bkup.sql" && sudo
mysqldump -d t0db logs be_command_log >> "/home/admin/$(date +%F)_t0db/
t0db_bkup.sql"
98
Replace a Quad Server MGMT Node
27. Save the mysql databases:
[MGMT]$ mkdir -p "/home/admin/$(date +%F)_mysql"
[MGMT]$ sudo mysqldump mysql > "/home/admin/$(date +%F)_mysql/mysql.sql"
28. Save the mylogin.cnf password file:
[MGMT]$ sudo cp /root/.mylogin.cnf "/home/admin/$(date +%F)_t0db"
29. Check the current database:
[MGMT]$ sudo mysql t0db -e "select * from netdev where hostname=' nodename '"
For example:
[snx11000n001]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n000'"
+----+------------------------------+-------------------+-------------+------------+---------|-------------+
| id | node_id
| mac_address
| ip_address | network_id | if_name | hostname
|
| 7 | snx11000n0:Node-rack1-38U1-A | 00:1E:67:57:25:28 | 172.16.2.2 |
3 | eth0
| snx11000n0 |
+----+------------------------------+-------------------+-------------+------------+---------|-------------+
30. Update the database for eth0 of the new server node using the MAC address obtained in the BIOS for IOM1
Port1 MAC Address. In the following example, new_MAC is the MAC recorded in step 19.e on page 96:
[MGMT]$ sudo mysql t0db -e "update netdev set mac_address='new_MAC' where
if_name='eth0' and hostname='nodename'"
For example:
[MGMT]$ sudo mysql t0db -e "update netdev set mac_address='00:1E:67:6B:5D:70'
where if_name='eth0' and hostname='snx11000n000'"
31. Update the database for eth1 of the new server node using the MAC address obtained in the BIOS for the
IOM1 Port2 MAC Address. In the following example, new_MAC is the MAC recorded in step 19.e on page 96
[MGMT]$ sudo mysql t0db -e "update netdev as a, (select * from netdev where
hostname='nodename') as b set a.mac_address='new_MAC' where a.if_name='eth1'
and a.node_id=b.node_id"
For example:
[snx11000n001]$ sudo mysql t0db -e "update netdev as a, (select * from netdev
where hostname='snx11000n000') as b set a.mac_address='00:1E:67:39:D6:91'
where a.if_name='eth1' and a.node_id=b.node_id"
32. Verify the database change for eth0:
[MGMT]$ sudo mysql t0db -e "select * from netdev where hostname=' nodename'"
For example:
[snx11000n001]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n000'"
+----+------------------------------+-------------------+-------------+------------+---------+---------------+
| id | node_id
| mac_address
| ip_address | network_id | if_name | hostname
|
+----+------------------------------+-------------------+-------------+------------+---------+---------------+
| 7 | snx11000n0:Node-rack1-38U1-A | 00:1E:67:6B:5D:70 | 172.16.2.2 |
3 | eth0
| snx11000n000 |
+----+------------------------------+-------------------+-------------+------------+---------+---------------+
99
Replace a Quad Server MGMT Node
Note the difference in the MAC address in the highlighted line from the example in Step 29 on page 99.
33. Verify the database changes for eth1:
[admin@n000]$ sudo mysql t0db -e "select * from netdev as a, (select * from
netdev where hostname='<nodename>') as b where a.if_name='eth1' and
a.node_id=b.node_id"
For example:
[snx11000n001]$ sudo mysql t0db -e "select * from netdev as a, (select * from netdev where
hostname='snx11000n000') as b where a.if_name='eth1' and a.node_id=b.node_id"
+----+------------------------------+-------------------+-------------+------------+---------|-------------------+
| id | node_id
| mac_address
| ip_address | network_id | if_name | hostname
|
+----+------------------------------+-------------------+-------------+------------+---------|-------------------+
| 8 | snx11000n0:Node-rack1-38U1-A | 00:1E:67:6B:5D:71 | 169.254.0.1 |
1 | eth1
| snx11000n000-eth1 |
+----+------------------------------+-------------------+-------------+------------|---------|-------------------+
34. Update Puppet on the active MGMT node to use the new MAC address:
[MGMT]$ sudo /opt/xyratex/bin/beUpdatePuppet –s –g mgmt
35. Change the STONITH settings to prevent nodes from shutting one another down:
[MGMT]$ sudo crm configure property stonith-enabled=false
36. Power on the new server node:
[MGMT]$ cscli power_manage –n replacement_nodename --power-on
37. Monitor the new server node to confirm that HA is now fully configured. Both the new server node and HA
partner node should be online:
[MGMT]$ sudo ssh new server_nodename crm_mon -1 | grep Online
38. Reconnect the Infiniband/40GbE cable to the new server node. There should no longer be any cables
disconnected from the new server node.
39. Re-enable STONITH:
[MGMT]$ sudo crm configure property stonith-enabled=true
40. If the system was not fully power cycled off and on, go to Step 42 on page 101.
41. Restart Lustre:
[MGMT]$ cscli mount -f filesystem_name
For example:
mount:
mount:
mount:
mount:
mount:
mount:
MGS is starting...
MGS is started!
No resources found on nodes snx11000n007 for "snx11000n" file system
starting ssetest on snx11000n[102-103]...
starting ssetest on snx11000n[104-105]...
No resources found on nodes snx11000n[100-101] for "snx11000n" file
100
Replace a Quad Server MGMT Node
system
mount: ssetest is started on snx11000n[102-103]!
mount: ssetest is started on snx11000n[104-105]!
mount: File system ssetest is mounted.
42. Fail back the resources for the new server node:
[MGMT]$ cscli failback –n new_server_nodename
43. Confirm that the failback operation completes and that the resource(s) are moved back to the new server
node:
[MGMT]$ sudo ssh new_server_nodename crm_mon -1
101
Replace a Quad Server MGS or MDS Node
Replace a Quad Server MGS or MDS Node
Prerequisites
Part number:
100900601: Server, Sonexion Dual E5-2680 64GB MDS/MGS Node FDR
Time:
1 hour
Interrupt level:
Interrupt (requires taking the Lustre file system offline)
Tools:
ESD strap, shoes, and garment
Console with monitor and keyboard (or PC with a serial COM port configured for 115.2Kbs,
8 data bits, no parity, and one stop bit)
About this task
One 2U quad server and either one 2U24 enclosure or one 5U84 enclosure are bundled in the MMU. Each server
node hosts an MMU node: a primary MGMT node, secondary MGMT node, MGS node and MDS node. The
following procedure describes the replacement of the MGS and MDS nodes, including steps to replace the failed
node and return the system to normal operations.
Subtasks:
●
Shut Down MGS or MDS Node and Replace
●
Set MGS/MDS Node IPMI Address and Record MAC Address
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure
1. If the location of the failed server node is known, go to Step 4 on page 104.
2. Log in to the primary MGMT node via SSH:
[Client]$ ssh –l admin primary_MGMT_node
3. Do one of the following:
102
Replace a Quad Server MGS or MDS Node
●
Access CSSM and use the Health tab to identify the malfunctioning server node.
●
At the front of the rack, look for a System Status LED (amber) or a dark LED on the left and right control
panels of the 2U quad server, as shown in the following figure and table.
The following figure and table show the mapping of server node names.
Figure 46. Quad Server Control Panels
Table 11. System Status LED Descriptions
LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or
fan failure; non-critical temp/voltage threshold; battery failure; or predictive
power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage
(power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure,
non-critical temperature and voltage.
-
Off
Power off: system unplugged.
Power on: system power off and in standby, no prior degraded/noncritical/
critical state.
103
Replace a Quad Server MGS or MDS Node
Figure 47. Quad Server: Rear Component Identification
Table 12. Quad Server Node Designations
Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
4. Record the BMC IP, Mask and Gateway IP addresses of the failed servre node:
[admin@n000]$ grep node -ipmi /etc/hosts
This command produces the following output:
[admin@snx11000n000 ~]$ grep snx11000n002-ipmi /etc/hosts
172.16.0.103
snx11000n002-ipmi
The Mask and Gateway address settings are the same for all nodes:
Mask = 255.255.0.0
Gateway = 172.16.2.1
5. Check the power cords plugged into the power distribution strip. In some Sonexion configurations, these
cords will block access to the 2U quad server, while the 2U quad server can be freely accessed in other
configurations.
If the cords block access to the 2U quad server, go to Step 5.a on page 105 to reposition them as necessary.
If they do not block access to the 2U quad server, go to Shut Down MGS or MDS Node and Replace.
104
Replace a Quad Server MGS or MDS Node
Figure 48. Power Distribution Strip With Cables in Top Two Sockets
a. Power down the Sonexion system as described in Power Off Sonexion 2000 or Power Off the Sonexion
2000 via the CSSM GUI.
b. Remove the power cords from the power distribution strip.
c.
Re-install the power cords so that no sockets block removal of the failed server node in the 2U quad
server.
105
Replace a Quad Server MGS or MDS Node
Figure 49. Power Distribution Strip with Top Two Sockets Open
d. Power on the Sonexion system as described in Power On Sonexion 2000.
Shut Down MGS or MDS Node and Replace
6. Fail over the failed node's resources to its HA partner node:
[admin@n000]$ cscli failover -n affected_node
Verify that the failover operation was successful:
[admin@n000]$ ssh partner_node sudo crm_mon -1
The following is an example of a successful failover:
[snx11000n000 ~]# ssh snx11000n003 sudo crm_mon -1
[sudo] password for admin:
============
Last updated: Tue Aug 5 11:07:45 2014
Last change: Tue Aug 5 11:01:45 2014 via crm_resource on snx11000n002
Stack: Heartbeat
Current DC: snx11000n003 (0b53f7df-3132-4cb2-b0a4-b6fc753cfcdd) - partition with quorum
Version: 1.1.6.1-6.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
28 Resources configured.
============
Online: [ snx11000n002 snx11000n003 ]
Full list of resources:
snx11000n002-1-ipmi-stonith
(stonith:external/ipmi):
Started snx11000n002
snx11000n002-2-ipmi-stonith
(stonith:external/ipmi):
Started snx11000n002
snx11000n003-3-ipmi-stonith
(stonith:external/ipmi):
Started snx11000n003
snx11000n003-4-ipmi-stonith
(stonith:external/ipmi):
Started snx11000n003
Clone Set: cln-kdump-stonith [kdump-stonith]
106
Replace a Quad Server MGS or MDS Node
Started: [ snx11000n002 snx11000n003 ]
Clone Set: cln-ssh-10-stonith [ssh-10-stonith]
Started: [ snx11000n002 snx11000n003 ]
Clone Set: cln-ssh-stonith [ssh-stonith]
Started: [ snx11000n002 snx11000n003 ]
Clone Set: cln-phydump-stonith [phydump-stonith]
Started: [ snx11000n002 snx11000n003 ]
Clone Set: cln-last-stonith [last-stonith]
Started: [ snx11000n002 snx11000n003 ]
snx11000n003_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n003
snx11000n002_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n002
baton
(ocf::heartbeat:baton):
Started snx11000n002
Clone Set: cln-diskmonitor [diskmonitor]
Started: [ snx11000n003 snx11000n002 ]
snx11000n003_ibstat
(ocf::heartbeat:ibstat):
Started snx11000n003
snx11000n002_ibstat
(ocf::heartbeat:ibstat):
Started snx11000n002
Resource Group: snx11000n003_md66-group
snx11000n003_md0-raid
(ocf::heartbeat:XYRAID):
Started snx11000n003
snx11000n003_md66-raid
(ocf::heartbeat:XYRAID):
Started snx11000n003
snx11000n003_md66-fsys
(ocf::heartbeat:XYMNTR):
Started snx11000n003
snx11000n003_md66-stop
(ocf::heartbeat:XYSTOP):
Started snx11000n003
Resource Group: snx11000n003_md65-group
snx11000n003_md65-raid
(ocf::heartbeat:XYRAID):
Started snx11000n003
Started snx11000n003
snx11000n003_md65-fsys
(ocf::heartbeat:XYMNTR):
snx11000n003_md65-stop
(ocf::heartbeat:XYSTOP):
Started snx11000n003
Connection to snx11000n003 closed.
The above output shows that the resources have started onthe HA partner (node n003). Proceed to the next
step to power off the failed server node.
7. Shut down the failed server node:
[admin@n000]$ cscli power_manage -n failed server_nodename --power-off
If the failed server does not power down after the power-off command, users have two options to shut the
node down:
a. From the primary MGMT node run:
[admin@n000]$ pm -0 failed server_nodename
b. If the failed server is still powered on and the above command has failed, press and hold the power
button, for at least 6 seconds, on the front panel of the failed server.
c.
Verify that the failed server is powered off:
[admin@n000]$ pm –q
For example:
on:
snx11000n[101-105]
off:
snx11000n000
unknown:
8. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods.
9. Disconnect the cables attached to the failed server node and note where each cable attaches to the
enclosure to ensure the cables can be connected properly during re-installation.
107
Replace a Quad Server MGS or MDS Node
IMPORTANT: Note the port that each cable is attached to, so the same cable connections can be
made after the new server is installed. Refer to Sonexion 2000 Field Installation Guide in the section
"Internal Cabling".
10. Remove the failed server node and install the new server node.
a. Push the green latch to release the failed server node, while using the handle to pull it from the chassis.
Figure 50. Removing the Server Node
b. Install the new server node by inserting it the server into the empty bay at the rear of the enclosure. It
may be necessary to firmly push the node to fully seat it.
There will be an audible click as the server node seats.
Figure 51. Installing the Server Node
11. Connect the cables to the new server node, based on the notes made in step 4 on page 104 and information
in Sonexion 2000 Field Installation Guide in the section "Internal Cabling"..
12. Connect the console keyboard and monitor, and then press the power button on the front of the 2U quad
server for the node that was replaced.
108
Replace a Quad Server MGS or MDS Node
Set MGS/MDS Node IPMI Address and Record MAC Address
13. Retrieve the BMC IP Address, Subnet Mask and Gateway IP Address obtained in step 4 on page 104, from
the output of the grep node-ipmi /etc/hosts command, for the server node being replaced.
a. Press F2.
b. Select BIOS.
c.
Select the Server Management tab.
d. Enter the data in the Baseboard LAN configuration section.
Figure 52. Server Management Screen
14. On the new server node, record the MAC address from BIOS.
a. Press F2.
b. Select the Advanced tab.
c.
Select PCI Configuration.
d. Select NIC Configuration.
e. Record the MAC address for Onboard NIC1 Port1 MAC Address and Onboard NIC1 Port2 MAC
Address.
f.
Press F10 to exit BIOS.
109
Replace a Quad Server MGS or MDS Node
Figure 53. Advanced Screen: NIC Configuration
15. Press and hold the power button on the new server node until it powers off.
IMPORTANT: The new server node must not be left powered on at this point.
16. If not already done, log in to the primary MGMT node:
[Client]$ ssh –l admin primary_MGMT_node
17. Check the current database:
[admin@n000]$ sudo mysql t0db -e "select * from netdev where
hostname='nodename'"
For example:
[snx11000n000]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n002'"
+----+-----------------+-------------------+------------+------------+---------+----------+
| id | node_id
| mac_address
| ip_address | network_id | if_name | hostname |
+----+-----------------+-------------------+------------+------------+---------+----------+
| 32 | snx11000:
| 00:1E:67:55:B2:9A | 172.16.3.5 |
2 | eth0
|
|
|
| Node-1rC1-42U-C |
|
|
|
|
|
+----+-----------------+-------------------+------------+------------+---------+----------+
110
Replace a Quad Server MGS or MDS Node
18. Update the database for eth0 of the new server node using the MAC address obtained in the BIOS for
Onboard NIC1 Port1 MAC Address. In the following example, new_MAC is the address recorded in step 14.e
on page 109:
[admin@n000]$ sudo mysql t0db -e "update netdev set mac_address=new_MAC'
where if_name='eth0' and hostname='nodename'"
For example:
[admin@n000]$ sudo mysql t0db -e "update netdev set mac_address='00:1E:
67:39:D6:90' where if_name='eth0' and hostname='snx11000n002'"
19. Update the database for eth1 of the new server node using the MAC address obtained in the BIOS for
Onboard NIC1 Port2 MAC Address. In the following example, new_MAC is the address recorded in step 14.e
on page 109 :
[admin@n000]$ mysql t0db -e "update netdev as a, (select * from netdev where
hostname='nodename') as b set a.mac_address='new_MAC' where a.if_name='eth1'
and a.node_id=b.node_id"
For example:
[admin@n000]$ mysql t0db -e "update netdev as a, (select * from netdev where
hostname='snx11000n002') as b set a.mac_address='00:1E:67:39:D6:91' where
a.if_name='eth1' and a.node_id=b.node_id"
20. Verify the database change for eth0:
[snx11000n000]$ sudo mysql t0db -e "select * from netdev where
hostname='snx11000n002'"
For example:
+----+-----------------+-------------------+-------------+------------+---------|--------------|
| id | node_id
| mac_address
| ip_address | network_id | if_name | hostname
|
+----+-----------------+-------------------+-------------+------------+---------|--------------|
| 32 | snx11000:
| 00:1E:67:39:D6:90 | 172.16.3.5 |
2 | eth0
| snx11000n002 |
|
| Node-R1C1-42U-C |
|
|
|
|
|
+----+-----------------+-------------------+-------------+------------+---------|--------------|
Note the difference in the MAC address in the highlighted line from the example in step 17 on page 110.
21. Verify the database change for eth1:
[admin@n000]$ sudo mysql t0db -e "select * from netdev as a, (select * from
netdev where hostname='nodename') as b where a.if_name='eth1' and
a.node_id=b.node_id"\
For example:
[snx11000n000]$ sudo mysql t0db -e "select * from netdev as a, (select * from netdev where
hostname='snx11000n002') as b where a.if_name='eth1' and a.node_id=b.node_id"
+----+-----------------+-------------------+------------+------------+---------+-----------+
| id | node_id
| mac_address
| ip_address | network_id | if_name | hostname |
+----+-----------------+-------------------+------------+------------+---------+-----------+
| 33 | snx11000:
| 00:1E:67:39:D6:91 | NULL
|
NULL | eth1
| NULL
|
|
| Node-1rC1-42U-C |
|
|
|
|
|
+----+-----------------+-------------------+------------+------------+---------+-----------+
Note the difference in the MAC address in the highlighted line from the example in step 17 on page 110.
111
Replace a Quad Server MGS or MDS Node
22. Delete the tftpboot files for the new server node:
[admin@n000]$ sudo ssh nfsserv "rm -rf /tftpboot/nodes/affected_nodename"
For example:
[snx11000n000 ~]$ sudo ssh nfsserv "rm -rf /tftpboot/nodes/snx11000n002"
23. Update Puppet on the MGMT nodes to use the new MAC address:
[admin@n000]$ sudo /opt/xyratex/bin/beUpdatePuppet -s -g mgmt
24. Log in to the active MGS/MDS node:
[admin@n000]$ ssh active_MDS/MGS_node
25. Change the STONITH settings to prevent nodes from shutting one another down:
[MDS/MGS]$ sudo crm configure property stonith-enabled=false
26. Log out from the active MDS/MGS node:
[MDS/MGS]$ exit
27. Power on the new server node:
[admin@n000]$ cscli power_manage -n new server_nodename --power-on
28. Monitor the new server node to confirm that HA is now fully configured. Both the active MGS/MDS node and
its active partner node should be Online:
[admin@n000]$ sudo ssh replacement_nodename crm_mon -1 | grep Online
29. Log in to the active MGS/MDS node:
[admin@n000]$ ssh active_MDS/MGS_node
30. Re-enable STONITH:
[MDS/MGS]$ sudo crm configure property stonith-enabled=true
31. Log out from the active MDS/MGS node:
[MDS/MGS]$ exit
32. If the system was not fully power cycled off and on, go to step 34 on page 113 .
112
Replace a Quad Server MGS or MDS Node
33. Restart Lustre:
[admin@n000]$ cscli mount -f filesystemname
For example:
mount:
mount:
mount:
mount:
mount:
mount:
system
mount:
mount:
mount:
MGS is starting...
MGS is started!
No resources found
starting snx11000n
starting snx11000n
No resources found
on
on
on
on
nodes snx11000n07 for "snx11000n" file system
snx11000n[102-103]...
snx11000n[104-105]...
nodes snx11000n[100-101] for "snx11000n" file
ssetest is started on snx11000n[102-103]!
ssetest is started on snx11000n[104-105]!
File system snx11000n is mounted.
34. Fail back the resources for the new server node:
[admin@n000]$ cscli failback –n new server_nodename
35. Confirm that the operations are complete and that the resource(s) have moved back to the new server owner:
[admin@n000]$ sudo ssh new server_nodename crm_mon -1
113
Replace a Quad Server Chassis
Replace a Quad Server Chassis
Prerequisites
Part number
100886701: Server, Sonexion 2U Quad MDS + MGS FDR, FRU- fully populated
The above part number applies to a complete quad server including internal components. If
only the chassis is defective, this complete assembly is ordered and the four server modules
and disk drives are swapped between the two chassis, so that the original servers and disks
remain with the system.
Time
2 hours
Interrupt level
Interrupt (requires taking the Lustre filesystem offline.)
Tools
●
ESD strap, boots, garment, or other approved methods
●
#1 and #2 Phillips screwdriver
●
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps,
8 data bits, no parity and 1 stop bit)
About this task
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U quad
server contains 4 server nodes, 2 PSUs and disk drives..
The MMU’s server nodes host the primary and secondary MGMT nodes, MGS and MDS nodes. The system's
High Availability architecture provides that if one of the server nodes goes down, its resources migrate to its HA
partner node so ClusterStor operations continue without interruption.
In this procedure, only the defective chassis is replaced; all other components are re-used in the new MMU
chassis.
The Lustre file system will be unavailable during this procedure. Disconnect all clients before continuing.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Internal components need to be removed from the replacement chassis because only the chassis is to be
replaced.
114
Replace a Quad Server Chassis
Procedure
1. If the the location of the faulty quad server is not known, do one of the following:
●
Access the CSSM GUI and use the Health tab to identify the faulty server, or
●
At the front of the rack(s), look for the 2U quad server with its System Status LED on (amber) or dark
LEDs on the left and right control panels (see following figure). Server node LED descriptions are given in
the table after the figure. The physical and logical layout of server nodes in a 2U quad server, as viewed
at the back of the rack(s).
Figure 54. Quad Server Control Panel
Table 13. System Status LED Descriptions
LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or
fan failure; non-critical temp/voltage threshold; battery failure; or predictive
power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage
(power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure,
non-critical temperature and voltage.
-
Off
Power off: system unplugged.
Power on: system power off and in standby, no prior degraded/noncritical/
critical state.
115
Replace a Quad Server Chassis
Figure 55. Quad Server Rear Component Identification
Table 14. Quad Server Node Designations
Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
2. Disconnect all clients.
3. Log in to the primary MGMT server node as admin:
[Client]$ ssh –l admin primary_MGMT_node
4. Unmount the Lustre file system:
[admin@n000]$ cscli unmount -f fsname
Where fsname is the file system name.
5. Power off the system as described in Power Off Sonexion 2000.
6. Once all of the nodes are shut down (power button LEDs on the control panels at the front of the 2U server
are not lit), power off the 2U24 (or 5U84) MMU storage enclosure by turning off the power switches on the
PSUs/PCMs (at the back of the rack).
●
In a 2U24 enclosure, the power switches are located on the PCMs.
●
In a 5U84 EBOD enclosure, the power switches are located on the PSU.
7. Disconnect the power cord from each PSU.
8. Disconnect the cables attached to each of the four server nodes. Make a note of each cable location and tag
each cable. Cable diagrams are provided in the Sonexion 2000 Quick Start Guide included with the system.
9. Remove two PSUs from the 2U quad server:
116
Replace a Quad Server Chassis
a. Push the green latch to release the PSU, as shown in the following figure.
Figure 56. Remove PSU from Quad Server
b. Using the handle, remove the power supply module by carefully pulling it out of the power supply cage.
Store the PSU in a safe location.
c.
Repeat for the second PSU.
10. Remove the four server nodes from the quad server:
a. Note the location of each server node when it is removed from the chassis; they must be re-installed in
exactly the same locations.
b. For each server node, push the green latch to release the node, while using the handle to pull it from the
chassis.
c.
Place each node in a safe location, and repeat until all four server nodes have been removed.
Figure 57. Remove the Server Node
11. Remove disk drives from the quad server:
117
Replace a Quad Server Chassis
a. Note the location (slot) of each disk drive when it is removed from the chassis; they must be re-installed in
exactly same position.
b. Remove each disk drive by pressing the green button and opening the lever. Pull to remove the disk.
Store the drives in a safe location.
Figure 58. Remove a Disk
Replace Chassis and Install Components
12. Loosen the two fasteners securing the server chassis to the front of the rack.
13. With a second person, slide the 2U quad server out of the rack and depress the two safety locks on the rails
to completely remove the chassis.
14. Place the server on a sturdy bench.
15. With an assistant, slide the server chassis into the rack and depress the two safety locks on the rails to
completely seat the chassis.
16. Secure the server chassis to the front of the rack cabinet with the two fasteners.
17. Refer to notes from step 11 on page 117 to install the disk drives in the correct slots. Use the following steps
for each drive (shown in the following figure):
a. For each drive, verify that the disk drive handle latch is released and in the open position, then slide the
drive carrier into the enclosure until it stops.
b. Seat the disk drive by pushing up the handle latch and rotating it to the closed position. There will be an
audble click as the handle latch engages.
118
Replace a Quad Server Chassis
Figure 59. Install the Disk Drive in the Quad Server
18. Refer to notes from step 10 on page 117 to ensure each server node is installed in the correct location. Install
each server node by inserting the server into the empty bay at the rear of the enclosure, as shown in the
following figure. A firm push on the module may be needed to fully seat it into the bay. There will be an
audible click as the server seats.
Figure 60. Install the Server Node
19. Insert the PSU into the empty power supply cage, as shown in the following figure. A firm push may be
needed on the module to fully insert it into the cage. There will be an audible click as the PSU seats.
119
Replace a Quad Server Chassis
Figure 61. Insert the Power Supply Unit
20. Connect the power cords to the PSUs.
21. Connect all the cables to the server nodes.
Be sure to reconnect the cables to their original ports. Refer to notes from step 8 on page 116 and to the
Sonexion 2000 Quick Start Guide for cabling information.
22. Power on the 2U24 or 5U84 MMU.
23. Power on the Sonexion system, as described in the procedure for your system and software release.
24. Log in to the primary MGMT server node:
[Client]$ ssh –l admin primary_MGMT_node
25. Start Lustre resources using CSCLI or the GUI:
●
From the primary MGMT node, run this command:
$ cscli mount -c cluster_name -f fsname
●
If working from CSSM, perform these steps:
1. Click the Node Control tab.
2. Select all nodes in the file system.
3. Click Selected Nodes.
120
Replace a Quad Server Chassis
4. Click Start Lustre in the drop-down menu.
When all resources are running, the procedure is complete.
121
Replace a Cabinet Network Switch
Replace a Cabinet Network Switch
Prerequisites
Part number
100900900: Switch Assy, Mellanox IB 36-port (Mellanox 6025, unmanaged)
Time
1.5 hours
Interrupt level
Failover (can be applied to a live system with no service interruption, but requires failover/
failback)
Tools
●
ESD strap (recommended)
●
Console with monitor and USB keyboard or a PC with a serial port configured for 115.2
Kbs
●
Serial cable
●
#2 Phillips screwdriver
About this task
Use this procedure to remove and replace a failed data network switch (InfiniBand). This includes steps to remove
and replace the failed network switch, configure the new switch (if it is not already configured), update firmware
on the new network switch, as required, and return the Sonexion system to normal operation.
IMPORTANT: Check with Cray Hardware Product Support to determine if the switch requires a
sprecialized configuration file.
Subtasks:
●
Replace Switch
●
Check Switch Installation
The Sonexion system contains two data network switches, known as Network Switch 0 (lower switch) and
Network Switch 1 (upper switch), stacked one on top of the other in the rack.
The dual data network switches manage I/O traffic and provide network redundancy throughout the Sonexion
system.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
122
Replace a Cabinet Network Switch
CAUTION: Before replacing the upper network switch, verify that the dual management switches
(positioned above it in the rack) are securely attached to the rack with no damaged hardware, to avoid
problems when the upper network switch is removed.
Procedure
1. If the location of the failed network switch in the rack is not known, check the status of the connected (cabled)
ports on both network switches. Look for the switch with inactive ports (LEDs are dark). In an operational
network switch, all connected ports have valid links with green LEDs.
2. Identify which nodes are affected by the switch failure by tracing the cabled port connections from the failed
network switch to components.
Refer to the Sonexion 2000 Quick Start Guide, which is included with the system.
●
Cables attaching to the 2U quad server affect the primary and secondary MGMT, MGS and MDS nodes.
●
Cables attaching to an SSU controller (OSS) affect OSS nodes. Each OSS hosts an OSS node.
●
Cables attached to the management switches, network switches, and to the Manufacturing equivalent of
an Enterprise Client Network (ECN) affect the CNG nodes.
3. If the status of the affected nodes’ resources is already known, skip to Replace Switch. Use the following
steps to verify that the affected nodes’ resources have failed over, either via CSSM or CLI.
To check node status using the CSSM GUI:
a. On the Node Control tab, check whether the affected nodes’ resources have failed over to their HA
partner nodes.
b. If the node resources have failed over, go to Replace Switch.
c.
If the node resources have not failed over, select the affected nodes and manually fail over their
resources to the HA partner nodes. When the Node Control tab indicates that all node resources have
successfully failed over, go to Replace Switch.
4. The remaining steps in this topic show the use of the CLI. To check node status:
a. Log in to the primary MGMT node via SSH:
[Client]$ ssh -l admin primary_MGMT_node
b. If the affected node is an MGS, MDS, or OSS node, SSH into the node:
[admin@n000]$ ssh MGS/MDS/OSS_nodename
c.
Determine if the node's resources failed over to its HA partner node:
[admin@mgs_mds_oss]$ sudo crm_mon -1
5. If the node's resources have failed over, log in to the remaining affected nodes via SSH and use the
crm_mon -1 command to check if their resources have failed over.
When the resources of all affected nodes have successfully failed over, go to Replace Switch.
6. If the node's resources have not failed over:
123
Replace a Cabinet Network Switch
a. Return to the primary MGMT node, if necessary.
b. Fail over the node’s resources to its HA partner node:
[admin@node]$ cscli failover -n HA_partner_node
Where node is the MGMT0, MGS, MDS, or OSS node.
c.
On the HA partner, verify that the node's resources failed over to the HA partner node:
[admin@HA_partner_node]$ sudo crm_mon -1
7. Manually fail over the remaining nodes resources to their HA partner nodes. When the resources of all
affected nodes have successfully failed over, go to Replace Switch.
Following is sample output showing a node (snx11000n003) with its resources failed over to its HA partner
node (snx11000n002).
Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003
Stack: Heartbeat
Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) - partition with quorum
Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
12 Resources configured.
============
Online: [ snx11000n002 snx11000n003 ]
Full list of resources:
snx11000n003-stonith
(stonith:external/libvirt):
Started snx11000n003
snx11000n002-stonith
(stonith:external/libvirt): Started snx11000n002
snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n003
snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n002
baton
(ocf::heartbeat:baton):
Started snx11000n002
Resource Group: snx11000n003_md66-group
snx11000n003_md0-raid
(ocf::heartbeat:XYRAID):
Started snx11000n002
snx11000n003_md66-raid
(ocf::heartbeat:XYRAID):
Started snx11000n002
snx11000n003_md66-fsys
(ocf::heartbeat:XYMNTR):
Started snx11000n002
snx11000n003_md66-stop
(ocf::heartbeat:XYSTOP):
Started
Resource Group: snx11000n003_md65-group
snx11000n003_md65-raid
(ocf::heartbeat:XYRAID):
Started snx11000n002
snx11000n003_md65-fsys
(ocf::heartbeat:XYMNTR):
Started snx11000n002
snx11000n003_md65-stop
(ocf::heartbeat:XYSTOP):
Started snx11000n002
Replace Switch
The network switches are installed facing the back of the rack, one on top of the other. The front panel of
each network switch (the connector side) faces the back of the rack and the back panel of each network
switch (the power side) faces the front of the rack.
8. Disconnect all QSFP/Infiniband cables from the switch.
9. Disconnect the power cord from the switch.
10. Remove the four screws securing the switch to the rack.
11. Carefully slide the failed switch out of the rack while holding the front of the switch to keep it steady.
12. If the new switch has not yet been unpacked:
124
Replace a Cabinet Network Switch
a. Place the shipping cartons on a flat surface.
b. Cut all straps securing the cartons.
c.
Unpack the switch and accessories from the cartons.
13. Mount the rail kit hardware to the new switch as follows, shown in the following figure. Using the flat-head
Phillips head screws (provided), attach the switch slides onto the switch, using seven flat-head screws for
short switches and seven screws for standard depth switches.
The hardware is attached so that the switch can be installed facing the back of the rack.
Figure 62. Securing the Rail
14. Carefully install the new switch into the rack, securing it into place with the 4 screws used to secure the failed
switch.
15. At the back of the rack, connect all QSFP/Infiniband cables to the switch.
The following figures show a schematic view of network switch locations (highlighted) and port locations on
those switches.
125
Replace a Cabinet Network Switch
Figure 63. TOR Network Switches
Figure 64. Network Switches Ports
16. At the front of the rack, connect the power cord to the switch and connect the other end to the PDU.
IMPORTANT: The switch does not have an ON/OFF control. The switch powers on when the power
cord is plugged in and power is applied.
Wait for the switch to power on and complete its boot cycle (approximately 5 minutes).
17. Check the status LEDs and confirm that they show status lights consistent with normal operation.
126
Replace a Cabinet Network Switch
Figure 65. Status LEDs after Five Minutes
18. Check the status of the connected (cabled) ports in the new network switch.
Check Switch Installation
19. Confirm that all connected ports show a valid link (green LED).
20. Confirm that all ports are active:
[admin@n000]$ pdsh -a ibstatus |dshbak
21. Confirm that the Port 1 Status shows an ACTIVE state and a rate of 56Gb/sec (FDR).
OSS nodes have two ports, ignore port 2 status.
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5654
base lid: 0xa
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
22. If port 1’s status shows a not-active state or a rate less than 56 Gb/sec, investigate the ibstatus command
output.
Sample output:
[admin@n000]$ pdsh -a ibstatus |dshbak
---------------snx11000n000
127
Replace a Cabinet Network Switch
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:001e:6703:003e:2b28
base lid: 0x1
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
---------------snx11000n001
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:001e:6703:003e:25b0
base lid: 0x3
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
---------------snx11000n002
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:001e:6703:003e:2158
base lid: 0x4
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
---------------snx11000n003
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:001e:6703:003e:0d38
base lid: 0x2
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
---------------snx11000n004
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5f54
base lid: 0x5
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5f55
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
128
Replace a Cabinet Network Switch
---------------snx11000n005
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5f06
base lid: 0xb
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5f07
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
---------------snx11000n006
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5654
base lid: 0xa
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:5655
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
---------------snx11000n007
---------------Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:55a6
base lid: 0x7
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0050:cc03:0079:55a7
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 10 Gb/sec (4X)
link_layer: InfiniBand
23. Upgrade firmware on the new switch to the version specified below (latest qualified firmware level as of
November 2014).
129
Replace a Cabinet Network Switch
Table 15. Network Switch Firmware Levels
Vendor/Model
Firmw PSID
are/
Softw
are
.bin File Name
Mellanox
SX6025 (FDR)
9.2.80
00
MT_10102
10021
fw-SwitchX-rel-9_2_8000-MSX6025F_B1-B4.bin
MT_10101
10021
fw-SwitchX-rel-9_2_8000-MSX6025F_A1.bin
24. Fail back resources on the nodes affected by the network switch replacement, as follows.
a. Trace the cabled port connections from the failed switch to the components (either a 2U quad server or
an SSU controller [OSS]).
b. Fail back the resources for the MGMT nodes, the MGS/MDS nodes, and then the OSS nodes. Verify that
the failbacks were successful.
[admin@n000]$ cscli failback -n node
It can take from 30 seconds to a few minutes for the node's resources to fail back completely.
Following is sample output showing an HA node pair (snx11000n002 and snx11000n003) in online mode
with their local resources assigned to them.
============
Last updated: Mon Jan 14 04:54:52 2013
Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003
Stack: Heartbeat
Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) partition with quorum
Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
12 Resources configured.
============
Online: [ snx11000n002 snx11000n003 ]
Full list of resources:
snx11000n003-stonith
(stonith:external/libvirt):
Started snx11000n003
snx11000n002-stonith
(stonith:external/libvirt):
Started snx11000n002
snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n003
snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n002
baton
(ocf::heartbeat:baton):
Started snx11000n002
Resource Group: snx11000n003_md66-group
snx11000n003_md0-raid
(ocf::heartbeat:XYRAID):
Started snx11000n003
snx11000n003_md66-raid (ocf::heartbeat:XYRAID):
Started snx11000n003
snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR):
Started snx11000n003
snx11000n003_md66-stop (ocf::heartbeat:XYSTOP):
Started snx11000n003
Resource Group: snx11000n003_md65-group
snx11000n003_md65-raid (ocf::heartbeat:XYRAID):
Started snx11000n002
snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR):
Started snx11000n002
snx11000n003_md65-stop (ocf::heartbeat:XYSTOP):
Started snx11000n002
c.
Repeat the preceding two substeps for all nodes affected by the switch replacement.
130
Replace a Cabinet Network Switch
Once all affected nodes are online with their local resources reassigned to them, the procedure to replace
a failed InfiniBand Mellanox SX6025 network switch is complete.
131
Replace a Cabinet Network Switch PSU
Replace a Cabinet Network Switch PSU
Prerequisites
●
Part number: 100901000: Power Supply, Sonexion for 36-port FDR IB Mellanox Switch
●
Time: 30 minutes
●
Interrupt level: Failover (can be applied to a live system with no service interruption, but requires failover/
failback)
●
Tools:
●
○
ESD strap
○
#2 Phillips screwdriver
Access requirements: This procedure has been written specifically for use by an admin user and does not
require root (super user) access. We recommend that this procedure be followed as written and that the
technician does not log in as root or perform the procedure as a root user.
About this task
The system contains two high-speed network switches, known as Network Switch 0 (lower switch) and Network
Switch 1 (upper switch), stacked one on top of the other in the rack, as shown in the following figure.
Figure 66. Power and Connector Side Panels
The dual InfiniBand network switches manage I/O traffic and provide network redundancy throughout the system.
CAUTION: PSUs have directional airflows similar to the fan module. The fan module airflow must
coincide with the airflow of all of the PSUs (see the following figure). The switch's internal temperature is
affected if the PSU airflow direction is different from the fan module airflow direction.
132
Replace a Cabinet Network Switch PSU
Figure 67. Airflow direction
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure
1. If the location of the failed PSU is not known, from the front of the server, look for a Mellanox switch with an
illuminated Status LED (amber). In an operational 36-port switch PSU, the status LED is illuminated green.
CAUTION: Make certain that the PSU that is not being replaced is showing all green, for both the
PSU and status indicators.
Figure 68. Status LED locations
133
Replace a Cabinet Network Switch PSU
Figure 69. Status LED bar
Table 16. LED Color Status
LED Color
Status
Solid Green
OK: The Power supply is delivering the correct voltage – 12VDC
Solid Red
Error: The PS unit is not operational
Off
Off: No power to the system (neither PS unit is receiving power). If one PS unit is
showing green and the second PS unit is unplugged it will show a red indication.
2. Disconnect the power cord attached to the failed PSU.
3. Grasping the handle with the right hand, push the latch release with the thumb while pulling the handle
outward, as shown in the following figure.
Figure 70. Power Supply Unit Removal Latch
As the PSU unseats, the PSU status indicators turn off.
4. Using the handle, remove the power supply module by carefully pulling it out of the power supply bay.
134
Replace a Cabinet Network Switch PSU
Figure 71. Removing a Power Supply Unit
5. Make certain the mating connector of the new unit is free of any dirt or obstacles.
6. Insert the PSU by sliding it into the opening until a slight resistance is felt.
7. Continue pressing the PSU until it seats completely.
The latch snaps into place, confirming the proper installation.
8. Insert the power cord into the supply connector.
9. Verify that the PSU indicator is illuminated green. This indicates that the replacement PSU is operational.
If the PSU is not green, repeat the whole procedure to extract and re-insert the PSU.
135
Replace a Cabinet Management Switch (Brocade)
Replace a Cabinet Management Switch (Brocade)
Prerequisites
Part number
Sonexion Model Part Number Description
Sonexion 2000
101171100
Switch Assy, Brocade 24-Port ICX66610 1GBE RJ45
Mgmt FRU
Sonexion 2000
101161901
Switch Assy, Brocade 48-Port ICX6610 1GBE RJ45
Mgmt FRU
Sonexion 900
101018600
Switch, GbE Managed 24-Port Airflow=PS to Port
(Brocade)
Sonexion 900
101018700
Switch, GbE Managed 48-Port Airflow=PS to Port
(Brocade)
Time
1.5 hours
Interruption level
Failover (can be applied to a live system with no service interruption, but requires failover/
failback operations.
Tools
●
Phillips screwdriver (#2)
●
Console with monitor and keyboard (or PC with a serial port configured for 9600 Kbps, 8
data bits, no parity and 1 stop bit)
●
PC/Laptop with a DB9 serial COM port, mouse, keyboard
●
Rollover/Null modem serial cable (DB9 to RJ45 Ethernet)
●
Serial terminal emulator program, such as SecureCRT, PuTTY, Tera Term, Terminal
Window (Mac) etc.
●
ESD strap, boots, garment or other approved methods
Requirements
The size and weight of the Brocade switch requires two individuals to move the unit safely.
Do not perform this procedure unless two individuals are onsite and available to move each
switch.
136
Replace a Cabinet Management Switch (Brocade)
About this task
Use this procedure to remove and replace a failed management switch (24/48-port Brocade), configure the new
switch (if it is not already configured), bring the nodes online, and return the ClusterStor system to normal
operation.
The switch configuration must be done properly to ensure a resilient, redundant , secure and easily accessible
management environment. IP addresses are obtained by the switches via the DHCP server, which will allow
access for configuration changes and firmware upgrades. The instructions in this procedure can be used for both
24- and 48-port Brocade switches.
This procedure includes steps to remove and replace the failed switch, configure the new switch (if it is not
already configured), bring the nodes online, and return the Sonexion system to normal operation.
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Take note of the following:
●
The port on the failed switch into which the ISL is plugged
●
Whether this is a multi-rack system with port(s) that have inter-rack connections. These ports will not be
configured as admin_edge_ports. Leave these blank (unconfigured).
●
Which management switch is being replaced; the upper or lower switch. This information will be needed for
configuring the new switch.
Procedure
1. Identify the port into which the inter-switch link (ISL) cable is plugged.
IMPORTANT: In a multi-rack Sonexion system, several ports have inter-rack connections. These
ports are not configured as admin_edge_ports; leave them blank (unconfigured).
2. If the location of the failed management switch is not known, check the status of the connected (cabled) ports
or look for an error indicator. The LEDs on inactive ports are off.
In a switch that is operating normally, all connected ports have valid links with green LEDs. No warning or
error LEDs should be lit.
3. If the identity of the nodes affected by the switch failure is already known, go to Verify Failover. Otherwise,
identify the affected nodes by tracing the cabled port connections from the failed network switch to Sonexion
components, and associating cables with nodes as follows:
●
Cables attached to the 2U quad server affect the MGMT, MGS, and MDS nodes.
●
Cables attached to an OSS controller in an SSU affect OSS nodes. Each OSS controller hosts an OSS
node.
Verify Failover
137
Replace a Cabinet Management Switch (Brocade)
Once the affected nodes are identified, use the following steps to verify that the affected nodes' resources
have failed over to their HA partner nodes. If the resources are confirmed to have failed over, go to Remove
Failed Management Switch. Determine node status using the GUI or CSCLI:
Check node status on CSSM GUI
1. If CSSM is running, go to the Node Control tab and check if the affected nodes'
resources have failed over to their HA partner nodes.
2. If the node resources have failed over, go to step Remove Failed Management Switch.
3. If the node resources have not failed over, select the affected nodes and manually fail
over their resources to the HA partner nodes.
4. When the Node Control tab indicates that all node resources have successfully failed
over, go to Remove Failed Management Switch.
Check node status via CSCLI
The following steps show the use of CSCLI.
4. Log in to the primary MGMT node via SSH:
[Client]$ ssh -l admin primary_MGMT_node
5. If the affected node is an MGS, MDS, or OSS node, SSH into the node:
[admin@n000]$ ssh MGS/MDS/OSS_nodename
6. Determine if the node's resources failed over to its HA partner node:
[admin@n000]$ sudo crm_mon -1
7. If the node's resources have failed over, log in to the remaining affected nodes via SSH and use the
crm_mon -1 command to check if their resources have failed over. When the resources of all affected
nodes have successfully failed over, go to Remove Failed Management Switch.
8. If the node's resources have not failed over:
a. Return to the primary MGMT node, if necessary.
b. Fail over the node’s resources to its HA partner node:
[admin@n000]$ cscli failover -n node_name
Where node_name is the name of the HA partner node
c.
On the HA partner, verify that the node's resources failed over to the HA partner node:
[HA Partner]$ sudo crm_mon -1
9. Manually fail over the remaining resources to the respective HA partner nodes. When the resources of all
affected nodes have successfully failed over, go to Remove Failed Management Switch.
138
Replace a Cabinet Management Switch (Brocade)
This is sample output showing a node (snx11000n003) with its resources failed over to its HA partner node
(snx11000n002).
============
Last updated: Mon Jan 14 04:54:52 2013
Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003
Stack: Heartbeat
Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) - partition
with quorum
Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
12 Resources configured.
============
Online: [ snx11000n002 snx11000n003 ]
Full list of resources:
snx11000n003-stonith
(stonith:external/libvirt):
Started snx11000n003
snx11000n002-stonith
(stonith:external/libvirt): Started snx11000n002
snx11000n003_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n003
snx11000n002_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n002
baton
(ocf::heartbeat:baton):
Started snx11000n002
Resource Group: snx11000n003_md66-group
snx11000n003_md0-raid
(ocf::heartbeat:XYRAID):
Started snx11000n002
snx11000n003_md66-raid
(ocf::heartbeat:XYRAID):
Started snx11000n002
snx11000n003_md66-fsys
(ocf::heartbeat:XYMNTR):
Started snx11000n002
snx11000n003_md66-stop
11 on page 139(ocf::heartbeat:XYSTOP):
Started
Resource Group: snx11000n003_md65-group
snx11000n003_md65-raid
(ocf::heartbeat:XYRAID):
Started
snx11000n002
snx11000n003_md65-fsys
(ocf::heartbeat:XYMNTR):
Started
snx11000n002
snx11000n003_md65-stop
(ocf::heartbeat:XYSTOP):
Started
snx11000n002
Remove Failed Management Switch
10. Identify the port into which the inter-switch link (ISL) cable is plugged. In a multi-rack Sonexion system,
several ports have inter-rack connections. These ports will NOT be configured as admin_edge_ports; leave
them BLANK (unconfigured).
11. Determine whether the upper or lower management switch is being replaced. You will need this information
when you configure the new switch.
12. At the back of the rack, disconnect all network cables from the failed switch.
For the port connection layout, refer to Sonexion Field Installation Guide for the applicable model of
Sonexion .
13. At the front of the rack, disconnect the power cords from the failed switch.
14. At the back of the rack, remove the four retaining pan-head screws from the front of the failed switch.
15. With a second person, carefully slide the failed switch out of the rack.
139
Replace a Cabinet Management Switch (Brocade)
On the lower switch, the mounting tabs might catch on the PDU as you remove it from the rack; it will be a
tight fit, but the switch will slide out.
Install New Management Switch
16. If the new switch has not yet been unpacked from the shipping carton(s), follow these steps:
a. Place the shipping carton(s) on a flat surface.
b. Cut all straps securing the carton(s).
c.
Unpack the switch and accessories from the carton(s).
17. Using the Phillips head screws (provided), attach the mounting brackets (2) to the sides of the new switch.
One bracket attaches to each side of the switch (in the front). Each mounting bracket requires 4 screws.
18. With a second person, slide the switch into the rack.
19. Align the mounting brackets with the rack holes. Using two pan-head screws with nylon washers, attach each
bracket to the rack.
20. Connect the power cord to the power receptacle on the switch.
The switch does not have an ON/OFF control. The switch powers on when the power cord is plugged in and
power is applied. for the switch to power on and complete its boot cycle (approximately 5 minutes). Do not
connect the network cables to the new switch. That step will be performed when the switch is configured.
21. Configure the new switch. Refer to Configure a Brocade Management Switch on page 141.
22. Fail back resources on the nodes affected by the switch replacement (MGMT nodes, then MGS/MDS nodes
and finally the OSS nodes).
a. Trace the port connections previously cabled to the failed switch.
b. For each affected node, fail back its resources. Verify that the failback operation was successful.
[admin@n000]$ cscli failback -n affected_node
It may take 30 seconds to a few minutes for the nodes' resources to fail back completely. Following is an
example output showing an HA node pair (snx11000n002 and snx11000n003) in online mode with their
local resources assigned to them.
============
Last updated: Mon Jan 14 04:54:52 2013
Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003
Stack: Heartbeat
Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) partition with quorum
Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052
2 Nodes configured, unknown expected votes
12 Resources configured.
============
Online: [ snx11000n002 snx11000n003 ]
Full list of resources:
snx11000n003-stonith
(stonith:external/libvirt):
Started
snx11000n003
snx11000n002-stonith
(stonith:external/libvirt):
Started
snx11000n002
140
Replace a Cabinet Management Switch (Brocade)
snx11000n003_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n003
snx11000n002_mdadm_conf_regenerate
(ocf::heartbeat:mdadm_conf_regenerate):
Started snx11000n002
baton
(ocf::heartbeat:baton):
Started snx11000n002
Resource Group: snx11000n003_md66-group
snx11000n003_md0-raid
(ocf::heartbeat:XYRAID):
Started
snx11000n003
snx11000n003_md66-raid
(ocf::heartbeat:XYRAID):
Started
snx11000n003
snx11000n003_md66-fsys
(ocf::heartbeat:XYMNTR):
Started
snx11000n003
snx11000n003_md66-stop
(ocf::heartbeat:XYSTOP):
Started
snx11000n003
Resource Group: snx11000n003_md65-group
snx11000n003_md65-raid
(ocf::heartbeat:XYRAID):
Started
snx11000n002
snx11000n003_md65-fsys
(ocf::heartbeat:XYMNTR):
Started
snx11000n002
snx11000n003_md65-stop
(ocf::heartbeat:XYSTOP):
Started
snx11000n002
c.
Repeat Step 7.b for all nodes affected by the switch replacement.
The procedure to replace a failed management switch (24/48-port Brocade) in the field is complete.
Configure a Brocade Management Switch
About this task
Use this procedure to configure the new Brocade management switch to use DHCP. The following schematic
shows the management switches (highlighted) and network switches in the rack (rear view). Perform this
procedure only on a the new switch installed in Replace a Cabinet Management Switch (Brocade) on page 136
Figure 72. TOR Management Switches
141
Replace a Cabinet Management Switch (Brocade)
In this procedure, each rack has two management switches (referred to in this procedure as management switch
0 and management switch 1.
Perform this procedure only on the switch that is replacing the problematic switch.
Commands used in this procedure are not case-sensitive.
Procedure
1. Connect to the new switch by connecting the rollover cable to the leftmost console port (RJ45), shown in the
following figure, on the new switch and to the serial COM port (DB9) on the PC or Laptop.
Figure 73. Console Port on Brocade ICS 6610-24
On Brocade switches, the console port is coupled with the out-of-band management port (which is not used
by the Sonexion system). Use the following table to determine the console port's location.
Table 17. Console Port Location
Brocade Switch Type
Console Port Location
ICX 6610-24 (24-port)
Left side of the pair of ports
ICX 6610-48 (48-port)
Right side of the pair of ports
2. Open a terminal session and specify these serial port settings to use a PuTTY/SecureCRT emulator program:
Baud rate (bits per second):
9600
Data bits:
8
Stop bits:
1
Parity:
None
Flow control:
Off
3. When using a PuTTY/SecureCRT emulator program, specify the following serial port settings :
screen /dev/ttyUSB0 cs8,9600
After connecting to the new switch, the terminal screen should resemble the following figure.
142
Replace a Cabinet Management Switch (Brocade)
Figure 74. Terminal Screen after Connecting to the Switch
After connecting to the new switch, go to the next step.
4. Configure the new switch's hostname as follows, using naming conventions shown below:
enable
configure terminal
hostname switchname
write memory
For switchname, use the following format:
cluster_name-sw_type num-rrack_num
Where:
●
cluster_name is the value used in the YAML file, as provided by the customer. Use snx11000n for
generic systems where the cluster name is not specified.
●
sw_type has these values:
○
ibsw for InfiniBand switches or gesw for 40GbE or 10/40GbE switches
○
mgmt for management switches
●
num is the switch number; 0 is the lower switch, 1 is the upper switch.
●
rack_num is the rack number in which the switch is installed. 0 is the base rack, 1 is expansion rack 1, 2
is expansion rack 2, etc.
143
Replace a Cabinet Management Switch (Brocade)
Sample hostnames for management switches, where the cluster name is snx11000n:
Sample Hostname
Description
snx11000n-mgmtsw1-r0
Upper management switch in the base rack
snx11000n-mgmtsw0-r7
Lower management switch in expansion rack #7
5. Configure the Spanning-Tree as follows.
The primary management switch is the lower switch (0), and the secondary management switch is the upper
switch (1), as shown in the following figure.
Figure 75. TOR Management Switches
Enter the following:
Spanning-tree 802-1w
Spanning-tree 802-1w priority ['0' for primary, '4096' for secondary]
With the exception of the ISL port and any inter-rack link port, the remaining ports are configured this way:
interface ethernet 1/1/1
spanning-tree 802-1w admin-edge-port
interface ethernet 1/1/2
spanning-tree 802-1w admin-edge-port
interface ethernet 1/1/3
spanning-tree 802-1w admin-edge-port
…
This sequence is repeated until all ports are configured, followed by:
write memory
This saves the configuration.
IMPORTANT: Ports used for the ISL link between switches and any inter-rack connection should not
have admin-edge-port enabled, leave them blank unconfigured.
144
Replace a Cabinet Management Switch (Brocade)
6. Set the management IP address of the new switch.
Setting the management IP address enables remote login, should it be necessary to manually update the
switch configuration or apply firmware updates. The management IP address can be identified by the DHCP
server (172.16.2.2). The steps below ensure the management IP address of the new switch is properly set.
Type:
enable
Configure terminal
Ip dhcp-client enable
Ip dhcp-client auto-update enable
Write mem
7. Configure the username and password:
Username admin priv 0 password Sonexion
Enable super-user-password Sonexion
Write mem
8. Enable logging, SSH, SNMP, and NTP; and disable Telnet:
Logging 172.16.2.2
snmp-server community ClusterRead ro
snmp-server community ClusterWrite rw
crypto key generate rsa modulus 1024
ip access-list standard 10
permit 172.16.0.0/16
ssh access-group 10
enable aaa console
aaa authentication login default local
no telnet server
write memory
9. Obtain the IP address of the switch:
SSH@snx11000n006-primary # ship address
This is sample output:
IP Address
Type
172.16.255.88
Dynamic
SSH@snx11000n-primary##
Lease Time
308
10. Verify that the new switch can be accessed via SSH.
a. Log in to the primary management node via SSH.
b. Access the new switch via SSH:
ssh admin@mgmt_switch_ip_address
c.
Enter the switch's password.
d. If a switch prompt appears, the new switch can be accessed via SSH access.
145
Replace a Cabinet Management Switch (Brocade)
11. The following error may occur when SSH is used to access the switch:
"aes128-ctr,3des-ctr,aes256-ctr,aes128-cbc,3des-cbc,aes256-cbc,twofish256cbc,twofish-cbc,twofish128-cbc"
The above text shows ciphers, while the client supports only the following:
"arcfour256,arcfour,blowfish-cbc"
If the above case occurs, do the following:
a. Add one of the server ciphers to the primary MGMT node (MGMT0) configuration to one of the following:
~/.ssh/config
OR
/etc/ssh/ssh_config
Following is an example that adds one of the server ciphers to the client configuration:
cd /etc/ssh
[root@localhost-mgmt ssh]# vi ssh_config
# Protocol 2,1
# Cipher 3des
# Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128cbc,3des-cbc
# MACs hmac-md5,hmac-sha1,umac-64@openssh.com,hmac-ripemd160
# EscapeChar ~
# Tunnel no
# TunnelDevice any:any
# PermitLocalCommand no
"ssh_config" 64L, 2164C written
b. Remove the “#” prompt from the next line to match the following entry:
Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256, arcfour128,aes128-cbc,
3des-cbc
c.
Press the Esc key.
d. Enter wq! and press Enter to exit.
SSH should now work on the Sonexion system.
12. Once the new switch has booted, reconnect the network cables. Refer to the Sonexion 2000 Quick Start
Guide, which is included with the system.
13. Check the status of the connected (cabled) ports.
IMPORTANT: Wait for links to be established on all connected ports (green LEDs).
146
Replace a Cabinet Management Switch PSU (Brocade)
Replace a Cabinet Management Switch PSU (Brocade)
Prerequisites
Part number
101233400: Power Supply, Sonexion for 24-port ICX66610 1GBE Brocade Switch
Time
30 minutes
Interrupt level
Live (can be applied to a live system with no service interruption)
Tools
●
ESD strap
●
#2 Phillips screwdriver
About this task
The Sonexion 2000 system contains two Brocade ICX 6610-24-I or ICX 6610-48-I management switches used for
configuration management and health monitoring of all system components. These are the only management
switches used in Sonexion 2000 systems. These switches have dual redundant power supplies, eliminating
management switches as a single point of failure. The management network is private and not used for data I/O
in the cluster.
The Brocade switches have two PSU receptacles at the rear of the switch (see the following figure). Each switch
ships from the manufacturer with one PSU installed, but a second PSU can be installed to provide backup power
in case of a PSU failure and for load-balancing when both PSUs are operational. Each Brocade switch shipped
with a Sonexion 2000 system has two PSUs installed as the standard configuration. PSUs are hot-swappable and
each has one standard power receptacle for the AC power cable.
Figure 76. Rear Panel 24- and 48-port Brocade Switch
Figure 77. Brocade Switch PSU
Precautions
Be sure to have the correct type of PSU before beginning the procedure to replace the PSU.
IMPORTANT: Check with Cray Hardware Product Support to determine if the switch requires a
sprecialized configuration file.
WARNING: When inserting a power module into the switches, do not use unnecessary force. Doing so
can damage the connectors on the rear of the supply and on the midplane.
147
Replace a Cabinet Management Switch PSU (Brocade)
CAUTION: Check to see that the PSU that is not being replaced is showing all green, for both the PSU
and status indicators.
CAUTION: Make sure that the proper air flow direction will be available when replacing a PSU (see the
following figures) .
Figure 78. Airflow Direction, Front to Back, E-labeled PSU
Figure 79. Airflow Direction, Back to Front, I-labeled PSU
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure
1. If the location of the failed PSU is not known, from the back of the rack, locate the switch with the failed PSU
by checking the status of the power status LED.
148
Replace a Cabinet Management Switch PSU (Brocade)
Figure 80. Management Switch Power Status LEDs
2. Disconnect the power cord attached to the failed PSU.
3. Loosen the two captive screws on the PSU.
4. Using the extraction handle, remove the PSU by carefully pulling it out of the power supply bay.
Figure 81. Remove Management Switch PSU
5. Before opening the package that contains the power supply, touch the bag to the switch casing to discharge
any potential static electricity. Cray recommends using an ESD wrist strap during installation.
6. Remove the PSU from the anti-static shielded bag.
7. Make certain the mating connector of the new unit is free of any dirt or obstacles.
8. Holding the PSU level, guide it into the carrier rails on each side and gently push it all the way into the slot,
ensuring that it firmly engages with the connector.
9. Align the two captive screws with the screw holes in the switch’s pack panel.
10. Using a screwdriver, gently tighten the captive screw.
11. Insert the power cord into the supply connector.
12. Confirm that the new PSU is powered on and displays a green power LED.
If the PSU power LED is not green, repeat the entire procedure to extract and re-insert the PSU. If the PSU
does not show a green power LED after repeating the entire procedure to extract and re-insert the PSU, then
contact Cray Support. This procedure has an interrupt level of Live and powering off this switch could
potentially affect the running cluster.
149
Replace a Cabinet PDU
Replace a Cabinet PDU
Prerequisites
Parts
Part Number
Description
100840300
PDU Assy, Sonexion 30A Triple Input US
100840400
PDU Assy, Sonexion 32A Dual Input EU
100927300
PDU Assy, Sonexion 50A Dual Input US
Time
1 hour per PDU. For example three racks (that is, six PDUs) would take six.
Interrupt level
Live (can be applied to a live system with no service interruption)
Tools
Phillips screwdriver (No. 2)
Torx T30
Console with monitor and keyboard and a PC / laptop with a DB9 serial COM port
configured for 115.2Kbs)
Serial cable, Null Modem (DB9 to DB9), female at both ends
ESD strap
About this task
Use this procedure to remove and replace a failed Power Distribution Unit (PDU) in a Sonexion rack. Each rack
contains two PDUs that are mounted on the left and right rear-facing sides of the cabinet. The specific PDU model
(whether 30A, 32A, 50A or 60A) installed in the rack is determined by the geographical location of the system; all
PDUs are factory-installed with the appropriate line-in cables and plugs.
This procedure has been written specifically for use by an admin user and does not require root (super user)
access. We recommend that this procedure be followed as written and that the technician not log in as root or
perform the procedure as a root user.
Subtask:
●
Configure PDUs
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
150
Replace a Cabinet PDU
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure
1. If the location of the failed PDU and its rack is not known, locate the rack containing the failed PDU by
checking the status of any nodes and switches connected to the PDU, looking for error indications. The PDU
has rotating status LED displays and alarms that may show signs of failure.
2. Attach an ESD wrist strap and use it at all times.
3. Prepare to remove the failed PDU from the rack.
a. Release 2.0 only: Remove AC power at the wall or floor socket by switching power off and removing the
plugs from the AC outlet. Perform this for each of the PDU line cords. This is because PDUs in Sonexion
2.0 systems have no on/off switches.
Raritan PDUs have from one to three AC line cords. Each line cord must be individually powered on or
off.
b. Remove all component power cords attached to the PDU.
To aid with ensuring power cords are properly reconnected, make a note of all cable connections to the
PDU. Refer also to Sonexion 2000 Quick Start Guide, a copy of which may be located in the outer sleeve
of the rack packaging.
c.
Unplug the Ethernet cable attached to the failed PDU.
4. Use a Phillips screwdriver to remove the ground strap wire from the failed PDU to the ground point at the
bottom of the rack. Refer to the following figure.
Figure 82. Remove Ground Strap Wire
5. Remove the failed PDU by lifting it up to the keyhole openings (three total) and remove it from the frame.
a. Loosen the three Torx retaining bracket screws on the vertical cable tray. Refer to the following figure.
Ensure that the brackets slide down and disengage.
151
Replace a Cabinet PDU
Figure 83. Torx Retaining Bracket Screws
b. Unhook the vertical cable tray and rotate it 90 degrees so the tray is behind the SSUs.
c.
Unscrew and remove the shipping L-Bracket from the top of the PDU, using the Phillips screwdriver.
Refer to the following figure.
Figure 84. Shipping L-Bracket
d. Lift the failed PDU from the mounting tabs and remove it from the rack.
6. Before the new PDU is put into the rack, remove the ground strap wire from the old PDU and transfer it to the
new PDU. Ensure that the ground strap wire is tight and secure.
7. Install three mounting screws in the back of the PDU before proceeding to the next step.
8. Install the new PDU.
a. Position the new PDU in the rack and feed the mains power cables into the opening at the top or bottom
of rack and to the mains power connector.
b. Hook the PDU to the mounting tabs by lifting and sliding the PDU onto the tabs.
c.
Screw the shipping L-Bracket to the top of the PDU, using the Phillips screwdriver.
d. Rotate the cable tray back into position.
152
Replace a Cabinet PDU
e. Secure the retaining brackets with three Torx screws.
9. Reconnect the ground strap wire and connect power to the PDU.
a. Re-connect the ground strap wire from the new PDU to the bottom of the rack, (previously removed in
step 6 on page 152).
b. Re-connect the Ethernet cable to the PDU.
c.
Connect the PDU's line cords to the AC power outlets and, if required, turn to the ON position.
d. Wait for a few seconds, the PDU will begin its power on process.
e. Verify that the LEDs on the PDU indicate normal operation. There is a rotating status display.
Configure PDUs
Verify that the following equipment is available before configuring the network switches.
●
PC/Laptop with a DB9 serial COM port, mouse, keyboard
●
DB9 console cable (DB9 to DB9, female on both ends)
●
Serial terminal emulator program, such as SecureCRT, PuTTY, Tera Term, Terminal Window (Mac), etc.
The managed power distribution units (PDUs) that are included in Sonexion systems must be configured
before the system is provisioned. There are two PDUs per rack (referred to as PDU 0 and PDU 1 in this
procedure).
IMPORTANT: This procedure applies only to systems running release 2.0.0 or later.
10. Connect the DB9 console cable to the CONSOLE/MODEM port (DB9) on the new PDU and the serial COM
port (DB9) on the PC/Laptop.
Figure 85. Console/Modem Port on Power Distribution Unit
WARNING: Do not use a rollover cable (DB9 to RJ45). Plugging an RS-232 RJ45 connector into the Ethernet
port of the PDU can cause permanent damage to the Ethernet hardware.
11. Open a terminal session with these settings:
Table 18. Settings for MGMT Connection
Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
153
Replace a Cabinet PDU
Parameter
Setting
Stop bits
1
Flow control
None
The function keys are set to VT100+.
For using Terminal Window on a Mac, enter the following:
screen /dev/ttyUSB0 cs8,115200
12. Once connected, log in to the PDU with the default username and password.
13. Enter config mode by typing config. The following prompt appears:
Config:#
14. When prompted, change the password and the password aging parameters (so that the password does not
age and possibly require changing again). Enter the current password and the new password, then re-enter
the new password:
Config:# password
Current password: old_password
Enter new password: new_password
Re-type new password: new_password
Add the following line to disable password aging:
Config:# ‘security loginLimits passwordAging disable’
15. Configure the PDU to use a wired connection:
Config:# ‘network mode wired’
16. Configure the PDU to use Static:
Config:# ‘network ipv4 ipConfigurationMode static’
Config:# ‘network ipv4 ipAddress ip_address’
Config:# ‘network subnetMask ipv4 255.255.0.0’
To show the commands available, type:
config:# network?
network command [arguments...]
Available commands:
interface Configure LAN interface
ip Configure IP access settings
ipv4 Configure IPv4 settings
ipv6 Configure IPv6 settings
mode Switch between wired and wireless networking
services Configure network service settings
wireless Configure wireless network
154
Replace a Cabinet PDU
17. Calculate the IP address to use, using the following table.
Table 19. Device IP Addresses
Device Type
Starting IP Address
Ending IP Address
Usable Addresses
Power Distribution Units
172.16.253.10
172.16.254.254
498
Table 20. Base Rack Starting IP Addresses
Device Type
Which Device in Base Rack
IP Address
Power Distribution Unit
Right hand PDU (viewed from
rear)
172.16.253.10
Left hand PDU1
172.16.253.11
a. From the above table, find the PDU starting IP address of the PDU being configured.
Example: 172.16.253.10 for the right-hand PDU and 172.16.253.11 for the left-hand PDU.
b. Determine the rack_ID where the PDU being configured is installed. The rack_ID of the base rack is 0.
The rack_ID of the first expansion rack is 1, second expansion rack is 2, and so on.
Continuing example: The PDU being configured is in the third expansion rack. The rack_ID is 3.
c.
Multiply the rack_ID by 2.
Continuing example: rack_ID x 2  3 x 2 = 6
d. Add the result from step 17.c on page 155 to the fourth octet of the IP address from step 17.a on page
155.
Continuing example for the right hand PDU: 172.16.253.11 fourth octet = 11 11 + 6 = 17
e. Replace the fourth octet of the IP address from Step 17.a on page 155 with the result from Step 17.d on
page 155.
Continuing example: 172.16.253.11  172.16.253.XX  172.16.253.17 Example result: 172.16.253.17 is
the device IP address for the left hand PDU installed in expansion rack 3.
IMPORTANT: When configuring the switch or PDU, the netmask is always 255.255.0.0.
18. Set a hostname for the PDU.
To set the PDU host name, from the config command prompt, type:
config:# network ipv4 preferredHostName cluster_name-pdunum-rrack_num
Where:
●
cluster_name is the value used in the YAML file, as provided by the customer.
●
num is the PDU number: 0 is the PDU on the left side of the rack (rear view); 1 is the PDU on the right
side of the rack (rear view).
●
rack_num is the rack number in which the pdu is installed. 0 is the base rack; 1 is expansion rack 1; 2 is
expansion rack 2, etc.
155
Replace a Cabinet PDU
Example:
config:# network ipv4 preferredHostName nsit-test-pdu0-r0
Example host names for PDUs:
Table 21. Title
nsit-test-pdu0-r0
Right side PDU in the base rack
nsit-test-pdu1-r0
Left side PDU in the base rack
nsit-test-pdu1-r7
Left side PDU in expansion rack #7
nsit-test-pdu0-r2
Right side PDU in expansion rack #2
19. Save the new configuration and leave configuration mode. Type either of the following:
'apply'
‒ or ‒
‘cancel’
20. Verify the settings:
'show network'
'show network [details]'
The above command shows network information, as in this example:
details Enable detailed view
show network
Networking mode: Wired
IP Configuration mode: Static
IP address: 10.22.160.113
Net mask: 255.255.240.0
Gateway: 10.22.160.1
To show the PDU name, type
‘show PDU’
This shows the PDU name, as in this example:
PDU 'dvtrack_pdu0_r0'
Model: PX2-5104X2-V2
Firmware Version: 2.5.30.5-41228
#
21. Log out of the PDU:
'exit'
22. After configuring the new PDU, power OFF the PDU via the AC outlets. Verify that all line cords are OFF.
156
Replace a Cabinet PDU
23. Using the packaging from the new PDU, re-package the failed PDU and return it per the authorized
procedures.
Reconnect Cords from Components
In the following, re-connect the line cords from all components that were disconnected earlier, into the PDU.
IMPORTANT: This procedure applies only to systems running release 2.0.0 or later.
24. After configuring the new PDU, power OFF the PDU via the AC outlets. Verify that all line cords are OFF.
25. At the back of the rack, confirm that the SSU PSUs related to this PDU change are in the OFF position.
26. At the back of the rack, confirm that the MMU storage (if fitted) 2U24 or 5U84 PCM / PSU related to this PDU
change are in the OFF position.
27. The PDU's line cords should still be connected to the AC power outlets; therefore, turn to the ON position
(they were turned OFF in step 24 on page 157).
28. Wait for a few seconds for the PDU to begin its power on process.
29. Verify that the LEDs on the PDU indicate normal operation. There is a rotating status display.
30. At the back of the rack, power on the SSUs and MMU storage (if fitted) and any other component related to
this PDU from step 3 on page 151.
31. Confirm that all components’ PSUs or PCMs related to this PDU now have a good power indication showing.
32. Using the packaging from the new PDU, re-package the failed PDU and return it according to authorized
procedures.
157
Replace a Cabinet Power Distribution Strip
Replace a Cabinet Power Distribution Strip
Prerequisites
Part number
100894000: Power Distribution Strip, C20-C13
Figure 86. Power Distribution Strip
Time:
1.5 hours
Interrupt level:
Interrupt (requires taking the Lustre filesystem offline)
Tools:
Phillips screwdriver (#2)
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8
data bits, no parity and 1 stop bit)
Serial cable
Cat5e cables (2)
RS-232 to Ethernet serial cable
ESD strap, boots, garment or other approved methods
Access requirements:
This procedure has been written specifically for use by an admin user and does not require
root (super user) access. Cray recommends that this procedure be followed as written and
that the technician does not log in as root or perform the procedure as a root user.
About this task
Each rack contains two power distribution strips that are mounted on the left and right rear-facing sides of the
cabinet. There are two types of power distribution strip in the rack, a 7-position and the other is a 12‑postion
model. The replacement procedure is the same, differing only in the plugging order.
To replace a defective power distribution strip, power off the entire rack.
158
Replace a Cabinet Power Distribution Strip
Notes and Cautions
●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the
equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure
1. If the location of the failed power distribution strip is not known, determine which power distribution strip failed
by checking the status of any failed power receptacles.
In most cases a failed receptacle can be re-cabled to get around the issue until appropriate downtime can be
scheduled.
2. Power off the system as described in Power Off Sonexion 2000.
3. At the back of the rack, disconnect all the power cords from the failed power distribution strip.
Figure 87. Power Distribution Strip with Power Cords
4. At the back of the rack, using the Philips (#2) screwdriver, remove the two retaining pan-head screws from the
front of the failed power distribution strip.
5. Using the Phillips head screws (provided), attach the mounting brackets (two total) to the sides of the new
power distribution strip.
159
Replace a Cabinet Power Distribution Strip
6. Align the mounting brackets and the rack holes (42U position). Using two pan-head screws with nylon
washers, attach each bracket to the rack.
7. Connect all power cords to the power receptacle on the power distribution strip. Refer to the cabling diagrams
in the following figures for cable configuration.
If the power cords block access to components, install the power cords so that the first two sockets are empty,
rather than the last two.
Figure 88. Power Distribution Strip With Top Sockets Empty
8. Power on the system as described in Power On Sonexion 2000.
Power Diagrams
160
Replace a Cabinet Power Distribution Strip
Figure 89. Base Rack (4U MMU) Power Distribution Strip for PX2-5104X2-V2
161
Replace a Cabinet Power Distribution Strip
Figure 90. Base rack with PDU: PX2-5965X3-V2
162
Replace a Cabinet Power Distribution Strip
Figure 91. Base Rack (4U MMU) Power Distribution for PX2-5100X2-V2
163
Replace a CNG Server Module
Replace a CNG Server Module
Prerequisites
Part number
Time
2 hours
Interrupt level
●
If the site has configured Round-Robin DNS for the CNG nodes, as recommended:
Failover (can be applied to a live system with no service interruption, but requires
failover/failback)
●
If the site uses static client assignments:
Enterprise Interrupt (requires disconnecting enterprise clients from the filesystem)
Tools
●
ESD strap, boots, garment or other approved methods
●
Console with monitor (DB15) and keyboard (USB)
●
Video cable
About this task
The CIFS/NFS Gateway (CNG) is an optional component that can be added to Sonexion systems. The CNG
component shares the Lustre file system to enterprise clients (Windows, Linux, Macintosh, Solaris, etc.) using
enterprise NAS protocols (CIFS2 or NFS).
A CNG unit consists of a 2U chassis that contains two or four server modules, each with built-in cooling fans, and
two PSUs, which also contain built-in fans. The two PSUs are located at the rear of the server in the center two
bays.
Four-Node and Two-Node CNG Servers
The physical node numbering on the four-node CNG is shown in the following figure.
164
Replace a CNG Server Module
Figure 92. CNG Four-Node Version Rear View
For the two-node variety, nodes 1 and 2 are connected identically to nodes 1 and 2 in the four-node configuration
and nodes 3 and 4 are replaced with blanking plates.
Distinguishing the CNG from the MMU
The MMU includes a 2U chassis that also contains four server modules. However, the MMU server modules are
not interchangeable with those in the CNG, and care should be taken not to confuse them.
To locate CNG components and distinguish from the MMU, examine the cabling attached to the 2U server
components at the back of the rack. If any components have SAS cables connected, the 2U enclosure is the
MMU. If there are no SAS cables, the 2U enclosure is the CNG.
This procedure was written for a CNG configured for four servers, but is similar to the procedure for a two-node
chassis.
CAUTION: Do not leave any enclosure bay empty.
Procedure
1. If the location of the failed server node is not known, do one of the following:
●
Access CSSM and use the Health tab to identify the faulty node by its hostname, and to determine which
chassis contains the corresponding server node and which other nodes share the chassis, or
●
Locate the CNG components, and look for the 2U enclosure with its System Status LED on (amber) or
dark LEDs on the left and right control panels. (See the following figure.) To locate the CNG components,
examine the cabling attached to the 2U server components at the back of the rack. If any of the
components have SAS cables connected then the 2U enclosure is the MMU. If there are no SAS cables,
the 2U enclosure is a CNG.
165
Replace a CNG Server Module
Figure 93. Quad Server Control Panels
Table 22. Server Node System Status LED Descriptions
LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or
fan failure; non-critical temp/voltage threshold; battery failure; or predictive
power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage
(power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure,
non-critical temperature and voltage.
-
Off
Power off: system unplugged.
2. Log in to the active MGMT server node:
[Client]$ ssh -l admin active_MGMT_node
3. Record the hostname of the affected node (hosted on the failed server module):
[admin@n000]$ nodeattr -n exporter
For example:
[admin@snx11000n000 ~]$ nodeattr -n exporter
snx11000n006
snx11000n007
snx11000n008
snx11000n009
166
Replace a CNG Server Module
The hostnames are numbered in the same order as the physical nodes. In this example, the hostname
snx11000n006 corresponds to physical node 1 in the following figure, and the hostname snx11000n009
corresponds to physical node 4.
Figure 94. CNG Four-Node Version Rear View
4. Record the primary and secondary BMC IP, Mask and Gateway IP addresses of the affected node:
[admin@n000]$ grep hostname-ipmi /etc/hosts
Where hostname is replaced by the hostname of the affected node. For example:
[admin@snx11000n000 ~]$ grep snx11000n008-ipmi /etc/hosts
172.16.0.52
snx11000n008-ipmi
10.10.0.9
snx11000n008-ipmi-sec
The Mask and Gateway addresses are set the same for all nodes:
Subnet Mask
Default Gateway IP
: 255.255.0.0
: 0.0.0.0
5. Define the gateway IP address and prefix length of the secondary IPMI network:
[admin@n000]$ pdsh -N
-g mgmt ip a l | grep :ipmi
For example:
[admin@snx11000n000 ~]$ pdsh -N -g mgmt ip a l | grep :ipmi
inet 10.10.0.3/16 brd 10.10.255.255 scope global secondary eth0:ipmi
Where 10.10.0.3 is the gateway IP address and /16 == 10.10.255.255 being the prefix length. CTDB performs
the ECN IP takeover for failed nodes, so it is not required to disconnect active clients. However, any active IO
will suffer connection loss and be interrupted during the failover process.
6. Stop the CIFS/NFS services:
[admin@n000]$ cscli cng node disable –n hostname
[admin@n000]$ cscli cng apply
Do you confirm apply? (y/n) y
This command interrupts any active connections from enterprise clients.
167
Replace a CNG Server Module
7. Shut down the affected node:
[admin@n000]$ cscli power_manage –n node --power-off
Where node is the hostname for the affected node, for example:
[admin@n000]$ cscli power_manage –n snx11000n016 –power-off
If the affected node does not shut down after the --power-off command is issued, there are two options
to shut the node down:
a. From the primary MGMT node run:
[admin@n000]$ pm -0 affected_node
b. If the node is still powered on and the pm -0 command fails, press and hold the power button, for at
least six seconds, on the front panel of the affected server node.
8. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods.
9. Disconnect the cables attached to the failed server node, and make a note as to where each cable attaches
to ensure the cables are connected properly during re-installation.
IMPORTANT: Note the port that each cable is attached to, so that the same cable connections can
be made after the new server node is installed. Refer to the cable reference guide attached to the
rack’s left hand rear door or provided in the Sonexion 2000 Quick Start Guide included with the
system.
10. Push the green latch to release the server node, while using the handle to pull it from the chassis.
Figure 95. Remove the Server Node
11. Install the server node by inserting the server into the empty bay at the rear of the enclosure. It may be
necessary to push the module firmly to fully seat it fully into the bay. There will be an audible click as the
server seats.
168
Replace a CNG Server Module
Figure 96. Install the Server Node
To configure the new server node, go to Configure BIOS Settings for CNG Node on page 169
Configure BIOS Settings for CNG Node
About this task
Before specifying the BIOS settings, verify that the following equipment is available:
●
Console with monitor and keyboard
●
Video cable
The physical node numbering on the CNG is as follows:
Figure 97. CNG Four-Node Version Rear View
Procedure
1. Connect the console’s video cable to the node (in the CNG chassis) for which BIOS settings are being
configured.
2. Power on the new server module using the power button.
169
Replace a CNG Server Module
3. Press F2 during POST to enter the BIOS setup utility. If fail to press F2 when the BIOS setup utility starts,
power-cycle the server module (press and hold the ON/OFF switch for 6 seconds) and try again.
4. Specify the server module’s BIOS settings as follows.
Depending on the specific configuration utility and BIOS version available, screen layouts, option names, and
navigation may vary slightly from the descriptions in this procedure. Use the arrow keys to navigate through
the BIOS and press Enter to confirm a setting.
Set the Baseboard LAN Configuration and Boot Options Entries
5. Navigate to the Server Management tab > BMC LAN Configuration menu > Baseboard LAN
Configuration.
6. Set the Baseboard LAN Configuration entries, using the information recorded in step 4 on page 167 of
Replace a CNG Server Module on page 164, or use factory defaults shown in the following figure.
IP Source
IP Address
Subnet Mask
Gateway IP
Static
172.16.0.xxx
255.255.0.0
0.0.0.0
Figure 98. CNG Node Positions Within Cabinet
7. Navigate to the Advanced tab > PCI Configuration > NIC Configuration menu > Onboard NIC1 Port1
MAC Address, and record the MAC address.
8. Navigate to the Boot Options tab > Network Devices Order.
9. Disable all network boot options except for one (IBA GE Slot 0400 v1372):
Boot Option #1
Boot Option #2
IBA GE Slot 0100 v1372
ST9450405SS
These entries set the boot order to the first network adapter, and then the first drive in the server. The Boot
Option entries may differ slightly from the entries shown above.
170
Replace a CNG Server Module
IMPORTANT: If any other boot devices appear on this screen, disable them.
10. Press F10 to save and exit. This automatically reboots the node.
11. Disconnect the video cable from the server.
12. Connect the remaining cables to the new server node based on the notes made in step 9 on page 168. For
more information about cable port assignments, refer to Sonexion 2000 Field Installation Guide in the
"Internal Rack Cabling" section.
13. Log in to the primary MGMT node:
[Client]$ ssh –l admin primary_MGMT
14. Configure the new MAC address:
a. Update the MGMT database to use the MAC address of the new server node, which you obtained in step
7 on page 170. Use the Onboard NIC1 Port1 MAC Address:
[admin@n000]$ sudo echo "update netdev set mac_address='new_node_mac_addr'
where hostname='node_hostname'" | mysql t0db
Where new_node_mac_addr is a new node mac address, and node_hostname is the target node
hostname. For example:
[admin@n000]$ sudo echo "update netdev set mac_address='34:56:78:90:ab:cd'
where hostname='snx11000n016'" | mysql t0db
b. Update the configuration to use the new database entry:
[admin@n000]$ sudo /opt/xyratex/bin/beUpdatePuppet -s -g mgmt
15. Power cycle the new CNG node:
[admin@n000]$ cscli power_manage –n node --cycle
Where node is the hostname for the server node that was replaced. For example:
[admin@n000]$ cscli power_manage –n snx11000n016 --cycle
16. Start the CIFS/NFS services on the new node:
[admin@n000]$ cscli cng node enable –n node
[admin@n000]$ cscli cng apply
Do you confirm apply? (y/n) y
17. Check the firmware level on the new CNG node:
[root@n000]# pdsh -w new_cng_node "/lib/firmware/mellanox/release_1/
xrtx_mlxfw -c | grep 'Current'"
171
Replace a CNG Server Module
Example:
[root@n000] #pdsh -w snx11000n016 "/lib/firmware/mellanox/release_1/
xrtx_mlxfw -c | grep 'Current'"
snx11000n016: Name: 01:00.0 Current: 2.30.8000 Update: 2.30.8000
snx11000n016: Name: 02:00.0 Current: 2.30.8000 Update: 2.30.8000
Firmware levels should be as follows:
Expansion Bus
Release 1.5.0
Release 2.0
CNG (Config A and B) onboard CX-3
2.30.8000
2.32.5100
CNG (Config A) PCI-e CX-3
2.30.8000
2.32.5100
CNG (Config B) PCI-e CX-2
2.9.1000
2.9.1000
18. If Round Robin DNS was not available, reconnect the Enterprise Clients disconnected in step 3 on page 166
in Replace a CNG Server Module on page 164, using the network utilities on the clients.
If the client connection works, the server replacement procedure is complete.
BMC IP Address Table for CNG Nodes
Use the following table to look up designated IP addresses and ranges, and assign them to nodes in the CNG
chassis.
Table 23. BMC IP Addresses (Enclosures)
Rack
Nodes
BMC IP Address / Range
Base
CNG nodes
172.16.0.50 to 172.16.0.57
●
The minimum number of nodes per CNG chassis is two, and the maximum is four.
●
Only one CNG chassis is supported in releases 1.5 and 2.0. It must be mounted in the base rack when a
2U24 EBOD is in use for MMU storage.
●
The CNG chassis may also be mounted in a customer rack, and is the only option when a 5U84 EBOD is
installed in the base rack for MMU storage.
●
Within groupings in the CNG chassis, it is recommended that IP addresses be assigned in the following order
when viewing the rack from the rear. The number referred to below is the 4th octet of the IP address:
○
Assign even numbers (4th octet of the IP address) to right-side nodes.
○
Assign odd numbers (4th octet of the IP address) to left-side nodes.
○
Assign lower numbers (4th octet of the IP address) to the bottom enclosure in the rack and work your way
up the rack.
The IP addresses used in the base rack diagram below follow the conventions described above. When viewed
from the rear, the IP addresses, from left to right, top to bottom, are:
●
172.16.0.53 (top left)
●
172.16.0.52 (top right)
172
Replace a CNG Server Module
●
172.16.0.51 (bottom left)
●
172.16.0.50 (bottom right)
173
Replace a CNG Server Module
Figure 99. IP Addresses Used With CNG Nodes
174
Replace a CNG Chassis
Replace a CNG Chassis
Prerequisites
●
Part number:
●
Time: 2 hours
●
Interrupt level: Enterprise Interrupt (requires disconnecting enterprise clients from the filesystem)
●
Tools:
○
ESD strap
○
#1 and #2 Phillips screwdriver
○
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8 data bits, no
parity and 1 stop bit)
About this task
The CIFS/NFS Gateway (CNG) is an optional component that can be added to Sonexion systems. The CNG
component shares the Lustre file system to enterprise clients (Windows, Linux, Macintosh, Solaris, etc.) using
enterprise NAS protocols (CIFS2 or NFS).
Subtasks:
●
Disconnect CNG Clients and Shut Down Nodes
●
Remove Components from the CNG Chassis
●
Install CNG Chassis and Components
A CNG unit consists of a 2U chassis that contains two or four server modules, each with built-in cooling fans, and
two PSUs, which also contain built-in fans. The two PSUs are located at the rear of the server in the center two
bays.
Four-Node and Two-Node CNG Servers
The physical node numbering on the four-node CNG is shown in the following figure.
175
Replace a CNG Chassis
Figure 100. CNG Four-Node Version Rear View
For the two-node variety, nodes 1 and 2 are connected identically to nodes 1 and 2 in the four-node configuration
and nodes 3 and 4 are replaced with blanking plates.
Distinguishing the CNG from the MMU
The MMU includes a 2U chassis that also contains four server modules. However, the MMU server modules are
not interchangeable with those in the CNG, and care should be taken not to confuse them.
To locate CNG components and distinguish from the MMU, examine the cabling attached to the 2U server
components at the back of the rack. If any components have SAS cables connected, the 2U enclosure is the
MMU. If there are no SAS cables, the 2U enclosure is the CNG.
This procedure was written for a CNG configured for four servers, but is similar to the procedure for a two-node
chassis.
CAUTION: Do not leave any enclosure bay empty.
Disconnect CNG Clients and Shut Down Nodes
Procedure
1. If the location of the failed chassis is not known, do the following:
●
Access CSSM and use the Health tab to determine that the CNG unit is faulty.
●
Locate the CNG in the rack by looking for the 2U two-node or four-node server with its System Status
LED on (amber) or dark LEDs on the left and right control panels. See the following figure. To locate CNG
components (as opposed to those in the MMU), examine the cabling attached to the 2U server
components at the back of the rack. If any of the components have SAS cables connected, the 2U
enclosure is the MMU. If there are no SAS cables then the 2U enclosure is the CNG.
176
Replace a CNG Chassis
Figure 101. Quad Server Control Panels
Table 24. System Status LED Descriptions
LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or
fan failure; non-critical temp/voltage threshold; battery failure; or predictive
power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage
(power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure,
non-critical temperature and voltage.
-
Off
Power off: system unplugged.
Power on: system power off and in standby, no prior degraded/noncritical/
critical state.
177
Replace a CNG Chassis
Figure 102. CNG Four-Node Chassis Rear View
2. Disconnect the CIFS/NFS clients, using the network utilities on your Windows/SMB/CIFS or NFS client. The
gateway access is unavailable during this procedure.
3. Login to the active MGMT server node as “admin”:
[Client]$ ssh -l admin active_MGMT_node
4. Stop the CIFS/NFS services:
[admin@n000]$ cscli cng disable –y
[admin@n000]$ cscli cng apply –y
This command interrupts any active client connections.
5. Shut down the gateway nodes:
[admin@n000]$ cscli power_manage –n nodes --power-off
Where nodes is the range of node names for the four server nodes. For example:
[admin@n000]$ cscli power_manage –n snx11000n0[6-9] -–power-off
Remove Components from the CNG Chassis
Once all of the nodes are shut down (the power button LEDs on the control panels at the front of the 2U
server are not illuminated), remove the internal components from the CNG chassis:
IMPORTANT: It is recommended to tag the cables or make a note of their connections. Refer to the
latest revision of the Internal Rack Cabling guide for your Sonexion release for cabling information.
6. Verify that the four server modules are off. If any power button LEDs on the control panels are illuminated
press and hold the power button for more than 4 seconds until the LED is extinguished.
The network LEDs may flash but the node is only on if the power button is illuminated.
7. Disconnect the power cord from each of the two power supplies.
8. Disconnect the cables attached to each of the four server modules.
178
Replace a CNG Chassis
9. Remove the two power supply units from the faulty chassis.
a. Push the green latch to release the PSU.
Figure 103. Remove PSU from Quad Server
b. Using the handle, remove the power supply module by carefully pulling it out of the power supply cage.
Store the PSUs in a safe location.
c.
Repeat for the second PSU.
10. Remove the four server modules:
a. Push the green latch to release the server module, while using the handle to pull it from the chassis.
Figure 104. Removing the Server Node
b. Place the server module in a safe location, and repeat until all four server modules are removed.
11. Remove the CNG chassis from the rack:
a. Loosen the two fasteners securing the server chassis to the front of the rack.
179
Replace a CNG Chassis
b. With a second person, slide the chassis out of the rack and depress the two safety locks on the rails to
completely remove the chassis.
c.
Place the chassis on a sturdy bench.
Install CNG Chassis and Components
12. Install the replacement chassis in the rack:
a. With an assistant, slide the chassis into the rack and depress the two safety locks on the rails to
completely seat the chassis.
b. Secure the chassis to the front of the rack cabinet with the two fasteners.
13. Install the four server modules:
IMPORTANT: Cray recommends placing each server module into the same bay from which it was
removed. This helps the CSSM software monitor each module more accurately over its lifetime.
a. Install the server module by inserting it into the empty bay at the rear of the chassis. It may be necessary
to push the module firmly to fully seat it into the bay.
You'll hear a click as the module seats.
Figure 105. Installing the Server Module
b. Connect the network cables to the server node.
IMPORTANT: Be sure to reconnect the cables to their original ports. Refer to your notes or the
latest revision of Internal Rack Cabling for your Sonexion release.
c.
Repeat Steps a. and b. for the other (one or three) server modules.
14. Install the PSUs:
a. Insert the power supply unit into the empty power supply cage. It may be necessary to firmly push the unit
to fully insert it into the cage.
You'll hear a click as the PSU seats.
180
Replace a CNG Chassis
Figure 106. Installing the Power Supply Unit
b. Repeat for the second PSU.
c.
Connect the power cords to the PSUs.
IMPORTANT: Be sure to reconnect the power cables to their original ports. Refer to your notes or
the latest revision of the Internal Rack Cabling guide for your Sonexion release.
15. Log in to the active MGMT node:
[Client]$ ssh –l admin active_MGMT_node
16. Power on the server nodes:
[admin@n000]$ cscli power_manage –n NODES --power-on
17. Wait for the server nodes to fully boot (approximately 10 minutes).
18. Make certain the Lustre filesystem has started:
[admin@n000]$ cscli show nodes
-----------------------------------------------------------------------Hostname
Role
Power Lustre
Targets Partner
HA
State state
Resources
-----------------------------------------------------------------------snx11000n000 Mgmt
on
----0 / 0
snx11000n001 ----snx11000n001 *Mgmt
on
----0 / 0
snx11000n000 ----snx11000n002 MGS,*MDS on
N/A
0 / 0
snx11000n003 None
181
Replace a CNG Chassis
snx11000n003 MDS,*MGS on
Started 1 / 1
snx11000n002 All
snx11000n004 OSS
on
Started 4 / 4
snx11000n005 All
snx11000n005 OSS
on
Started 4 / 4
snx11000n004 All
------------------------------------------------------------------------19. Start the CIS and NFS services:
[admin@n000]$ cscli cng enable –y
[admin@n000]$ cscli cng apply –y
20. Connect the CIFS/NFS clients using the network utilities on your Windows/SMB/CIFS or NFS client.
If the client connection works, the CNG chassis FRU procedure is complete.
182
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising