IBM RS/6000 SP Software User manual
Below you will find brief information for RS/6000 SP Software. This manual covers installation, migration, update and backup of system software. It will help you install, tailor and configure your RS/6000 SP system.
advertisement
Assistant Bot
Need help? Our chatbot has already read the manual and is ready to assist you. Feel free to ask any questions about the device, but providing details will make the conversation more productive.
RS/6000 SP Software Maintenance
Hajo Kitzhöfer, Bärbel Altmann, Janakiraman Balasayee, Atul Sharma, Jorge Vergara
International Technical Support Organization
http://www.redbooks.ibm.com
SG24-5160-00
International Technical Support Organization
SG24-5160-00
RS/6000 SP Software Maintenance
July 1999
Take Note!
Before using this information and the product it supports, be sure to read the general information in
Appendix D, “Special Notices” on page 257.
First Edition (July 1999)
This edition applies to PSSP Version 2, Release 4 and Version 3, Release 1 of IBM Parallel System
Support Programs for use with AIX 4.3.2
Comments may be addressed to:
IBM Corporation, International Technical Support Organization
Dept. JN9B Mail Station P099
522 South Road
Poughkeepsie, New York 12601-5400
When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you.
© Copyright International Business Machines Corporation 1999.
All rights reserved.
Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
Contents
Chapter 1. Software Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Licensed Program Products (LPPs) . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.4 Fileset Updates (PTFs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.6 Product Offerings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 AIX Maintenance Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Monthly Maintenance Packages . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Maintenance Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Maintenance-Level Bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Media - IBM Software Manufacturing Company . . . . . . . . . . . . . . . . . . 6
1.3.1 CD-ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2. The Installation and Customization Process . . . . . . . . . . . . 9
2.1.1 SP Installation Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 NIM Master. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 NIM Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 NIM Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.5 NIM Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.6 NIM Operations on Standalone Clients . . . . . . . . . . . . . . . . . . . . 26
2.1.7 NIM Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 NIM Configuration Using PSSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 mknimmast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.3 mknimclient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
© Copyright IBM Corp. 1999
iii
2.3 Web-Based System Manager for NIM (WSM) . . . . . . . . . . . . . . . . . . . 41
2.4.1 Setting Up the Control Workstation . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.2 Disk Space Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.6 Additional Required AIX LPPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7 Minimum Required PSSP Filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.8 Install PSSP on the Control Workstation. . . . . . . . . . . . . . . . . . . . . . . 55
2.9.1 setup_authent Files and Daemons . . . . . . . . . . . . . . . . . . . . . . . 57
2.10 The install_cw Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.11 The setup_server Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.14 Set bootp_response to Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 3. Verifying Your SP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Dates and Timezones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Kerberos and kadmind Daemons . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.2 Kerberos Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 General Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 System Data Repository (SDR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 4. Alternate and Mirrored rootvg . . . . . . . . . . . . . . . . . . . . . . . 89
4.1 Alternate Boot System Image Function. . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.2 Defining an Alternate rootvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.3 Installing an Alternate rootvg . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.4 Switching between Alternate System Images . . . . . . . . . . . . . . . 95
4.1.5 Alternate rootvg Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2.1 Mirroring Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2.2 Unmirroring a Root Volume Group . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.1 Disk Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.2 Install Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.3 Migrate Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
iv
RS/6000 SP Software Maintenance
4.3.4 Customize Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.5 Diag Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.6 Maintenance Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 5. Updating Your System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Applying AIX PTFs in the CWS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Applying AIX PTFs on the Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.1 Method 1: Applying AIX PTFs Using mksysb Install Method . . . 106
5.3.2 Method 2: Applying AIX PTFs Using SMIT or dsh . . . . . . . . . . . 111
5.3.3 Method 3: Applying AIX PTFs Using DSMIT . . . . . . . . . . . . . . . 113
5.4 Applying PSSP PTFs in the CWS . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 Applying PSSP PTFs to the Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.1 Method 1: Applying PSSP PTFs Using mksysb Install Method . 119
5.5.2 Method 2: Applying PSSP PTFs Using SMIT or dsh . . . . . . . . . 119
5.5.3 Method 3: Applying PSSP PTFs Using DSMIT . . . . . . . . . . . . . 120
5.6 Applying PTFs in the Nodes Using the Switch . . . . . . . . . . . . . . . . . 120
Chapter 6. Migrating PSSP and AIX to Later Versions . . . . . . . . . . . . 123
6.2 Migrating the CWS from PSSP 2.4 to PSSP 3.1 . . . . . . . . . . . . . . . . 127
6.3 Migrating the Nodes from PSSP 2.4 to PSSP 3.1 . . . . . . . . . . . . . . . 134
Chapter 7. Backup and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1 Backing Up Your SP System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1.1 Mksysb and Savevg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1.2 Backing Up the SDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.3 Kerberos Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.4 Backing Up the NIM Database . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2 Restoring Your SP System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Restoring an mksysb Image or a Volume Group Backup . . . . . 147
7.2.2 Restoring the SDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2.3 Restoring the Kerberos Database . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.4 Restoring the NIM Database. . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3 Remote Backups and Restores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Chapter 8. Taming Kerberos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.1 The Default SP Kerberos Realm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2 Kerberos Setup during the Installation of an RS/6000 SP System . . 159
8.2.1 The Kerberos Master Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2.2 The Kerberos Configuration File . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.3 The Kerberos Realm Configuration . . . . . . . . . . . . . . . . . . . . . . 161
8.3.1 The /.klogin File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
v
8.4 Kerberos Daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.5 setup_authent, setup_server and the Kerberos Database . . . . . . . . 164
8.7.1 Ticket-Granting-Ticket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.7.2 Contents and Purpose of the Ticket-Granting-Ticket (tgt) . . . . . 169
8.7.3 Asking for a Remote Service . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.7.4 The Ticket Cache File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.7.5 Authentication and Authorization . . . . . . . . . . . . . . . . . . . . . . . 175
8.7.6 Never-Expiring Ticket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.7.7 Hardmon Access Control List - hmacls File. . . . . . . . . . . . . . . . 177
8.7.8 Kerberos Database Access Control Lists . . . . . . . . . . . . . . . . . 179
8.8 Kerberos Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.8.1 List a Kerberos Principal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.8.2 Make a Kerberos Principal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.8.3 Change a Kerberos Principal . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.8.4 Remove a Kerberos Principal . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.9 Reinitializing Kerberos without Rebooting the Nodes . . . . . . . . . . . . 185
8.9.1 Removing or Reinitializing the Kerberos Subsystem . . . . . . . . . 185
8.10 Recreating the krb-srvtab File for a Node . . . . . . . . . . . . . . . . . . . . 187
8.10.1 The create_krb_files Wrapper. . . . . . . . . . . . . . . . . . . . . . . . . 187
8.10.2 The ext_srvtab Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.11 Merging Two (or more) SP Systems to One Kerberos Realm . . . . . 189
8.11.1 Deleting an Authentication Server Setup . . . . . . . . . . . . . . . . 190
8.11.2 Provide Name Resolution and Time Synchronization . . . . . . . 191
8.11.3 Adjust the /etc/krb.realms File . . . . . . . . . . . . . . . . . . . . . . . . 191
8.11.4 Extend the /.klogin File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.11.5 Adding Principals for Remote Nodes . . . . . . . . . . . . . . . . . . . 193
8.11.6 Creating the srvtab Files for the Remote Nodes . . . . . . . . . . . 195
8.11.7 Distributing the Kerberos Files to the Remote Nodes . . . . . . . 197
8.12 Setting Up a Secondary Authentication Server . . . . . . . . . . . . . . . . 197
8.12.1 Install Kerberos-Related Filesets on a Secondary Server . . . . 199
8.12.2 /etc/krb.conf File Modification and Distribution . . . . . . . . . . . . 199
8.12.3 Run setup_authent on the Secondary Authentication Server . 200
8.13 Working with the WCOLL Variable . . . . . . . . . . . . . . . . . . . . . . . . . 202
Chapter 9. Integration of External Workstations . . . . . . . . . . . . . . . . 205
9.1 Integration of External Workstations to the SP Kerberos Realm . . . . 205
9.1.1 Time Synchronization, Name Resolution and Kerberos Code . . 206
9.1.2 Edit /etc/krb.realms on the Control Workstation . . . . . . . . . . . . 207
9.1.3 Extend the /.klogin File on the Control Workstation. . . . . . . . . . 207
9.1.4 Add rcmd Principals to the Kerberos Database . . . . . . . . . . . . . 208
9.1.5 Create a krb-srvtab File for the External Workstation . . . . . . . . 209
vi
RS/6000 SP Software Maintenance
9.1.6 Transfer the Required File to the External Workstation . . . . . . . 209
9.1.7 Set the Authentication Method on the External Workstation . . . 209
9.2 Using the CWS NIM Master Setup to Install External Workstations . 211
9.2.1 Definition of a Second Network Served by the NIM Master . . . . 211
9.2.2 External NIM Client Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.2.3 Creating the Resources for External NIM Clients . . . . . . . . . . . 217
9.2.4 Setting Up the Installation of an External Workstation. . . . . . . . 218
9.2.5 Initializing the Client Installation . . . . . . . . . . . . . . . . . . . . . . . . 221
Chapter 10. The Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.1.1 Checking the Primary and Primary Backup Node . . . . . . . . . . 223
10.1.2 Checking the Switch Clock Source Settings . . . . . . . . . . . . . . 226
10.2.1 Estart on the Control Workstation . . . . . . . . . . . . . . . . . . . . . . 228
10.2.2 The Worm Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.2.3 Estart_sw Running on the Oncoming Primary Node . . . . . . . . 231
10.2.4 What Happens on the Non-Primary Nodes . . . . . . . . . . . . . . . 232
10.2.5 Fenced and Unfenced Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.2.6 Fenced with and without the autojoin Flag . . . . . . . . . . . . . . . 233
10.2.7 When Do You Have to Fence a Node? . . . . . . . . . . . . . . . . . . 235
10.2.8 Modifying the Switch Values in the SDR . . . . . . . . . . . . . . . . . 235
10.3 Monitoring the Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.3.1 How to Activate the Emonitor Function . . . . . . . . . . . . . . . . . . 239
10.3.2 How Emonitor Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.3.3 The Emonitor.log File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.3.4 The Time Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.3.5 Monitoring the Switch with PSSP 3.1 . . . . . . . . . . . . . . . . . . . 242
Appendix A. Downloading PTFs for RS/6000 and SP . . . . . . . . . . . . . . 243
A.1.1 Downloading the FixDist Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
A.1.3 Starting and Configuring the FixDist Tool . . . . . . . . . . . . . . . . . . . . 245
A.3 Downloading Fixes from the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Appendix B. Integrating an S70 System into an SP . . . . . . . . . . . . . . . 247
Secondary Authentication Server on an SP Node . . . . 255
vii
C.3 On the Node That Will Become the Secondary Authentication Server. . 255
C.5 On the Future Secondary Authentication Server . . . . . . . . . . . . . . . . . . 256
Appendix D. Special Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Appendix E. Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
E.1 International Technical Support Organization Publications . . . . . . . . . . 261
How to Get ITSO Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
ITSO Redbook Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
viii
RS/6000 SP Software Maintenance
Figures
Architecture of LPPs, Optional Products and Fixes . . . . . . . . . . . . . . . . . . . 3
NIM Configuration in an SP Environment . . . . . . . . . . . . . . . . . . . . . . . . . 13
Web-Based System Manager Main Screen. . . . . . . . . . . . . . . . . . . . . . . . 42
Network Installation Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
General Properties of the Resource spot_aix432 . . . . . . . . . . . . . . . . . . . 46
Boot Image Information of the Resource spot_aix432 . . . . . . . . . . . . . . . . 47
13. SPOT Platform-Dependent Debug Addresses . . . . . . . . . . . . . . . . . . . . . 61
14. Output of the splstdata -a Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
15. Output of the spmon -d -G Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
19. DSMIT Configuration in an SP Environment . . . . . . . . . . . . . . . . . . . . . . 114
20. Basic Flowchart for PSSP Migration - Part 1 . . . . . . . . . . . . . . . . . . . . . . 124
21. Basic Flowchart for PSSP Migration - Part 2 . . . . . . . . . . . . . . . . . . . . . . 125
23. Typical Format of a Kerberos Backup File. . . . . . . . . . . . . . . . . . . . . . . . 145
24. Restore of the SDR Using SDRRestore . . . . . . . . . . . . . . . . . . . . . . . . . 150
25. Restore of the SDR Using sprestore_config . . . . . . . . . . . . . . . . . . . . . . 151
36. Dump the Kerberos Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
37. Kerberos Database Dump File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
38. Two SP Systems in a Single Kerberos Realm . . . . . . . . . . . . . . . . . . . . . 190
39. Secondary Authentication Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
40. The /etc/krb.conf File with Secondary Authentication Server . . . . . . . . . 200
© Copyright IBM Corp. 1999
ix
41. External Workstation as Part of the SP Kerberos Realm. . . . . . . . . . . . . 206
42. The CWS as NIM Master for a Second Network . . . . . . . . . . . . . . . . . . . 212
44. SP Node - Switch Chip Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
47. Switch Subsystem Component Architecture . . . . . . . . . . . . . . . . . . . . . . 230
51. Lab Scenario of Integrating S70 to an SP . . . . . . . . . . . . . . . . . . . . . . . . 248
52. Frame/Node Information in Perspectives. . . . . . . . . . . . . . . . . . . . . . . . . 251
53. LED/LCD Display for the New S70 Node. . . . . . . . . . . . . . . . . . . . . . . . . 254
x
RS/6000 SP Software Maintenance
Tables
Example of Packaging Term Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
NIM Objects Used by PSSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Supported AIX and PSSP Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
PTF Requisites for Coexistence/Migration of PSSP and AIX Levels. . . . 126
© Copyright IBM Corp. 1999
xi
xii
RS/6000 SP Software Maintenance
Preface
This redbook is intended for system administrators and system operators who need to manage an SP system. It will help you install, tailor and configure an
RS/6000 SP system. It will also illustrate solutions for updating or migrating your SP to a higher level of AIX or PSSP software.
It is based on Version 2, Release 4 and Version 3, Release 1 of the
POWERparallel System Support Program. The migration and update section covers procedures for going from PSSP 2.4 to PSSP 3.1. Nevertheless, most of the migration and update tasks described are general procedures that are more or less AIX and PSSP version-independent.
This redbook is also a good starting point for anyone wanting to get more background information about network install management (NIM), the switch commands and the use of Kerberos.
The Team That Wrote This Redbook
This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization Poughkeepsie
Center.
Dr. Hajo Kitzhöfer is an Advisory International Technical Support
Organization (ITSO) Specialist for RS/6000 SP at the Poughkeepsie Center.
He holds a Ph.D. degree in Electrical Engineering from the Ruhr University of
Bochum (RUB). Before joining ITSO, he worked as an SP Specialist at the
RS/6000 and AIX Competence Center, IBM Germany. He has worked at IBM for eight years. His areas of expertise include RS/6000 SP, SMP, and
Benchmarks. He now specializes in SP System Management, SP
Performance Tuning and SP hardware.
Bärbel Altmann holds degrees in German Literature and Computer Science from Ludwig Maximilian University of Munich (LMU). Before she started working with AIX in 1990, she worked in the editorial section of a literary magazine published by LMU. While with IBM, she has worked on several major SP implementations at customer sites doing administration and problem determination, as well as consulting and second level support. Since
1997, she has worked as an SP instructor at the German Education and
Training Center.
Janakiraman Balasayee is an IT Specialist in the PSS department of IBM
Global Services, India. He holds a Diploma in Electrical Engineering. He has
© Copyright IBM Corp. 1999
xiii
14 years of experience in the IT industry. Before joining IBM, he worked on
UNIX and Norsk Data Systems. He has worked at IBM India for three years.
He provides country support for RS/6000, AIX and SP. His areas of expertise include RS/6000 and SP hardware, AIX and SP System Management.
Atul Sharma is an AIX Specialist in IBM U.K. He holds a diploma in
Computer Studies from Thames Valley Polytechnic. He has five years of experience in AIX. Previously, he worked with retail systems and UNIX. Since joining IBM two years ago, he has worked in presales technical support. His areas of expertise include SP support and Capacity Planning.
Jorge Vergara is an AIX Specialist in IBM Colombia. He holds a degree in
Systems Engineering from the Escuela Colombiana de Ingenieria. He has five years of experience in AIX. He works for the PSS division giving support mainly to SP systems and HACMP installations.
Thanks to the following people for their invaluable contributions to this project:
Rich Ferri
IBM PPS Lab Poughkeepsie
Linda Mellor
IBM PPS Lab Poughkeepsie
xiv
RS/6000 SP Software Maintenance
Comments Welcome
Your comments are important to us!
We want our redbooks to be as helpful as possible. Please send us your comments about this or other redbooks in one of the following ways:
• Fax the evaluation form found in “ITSO Redbook Evaluation” on page 271
to the fax number shown on the form.
• Use the online evaluation form found at http://www.redbooks.ibm.com/
• Send your comments in an internet note to [email protected]
xv
xvi
RS/6000 SP Software Maintenance
Chapter 1. Software Maintenance
This redbook concentrates on maintenance topics like installation, migration, update and backup of system software. It will help you install, tailor and configure an RS/6000 SP system. It also illustrates solutions for updating or migrating your SP to a higher level of AIX or PSSP software.
This publication is also a good starting point for anyone wanting to get more background information about network install management (NIM), the switch commands and the use of Kerberos.
First we clarify some terms so that we have a common understanding when we are using these terms in the following chapters.
1.1 Packaging Terms
AIX V4 is packaged into many different distinct groups of software. These groupings can be thought of as installable image groupings. Each grouping may have one or more images that may need to be installed in unique ways.
The basic packaging concept is to install a set of packages that make up a base environment and then allow the users the option of installing additional packages as they wish. Optional software may be organized into bundles, which are groups of products that are suited for a particular use. For example, a client bundle provides client AIX functionality with minimum disk utilization.
This granularity allows customers to install exactly what they need to create their required environment, allowing a smaller minimum installation size. The packaging terms are explained in the following sections.
1.1.1 Licensed Program Products (LPPs)
These are complete software products, including all packages and filesets needed to provide a specific function. For example, the Base Operating
System (BOS) is a Licensed Program Product (LPP).
1.1.2 Package
A package is a group of filesets with a common function collected into a single installable image. The image is in backup file format (BFF).
© Copyright IBM Corp. 1999
1
1.1.3 Filesets
AIX 4 is divided into filesets, each of which contains a group of logically related customer deliverable files. Each fileset can be individually installed and updated.
Revisions to filesets are tracked using the version, release, modification, and fix (VRMF) levels. Each time a fileset update is applied, the fix level is adjusted. Each time a maintenance level is applied, the modification level is adjusted, and the fix level is reset to zero.
Filesets provide a specific function. For example, bos.not.tcp.client is a fileset in the bos.net package (Base Network Support and Applications) providing
TCP/IP client support.
A fileset is the smallest installable and updatable package in AIX. Customers can choose which functions to install by selecting filesets. Fixes are delivered to customers by providing an updated fileset, which contains only those files that have changed since that version/release of AIX was introduced.
1.1.3.1 VRMF
Revisions to filesets are tracked using a four-part
Version.Release.Modification.Fix level. Along with each part of the VRMF versioning scheme come some implied rules for its use.
Version
A new version of AIX or a Licensed Program Product (LPP) implies a completely new design. In a new version, it is likely that every fileset will be changed. Although we strive for binary compatibility, some changes may be necessary that could break compatibility with older versions.
Release
A new release of AIX and LPP implies pervasive changes in the product or packaging. Most filesets are likely to be changed. Binary compatibility is usually preserved, although not guaranteed. Each release of an AIX version is supported independently, having its own service stream.
Modification
A modification level, also known as maintenance level, contains an accumulation of all fixes that were ever made to that release of the product, as well as fixes that were deferred. Maintenance levels are likely to contain support for new hardware, and may also contain minor functional
Software Maintenance
2
enhancements. Binary compatibility with previous maintenance levels of the same release is guaranteed. Pervasive changes are discouraged, since they would likely have a profound impact on the size, instabilities, and risk associated with future individual fix packages.
Fix
A fix level is an update to a particular fileset. Each successive fix level contains all previous fixes to that fileset. The main goal for fix levels is to provide the smallest, most localized, lowest-risk fix possible to the customer.
For this reason, any type of high-risk or pervasive change is discouraged.
1.1.4 Fileset Updates (PTFs)
Fileset updates, or Program Temporary Fixes (PTFs), are updates to particular filesets. A fileset update will either enhance, or correct a defect in, a previously installed fileset. Each successive fileset update contains all previous fixes to that fileset. Fileset updates are intended to provide the smallest, most localized, least-risk fix possible for a specific problem.
Figure 1 shows the relation between LPPs, optional program products and
PTFs.
AIX System
AIX Base Operating System
LPP LPP bos.rte
bos.rte.X11
Options
...
U423884 U461230
...
Figure 1. Architecture of LPPs, Optional Products and Fixes
Software Maintenance
3
1.1.5 Bundle
Bundles are collections of installable operating system software components and Licensed Program Product (LPP) components that are grouped together and can be installed with one selection. AIX V4.1 supports both system-defined and user-defined bundles.
The system-defined bundles are defined as:
• Client: A collection of software products for single-user systems running in a standalone or network client environment.
• Server: A collection of software products for multiuser systems running in a standalone or network environment.
• Personal Productivity: A collection of software products for graphical desktop systems running AIX and PC applications.
• Application Development: A collection of software products for developing applications, for example, compilers/linkers, libraries and debuggers.
1.1.6 Product Offerings
Product offerings are selected sets of packages which are sold together on physical media. Product offerings should not be confused with bundles (see
Examples of product offerings from IBM are:
• AIX 4.3 for Clients: This product offering is packaged with functionality aimed at satisfying the users of client systems. It is designed for customers who do not need to provide network server support for
LAN/WAN-attached devices. These systems can be used as client systems, personal productivity workstations, print servers, name servers, and gateways.
• AIX 4.3 for Servers: This product offering is packaged and priced for full server-system functionality. These systems can be used as network file servers, data servers, print servers, iFOR/LS servers, compute servers, application servers, and so on.
Software Maintenance
4
Table 1 demonstrates the differences between LPP, package, and fileset.
Table 1. Example of Packaging Term Usage
LPP Package Fileset bos bos bos.INed
bos.diag
bos.INed
bos.diag.rte
adacmp adacmp adacmp
1.2 AIX Maintenance Strategy
Fileset updates are available to customers on the World Wide Web using a
Web browser, or can be downloaded using FixDist. FixDist is an AIX Motif client that handles downloading the requested fixes with all necessary requisites.
Customers that purchase Support Line can also request updates on media, either through the World Wide Web, or through the AIX Support Center.
1.2.1 Monthly Maintenance Packages
Customers sometimes want to order the latest available fixes. In practice, it would be a difficult task to find the latest fixes, and then order or download all of the fileset updates. About once per month, we make the latest available fileset updates for AIX easily orderable under a single APAR number.
1.2.2 Maintenance Levels
Preventive maintenance in AIX 4 is delivered in a maintenance level (ML). An
ML consists of one fileset update for each fileset that has changed since the base level of the release. Each of these fileset updates is cumulative, containing all fixes for that fileset since the release was introduced, and supersedes all previous updates for the same fileset.
A new concept for preventive maintenance in AIX 4 is the recommended maintenance level (RML). A recommended maintenance level is a set of
APARs since the most recent maintenance level.
Recommended maintenance levels are named using the four-digit VRMF level of the maintenance level with a two-digit number appended, starting with
01 and incrementing by 1 each time a new level is released.
Following are the maintenance levels and recommended maintenance levels produced for AIX 4.2:
Software Maintenance
5
ML 4210, also called AIX 4.2.1
RML 4210-01, also called AIX 4210-01
Following are the maintenance levels and recommended maintenance levels produced for AIX 4.3:
ML 4310, also called AIX 4.3.1
ML 4320, also called AIX 4.3.2
Maintenance levels for AIX 4 are released approximately twice a year while the release is active. Maintenance levels are likely to contain support for new hardware, and may also contain minor functional enhancements.
1.2.3 Maintenance-Level Bundle
A maintenance-level bundle is a collection of fixes and enhancements that update the operating system to the latest level. Software is selected for installation if it is in the bundle you choose and on the installation media.
1.3 Media - IBM Software Manufacturing Company
If a customer orders the AIX media option, media is delivered through IBM
Software Manufacturing Company. Current media types are CD-ROM, 4mm,
8mm, and 1/4 inch tape.
1.3.1 CD-ROM
Due to increased reliability and speed, CD-ROM is by far the most popular media. The typical AIX order ships the following CD-ROMs:
1.3.1.1 AIX CDs
AIX 4.1 and 4.2 span two CD-ROMs. Both releases are currently available in client and server packages.
1.3.1.2 Bonus/Value Pack CD
AIX 4.1 includes a Value Pack CD, while AIX 4.2 includes a Bonus Pack CD.
These CD-ROMs contain additional software such as Ultimedia Services and
Netscape Navigator.
1.3.1.3 Update CD
The Update CD is a recent addition to the AIX CD-ROM set. This CD includes fixes for critical problems, optionally installable preventive maintenance, and
Software Maintenance
6
may also contain new device support. Documentation for the update CD is included in both ASCII and HTML formats on the CD, with instructions for browsing the documentation on the CD jacket. The documentation includes descriptions of critical fixes, preventive maintenance, and device support, as well as installation instructions. This CD is updated approximately once per quarter.
1.3.2 Tape
The Stacked Product Option (SPO) tape not only contains AIX, but can also contain Licensed Program Products (LPPs) that the customer ordered at the same time as AIX.
In addition to AIX and LPPs, the SPO tape may also contain critical fixes.
These fixes will be automatically installed along with the filesets they update.
The optional preventive maintenance contained on the Update CD cannot be included on the tape media, since it too would automatically be installed, and would therefore no longer be optional.
Software Maintenance
7
Software Maintenance
8
Chapter 2. The Installation and Customization Process
One of the more complex tasks within an SP system is the installation of the nodes. In contradiction to standalone workstations which are usually equipped with tape- and/or CD-devices, all the SP nodes are missing these devices. Therefore the ordinary installation methods via tape or CD do not work. The solution for this dilemma is installation over a network. The administrative Ethernet, the connection between the CWS and all the nodes, is used for this installation method. That is one of the reasons why the
Ethernet is mandatory.
Since version 3.2, AIX has had a network installation utility with very limited capabilities. The AIX 4 network installation management (NIM) is a totally new approach. NIM provides the ability to install machines with software from a centrally managed repository in the network.
Looking at the NIM installation methods, we notice that the PSSP software is only using a subset of the NIM capabilities (for example, only the installation of standalone workstations is used).
All of the NIM commands and options are being hidden by PSSP scripts and commands. Because of this, and as long as the installation is successful, some of the complexity appears to be transparent.
In order to diagnose installation problems, we need to understand what is actually happening under the covers. In this chapter we cover the various stages of the installation. We discuss how NIM is used in a usual AIX network environment and how PSSP utilizes these features during the installation phase.
2.1 NIM Overview
The NIM environment is comprised of client and server machines. A server provides resources (for example, files and programs required for installation) to another machine, the client. Any machine that receives resources is a client, although the same machine can also be a server in the overall network environment. Most installation tasks in the NIM environment are performed from one server, called the master.
The machines in an NIM environment, their resources, and the networks through which the machines communicate are all represented as objects within a central database that resides on the master. Each of these objects has attributes that give it a unique identity, such as the network address of a
© Copyright IBM Corp. 1999
9
machine or the location of a file or directory. See Table 2 for an overview of
NIM objects.
Table 2. Overview of NIM Objects
Object Class Object Type Object Object Attributes
Machines machines
Resources
Networks standalone dataless diskless spot lppsource mksysb scripts bosinst_data other resources
Ethernet token ring
FDDI
ATM file directory subnet
TCP/IP hostname hardware address name of network type location subnet mask
Not all the offered functionality of NIM is used in the SP. Only the minimum to
install a standalone system is utilized. Table 3 shows the required information
used during installation of an SP node.
Table 3. NIM Objects Used by PSSP
Object Class Object Type Object Object Attributes
Machines standalone machines
Resources file directory
TCP/IP hostname hardware address name of network type location
Networks spot lppsource mksysb scripts bosinst_data
Ethernet subnet subnet mask
The CWS is configured as a NIM master. The master and clients make up a
NIM environment. The master provides resources to the clients. Referring to
AIX Version 4.3, Network Installation Management Guide and Reference,
SC23-4113, each NIM environment can have only one NIM master. As long as the CWS is the only boot/install server, this rule is not broken. As soon as
you are using nodes as boot/install server as described in 2.1.1, “SP
10
RS/6000 SP Software Maintenance
Installation Hierarchy” on page 11, the rule is violated. Despite this fact, the
hierarchical installation in the SP works.
As Figure 2 illustrates, NIM supports two ways of installing machines: pull or
push.
Master
PUSH
Install
Resource
Client
Or,
Master
Install
Resource
Client
PULL
Figure 2. NIM Installation Methods
1. The push method is used when NIM installs dataless or diskless machines. This method is not used by the PSSP software.
2. The pull method is used when NIM installs standalone machines.
Generally, pull means that the client invokes the installation methods. This is the mode used by the PSSP software.
2.1.1 SP Installation Hierarchy
As NIM installations utilize the network, the number of machines you can install simultaneously depends on the throughput of your network (namely,
Administrative Ethernet). Other factors that can restrict the number of installations at a time are the disk access throughput of the installation servers, and the processor types of your servers.
To overcome this bottleneck you can configure nodes in your SP as boot/install server so that you will get a hierarchy of installation servers;
Figure 3 on page 12 shows an example of a possible SP configuration.
The Installation and Customization Process
11
CWS
1
1
. .
4
15 16
17
19
. .
20
31 32
33
35
. .
36
47 48
Figure 3. SP Installation Hierarchy
If your physical network structure supports such a hierarchy, you can initiate an installation process in each of these trees. In this case more nodes can be installed simultaneously than by using one flat network structure.
A NIM master makes use of the Network File System (NFS) utility to share resources with clients. As such, all resources required by clients must be local file systems on the master.
Administrators might be tempted to NFS-mount file systems from another machine to the CWS as /spdata/sys1/install/images to store node images, and as /spdata/sys1/install/<name>/lppsource to store AIX filesets due to disk space constraints. This is strictly not allowed, as NIM will not be able to
NFS-export these file systems.
Figure 4 on page 13 shows what NIM looks like in an SP environment, and we
describe how to configure this NIM environment.
12
RS/6000 SP Software Maintenance
SP2
Nodes
NIM C onfiguration in an SP E nvironm ent
N IM
Sta ndalone Clients sp2n01
N IM
Standalone Clients sp2n..
NIM Netw ork spn et_en0
N IM
Standalone Clients sp2n15
NIM Operations
bos_inst diag cust m aint reset update_all
NIM Resources
lppsource spot m ksysb bosinst_data scripts
S P2
Control W orkstation
NIM M aster
Figure 4. NIM Configuration in an SP Environment
The steps for creating a NIM master in an AIX environment to perform the various options are:
1. Create one of the standalone systems with AIX and NIM filesets installed as a NIM master, giving a unique network name for the primary network interface.
2. Define NIM standalone and diskless clients.
3. Create the basic installation resources like lpp_source and spot. If a customized mksysb image needs to be installed, then create mksysb resource. Create the bosinst_data resource if you would like to have a customized BOS install program.
4. Allocate the required resources for the clients based on the operations to be performed.
5. Invoke the operations you want to perform on the client machines.
In the following section we describe how to configure a general NIM environment on AIX. However, this section offers only a basic overview of
NIM in AIX. For more information on NIM, see AIX Version 4.3, Network
Installation Management Guide and Reference, SC23-4113.
The Installation and Customization Process
13
Attention
In an SP environment, we never configure NIM directly (using SMIT, bottom line or WebSM). Instead, it is done using scripts called wrappers. Through this overview of NIM, it will be easier to know what is happening when you execute the wrappers.
2.1.2 NIM Master
The NIM master is fundamental to all operations in the NIM environment. It must be a machine running AIX with the NIM master fileset installed. The NIM master provides the central point of administration for the NIM environment and is the primary resource server. All other machines are clients to the master, including machines that may also serve resources. There is only one
NIM master for each NIM environment.
The NIM filesets required to configure the NIM master are:
# lslpp -l |grep bos.sysmgt.nim
bos.sysmgt.nim.client
bos.sysmgt.nim.master
4.3.2.1
4.3.2.1
COMMITTED
COMMITTED
Network Install Manager -
Network Install Manager bos.sysmgt.nim.master_gui
4.3.2.0
COMMITTED Network Install Manager - GUI bos.sysmgt.nim.spot
4.3.2.1
COMMITTED Network Install Manager - SPOT bos.sysmgt.nim.client
4.3.2.0
COMMITTED Network Install Manager -
The rsh command is used to remotely execute commands on clients. To use the rsh command, the $HOME/.rhosts file on the client is automatically configured by NIM when the client is initialized so that the master has the permissions required to run commands on the client as root. For more information on rsh and the .rhosts file, see AIX Version 4.3, Network
Installation Management Guide and Reference, SC23-4113.
When you configure the NIM master, you specify a unique identifier to name the object that NIM creates to represent the network. This is the master's primary interface that connects to the clients. Once this object has been created, the name you specify identifies the network in all subsequent NIM operations.
In the SP CWS, the network type is created by default as spnet_en0 for the
Ethernet network during installation. The SMIT fastpath to initialize the NIM master is smitty nimconfig.
To initialize the NIM master from SMIT, use the command:
14
RS/6000 SP Software Maintenance
# smitty nim
Select Configure the NIM Environment.
Select Advanced Configuration.
Select Initialize the NIM Master Only.
Enter the input fields as shown in the following screen:
Configure Network Installation Management Master Fileset
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Network Name
* Primary Network Install Interface
[Entry Fields]
[spnet_en0]
[en0]
Allow Machines to Register Themselves as Clients?
[yes]
Alternate Port Numbers for Network Communications
(reserved values will be used if left blank)
Client Registration
Client Communications
[]
[]
+
+
#
#
2.1.3 NIM Networks
In order to perform certain NIM operations, the NIM master must be able to supply information necessary to configure client network interfaces. The NIM master must also verify that the clients are able to access all the resources of the master. When the NIM master is being configured, the network objects associated with the master are automatically defined in the NIM database.
You will have to define additional NIM networks if clients reside on other local area networks or subnets. In this case it is required to add a default or static
NIM route between networks you specify. NIM routes are added to network definitions so NIM can determine the connectivity between NIM machines.
When defining a default or static NIM route, you must also specify the gateways used by machines on the specified network.
The network types supported by NIM are tok, ent, fddi, generic and ATM. In the SP environment, the Ethernet (ent) network is always used as the primary network for NIM operations.
The SMIT fastpath to define the network is smitty nim_mknet.
The Installation and Customization Process
15
2.1.4 NIM Machines
There are currently three types of machines that can be managed in the NIM environment. These are standalone, diskless and dataless clients. The NIM environment is composed of two basic machine roles: master and client.
The NIM master manages the installation of the rest of the machines in the
NIM environment. The master is the only machine that can remotely run NIM commands on the clients. All the other machines in the NIM environment are clients to the master.
Standalone Clients
The standalone NIM clients are clients with the capability of booting and running from local resources. The standalone clients mount all the file systems from local disks and have a local boot image. They are not dependent upon network servers for their operation. They are managed in
NIM networks primarily to install and update software.
Diskless and Dataless Clients
Diskless and dataless clients are machines that are not capable of booting and running without the assistance of servers on a network. The diskless clients have no hard disks and the dataless clients have hard disks, but they will not be able to hold all the data that is required for operating.
2.1.4.1 NIM Clients in the SP Environment
In an SP environment only standalone clients are supported. They have their own local disks to hold all the data that is required. In this section we get into more detail about standalone clients.
The fastpath to define a NIM standalone or diskless client on the NIM master is smitty nim_mkmac.
Use the following to add the NIM standalone client sp4n10 using SMIT:
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Machines.
Select Define a Machine.
Host Name of Machine: sp4n10
Type of Network Attached to the Primary Network: ent
16
RS/6000 SP Software Maintenance
Define a Machine
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* NIM Machine Name
* Machine Type
* Hardware Platform Type
Kernel to use for Network Boot
Primary Network Install Interface
* Cable Type
* NIM Network
* Host Name
Network Adapter Hardware Address
Network Adapter Logical Device Name
IPL ROM Emulation Device
CPU Id
[Entry Fields]
[sp4n10]
[standalone]
[rs6k]
[up] bnc spnet_en0 sp4n10
[10005AFA159D]
[ent]
[]
[]
+
+
+
+
+/
2.1.5 NIM Resources
Most operations on clients in the NIM environment require one or more resources. NIM resource objects represent files and directories that are used to manage the installation of standalone machines and the operation of diskless and dataless machines.
The NIM resources that are normally used for standalone clients in SP are:
• lppsource
• Shared Product Object Tree (SPOT)
• mksysb
• scripts
• bosinst_data
Let us look at what these resources are and how to create them using SMIT.
The Installation and Customization Process
17
lpp_source
This is a directory containing all the installation images in bff format that can be installed on clients using the installp command. This directory should contain the minimum BOS filesets and all the device driver filesets so that the devices can be configured at the time of installation. Whenever you add or remove software from this directory, you must run the NIM
check operation on the lpp_source to update the installation table-of-contents file for the resource.
In addition to updating the table-of-contents file for the lpp_source, the check operation also updates the simages attribute for the lpp_source, which indicates whether or not the lpp_source contains the images necessary to install the Base Operating images on the machine.
Check the simages attribute (simages=yes) after running the nim -o check
<lpp_source> command.
To create the resource lppsource_aix432 in directory
/spdata/sys1/install/aix432/lppsource using SMIT, use the following:
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Resources.
Select Define a Resource.
Resource Type: lpp_source
Define a Resource
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Resource Name
* Resource Type
* Server of Resource
* Location of Resource
Source of Install Images
Names of Option Packages
Comments
[Entry Fields]
[lppsource_aix432] lpp_source
[master] +
[/spdata/sys1/install/a> /
[] +/
[]
[]
SPOT
A Shared Product Object Tree (SPOT) provides a /usr file system for diskless and dataless clients, as well as the network boot support for all clients. This is the fundamental resource in an NIM environment. It is required to install or initialize all machine configuration types.
18
RS/6000 SP Software Maintenance
Everything that a machine requires in a /usr file system, such as the AIX kernel, executable commands, libraries, and applications are included in the SPOT. Machine-unique information or user data is usually stored in the other file systems. A SPOT can be located on any standalone machine within the NIM environment, including the master. The SPOT is created, controlled, and maintained from the master, even though the SPOT can be located on another system.
There are two ways to create a SPOT. You can convert the /usr file system
(/usr SPOT), or you can locate the SPOT elsewhere within the file system
(non-/usr SPOT) on the server.
The PSSP software uses the non-/usr SPOT method for the NIM setup. All
SPOTs are located under /spdata/sys1/install/<aix version>. Here is an example where the POT for AIX 4.3.2 will be created:
/spdata/sys1/install/aix432/spot
The /usr SPOT inherits all the optional software that is already installed on the server. All the clients using the /usr SPOT have access to the optional software installed on the server. The non-/usr SPOT can be used to manage a different group of optional software than that installed and licensed for the server.
Creating a SPOT by converting the /usr file system has the advantage of being fast and using much less disk space. However, this method does not give you the flexibility to choose which software packages will be included in the SPOT, because all the packages and filesets installed in the /usr file system of the machine serving the SPOT will be included in the SPOT.
This was the way PSSP 2.2 was using the SPOT, which led to some problems.
Starting with PSSP 2.2, the second method, creating a non-/usr SPOT, is used. This method uses a lot more disk space, but it is more flexible.
Initially, only the minimum set of software packages required to support
NIM clients is installed in the SPOT, but additional packages and filesets can be installed. Also, it is possible to have multiple SPOTs, all with different additional packages and filesets installed, serving different clients.
That is the way the PSSP software supports different AIX versions on SP nodes. For each group of nodes with the same AIX version, a common
SPOT is maintained.
A SPOT varies in size from 100 MB up to, and sometimes in excess of,
300 MB, depending on the software that is installed. Since all device support is installed in the SPOT and the number of device filesets typically increases, the size is not easily predictable from release to release.
The Installation and Customization Process
19
SPOTs are used to support all NIM operations that require a machine to boot over the network. These operations are as follows:
• bos_inst
• maint_boot
• diag
• dkls_init
• dtls_init
The last two options are not used in an SP environment.
When a SPOT is created, network boot images are constructed in the
/tftpboot directory of the SPOT server, using code from the newly created
SPOT. When a client performs a network boot, it uses tftp to obtain a boot image from the server. After the boot image is loaded into memory at the client, the SPOT is mounted in the client's RAM file system to provide all additional software support required to complete the operation.
Each boot image created is up to 4 MB in size. Before creating a SPOT, ensure there is sufficient space in the root (/) file system, or create a separate file system for /tftpboot to manage the space required for the network boot images.
Recommendation
To avoid running out of filesystem space in the root filesystem we strongly recommend creating a separate file system, /tftpboot. The size of this filesystem should be at least 60 MB.
The Micro Channel-based systems support booting from the network using Token-Ring, Ethernet, or FDDI. The PowerPC PCI bus-based systems support booting from the network using Token-Ring or Ethernet.
The uniprocessor MCA and PCI bus-based systems can be used in a diskless or dataless configuration. Again, the diskless and dataless options are not used in an SP environment.
A single network boot image can be accessed by multiple clients; therefore, the network boot image cannot contain any client-specific configuration information. The platform type is specified when the machine object is defined, while the network type is determined from the primary interface definition. Two files are created in the /tftpboot directory on the
SPOT server for each client to be network-booted: ClientHostName and
ClientHostName.info. The ClientHostName file is a link to the correct
20
RS/6000 SP Software Maintenance
network boot image, while the ClientHostName.info file contains the client configuration information.
When the SPOT is defined (and created), the following occurs:
The BOS image is retrieved from archive or, for /usr conversion, just the root directory is retrieved from archive (/usr/lpp/bos/inst_root). The device support required to support NIM operations is installed. Network boot images are created in the /tftpboot directory.
Each network boot image supports a single network, platform, and kernel type. The network boot image files are named
SPOTName.Platform.Kernel.Network. The network types are Token-Ring,
Ethernet, and FDDI. The platform types are:
rs6k
used for POWER/POWER2/P2SC/PowerPC MCA bus-based machines
rspc
used for PowerPC Reference Platform (PREP) Architecture-based machines
chrp
used for PowerPC Common Hardware Reference Platform (CHRP)
Architecture-based machines
The rs6ksmp platform for 4.2 (and later) SPOTs are represented by the boot image with a platform type of rs6k and a kernel type of mp.
The kernel types are:
up mp
used for single processor machines used for multiple processor machines
Both up and mp boot images are created for each platform and network type. The network boot images located in /tftpboot look similar to the following: spot_aix432.rs6k.mp.ent
spot_aix432.rs6k.mp.fddi
spot_aix432.rs6k.mp.tok
spot_aix432.rs6k.up.ent
spot_aix432.rs6k.up.fddi
The Installation and Customization Process
21
spot_aix432.rs6k.up.tok
spot_aix432.rspc.mp.ent
spot_aix432.rspc.mp.tok
spot_aix432.rspc.up.ent
spot_aix432.rspc.up.tok
The amount of space used in the /tftpboot directory for boot images may become very large. A SPOT that supports network boot for all possible combinations of platforms, kernel types, and network adapters may require as much as 60 MB in /tftpboot. If the same server serves multiple
SPOTs, the space required in /tftpboot will be even more since each SPOT creates its own set of boot images.
In AIX 4.3, NIM creates by default only the boot images required to support the machines and network types that are defined in the environment. This should significantly reduce the amount of disk space used and the time required to create boot images from SPOT resources.
A SPOT will be created by using the following SMIT shortcut:
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Resources.
Select Define a Resource.
Resource Type: spot
Define a Resource
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Resource Name
* Resource Type
* Server of Resource
* Source of Install Images
* Location of Resource
EXPAND file systems if space needed?
Comments installp Flags
COMMIT software updates?
SAVE replaced files?
AUTOMATICALLY install requisite software?
OVERWRITE same or newer versions?
VERIFY install and check file sizes?
[Entry Fields]
[spot_aix432] spot
[master]
[lppsource_aix432]
[/spdata/sys1/install/a> / yes +
[]
+
+ no yes yes no no
+
+
+
+
+
22
RS/6000 SP Software Maintenance
mksysb
An mksysb resource represents a file that is a system backup image created using the mksysb command. The type of resource can be used as the source for the installation of a client. The file must reside on the hard disk of the NIM master in order to be defined as a resource. It cannot be located on a tape or other external media (like CDs).
First create the mksysb image on a file from a configured system; never try to use the mksysb image of a system with NIM master initialized.
In this example, we copied the mksysb image file as bos.obj.ssp.432 to directory /spdata/sys1/install/images before going to SMIT to configure the mksysb resource named mksysb_1.
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Resources.
Select Define a Resource.
Resource Type: mksysb
Define a Resource
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP]
* Resource Name
* Resource Type
* Server of Resource
* Location of Resource
Comments
System Backup Image Creation Options:
CREATE system backup image?
NIM CLIENT to backup
PREVIEW only?
IGNORE space requirements?
EXPAND /tmp if needed?
Create MAP files?
Number of BLOCKS to write in a single output
(leave blank to use system default)
Use local EXCLUDE file?
(specify no to include all files in backup)
[]
-OR-
EXCLUDE_FILES resource
(leave blank to include all files in backup) no
[] no no no no no
[]
[Entry Fields]
[bos.obj.ssp.432] mksysb
[master] +
[/spdata/sys1/install/i> /
[]
#
+
+
+
+
+
+
+
+
The Installation and Customization Process
23
bosinst_data
This resource customizes the BOS install flow (it automates menu choices). It is used to customize the installation flow and to avoid the need for prompting at the console. To create this resource, first copy the file
/usr/lpp/bosinst.template as bosinst_data to the /spdata/sys1/install/pssp directory and edit the file to customize your environment.
The contents of the bosinst_data file for a no-prompt installation are shown in the following screen: control_flow:
BOSINST_DEBUG = yes
CONSOLE = /dev/tty0
INSTALL_METHOD = overwrite
PROMPT = no
EXISTING_SYSTEM_OVERWRITE = yes
INSTALL_X_IF_ADAPTER = no
RUN_STARTUP = no
RM_INST_ROOTS = no
ERROR_EXIT =
CUSTOMIZATION_FILE =
TCB = no
INSTALL_TYPE = full
BUNDLES = target_disk_data:
LOCATION =
SIZE_MB =
HDISKNAME = hdisk0 locale:
BOSINST_LANG = en_US
CULTURAL_CONVENTION = en_US
MESSAGES = en_US
KEYBOARD = en_US
After editing this file, we can create the no-prompt resource using SMIT as follows:
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Resources.
Select Define a Resource.
Resource Type: bosinst_data
24
RS/6000 SP Software Maintenance
Define a Resource
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Resource Name
* Resource Type
* Server of Resource
* Location of Resource
Comments
[Entry Fields]
[noprompt] bosinst_data
[master] +
[/spdata/sys1/install/p> /
[]
script
This is a customization program executed after installation. It is only a shell script which you create with any filename and define as a script resource for getting executed after the installation of the mksysb image or the SPOT image. This script can be used for customizing your environment in the client system, such as increasing the paging space and the file system size, configuring the NIS client, and so on. The steps for configuring this resource are the same as for creating the bosinst_data resource.
Note: You do not have a full AIX environment during the NIM customization process. Also, default routes and additional adapters are not configured at this stage of the installation.
The Installation and Customization Process
25
2.1.6 NIM Operations on Standalone Clients
Although an installed client is capable of booting from the local disk, it may be necessary to perform a network boot of a client for certain NIM operations.
Operation to Perform
Move cursor to desired item and press Enter.
[TOP] diag cust bos_inst maint reset fix_query check reboot maint_boot showlog
[MORE...3]
F1=Help
F8=Image
/=Find
= enable a machine to boot a diagnostic image
= perform software customization
= perform a BOS installation
= perform software maintenance
= reset an object's NIM state
= perform queries on installed fixes
= check the status of a NIM object
= reboot specified machines
= enable a machine to boot in maintenance mode
= display a log in the NIM environment
F2=Refresh
F10=Exit n=Find Next
F3=Cancel
Enter=Do
The operations that can be performed on the standalone clients are as follows:
bos_inst
This operation is used to install the AIX Base Operating System on standalone clients. The standalone clients are the boot/install servers and the nodes.
diag
This is used to boot the clients into diagnostics mode to perform hardware maintenance.
cust
This is used to install software filesets and updates on standalone clients and SPOT resources.
fix_query
This is used to display whether specified fixes are installed on a client machine or SPOT resources.
26
RS/6000 SP Software Maintenance
lppchk
This is used to verify that software was installed successfully by running the lppchk command on a NIM client or SPOT resource.
maint
This is used to deinstall software filesets and commit and reject updates on standalone clients and SPOT resources.
maint_boot
This is used to prepare resources for a client to be network-booted into maintenance mode to perform software maintenance.
reset
This is used to change the state of a NIM client or resource, so NIM operations can be performed with it. A reset may be required on a machine or resource if an operation was stopped before it completed successfully.
check
This is used to verify the usability of a machine or resource in the NIM environment. The check operation can be performed on NIM clients, or a group of NIM clients, a SPOT resource, or an lpp_source resource.
showlog
This is used to list software installed on a NIM client or SPOT resource.
reboot
This is used to reboot a NIM client machine. The target of a reboot operation can be any standalone NIM client or groups of clients.
The fastpath to perform a NIM operation on a standalone or diskless/dataless client is smitty nim_mac_op.
The following shows the steps to perform a bos_inst operation on node sp4n10 using the mksysb_1 image:
1. First allocate the resources for the client sp4n10 to get access to the files in the master. In addition to the mksysb_1 resource, allocate the lppsource and the SPOT resource for the node to do a network boot. To allocate the resources to the client sp4n10 using SMIT, use the following:
The Installation and Customization Process
27
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Machines.
Select Manage Network Install Resource Allocation.
Select Allocate Network Install Resources.
Target name: sp4n10
Available Network Install Resources: (select the following) noprompt lppsource_aix432 spot_aix432 mksyb_1 psspscript
(the customization script)
2. The next step is to initiate the bosinst operation on the client sp4n10. To perform this using SMIT, use the following:
# smitty nim
Select Perform NIM Administration Tasks.
Select Manage Machines.
Select Perform Operations on Machines.
Target Name: sp4n10
Operation to perform: bos_inst
Perform a Network Install
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Target Name
Source for BOS Runtime Files installp Flags
Fileset Names
Remain NIM client after install?
Initiate Boot Operation on Client?
Set Boot List if Boot not Initiated on Client?
Force Unattended Installation Enablement?
[Entry Fields] sp4n10 mksysb
[-agX]
[] yes yes no yes
+
+
+
+
+
28
RS/6000 SP Software Maintenance
2.1.7 NIM Administration
Generally the problems encountered during SP installation are related to
NIM. To resolve NIM-related problems, always use the following SMIT menu.
From this SMIT screen you can administer the clients and resources, and also check the state of the various NIM objects.
The fast path to reach this menu is smitty nim_admin.
Perform NIM Administration Tasks
Move cursor to desired item and press Enter.
Manage Networks
Manage Machines
Manage Resources
Manage Groups
Backup/Restore the NIM Database
Configure NIM Environment Options
Rebuild the niminfo File on the Master
Unconfigure NIM
F1=Help
Esc+9=Shell
F2=Refresh
Esc+0=Exit
F3=Cancel
Enter=Do
Esc+8=Image
Note: In many cases, NIM prevents operations on a target object when the object is not in a ready state. In such situations you can either try to reset the object to a ready state or set the force option for an operation to "on" when the action is initiated. This will ignore the state of the target object and the operation can be performed on it. In NIM the force option is available for the following operations:
Manage Networks
This displays a menu of operations that enable you to manage NIM networks.
Operations include creating a network, changing or showing the characteristics of a network, removing a network, and managing routing information for a network.
Note: You must have root user authority to perform any of the network operations.
The Installation and Customization Process
29
Manage Machines
This displays a menu of operations that can be performed on NIM machines.
The machine operations include adding a machine, changing or viewing the characteristics of a machine, and removing a machine. Additional machine operations are managing network interfaces, managing resource allocation, and performing machine control actions.
Manage Resources
This displays a menu of operations that enable you to manage NIM resources. The operations include listing the existing resources, adding a resource, changing or showing the characteristics of a resource, removing a resource, and performing maintenance on a Shared Product Object Tree
(SPOT).
In an existing customized NIM environment, when there is a problem related to one of the resources, you can check the status of that resource from this menu. If you find a problem, it is better to reset that resource with the force option set to yes
. This will reset the resource object and bring it to a ready state to perform a NIM operation.
Manage Groups
This displays a menu of operations that can be performed on NIM machine groups. Machine groups provide a means of grouping related NIM machines so that the machines may be operated upon as a logical unit. This will be useful when you want to perform similar operations in multiple clients.
Backup/Restore of the NIM Database
We recommend that you take a backup of the customized NIM database, so that you can restore it in case of a problem. The steps for taking a NIM
database backup are documented in 7.1.4, “Backing Up the NIM Database”
on page 146. The steps for restoring are documented in 7.2.4, “Restoring the
30
RS/6000 SP Software Maintenance
2.2 NIM Configuration Using PSSP
In an SP environment, we never configure NIM manually. Instead, the PSSP software package has scripts called wrappers for configuring the NIM environment. Wrappers are functions written in Perl and each one is an independent script.
These wrappers are called by the setup_server command for configuring the
NIM environment in the CWS and Boot/Install servers (BIS). These scripts can also be called from the command line if you know the sequence in which they have to be executed. These scripts use the information from the SDR to initiate the appropriate NIM commands with the required options. In this section we discuss how NIM is configured in an SP using these wrappers.
The NIM wrappers are part of the ssp.basic filesets. They are:
• mknimmast
• mknimint
• mknimclient
• mkconfig
• mkinstall
• mknimres
• delnimmast
• delnimclient
• allnimres
• unallnimres
Here we discuss the basic overview of these wrappers. For more information on the logic flow within each script, refer to Section 2.2.2 of RS/6000 SP:
PSSP 2.2 Survival Guide, SG24-4928.
2.2.1 mknimmast
This wrapper initializes the NIM master. This is the first step for configuring a
NIM master. In order to configure the NIM master, the bos.sysmgt.nim.master
and bos.sysmgt.nim.spot filesets need to be installed. This wrapper is executed only on the CWS and BIS as part of the setup_server script using information from the SDR to configure the NIM master.
The Installation and Customization Process
31
Command syntax:
# mknimmast -l <node_number_list>
-l<node_number_list> is the list of node numbers to define as NIM masters.
This is a required option. Node number 0 signifies the CWS.
Example:
# mknimmast -l 0
This command initializes the CWS as the NIM master.
2.2.2 mknimint
This wrapper creates network objects on the NIM master to serve the NIM clients. If there is more than one NIM master configured, mknimint creates network objects to find the CWS. In cases where more than one NIM master is defined, dedicated nodes become Boot/Install servers (BISs). The CWS remains as a resource center for other NIM masters. Therefore, it is important to configure network paths for reaching the CWS. This is the reason why the
BIS, while executing mknimint, searches for network interfaces of the CWS using the netstat command. To make sure that the BIS can reach every network interface of the CWS, the route definitions must be in place.
Command syntax:
# mknimint -l <node_number_list>
-l <node_number_list> is the list of node numbers to define as NIM masters.
This is a required option. Node number 0 signifies the CWS.
Example:
# mknimint -l 0
This command configures the NIM network objects for all the interfaces in the
CWS.
2.2.3 mknimclient
This wrapper takes input from the SDR to create the clients on the CWS and the BIS that are configured as NIM masters. The SP node’s reliable hostname configured in the SDR is used as the NIM machine object name.
32
RS/6000 SP Software Maintenance
Command syntax:
# mknimclient -l <node_number_list>
-l <node_number_list> is the list of node numbers to define as NIM clients.
This is a required option. Node number 0 (the CWS) is not allowed.
Example:
# mknimclient -l 5
This command creates the NIM client definitions for node 5 on the NIM master.
2.2.4 mkconfig
This wrapper creates the /tftpboot/<reliable_hostname>.config_info file for every node. The input values for this script are retrieved from the SDR for all the nodes. This command has no options, and whenever it is executed it creates the config files for all nodes that have a bootp_response value not set to disk. This file is used during network installation of the nodes.
Command syntax:
# mkconfig
2.2.4.1 The config_info File
The config_info file is used by the pssp_script. It exports all the required environment variables during the installation phase. You can see the contents of a sample config_info file in the following screen.
The Installation and Customization Process
33
# cat /tftpboot/sp4n09.install_info
#!/bin/ksh export control_workstation="9.12.0.4 192.168.4.140" export cw_hostaddr="192.168.4.140" export cw_hostname="sp4en0" export server_addr="192.168.4.140" export server_hostname="sp4en0" export rel_addr="192.168.4.9" export rel_hostname="sp4n09" export initial_hostname="sp4n09" export auth_ifs="sp4en0/192.168.4.140" export authent_server="ssp" export netinst_boot_disk="hdisk0" export netinst_bosobj="bos.obj.ssp.432" export remove_image="false" export sysman="true" export code_version="PSSP-3.1" export proctype="UP" export LPPsource_name="aix432" export cwsk4="" export LPPsource_hostname="sp4en0" export LPPsource_addr="192.168.4.140" export platform="rs6k" export ssp_jm="yes"
2.2.5 mkinstall
This wrapper creates the /tftpboot/<reliable_hostname>.install_info file for every node in the SDR whose bootp_response is not set to disk. This file is used during the network installation of the nodes.
Command syntax:
# mkinstall
2.2.5.1 The install_info File
The install_info file is used by the pssp_script. The information from this file is taken for the base configuration of the node. See an example file in the following screen.
cat /tftpboot/sp4n09.config_info
9 sp4n09.msc.itso.ibm.com 192.168.4.9 192.168.4.140 8 1 4 3 1 yes rootvg true hd isk0 1 false en0 192.168.4.9 255.255.255.0 NA 10 "" 192.168.4.0
css0 192.168.14.9 255.255.255.0 NA NA "" 192.168.14.0
The fields of the config_info file contain the following:
34
RS/6000 SP Software Maintenance
• Node number
• Initial hostname
• Node IP address
• Switch node number
• Switch number (switch board)
• Switch chip port
• Slot that this node use
• If arp is enabled for the switch
This additional information is only for PSSP >= 3.1 clients:
• Selected volume group
• Quorum
• Physical volume list
• Number of copies (mirroring)
• Volume group mapping
An entry for each adapter is defined in the SDR, listing:
• Adapter name
• Adapter IP address
• netmask
• ring_speed (for Token Ring)
• bnc_select (dix/bnc/tp for Ethernet)
This additional information is only for PSSP >= 3.1 clients:
• Ethernet rate
• Duplex type
2.2.6 mknimres
This wrapper creates all the NIM resources for installation, diagnostics, migration and customization. The resources created will be used by the allnimres wrapper for allocation to the clients, depending on the bootp_response field.
Command syntax:
# mknimres -l <node_number_list>
The Installation and Customization Process
35
-l <node_number_list
> is the list of node numbers to define as NIM masters.
This is a required option. Node number 0 signifies the CWS.
Example:
# mknimres -l 1
This command creates all the resources in the CWS.
The resources that are created by this wrapper are shown in the following screen: psspscript prompt noprompt migrate lppsource_aix432 mksysb_1 spot_aix432 resources resources resources resources resources resources resources script bosinst_data bosinst_data bosinst_data lpp_source mksysb spot
The purpose and location of these resources are as follows:
psspscript:
The pssp_script file is copied from the /usr/lpp/ssp/install/bin/pssp_script to the /spdata/sys1/install/pssp directory when the mknimres command is executed. This pssp_script is called by NIM after installation of a node and before NIM reboots the node.
It is run under a single user environment with the RAM file system in place. It installs the required LPPs and does the post-installation setup.
This script takes the input from the /tftpboot/<node>.config_info and
/tftpboot/<node>.install_info files.
prompt:
This resource is allocated when you want the node to prompt for the input from the console. It is used to perform a maintenance or diagnostic mode of operation.
noprompt:
This resource is allocated when installing or migrating the node using full overwrite install and you do not want the installation to prompt for any input on the console.
36
RS/6000 SP Software Maintenance
migrate:
This resource is allocated when you want to perform a migration on the nodes to a newer version of AIX.
lppsource_aix432:
This resource contains the BOS minimum filesets and PTFs required for installing the nodes in the installp format. Whenever you copy any new filesets or PTFs to this directory, you must initialize the table of contents file (.toc file) and also update the SPOT resource to reflect these PTFs.
mksysb_1:
This resource contains the image to be installed on the nodes. Along with the SP system, you get a spimg tape that contains the minimum BOS filesets and PTFs in image file format. If you are not using this minimal image and want to use the image you have created, then make sure that all the minimum PTFs are also installed. This image is copied to the directory /spdata/sys1/install/images as part of the installation steps.
spot_aix432:
This resource is defined as a directory structure that contains the runtime files common to all machines. It is created under the directory
/spdata/sys1/install/<aix version>/spot.
2.2.7 delnimmast
This wrapper is used to delete the NIM master definitions and all the NIM objects on the CWS or boot/install server. The NIM filesets will also be deinstalled by this script.
Command syntax:
# delnimmast -l <node_number_list>
-l <node_number_list> is the list of node numbers to undefine as NIM masters.
This is a required option. Node number 0 signifies the CWS.
Example:
# delnimmast -l 0
This command can be used to unconfigure the NIM master definitions in the
CWS.
The Installation and Customization Process
37
2.2.8 delnimclient
This wrapper is used to deallocate the resources and to delete the NIM client definitions from the NIM master.
Command syntax:
# delnimclient -l <node_number_list> | -s <server_node_list
>
-l <node_number_list> is the list of node numbers which will be deleted as NIM clients. This is a required option. Node number 0 (the CWS) is not allowed.
-s <server_node_list> is the list of server (NIM master) nodes on which to delete all clients that are no longer defined as boot/install clients in the SDR.
Example:
# delnimclient -l 7,8
This command deletes the NIM client definitions for nodes 7 and 8 from the
NIM master.
2.2.9 allnimres
This wrapper is used to allocate the NIM resources to the clients depending on the value of the bootp_response attribute defined in the SDR. If the bootp_response value is set to install, migration, maintenance or diagnostic, the appropriate resources will be allocated. In case of a disk or customize value, all the resources will be deallocated.
Command syntax:
# allnimres -l <node_number_list>
-l <node_number_list> is the list of node numbers representing NIM clients to which to allocate resources. This is a required option.
Example:
# allnimres -l 5
This command allocates resources to node 5, depending on the value of the bootp_response attribute for node 5.
2.2.9.1 Resources Allocated for Various bootp_response Values
The resources that are allocated for performing various operations on a client are shown in the following screens. If you encounter problems in trying to
38
RS/6000 SP Software Maintenance
perform any operation on the nodes from the NIM master, check if all these resources are allocated.
The command to check the resources that are allocated for performing a install operation on clients is as follows:
# lsnim -c resources sp4n09 boot nim_script resources resources psspscript noprompt lppsource_aix432 mksysb_1 spot_aix432 resources resources resources resources resources boot nim_script script bosinst_data lpp_source mksysb spot
The resources that are allocated for performing a maintenance operation on clients are as follows:
# lsnim -c resources sp4n09 boot nim_script resources resources prompt lppsource_aix432 spot_aix432 resources resources resources boot nim_script bosinst_data lpp_source spot
The resources that are allocated for performing a diag operation on clients are as follows:
# lsnim -c resources sp4n09 boot resources prompt spot_aix432 resources resources boot bosinst_data spot
The resources that are allocated for a migrate operation on clients are as follows:
# lsnim -c resources sp4n09 boot nim_script resources resources psspscript lppsource_aix432 spot_aix432 migrate resources resources resources resources boot nim_script script lpp_source spot bosinst_data
The Installation and Customization Process
39
If the bootp_response value is set to disk or customize no resources are allocated to the clients.
2.2.10 unallnimres
This wrapper is used to unallocate all the NIM resources for the clients. The command syntax is:
# unallnimres -l <node_number_list>
-l <node_number_list> is the list of node numbers representing NIM clients from which to deallocate resources. This is a required option.
Example:
# unallnimres -l 5
This command deallocates all resources for node 5.
Steps for Manually Configuring an Install Operation on a Node Using
Wrappers
When you run setup_server
, it looks for all the nodes and customizes them according to their bootp_response values each time you execute this command. This is really time-consuming and unnecessary. Instead of running setup_server
, you can execute the following wrappers to set up the node for installation. The steps to set up node sp4n06 for installation are as follows:
# spbootins -s no -r install -l 6
This sets node 6 for the install operation and adds an entry in /etc/bootptab for node 6.
# create_krb_files
This creates and updates the /etc/tftpaccess.ctl and
/tftpboot/sp4n06-new-srvtab files. It also sets the file permissions to read-only, and sets the owner of the file to user nobody.
# mkconfig
This creates the file /tftpboot/sp4n06.config_info.
# mkinstall
This command creates the file /tftpboot/sp4n06.install_info.
40
RS/6000 SP Software Maintenance
# export_clients
This command exports the directories for NFS mounts.
# allnimres -l 6
This command allocates all the resources that are required for the install operation for the node sp4n06.
Now from the perspectives, if you invoke a netboot on node 6, the node will reboot and start the installation of the node.
2.3 Web-Based System Manager for NIM (WSM)
Web-Based System Manager (WSM) enables a system administrator to manage an AIX machine either locally from a graphics terminal or remotely by using a browser like Netscape or Internet Explorer. It has been implemented in the Java programming language.The Web browser in the client machine should be Java 1.1-enabled.
WSM includes the components that were available for doing system administration using SMIT. It can be launched from the Java-enabled browser and also from the application icon in the CDE application manager. To launch the WSM from a browser on a PC or some other client, you should have the
Internet server daemon httpd running.
In this section we discuss how to do NIM administration using WSM. To open
WSM from a Web browser, type the following in your browser’s URL or location field: http://<your server name>/wsm.html
Figure 5 on page 42 shows the main screen when you open WSM from a
browser.
The Installation and Customization Process
41
Figure 5. Web-Based System Manager Main Screen
When you double click on the NIM icon, it takes you to the NIM Installation
Management menu, as shown in Figure 6 on page 43. This is the main screen
for configuring and doing NIM administration.
42
RS/6000 SP Software Maintenance
Figure 6. Network Installation Management
There are three types of objects shown on this screen: the Task Guide, the
Container, and the NIM machine objects (standalone and master). The screen also shows the state of each object, and for some objects it also gives additional information. For example, you can see that node sp4n01 is enabled for BOS installation, and that node sp4n06 is enabled for diagnostic boot.
Now let us describe these objects in detail.
The Installation and Customization Process
43
TaskGuides:
These objects are dialogs designed to assist the user in performing tasks in
an ordered series of steps. The first two lines in Figure 6 show two objects of
type TaskGuide. They are:
Configure NIM
This TaskGuide helps you to configure the basic NIM environment. It initializes the host as a NIM master. You then have the option to create the
NIM resources and the NIM clients in this NIM environment.
Add New Machine
This is used for adding new standalone or diskless and dataless clients to the NIM environment. Here it is possible to specify multiple systems at the same time to define in the NIM database.
Container Objects:
The container objects include other container objects and simple objects representing elements of system resources to be configured and managed. In
Figure 6 on page 43, you can see there are two types of container objects.
They are:
Resources
This container object contains the resource objects that are configured in
the NIM environment. Figure 7 on page 45 shows the NIM Resources
screen.
44
RS/6000 SP Software Maintenance
Figure 7. NIM Resources
The Add New Resource object is for creating new resources in the NIM environment. When you double-click on the other resource objects, it gives all the information and status of that resource.
For example, when you double-click on the spot_aix432 object, you will
see the screen shown in Figure 8 on page 46. This screen shows the
general information of this object, such as resource name, location of the resource, server of the resource and resource type. You also see that there are other menu options, like boot image information and the state information for this object.
The Installation and Customization Process
45
Figure 8. General Properties of the Resource spot_aix432
The boot image information of spot_aix432 is shown in Figure 9 on page
47. Here the display shows that the network boot image for a uniprocessor
and multiprocessor system is configured.
46
RS/6000 SP Software Maintenance
Figure 9. Boot Image Information of the Resource spot_aix432
Networks
This network container object contains the Add New Network TaskGuide and the network object spnet_en0 that is configured in this NIM
environment. Figure 10 on page 48 shows the NIM Networks screen.
The Installation and Customization Process
47
Figure 10. NIM Networks
The Add New Network TaskGuide object is for creating new networks.
Double-clicking the spnet_en0 object, you can see the general NIM routes and status information menus for this object.
Steps to perform a BOS Installation on sp4n10 Using WEBSM - NIM:
1. From the Network Installation Management screen, click the node sp4n10 to see that it is highlighted.
2. Next click Selected menu at the top of the screen and you will see the various operations that can be performed on the clients.
3. Select the Install Base Operating System option, which opens another
window, as shown in Figure 11 on page 49. This window is for allocating
the resources and starting the bos_inst operation.
48
RS/6000 SP Software Maintenance
Figure 11. BOS Installation for sp4n10
4. To perform the bos_inst operation using the mksysb_1 image, allocate the
resources as shown in Figure 11 and click OK to start the installation.
2.4 The Installation Process
This section discusses the main steps involved in the installation process of the CWS and the nodes. Some steps are covered in more detail than others, simply because they are already detailed in other redbooks or installation guides. Where this is true, references to these other redbooks are made. This section can be used to diagnose and solve installation problems, or to simply verify that the installation process is running correctly.
The node installation process is complex, so we detail the process from start to finish. To help you find out what is happening during node installation, we also look at the debug options that are available.
At the start of this section, we look at the steps that are performed on the
CWS before PSSP can be installed, and describe points that should be considered when preparing the CWS for an SP installation.
The Installation and Customization Process
49
2.4.1 Setting Up the Control Workstation
After AIX has been successfully installed, you should read all the README documentation that arrived with your AIX and PSSP software and take the necessary actions before continuing with the installation.
2.4.2 Disk Space Considerations
We suggest that you install the /spdata filesystem in a separate volume group from rootvg, for the following reasons:
• Availability - if a disk fails, only one volume group is lost.
• You may want to install HACWS in the future. For this you would need to install spdata on an external disk subsystem.
• To save backup and restore time. The spdata filesystem is around 2 GB, and it can grow considerably with future releases of AIX and PSSP.
• You may already have free disk space limitations.
Following is an approximation of free space required in various filesystems and volume groups.
• The recommendation for rootvg is 2 GB, providing that /spdata is installed in a separate volume group.
• /var requires at least 20 MB of free space. Most of the PSSP logs are stored here. The actual disk space used by these logs depends entirely on the types of problems that occur on your system and their frequency. The
/var filesystem should be monitored frequently to ensure that there is always sufficient free disk space for new logs.
• /tmp requires at least 16 MB of free space.
• We suggest that you create a separate file system for /tftpboot. Each lppsource level requires a minimum of 25 MB. You should also consider future AIX installations. We suggest 25 MB x (number of AIX versions+1).
For example, if you are installing AIX 4.2, 4.2.1 and 4.3, then 25 MB x
(3+1) = 100 MB.
• Downloading all of the AIX filesets to the lppsource directory requires approximately 1.5 GB of disk space. Downloading only the minimal filesets requires approximately 500 MB. Each mksysb image may vary between
100 MB and 700 MB. A simple example for AIX 4.3.2 is: lppsource + mksysb_images + pssp_lpp_images + SPOT = total disk space lppssource = 500 MB
50
RS/6000 SP Software Maintenance
mksysb_images = 300 MB pssp_lpp_images = 350 MB
SPOT = 200 MB
500 MB + 300 MB + 350 MB + 200 MB = 1.35 GB
You should have at least 1.35 GB of free disk space for the /spdata file system. To be on the safe side, you should reserve 2 GB of disk space for
/spdata. For multiple AIX and PSSP version coexistence, this figure will increase considerably, due to the disk space taken up by different versions of mksysb images, SPOT, AIX and PSSP filesets.
2.5 Directory Structure
Make sure that AIX and PSSP images are installed in the correct directory
structure. See Figure 12 on page 52 for an example. Use the correct naming
convention to store these images, otherwise the installation of the nodes will fail.
The Installation and Customization Process
51
spdata sys1 install aix432 lppsource images pssp pssplpp
PSSP-2.4
PSSP-3.1
aix421 lppsource
Figure 12. Multiple AIX and PSSP Levels
In an SP environment, for each AIX version an mksysb image, SPOT and
lppsource must exist. For multiple AIX and PSSP installations, refer to Table 4
for supported versions.
Table 4. Supported AIX and PSSP Versions
PSSP AIX
PSSP 2.1
2.2
2.3
2.4
3.1
Y
Y
Y
-
2.1
2.2
2.3
2.4
3.1
4.1.4
4.1.5
4.2
Y Y n
Y Y Y Y
Y
Y
Y
-
Y
Y
-
-
Y
-
-
n n n n n n n n n
Y
Y n
4.2.1
4.3
n n
Y n
Y n n
Y
Y n
4.3.1
4.3.2
n n n n
Y
Y
Y
52
RS/6000 SP Software Maintenance
All AIX mksysb images must be installed in /spdata/sys1/install/images. The normal naming convention for mksysb images is bos.obj.ssp.<aix version>. In our case we installed AIX 4.3.2, which is bos.obj.ssp.432. However, you can choose your own name for the mksysb image. For multiple images, make sure the image name is version-specific.
All AIX filesets must be installed in lppsource using the directory structure shown in the following screen. The corresponding SPOT is also shown here.
A SPOT resource contains basically all the files that are normally found in the
/usr filesystem. When you install new filesets in lppsource, you should then
update the corresponding SPOT; this is covered in 5.2, “Applying AIX PTFs in the CWS” on page 104.
/spdata/sys1/install/aix415/LPPsource
/spdata/sys1/install/aix415/spot
/spdata/sys1/install/aix421/LPPsource
/spdata/sys1/install/aix421/spot
/spdata/sys1/install/aix432/LPPsource
/spdata/sys1/install/aix432/spot
All PSSP filesets must be installed in /spdata/sys1/install/pssplpp/<code version>, as shown in the following screen.
/spdata/sys1/install/pssplpp/PSSP-2.2
/spdata/sys1/install/pssplpp/PSSP-2.3
/spdata/sys1/install/pssplpp/PSSP-2.4
/spdata/sys1/install/pssplpp/PSSP-3.1
Complete installation steps 1 to 17 described in IBM Parallel System Support
Programs for AIX: Installation and Migration Guide, GC23-3898.
You should now have completed the following:
1. AIX Installation
2. Required PTFs installed
3. Defined hostname, IP addresses, IP labels and netmasks
4. Verified network connections
5. Configured RS232 line between the CWS and frame(s)
6. Tuned CWS network adapters
7. Started the necessary daemons
8. Changed the CWS root Maximum Default Process value
The Installation and Customization Process
53
9. Added network tunable values in /etc/rc.net
10.Created the /tftpboot filesystem
11.Created the /spdata filesystem (preferably in a separate volume group)
12.Created LPPsource and copied AIX filesets
13.Copied the PSSP image
14.Copied a basic AIX mksysb image
2.6 Additional Required AIX LPPs
Depending upon the level of AIX and PSSP you want to install, you have to order and install additional AIX LPPs. The following section gives an overview of which perfagent level is required and answers the question why a
C-compiler is required.
The perfagent.server fileset is a prerequisite for PSSP. This fileset is part of the Performance Aide for AIX (PAIDE) feature of the Performance Toolbox for
AIX (PTX), a separate product. The perfagent.tools fileset is part of AIX 4.3.2.
Table 5 gives an overview of the required level of perfagent.
Table 5. Perfagent Filesets
AIX Level PSSP Level Required Perfagent File Sets
4.1.5
4.2.1
4.2.1
4.2.1
4.3.1
4.3.1
4.3.2
4.3.2
4.3.2
2.2
2.2
2.3
2.4
2.3
2.4
2.3
2.4
3.1
perfagent.server 2.1.5.x
perfagent.server 2.2.1.x or greater, where x is greater than or equal to 2 perfagent.server 2.2.1.x or greater, where x is greater than or equal to 2 perfagent.server 2.2.1.x or greater, where x is greater than or equal to 2 perfagent.server 2.2.31.x
perfagent.server 2.2.31.x
perfagent.tools and perfagent.server 2.2.32.x
perfagent.tools and perfagent.server 2.2.32.x
perfagent.tools 2.2.32.x
Another prerequisite for PSSP is IBM C for AIX. This LPP is necessary for service of the PSSP. Without the compiler’s preprocessor, dumb diagnostic
54
RS/6000 SP Software Maintenance
tools such as crash will not function fully. At least a one-user license is required.
For an installation of AIX 4.3.2 and PSSP 3.1 the following filesets need to be copied to /spdata/sys1/install/aix432/lppsource:
• xlC.rte 3.6.4.0
• perfagent.tools 2.2.32.x
2.7 Minimum Required PSSP Filesets
These are the minimum required PSSP filesets:
• rsct.* (all rsct filesets)
• ssp.basic
• ssp.authent (if CWS is a Kerberos authentication server)
• ssp.clients
• ssp.css (if SP Switch is installed)
• ssp.top (if SP Switch is installed)
• ssp.ha_topsvcs.compat
• ssp.perlpkg
• ssp.sysctl
• ssp.sysman
Note
If you are adding dependent nodes, you must also install the ssp.spmgr
fileset.
2.8 Install PSSP on the Control Workstation
Install PSSP on the CWS, either from media or the PSSP images that were earlier copied onto the CWS.
Install the PSSP software from SMIT panels, or use the installp command.
To install PSSP from the command line, enter the following command:
# installp -a -d /spdata/sys1/install/pssplpp/<PSSP-VER> -X ssp.*
Alternatively, use the SMIT panels to perform the PSSP installation, as follows:
The Installation and Customization Process
55
# cd /spdata/sys1/install/pssplpp/PSSP-3.1
# smitty install_latest
Select the current directory as your input device.
Install and Update from LATEST Available Software
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* INPUT device / directory for software
* SOFTWARE to install
PREVIEW only? (install operation will NOT occur)
COMMIT software updates?
SAVE replaced files?
AUTOMATICALLY install requisite software?
EXTEND file systems if space needed?
OVERWRITE same or newer versions?
VERIFY install and check file sizes?
Include corresponding LANGUAGE filesets?
DETAILED output?
Process multiple volumes?
[Entry Fields]
.
[ssp no yes no yes yes no no yes no yes
> +
+
+
+
+
+
+
+
+
+
Once PSSP has been installed on the CWS, you can perform the rest of the steps required to configure our CWS as an authentication server, NIM master and boot/install server. These steps need to be performed before you can install the SP nodes.
2.9 setup_authent
To proceed with the rest of the installation, you need to create a primary authentication server, which is normally the CWS. The minimum configuration required at this stage is to initialize the Kerberos database and define and register a kerberos administrative user. The administrative UNIX user is root.
A Kerberos principal must be created (it should be named root.admin) before you can perform subsequent steps in the installation. Enter the following command to configure the Kerberos database and to set up the root.admin
principal:
# setup_authent
The Creating the Kerberos Database menu will appear. Answer the prompts to initialize the primary authentication server and get a Kerberos ticket for the
Kerberos administration principal root.admin.
56
RS/6000 SP Software Maintenance
If you intend to install a secondary authentication server, follow the steps in
8.12, “Setting Up a Secondary Authentication Server” on page 197, once the
primary server has been initialized. Note that you need to ensure clock synchronization between primary and secondary authentication servers.
2.9.1 setup_authent Files and Daemons
The setup_authent script creates the following files and daemons, which are
discussed in detail in 8.4, “Kerberos Daemons” on page 163.
Files
/.k
/.klogin
master key cache file lists the names of principals authorized to invoke remote commands
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/tmp/tkt0
contains the local realm and a list of authentication servers for the realm maps host names to an authentication realm this server key file contains names and private keys this ticket cache file contains the ticket granting ticket
/var/kerberos/database/principle.dir
part of the Kerberos database
/var/kerberos/database/principle.pag part of the Kerberos database
Daemons
The setup_authent script starts the kerberos and kadmind daemons. These daemons are added to the /etc/inittab file and are started from here thereafter. The following screen shows an entry for each of the kerberos daemons added to the /etc/inittab file.
# cat /etc/inittab kerb:2:once:/usr/bin/startsrc -s kerberos kadm:2:once:/usr/bin/startsrc -s kadmind
The kerberos daemon is responsible for providing ticket granting tickets to clients. The kadmind daemon is responsible for the administration of the
The Installation and Customization Process
57
Kerberos principals, typically for adding principals and changing principal passwords.
If you were installing at least one secondary authentication server, there would be an entry in the inittab file for the kpropd daemon on the secondary authentication server.
The kpropd daemon only exists on the secondary authentication server and is responsible for updating the secondary authentication server’s database from the primary authentication server.
2.10 The install_cw Command
Enter the klist command to ensure that you have a valid ticket granting ticket
(TGT) for the root.admin principal. The following screen shows the command output.
# /usr/kerberos/bin/klist
Ticket file: /tmp/tkt0
Principal: root.admin@SP4EN0
Issued Expires Principal
Apr 28 18:08:23 May 28 18:08:23 krbtgt.SP4EN0@SP4EN0
If your ticket is expired, you have to run kinit root.admin
from the command line and enter your password. Once you have a ticket granting ticket, enter the following command:
# install_cw
This script performs the following steps:
1. It creates the /spdata/sys1/spmon/hmacl file.
The hmacl file contains the permissions specifications for Kerberos principals. The following screen shows the default entries for root.admin
and hardmon. Both principals have vsm permissions. What this actually means is that both principals can view, save, and modify.
# cat /spdata/sy1/spmon/hmacl
1 root.admin vsm
1 hardmon.sp4en0 vsm
58
RS/6000 SP Software Maintenance
2. It creates the /etc/SDR_dest_info file. This file contains the IP address and name of the SDR server, which is normally the CWS. The following screen shows the output of our SDR_dest_info file.
# cat /etc/SDR_dest_info default:192.168.4.140
primary:192.168.4.140
nameofdefault:sp4en0 nameofprimary:sp4en0
3. It creates the sdrd and hardmon subsystems.
4. It creates an entry for the sdrd in /etc/inittab.
5. It creates an entry for the hardmon daemon in /etc/inittab.
6. It creates the following port map entries in the /etc/services: hardmon 8435/tcp sdr 5712/tcp heartbeat 4893/udp
2.11 The setup_server Command
The setup_server command calls a number of scripts and wrappers to perform
the following task. These wrappers are discussed in detail in 2.2, “NIM
Configuration Using PSSP” on page 31.
The main tasks performed by the setup_server command are setting up PSSP services such as NTP, AMD and File Collections. This command ensures that the required Kerberos files exist and then creates a list of known rcmd principals. It creates the CWS as a NIM master and gets information from the
SDR to create the NIM environment including network objects, NIM clients,
SPOT and tftp boot images for the different platforms, and stores them in the
/tftpboot directory. These files are discussed in detail in 2.14.1, “/tftpboot
After creating the NIM environment, the NIM resources are allocated to NIM clients inside the SP. When setup_server is run for the first time on a CWS, all of these steps need to be performed. This can take up to one hour or more depending on the CWS configuration and hardware environment. Creating the SPOT takes up the majority of this time.
The Installation and Customization Process
59
NIM logs the actions performed during SPOT creation in the /tmp directory.
There you should look for the file name spot.out.<pid>; where pid is the process ID of the SPOT creation process. This process is probably not running any more, so in case you find several spot.out.xxx files, take the newest one. View the contents of this file and see if you find any hints why the build of the SPOT was unsuccessful.
2.12 Verification Tests
Upon the completion of setup_server you should reboot the CWS and then perform configuration verification and verification tests to ensure that the
CWS is set up correctly. Verify that all the necessary daemons are running.
Run the following tests:
# SDR_test
The SDR_test creates an SDR class and adds attributes and values, then deletes the SDR class to ensure the SDR is working correctly.
# spmon_itest
The spmon_itest verifies the following:
• The SP_ports class in the SDR contains hardmon information.
• The hardmon daemon is running.
• The sdrd daemon is running.
• It validates the hmacls file.
After these tests are complete, log files are created in /var/adm/SPlogs:
SDR_test.log and spmon_itest.log.
2.13 Debugging NIM
If the SP node was booted from the network boot image, but failures are still occurring during a BOS installation, it may be necessary to collect debug information from the BOS install program. This is achieved by building debug
boot images using the SPOT. After doing this, commands and output produced by the BOS install program will automatically be displayed on the tty session connecting to the node. Typical symptoms of BOS installation problems are system hangs.
Viewing the debug output can be extremely useful, because you will be able to see the commands that failed. The problem may be a misconfiguration of
60
RS/6000 SP Software Maintenance
the boot network adapter, incorrect NIM configuration information, or errors in resource definitions. Examining the debug output, you can reduce the scope of investigation, isolate the cause of the problem, and fix it.
Perform the following steps to produce debug output from a network boot image:
Step 1: Create the debug boot images
To create debug versions of the network boot images use the nim command on the NIM master:
# nim -Fo check -a debug=yes SPOTName where SPOTName is the name of your SPOT (for example, spot_aix432).
Step 2: Get the address used for the debug session
On the master, use the following command:
# lsnim -a enter_dbg SPOTName where SPOTName is the name of your SPOT (for example, spot_aix432). The displayed output will be similar to the one shown in the following sample: spot_aix432: enter_dbg = "chrp.mp.ent 0x001f3d1c" enter_dbg = "rs6k.mp.ent 0x001f3d1c" enter_dbg = "rs6k.up.ent 0x001bd844"
Figure 13. SPOT Platform-Dependent Debug Addresses
Write down the enter_dbg address for the type of node you are going to boot.
For example, if your client is an rs6k multiprocessor machine, you would write down the address 1f3d1c.
You can query the platform and processor type information (for example: node #1) by using the following command:
# SDRGetObjects -x Node node_number==1 platform processor_type
The CWS responds with: rs6k MP
Step 3: Open a terminal session to the node
You have to open a read/write terminal session to the node, because the system prompts you during the installation. You also want to keep track of the debug information using a file named /tmp/BOS.trace.n01. To access node #1 enter:
The Installation and Customization Process
61
# s1term -w 1 1 | tee /tmp/BOS.trace.n01
Press Enter twice.
Step 4: Netboot the node
We assume that the node is set to customize and the setup_server script was executed.
The boot process has to be initiated using the spmon command. To boot node
1, enter:
# spmon -power on node1
During the boot process a menu will appear and the system will prompt you for the selection of the network boot device. Make sure to select the SP
Ethernet adapter. This is “normally” automated by the nodecond command during node installation. Now it has to be done “manually”.
Step 5: Enter debug information
After the client gets the boot image from the NIM master, the debug screen appears on the terminal. You see some cryptic information and the ">" prompt will be shown. Enter: st Enter_dbg_Value 2 where
Enter_dbg_Value is the number you wrote down in step 2 as your machine type's address (in our example, 1f3d1c). Specifying the
2 in the end of your command makes sure output is printed to your tty.
After getting the prompt again, type g for go and press Enter to start the boot process.
Use Ctrl-s to temporarily stop the process as you watch the output on the tty.
Use Ctrl-q to resume the process.
The system will begin installation and provide you with the commands and outputs during the installation process. If the system hangs, you can spot the command causing this and what the problem is.
Step 6: Build a normal SPOT
After a successful debug session rebuild your SPOT in non-debug mode. Use the following command on the NIM master:
# nim - Fo check SPOTName where SPOTName is the name of your SPOT (for example, spot_aix432).
62
RS/6000 SP Software Maintenance
If the boot image is left in debug mode, then every time a client is booted from these boot images, the machine will stop and wait for a command at the debugger ">" prompt. If you attempt to use these debug-enabled boot images and there is not a tty attached to the client, the machine will appear to be hanging for no reason.
2.14 Set bootp_response to Install
In this example we are going to install node sp4n09 with AIX 4.3.2 and PSSP
3.1. We use hdisk0 for rootvg. From either SMIT or the command line, set the node to install. From the command line enter the following command:
# spbootins -c rootvg -r install -l 9 -s yes
Enter the following command to verify that the node boot response is set to install. Check that the version levels for AIX and PSSP software are correct.
Check the hdisk number for the boot disk.
# splstdata -b -l 9
[sp4en0:/tmp]# splstdata -b -l 9
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
9 sp4n09.msc.itso.
0004AC4947E9 0 default Fri_Apr__9_18:26:04 disk default hdisk0 aix432
PSSP-3.1
rootvg
2.14.1 /tftpboot Files
When a node is set to install and setup_server is run on the CWS, the setup_server script updates the /etc/tftpaccess.ctl file on the server. In the
/tftpboot directory the following files are created for sp4n09: sp4n09-new-srvtab sp4n09.config_info
sp4n09.install_info
Each file name starts with the reliable hostname. The srvtab file is the
Kerberos service key file; for more information, see 8.6, “The krb-srvtab File” on page 166.
The Installation and Customization Process
63
The config_info file is described in 2.2.4.1, “The config_info File” on page 33
and the install_info is described in 2.2.5.1, “The install_info File” on page 34.
They are created by the mkinstall and mkconfig wrappers. For more
information on these wrappers, see 2.2.4, “mkconfig” on page 33.
An entry for node sp4n09 is added to the bootpd configuration file
/etc/bootptab on the CWS. Each bootpd configuration parameter in this single line (in our screen output this line was broken down into two lines) is two characters long and is separated from the other parameters. The following screen shows an example entry for one node: cat /etc/bootptab sp4n09:bf=/tftpboot/sp4n09:ip=192.168.4.9:ht=ethernet:ha=10005AFA158A:sa=
192.168.4.140:sm=255.255.255.0:
Here is an explanation of each parameter:
• The first field specifies the node name, which in this example is sp4n09, followed by a character tag bf.
• bf specifies the bootfile /tftpboot/sp4n09.
• ip specifies the ip address of the node.
• ht specifies the network type, which is always Ethernet for SP nodes.
• ha specifies the hardware address of the Ethernet adapter.
• sa specifies the ip address of the TFTP server.
• sm specifies the host subnet mask.
64
RS/6000 SP Software Maintenance
All the NIM resources required for an install bootp response are created and allocated to the node (for example the bos image, SPOT and lppsource to use). NIM resources for PSSP script and customization scripts are also allocated. Enter the following command to verify the NIM resources allocated to the node:
# lsnim -l sp4n09
You should get a response similar to the following: class type platform
= machines
= standalone
= rs6k netboot_kernel = up if1 = spnet_en0 sp4n09 10005AFA158A ent cable_type1
Cstate
= bnc
= BOS installation has been enabled prev_state
Mstate
= ready for a NIM operation
= currently running boot = boot bosinst_data = noprompt lpp_source mksysb
NIM_script script
= lppsource_aix432
= mksysb_1
= NIM_script
= psspscript spot cpuid control
= spot_aix432
= 000201335700
= master
The Installation and Customization Process
65
66
RS/6000 SP Software Maintenance
Chapter 3. Verifying Your SP
An SP system is a complex interrelation among the many components that make this machine powerful and interesting. To ensure that your system is working properly, and to avoid future problems, you should verify from the very beginning that all components of your SP system are working as expected and will be available when needed.
You should not underestimate the importance of performing verification procedures at the time of installation. Just as you would not wait until you were going 70 mph before testing the brakes on a new car, or wait until dark before testing its headlights, neither should you wait until your system is in production (and perhaps running into problems) before performing verification.
SP2
OK
LPPs
DATE
TIMEZONE
KERBEROS
SDR
This chapter provides guidelines that you can use to verify that your SP system is well installed and customized, and that all subsystems integrated in the SP are running as needed according to your specific environment. Some scripts are also provided through the SMIT panels to verify your system. You can type smitty sp_verify to go to the panel of verifications and execute the scripts.
It is also helpful if you perform these verifications whenever you finish executing any procedure that changes your environment (for example, initial installation, migration, installation of a new node, installation of PTFs, partitioning, addition of a new switch and so on).
Following the verification procedures in the order in which they are presented in this chapter is a good way to verify your whole system. However, you can also use them when only verifying some of your subsystems.
© Copyright IBM Corp. 1999
67
While the purpose of this chapter is to verify your SP system and isolate problems, refer to RS/6000 SP: Problem Determination Guide, SG24-4778 for problem resolution procedures.
3.1 LPP Level
The level of the LPPs should be the latest available at the time of your installation. Try to obtain that level at the start because some PTFs require you to do specific procedures to be completely installed, such as taking down your nodes, or some of the subsystems, or the Control Workstation (CWS).
Activities like rebooting a node may interrupt the production system and are therefore difficult to do on a working system. For example a PTF for the switch could, in some cases, cause the switch to stop.
Never go to production having just the software included in the original media, because many problems are found after the software is installed and used massively by the customers.
There are also some specific PTFs required for AIX that should be installed in the CWS and the nodes so they will function properly. These PTFs are mentioned in the README FIRST files, so be sure to read that documentation to certify that all the PTFs are installed in your system. You can verify the level of your SP software using the following command:
# lslpp -L | grep -E "ssp|rsct" rsct.basic.hacmp
rsct.basic.rte
rsct.basic.sp
rsct.clients.hacmp
rsct.clients.rte
rsct.clients.sp
ssp.authent
ssp.basic
ssp.clients
ssp.css
ssp.ha_topsvcs.compat
ssp.docs
ssp.gui
ssp.jm
ssp.perlpkg
ssp.pman
ssp.ptpegui
ssp.public
C
C
C
C
A
C
C
A
C
C
C
C
A
A
A
C
A
C
3.1.0.2
3.1.0.4
1.1.0.3
1.1.0.0
1.1.0.2
1.1.0.0
3.1.0.1
3.1.0.4
3.1.0.4
3.1.0.4
3.1.0.0
3.1.0.0
3.1.0.4
3.1.0.1
3.1.0.0
3.1.0.1
3.1.0.1
3.1.0.0
RS/6000 Cluster Technology basic
RS/6000 Cluster Technology basic
RS/6000 Cluster Technology basic
(ECIP) RS/6000 Cluster
RS/6000 Cluster Technology
(ECIP) RS/6000 Cluster
SP Authentication Server
SP System Support Package
SP Authenticated Client Commands
SP Communication Subsystem
Compatability for ssp.ha and ssp.topsvcs clients
SP man pages and PDF files and
SP System Monitor Graphical User
SP Job Manager Package
SP PERL Distribution Package
SP Problem Management
SP Performance Monitor Graphical
Public Code Compressed Tarfiles
68
RS/6000 SP Software Maintenance
ssp.resctr.rte
ssp.spmgr
ssp.st
ssp.sysctl
ssp.sysman
ssp.tecad
ssp.top
ssp.top.gui
ssp.ucode
ssp.vsdgui
3.1.0.0
3.1.0.2
3.1.0.1
3.1.0.1
3.1.0.2
3.1.0.0
3.1.0.1
3.1.0.1
3.1.0.1
3.1.0.1
C
A
C
A
C
C
A
A
A
A
SP Resource Center
SP Extension Node SNMP Manager
Job Switch Resource Table
SP Sysctl Package
Optional System Management
SP HA TEC Event Adapter Package
SP Communication Subsystem
SP System Partitioning Aid
SP Supervisor Microcode Package
VSD Graphical User Interface
There are some LPPs that are absolutely necessary for your CWS; the following list specifies the minimum PSSP LPPs that you should have installed: rsct.basic.hacmp
rsct.basic.rte
rsct.basic.sp
rsct.clients.hacmp
rsct.clients.rte
rsct.clients.sp
ssp.authent (if CWS is Kerberos authentication server) ssp.basic
ssp.clients
ssp.css (if switch is installed) ssp.ha_topsvcs.compat
ssp.perlpkg
ssp.sysct
ssp.sysman
ssp.top (if switch is installed) ssp.ucode
If you are using Virtual Shared Disks (VSDs), you should see: vsd.cmi
vsd.hsd
vsd.rvsd.hc
Verifying Your SP
69
vsd.rvsd.rvsdd
vsd.rvsd.scripts
vsd.sysctl
vsd.vsdd
This list is valid for PSSP 3.1. It may vary for other PSSP versions.
Other products that are included with the PSSP media, but are not absolutely necessary unless you have that specific requirement, are: ssp.docs
ssp.gui
sss.hacws
ssp.pman
ssp.ptpegui
ssp.public
ssp.resctr.rte
ssp.st
ssp.tecad
ssp.top.gui
ssp.vsdgui
After you verify that all the required LPPs are installed, check for the required
PTFs. Use the instfix command to determine weather a specific APAR is installed. For example, to display if APAR IX71835 is installed in your system, type:
# instfix -ik IX71835
You will have one of two responses:
All filesets for IX71835 were found.
Or:
There was no data for IX71832 in the fix database.
If you want a complete display of this fileset, type:
#insfix -ivk IX71835
The response will be:
70
RS/6000 SP Software Maintenance
IX71835 Abstract: nfs v3 sequential write performance
Fileset bos.mp is not applied on the system.
Fileset bos.net.nfs.cachefs:4.3.0.1 is applied on the system.
Fileset bos.net.tcp.client:4.3.0.1 is applied on the system.
Fileset bos.up:4.3.0.1 is applied on the system.
Fileset bos.adt.include:4.3.0.1 is applied on the system.
All filesets for IX71835 were found.
After you have received this response, you should verify that the software is consistent throughout your system; that is, that your nodes and CWS have the same levels of software.
There are different methods of performing this verification:
• Locally: If you have only a few nodes, you can do this verification by logging in and executing the same command that you executed before and comparing the results with the ones from the CWS.
-a
• Remotely: You can do this verification by executing the command lppdiff with flags such as the following:
Specifies that the System Data Repository (SDR) initial_hostname field for all nodes in the current system partition be added to the working collective. If -G is also specified, all nodes in the SP system are included.
-c
-G
-n
-v
-w
Displays information as a list separated by colons.
Expands the scope of the -a flag to include all nodes in the SP system. The -G flag is meaningful only if used in conjunction with the -a flag.
Displays the count of the number of nodes with a fileset.
Verifies hosts before adding to the working collective. If this flag is set, each host to be added to the working collective is checked before being added.
Specifies a list of host names, separated by commas, to include in the working collective. Both this flag and the -a flag can be included on the same command line. Duplicate host names are included only once in the working collective.
For example, to query LPP information for ssp.basic on all nodes in the system, enter:
# lppdiff -Ga ssp.basic
Verifying Your SP
71
You should receive output similar to the following:
-------------------------------------------------------------------------------
Name Path Level PTF State Type Num
-------------------------------------------------------------------------------
LPP: ssp.basic
/etc/objrepos 3.1.0.0
COMMITTED I
From: sp4n01 sp4n05 sp4n06 sp4n07 sp4n08 sp4n09 sp4n10 sp4n11 sp4n13 sp4n15
10
-------------------------------------------------------------------------------
LPP: ssp.basic
/etc/objrepos 3.1.0.4
APPLIED F 10
From: sp4n01 sp4n05 sp4n06 sp4n07 sp4n08 sp4n09 sp4n10 sp4n11 sp4n13 sp4n15
-------------------------------------------------------------------------------
LPP: ssp.basic
/usr/lib/objrepos 3.1.0.0
COMMITTED I 10
From: sp4n01 sp4n05 sp4n06 sp4n07 sp4n08 sp4n09 sp4n10 sp4n11 sp4n13 sp4n15
-------------------------------------------------------------------------------
LPP: ssp.basic
/usr/lib/objrepos 3.1.0.4
APPLIED F 10
From: sp4n01 sp4n05 sp4n06 sp4n07 sp4n08 sp4n09 sp4n10 sp4n11 sp4n13 sp4n15
If you find any inconsistency with the software, stall the required LPPs or
PTFs.
3.2 Dates and Timezones
In a “normal” machine, having the correct date is important, but in the SP it is vital. However, it is easy to check the correct date and thereby avoid problems in your environment. Simply make sure that the CWS has the correct date and also check the specified timezone, as shown:
# date
Fri Apr 16 17:19:27 EDT 1999
Do the same with the nodes. You can use the dsh command for this check:
# dsh -a date
If you see a difference of more than five minutes or the timezone is incorrect, you will have to change it because there are daemons and subsystems that use time stamps to accomplish their work. For example, kerberos uses a time stamp when decoding tickets for authentication, and cannot handle differences of more than five minutes in the clocks of the authenticator and the clients.
If you find such differences in your system, refer to IBM Parallel System
Support Programs for AIX: Administration Guide, GC23-3897 to understand the procedure to change hostnames and IP addresses.
72
RS/6000 SP Software Maintenance
3.3 Checking Kerberos
Kerberos is the authentication subsystem used in the SP. It is based on
Version 4 of the Kerberos authentication service developed at MIT. Kerberos validates the identity of either a user of a service, or the service itself.
For a fast check of kerberos, type the following command:
# dsh -a date
You should receive responses from all nodes that are powered on, giving you their system date, as in the following example:
# dsh -a date sp4n01: Fri Apr 16 10:55:02 EDT 1999 sp4n05: Fri Apr 16 10:55:02 EDT 1999 sp4n06: Fri Apr 16 10:55:02 EDT 1999
If you have an improper configuration, you will receive messages such as: sp4n10.msc.itso.ibm.com: Fri Apr 16 10:36:44 EDT 1999 sp4n10.msc.itso.ibm.com: rshd: Kerberos Authentication Failed: Access denied because of improper credentials.
sp4n10.msc.itso.ibm.com: spk4rsh: 0041-004 Kerberos rcmd failed: rcmd protocol failure.
sp4n11.msc.itso.ibm.com: Fri Apr 16 10:36:44 EDT 1999 sp4n11.msc.itso.ibm.com: rshd: Kerberos Authentication Failed: Access denied because of improper credentials.
sp4n11.msc.itso.ibm.com: spk4rsh: 0041-004 Kerberos rcmd failed: rcmd protocol failure.
If you receive an error in that command, verify that your tickets are available and have not expired by typing klist
.
Ticket file:
Principal:
/tmp/tkt0 root.admin@SP4EN0
Issued Expires Principal
Apr 15 11:31:24 May 14 11:31:24 krbtgt.SP4EN0@SP4EN0
Apr 15 11:31:24 May 14 11:31:24 rcmd.sp4n06@SP4EN0
Apr 15 11:31:24 May 14 11:31:24 rcmd.sp4n01@SP4EN0
Apr 15 11:31:24 May 14 11:31:24 rcmd.sp4n08@SP4EN0
Apr 15 11:31:24 May 14 11:31:24 rcmd.sp4n07@SP4EN0
Apr 15 11:31:24 May 14 11:31:24 rcmd.sp4n05@SP4EN0
Verifying Your SP
73
Check the column labeled Expires. If you need to update your tickets, execute the command kinit followed by the principal, as in this example:
# kinit root.admin
Within the Kerberos subsystem, you should verify that the following daemons are running and that the following files exist, with the correct data and permissions.
3.3.1 Kerberos and kadmind Daemons
These daemons should be running in the primary authentication server
(usually the CWS). The kerberos daemon provides ticket-granting tickets to clients so that they can establish communication with the server.The kadmind daemon manages the administrative tools of the kerberos database.
To verify that these daemons are running in the CWS, execute:
# ps -fea | grep kerb root 12688 4906 0 Apr 02 root 13724 4906 0 Apr 02
0:00 /usr/lpp/ssp/kerberos/etc/kadmind -n
0:00 /usr/lpp/ssp/kerberos/etc/kerberos
You can also check that these daemons are in the /etc/inittab file so they will be started automatically in each reboot.
3.3.2 Kerberos Files
To do a more in-depth verification, you can check the eight files related to kerberos that must exist in your CWS and nodes (except for the.k file, which will be only in your CWS).
/.k
This file contains the master key of the kerberos database. The kadmind and the utility commands read the key from this file instead of prompting for the master password.
The permissions are: -rw------ , owner root and group system.
$HOME/.klogin
This file specifies a list of the remote principals that are authorized to invoke commands on the local user account. For example, the .klogin file of the user root file contains the principals who are authorized to invoke processes as the
74
RS/6000 SP Software Maintenance
root user with the kerberos remote commands (rsh and rcp).The .klogin file of root is distributed to the nodes during installation or customization.
The permissions are: -rw-r--r-- owner root and group system.
/tmp/tkt<uid>
This file contains the tickets owned by a client of the authentication database.
This file is continuously created and destroyed using the kinit and kdestroy commands.
To verify that you have correct tickets, you can reinitialize the file by executing the kinit command. You should be prompted for the password of the principal and if it is correct, you should return to the AIX prompt, as follows:
# kinit root.admin
Kerberos Initialization for "root.admin"
Password:
The permissions are: -rw------ , owner <uid> and group <group of uid>.
/etc/krb-srvtab
This file contains the names and private keys of the local instances for all the services protected by Kerberos. It is important to verify that the key versions of the nodes match the ones specified in the CWS for the same services. To check this, execute:
# klist -srvtab
Verifying Your SP
75
In the CWS, you will have a response similar to:
Server key file: /etc/krb-srvtab
Service Instance Realm Key Version
-----------------------------------------------------rcmd sp4en0 SP4EN0 1 hardmon sp4en0 SP4EN0 1
In the nodes, you will have a response similar to:
Server key file: /etc/krb-srvtab
Service Instance Realm Key Version
-----------------------------------------------------rcmd sp4css10 SP4EN0 1 rcmd sp4n10 SP4EN0 1
Look for the column labeled Key Version. If you find that the versions are different between the CWS and the nodes for the same service, for example rcmd, the
/usr/lpp/ssp/kerberos/etc/ext_srvtab command can be used to create new server key files for each node.
The permissions are: -r-------- owner root and group system.
/etc/krb.conf
The SP authentication configuration file defines the local realm and the location of authentication servers for known realms.
# cat /etc/krb.conf
SP4EN0
SP4EN0 sp4en0 admin server
The permissions are: -rw-r--r-- owner root and group system.
/etc/krb.realms
This file contains an association of hostnames and authentication realms to specify the services provided by the host.
76
RS/6000 SP Software Maintenance
# cat /etc/krb.realms
sp4cw0.msc.itso.ibm.com SP4EN0 sp4n01.msc.itso.ibm.com SP4EN0 sp4sw01.msc.itso.ibm.com SP4EN0 sp4n05.msc.itso.ibm.com SP4EN0 sp4sw05.msc.itso.ibm.com SP4EN0 sp4n06.msc.itso.ibm.com SP4EN0 sp4sw06.msc.itso.ibm.com SP4EN0 sp4n07.msc.itso.ibm.com SP4EN0 sp4sw07.msc.itso.ibm.com SP4EN0 sp4n08.msc.itso.ibm.com SP4EN0
The permissions are: -rw-r--r-- owner root and group system.
/var/adm/SPlogs/kerberos
You must have the following two files in this directory:
• kerberos.log
The permissions are: -rw-r--r-- owner root and group system.
• admin_server.syslog
The permissions are: -rw-r--r-- owner root and group system.
These files contain messages about the behavior of the kerberos and kadmind daemons, respectively. You should check them to see if you have any error messages.
That is all you should verify with Kerberos.
3.4 General Environment
There are variables and files that are set when you install your SP, which are used to manage the system. Those variables include the characteristics of your nodes and CWS such as IP addresses, hostname, versions of AIX and
PSSP installed, and others. You can quickly verify that these variables fit your expectations and that you have a consistent environment.
One command that is very helpful to do such verification is splstdata
, with which you can display all the information of your environment that is taken from the SDR.
The main flags you can use for verification are:
-a
Displays the following SDR LAN data only for nodes in the current system partition:
Verifying Your SP
77
node_number adapter_type netaddr netmask host_name type rate
As shown in Figure 14, you will have all the IP addresses defined in the SDR
for your SP, so do not worry if you do not see addresses for all the adapters that you have in your nodes. The basic ones that you should see will be the adapters for the service LAN and the switch adapters (if you have a switch, of course).
Check that the IP addresses and the netmasks are effectively correct.
List LAN Database Information node# adapt netaddr netmask hostname type t/r enet_ duplex
-----------------------------------------------------------------------------------
1 css0
5 css0
192.168.14.1
192.168.14.5
255.255.255.0
255.255.255.0
sp4sw01 sp4sw05
NA NA NA
NA NA NA
NA
NA
""
""
6 css0
7 css0
8 css0
9 css0
192.168.14.6
192.168.14.7
192.168.14.8
192.168.14.9
255.255.255.0
255.255.255.0
255.255.255.0
255.255.255.0
sp4sw06 sp4sw07 sp4sw08 sp4sw09
NA NA NA
NA NA NA
NA NA NA
NA NA NA
NA
NA
NA
NA
""
""
""
""
1
5
6
7
10 css0
11 css0
13 css0
15 css0 en0 en0 en0 en0
192.168.14.10
192.168.14.11
192.168.14.13
192.168.14.15
192.168.4.1
192.168.4.5
192.168.4.6
192.168.4.7
255.255.255.0
255.255.255.0
255.255.255.0
255.255.255.0
255.255.255.0
255.255.255.0
255.255.255.0
255.255.255.0
sp4sw10 sp4sw11 sp4sw13 sp4sw15 sp4n01 sp4n05 sp4n06 sp4n07
NA NA NA
NA NA NA
NA NA NA
NA NA NA bnc NA 10 bnc NA 10 bnc NA 10 bnc NA 10
NA
NA
NA
NA half half half half
""
""
""
""
""
""
""
""
8
9
10
11
13
15 en0 en0 en0 en0 en0 en0
192.168.4.8
192.168.4.9
192.168.4.10
192.168.4.11
192.168.4.13
192.168.4.15
255.255.255.0
sp4n08 bnc NA 10 half
255.255.255.0
sp4n09 bnc NA 10 half
255.255.255.0
sp4n10 bnc NA 10 half
255.255.255.0
sp4n11 bnc NA 10 half
255.255.255.0
sp4n13 bnc NA 10 half
255.255.255.0
sp4n15 bnc NA 10 half
""
""
""
""
""
""
Figure 14. Output of the splstdata -a Command
-b
Displays the following SDR boot/install data: node_number host_name
78
RS/6000 SP Software Maintenance
hdw.enet.addr
boot_server bootp_response install_disk last_inst_image last_inst_time next_inst_image lppsource_name pssp_version
This is one of the most helpful commands, as you can see in the following screen output. It gives you almost all the information you need to know to do maintenance tasks such as installation, configuration and upgrading of your nodes. The boot_server under the column labeled srvr is the installation server for that node: specifically, in this example, the installation server for all the nodes is node number zero (that means the
CWS). If you have other installation servers, check that they are correct in the output of this command. You can also verify that the lppsource_name corresponds to the directory that contains the LPPS of the version of AIX that you want installed in that node, and that the version of PSSP listed is right. This verification is absolutely necessary before and after you do an installation or migration of any node.
Verifying Your SP
79
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
1 sp4n01 02608C2E86CA 0 default Fri_Apr__9_18:30:38 install default hdisk0 aix432
5 sp4n05
PSSP-3.1
10005AFA0518 rootvg
0 install bos.obj.ssp.432
hdisk0,hdisk1 default
6 sp4n06
7 sp4n07 default
PSSP-3.1
PSSP-3.1
initial alt_rootvg
10005AFA17E3 0 default Fri_Apr__9_18:45:17
10005AFA1721 rootvg
0 diag disk default default hdisk0 aix432 hdisk0 aix432
8 sp4n08
9 sp4n09 default Fri_Apr__9_18:51:19
PSSP-3.1
rootvg
10005AFA07DF 0 default Fri_Apr__9_18:53:33
PSSP-3.1
0004AC4947E9 rootvg
0 disk disk default default hdisk0 aix432 hdisk0 aix432
10 sp4n10
11 sp4n11 default Fri_Apr__9_18:26:04
PSSP-3.1
rootvg
0004AC494B40 0 default Fri_Apr__9_18:31:45
PSSP-3.1
02608C2E7643 rootvg
0 disk disk default default hdisk0 aix432 hdisk0 aix432
13 sp4n13
15 sp4n15 default Fri_Apr__9_18:43:10
PSSP-3.1
rootvg
02608C2E7C1E 0 default Fri_Apr__9_18:46:48
PSSP-3.1
02608C2E78C9 rootvg
0 default Sat_Apr_10_11:10:57
PSSP-3.1
rootvg disk disk default default hdisk0 aix432 hdisk0 aix432
A column that is specially important is the one labeled as response; it determines the answer that the CWS will have when you do a netboot of a node. If you are in your normal production environment, this field should be in status disk. If you find something different here, you may have a problem.
-e
Displays SP object attributes and their values from the SDR.
There are some special fields you should pay attention to, such as the name and address of the CWS, the install_image, the code_version and the cw_lppsource_name.
The following screen output shows a typical response to this command.
80
RS/6000 SP Software Maintenance
# splstdata -e
List Site Environment Database Information attribute value
----------------------------------------------------------------------------control_workstation sp4en0 cw_ipaddrs install_image remove_image primary_node
9.12.0.4:192.168.4.140: bos.obj.ssp.432
false
1 ntp_config ntp_server ntp_version amd_config print_config print_id usermgmt_config passwd_file consensus
""
3 false false
"" true
/etc/passwd passwd_file_loc homedir_server homedir_path filecoll_config supman_uid supfilesrv_port spacct_enable spacct_actnode_thresh spacct_excluse_enable acct_master cw_has_usr_clients code_version layout_dir authent_server backup_cw ipaddrs_bucw active_cw sec_master cds_server cell_name cw_lppsource_name cw_dcehostname sp4en0 sp4en0
/home/sp4en0 true
102
8431 true
80 false
0 false
PSSP-3.1
"" ssp
""
""
""
""
""
"" aix432
""
The install_image value is the default image that gets install. The code_version variable is default PSSP version which will be installed and the cw_lppsource_name is the level of NIM to install on the CWS.
There are other variables that you can also check, such as amd_config, which tells you if you have amd configured or not, and the
usermgmt_config variable, which enables the user management option.
-n
Displays the following SDR node data: node_number frame_number
Verifying Your SP
81
slot_number slots_used initial_hostname reliable_hostname default_route processor_type processors_installed description
List Node Configuration Information node# frame# slot# slots initial_hostname reliable_hostname dcehostname default_route processor_type processors_installed description
------------------------------------------------------------------------------
1 1 1 4 sp4n01 sp4n01.msc.itso.
""
5
192.168.4.140
1 5 1 sp4n05
MP 6 112_MHz_SMP_High sp4n05.msc.itso.
""
UP
6
7
192.168.4.140
1 6
192.168.4.140
1 7
1
1 sp4n06 sp4n07
UP
1 66_Mhz_PWR2_Thin sp4n06.msc.itso.
""
1 66_Mhz_PWR2_Thin sp4n07.msc.itso.
""
UP
8
9
10
11
192.168.4.140
1 8
192.168.4.140
1 9
192.168.4.140
1 10
192.168.4.140
1 11
1
1
1
2 sp4n08 sp4n09 sp4n10 sp4n11
UP
MP
MP
1 66_Mhz_PWR2_Thin sp4n08.msc.itso.
""
1 66_Mhz_PWR2_Thin sp4n09.msc.itso.
""
4 332_MHz_SMP_Thin sp4n10.msc.itso.
""
4 332_MHz_SMP_Thin sp4n11.msc.itso.
""
13
15
192.168.4.140
1 13
192.168.4.140
1 15
192.168.4.140
2
2 sp4n13 sp4n15
UP
UP
UP
1 66_MHz_PWR2_Wide sp4n13.msc.itso.
""
1 66_MHz_PWR2_Wide sp4n15.msc.itso.
""
1 66_MHz_PWR2_Wide
The screen output shows information about all your nodes. Check that the hostnames are well-defined and that the default route is the one that you want to be (in a normal installation, you will see the IP address of the CWS as the default route).
There is another command that will help you to determine the status of your nodes in a specific moment in time, that is the spmon command, and the most complete report can be obtained using the
-d and
-G flags. In the output from this command you can verify that you have connection with the frame (or
frames) and if you do, you can see the information listed in Figure 15 on page
82
RS/6000 SP Software Maintenance
The fields that you should check are the following:
Verify that you can see all the frames and all the nodes in your SP. Check the column labeled Power to determine whether a particular node is listed as power off or on (remember that this power refers only to the logical power on, not the physical). If you have a node with this field set to no, you do not know if the node is logically powered off or physically powered off.
Verify that your Host Responds column is displaying yes for all nodes. If it is not, you may have a problem with your Ethernet TCP/IP connection; try to ping the node to see if it responds. If it does respond, problems within the
High Available Infrastructure layer (specially hats) could be the cause of the missing host response. In this case refer to RS/6000 SP: Problem
Determination Guide, SG24-4778.
Do the same check with the Switch Responds column. If you find a no there, you probably need to initialize your switch by executing the
Estart command;
Refer to 10.2, “Initialize the Switch” on page 228 for details.
The key switch works the same as on stand-alone RS/6000 workstations; it has three positions: normal, secure, and service. Verify that the key is in the position that you require. If you want to change the key to a different mode, you can execute the command spmon -k followed by the mode to which you want to change. For example, to change the key of node number 5 to service mode you execute:
# spmon -key service node05
The column Env Fail shows the status of your environment; this should
always remain in value no
. If this flag shows value yes
, then you have a problem that must be resolved urgently because it indicates a failure in the system’s physical environment. For example, there could be a excessive temperature in the node because a fan is not working, or because the air conditioner is not working well.
The column of Front Panel LCD/LEDs will accomplish the same functions as in a stand-alone machine.
PSSP 2.4 introduced a new column called LCD/LED is Flashing that indicates if the displayed values are flashing (most times meaning hardware problems) or that the values are continuously changing.
Verifying Your SP
83
# spmon -d -G
1.
Checking server process
Process 13680 has accumulated 5 minutes and 45 seconds.
Check ok
2.
Opening connection to server
Connection opened
Check ok
3.
Querying frame(s)
1 frame(s)
Check ok
4.
Checking frames
Frame
Controller Slot 17 Switch
Responds Switch Power
Switch
Clocking
Power supplies
A B C D
----------------------------------------------------------------
1 yes yes on 0 on on on on
5.
Checking nodes
--------------------------------- Frame 1 -------------------------------------
Frame Node Node Host/Switch Key Env Front Panel LCD/LED is
Slot Number Type Power Responds Switch Fail LCD/LED Flashing
-------------------------------------------------------------------------------
1
5
1
5 high thin on on yes yes yes yes normal normal no no
LCDs are blank
LEDs are blank no no
6
7
8
6
7
8 thin thin thin on on on yes yes yes yes yes yes normal normal normal no no no
LEDs are blank
LEDs are blank
LEDs are blank no no no
Figure 15. Output of the spmon -d -G Command
You should also verify that you have enough free space in your file systems, like / - and /var-filesystems. In particular, the /var filesystem should be checked because most of the system logfiles are written to this filesystem.
The services of high availability can also be verified; look into
/var/ha/run/hats.<partition name> to search for a file called core or core.####.
If you find one, that means that you had a problem with the topology service daemon and you can verify if the daemons are running by executing the command:
# lssrc -g hats
Alternatively, if you want detailed output you can list the information of the subsystem hats.<partition name
> with the -l flag and it will look similar to the following:
84
RS/6000 SP Software Maintenance
# lssrc -l -s hats.sp4en0
Subsystem Group PID Status hats.sp4en0
hats 20902 active
Network Name Indx Defd Mbrs St Adapter ID Group ID
SPether
SPether
[ 0] 11 11 S 192.168.4.140
192.168.4.140
[ 0] 0x470e8685 0x470f7084
HB Interval = 1 secs. Sensitivity = 4 missed beats
2 locally connected Clients with PIDs: haemd( 15484) hagsd( 15234)
Configuration Instance = 923696593
Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
CWS = 192.168.4.140
This command produces helpful information; for example, the State of the subsystem should be S (Stable). You can also verify that the number of nodes defined in the group of topology servers is equal to the number of Members active at a given moment; this number is the sum of the total nodes, plus your
CWS, minus the nodes that are installed with PSSP version 1.2 or 2.1 that do not work with hats.
For more information about hats, refer to RS/6000 SP High Availability
Infrastructure, SG24-4838.
3.5 System Data Repository (SDR)
As with the ODM in AIX, the SDR contains all the configuration information for the SP. It is located in the CWS and the nodes access it through the service network. The consistency of this database is needed to have your system working properly.
While there are commands that allow you to look at, modify, and delete your
SDR, the recommendation is to not modify the SDR directly. Instead use the commands provided to administer the SP. However, if you find that you do
need to modify the SDR, refer to 7.1.2, “Backing Up the SDR” on page 143 to
learn how to back up the SDR before you do any modification. The SDR is located in the /spdata/sys1 directory and its subdirectory structure is shown
Verifying Your SP
85
S D R a r c h i v e s d e f s p a r t i t i o n s s y s t e m
V S D _ T a b l e
N o d e
S w i t c h
F r a m e
.. .
c la s e s f ile s lo c k s
Figure 16. SDR Subdirectory Structure
The archives subdirectory could either be empty (if you have not taken any backup of the SDR), or it could contain the files of the backups that you have taken. In the defs subdirectory, you should have the header files for all the object classes. The partitions subdirectory should have at least one directory named as the IP address of the CWS, as well as other additional subdirectories for each additional partition that you have defined. The system subdirectory contains the classes and files which are global to the system.
To do a fast check of the SDR, issue the following command:
# SDR_test
SDR_test: Start SDR commandline verification test
SDR_test: Verification succeeded
If you have a different response, you have a problem with your SDR. In that case, do the following:
Check that the SDR daemons are running by typing:
# ps -ef | grep sdr
The response should be similar to: root 2974 12024 1 16:46:02 pts/5 0:00 grep sdr root 10850 7228 1 20:43:05 0:31 /usr/lpp/ssp/bin/sdrd 192.168.4.140
You will see one SDR daemon for each partition that you have in your system.
You can also check this by executing the following command:
86
RS/6000 SP Software Maintenance
#lssrc -g sdr
The output of the command in this case is:
Subsystem sdr.sp4en0
Group sdr
PID Status
10850 active
Verify that the name of the subsystem is associated with the name of the partition.
You can also check if there is an entry in the inittab as follows: sdrd:2:once:/usr/bin/startsrc -g sdr
However, the normal (and easiest) procedure if you have a problem with the
SDR is to stop the subsystem and restart it; you can do this using the stopsrc and startsrc commands.
Verifying Your SP
87
88
RS/6000 SP Software Maintenance
Chapter 4. Alternate and Mirrored rootvg
In this chapter we explain the new concepts of alternate and mirrored root volume groups which was introduced with PSSP 3.1. We will also look at the different boot modes which are required for doing software maintenance.
4.1 Alternate Boot System Image Function
There is a new feature in PSSP 3.1 that allows you to define, install and manage different, absolutely independent rootvg volume groups for an SP node or the CWS. However, only one rootvg volume group can be active at any time, and no changes are propagated from one rootvg to the other. This function is specially useful for fast migrations and testing of software before taking it into production.
AIX 4.2.1
PSSP 2.4
hdisk0
AIX 4.3.2
PSSP 3.1
hdisk1
You can have different versions of software installed in different physical volumes of the same machine, whether it is the CWS or a node, but only one can be active at a time.
CWS or Node
Figure 17. Alternate rootvg
In order to support this new functionality a new class was added to the SDR:
Volume_Group class.
The attributes of this new class are the following:
• node_number: The node number for Volume_Group.
• vg_name: The customer-supplied volume group name.
© Copyright IBM Corp. 1999
89
Default: rootvg.
• pv_list: A list of physical volumes.
Default: hdisk0.
• rvg: Specifies whether the volume group is a root volume group.
Valid values: true or false.
Default: true.
• quorum: Specifies whether the quorum is set to on for this volume group.
Valid values: true or false.
Default: true.
• copies: Specifies the number of copies for the volume group.
Valid values: 1or 2 or 3.
Default: 1.
• mapping: Specifies whether mapping is set to on.
Valid values: true or false.
Default: false.
• install_image: The name of the mksysb install image to use for next install.
Default: default.
• code_version: PSSP code version to use for next install.
Default: the one defined in the environment.
• lppsource_name: Name of the lppsource resource to use for next install.
Default: default.
• boot_server: The node_number of the boot/install server.
Default: Default depends upon the node location.
• last_install_image: A string specifying the name of the last image installed on the node.
Default: Initial.
• last_bootdisk: A string specifying the logical device name of the last volume from which the node booted.
Default: Initial.
To manage this new class, five new commands were added to the SP system.
These commands are:
90
RS/6000 SP Software Maintenance
• spmkvgobj
• spchvgobj
• sprmvgobj
• spmirrorvg
• spunmirrorvg
4.1.1 Prerequisites
To enable an alternate rootvg, you must have the following hardware:
• At least two hard disks (one for each rootvg volume group)
• A virtual battery on the SP nodes
The SP virtual battery is a power source that keeps NVRAM powered and the time of day clock running while the SP frame is plugged into a power outlet with power applied, or until a node is removed from the SP frame. Power is supplied by the node supervisor card in each node. This card provides line power even when the node is powered off.
On wide SP nodes, a virtual battery is installed from the factory. On thin nodes, a no charge Engineering Charge Action (ECA) can be ordered by your
IBM representative; order ECA number ECA007. On the CWS, a battery is installed and no virtual battery ECA is required.
4.1.2 Defining an Alternate rootvg
In the following example, we have a thin node in slot 11 with two internal disks. We assume that AIX4.2.1 and PSSP 2.4 are already installed on the first physical disk (hdisk0). We now want to create an alternate rootvg on the second internal disk (hdisk1). AIX 4.3.2 and PSSP 3.1 will be installed on this disk.
We already copied the AIX 4.3.2 lppsources in the directory
/spdata/sys1/install/aix432/lppsource, and the PSSP code in directory
/spdata/sys1/install/pssplpp/PSSP-3.1.
In order to make a clear relationship between the installation device and the physical device, we recommend that you use the physical location code
(00-00-0S-1,0) in the Physical Volume List field of the next input mask instead of using logical device names, like hdisk0. AIX assigns location device names dynamically to physical device names. Therefore it is not always obvious which physical device belongs to the logical device hdisk0.
Alternate and Mirrored rootvg
91
To get the location code use the command:
# lsdev -Cc disk
The next screen shows a command output from one of our nodes.
# lsdev -Cc disk hdisk0 Available 00-00-0S-0,0 2.0 GB SCSI Disk Drive hdisk1 Available 00-00-0S-1,0 2.0 GB SCSI Disk Drive
As you can see, we have two internal scsi disk devices.
We want to define a alternate rootvg, rootvg_432. We can use the fast path
SMIT command:
# smitty creatvg_dialog
Create Volume Group Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Start Frame
Start Slot
Node Count
OR
Node List
Volume Group Name
Physical Volume List
Number of Copies of Volume Group
Boot/Install Server Node
Network Install Image Name
LPP Source Name
PSSP Code Version
Set Quorum on the Node
[Entry Fields]
[1]
[11]
[1]
[]
[rootvg_432]
[00-00-0S-1,0]
1
[]
[]
[aix432]
PSSP-3.1
+
#
+
+
#
#
#
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
The appropriate command is:
# spmkvgobj -r root_432 -h ’00-00-0S-1,0’ -i bos.obj.ssp.432 \
-v aix432 -p PSSP-3.1 -l 11
92
RS/6000 SP Software Maintenance
The following screen output shows that the alternate rootvg root_432 did not exist and was therefore created.
COMMAND STATUS
Command: OK stdout: yes stderr: no
Before command completion, additional instructions may appear below.
spmkvgobj: Volume_Group object for node 11, name root_432 not found, adding n ew Volume_Group object.
spmkvgobj: The total number of Volume_Group objects successfully added is 1.
spmkvgobj: The total number of rejected Volume_Group additions is 0.
F1=Help
Esc+8=Image n=Find Next
F2=Refresh
Esc+9=Shell
F3=Cancel
Esc+0=Exit
Esc+6=Command
/=Find
4.1.3 Installing an Alternate rootvg
If we want to install this alternate rootvg we need to set the bootp_response attribute for the node to install. Do this with the following steps:
1. Use the command: smitty server_dialog
Fill all the input fields as shown in the screen. Also set the Run Setup server option to no, for the time being, so that you can check the SDR.
Alternate and Mirrored rootvg
93
Boot/Install Server Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Start Frame
Start Slot
Node Count
OR
Node List
Response from Server to bootp Request
Volume Group Name
Run setup_server?
[Entry Fields]
[1]
[11]
[1]
[] install
[rootvg_432] no
+
+
#
#
#
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
The appropriate command is:
# spbootins -l 11 -r install -c rootvg_432
2. Check the SDR by using the command splstdata -b
; this should reflect the new image.
# splstdata -b -l 11
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
11 sp4n11 default
PSSP-3.1
02608C2E7643 0 initial rootvg_432 install default
00-00-0S-1,0 aix432
3. Now you can either run setup_server to initiate an install operation, or you can do it directly by using the wrapper-commands. Execute the following commands to prepare for the installation of node sp4n11:
# setup_server or
# Efence -G -autojoin sp4n11
94
RS/6000 SP Software Maintenance
# create_krb_files
# mkconfig
# mkinstall
# export_clients
# allnimres -l 11
After the node comes up, the lspv
command in Figure 18 on page 95 shows
that the rootvg is located on hdisk1 and that hdisk0 is unassigned; even if our first root volume group is installed on this disk.
# lspv hdisk0 hdisk1 hdisk2 hdisk3
0000097330a4837d
00000973967bdfa2
000003411ee207da
00003550d567d4e3
None rootvg
None
None
Figure 18. lspv Command Output
4.1.4 Switching between Alternate System Images
Now that we have at least two root volume groups, how can we check this and how can we switch between these two system images?
To verify that we have more than one root volume group definition for node 11 we run the splstdata command:
# splstdata -v -l 11
List Volume Group Information node# name boot_server quorum copies code_version lppsource_name last_install_image last_install_time last_bootdisk pv_list
-------------------------------------------------------------------------------
11 rootvg 0 true 1 PSSP-2.4 aix421
Fri_Apr__9_18:43:10_EDT_1999 hdisk0 default hdisk0
11 rootvg_432 default
0 true initial
1 PSSP-3.1 aix432 hdisk0
As you can see, the new
-v option for the splstdata command shows all defined volume groups for a node.
The next question is, how do we know which root volume group will be used for the next boot? We use splstdata -b to display this information:
Alternate and Mirrored rootvg
95
# splstdata -b -l 11
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
11 sp4n11 02608C2E7643 0 disk hdisk0 default Fri_Apr__9_18:43:10
PSSP-2.4
rootvg default aix421
As we can see, the current active system is the AIX 4.2.1 version on hdisk0.
If we want to switch to an alternate root volume group we will use the spbootins command:
# spbootins -c rootvg_432 -l 11 -s no
The rootvg_432 becomes the current root volume group for subsequent installation and customization. Let us check, using the splstdata -b command, that our changes are carried out.
# splstdata -b -l 11
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
11 sp4n11 default
PSSP-3.1
02608C2E7643 0 initial rootvg_432 disk default hdisk1 aix432
There is one final step missing. We have to modify the bootlist of node 11 so that the next boot will be done using hdisk1 (rootvg_432) instead of hdisk0
(rootvg). We do this by using the new PSSP command spbootlist
.
Like the normal AIX bootlist command, the spbootlist command will look at the vg_name attribute in the Volume_Group object, determine which physical volume(s) are in the volume group, and set the bootlist to them. Let us look at an example:
# spbootlist -l 11
96
RS/6000 SP Software Maintenance
This command takes the information stored in the SDR and issues a remote command on the selected node. Now the next boot will bring up the alternate root volume group rootvg_432 on node 11.
4.1.5 Alternate rootvg Highlights
No matter what name you give to the volume group object in the SDR on the
CWS, the name of the active volume group of the node will be rootvg.
Only one rootvg can be active at any time.
Important
When you boot your node from one rootvg on one specific physical volume, you will see that the other physical volumes assigned to an alternate rootvg
appear as unassigned (see Figure 18 on page 95). Do not define anything
on those physical volumes or you will destroy your alternate rootvg.
The normal bootlist within your nodes will be modified by using the spbootlist command. You should also think about modifying your service bootlist. Why?
Imagine that you are trying to boot your system in service mode to do some diagnostic work and your system is not coming up. If you have defined the second physical disk (with the alternate rootvg installed on it) as an alternate boot device, your system will try to boot from this device. Your node probably will boot from this device. In this case your node is still accessible.
4.2 Rootvg Mirroring
PSSP 3.1 supports mirroring of the root volume group. Rootvg Mirroring helps you achieve AIX High Availability. You can have multiple copies of your operating system spread over multiple physical volumes. If one physical volume fails, you still have other valid copies and your system remains operational.
You can configure two or three copies of each logical volume of the operating system (the original plus one or two copies). The only logical volume that cannot be mirrored is the dump logical device.
When we talk about mirroring we need to think about disk quorum. When quorum is enabled, a voting scheme will be used to determine if the number of physical volumes that are up is enough to maintain a quorum. If quorum is lost, the entire volume group will be taken offline to preserve data integrity. If quorum is disabled, the volume group will remain on line as long as there is at
Alternate and Mirrored rootvg
97
least one valid physical volume. We chose disk quorum to be enabled, which is generally safer.
4.2.1 Mirroring Example
We will install node 10 with two copies of the operating system.
Enter the information about the node so that your volume group is defined with two copies. You can use the spchvgobj command to modify the information about your node. You can execute:
# spchvgobj -r rootvg -h hdisk0,hdisk1 -c 2 -l 10
If you use SMIT, type:
# smitty changevg_dialog
Change Volume Group Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Start Frame
Start Slot
Node Count
OR
Node List
Volume Group Name
Physical Volume List
Number of Copies of Volume Group
Set Quorum on the Node
Boot/Install Server Node
Network Install Image Name
LPP Source Name
PSSP Code Version
[Entry Fields]
[1]
[10]
[1]
[]
[rootvg]
[hdisk0,hdisk1]
2 true
[]
[]
[]
PSSP-3.1
+
+
#
+
#
#
#
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
With these instructions, we specify that the rootvg volume group will be installed on hdisk0 and hdisk1. AIX takes care that each copy goes to a separate physical volume. Recall that the prerequisite to mirroring is to have at least two physical volumes in the volume group of interest. You can define mirroring at installation time or dynamically on a running node.
98
RS/6000 SP Software Maintenance
Now we can proceed with our normal installation. Afterwards, you can verify that the logical volumes are mirrored into the rootvg by executing the following command:
# lsvg -l rootvg
Check that the number of physical partitions dedicated to a logical volume is twice the number of logical partitions for a mirrored rootvg with two instances.
The only exception is the sysdump logical volume, which cannot be mirrored.
You can initiate the mirroring of the root volume group on a running system by issuing the following command:
# spmirrorvg -l 10 -f
This assumes that you defined the mirroring of the rootvg already, as described before. The command uses information found in the Volume_Group object to initiate the mirroring.
4.2.2 Unmirroring a Root Volume Group
If you decide that you do not want the rootvg mirrored anymore, you can modify the definition in the SDR and customize the node. The following commands are required to unmirror node 10.
1. Modify the information for the volume group:
# spchvgobj -r rootvg -c 1 -l 10
2. Change the bootp response for that node and run setup_server in the
CWS:
# spbootins -r customize -l 10
3. Reboot the node:
# cshutdown -r -N 10
You could also remove the mirroring by using the normal AIX commands on the node or through SMIT, and modify the information in the SDR afterwards.
Otherwise, the next time you customize that node, the system will try to recreate the mirroring.
Alternate and Mirrored rootvg
99
4.3 Boot Modes
The SP nodes can be booted up to several system modes. You can choose among five different modes:
• disk
• install
• migrate
• customize
• diag
You can select the mode that you want through the SMIT panels or by using the spbootins command. This mode will determine the actions taken when the setup_server script is executed. It will influence the next boot procedure.
If you want to check the current boot mode settings for a node, use the splstdata -b command and look for the column labeled “response”. The following is an example of the output of this command:
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
1 sp4n01 02608C2E86CA 0 default Fri_Apr__9_18:30:38 disk default hdisk0 aix432
5 sp4n05
PSSP-3.1
default
10005AFA0518 rootvg
0 initial install bos.obj.ssp.432
hdisk0,hdisk1 default
To modify the bootp_response attribute of a node and save this in the SDR, type the following command:
# spbootins -r <mode> -l <node list>
When you run this command, the default option is to also run the setup_server script.
The following sections explain briefly the procedures which are executed when you change the boot mode.
100
RS/6000 SP Software Maintenance
4.3.1 Disk Mode
We can say that this is the "normal" mode for your SP system, meaning that you should have your nodes in this mode while they are in the production environment.
In this mode, when you run setup_server
, all the resource allocations are removed for this NIM client. The file /tftpboot/<hostname>.info is also removed, as well as the entry in the /etc/bootptab for that client. All this means that the bootp request will be ignored and that your node will boot from the local disk.
You can reboot the node as if it were a normal standalone machine. It will boot from the local disk and start the operating system (if the key is in normal) or the diagnostics menu (if the key is in service).
4.3.2 Install Mode
You use this mode when you want to completely install a node. When setup_server is run, the following procedures are executed:
• Allocate SPOT
• Allocate lpp_source
• Allocate bosinst_data
• Allocate psspscript
• Allocate mksysb
• Update /etc/inetd.conf
• Create <reliable_hostname>.install_info
• Create <reliable_hostname>.config_info
• Create the Kerberos file <reliable_hostname>-new_srvtab
Then, in order to install your node, you must boot from the network.
4.3.3 Migrate Mode
This mode indicates that you want the server to perform a migration on the specified nodes. The following procedures take place when running the server in this mode:
• Allocate SPOT
• Allocate lpp_source
• Allocate bosinst_data
• Allocate psspscript
• Update /etc/inetd.conf
• Create <reliable_hostname>.install_info
• Create <reliable_hostname>.config_info
Alternate and Mirrored rootvg
101
• Create the Kerberos file <reliable_hostname>-new_srvtab
4.3.4 Customize Mode
In this mode the setup_server script deallocates all resources of a given NIM client, removes the NIM client, and recreates it. The files:
• <reliable_hostname>.install_info
• <reliable_hostname>.config_info
• <reliable_hostname>-new-srvtab are created under the /tftpboot directory.
4.3.5 Diag Mode
This mode allows you to bring up your node in diagnostic mode. You have to reboot it.
The setup_server script does the following:
• Allocate SPOT
• Allocate bosinst_data
• Update /etc/inetd.conf
• Create <reliable_hostname>.install_info
• Create <reliable_hostname>.config_info
The difference between booting a node locally with the key in service position and booting from the network in diag mode is that in the first case, you are still booting from your local disk and loading the diagnostic software from that disk, while in the second case, the diagnostic software will be mounted from the CWS to the node.
4.3.6 Maintenance Mode
This mode is useful when you encounter boot problems. You will get a maintenance menu, from which you can initiate several actions. This mode also requires a netboot.
The setup_server script does the following:
• Allocate SPOT
• Allocate lpp_source
• Allocate bosinst_data
• Update /etc/inetd.conf
• Create <reliable_hostname>.install_info
• Create <reliable_hostname>.config_info
102
RS/6000 SP Software Maintenance
Chapter 5. Updating Your System
This chapter is to be used for applying Program Temporary Fixes (PTFs) for
AIX, Parallel System Support Program (PSSP) and other Licensed Program
Products (LPPs) in SP. If you are planning for migration of the AIX version or
PSSP software, refer to Chapter 6, “Migrating PSSP and AIX to Later
Software corrections or enhancements are released in the form of PTFs.
These patches can be ordered through the IBM Software Support Center, or they can be downloaded either from the Web or by using standalone FixDist
servers. For more information on downloading PTFs, refer to Appendix A,
“Downloading PTFs for RS/6000 and SP” on page 243.
When installing PTFs on a production system, you must first take certain precautionary measures. In this chapter we show the steps for successfully installing PTFs on the CWS and the nodes.
Before applying any PTFs, you should review the memos, the so-called
README files. These README files are most times related to filesets and include the latest information, which is sometimes not included in the manuals. They tell you what problems are being fixed and whether any specific prerequisite filesets are required. They also tell you whether a shutdown of the node is required for the fixes to take effect. Generally, it is suggested you install the fixes in apply mode so that if there is any problem, you can reject the filesets that have been applied. In some cases we suggest you commit the fixes, as they may be mandatory. For these reasons, it is imperative that you read the memos before proceeding.
The README files for all PSSP versions and their LPPs are available at: http://www.rs6000.ibm.com/support/sp/sp_secure/readme/
As an example, when you apply the PTF ssp.css.2.4.0.2, you get the following output on the console:
NOTE: IX78007 and IX78629
PTF ssp.css 2.4.0.2 contains a fix to fault_service and fault_service_SP. In order for this fix to take effect, you must reboot the node. If you are not encountering this switch problem, you can wait until your next scheduled reboot before applying this PTF.
© Copyright IBM Corp. 1999
103
5.1 Applying PTFs
Whenever you apply PTFs for PSSP software, you must first install them on the CWS and verify that your SP-related environments are working OK. Refer
to Chapter 3, “Verifying Your SP” on page 67 for instructions on checking the
SP setup. After the CWS is found to be OK, you must next plan to choose one node for applying the PTF and test it for proper operation.
For installing on the nodes, two approaches are documented in the
Installation and Migration Guide. The first approach is to apply the updates on a per node basis using installp
, and the second approach is to reinstall the node using a customized mksysb image with the PTFs installed.
Generally in a production environment, a lot of customization will have been done, and shell scripts will have been written for performing various day-to-day administrative tasks. So the customer would most likely choose to install the PTFs in each node rather than reinstalling the nodes with a new image.
Now let us look at the steps to be followed for installing the PTFs on the CWS and the nodes.
5.2 Applying AIX PTFs in the CWS
The steps for applying the PTFs are as follows:
1. Create a mksysb backup image of the CWS. We recommend that you always take two backups, on two different tapes, and never trust just one tape. Check that the tape is OK by listing the contents of the tape using the command smitty lsmksysb.
For more information on taking a mksysb
backup, refer to 7.1.1, “Mksysb and Savevg” on page 139.
2. Copy the PTFs to the lppsource directory
/spdata/sys1/install/<aix_version>/lppsource.
3. Create a new .toc file by executing the command:
# nim -o check <lpp_source> or
# inutoc /spdata/sys1/install/aix432/lppsource
4. Update the new PTFs to the CWS using SMIT as follows:
# smitty update_all
Input device for directory/software:
/spdata/sys1/install/aix432/lppsource
104
RS/6000 SP Software Maintenance
Update Installed Software to Latest Level (Update All)
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
INPUT device / directory for software
SOFTWARE to update
PREVIEW only? (update operation will NOT occur)
COMMIT software updates?
SAVE replaced files?
AUTOMATICALLY install requisite software?
EXTEND file systems if space needed?
VERIFY install and check file sizes?
DETAILED output?
Process multiple volumes?
[Entry Fields]
/spdata/sys1/install/aix432>
_update_all yes yes no yes
+
+
+
+ yes no no yes
+
+
+
+
First run with the PREVIEW only option set to yes, and check that the prerequisites are available. If it is OK, then continue installing the PTFs by changing the PREVIEW only option to no
.
During the installation, check for any errors being reported. It is also possible to check this by viewing the smit.log file any time after the installation is over.
5. Update the SPOT with the PTFs in the lppsource directory using the following command, as shown in the next screen:
# smitty nim_res_op
Resource name: spot_aix432
.
Operation to perform: update_all
Updating Your System
105
Customize a SPOT
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Resource Name
Fixes (Keywords)
* Source of Install Images
EXPAND file systems if space needed?
Force installp Flags
PREVIEW only? (install operation will NOT occur)
COMMIT software updates?
SAVE replaced files?
AUTOMATICALLY install requisite software?
OVERWRITE same or newer versions?
VERIFY install and check file sizes?
[Entry Fields] spot_aix432 update_all
[lppsource_aix432] yes no no yes no yes no no
+
+
+
6. If the status of the install is OK, then you are through with the update of the AIX PTFs in the CWS. If the status of the install is FAILED, then you should review the output for the cause of the failure and resolve the problem.
+
+
+
+
+
+
5.3 Applying AIX PTFs on the Nodes
Applying PTFs on the nodes can be done by three different methods. Before applying the PTFs, take a mksysb backup of the node. For more information
on taking a mksysb backup, refer to 7.1.1, “Mksysb and Savevg” on page 139.
The three different methods for installing AIX PTFs are as follows:
1. Creating a new image which contains all the PTFs in one node and propagating the image to the nodes.
2. Mounting the lppsource directory from the CWS to the nodes, and installing the PTFs manually using smitty update_all for all nodes.
3. If you have the DSMIT package installed, then you can install using dsmitty update_all to all the nodes in one effort.
The following sections details these methods.
5.3.1 Method 1: Applying AIX PTFs Using mksysb Install Method
To install PTFs using a new image, double check to be sure that you have not done any customization specific to the nodes. If you have done so, you must not use this method of installation.
106
RS/6000 SP Software Maintenance
This method of updating PTFs is suggested only when all the nodes are identical as far as the configuration and LPPs are concerned.
The steps for updating PTFs using this method are:
1. Choose one node for installing the PTFs and testing the image. In the lab, we chose the node sp2n10 for this purpose.
2. Login as root on the node and mount the lppsource directory of the CWS on /mnt using the command:
# mount sp2en0:/spdata/sys1/install/aix432/lppsource /mnt
If you get any errors, check whether the directory is NFS-exported by looking into the file /etc/exports.
3. Update the PTFs using the command:
# smitty update_all
Input device for directory/software:
/mnt.
First run with the PREVIEW only option set to yes, and check that all the prerequisites are available. If it is OK, continue installing the PTFs by changing the PREVIEW only option to no
.
Check for any errors being reported during the installation. You can also check this by viewing the smit.log file any time after the installation is over.
4. Create the mksysb image from this node. It is always advisable to have the image kept in the /spdata/sys1/install/images/<new_image_file>. To create the image of node sp2n10 on the CWS, first export the directory
/spdata/sys1/install/images to node sp2n10. For example, to export
/spdata/sys1/install/images from sp2en0(CWS) to sp2n10 issue the following command from the CWS:
# mknfsexp -d '/spdata/sys1/install/images' -t 'rw' -r 'sp2n10' '-B'
5. Mount the directory locally on node sp2n10 using the command:
# mount sp2en0:/spdata/sys1/install/images /mnt
6. Create the mksysb image using smitty mksysb.
This is shown in the following screen.
Updating Your System
107
Back Up the System
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[MORE...2] previously stored on the selected output medium. This command backs up only rootvg volume group.
* Backup DEVICE or FILE
Create MAP files?
EXCLUDE files?
List files as they are backed up?
Generate new /image.data file?
EXPAND /tmp if needed?
Disable software packing of backup?
Number of BLOCKS to write in a single output
(Leave blank to use a system default)
[Entry Fields]
[/mnt/sp2n10.img] no no no yes no no
[]
7. Unmount the /mnt filesystem on the node and remove the NFS export of the directory /spdata/sys1/install/images from the CWS. If you have the
Switch installed, fence the node using the command
Efence sp2n10
.
Reboot the node for the PTFs to take effect.
8. After the node has been rebooted, unfence the Switch using the command:
# Eunfence sp2n10
Now the image is ready on the CWS for installation on the other nodes.
However, you should first check that all the operations on this node are working without any problems. It is generally advisable to keep this node under observation for a few days before installing the image on the other nodes.
Propagating the image to the node:
To test the image, we will install our new image on node sp2n11. The following steps install node sp2n11 using the new image sp2n10.img which
we created in step 6 of 5.3.1, “Method 1: Applying AIX PTFs Using mksysb
1. For our test installation we define an alternate rootvg as described in
4.1.2, “Defining an Alternate rootvg” on page 91. Execute the command:
# smitty createvg_dialog
Fill in the input fields as shown in the following screen. The only difference here, when compared to the normal setup, is that the Network Install
Image Name will be set to sp2n10.img.
+/
+
+
+
+
+
+
#
108
RS/6000 SP Software Maintenance
Create Volume Group Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Start Frame
Start Slot
Node Count
OR
Node List
Volume Group Name
Physical Volume List
Number of Copies of Volume Group
Boot/Install Server Node
Network Install Image Name
LPP Source Name
PSSP Code Version
Set Quorum on the Node
[Entry Fields]
[1]
[11]
[1]
[]
[test_rootvg]
[00-00-0S-0,0]
1
[]
[sp2n10.img]
[aix432]
PSSP-3.1
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
The following screen output shows that the alternate rootvg test_rootvg did not exist and was therefore created.
COMMAND STATUS
Command: OK stdout: yes stderr: no
Before command completion, additional instructions may appear below.
spmkvgobj: Volume_Group object for node 11, name test_rootvg not found, adding n ew Volume_Group object.
spmkvgobj: The total number of Volume_Group objects successfully added is 1.
spmkvgobj: The total number of rejected Volume_Group additions is 0.
+
#
+
+
#
#
#
F1=Help
Esc+8=Image n=Find Next
F2=Refresh
Esc+9=Shell
F3=Cancel
Esc+0=Exit
Esc+6=Command
/=Find
2. The next step is to set the bootp_response attribute for the node to install.
Use the command:
Updating Your System
109
# smitty server_dialog
3. Fill all the input fields as shown in the screen. Also set the Run Setup server option to no, for the time being, so that you can check the SDR.
Boot/Install Server Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Start Frame
Start Slot
Node Count
OR
Node List
Response from Server to bootp Request
Volume Group Name
Run setup_server?
[1]
[Entry Fields]
[11]
[1]
[] install
[test_rootvg] yes
+
+
#
#
#
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
4. Check the SDR by using the command splstdata -b
; this should reflect the new image.
# splstdata -b -l 11
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
11 sp4n11 02608C2E7643 0 install 00-00-0S-0,0 default
PSSP-3.1
initial test_rootvg sp2n10.img
aix432
5. Now you can either run setup_server to initiate an install operation, or you can do it manually using wrappers only for this node. For doing the manual steps to install the node, execute the following commands to prepare for the installation of node sp2n10:
# Efence sp2n10
# create_krb_files
110
RS/6000 SP Software Maintenance
# mkconfig
# mkinstall
# mknimres -l 0
# export_clients
# allnimres -l 10
When you execute these commands, the screen output will look like the following:
# Efence sp2n10
All nodes successfully fenced.
# create_krb_files create_krb_files: tftpaccess.ctl file and client srvtab files created/updated on server node 0.
# mkconfig
# mkinstall
# mknimres -l 0 mknimres: Copying /usr/lpp/ssp/install/bin/pssp_script to /spdata/sys1/install/p ssp/pssp_script.
mknimres: Copying /usr/lpp/ssp/install/config/bosinst_data_prompt.template to /s pdata/sys1/install/pssp/bosinst_data_prompt.
mknimres: Copying /usr/lpp/ssp/install/config/bosinst_data_migrate.template to / spdata/sys1/install/pssp/bosinst_data_migrate.
mknimres: Successfully created the mksysb resource named mksysb_2 from dir
/spdata/sys1/install/images/sp2n10.img on sp2en0.
# export_clients export_clients: File systems exported to clients from server node 0.
# allnimres -l 10 exportfs: /spdata/sys1/install/images/sp2n10.img: parent-directory (/spdata/sys1
/install/images) already exported exportfs: /spdata/sys1/install/images/sp2n10.img: parent-directory (/spdata/sys1
/install/images) already exported allnimres: Node 10 (sp2n10) prepared for operation: install.
6. Start the netboot and see that the node is getting installed using this new image.
5.3.2 Method 2: Applying AIX PTFs Using SMIT or dsh
This method is used for installing the PTFs on the nodes by using SMIT or dsh commands. For installing on the nodes using SMIT, you login to each node and apply the PTFs. It is also possible to install the PTFs on the nodes from the CWS by using the dsh command.
Updating Your System
111
As previously mentioned, for any of these options, it is better to install the
PTFs in one node and do the testing there before applying them to all the nodes. In the lab, we selected sp2n10 for installing the PTFs and testing the node. Here we discuss both options for installing the PTFs.
Option 1: Using SMIT
1. Login to sp2n10 as root.
2. Mount the lppsource directory of the CWS in sp2n10 using the command:
# mount sp2en0:/spdata/sys1/install/aix432/lppsource /mnt
3. Apply the PTFs using the command:
# smitty update_all
Input device for directory/software:
/spdata/sys1/install/aix432/lppsource
First run with the PREVIEW only option set to yes
, and check that all prerequisites are met. If it is OK, install the PTFs with the PREVIEW only option set to no
.
While the installation is running, check for any errors being reported. You can also check this by viewing the smit.log file any time after the installation is complete.
4. Unmount the directory which you had mounted in step 2 by using the command:
# umount /mnt
While installing the PTFs, if you get any output saying that a reboot is required for the PTFs to take effect, you can do a reboot of the node.
5. Before initiating a reboot, if you have a switch, you will have to fence it using the command:
# Efence sp2n10
6. After the system has been rebooted, unfence the node using the command:
# Eunfence sp2n10
Now it is time to test the node for proper operation. Once all tests are OK, you can install the other nodes. To install the other nodes individually using SMIT, follow the same steps.
Option 2: Using dsh
This method is used for installing the PTFs on the nodes by using the dsh command from the CWS.
112
RS/6000 SP Software Maintenance
1. Select sp2n10 as the working collective member for executing the commands by using dsh:
# dsh -w sp2n10
2. Mount the lppsource directory of the CWS to node sp2n10 by using the command:
# dsh> mount sp2en0:/spdata/sys1/install/aix432/lppsource /mnt
3. Install the PTFs by using the command:
# dsh>/usr/lib/instl/sm_inst installp_cmd -a -d '/mnt' -f '_update_all'
'-pcNgX’
This command with the p option is for preview, and with the c option is for commit. For installing, give the command without the p option. If you also do not want to commit, remove the c option.
4. Unmount the directory that you mounted in step 2 by using the command:
# umount /mnt
While installing the PTFs, if you get any output saying that a reboot is required for the PTFs to take effect, it is better to reboot the node.
5. Before initiating a reboot, if you have a Switch, you will have to fence it by using the command:
# Efence sp2n10
6. After the system has been rebooted, unfence the node by using the command:
# Eunfence sp2n10
For installing to all nodes, use the dsh -a option to have the nodes in the working collection. If you want only selected nodes, use the command dsh -w sp2n01,sp2n05,sp2n06,sp2n07 to install only in these nodes.
5.3.3 Method 3: Applying AIX PTFs Using DSMIT
Distributed System Management Interface Tool (DSMIT) adds functionality to
SMIT by allowing the SMIT interface to build commands for system management and distribute them to other clients on a network. DSMIT does not use the .rhosts file for executing the command on the clients, and therefore it does not create any security exposure. DSMIT security is based on well-established crypto routines and DSMIT-specific (modeled after MIT's
Kerberos) communication protocols. It provides an ongoing secure DSMIT operation, and supports secure modification of the security configuration and updates of passwords and keys.
Updating Your System
113
DSMIT can also be used in an SP environment for doing system administration of the nodes from the CWS. In this section we discuss how to update the PTFs on the nodes from the CWS using DSMIT. DSMIT is a priced
LPP.
DSMIT Clients
sp2n01
DSMIT Clients
sp2n..
DSMIT Clients
sp2n15
SP2 Nodes
Thin
Wide
High
Silver
DSMIT
Environment
SP2
Control Workstation
DSMIT server
Figure 19. DSMIT Configuration in an SP Environment
DSMIT runs in both concurrent and sequential modes. Concurrent mode means that the DSMIT server builds a command and routes it to the clients simultaneously. Sequential mode means that the DSMIT server builds a command and routes it to the clients one machine at a time. After you build a command on the server and press the Enter key, a menu appears asking in which mode you wish to run DSMIT.
Using DSMIT, it is easier to do system administration than by using dsh from the CWS. The main reason for this is because, when using dsh, you have to know the options that are available for executing the commands, which you need not know when you are using DSMIT.
In the lab, we configured the CWS as the DSMIT server and all the nodes as
DSMIT clients.
114
RS/6000 SP Software Maintenance
To start DSMIT with a working collection of two clients (sp2n10 and sp2n11), type:
# dsmitty -w sp2n10,sp2n11
If your credentials have expired or if you are logging on for the first time, it will prompt you for the administrator password.
Distributed System Management
Move cursor to desired item and press Enter.
System Management
Domain Management
F1=Help
Esc+9=Shell
F2=Refresh
Esc+0=Exit
F3=Cancel
Enter=Do
Esc+8=Image
Esc+c=Collective
Select the option System Management to get into the SMIT screen, or
Domain Management to get into DSMIT administration. Once you get into the system management, whatever operation you perform will be executed on clients sp2n10 and sp2n11. If you had selected more than one client, it will ask you whether you want to execute the command in sequence on all the clients, or if you want them to be executed concurrently.
Now let us discuss how to update the AIX PTFs to nodes sp2n10 and sp2n11 using DSMIT.
1. We have to mount the lppsource of the CWS to clients sp2n10 and sp2n11:
# dsmitty -w sp2n10,sp2n11 mknfsmnt
You will get a screen similar to the following:
Updating Your System
115
Common Dialogue for Add a File System for Mounting
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP]
* PATHNAME of mount point
* PATHNAME of remote directory
[Entry Fields]
[/mnt]
[/spdata/sys1/install/a>
* HOST where remote directory resides
Mount type NAME
* /etc/filesystems entry will mount the directory on system RESTART.
[sp2en0]
[]
* Use SECURE mount option?
no
* MOUNT now, add entry to /etc/filesystems or both?
now no
+
+
+
/
* MODE for this NFS file system
* ATTEMPT mount in background or foreground
NUMBER of times to attempt mount
Buffer SIZE for read
Buffer SIZE for writes
NFS TIMEOUT. In tenths of a second
NFS version for this NFS file system
Transport protocol to use
Internet port NUMBER for server
* Allow execution of SUID and sgid programs in this file system?
* Allow DEVICE access via this mount?
* Server supports long DEVICE NUMBERS?
* Mount file system soft or hard
Minimum TIME, in seconds, for holding attribute cache after file modification read-only background
[]
[]
[]
[] any any
[] yes yes yes hard
[3]
+
+
#
#
+
+
#
+
#
+
+
+
#
Allow keyboard INTERRUPTS on hard mounts?
Maximum TIME, in seconds, for holding attribute cache after file modification
Minimum TIME, in seconds, for holding attribute cache after directory modification
Maximum TIME, in seconds, for holding attribute cache after directory modification
Minimum & Maximum TIME, in seconds, for holding attribute cache after any modification
The Maximum NUMBER of biod daemons allowed to work on this file system yes
[60]
[30]
[60]
[]
[6]
+
#
#
#
#
#
When you press Enter, it will prompt for sequential or concurrent mode.
Make your choice and press Enter. This mounts the sp2en0:/spdata/sys1/install/aix432/lppsource to /mnt in both clients sp2n10 and sp2n11. You can verify that the mount has happened by entering:
# dsh -w sp2n10,sp2n11 df /mnt
2. Install the PTFs in nodes sp2n10 and sp2n11:
# dsmitty -w sp2n10,sp2n11 update_all
116
RS/6000 SP Software Maintenance
Update Installed Software to Latest Level
(Update All) : INPUT device / directory for software
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
[yes] * Use first value entered as default for the rest of the fields
* sp2n10
* sp2n11
[/mnt]
[]
+
+
+
Input the Entry fields as shown in the preceding screen and press Enter to take you to the next screen.
Common Dialogue for Update Installed Software to Latest Level (Update All)
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* INPUT device / directory for software
* SOFTWARE to update
PREVIEW only? (update operation will NOT occur)
COMMIT software updates?
SAVE replaced files?
AUTOMATICALLY install requisite software?
EXTEND file systems if space needed?
VERIFY install and check file sizes?
DETAILED output?
Process multiple volumes?
[Entry Fields]
/mnt
_update_all no yes no yes yes no no yes
+
+
+
+
+
+
+
+
F1=Help
Esc+5=Reset
Esc+9=Shell
Esc+f=By Field
F2=Refresh
Esc+6=Command
Esc+0=Exit
Esc+c=Collective
F3=Cancel
Esc+7=Edit
Enter=Do
Esc+s=Scheduler
F4=List
Esc+8=Image
Esc+m=By Machine
When you press Enter, it will prompt for sequential or concurrent mode.
Make your choice and press Enter. This command is passed to clients sp2n10 and sp2n11 and the PTFs are installed in both systems. In this way, you can select all systems in one effort in order to install the PTFs in the nodes.
Updating Your System
117
5.4 Applying PSSP PTFs in the CWS
The steps for applying the PSSP PTFs are as follows:
1. Create a mksysb backup image of the CWS. We recommend that you take two backups, on two different tapes, and never trust just one tape. Check that the tape is OK by listing the contents of the tape using the command smitty lsmksysb.
For more information on taking a mksysb backup, refer to
7.1.1, “Mksysb and Savevg” on page 139.
2. Copy the PTFs to an appropriate directory. For example, in the lab we copied the files to the directory /spdata/sys1/install/pssplpp/PSSP-3.1.
3. Create a new .toc file by running the command:
# inutoc /spdata/sys1/install/pssplpp/PSSP-3.1
4. Check the READ THIS FIRST that comes with any updates to the PSSP and .info files for the prerequisites, corequisites, and any precautions that need to be taken for installing these PTFs. Check the filesets in the directory you copied to verify that all the required filesets are available. If
5. Update the new PTFs to the CWS by using:
# smitty update_all
Input device for directory/software:
/spdata/sys1/install/pssplpp/PSSP-3.1
Update Installed Software to Latest Level (Update All)
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
INPUT device / directory for software
SOFTWARE to update
[Entry Fields]
/spdata/sys1/install/pssplpp/PSSP>
_update_all
PREVIEW only? (update operation will NOT occur) yes
COMMIT software updates?
yes
+
+
SAVE replaced files?
AUTOMATICALLY install requisite software?
EXTEND file systems if space needed?
VERIFY install and check file sizes?
DETAILED output?
Process multiple volumes?
no yes yes no no yes
+
+
+
+
+
+
When the update is finished, go through the output messages for any specific problem. Based on these messages, you must take appropriate actions, depending on your environment.
118
RS/6000 SP Software Maintenance
6. Verify the correct operation of all SP and AIX control functions. For more
information, refer to Chapter 3, “Verifying Your SP” on page 67.
5.5 Applying PSSP PTFs to the Nodes
Applying PSSP PTFs to the nodes can be done by using the same three different methods that we used in applying AIX PTFs. Before applying the
PTFs, take a mksysb backup of the node. For more information on taking a
mksysb backup, refer to 7.1.1, “Mksysb and Savevg” on page 139. The three
different methods are:
1. Creating a new image that contains all the PSSP PTFs in one node and propagating the image to the nodes.
2. Mounting the lppsource directory from the CWS to the nodes and installing it manually by using smitty update_all or by using dsh commands from the
CWS for all nodes.
3. If you have a dsmit package installed, then you can install using dsmitty update_all to all nodes in one effort.
5.5.1 Method 1: Applying PSSP PTFs Using mksysb Install Method
The procedure to install PSSP PTFs on the nodes using this method is the
same as that used for installing AIX PTFs in 5.3.1, “Method 1: Applying AIX
PTFs Using mksysb Install Method” on page 106.
For installing the PSSP PTFs, follow the same procedure except for Step 2: you will have to mount the PSSP PTF directory instead of the lppsource directory. The command is:
#mount sp2en0:/spdata/sys1/install/pssplpp/PSSP-3.1 /mnt
5.5.2 Method 2: Applying PSSP PTFs Using SMIT or dsh
This method is to be used for installing PSSP PTFs on the nodes using SMIT or dsh commands. The procedure to install the PSSP PTFs is the same as
that for applying AIX PTFs in the nodes. Refer to 5.3.2, “Method 2: Applying
AIX PTFs Using SMIT or dsh” on page 111.
Follow the same procedure except for Step 2: in both Option 1 and Option 2, you have to mount the PSSP PTF directory instead of the lppsource directory.
The command is:
#mount sp2en0:/spdata/sys1/install/pssplpp/PSSP-3.1 /mnt
Updating Your System
119
When updating the ssp.css fileset of PSSP, you must reboot the nodes in order for the kernel extensions to take effect.
5.5.3 Method 3: Applying PSSP PTFs Using DSMIT
This method is to be used for installing the PSSP PTFs on to the nodes using
DSMIT, assuming DSMIT is installed in the CWS and the nodes. The procedure for installing is the same as that for installing AIX PTFs to the
nodes. Refer to 5.3.3, “Method 3: Applying AIX PTFs Using DSMIT” on page
Follow the same procedure except for Step 1: instead of mounting the lppsource directory, you will have to give the path of the PSSP ptfs directory.
In the SMIT dialogue panel for Add a File System for Mounting, enter the pathname of the remote directory as /spdata/sys1/install/pssplpp/PSSP-3.1.
When updating the ssp.css fileset of PSSP, you must reboot the nodes in order for the kernel extensions to take effect.
5.6 Applying PTFs in the Nodes Using the Switch
In a larger SP configuration, installing PTFs using the Ethernet is very slow and time-consuming. In such cases, it is possible to install the PTFs by using the switch network, which is much faster compared to the Ethernet network.
This method can be used for installing AIX PTFs. It is also possible to install the PSSP PTFs by using the Switch, since the installation process of the
PTFs does not affect the drivers already loaded in the switch adapter. This method was used at some customer sites but nevertheless it is an unsupported method.
The steps for installing the PTFs by using the Switch are as follows:
1. Create a filesystem or directory in one of the nodes. Check that you have sufficient space to copy the files to this filesystem. In the lab, we created file system /ptf in node sp2n07.
2. NFS-export this filesystem to the CWS with read-write permission by using the command:
# dsh -w sp2n07 /usr/sbin/mknfsexp -d '/ptf' -t 'rw' -r 'sp2en0' '-B'
3. Mount the exported filesystem of the node in the CWS on /mnt by using the command:
# mount sp2n07:/ptf /mnt
4. Copy the PTFs from the tape to /mnt.
120
RS/6000 SP Software Maintenance
5. Initialize the table of contents .toc file by using the command:
# inutoc /mnt
6. Umount the filesystem from the CWS by using the command:
# umount /mnt
7. NFS-export the /ptf filesystem to all the SP nodes through the Switch network by using the command:
# dsh -w sp2n07 /usr/sbin/mknfsexp -d '/ptf' -t 'ro' -r
'sp2css01,sp2css05,sp2css06,sp2css07,sp2css08,sp2css09,sp2css10,sp2css1
1,sp2css12,sp2css13,sp2css14,sp2css15 ' '-B'
8. NFS-mount the /ptf file system of sp2n07 through the Switch on the node where the PTF is to be installed. For example, to install the PTFs on node sp2n08, execute the following command:
# dsh -w sp2n08 dsh> mount sp2css07:/ptf /mnt
9. Install the PTFs in node sp2n08 by using the command: dsh>/usr/lib/instl/sm_inst installp_cmd -a -d '/mnt' -f '_update_all'
'-cNgX’
10.Unmount the filesystem that you mounted in Step 8 by using the command: dsh> umount /mnt
Updating Your System
121
122
RS/6000 SP Software Maintenance
Chapter 6. Migrating PSSP and AIX to Later Versions
This chapter covers the migration of PSSP and AIX to later versions in an SP environment. The objective is to help you to prepare and perform the migration of PSSP and AIX. Before migrating a production system, there are several things to be checked in order to perform a successful migration. In this chapter, we describe in detail the steps from the planning stage to the implementation of migration of PSSP software.
There are several things you need to do, or check, before starting the migration. The following list gives most of these items, but it is not exhaustive.
1. Get the relevant documentation, like READ THIS FIRST document for
PSSP, and read it fully before performing the migration. The README files for all PSSP versions and their LPPs are available at the following URL: http://www.rs6000.ibm.com/support/sp/sp_secure/readme/
2. Check and document the AIX and PSSP levels of your CWS and all nodes.
3. Check your installed LPPs in the CWS and the nodes and document them.
4. Look for the required minimum AIX and PSSP PTF levels for the various
AIX and PSSP versions.
5. Check the PTF levels for AIX and PSSP that are installed in the CWS and the nodes. Find out the PTFs that are needed to be installed for AIX,
PSSP and other LPPs.
6. Understand the issues related to coexistence of various PSSP and AIX levels.
7. Choose the appropriate method (update/migrate/install) for upgrading your
CWS and the nodes.
8. Check your disk space, and if necessary, get some extra disk space.
9. Document your system and the migration plan.
10.Estimate the migration time.
11.Prepare for recovery procedures, in case of unsuccessful migration.
12.Back up the rootvg of the CWS and the nodes. Also back up the files in the
/spdata directory (directory backup, file system backup or savevg backup) of the CWS depending upon your configuration.
© Copyright IBM Corp. 1999
123
Note: For more detailed migration information, refer to IBM Parallel System
Support Programs for AIX: Installation and Migration Guide, GC23-3898.
The basic flow of steps for doing a PSSP migration is shown in Figure 20 on
page 124 and Figure 21 on page 125.
Back up all the data in the CWS and the Nodes
Are PSSP
PTFs at the latest level in the CWS?
Yes
Are PSSP
PTFs at the latest level in the
Nodes?
Yes
Is AIX at the required level in the
CWS for the new PSSP version?
Yes
No
No
No
Update PSSP PTFs in the CWS
Update PSSP PTFs in the Nodes
Migrate/Update to the required AIX version with PTF's
Figure 20. Basic Flowchart for PSSP Migration - Part 1
124
RS/6000 SP Software Maintenance
Back up all the data in the CWS and the Nodes
Are PSSP
PTFs at the latest level in the CWS?
Yes
Are PSSP
PTFs at the latest level in the
Nodes?
Yes
Is AIX at the required level in the
CWS for the new PSSP version?
Yes
No
No
No
Figure 21. Basic Flowchart for PSSP Migration - Part 2
Update PSSP PTFs in the CWS
Update PSSP PTFs in the Nodes
Migrate/Update to the required AIX version with PTF's
Migrating PSSP and AIX to Later Versions
125
6.1 Coexistence
When you are planning to have different versions of PSSP software on the nodes, you must keep in mind that there are minimum PTF requisites for
PSSP and AIX levels on the CWS and the nodes for it to work. This information is available in the READ THIS FIRST document that is provided with the PSSP software. For the README files for all the PSSP versions and their LPPs, refer to the following URL: http://www.rs6000.ibm.com/support/sp/sp_secure/readme/
The minimum PTF requisites for coexistence/migration for various PSSP
versions at the time of writing are shown in the Table 6. To get the latest
update on the PTF levels for PSSP, refer to the following URL: http://www.rs6000.ibm.com/support/sp/sp_secure/status/servstat.html
Table 6. PTF Requisites for Coexistence/Migration of PSSP and AIX Levels
Coexistence / Migration
PSSP 2.4
on
CWS
PSSP 2.3
on
CWS
PSSP 2.2
on
CWS
PSSP 1.2 Nodes
PSSP 2.1 Nodes
PSSP 2.2 Nodes
PSSP 2.3 Nodes
PSSP 1.2 Nodes
PSSP 2.1 Nodes
PSSP 2.2 Nodes
PSSP 1.2 Nodes
PSSP 2.1 Nodes
Not Supported
PTF Set 28.
PTF Set 15.
PTF Set 6.
PTF Set 23.
PTF Set 23
If 4.1.4, then add IX58039 + IX60723 +
IX60742.
PTF Set 11 + IX69336 + IX69491.
If 4.1.4, then add IX58039 + IX60723 +
IX60742.
For RVSD 1.2 Nodes, IX69650.
PTF Set 23 + IX60961 or
PTF Set 24 (which includes IX60961).
PTF Set 18.
If LoadLeveler 1.2.0, PTF Set 12.
If LoadLeveler 1.2.1, PTF Set 5.
126
RS/6000 SP Software Maintenance
6.2 Migrating the CWS from PSSP 2.4 to PSSP 3.1
In the SP environment, the CWS should be at the same or at a higher level of
AIX and PSSP than nodes. If there is already a higher level of AIX and/or
PSSP available you should consider updating/migrating your CWS to this level. This will make future updates and/or migrations easier. So when you consider migrating your SP to a later version, the first step is to migrate the
CWS to the required level of AIX and PSSP Versions. IBM suggests installing
PTF service when upgrading AIX modification levels or when migrating to just a new PSSP level.
To migrate from PSSP 2.4 to PSSP 3.1 with the AIX version at 4.3.2, we suggest doing it as an upgrade service.
Note: Use this section in conjunction with IBM Parallel System Support
Programs for AIX: Installation and Migration Guide, GC23-3898 when migrating the CWS.
The steps for migrating the CWS from PSSP 2.4 to PSSP 3.1 with AIX version at 4.3.2 are as follows:
1. Create a mksysb backup image of the CWS. We recommend that you always take two backups on two different tapes, and never trust just one tape. Check the tape is OK by listing the contents of the tape to be readable by using the command smitty lsmksysb.
For more information on
mksysb backup, refer to 7.1.1, “Mksysb and Savevg” on page 139.
2. Check that there is sufficient space in the /tftpboot and /spdata file
systems. See 2.4.2, “Disk Space Considerations” on page 50 for more
details.
3. Create directory /spdata/sys1/install/pssplpp/PSSP-3.1 by using:
# mkdir /spdata/sys1/install/pssplpp/PSSP-3.1
4. Copy the PSSP 3.1 images into the /spdata/sys1/install/pssplpp/PSSP-3.1
directory. Rename the pssp package to pssp.installp and create the .toc
file by using the inutoc command. The files related to PSSP-3.1 are:
• pssp.installp
• rsct.basic
• rsct.clients
• ssp.resctr
5. Stop the daemons in the CWS. Execute the following commands in the same sequence to stop the daemons in the CWS:
Migrating PSSP and AIX to Later Versions
127
# syspar_ctrl -G -k
# stopsrc -s sysctld
# stopsrc -s splogd
# stopsrc -s hardmon
# stopsrc -g sdr
6. Check that all these daemons are stopped by using the command lssrc
-a.
They should be in inoperative state.
7. Update PSSP version 3.1 on the CWS by using the commands:
# cd /spdata/sys1/install/pssplpp/PSSP-3.1
# /usr/lib/instl/sm_inst installp_cmd -a -Q -d '.' -f '_all_latest'
'-cNgXG’
The list of ssp filesets installed are as follows:
# lslpp -L | grep -E "ssp|rsct" rsct.basic.hacmp
1.1.0.2
rsct.basic.rte
rsct.basic.sp
1.1.0.4
1.1.0.3
rsct.clients.hacmp
rsct.clients.rte
rsct.clients.sp
ssp.authent
1.1.0.0
1.1.0.2
1.1.0.0
3.1.0.1
ssp.basic
ssp.clients
ssp.css
ssp.docs
ssp.gui
ssp.ha_topsvcs.compat
3.1.0.4
3.1.0.4
3.1.0.4
3.1.0.0
3.1.0.4
3.1.0.0
ssp.jm
ssp.perlpkg
ssp.pman
ssp.ptpegui
ssp.public
ssp.resctr.rte
ssp.spmgr
ssp.st
ssp.sysctl
ssp.sysman
ssp.tecad
ssp.top
ssp.top.gui
ssp.ucode
ssp.vsdgui
3.1.0.1
3.1.0.0
3.1.0.1
3.1.0.1
3.1.0.0
3.1.0.0
3.1.0.2
3.1.0.1
3.1.0.1
3.1.0.2
3.1.0.0
3.1.0.1
3.1.0.1
3.1.0.1
3.1.0.1
A
C
A
C
A
A
A
C
C
C
C
A
C
C
A
C
C
C
C
C
C
C
A
C
A
A
A
A
RS/6000 Cluster Technology basic
RS/6000 Cluster Technology basic
RS/6000 Cluster Technology basic
(ECIP) RS/6000 Cluster
RS/6000 Cluster Technology
(ECIP) RS/6000 Cluster
SP Authentication Server
SP System Support Package
SP Authenticated Client Commands
SP Communication Subsystem
SP man pages and PDF files and
SP System Monitor Graphical User
Compatability for ssp.ha and ssp.topsvcs clients
SP Job Manager Package
SP PERL Distribution Package
SP Problem Management
SP Performance Monitor Graphical
Public Code Compressed Tarfiles
SP Resource Center
SP Extension Node SNMP Manager
Job Switch Resource Table
SP Sysctl Package
Optional System Management
SP HA TEC Event Adapter Package
SP Communication Subsystem
SP System Partitioning Aid
SP Supervisor Microcode Package
VSD Graphical User Interface
8. Authenticate yourself as the administrative user to the authentication database by using the command:
128
RS/6000 SP Software Maintenance
# kinit root.admin
9. Complete the PSSP installation on the CWS by using the command as shown in the following screen:
# install_cw
0 objects deleted
0513-044 The stop of the hardmon Subsystem was completed successfully.
0513-083 Subsystem has been Deleted.
0513-071 The hardmon Subsystem has been added.
An entry for /usr/lpp/ssp/bin/sdrd already exists in inittab.
An entry for hardmon already exists in inittab.
0513-004 The Subsystem or Group, splogd, is currently inoperative.
0513-083 Subsystem has been Deleted.
0513-071 The splogd Subsystem has been added.
0513-059 The splogd Subsystem has been started. Subsystem PID is 18156.
Stopping the spmgr subsystem is currently running
An entry for the spmgr subsystem already exists in inittab.
Starting the spmgr subsystem
Starting the swtlog subsystem
Starting the swtadmd subsystem
0513-059 The sysctld Subsystem has been started. Subsystem PID is 26764.
install_cw: If you want to bring up the Perspectives Launch Pad to help
# you complete your installation, enter "perspectives &" now.
10.Verify the SDR and system monitor for correct installation by using the following commands:
# SDR_test
SDR_test: Start SDR commandline verification test
SDR_test: Verification succeeded
# spmon_itest spmon_itest: Start spmon installation verification test spmon_itest: Verification Succeeded
#
11.Set up the site environment in the CWS for the AIX level by using the command:
# spsitenv cw_lppsource_name=aix432
12.Start the system management environments on the CWS by using command services_config; the output should look looks as follows:
Migrating PSSP and AIX to Later Versions
129
# /usr/lpp/ssp/install/bin/services_config rc.ntp: NTP already running - not starting ntp
0513-029 The supfilesrv Subsystem is already active.
Multiple instances are not supported.
/etc/auto/startauto: The automount daemon is already running on this system.
#
13.Check whether the supervisor microcode needs to be updated by using the following command:
# spsvrmgr -G -r status all spsvrmgr: Frame Slot Supervisor Media
State Versions Version Action
_____ ____ __________ ____________ ____________ ____________
1 0 Active u_10.1c.0709
u_10.1c.070c
Installed u_10.1c.070c
Required
None
____ __________ ____________ ____________ ____________
1 Active u_10.3a.0614
u_10.3a.0614
Upgrade u_10.3a.0615
____ __________ ____________ ____________ ____________
17 Active u_80.19.060b
u_80.19.060b
None
#
14.The output of the previous command indicates that microcode for high node in frame-1, slot-1 needs to be updated. Update the microcode by using the following command:
0)sp2en0:/ 89$ spsvrmgr -G -u 1:1 spsvrmgr: Dispatched "microcode" process [28506] for frame 1 slot 1.
Process will take approximately 5 minutes to complete.
spsvrmgr: Process [28506] for frame 1 slot 1 completed successfully.
15.In PSSP-3.1, the authenticated r-commands in the AIX 4.3.2 operating system are used. Older PSSP versions installed their own r-command (in the subdirectory /usr/lpp/ssp/rcmd/bin). Therefore, you have to select an authentication method and activate it afterwards.
1. Select Authentication Method.
If you are using SMIT, issue the following command:
# smitty spauth_config
Select
Select Authorization Methods.
Select
System Partition name.
130
RS/6000 SP Software Maintenance
Move the cursor to the desired partition.
Select
Authorization Methods.
Select one or more methods by using PF7.
See the following screen for an example.
Select Authorization Methods for Root access to Remote Commands
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* System Partition names
* Authorization Methods
[Entry Fields] sp4en0 k4 std
+
+
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
You can also select the appropriate authentication methods by issuing the following command:
# spsetauth -d -p sp4en0 k4 std
2. Enable Authentication Method
The next step is to enable the authentication method. If you are using
SMIT, issue the following command:
# smitty spauth_config
Select
Enable Authorization Methods.
Select
Enable on the Control Workstation only.
Select
Force change on nodes.
Select the option
System Partition name.
Move the cursor to the desired partition.
Select
Authentication Methods.
Migrating PSSP and AIX to Later Versions
131
Select one or more methods by using PF7.
See the following screen for an example.
Enable Authentication Methods
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Enable on Control Workstation Only
Force change on nodes
* System Partition names
* Authentication Methods
[Entry Fields] yes yes sp4en0 k4 std
+
+
+
+
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
You can also enable the appropriate authentication methods by issuing the following command:
# chauthpar [-c] [-f] -p sp4en0 k4 std
16.There are system partition-sensitive subsystems that need to be added to the system. Even if you are not partitioning your system, you need to do this since you still have one default partition. Do the following:
1. Remove the old subsystems by using the following command:
# syspar_ctrl -c
2. Add new subsystems by issuing the following command:
# syspar_ctrl -A -G
The following screen shows an example command output:
132
RS/6000 SP Software Maintenance
# syspar_ctrl -c hr.sp2en0 susbsystem has been deleted
(0)sp2en0:/ 92$ syspar_ctrl -A -G
0513-071 The hats.sp2en0 Subsystem has been added.
0513-071 The hags.sp2en0 Subsystem has been added.
0513-071 The hagsglsm.sp2en0 Subsystem has been added.
0513-071 The haem.sp2en0 Subsystem has been added.
0513-071 The haemaixos.sp2en0 Subsystem has been added.
Deleted 370 objects from class EM_Resource_Variable
Added 370 objects to class EM_Resource_Variable
Added 45 objects to class EM_Resource_ID
Added 54 objects to class EM_Resource_Variable
Added 8 objects to class EM_Structured_Byte_String
Added 3 objects to class EM_Resource_ID
Added 2 objects to class EM_Resource_Class
Added 2 objects to class EM_Resource_Monitor haemcfg: Reading Event Management data for partition: sp2en0.
haemcfg: Created EMCDB file: /spdata/sys1/ha/cfg/em.sp2en0.cdb Version: 90070676
3,274389248,0.
making SRC object "hr.sp2en0"
0513-071 The hr.sp2en0 Subsystem has been added.
0513-071 The pman.sp2en0 Subsystem has been added.
0513-071 The pmanrm.sp2en0 Subsystem has been added.
0513-071 The Emonitor.sp2en0 Subsystem has been added.
0513-059 The hats.sp2en0 Subsystem has been started. Subsystem PID is 15164.
0513-059 The hags.sp2en0 Subsystem has been started. Subsystem PID is 20478.
0513-059 The hagsglsm.sp2en0 Subsystem has been started. Subsystem PID is 19664.
0513-059 The haem.sp2en0 Subsystem has been started. Subsystem PID is 33212.
0513-059 The haemaixos.sp2en0 Subsystem has been started. Subsystem PID is 23214
.
0513-059 The hr.sp2en0 Subsystem has been started. Subsystem PID is 28522.
0513-059 The pman.sp2en0 Subsystem has been started. Subsystem PID is 17828.
0513-059 The pmanrm.sp2en0 Subsystem has been started. Subsystem PID is 17292.
0513-059 The sp_configd Subsystem has been started. Subsystem PID is 14680.
#
17.Verify that all the system partition subsystems have been properly started by using the following command:
# lssrc -a | grep sp2en0 sdr.sp2en0
hats.sp2en0
sdr hats hags.sp2en0
hags hagsglsm.sp2en0
hags haem.sp2en0
haem haemaixos.sp2en0 haem
# hr.sp2en0
pman.sp2en0
hr pman pmanrm.sp2en0
pman
Emonitor.sp2en0
emon
11354 active
20386 active
21160 active
21680 active
16004 active
17804 active
15488 active
20136 active
19880 active inoperative
Migrating PSSP and AIX to Later Versions
133
18.The pmand daemons need to be refreshed in order to recognize the changes that were made to the SDR. To refresh, use the command:
# dsh -avG refresh -s pman
The output of the refresh command should look as follows for all the nodes defined in the SDR: sp2n01: pmand in partition sp2en0 refreshed, waiting events at Fri Jul
17 16:22:02 1998 sp2n01: 0513-095 The request for subsystem refresh was completed successfully.
19.Start the switch by giving the command
Estart and verify whether all the nodes have joined the Switch.
20.Run all the verification tests that are appropriate to your SP environment for correct operation.
6.3 Migrating the Nodes from PSSP 2.4 to PSSP 3.1
In this example, we had an SP configuration with AIX version at 4.3.2 on the
CWS and on the nodes, and a PSSP version at 2.4. The steps for migrating node sp2n06 from version PSSP 2.4 to PSSP 3.1 are as follows:
1. Create a mksysb backup image of the node. For more information on
taking a mksysb backup, refer to 7.1.1, “Mksysb and Savevg” on page 139.
2. Enter the node configuration data to set the SDR for the new PSSP level to be installed by using the following two commands:
# spchvgobj -r rootvg -p PSSP-3.1 -l 6 spchvgobj: Successfully changed Node and Volume_Group objects for node number 6, volume group rootvg.
spchvgobj: The total number of changes successfully completed is 1.
spchvgobj: The total number of changes which were not successfully completed is 0.
# spbootins -s no -r customize -l 6
3. Verify the values in the SDR for the correct pssp_ver by using the following command:
134
RS/6000 SP Software Maintenance
splstdata -b -l 6
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
6 sp2n06 10005AFA1B12 0
customize
hdisk0 bos.obj.ssp.432 Tue_Apr_13_17:29:51
PSSP-3.1
rootvg bos.obj.ssp.432
aix432
4. Now you can either run setup_server to initiate an install operation, or you can do it directly by using the wrapper-commands. Execute the following commands to prepare for the installation of node sp2n06.
# setup_server or
# Efence -G -autojoin sp2n06
# create_krb_files
# mkconfig
# mkinstall
# export_clients
# allnimres -l 6
After you execute these commands, the screen output should look as follows:
# Efence -G -autojoin sp2n06
All nodes successfully fenced.
# create_krb_files create_krb_files: tftpaccess.ctl file and client srvtab files created/updated on server node 0.
# mkconfig
# mkinstall
# export_clients export_clients: File systems exported to clients from server node 0.
# llnimres -l 6 allnimres: Node 6 (sp2n06) prepared for operation: customize.
5. Refresh the system partition-sensitive subsystems by using the following command:
Migrating PSSP and AIX to Later Versions
135
# syspar_ctrl -r -G
0513-095 The request for subsystem refresh was completed successfully.
0513-095 The request for subsystem refresh was completed successfully.
6. Copy the pssp script from the CWS to the /tmp directory of sp2n06 by using the command:
# pcp -w sp2n06 /spdata/sys1/install/pssp/pssp_script /tmp/pssp_script
7. Execute the pssp script on node sp2n06 from the CWS by using the command:
# dsh -w sp2n06 /tmp/pssp_script
The output of this command should look as follows:
# dsh -w sp2n06 /tmp/pssp_script sp2n06: sp2n06: ===================================================== sp2n06: pssp_script: = Making PSSP log directories...
= sp2n06: sp2n06:
===================================================== sp2n06: ===================================================== sp2n06: pssp_script: = Switching output to log file...
= sp2n06: ===================================================== sp2n06: + /usr/bin/mkdir -p /var/adm/SPlogs/sysman sp2n06: + 1> /dev/null 2>& 1 sp2n06: + exec sp2n06: + 3> /var/adm/SPlogs/sysman/NODE.config.log.7792
8. Check the log file /var/adm/SPlogs/sysman/sp2n06.config.log.7792 for any errors. Check that no lpp is in a broken state by using the command lppchk
-v
.
9. Check the bootp_response for this node and ensure that it is set to disk by using the following command:
1)sp2en0:/ 105$ splstdata -b -l 6
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
6 sp2n06 10005AFA1B12 0 bos.obj.ssp.432 Tue_Apr_13_15:04:18
PSSP-3.1
rootvg
disk
bos.obj.ssp.432
hdisk0 aix432
136
RS/6000 SP Software Maintenance
10.Reboot the node to enable the kernel changes related to the new PSSP version to take effect.
11.Run all the verification tests that are appropriate to your SP environment for their correct operation.
Migrating PSSP and AIX to Later Versions
137
138
RS/6000 SP Software Maintenance
Chapter 7. Backup and Recovery
As we know, there is no such thing as a “perfect” system. Any component of your system, independent of the quality of the product or of the products and mechanisms implemented to increase availability, is susceptible to failure.
The data stored in your system could get lost due to human error or component failure. Hence, should such failures occur, you must have procedures and medias in place to reconstruct your environment as it was before those problems arose.
The whole business of a company or success of a project depends on successful backups, meaning the ability to recover all the data that has been stored in a machine, in a minimum time and as near as possible to the exact data that was there before the failure event.
Although the most important data to protect is the information of the business.
it is also important to back up the configuration of your system so that you could continue working or restart your production environment in case of component failure.
This chapter helps you to understand how to back up and restore your whole
SP2 system, as well as its subsystems: the SDR, the Kerberos database, normal directories and logical volumes.
7.1 Backing Up Your SP System
This section covers the procedures to back up the SP system and its various
subsystems. To restore the SP system and its subsystems, refer to 7.2,
“Restoring Your SP System” on page 147.
7.1.1 Mksysb and Savevg
The mksysb and savevg commands should be used to protect the whole system; through these tools you can save all the system data and user data.
The file-system image is in backup-file format. The tape format includes a boot image, a bosinstall image, and an empty table of contents, followed by the system backup (root volume group) image. The root volume group image is in backup-file format, starting with the data files and then any optional map files.
Keep the following things in mind when taking a backup with mksysb:
© Copyright IBM Corp. 1999
139
• The mksysb tool cannot back up raw devices, unmounted file systems, nfs filesystems, or volume groups differents to the rootvg. Make sure that you use a different tool to back up that data.
• Some rspc systems do not support booting from tape. When you make a bootable mksysb image on an rspc system that does not support booting from tape, the mksysb command issues a warning indicating that the tape will not be bootable.
• You can install a mksysb image from a system that does not support booting from tape by booting from a CD and entering maintenance mode; while in maintenance mode, you will be able to install the system backup from tape.
For a complete explanation of the mksysb command, refer to IBM Parallel
System Support Programs for AIX: Command and Technical Reference,
Volume 1 and Volume 2, SA22-7351.
7.1.1.1 Mksysb of the CWS
The CWS will normally have a tape device attached directly, so there is no difference in the way that you take a backup from a normal workstation.
You can generate the mksysb either through SMIT or by using the command line.
If you use SMIT, type the following command:
# smitty mksysb
If you use the command from command line and you want to generate the
./image.data file before taking the mksysb, type the following command:
# mksysb -i <device name>
If you do not want to generate the ./image.data file because you made a change to the existing one, you can omit the -i flag.
It is always important to have a recent backup of the CWS. You can decide to take one image periodically, for example each week or month, depending upon your environment, and take another copy each time you change the configuration.
Remember that mksysb will back up only the data contained in the rootvg volume group and that you will have to use savevg to back up the other volume groups.
140
RS/6000 SP Software Maintenance
7.1.1.2 Mksysb of a Node
Normally you do not have a tape drive attached to all your nodes, so therefore you are not able to take backups directly to tape. It will be necessary to use the network to transfer your backups to another machine with a connected tape drive attached.
In order to take an image of your node, follow the same steps mentioned in
7.1.1.1, “Mksysb of the CWS” on page 140. However, your image is usually
created on your local node, but for later use you need to have it on your CWS; the preferred subdirectory is /spdata/sys1/install/images. As always, there is more than one solution to this problem.
• Create the image on the local node and transfer it afterwards to the CWS.
You can use the ftp or rcp command.
• Export /spdata/sys1/install/images and mount it on your node. Create the image on this mounted filesystem.
• Create a named pipe on your system and use this pipe as the output device for the mksysb command. Start a remote copy; use the named pipe as the source and the file name of the image (in the subdirectory
/spdata/sys1/install/images on the CWS) as the destination.
The image file is not bootable, so you will have to restore this as explained in
7.2.1.2, “Restoring a mksysb of a Node” on page 148.
In common system environments, you have some nodes with the same configuration working the same application. You can take a backup of one node of each group so that you can restore that image into any node of the same kind and purpose. For example, if you have 10 nodes working with AIX
4.3.2 and configured to work with Oracle, you could take a mksysb of one node. Then, if you need to restore the information of any of those nodes in the future, you can use the same mksysb to restore both AIX and also Oracle, and restore the Oracle configuration files of a specific node.
7.1.1.3 How to Verify a Mksysb
The only way to verify that there is absolutely no problem with a mksysb is to restore it in another machine. But if you cannot do that, you can verify the following: that the tape media can be read, that the bootable part of the tape is OK, and that the mksysb stored the files that you wanted. The following section describes how to do these verifications.
Data Verification
You can verify that you can read the data in the tape and store a copy of its contents, if desired, in the following way:
Backup and Recovery
141
If you use SMIT, type:
# smitty lsmksysb
Using the command line:
# tctl -f /dev/rmt0 rewind
# restore -s4 -Tvqf /dev/rmt0.1 > /tmp/mksysb.lst
Boot Verification
The only way to verify that a mksysb tape will successfully boot is to bring the machine down and boot from the tape. No data needs to be restored.
Attention
Having the PROMPT field in the bosint.data file set to no causes the system to begin the mksysb restore automatically, using preset values with no user intervention.
If the state of PROMPT is unknown, this can be set during the boot process. After answering the prompt to select a console during the boot up, a rotating character is seen in the lower left of the screen. As soon as this character appears, type
000 and press Enter. This will set the prompt variable to yes.
You can also check the state of this variable while in normal mode by typing the following commands:
# chdev -l rmt0 -a block_size=512
# tctl -f /dev/rmt0 rewind
# cd /tmp
# restore -s2 -xvqf /dev/rmt0.1 ./bosinst.data
Then you can edit the file bosinst.data and check the PROMPT variable to find out its value.
7.1.1.4 Savevg
The savevg command finds and backs up all files belonging to a specified volume group. The volume group must be varied on, and the file systems must be mounted.
It is important to have a backup of the spdata file system. If you did not create an independent volume group to store this file system, and it is mounted in the rootvg file system, the spdata file system will be included in your CWS image created with the mksysb command. However, if you have the file system
142
RS/6000 SP Software Maintenance
spdata in a volume group different from rootvg, you will have to use the savevg command to back up the volume group, or use some other tool to back up file systems independently.
There are two ways of using savevg
, one through SMIT and the other through the command line.
To back up the splstdatavg using SMIT, type:
# smitty savevg
Back Up a Volume Group
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP]
WARNING: Execution of the savevg command will result in the loss of all material previously stored on the selected output medium.
* Backup DEVICE or FILE
* VOLUME GROUP to back up
List files as they are backed up?
Generate new vg.data file?
Create MAP files?
EXCLUDE files?
EXPAND /tmp if needed?
Disable software packing of backup?
[MORE...2]
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
[Entry Fields]
[/dev/rmt0]
[spdatavg] no yes no no no no
F4=List
Esc+8=Image
+
+
+
+
+
+/
+
+
Figure 22. Display of SMIT savevg Settings
If you are taking the backup in a node and you do not have a tape drive installed locally, you can choose one of the three methods described at the beginning of this section.
7.1.2 Backing Up the SDR
The SDR is required in order for your SP system to function, so you should have at least one backup of this database. The following task and subsystems depend on the SDR:
• The Job Manager
Backup and Recovery
143
• VSD configuration tasks
• SP system configuration, customization and installation tasks
• The hardware monitor clients
• The SP nodes, when they are booted
• The SP Switch, when the switch is started
• The High Availability subsystem
You can take a backup of the SDR through SMIT or by using the command line, as follows.
If you use SMIT, type:
# smitty syspar
Select
Archive System Data Repository.
Type the extension that you want for the filename. (This is optional). Press
Enter.
Using the command line:
# SDRArchive <append_string>
One file will be generated in the directory /spdata/sys1/sdr/archives with the following format: backup.JULIANdate.HHMM.append_string
JULIANdate.HHMM is a number or string uniquely identifying the date and time of the archive, and append_string is the argument entered in the command invocation, if specified.
The
SDRArchive command is a script contained in the directory /usr/lpp/ssp/bin that executes a tar similar to the following:
# tar -cf /spdata/sys1/sdr/archives/backup.* /spdata/sys1/sdr/defs \
/spdata/sys1/sdr/sys
If you want, you can take a look to the file using the command:
# tar -tvf <file name>
To restore a savevg backup, use the restvg
command as explained in 7.2.1.3,
“The Restore Volume Group Command” on page 148.
144
RS/6000 SP Software Maintenance
7.1.3 Kerberos Database
The kerberos database is needed in order for the SP to work properly. If something goes wrong with kerberos, you can experience problems with the
Switch, spmon, parallel commands and so forth. Therefore, you must have an up-to-date backup of your kerberos database.
Kerberos provides some commands that help you to take backups and do recovery of the kerberos database. The command used to backup is kdb_util dump
, followed by the file name where you want to store the backup.
For example, if you want to back up the kerberos database to a file called kerberos.940422, you should execute:
# kdb_util dump kerberos.990422
The command will create a ASCII file called 990422 that has a format similar
to the one shown in Figure 23 on 145. Although you can put this file
anywhere, we recommend that you put it in the kerberos database directory called /var/kerberos/database.
hardmon sp4en0 255 1 1 0 fa50af36 5f82b6b6 203801010459 199902210245 root admin rcmd sp4sw06 255 1 1 0 380cf1de f60552b8 203801010459 199902221651 root admin
K M 255 1 1 0 d0f57e3f 6d27fc15 203801010459 199902210242 db_creation * rcmd sp4sw25 255 1 1 0 97ed8fe3 5bb4c135 203801010459 199904091434 root admin rcmd sp4sw27 255 1 1 0 34b2c9ce d2be40c2 203801010459 199904091434 root admin rcmd sp4n11 255 1 1 0 c71bc192 ce5c42e8 203801010459 199902221651 root admin rcmd sp4n13 255 1 1 0 d6cc1011 853697a3 203801010459 199902221651 root admin rcmd sp4n01 255 1 1 0 7c8addfe 643ec26c 203801010459 199902221651 root admin rcmd sp4sw13 255 1 1 0 3db9c61e 2c3d995 203801010459 199902221651 root admin rcmd sp4en1 255 1 1 0 dc07675e a010a874 203801010459 199902261624 root admin krbtgt MSC.ITSO.IBM.COM 255 1 1 0 248e516a 53e2ad30 203801010459 199902210242 db
_creation * rcmd sp4en0 255 1 1 0 3fd5361 69e766e9 203801010459 199902210245 root admin q q 255 1 1 0 5910b883 9c531f43 203801010459 199903291519 * * default * 255 1 1 0 0 0 203801010459 199902210242 db_creation * rcmd sp4n31 255 1 1 0 2624cf4e 1237938b 203801010459 199904091434 root admin rcmd sp4sw31 255 1 1 0 aef5ca0d fbd0d81d 203801010459 199904091434 root admin rcmd sp4sw07 255 1 1 0 2d624a68 d4ec615 203801010459 199902221651 root admin rcmd sp4sw21 255 1 1 0 9e81aa86 69b0acd1 203801010459 199904091434 root admin rcmd sp4sw26 255 1 1 0 655e9862 9da6ec98 203801010459 199904091434 root admin hardmon sp4cw0 255 1 1 0 53609c0b 71598274 203801010459 199902210245 root admin
...
Figure 23. Typical Format of a Kerberos Backup File
When you have secondary authentication servers defined and you follow the steps outlined for setting them up, you will automatically create a backup of the database every time the cron file entry runs that propagates the database
Backup and Recovery
145
to the secondary servers. You can verify if you have that entry in your crontab file by typing:
# crontab -l
The entry that propagates the database to the secondary servers and takes a backup of the database looks like the following:
0 * * * * /usr/kerberos/etc/push-kprop
If you have this entry, the push-kprop script will create a backup file called slavesave in the database directory /var/kerberos/database.
If you do not have this entry, but you want to automatically and periodically take backups of your kerberos database, you can include the kdb_util command to your crontab file.
7.1.4 Backing Up the NIM Database
The NIM database can be backed up from SMIT or by using the command line, as follows:
If you use SMIT, type:
# smitty nim_backup_db
Then enter the name of a device or a file to which the NIM database and the
/etc/niminfo file will be backed up.
Using the command line:
Save the following NIM files using any method, for example, tar:
/etc/niminfo
/etc/objrepos/nim_attr
/etc/objrepos/nim_attr.vc
/etc/objrepos/nim_object
/etc/objrepos/nim_object.vc
When you use SMIT, you can see that a tar is executed to back up these files.
Do not forget that a backup of a NIM database should only be restored to a system with a NIM master fileset that is at the same level or a higher level than the level from which the backup was created.
146
RS/6000 SP Software Maintenance
7.2 Restoring Your SP System
This section covers the procedures to restore the data in the SP system and its different subsystems. The procedures to backup the SP system and any of
its subsystems are found in 7.1, “Backing Up Your SP System” on page 139.
7.2.1 Restoring an mksysb Image or a Volume Group Backup
In the following sections we descibe how to restore an mksysb image or a volume group backup.
7.2.1.1 Restoring an mksysb of the CWS
If you have a problem with the CWS and you have a recent backup, you can restore it as follows:
1. Execute the normal procedure used to restore any RS/6000 workstation, as described in Chapter 5. Installing BOS from a system backup is described in IBM Parallel System Support Programs for AIX: Installation
and Migration Guide, GC23-3898.
2. Run install_cw
. This script creates the proper node_number entry for the
CWS in the ODM. It also executes other functions, as described in 2.10,
“The install_cw Command” on page 58.
3. Verify your CWS.
Sometimes you need to recover only one part of a mksysb. For example, if you install a new SP with a PCI CWS and you cannot recover the mksysb that is sent with the SP, because that is taken in a microchannel machine, you will want to recover at least the /spdata directory. To do that, follow these next instructions:
1. Determine the blocksize the tape was set to when the mksys was taken:
# cd /tmp
# tctl -f /dev/rmt0 rewind
# chdev -l rmt0 -a block_size=512
# restore -s2 -xqdvf /dev/rmt0.1 ./tapeblksz
# cat ./tapeblksz
Then
2. Set the blocksize of the tape drive accordingly by running the following command:
# chdev -l rmt0 -a block_size=[number in the ./tapeblksz file]
3. Restore the files or directories that you want by executing the following commands:
Backup and Recovery
147
# cd /
# tctl -f /dev/rmt0 rewind
# restore -s4 -xqdvf /dev/rmt0.1 ./< path >
7.2.1.2 Restoring a mksysb of a Node
The procedure that is used to restore a mksysb image to a node is the same
as that used for installing a node; for a detailed description see 5.3.1,
“Method 1: Applying AIX PTFs Using mksysb Install Method” on page 106.
To define your install image in an PSSP 3.1 environment you need to run the following commands:
# spchvgobj -r rootvg -i <image name> -l <node_number>
# spbootins -r install -l <node_number>
Example:
To restore node 10 with an image called sp4n10.img:
# spchvgobj -r rootvg -i image.sp4n10.img -l 10
# spbootins -r install -l 10
You can verify the environment as shown in the following command:
# splstdata -b -l 10
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
10 sp4n10 10005AFA159D 0 bos.obj.ssp.432 Mon_Apr_12_15:44:09
PSSP-3.1
rootvg install sp4n10.img
hdisk0 aix432
Check the fields response and next_install_image.
Now you should net boot the node to restore the correct image.
You can restore this image in a node different from the original node without worrying about the number of the node and its specific configuration. That configuration is restored in the customization face of the installation process.
7.2.1.3 The Restore Volume Group Command
The restvg command restores the user volume group and all its containers and files, as specified in the /tmp/vgdata/vgname/vgname.data file (where vgname is the name of the volume group) contained within the backup image
148
RS/6000 SP Software Maintenance
created by the savevg
command. Refer to 7.1.1.4, “Savevg” on page 142 to
learn how to use this command.
To restore a specific volume group using SMIT, you can do the following command:
# smitty restvg
Enter the name of the file or device that contains a backup taken with the savevg command.
Enter the physical volume names for that vg (optional)
Press Enter.
Using the command line, type the following:
# /usr/bin/restvg -q -f '/temporal/jorgevg.backup'
7.2.2 Restoring the SDR
The restore of the SDR is accomplished by using the
SDRRestore command.
This command is actually a script in the /usr/lpp/ssp/bin directory that deletes the information of the existing SDR, stops and removes the sdr daemons and subsystems, restores the backup taken with the
SDRArchive command which is in tar format, and recreates and restarts the subsystems and daemons needed.
You can restore the SDR backup either through SMIT or by using the command line, as follows.
If you use SMIT, type:
# smitty syspar
Select
Restore System Partition Configuration.
Select the file you want to restore using F4.
Press Enter.
This way you restore the SDR and also the corresponding partition-sensitive subsystems.
Backup and Recovery
149
Using the command line:
You can use one of two commands:
# SDRRestore <backup file name> or
# sprestore_config <backup file name>
The
SDRRestore command only stops and restarts the sdr subsystem and
restores the tar contained in <backup file name>. As an example, Figure 24
shows the output of the execution of the
SDRRestore command of a file called backup.99201.1811.
# SDRRestore backup.99201.1811.jvm
0513-044 The stop of the sdr.sp4en0 Subsystem was completed successfully.
0513-083 Subsystem has been Deleted.
0513-071 The sdr.sp4en0 Subsystem has been added.
0513-059 The sdr.sp4en0 Subsystem has been started. Subsystem PID is 37628
Figure 24. Restore of the SDR Using SDRRestore
Alternatively, sprestore_config not only restores the SDR, but also restores all partition-sensitive subsystems. This is actually the command executed when
you restore using SMIT (see Figure 25).
150
RS/6000 SP Software Maintenance
# sprestore_config backup.99201.1811.jvm
0513-044 The stop of the sdr.sp4en0 Subsystem was completed successfully.
0513-083 Subsystem has been Deleted.
0513-071 The sdr.sp4en0 Subsystem has been added.
0513-059 The sdr.sp4en0 Subsystem has been started. Subsystem PID is 17934.
stopping "hr.sp4en0"
0513-044 The stop of the hr.sp4en0 Subsystem was completed successfully.
removing "hr.sp4en0"
0513-083 Subsystem has been Deleted.
0513-071 The hats.sp4en0 Subsystem has been added.
0513-071 The hags.sp4en0 Subsystem has been added.
0513-071 The hagsglsm.sp4en0 Subsystem has been added.
0513-071 The haem.sp4en0 Subsystem has been added.
0513-071 The haemaixos.sp4en0 Subsystem has been added.
Added 0 objects to class EM_Resource_Variable
Added 0 objects to class EM_Structured_Byte_String
Added 0 objects to class EM_Resource_ID
Added 0 objects to class EM_Resource_Class
Added 0 objects to class EM_Resource_Monitor making SRC object "hr.sp4en0"
0513-071 The hr.sp4en0 Subsystem has been added.
0513-071 The pman.sp4en0 Subsystem has been added.
0513-071 The pmanrm.sp4en0 Subsystem has been added.
0513-071 The Emonitor.sp4en0 Subsystem has been added.
0513-059 The hats.sp4en0 Subsystem has been started. Subsystem PID is 38712.
0513-059 The hags.sp4en0 Subsystem has been started. Subsystem PID is 22996.
0513-059 The hagsglsm.sp4en0 Subsystem has been started. Subsystem PID is 25742.
0513-059 The haem.sp4en0 Subsystem has been started. Subsystem PID is 20558.
0513-059 The haemaixos.sp4en0 Subsystem has been started. Subsystem PID is 24642.
0513-059 The hr.sp4en0 Subsystem has been started. Subsystem PID is 35508.
0513-059 The pman.sp4en0 Subsystem has been started. Subsystem PID is 24880.
0513-059 The pmanrm.sp4en0 Subsystem has been started. Subsystem PID is 40450.
0513-059 The sp_configd Subsystem has been started. Subsystem PID is 18412.
Figure 25. Restore of the SDR Using sprestore_config
7.2.3 Restoring the Kerberos Database
If the kerberos database becomes corrupted, it may be necessary to restore it. If you do not have a backup taken with the command kdb_util dump
, you might have to recreate the database using the setup_authent command. If you do have that backup, then you can use the following procedure to restore your database:
# kdestroy
# chitab "kerb:2:off:/usr/lpp/ssp/kerberos/etc/kerberos"
# chitab "kadm:2:off:/usr/lpp/ssp/kerberos/etc/kadmind -n"
# telinit 2
# stopsrc -s hardmon
# kdb_destroy
# kdb_init
Backup and Recovery
151
The realm name will be the same as the domain name, but converted to uppercase.
# kstash
# kdb_util load <filename of backup>
# chitab "kerb:2:once:/usr/lpp/ssp/kerberos/etc/kerberos"
# chitab "kadm:2:once:/usr/lpp/ssp/kerberos/etc/kadmind -n"
# telinit 2
# startsrc -s hardmon
# kinit root.admin
If you do not have a good backup of the kerberos database, and you need to rebuild it, refer to RS/6000 SP: Problem Determination Guide, SG24-4778 for a detailed procedure to reconstruct the database.
7.2.4 Restoring the NIM Database
To restore the NIM database, you can use SMIT or the command line.
If you use SMIT, type:
# smitty nim_restore_db
Then select the device or file to restore.
When we tried to restore using SMIT, we received error messages because the system asked us to unconfigure the NIM database, but in some cases the system cannot do that and returns errors. It is easier to restore the backup manually. If you use the SMIT menu to backup the database, go to / and execute:
# tar -xvf /etc/objrepos/nimdb.backup
This command will restore the files of the NIM database immediately.
7.3 Remote Backups and Restores
When you want to back up files or directories in nodes which do not have tape drives directly attached, you have three options:
1. Do a local backup using any command ( tar, dd, cpio, backup
, and so forth) and send the file using rcp,pcp,ftp or any other command.
2. Use a specific tool (such as ADSM) to send your data to a backup server.
3. Create a backup on a remote system which has a tape connected; you can use systems like the CWS, another node, or any workstation.
152
RS/6000 SP Software Maintenance
The first option can be good if you have enough space in your local node and have a slow network; by using this option, you can take the backup fast, compress the file, and send it to the other machine.
If you want to take a backup directly to a file or tape in other machine, you can use the tar and dd commands in the following way:
# tar -cvf - <path> | rsh <remote host> dd of= <remote tape or file> \ bs=<block size>
Remember to use the dash symbol (-) before the path, and also remember that the block size depends of current tape device settings. For example, take a backup of the /tmp filesystem to the tape /dev/rmt0 connected to a machine called sp4en0 and defined with a block size of 1024 using the dsh command, you must execute:
# tar -cvf - /tmp | dsh -w sp4en0 dd of=/dev/rmt0 bs=1024
However, running this command, you can experience some problems. For example, if you have a block size that does not match the one in the remote tape, you will receive a response similar to the following:
# tar -cvf - ./usr/lib/u* |rsh sp4en0 dd of=/dev/rmt0 bs=512 a ./usr/lib/unixtomh 19 blocks.
a ./usr/lib/uucp a ./usr/lib/uucp/Devices symbolic link to /etc/uucp/Devices.
a ./usr/lib/uucp/Dialers symbolic link to /etc/uucp/Dialers.
dd: 0511-053 The write failed.
: A system call received a parameter that is not valid.
3+0 records in.
0+0 records out.
Also, make sure that you have permission to execute the remote command. If you are using kerberos, ensure that you have the proper tickets. If you are using normal remote commands, ensure that you have permission in the
/.rhosts file.
To restore a backup taken like this, you must execute the following command:
# rsh <remote host> dd if=<remote tape or file> ibs=<block size> |tar \
-xvf - <path>
Remember to use the same block size that you used to take the backup.
To create a savevg tape remotely, you must use a pipe; run the following commands:
Backup and Recovery
153
# mknod /tmp/pipe p
# savevg -i -f /tmp/pipe <vgname> &
# dd if=/tmp/pipe |dsh -w <remote host> dd of=<remote tape or file> \ bs=<block size>
Then, to restore a savevg, type the following:
# mknod /tmp/pipe p
# dsh -w <remote host> dd if=<remote tape or file> bs=<block size> \
</dev/null >/tmp/pipe &
# restvg /tmp/pipe
One common requirement when working with the SP2 is the need to install remotely, any third-party applications that require a “local” tape drive to install their product. To accomplish this, assuming that the product is installed with the syntax installxx <device>
, run the following procedure:
# mkfifo /tmp/pipe
# dsh -w sp4en0 <remote host> dd if=<remote tape> bs=<Block size> \
>/tmp/pipe 2>/dev/null
# installxx /tmp/pipe
154
RS/6000 SP Software Maintenance
Chapter 8. Taming Kerberos
Ever since Kerberos was introduced into PSSP software there have been complaints about it. The question is often asked: "Isn’t it possible to get rid of this Kerberos stuff on the SP?" The reasons for these complaints are always the same: scripts do not run because of expired Kerberos tickets, the switch is not startable, spmon does not work after changing the IP configuration, distributed shell commands return errors, and so on. As a consequence, many customers stored a /.rhosts file on the nodes just to be sure that remote shell commands run even if Kerberos does not work properly. But one of the reasons for using Kerberos in the first place was to eliminate the need for an
/.rhosts file.
"With the SP comes Kerberos and with Kerberos comes trouble." These are the very first words of a service bulletin about Kerberos in 1995. In this chapter we show that this statement is not true and that it can be easy to handle Kerberos as soon as the processes of Kerberos authentication are made clear.
In Greek mythology, Kerberos is the three-headed dog who guards the entrance to Hades. In the following sections, we provide the information, hints and tips you need in order to “tame” Kerberos.
© Copyright IBM Corp. 1999
155
Figure 26. Kerberos as a Domestic Dog
We start with a brief overview of what a default SP Kerberos setup looks like, and explain the necessary Kerberos-related terms. We then approach
Kerberos in a chronological way, corresponding to the installation process of an SP. We explain: what does setup_authent do? What happens during setup_server
? Which files are needed where, and when are they distributed?
156
RS/6000 SP Software Maintenance
8.1 The Default SP Kerberos Realm
When you install an RS/6000 SP system with the default settings, your
Kerberos setup will look like Figure 27.
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
n15 n13 n09 n05 n01
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/.k
cws:>
/.k
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/tmp/tkt0
/var/ker beros/\ e/*
Figure 27. SP Kerberos Realm
The default SP Kerberos environment consists of the CWS as the Kerberos authentication server. All hosts (that means all SP nodes) that are under control of the authentication server belong to its realm. Realm in this case is a synonym for kingdom, so the authentication server as the "king" reigns over his kingdom and keeps an eye (or better, an ear) on the communication of all the inhabitants of the kingdom.
Taming Kerberos
157
On the authentication server, the Kerberos database is stored under the
/var/kerberos/database. This database holds the information of all Kerberos users and services and their passwords within this realm. Only the authentication server (that means the CWS) has access to this database because it is encrypted with the so-called Kerberos master key. This master key is stored in the /.k file on the server.
Since there is one central database for one Kerberos realm, all Kerberos user and service names must be unique within this realm. All clients of this
Kerberos realm, in our example all SP nodes, have Kerberos configuration files, therefore they know who their authentication server is and which nodes
belong to the same realm. As you can see in Figure 27 on page 157, the
client nodes have almost all the Kerberos-related files as their authentication
server. But some files are unique on every host. Figure 28 shows that the
/etc/krb.conf, /etc/krb.realms, and /.klogin files on the clients are just copies of the server files, while the /etc/krb-srvtab file and the /tmp/tkt0 file (if it exists) are unique on each host, although the names are the same.
Authentication
Server
Client
Node /.k
/.klogin
/etc/krb.realms
/etc/krb.conf
copies of server files
/.klogin
/etc/krb.realms
/etc/krb.conf
copies of server files
/etc/krb-srvtab
/tmp/tkt0 uniqu each e to clien t
/etc/krb-srvtab
/tmp/tkt0 un iqu e t o ea ch clie nt
Client
Node
/.klogin
/etc/krb.realms
/etc/krb.conf
/etc/krb-srvtab
/tmp/tkt0
/var/kerberos/\ database/*
Figure 28. Kerberos Configuration Files
158
RS/6000 SP Software Maintenance
8.2 Kerberos Setup during the Installation of an RS/6000 SP System
Assume we are going to install an RS/6000 SP system. Only the CWS is installed with AIX; the nodes are not yet installed. Note: it is not mandatory that the CWS plays the role of the authentication server. The SP environment can also be included in an existing AFS or CDE cell, for example. However, in most cases the CWS is the authentication server, so this will also be our initial setup.
After the installation of the PSSP code on the CWS, you have to issue setup_authent.
This script defines the CWS as authentication server, creates the Kerberos database, and asks you for the Kerberos master key, the very first Kerberos principal (root.admin) and some default values. Let us see which files are already created at this point of the installation.
setup_authent on cws
[cws] >
# setup_authent creates on the cws
/.k
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/tmp/tkt0
/var/kerberos/database/*
Figure 29. setup_authent on the CWS
8.2.1 The Kerberos Master Key
The /.k file stores the Kerberos master key that has been generated by encrypting the Kerberos master password you have been prompted for during setup_authent
. This key encrypts not only the whole Kerberos database, but
regenerate it with the kstash command.
Taming Kerberos
159
8.2.2 The Kerberos Configuration File
The /etc/krb.conf file contains information indicating to which Kerberos realm this host belongs and also which host is the authentication server of this realm. Unlike the Kerberos master key, the /etc/krb.conf file is a regular ASCII file.
SP2EN0
SP2EN0 sp2en0 admin server
For this realm...
this host is ...
Which realm does this host belong to?
Figure 30. The /etc/krb.file File
As shown in Figure 30, the /etc/krb.conf file has at least two entries. The first
line indicates that this host belongs to, for example, the SP2EN0 realm, and the second line gives the information about the authentication server of this realm.
You may ask, who chose the realm name SP2EN0? The setup_authent script does the following. First it checks whether a /etc/krb.conf file already exists. If not, setup_authent takes the hostname of the CWS and changes it into
uppercase to define the realm name. As you can see in Figure 30, the
hostname of the CWS is sp2en0, so the realm name became SP2EN0. When you use a domain name server, setup_authent uses the domain entry of the
/etc/resolv.conf file as the realm name and changes it into uppercase.
Note, however, that you are not forced to use any of these realm names. You can choose the realm name and create your own /etc/krb.conf file in the same syntax. As soon as setup_authent detects an existing /etc/krb.conf file, the first entry will then become your realm name.
160
RS/6000 SP Software Maintenance
8.2.3 The Kerberos Realm Configuration
The /etc/krb.realms file is a kind of map file that indicates which network interface (therefore which adapter names) belongs to which realm. Kerberos authentication works with adapter names, therefore you will see an entry for every adapter that is part of the realm. At this stage of the setup_authent process, the krb.realms only includes information about the authentication server itself (the CWS), because no node information exists yet. Here is the content of the file after setup_authent was executed.
sp2en0:/# cat /etc/krb.realms sp2cw0 SP2EN0
In our setup the CWS has a Token Ring adapter with the adapter name sp2cw0 and an Ethernet adapter with the sp2en0 adapter name. The hostname of our CWS is based on the Ethernet adapter (look at the command prompt in the screen output above). Since the primary authentication server is using the hostname, the related adapter name (the ethernet adapter name in our configuration) is not included in the
/etc/krb.realms file. Any additional configured adapter, like our Token Ring adapter, will be included in the krb.realms file.
This file will get enlarged with all the adapter names of supported adapters
(like Ethernet, Token Ring, Switch adapter, FDDI etc.) from all the nodes during the installation steps following the setup_authent command.
8.3 Kerberos Principals
In a Kerberos environment, users as well as services are called principals.
You can authenticate yourself to Kerberos either as a user principal or as a service principal.
The complete instance of a principal is composed of these elements:
<principal_name>.<instance>@REALM
Examples: root.admin@SP2EN0, anna.kerb@SP2EN0
The very first Kerberos user principal that you define (usually root.admin) must have the instance admin. This instance is needed to do administrative tasks on the Kerberos server. This Kerberos principal must be logged in as
UNIX user root. For further user principals, the instance name is freely choosable, but it must not be created before you define the principal.
Taming Kerberos
161
In an SP environment two service principals are available:
• rcmd.<instance>@REALM (on CWS and on each node)
• hardmon.<instance>@REALM (only on the CWS)
The instance name of a service principal is the adapter name on which this service is provided. The rcmd service-principal provides the Kerberos remote commands. Since you want to be allowed to set up Kerberized remote commands over each network in your SP, you have as many rcmd principals as the number of adapters in your SP configuration.
When you are allowed to use the hardmon service on the CWS, you can talk to the hardmon daemon that monitors the serial link to the SP frame. Without hardmon service, you are not able to run any command that uses the serial line: these commands are spmon
, s1term
, hmcmds
, and hmmon
. That is why you cannot run spmon commands without a working Kerberos environment. By default only root.admin is allowed to use the hardmon services, but you can
change this configuration; see 8.7.7, “Hardmon Access Control List - hmacls
File” on page 177, for more information.
8.3.1 The /.klogin File
If you want to know which principal is allowed to use the Kerberos remote commands and services of a node, look in the /.klogin file. This file contains a list of all principals that may use a service. Even if you are authenticated as a
Kerberos principal and you try to contact a remote host using Kerberized commands, this /.klogin file will be checked after a lot of other authentication actions. Usually you are authenticated as a principal. If your principal name is not included in the .klogin file and you are trying to run a remote command, the execution will be refused. The following screen shows what this file looks like.
162
RS/6000 SP Software Maintenance
sp2en0:/# cat /.klogin root.admin@SP2EN0 anna.sap@SP2EN0 rcmd.sp2en0@SP2EN0 rcmd.sp2n01@SP2EN0 rcmd.sp2n05@SP2EN0 rcmd.sp2n06@SP2EN0 rcmd.sp2n07@SP2EN0 rcmd.sp2n08@SP2EN0 rcmd.sp2n09@SP2EN0 rcmd.sp2n10@SP2EN0 rcmd.sp2n11@SP2EN0 rcmd.sp2n12@SP2EN0 rcmd.sp2n13@SP2EN0 rcmd.sp2n14@SP2EN0 rcmd.sp2n15@SP2EN0
8.4 Kerberos Daemons
There are two daemons running on the authentication server which belong the Kerberos system, the kadmin daemon and the kerberos daemon. This is shown in the following screen.
sp2en0:/# ps -ef | grep kerb root 21614 22670 0 14:40:38 pts/6 0:00 grep kerb root 27664 7744 0 10:38:29 root 33850 7744 0 10:38:25
-
-
0:00 /usr/lpp/ssp/kerberos/etc/kadmind -n
0:00 /usr/lpp/ssp/kerberos/etc/kerberos
The kerberos daemon is used for the authentication of Kerberos users. The kadmind is only used during administrative tasks on the Kerberos database
(for example adding principals, changing passwords, and so on). It happens that using remote commands like dsh work fine, but issuing a kadmin command hangs; no input prompt returns. In such cases, make sure both daemons are running. You will not notice the death of the kadmind daemon as long as you are only asking for tickets and services.
These daemons are started from the /etc/inittab. Also, starting with PSSP 2.3, these daemons are managed by the system resource controller. You can start and stop them as follows:
# stopsrc -s kadmin
# startsrc -s kadmin
Taming Kerberos
163
A typical screen output of such a command sequence in shown in the following screen.
sp2en0:/# stopsrc -s kadmind
0513-044 The stop of the kadmind Subsystem was completed successfully.
sp2en0:/# startsrc -s kerberos
0513-059 The kerberos Subsystem has been started. Subsystem PID is 35738.
sp2en0:/# lssrc -a |grep kerb rpc.lockd
nfs kerberos kadmind
11096 active
35738 active inoperative
Before PSSP 2.3 version, in order to stop kerberos and kadmind, you had to change the respawn value in the /etc/inittab file of these daemons to off and then kill the corresponding processes.
8.5 setup_authent, setup_server and the Kerberos Database
All Kerberos principals, both users and services, are stored in the Kerberos database. But did you ever define a Kerberos principal except root.admin
during an SP installation? All service principals (hardmon and rcmds) are defined automatically either by setup_authent or by setup_server
.
In our example, setup_authent did run, but not setup_server, so what is the content of the Kerberos database at this stage? Let us look into the database by issuing the lskp command.
sp2en0:/# lskp
K.M
changepw.kerberos
default hardmon.sp2cw0
hardmon.sp2en0
krbtgt.SP2EN0
rcmd.sp2cw0
rcmd.sp2en0
root.admin
tkt-life: 30d tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 key-vers: 1 expires: 2037-12-31 23:59 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59
As shown in the preceding screen, you can see the root.admin user principal, hardmon service and rcmd principals for the CWS.
After filling up the SDR with node information, you execute setup_server for the very first time. One wrapper of setup_server
, called setup_CWS, scans the
SDR for the adapter names for all nodes. For every adapter name it then adds one rcmd principal, and sets a password for it. After running setup_CWS, the
164
RS/6000 SP Software Maintenance
Kerberos database contains many more entries, as shown in the following screen.
sp2en0:/# lskp | pg
K.M
tkt-life: 30d changepw.kerberos
default tkt-life: 30d tkt-life: 30d hardmon.sp2en0
krbtgt.SP2EN0
rcmd.sp2cw0
rcmd.sp2en0
key-vers: 1 expires: 2037-12-31 23:59 key-vers: 1 expires: 2037-12-31 23:59 key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 rcmd.sp2css01
rcmd.sp2css05
rcmd.sp2css06
rcmd.sp2css07
rcmd.sp2css08
rcmd.sp2css09
rcmd.sp2css10
rcmd.sp2css11
tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 rcmd.sp2n01
rcmd.sp2n05
rcmd.sp2n06
rcmd.sp2n07
rcmd.sp2n08
rcmd.sp2n09
rcmd.sp2n10
rcmd.sp2n11
root.admin
tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59
Now there is one rcmd principal defined for each adapter in the SP system.
As already mentioned, this service principal provides the Kerberized remote commands.
The /etc/krb.realms file has also been filled up with all these new adapters.
This file indicates which adapter belongs to which realm. The example shows only the first part of it.
sp2en0:/# cat /etc/krb.realms | pg sp2cw0 SP2EN0 sp2n01 SP2EN0 sp2tr01 SP2EN0 sp2css01 SP2EN0 sp2n05 SP2EN0 sp2css05 SP2EN0 sp2n06 SP2EN0 sp2css06 SP2EN0 sp2n07 SP2EN0 sp2css07 SP2EN0 sp2n08 SP2EN0 sp2css08 SP2EN0
^C
Taming Kerberos
165
8.6 The krb-srvtab File
The rcmd principals have been defined and a random password has been set for them, as you set it for the root.admin principal during the installation of the
CWS. These rcmd passwords are all stored in the Kerberos database and also in the /etc/krb-srvtab files.
While create_krb_files
(a wrapper of setup_server) is running, these srvtab files are created for each node that has a bootp_response value of install, customize or migrate in the SDR. These srvtabs are node-dependent and contain the encrypted passwords for the rcmd principals of each node. The different srvtab files are extracted from the Kerberos database and at first
CWS for the SP nodes:
166
RS/6000 SP Software Maintenance
which nodes' bootpresponse is on customize, install or migrate?
extract the srvtab for the corresponding rcmd principals
rcmd sp2en0 255 1 1 0 10cb558e 72d4f65c 203801010459 199904212101 root admin
K M 255 1 1 0 f21f04fe a5f20edc 203801010459 199904212101 db_creation * hardmon sp2cw0 255 1 1 0 beab30fc 1a6c5384 203801010459 199904212101 root admin rcmd sp2css07 255 1 1 0 a1bfc471 48c942fd 203801010459 199904212105 root admin anna sap 3 1 1 0 e5270e06 4b6d3066 200004270459 199904222132 root admin
rcmd sp2n06 255 1 1 0 b4b6542a 27abcdd8 203801010459 199904212105 root admin rcmd sp2n13 255 1 1 0 cf88f27f 556f40ed 203801010459 199904212105 root admin rcmd sp2tr01 255 1 1 0 45bda4ec 37a3c2ad 203801010459 199904212105 root admin
rcmd sp2css06 255 1 1 0 3170111 84a51456 203801010459 199904212105 root admin
node_number
6 loop through all addresses of this node
Kerber
Databa os se
/tftpboot/sp2n06-new-srvtab
Figure 31. The create_krb_files Wrapper
At the end of a node installation, the customize script pssp_script is running on the node. This script fetches the Kerberos-related files and the appropriate new srvtab and moves it to /etc/krb-srvtab. Though the srvtab, this file actually is not readable, as shown in the srvtab file of node07: p2n07:> cat /etc/krb-srvtab rcmdsp2css07SP2EN0½Ïuî¯LŠrcmdsp2n07SP2EN0´bøÄ#
Taming Kerberos
167
However, it is possible to identify the principal strings: rcmd followed by the realm string. Our example output in the previous screen shows the passwords for the principals rcmd.sp2css07@SP2EN0 and rcmd.sp2n07@SP2EN0.
After the installation of the whole SP, each SP node has its own
/etc/krb-srvtab file.
8.7 Tickets and Keys
There are two kinds of tickets used in an Kerberos environment:
• Ticket-granting-ticket
• Service-ticket
To point out the difference between these two kinds of tickets, let us imagine a trip to Disney World. At the entrance you have to buy a ticket just to enter, and this ticket allows you to stay inside for a defined period of time. This initial ticket can be compared to the Kerberos ticket-granting-ticket which has a specified lifetime. It only gives you permission to use Kerberos services in this realm. But you do not enter Disney World (or a Kerberos realm) just to get inside and then stay in only one place; instead, you are going to have some fun and visit several shows.
If Disney World was organized like a Kerberos realm, there would be a central ticket counter. In a Kerberos realm, this ticket counter (called the
Ticket-Granting-Server) is located in the authentication server and it already issued you the ticket-granting-ticket. To get access to one of the Disney shows or even better -- to one of the Kerberos services, you have to show your ticket-granting-ticket at the central ticket counter. It will be checked and if it is accepted, you will obtain a ticket (in Kerberos terminology, a service
ticket) to visit the show -- or use Kerberos remote services. Without a ticket-granting-ticket, you will not get any access.
Now keys, as we know, are used both to lock something and to open it again.
If two people share the same key, one can lock and the other can reopen. In the Kerberos environment, we use two different keys:
• Session Key
• Private Key
While a Session Key is generated randomly and used for communication between two parties, the private key is generated based on the principals’ password plus a timestamp.
168
RS/6000 SP Software Maintenance
How are these tickets and keys distributed and used without sending any unencrypted passwords over the network? If you are interested in the whole theory of Kerberos encryption, refer to Inside the RS/6000 SP, SG24-5145.
8.7.1 Ticket-Granting-Ticket
The last part of the setup_authent command invokes the kinit command for the first principal you have defined, usually root.admin. This kinit root.admin
prompts you again for the password of this principal, then provides you with a ticket-granting-ticket. Since this ticket-granting-ticket has by default a lifetime of 30 days, you have to get a new one after expiration. To obtain a new ticket-granting-ticket for root.admin, just issue kinit root.admin
at the command prompt. If you are interested in the values of your current ticket, klist gives you the information needed, as shown.
sp2en0:/# kinit root.admin
Kerberos Initialization for "root.admin"
Password: sp2en0:/# klist
Ticket file:
Principal:
/tmp/tkt0 root.admin@SP2EN0
Issued Expires Principal
Apr 16 10:58:08 May 15 10:58:08 krbtgt.SP2EN0@SP2EN0
To say kinit <principal.instance> has been compared to logging on to
Kerberos, while the Kerberos identity you get is dependent on your UNIX identity. In our example, the UNIX root user is authenticated as the Kerberos root.admin principal. This combination of two identities is kept in your ticket-granting-ticket.
The first line of the klist output always points to your ticket cache file that
stores your ticket; therefore, it will survive a reboot of the machine (see 8.7.4,
“The Ticket Cache File” on page 174). The second line indicates the owner of
this ticket, and the first line of the next block represents the ticket-granting-ticket itself.
8.7.2 Contents and Purpose of the Ticket-Granting-Ticket (tgt)
It does not matter whether you get a root.admin ticket-granting-ticket (tgt) on the host that plays the role of the authentication server, or one of your client nodes. The rule is that a password must not pass the network unencrypted.
In the authentication server itself exists an instance, the so-called
Ticket-Granting-Server (TGS), that generates the tickets. Assume you are
Taming Kerberos
169
going to get a tgt for root.admin on n07. Due to the fact that all nodes belonging to a realm have a copy of the /etc/krb.conf file that points to the authentication server, the kinit command detects which server has to be
contacted. Figure 32 shows the process of getting a tgt and the contents of
that tgt:
Client n07
# kinit root.admin
1
Kerberos Initialization for "root.admin"
Password:
3
As soon as you type the root.admin password locally a key is generated that opens the outer brackets of this packet. Then the
Session Key for n07-TGS can be obtained and the tgt is stored under /tmp/tkt0.
tgt
Session n07-TGS
/tmp/tkt0
/.klogin
/etc/krb.realms
/etc/krb.conf
/etc/krb-srvtab
/.k
kinit root.admin
2 root.
admin
tgt
time stamp ticket lifetime sp2n07 session key n07-TGS
Session Key n07-TGS
/.k
Private
Key for root.admin
principal
Authentication
Server
Who ? Time ?
Part of Realm ?
Okay
/.k
/etc/krb.realms
/etc/krb.conf
/etc/krb-srvtab
/tmp/tkt0
/var/kerberos/\ database/*
/.k
Ticket-Granting
Server - TGS
Figure 32. The kinit Procedure
Step 1
: The client n07 contacts the authentication server asking for a ticket-granting-ticket for the root.admin principal.
Step 2
: The authentication server verifies whether this client is part of the realm and creates a packet including a tgt encrypted with the master key (/.k), so that only the authentication server itself is able to open it. This tgt contains the client’s name, the name of the Ticket-Granting-Server, a timestamp, the ticket lifetime, the client’s IP address, and the Session Key for n07-TGS.
Furthermore, a copy of this Session Key n07-TGS is appended to the tgt. The
170
RS/6000 SP Software Maintenance
whole packet is encrypted with the root.admin’s private key, which is a key that is generated based on the root.admin password plus a timestamp.
Since the root.admin password is stored in the Kerberos database, the authentication server (or TGS) can generate the key locally. Now the packet is sent back to the client.
Step 3
: When this packet arrives at the client node, the password prompt appears. The root.admin password is issued locally and the Kerberos code on the client generates a key as well, based on that password. If the password was correct, the generated key is able to open the packet. Only the session key contained in the outer brackets is usable by the client (node n07) for further communication. The ticket-granting-ticket itself is always a kind of
"black box" for the client. Only the authentication server is able to open it with its master key (/.k). This tgt is now stored under /tmp/tkt0 on client n07.
8.7.3 Asking for a Remote Service
Now n07 holds a tgt for the root.admin principal. But to use the remote services of another node he needs an additional service ticket for that node.
He has to contact the authentication server again, showing his tgt and asking for a service ticket for the remote node.
Let us assume n07 asks for a remote service of n08; n07 wants to set up a remote shell command on n08. We divide this procedure into two steps, first the communication with the authentication server and then the direct communication between n07 and n08.
Taming Kerberos
171
Client n07
dsh -w n08 (part I)
Authentication
Server
# dsh -w n08 ls
4 n08,
tgt
Packet is opened with
Session Key n07-TGS.
Get Session Key n07-n08 and keep
Service Ticket for n08 n07 n07
Authenticator
IP address timestamp
Packet is opened with
Session Key 7+TGS
Session n07-TGS
tgt
=
Authenticator
?
Does the info in the tgt correspond to the info in the Authenticator ?
5
Service
Ticket for n08
Session n07-n08
Private
Key of n08
,
Session Key n07-n08
Session n07-TGS
Okay
/.k
Service
Ticket for n08
Session n07-n08
Private
Key of n08
Session n07-n08
Session n07-TGS
tgt
Session n07-TGS
Ticket-Granting
Server - TGS
Figure 33. Requesting a Service Ticket
Step 4
: Client n07 wants to set up an ls command on n08 via remote shell.
Since n07 has no service ticket for n08 yet, the authentication server has to be contacted first. n07 wraps a packet with the following contents: n08, the host that wants to be contacted, the ticket-granting-ticket, and an authenticator storing the name and IP-address of n07, plus a timestamp. This authenticator is encrypted with the Session Key n07-TGS. This packet is sent to the authentication server.
Unlike the tickets, an authenticator can only be used once.
The authentication server opens this packet with its Session Key n07-TGS and opens the tgt with its master key. The content of the tgt is compared to
172
RS/6000 SP Software Maintenance
the content of the authenticator to insure that n07 is the valid owner of the tgt.
If the check returns okay, the next step (step 5) will be executed.
Step 5: The authentication server returns a packet to n07. The first part contains the Service Ticket for n08 including the Session Key for n07 and n08
(n07-n08). This part is encrypted with the private key of n08 so that only n08 is able to open it. Then the same Session Key n07-n08 is appended and the whole packet encrypted again with the Session Key n07-TGS. 07 opens the packet with its Session Key n07-TGS and gets the Session Key n07-n08 and an encrypted packet containing the Service Ticket for n08.
In Figure 34 you can follow part II of the
dsh command:
Client n07
# dsh -w n08 ls
6 n08: .profile
n08: .klogin
n08: audit
: n08: bin n08: dev
Service
Ticket for n08
Session n07-n08
Private
Key of n08
Session n07-n08
Session n07-TGS
tgt
Figure 34. Asking for Remote Services
Client n08
dsh -w n08 (part II)
Authenticator
Session n07-n08
,
Service
Ticket for n08
Session n07-n08
Return of the ls command
Private
Key of n08
7
The packet with the service ticket is decrypted with the password stored under
/etc/krb-srvtab. The obtained Session Key n07-n08 then opens the
Authenticator packet.
The contents of the
Authenticator and the
Service Ticket are compared.
/.klogin check
Okay
The ls command is executed and the result returned to client n07.
Taming Kerberos
173
Step 6: Client n07 now has the Service Ticket and the integrated Session
Key. Since all this is encrypted with the private key of n08, it can only be opened by n08 with the appropriate password stored in /etc/krb-srvtab. Now the Service Ticket and the Session Key n07-n08 are obtained. With the justextracted Session Key n07-n08, the Authenticator can be decrypted. The contents of the authenticator are compared with the contents of the Service
Ticket.
Step 7: If the comparison was successful and the principal root.admin@REALM (who is asking for the remote service) is included in the
/.klogin, the ls command will be executed. The screen output of the command will be returned to n07. If one of these two conditions was not met the remote service for n07 will be refused.
8.7.4 The Ticket Cache File
The ticket cache file stores all tickets that a Kerberos principal has obtained, tgts as well as service tickets. The service tickets live as long as the ticket-granting-ticket. As soon as service tickets for all clients are appended to that file, the authentication server is not longer contacted when a remote service is asked for since all the tickets are available and the authenticator is created locally.
In general the Kerberos ticket cache files are stored in the /tmp directory. The file name tkt0 by default is composed of tkt followed by the UID of the UNIX user that asked for this ticket-granting-ticket. Since the root user has the UID
0, the ticket cache file’s name is /tmp/tkt0. However, you can name this file whatever you want since the correlation of UNIX UID and Kerberos principal is integrated in this file. Having the UID as the ending of the file just facilitates the identification for you.
You may want to store the ticket files in another location. This can be done with the KRBTKT environment variable in the ~/.profile; for example:
# export KRBTKFILE=~/tkt$LOGIN
This command will name the tgt for UNIX user anna to be
/home/anna/tktanna. Wherever you decide to store the tickets, insure that each user who asks for a ticket is allowed to write to this directory; that is the reason why it is stored in /tmp by default.
The fact that the ticket-granting-ticket is a combination of the UNIX and
Kerberos identity makes it possible that several UNIX users can log on or authenticate as the same Kerberos principal. The principal name is totally
174
RS/6000 SP Software Maintenance
independent from the UNIX user name. It may be the same but it does not need be. Let us look at an example.
Assume we have two UNIX users on the system and a Kerberos user
principal called anna.sap (refer to 8.8.2, “Make a Kerberos Principal” on page
181 for information about how a new Kerberos principal is created).
p2en0:/# lsuser anna,heiner | awk ’{print $1" "$2}’ anna id=203 heiner id=209
Both anna and heiner can authenticate themselves as Kerberos principal anna.sap (if both know the password) without any collisions, as shown in the following screens: sp2en0:/home/anna $ id uid=203(anna) gid=1(staff) sp2en0:/home/anna $ kinit anna.sap
Kerberos Initialization for "anna.sap"
Password: sp2en0:/home/anna $ klist
Ticket file:
Principal:
/tmp/tkt203 anna.sap@SP2EN0
Issued Expires Principal
Apr 16 16:13:36 May 15 16:13:36 krbtgt.SP2EN0@SP2EN0 sp2en0:/home/heiner $ id uid=209(heiner) gid=1(staff) sp2en0:/home/heiner $ kinit anna.sap
Kerberos Initialization for "anna.sap"
Password: sp2en0:/home/heiner $ klist
Ticket file: /tmp/tkt209
Principal: anna.sap@SP2EN0
Issued Expires Principal
Apr 16 16:17:02 May 15 16:17:02 krbtgt.SP2EN0@SP2EN0
Since anna and heiner have different ticket cache files, they can share one
Kerberos principal. You may ask: "For what purpose?”
8.7.5 Authentication and Authorization
Say we have an SP system with 12 nodes. All the nodes belong to the same
Kerberos realm. You provide 8 nodes to login for your UNIX users. Each UNIX user is known on each node and you want to allow remote shell commands
Taming Kerberos
175
between the nodes. You have the choice of creating a /etc/hosts.equiv file or
~/.rhosts files, or else the users are prompted for their passwords when using rexec
. All these possibilities are more or less unsecure, especially the .rhosts
file. As shown in the last section (8.7.4, “The Ticket Cache File” on page 174),
several UNIX users can share one Kerberos principal because the combination of UNIX and Kerberos identity makes the Kerberos ticket unique.
Kerberos does the authentication and permits or denies (by providing you a ticket or not) sending remote shell commands to the hosts belonging to the same realm. The authorization (meaning the permission to read, write or execute a file) is still done by the remote UNIX system by checking the UNIX permissions. For instance, whether or not anna is allowed to copy a file to a remote machine is under the control of the remote UNIX system. If anna is not known on the remote host, the rcp command will be refused even if the
Kerberos authentication was okay.
Keep in mind that for Kerberized remote shell commands, the users UID must be the same on all the nodes. As long as your are using the SP User
Management this facility takes care of it.
8.7.6 Never-Expiring Ticket
You may wonder why is it possible to get a never-expiring ticket-granting-ticket for the rcmd.<node> principal by executing rcmdtgt without issuing any password?
sp2n01:/# /usr/lpp/ssp/rcmd/bin/rcmdtgt sp2n01:/# klist
Ticket file: /tmp/tkt0
Principal: rcmd.sp2n01@SP2EN0
Issued Expires
Apr 6 17:25:22 Never
Principal krbtgt.SP2EN0@SP2EN0
The authentication is done undercover, but instead of issuing a password the contents of the /etc/krb-srvtab is used. The same processes as described in
8.7.2, “Contents and Purpose of the Ticket-Granting-Ticket (tgt)” on page 169
are done undercover. The authentication server sends a packet back to the client that is encrypted with the client’s private key. The password stored under /etc/krb-srvtab can decrypt it, consequently the ticket-granting-ticket can be obtained.
176
RS/6000 SP Software Maintenance
8.7.7 Hardmon Access Control List - hmacls File
As mentioned in 8.3, “Kerberos Principals” on page 161, hardmon is a
Kerberos-protected service that is only available on the CWS. Once you have permission to use this service, it offers the contact to the hardmon daemon that monitors and controls the serial link to the SP frame. By default, only the root.admin principal (or other first user principal) is allowed to obtain a service ticket for the hardmon service because an access control list file exists for the hardmon service and only root.admin is allowed to use this
service. This file is called /spdata/sys1/spmon/hmacls. Figure 35 shows the
hmacls file.
# c a t /s p d a ta /s y s1 /s p m o n /h m a c ls s p 2 e n 0 ro o t.a d m in a s p 2 e n 0 h a rd m o n .s p 2 e n 0 a
1 ro o t.a d m in vs m
1 h a rd m o n .s p 2 e n 0 vs m p e rm is s io n s vs m fo r fra m e n u m b e r 1 th is p rin c ip a l h a s th e
...
Figure 35. The hmacls File
There are four different sets of permissions, indicated by a single lowercase character:
• m (monitor) - monitor hardware status
• v (virtual front operator pane) - control/change hardware status
• s (serial access) - access to node’s console via serial port (s1term)
• a (administrative) - use hardmon administrative commands
With this permission scheme it is possible to permit a Kerberos principal just the use of starting Perspectives for monitoring, without permitting the issuing of commands over the serial line. For issuing commands, you have to define a
Kerberos principal (see 8.8.2, “Make a Kerberos Principal” on page 181), and
then add it to the hmacls file. For example, assume anna.sap is a defined
Taming Kerberos
177
Kerberos principal and we add an entry for her to the
/spdata/sys1/spmon/hmacls file; see the following example. Be careful with this file because having even one blank line means hardmon will not serve anything, even if the hardmon subsystem is active.
sp2en0:/# vi /spdata/sys1/spmon/hmacls sp2en0 root.admin a sp2en0 hardmon.sp2en0 a
1 root.admin vsm
1 hardmon.sp2en0 vsm
1 anna.sap m
Then get a ticket-granting-ticket for anna.sap, as follows:
# kinit anna.sap
Trying spmon -d now would return an error message because the hardmon subsystem has to be stopped and restarted so that the hmacls file is read again:
# stopsrc -s hardmon
# startsrc -s hardmon
As Kerberos principal anna.sap, let us now try to use spmon -d
:
11
12
13
14 sp2en0:/# spmon -d | grep thin
5
6
5
6 thin thin on on yes yes yes yes
7
8
9
10
7
8
9
10 thin thin thin thin on on on on yes yes yes yes yes yes yes yes
11
12
13
14 thin thin thin thin on yes yes on yes yes on yes yes on yes yes normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no
That works, but when we try to open a write terminal on a node, the following message is returned: sp2en0:/# s1term -w 1 7 s1term: 0026-645 The S1 port in frame 1 slot 7 cannot be accessed.
It either does not exist or you do not have S1 permission.
It does not matter which UNIX user has a ticket for the principal anna.sap.
Neither root nor any normal UNIX user who is authenticated as anna.sap is
178
RS/6000 SP Software Maintenance
allowed with these permissions in the hmacls file to issue commands to hardmon, other than simple monitoring commands.
8.7.8 Kerberos Database Access Control Lists
The Kerberos database located under the /var/kerberos/database on the authentication server consists of two encrypted files, the principal.pag and principal.dir file. Besides these files, there also exist ACL files that are standard ASCII files: sp2en0:/# ls -l /var/kerberos/database total 78
-rw-r----1 root
-rw-r----1 root system system
11 Apr 6 16:40 admin_acl.add
11 Apr 6 16:40 admin_acl.get
-rw-r----1 root
-rw------1 root
-rw------1 root
-rw------1 root system system system system
11 Apr 6 16:40 admin_acl.mod
4096 Apr 6 16:41 principal.dir
0 Apr 6 16:38 principal.ok
82944 Apr 6 16:41 principal.pag
The contents of these files indicate which Kerberos principal is allowed to do the following:
• Add (admin_acl.add) information to the database; for example add a principal.
• Get (admin_acl.get) information from the database.
• Modify (admin_acl.mod) information in the database.
By default, the very first admin principal, usually root.admin, is included in each file. The principals that can get permissions for the database must have the instance admin. To delegate some of the administrative tasks, you can create another admin principal and allow SP monitoring tools by adding him
appropriate admin_acl file(s).
8.8 Kerberos Commands
With PSSP 2.3 other convenient commands have been added. They are located under /usr/lpp/ssp/kerberos/bin:
• lskp
- list Kerberos principal
• mkkp
- make Kerberos principal
• chkp
- change Kerberos principal
Taming Kerberos
179
• rmkp
- remove Kerberos principal
In the following sections we discuss every command and demonstrate the differences between these new commands and the old Kerberos commands, such as kdb_util
, kadmin
, and kdb_edit
.
8.8.1 List a Kerberos Principal
Before the lskp command was provided, two steps were required to list
Kerberos principals. Since the Kerberos database is not readable, you first had to dump it into an ASCII file and then look into this file.
Kerb eros
Data base
# kdb_util dump /tmp/kerbdb
# cat /tmp/kerbdb | pg
Kerb eros
Data base rcmd sp2en0 255 1 1 0 10cb558e 72d4f65c 203801010459 199904212101 root admin
K M 255 1 1 0 f21f04fe a5f20edc 203801010459 199904212101 db_creation * rcmd sp2css10 255 1 1 0 80388e29 b7db38b8 203801010459 199904212105 root admin rcmd sp2css14 255 1 1 0 800b752 81ccd848 203801010459 199904212105 root admin hardmon sp2cw0 255 1 1 0 beab30fc 1a6c5384 203801010459 199904212101 root admin rcmd sp2css07 255 1 1 0 a1bfc471 48c942fd 203801010459 199904212105 root admin rcmd sp2n06 255 1 1 0 b4b6542a 27abcdd8 203801010459 199904212105 root admin rcmd sp2n13 255 1 1 0 cf88f27f 556f40ed 203801010459 199904212105 root admin rcmd sp2n15 255 1 1 0 c94098e1 e631d4af 203801010459 199904212105 root admin
Figure 36. Dump the Kerberos Database
This command is useful to find out which principals are currently defined and what the current settings are; for example, the expiration date of a Kerberos account, the ticket lifetime and the key version of this principal, as shown in the following screen: sp2en0:/# lskp
K.M
changepw.kerberos
default hardmon.sp2cw0
hardmon.sp2en0
krbtgt.SP2EN0
rcmd.sp2cw0
rcmd.sp2en0
root.admin
tkt-life: 30d tkt-life: 30d tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 key-vers: 1 expires: 2037-12-31 23:59 key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59
180
RS/6000 SP Software Maintenance
8.8.2 Make a Kerberos Principal
When you are going to define a new Kerberos user principal, it has to be included in the Kerberos database and you have to set a password. Let us create the Kerberos principal anna.sap (that no longer exists in the database) with the new mkkp command:
# mkkp anna.sap
This creates a new principal anna with the instance sap in the Kerberos database. But we have not yet set a password for anna.sap. Since these new
PSSP Kerberos commands are not interactive, you are not able to set the initial password with the mkkp command. But without a password, the principal is blocked. The consequence is that you have to set the password later by getting the kadmin prompt and setting the first password with the change_principal_password
( cpw
) option. There is always one additional step required, so we recommend that you use the kadmin command at once to define a new principal. You will be prompted for the root.admin password first and then set the password for the new principal as in the following example.
Note: Regarding the kadmin command: in case you named your very first
Kerberos principal not root.admin but instead decided to chose a different name with instance admin, you have to specify this admin user as a parameter for the kadmin command. For instance, if your admin principal is sp.admin, you will get the kadmin prompt by typing:
# kadmin -u sp.admin
The kadmin command, without any options, expects the default root.admin
principal. Now let us define anna.sap with the old kadmin command:
Taming Kerberos
181
sp2en0:/# kadmin
Welcome to the Kerberos Administration Program, version 2
Type "help" if you need it.
admin:
?
Available admin requests: change_password, cpw Change a user’s password change_admin_password, cap add_new_key, ank
Change your admin password
Add new user to kerberos database get_entry, get destroy_tickets, dest help list_requests, lr, ?
Get entry from kerberos database
Destroy admin tickets
Request help with this program
List available requests.
Exit program.
quit, exit, q admin:
ank anna.sap
Admin password:
Password for anna.sap:
Verifying, please re-enter Password for anna.sap: anna.sap added to database.
admin:
quit
Cleaning up and exiting.
8.8.3 Change a Kerberos Principal
To change the expiration and ticket lifetime of an existing Kerberos principal, you can use the following command:
# chkp [-e <expiration>] [-l <lifetime>] <principal>[.<instance>]
...
How can you find out when a Kerberos principal account will expire and how long the ticket lifetime for this principal currently is? The best way is to use both commands, lskp and kdb_util dump
, especially when you want to change something. The lskp command returns the expiration date and the ticket lifetime in hours and minutes.
However, when you want to change the ticket lifetime of a principal, you cannot specify hours and minutes; instead you need to specify values from 1 to 255. Values 1 to 128 represent multiples of 5 minutes, while lifetime values from 129 to 255 represent intervals from 11 hours and 40 minutes to 30 days
(see IBM Parallel System Support Programs for AIX: Administration Guide,
GC23-3897).
The current value is readable in the dump file of the database. Let us first have a look to the current settings of the principal anna.sap, and then use the chkp command to change them.
182
RS/6000 SP Software Maintenance
sp2en0:/# lskp |grep anna.sap anna.sap
tkt-life: 10:40 key-vers: 1 expires: 2037-12-31 23:59 sp2en0:/# chkp -e 2000-04-26 anna.sap sp2en0:/# lskp |grep anna.sap anna.sap
tkt-life: 10:40 key-vers: 1 expires: 2000-04-26 23:59
As you can see, the expiration date has been changed and the ticket lifetime is 10 hours and 40 minutes. You can look up this value by using the kdb_util dump command.
Kerb eros
Data base
# kdb_util dump /tmp/kerbdb
# cat /tmp/kerbdb |pg rcmd sp2en0 255 1 1 0 10cb558e 72d4f65c 203801010459 199904212101 root admin
K M 255 1 1 0 f21f04fe a5f20edc 203801010459 199904212101 db_creation * rcmd sp2css10 255 1 1 0 80388e29 b7db38b8 203801010459 199904212105 root admin rcmd sp2css14 255 1 1 0 800b752 81ccd848 203801010459 199904212105 root admin hardmon sp2cw0 255 1 1 0 beab30fc 1a6c5384 203801010459 199904212101 root admin rcmd sp2css07 255 1 1 0 a1bfc471 48c942fd 203801010459 199904212105 root admin
anna sap 128 1 1 0 e5270e06 4b6d3066 203801010459 1999074221922 root admin
rcmd sp2n06 255 1 1 0 b4b6542a 27abcdd8 203801010459 199904212105 root admin rcmd sp2n13 255 1 1 0 cf88f27f 556f40ed 203801010459 199904212105 root admin rcmd sp2n15 255 1 1 0 c94098e1 e631d4af 203801010459 199904212105 root admin
Kerb eros
Data base
Figure 37. Kerberos Database Dump File
The current ticket lifetime of anna.kerb is 128, or 10 hours and 40 minutes. So let us change this value to 15 minutes.
sp2en0:/# chkp -l 3 anna.sap sp2en0:/# lskp | pg
K.M
anna.sap
changepw.kerberos
default hardmon.sp2cw0
hardmon.sp2en0
tkt-life: 30d tkt-life: tkt-life: 30d tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59
00:15 key-vers: 1 expires: 2000-04-26 23:59 key-vers: 1 key-vers: 1 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 krbtgt.SP2EN0
rcmd.sp2css01
rcmd.sp2css05
^C tkt-life: 30d key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59
Taming Kerberos
183
With the database dump file, you get the ticket lifetime value 3: sp2en0:/# kdb_util dump /temp/kerbdb sp2en0:/# cat /tmp/kerbdb | grep anna anna sap 3 1 1 0 e5270e06 4b6d3066 200004270459 199904162132 root admin
As mentioned before, these new PSSP Kerberos commands make handling
Kerberos more convenient, but you can also use the old commands. For changing Kerberos principals, the corresponding command is kdb_edit
: p2en0:/# kdb_edit
Opening database...
Enter Kerberos master key:
Previous or default values are in [brackets] , enter return to leave the same, or new value.
Principal name: anna
Instance: sap
Principal: anna, Instance: sap, kdc_key_ver: 1
Change password [n] ? n
Expiration date (enter yyyy-mm-dd) [ 2000-04-27 ] ? 1999-04-16
Max ticket lifetime [ 3 ] ? 10
Attributes [ 0 ] ?
Edit O.K.
Principal name: <just press Enter to exit> # comment by author
#
8.8.4 Remove a Kerberos Principal
Until PSSP 2.2, deleting a Kerberos principal consisted of three steps: dump the Kerberos database to a file as shown in the examples; delete the line of the Kerberos principal you want to remove; load the database again with the kdb_util load <file> command. With the rmkp command, however, deleting a principal became very easy. To delete anna.sap, just type:
# rmkp anna.sap
Then confirm the removal.
For all new Kerberos commands introduced with PSSP 2.3, only root authority is required; no ticket-granting-ticket is needed.
184
RS/6000 SP Software Maintenance
8.9 Reinitializing Kerberos without Rebooting the Nodes
In a situation where you are forced to fix Kerberos problems but you cannot afford to reboot the nodes in customize mode, or effect the running switch by executing the pssp_script manually, all the Kerberos-related files can be generated and distributed without any reboot. Here is a short example of how you can set up your Kerberos environment from scratch by using parts of these procedures.
8.9.1 Removing or Reinitializing the Kerberos Subsystem
Since PSSP 2.3, reinitializing the Kerberos subsystem is quite easy. You just issue setup_authent and confirm the replacement of the Kerberos setup. The database is replaced, the Kerberos subsystems are stopped and restarted, and if you already have an installed SP, all nodes will be set automatically to
customize,
and setup_server will be started to rebuild the new krb-srvtab files for the nodes. What is left to do is the distribution of the appropriate files to the nodes.
If you want to keep some of the Kerberos configuration files, then follow the steps as in older PSSP versions before a rerun of setup_authent takes place.
Until PSSP 2.3 you had to delete the Kerberos files manually. Since it is possible to keep some of the files, especially if you edited them, we will not only describe how to delete all Kerberos files, but also tell you which files can be kept.
# rm /.k /.klogin
The master key /.k must be removed, but /.klogin can be kept.
# rm /etc/krb*
This command deletes /etc/krb.conf, /etc/krb.realms and /etc/krb-srvtab.
While krb-srvtab has to be removed, krb.conf and krb.realms files can stay.
Setup_authent will use the contents for the new setup.
# kdb_destroy
This command deletes the database files, meaning the principal.pag and the principal.dir files under /var/kerberos/database. The ACL files for the
Kerberos database remain. In each ACL file there is an entry for the root.admin principal; consequently, if you set up your Kerberos with a different first admin principal than root.admin, you will run into trouble.
Taming Kerberos
185
Therefore, to delete all these files use the command:
# rm /var/kerberos/database/*
As soon as you have deleted the Kerberos database and the master key /.k, the setup_authent command will not run into the replacement loop but instead set up Kerberos again based on the existing files. Now you can rerun:
# setup_authent
Up to PSSP 2.3, the krb-srvtab files are recreated automatically.
Since we plan neither a reboot of the nodes nor starting the pssp_script manually, we can set all the nodes back to response=disk and then distribute the files by file transfer. For the bootp_response value customize, no NIM resource allocation has taken place; therefore, it is sufficient to change this response value in the SDR without running setup_server
.
# spbootins -r disk -l 1,5,6,7 -s no
All new-srvtab files for the nodes are stored under the /tftpboot directory on the CWS. At a minimum, these files have to be copied to the nodes. If nothing changed in your configuration, meaning you are using the same realm name and the same adapter names, the /etc/krb.conf and /.klogin files on the nodes can still be used. Regardless of whether the Kerberos client could use the old files or really needs the new ones, we will demonstrate the transfer of all files that are required on a Kerberos client for sp2n01. Due to the long outputs of the ftp command, we cut the messages out and show only the essential part: sp2en0:/# ftp sp2n01 ftp> bin ftp> put /tftpboot/sp2n01-new-srvtab /etc/krb-srvtab ftp> put /etc/krb.conf ftp> put /etc/krb.realms ftp> put /.klogin ftp > quit
This procedure has to be repeated for each node. After the distribution of the new-srvtab files to the nodes, delete them under /tftpboot on the CWS, since they are no longer needed in this place and this directory permits access for tftp.
# rm /trftpboot/*new*
186
RS/6000 SP Software Maintenance
8.10 Recreating the krb-srvtab File for a Node
The krb-srvtab files were always created on the primary authentication server
(usually the CWS), and then distributed to the appropriate nodes. You have two possibilities to build a new krb-srvtab file for a node.
• By using the create_krb_files
(or setup_server
) command
• By using the ext_srvtab command
8.10.1 The create_krb_files Wrapper
Since create_krb_files only creates the krb-srvtab file for nodes that have a bootp_response value of customize, install or migrate in the SDR, you have to set the node to customize before running the following command:
# spbootins -r customize -l 6 -s no
This command sets node 6 to customize and setup_server will not run due to the -s option. We only need the wrapper create_krb_files
(refer to Figure 31 on page 167). This command creates the new-srvtab file for node 6 under
/tftpboot.
When create_krb_files has finished and you have obtained all needed
<node>-new-srvtab files, do not forget to set the appropriate node(s) back to disk:
# spbootins -r disk -l 6 -s no
In this case, it is sufficient just to set the bootp_response value in the SDR back to disk. For “customize”, no resource has been allocated, therefore it is not necessary to run the unallnimres command (or setup_server
).
Now you can transfer the /tftpboot/<node>-new-srvtab to the concerned node(s) under /etc/krb-srvtab.
8.10.2 The ext_srvtab Wrapper
The ext_srvtab command works independently of the SDR; therefore, it is not necessary to change any value in the SDR before creating new-srvtabs. This command extracts the srvtab directly from the Kerberos database. This works only per adapter (in other words, per rcmd principal). Following is the syntax of this command:
# ext_srvtab [-n] [-r realm] [instance ...]
Taming Kerberos
187
If you use the -n option you will not be prompted for the Kerberos master key as long as a /.k file exists. Usually the -r option is not needed either, since the realm name of the rcmd principal is the same as your local realm in which you are authenticated as root.admin. The instance of a service principal is always the adaptername (short form, without domain) over which the service is offered. Now let us extract the srvtab for node 7 that has one Ethernet and one switch adapter. Due to the equality of adapter names and instance names of the rcmd principals, the command must look as follows:
# ext_srvtab -n sp2n07 sp2css07
The result is two new-srvtab files in the current directory. What is still left to do is the concatenation of all the srvtab files belonging to a node into one srvtab file (whose name is free-choosable), and then the transfer to the corresponding node as /etc/krb-srvtab file.
# ext_srvtab -n sp2n07 sp2css07
Generating ’sp2n07-new-srvtab’....
Generating ’sp2css07-new-srvtab’....
# ls -la | grep 07
-rw------1 root system 30 Mar 24 14:23 sp2css07-new-srvtab
-rw------1 root system 28 Mar 24 14:23 sp2n07-new-srvtab
# cat sp2n07-new-srvtab sp2css07-new-srvtab > node07-new-srvtab
# ls -la | grep srvtab
-rw-r--r-1 root system
-rw------1 root
-rw------1 root system system
58 Mar 24 14:25 node07-new-srvtab
30 Mar 24 14:23 sp2css07-new-srvtab
28 Mar 24 14:23 sp2n07-new-srvtab
# ftp sp2n07
Connected to sp2n07.
220 sp2n07 FTP server (Version 4.1 Tue Mar 17 14:00:13 CST 1998) ready.
Name (sp2n07:root):
331 Password required for root.
Password:
230 User root logged in.
ftp> bin
200 Type set to I.
ftp> put node07-new-srvtab /etc/krb-srvtab
200 PORT command successful.
150 Opening data connection for /etc/krb-srvtab.
226 Transfer complete.
58 bytes sent in 0.02141 seconds (2.645 Kbytes/s) local: node07-new-srvtab remote: /etc/krb-srvtab ftp> quit
221 Goodbye.
# dsh -w sp2n07 date sp2n07: Fri Apr 16 15:08:59 EDT 1999
Do not forget to delete the <node>new-srvtab file under /tftpboot as soon as you have transferred them to the appropriate node. This is always done when the customize script ( pssp_script
) is running on a node. After transferring the
Kerberos-related files, the /tftpboot/<node>-new-srvtab is deleted. It is
188
RS/6000 SP Software Maintenance
recommended that you do not keep these srvtab files hanging around under a directory that permits tftp.
8.11 Merging Two (or more) SP Systems to One Kerberos Realm
One advantage of having an RS/6000 SP is the centralized administration of the whole system: you can distribute, install, to check up something in parallel. One command works for all your nodes. Though some customers have several SP systems in different locations due to safety prerequisites or other reasons, these systems often are administered by the same people. In such configurations it is desirable to work in a common secure environment including all the SPs. Otherwise, each time you change from one SP to another for administrating the system, you have to change authenticate to a different Kerberos realm.
It is preferable to have all SPs belong to a single realm. Here is the solution for having one Kerberos realm spread over two (or even more) SP systems.
Assume we have two SP systems, sp4en0 and sp5en0, and sp5en0 will be integrated to the realm of sp4en0. That means the CWS sp4en0 will become the authentication server for both SP systems. Later, we will set up sp5en0 as
a secondary authentication server for this realm. Figure 38 on page 190
illustrates this configuration.
The common realm name will be SP4EN0, but as you know, you are free to choose the realm name and create your own /etc/krb.conf file before issuing setup_authent
(see 8.2.2, “The Kerberos Configuration File” on page 160 for
further information).
Taming Kerberos
189
A Single Kerberos Realm
System A
sp4n15 sp4n13 sp4n14 sp4n11 sp4n12 sp4n09 sp4n10 sp4n07 sp4n08 sp4n05 sp4n06 sp4n01 sp5n13 sp5n09 sp5n05 sp5n01
System B
sp4en0:# sp5en0:#
Figure 38. Two SP Systems in a Single Kerberos Realm
8.11.1 Deleting an Authentication Server Setup
When you are planning to spread one Kerberos realm over two SP systems, the starting point usually is that you already have two installed systems with an authentication server for each system (meaning for each realm). The integration of SP system B into the realm of SP system A makes the deletion of the authentication server on sp5en0 (the CWS of SP system B) necessary.
In the first step, sp5en0 will become a normal Kerberos client just like all the other nodes under the control of the authentication server on sp4en0.
You may wonder, which files does the Kerberos master use, and which subsystems belong to the Kerberos master? Besides the Kerberos configuration files that are stored on the server and the clients, the server is owner of the Kerberos database. So let us delete the related files and remove the subsystems on sp5en0:
190
RS/6000 SP Software Maintenance
sp5en0:/# rm /var/kerberos/database/* sp5en0:/# rm /.k sp5en0:/# rm /etc/krb-srvtab sp5en0:/# rm /etc/krb.conf sp5en0:/# stopsrc -s kadmind
0513-044 The stop of the kadmind Subsystem was completed successfully.
sp5en0:/# stopsrc -s kerberos
0513-004 The Subsystem or Group, kerberos, is currently inoperative.
sp5en0:/# rmssys -s kadmind
0513-083 Subsystem has been Deleted.
sp5en0:/# rmssys -s kerberos
0513-083 Subsystem has been Deleted.
After these commands are run, sp5en0 is no longer an authentication server.
We intentionally did not delete the /etc/krb.realms and /.klogin files since we can use these files later to create the global /etc/krb.realms and /.klogin files on the new authentication server. Since the /etc/krb-srvtab and /etc/krb.conf
files will be overwritten, it is not necessary to delete them, but the subsystems and the database will have to be deleted.
8.11.2 Provide Name Resolution and Time Synchronization
Many of the problems with Kerberos are based on incorrect name resolution, so you should insure that the name resolution for the nodes in SP system A, as well for the nodes and CWS of SP system B, is correct. It does not matter whether you are using the /etc/hosts or domain name service. For the single realm including two SP systems, we recommend that you to add all the hosts of the SP system B to the /etc/hosts on the CWS of SP system A (in our example, sp4en0).
For more information about Kerberos and IP configuration, refer to IBM
Parallel System Support Programs for AIX: Administration Guide,
GC23-3897.
Since Kerberos packets have a lifetime of 5 minutes, the time difference between the hosts must not exceed this limit. Therefore, when you plan to spread one Kerberos realm across SP systems, you have to insure that the time difference between these systems is limited. Otherwise, your Kerberos authentication and communication will not work. One solution for solving this problem is to use an external time server to synchronize the time services within the SPs.
8.11.3 Adjust the /etc/krb.realms File
The CWS of SP system A (sp4en0) will become the authentication server for the whole realm; that means the Kerberos database will be located on this
Taming Kerberos
191
host. In addition, all Kerberos clients will get the Kerberos-related files from this server, therefore we will adjust all the files on the future authentication server (sp4en0).
The /etc/krb.realms file maps adapter names to an authentication realm. An
/etc/krb.realms file with all the adapter names for SP system B already exists, but for our purpose, it is pointing to the wrong realm. Consequently, you must copy the krb.realms file of sp5en0 to another file on sp4en0, update the realm entry for every adapter to the new realm name, and then concatenate this file with the krb.realms file of sp4en0 (in our case, SP4EN0). Finally, this file should to look like this: sp4en0:/# cat /etc/krb.realms sp4cw0 SP4EN0 sp4cw0 SP4EN0 sp4n01 SP4EN0 sp4css01 SP4EN0 sp5n01 SP4EN0 sp5n01x SP4EN0 sp4n05 SP4EN0 sp4css05 SP4EN0 sp4n06 SP4EN0 sp4css06 SP4EN0 sp4n07 SP4EN0 sp4css07 SP4EN0 sp4n08 SP4EN0 sp4css08 SP4EN0 sp4n09 SP4EN0 sp4css09 SP4EN0 sp5n09 SP4EN0 sp5n09x SP4EN0
^C
8.11.4 Extend the /.klogin File
Imagine are you working on sp5n09 on SP system B, you own a ticket-granting-ticket for root.admin, and you want to set up a dsh command to a node belonging to SP system A, as follows: sp5n09:/# dsh -w sp4n07 date
After the authentication processes (tgt, service ticket) complete, it will be verified whether node sp5n09 is allowed to login to sp4n07 to issue the command. This will be checked in the /.klogin file; therefore, all principals of both SPs have to have an entry in this file. We recommend that you include not only the rcmd.<en0_name> principals based on the internal Ethernet, but also the rcmd.<css0_name> principals so that you are allowed to set up
Kerberized remote shell commands over the switch.
192
RS/6000 SP Software Maintenance
You can create the global /.klogin file in the same way as you created the
/etc/krb.realms file. However, you have to delete the root.admin principal of the old realm; it must no longer exist. We have one root.admin for our global realm, that is all. The /.klogin will look like this, and the order does not matter: p4en0:/# cat /.klogin | pg root.admin@SP4EN0 rcmd.sp4en0@SP4EN0 rcmd.sp4cw0@SP4EN0 rcmd.sp5en0@SP4EN0 rcmd.sp5cw0@SP4EN0 rcmd.sp4n01@SP4EN0 rcmd.sp4css01@SP4EN0 rcmd.sp5n01@SP4EN0 rcmd.sp5n01x@SP4EN0 rcmd.sp4n05@SP4EN0 rcmd.sp4css05@SP4EN0 rcmd.sp5n05@SP4EN0 rcmd.sp5n05x@SP4EN0
^C
8.11.5 Adding Principals for Remote Nodes
There are two ways to add the rcmd service principal for each adapter of SP system B to the Kerberos database on sp4en0:
• Create rcmd principals with the kadmin command.
• Use a file as input for the add_principal command.
When you use the kadmin prompt, you have to define one rcmd principal after the other and set a password for each of them. It is required to have a valid password for every principal; otherwise, you will not be allowed to log on or authenticate as this principal. On the other hand, you are never prompted for the rcmd principals’ password because the login routine for the rcmd principal is done under cover. The rcmdtgt command asks for a ticket-granting-ticket
(tgt). Then the authentication server sends a packet back containing the tgt
principal. Finally, with this srvtab file, the packet can be decrypted locally so that the tgt can be obtained.
Because of this mechanism, you do not need to keep these passwords in mind. The second way to define all these principals to the Kerberos database and provide them with a password can be realized by creating a file that contains all principals and passwords and by issuing the add_principal command that uses this file as input.
Taming Kerberos
193
8.11.5.1 Creating rcmd Principals with the kadmin Command
The best way to do this is to have a list with all the adapter names that should be integrated. As example, we will define the adapters of sp5n01 (the first node of SP system B) with the kadmin command: sp4en0:/# kadmin
Welcome to the Kerberos Administration Program, version 2
Type "help" if you need it.
admin:
ank rcmd.sp5n01
Admin password:
Password for rcmd.sp5n01:
Verifying, please re-enter Password for rcmd.sp5n01: rcmd.sp5n01 added to database.
admin:
ank rcmd.sp5n01x
Admin password:
Password for rcmd.sp5n01x:
Verifying, please re-enter Password for rcmd.sp5n01x: rcmd.sp5n01x added to database.
admin:
quit
Cleaning up and exiting.
You could define every rcmd principal within one kadmin session. You have to set a password and verify it, but you can forget it afterwards. Now let us check if these principals are defined: sp4en0:/# lskp | grep sp5 rcmd.sp5n01
tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 rcmd.sp5n01x
tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59
8.11.5.2 Use the add_principal Command
The advantage of creating a file with all principals that you are going to define is that you run the add_principal command just once. The disadvantage of this method is that you have to also provide the passwords in this file.
First of all we create a file that has read and write permission only for the owner, say /tmp/addnew, in order to insure that no other person can look into this file during the procedure of creating principals.
#
(umask 066 ; vi /tmp/addnew)
This command creates a file /tmp/addnew with permission -rw-------. The brackets are needed to avoid setting your default umask to this value. It is only valid for the following command. You can also touch the file with your default umask and then change the permissions with chmod 600 /tmp/addnew
.
194
RS/6000 SP Software Maintenance
Now an entry for each principal that should be defined is needed, followed by a blank and then the password of your choice. Let us define the rcmd principals for node 5 and the CWS of SP system B. The file structure has must look like this: sp4en0:/# cat /tmp/addnew rcmd.sp5en0 siw96ed62h rcmd.sp5cw0 926fkduen3 rcmd.sp5n05 93jd7dj49f rcmd.sp5n05x uejf7390id
The add_principal command, located under /usr/lpp/ssp/kerberos/bin, takes the first field as the name of the principal that should be defined and the second field is used as password. We issue the add_principal command, then have a look into the database to see which principals are already defined: sp4en0:/# add_principal -v /tmp/addnew add_principal: rcmd.sp5en0 added to database.
add_principal: rcmd.sp5cw0 added to database.
add_principal: rcmd.sp5n05 added to database.
add_principal: rcmd.sp5n05x added to database.
sp2en0:/# lskp | grep sp5 rcmd.sp5cw0
rcmd.sp5en0
tkt-life: Unlimited key-vers: 1 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 expires: 2037-12-31 23:59 rcmd.sp5n01
rcmd.sp5n01x
rcmd.sp5n05
rcmd.sp5n05x
tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59 tkt-life: Unlimited key-vers: 1 expires: 2037-12-31 23:59
We do not delete the /tmp/addnew file immediately because we can also use it to create the new-srvtab files for the new principals; it saves time to use this file as input.
8.11.6 Creating the srvtab Files for the Remote Nodes
As soon as the principals for all adapters names are defined in the Kerberos database, the krb-srvtab files for the appropriate nodes can be extracted.
Since the newly created principals (in other words, the appropriate adapter name of these principals) are not part of the SDR on our CWS sp4en0, we cannot use the create_krb_files wrapper to create the krb-srvtab files (see
Figure 31 on page 167). We are forced to use the
ext_srvtab command, which is SDR-independent and works directly with the Kerberos database entries.
Because all newly created principals are listed in the /tmp/addnew file (except the principals for node 1), we can use this file as input for the ext_srvtab command. The syntax is:
Taming Kerberos
195
# ext_srvtab [-n] [-r realm] instance [instance ...]
The instance names are equal to the adapter names. The entries in
/tmp/addnew look like rcmd.sp5n05 <password>, so we will cut the instance names out and use them as input for ext_srvtab
. The new-srvtabs are created in the current directory. To make work easier, it is recommended to create new directory /tftpboot/srv and change to it. Now all new-srvtab files will be generated in this subdirectory.
sp4en0:/# cd /tftpboot/srv sp4en0:/tftpboot/srv# ext_srvtab -n `cat /tmp/addnew | cut -d’.’ \
>-f2 | awk '{print $1}'`
Generating 'sp5en0-new-srvtab'....
Generating 'sp5cw0-new-srvtab'....
Generating 'sp5n05-new-srvtab'....
Generating 'sp5n05x-new-srvtab'....
Do not forget to remove the /tmp/addnew file with the passwords for the rcmd principals if you have used the add_principal command.
# rm /tmp/addnew
Since the rcmd principals for node 1 have been defined with the kadmin command, they were not included in our /tmp/addnew list. Let us now create these srvtabs by naming the principals as parameters and then have a look at the directory: sp4en0:/tftpboot/srv# ext_srvtab -n sp5n01 sp5n01x
Generating 'sp5n01-new-srvtab'....
Generating 'sp5n01x-new-srvtab'....
sp2en0:/tftpboot/srv# ls -l total 48
-rw------1 root
-rw------1 root
-rw------1 root system system system
28 Mar 31 19:14 sp5cw0-new-srvtab
28 Mar 31 19:14 sp5en0-new-srvtab
28 Mar 31 19:30 sp5n01-new-srvtab
-rw------1 root
-rw------1 root
-rw------1 root system system system
29 Mar 31 19:30 sp5n01x-new-srvtab
28 Mar 31 19:14 sp5n05-new-srvtab
29 Mar 31 19:14 sp5n05x-new-srvtab
The result is one new-srvtab file for each adapter. Therefore, we have to concatenate the srvtab files belonging to one node before we distribute them.
# cat sp5n01-new-srvtab sp5n01x-new-srvtab > sp5n01-new-srv
Follow this procedure for every node.
196
RS/6000 SP Software Maintenance
8.11.7 Distributing the Kerberos Files to the Remote Nodes
The last step consists of distributing new-srvtab files to the nodes as well to
The files that a Kerberos client needs are, at minimum:
• /etc/krb.conf
• /etc/krb.realms
• /etc/krb-srvtab
• /.klogin
All files except the /etc/krb-srvtab files are copies of the Kerberos server files.
sp4en0:/tmp# ftp sp5n01
Connected to sp5n01.
220 sp5n01 FTP server (Version 4.1 Tue Mar 17 14:00:13 CST 1998) ready.
Name (sp5n01:root):
331 Password required for root.
Password:
230 User root logged in.
ftp> bin
200 Type set to I.
ftp> put /tftpboot/srv/sp5n01-new-srv /etc/krb-srvtab
200 PORT command successful.
150 Opening data connection for /etc/krb-srvtab.
226 Transfer complete.
29 bytes sent in 0.02214 seconds (1.279 Kbytes/s) local: /tftpboot/srv/sp5n01-new-srv remote: /etc/krb-srvtab ftp> put /etc/krb.conf
200 PORT command successful.
150 Opening data connection for /etc/krb.conf.
226 Transfer complete.
36 bytes sent in 0.000413 seconds (85.12 Kbytes/s) local: /etc/krb.conf remote: /etc/krb.conf
ftp> #### and also /etc/krb.realms and /.klogin ####
Once the distribution of the Kerberos-related files is finished, your Kerberos environment with two SP systems in one Kerberos realm is ready to use.
8.12 Setting Up a Secondary Authentication Server
Due to the dependency of the authentication server for getting tickets, especially if you are serving several SP systems from one authentication server, we set up a secondary authentication server. Every workstation, even if it is not part of the realm, can become a secondary authentication server. In our example it makes sense to define the second CWS, that became a
Taming Kerberos
197
normal Kerberos client during the fusion to one singe Kerberos realm, as a secondary server.
The setup_authent script that will run to define the secondary server does not allow you to define an SP node as a secondary server. As soon as it realizes that you are working on an SP node, the script exits. This works as designed because usually the nodes are the login hosts for your user and it is not recommended to place a copy of the database on a “user” machine.
Nevertheless, a workaround for that is described in Appendix C, “Secondary
Authentication Server on an SP Node” on page 255.
Our starting point consists of a Kerberos realm spread across two SP systems. The Kerberos configuration after the setup of a secondary
authentication server will look like Figure 39.
sp4en0 as primary authentication server sp5en0 as secondary authentication server
/.k
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/tmp/tkt0 kerb eros
/var/ data base
/*
/\
/.k
sp4en0:#
replicate
sp5en0:#
/.k
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/tmp/tkt0
/var/ke databa rberos se/*
/\
/.k
Figure 39. Secondary Authentication Server
Due to an additional entry in the /etc/krb.conf file, sp5en0 will be identified as a secondary authentication server. The secondary server gets a read-only replica of the Kerberos database from the primary authentication server. That means the secondary server is able to do Kerberos authentication. The secondary server can generate tickets, but any configuration changes should
198
RS/6000 SP Software Maintenance
only be applied to the primary authentication server. Every node within this realm has the same /etc/krb.conf file pointing to sp5en0 as the secondary server. If the communication with the primary server fails, the clients ask the secondary server for authentication.
Assuming that the future secondary server is already a client of the Kerberos realm, the following steps are necessary to set up the secondary server:
• Install the Kerberos filesets on the future secondary server (if necessary).
• Edit the /etc/krb.conf file and distribute it to all hosts belonging to the realm.
• Run setup_authent on the secondary authentication server.
• Add a /usr/kerberos/etc/push-krop entry to the crontab on the primary server.
8.12.1 Install Kerberos-Related Filesets on a Secondary Server
The authentication server and client part of the PSSP software is needed on the secondary server, as follows:
• ssp.authent
• ssp.clients
In our example, the sp5en0 had been a Kerberos authentication server before, consequently the filesets are already installed.
Insure that the name resolution for all hosts belonging to the Kerberos realm works properly.
8.12.2 /etc/krb.conf File Modification and Distribution
Since we have a working Kerberos environment, it makes sense to edit the
/etc/krb.conf file on the primary authentication server and then distribute it with the parallel copy command to all nodes.
The /etc/krb.conf file contents are explained in 8.2.2, “The Kerberos
Configuration File” on page 160. The second line points to the admin server,
meaning to the primary authentication server. To define a secondary authentication server an additional entry is required:
Taming Kerberos
199
SP4EN0
SP4EN0 sp4en0 admin server
SP4EN0 sp4en0
For this realm...
this host is ...
a secondary authentication server
Which realm does this host belong to?
Figure 40. The /etc/krb.conf File with Secondary Authentication Server
Only one admin server entry (primary authentication server) should be defined for a realm, but you can define several secondary servers. The definitions for the secondary authentication server look like the entry for
sp5en0 in Figure 39 on page 198.
To distribute this file to all the nodes of our realm we use the parallel copy command pcp
. Details on the parallel copy command and its possible options
are covered in 8.13, “Working with the WCOLL Variable” on page 202.
Assume the file /tmp/allhosts contains all nodes of both SP systems, plus the second CWS. First we will set the WCOLL variable to that file, and then copy the /etc/krb.conf to all nodes in parallel: sp2en0:# export WCOLL=/tmp/allhosts sp2en0:# pcp ’’ /etc/krb.conf
8.12.3 Run setup_authent on the Secondary Authentication Server
As mentioned in 8.5, “setup_authent, setup_server and the Kerberos
setup_authent script first checks if a /etc/krb.conf
file already exists. In our case, when we execute setup_authent on our future secondary server (sp5en0), the script detects the entry for sp5en0 in the
200
RS/6000 SP Software Maintenance
krb.conf file and starts the appropriate part of setup_authent to make this host a secondary server. See the output of our example: sp5en0:/# setup_authent
********************************************************************
ATTENTION
You are attempting to redo the authentication setup on your Control
Workstation.
Since you have not yet executed install_cw, there are no special considerations or risks involved.
Do you want to replace your existing setup?
********************************************************************
Enter y or n: y
***********************************************************************
Logging into Kerberos as an admin user
[...]
For more information, see the kinit man page.
*********************************************************************** setup_authent: Enter name of admin user: root.admin
Kerberos Initialization for "root.admin"
Password: add_principal: 2502-037 rcmd.sp5cw0 already exists in database.
add_principal: 2502-037 hardmon.sp5cw0 already exists in database.
add_principal: 2502-037 rcmd.sp5en0 already exists in database.
add_principal: 2502-037 hardmon.sp5en0 already exists in database.
0513-085 The kpropd Subsystem is not on file.
0513-084 There were no records that matched your request.
sp5cw0: success.
sp5cw0: Succeeded
0513-071 The kpropd Subsystem has been added.
0513-059 The kpropd Subsystem has been started. Subsystem PID is 21896.
0513-085 The kerberos Subsystem is not on file.
0513-084 There were no records that matched your request.
0513-071 The kerberos Subsystem has been added.
0513-059 The kerberos Subsystem has been started. Subsystem PID is 20772.
The kpropd and Kerberos subsystems have been added and we already received a read-only replica of the Kerberos database. Due to the read-only functionality on a secondary server, the kadmind subsystem is not installed.
All changes to the database must be, and can only be, done on the primary authentication server.
The kpropd is the daemon that receives the copy of the Kerberos database from the primary authentication server. Now let us have look to the contents of the /var/kerberos/database directory on the secondary server, because there is a difference from the primary server:
Taming Kerberos
201
sp5en0:/# cd /var/kerberos/database sp5en0:/var/kerberos/database# ls -l total 83
-rw-r----1 root system 0 Jul 20 17:51 admin_acl.add
-rw-r----1 root
-rw-r----1 root
-rw------1 root
-rw------1 root
-rw------1 root system system system system system
11 Jul 20 17:51 admin_acl.get
0 Jul 20 17:51 admin_acl.mod
4096 Jul 20 17:51 principal.dir
0 Jul 20 17:51 principal.ok
82944 Jul 20 17:51 principal.pag
Since the secondary authentication server gets a read-only replica of the
Kerberos database, the admin_acl.add and admin_acl.mod files under
/var/kerberos/database have no entries. The only entry for root.admin is in the admin_acl.get file, which indicated which Kerberos principal is allowed to get information from the Kerberos database.
8.13 Working with the WCOLL Variable
When your realm consists of several SP systems or includes external
workstations (see 9.1, “Integration of External Workstations to the SP
Kerberos Realm” on page 205), it is particularly necessary to work with the
WCOLL variable since the usual hostlist arguments as parameters for
Kerberized remote shell commands get their information from the SDR. The best way to do this is to create a file containing all hosts of the realm, for example /tmp/allhosts. Each host entry must begin on a new line: p2en0:/# cat /tmp/allhosts sp2n01 sp2n05 sp2n06
.
.
.
sp5en0 sp5n01 sp5n05
^C
Next we set the WCOLL variable to this file and export it:
# export WCOLL=/tmp/allhosts
It is well known that the dsh command without any option takes the contents of the file to which the WCOLL variable points:
# dsh date
202
RS/6000 SP Software Maintenance
But how can you use this WCOLL variable with other parallel commands, such as pcp
, lppdiff, pdf
? The dsh command without any options is actually a short form of the command:
# dsh ’’ date
Let us try with two nodes in the working collective: sp2en0:/# cat /tmp/nodes sp2n01 sp2n05 sp2en0:/# export WCOLL=/tmp/nodes sp2en0:/# dsh ’’ date sp2n01: Sun Apr 11 21:12:34 EDT 1999 sp2n05: Sun Aug 11 21:12:34 EDT 1999 sp2en0:/# pcp ’’ /tmp/blubb sp2en0:/# dsh ’’ ls -la /tmp/blubb sp2n01: -rw-r--r-1 root system sp2n05: -rw-r--r-1 root sp2en0:/# lppdiff ’’ ssp.css system
6 Apr 02 21:13 /tmp/blubb
6 Apr 02 21:13 /tmp/blubb
-------------------------------------------------------------------------------
Name Path Level PTF State Type Num
-------------------------------------------------------------------------------
LPP: ssp.css
/etc/objrepos 3.1.0.0
COMMITTED I 2
From: sp2n01 sp2n05
-------------------------------------------------------------------------------
LPP: ssp.css
From: sp2n01 sp2n05
/usr/lib/objrepos 3.1.0.0
COMMITTED I 2 sp2en0:/# pdf ’’ /var
Filesystem Size-KB
-----------------------
HOST: sp2n01
------------
/var 32768
Used-KB
-------
9148
Free-KB %Free
------- -----
23620 73% iUsed
-----
368 iFree %iFree
----- ------
7824 96%
HOST: sp2n05
------------
/var 4096 4068 28 1% 338 686 67%
This is how all the other mentioned commands can work with the WCOLL variable.
Taming Kerberos
203
204
RS/6000 SP Software Maintenance
Chapter 9. Integration of External Workstations
With PSSP 3.1, a new strategy has arisen. The High Availability Infrastructure that was restricted to the SP Environment before now becomes a Cluster
Technology, spread over a cluster consisting of both SP systems and external machines. Further, HACMP Enhanced Scalability Version 4.3 moves beyond partition boundaries and can be used over several partitions and different SP systems, as well as external workstations.
Because of these major changes, it makes sense to integrate external workstations with some services offered by the CWS. This chapter covers the integration of external workstations into the SP Kerberos realm, as well as the definition of external workstations to the NIM master configuration on the
CWS.
9.1 Integration of External Workstations to the SP Kerberos Realm
For security reasons it may be required to integrate RS/6000 systems into a
SP Kerberos realm. In this section we show how to integrate an external workstation into a existing Kerberos realm.
Kerberos never sends a password unencrypted over the network. That is fine, but as an SP administrator, you usually do not work directly in the lab but instead telnet from another machine to your CWS by typing the root password. If you authenticate to the realm, you will be asked for the root.admin password. During both authentication procedures, the password between your workstation and the CWS is transferred across the network unencrypted. So, to maintain this security function, it makes sense to integrate at least your workstation to your SP Kerberos realm.
The procedure to do this is nearly the same as merging two SPs into one
Kerberos realm. In the following section we describe the necessary steps required for realm integration. For background information concerning the
Kerberos-related files, the rcmd principal, and the srvtab file on the nodes,
refer to Chapter 8, “Taming Kerberos” on page 155.
In our example we will show the integration of the machine named risc77 that
is connected to the CWS over Token Ring. Figure 41 on page 206 gives an
overview of our scenario.
© Copyright IBM Corp. 1999
205
sp2n15 sp2n13 sp 2n 14 sp2n11 sp 2n 12 sp2n09 sp 2n 10 sp 2n 07 sp 2n 08 sp2n05 sp 2n 06 sp2n01
SP Ethernet
Serial Link sp2en0:#
/.k
/etc/krb.conf
/etc/krb.realms
/etc/krb-srvtab
/.klogin
/tmp/tkt0
/var/k databa erbero se/* s/\
/.k
Token Ring risc77:#
/etc/krb.conf
/etc/krb.realm s
/etc/krb-srvtab
/.klogin
root.admin
rcmd.sp2en0
rcmd.sp2n01
rcmd.sp2n05
rcmd.sp2n06
:
: rcmd.risc77
Figure 41. External Workstation as Part of the SP Kerberos Realm
9.1.1 Time Synchronization, Name Resolution and Kerberos Code
Kerberos packets have a maximum lifetime of five minutes. If the time gap between two machines exceeds this limit, Kerberos authentication and communication will not work. The SP uses the Network Time Protocol to achieve time synchronization between all participants. For your external machine, you have to insure that there is no time difference between the SP system and your external workstation. This can be done by using a time server that serves both the CWS and your external workstations.
Make sure that the name resolution for your external workstation works properly. We recommend that you include the workstation in the /etc/hosts file on your CWS and also include the CWS and the node in the /etc/hosts file on your workstation.
To obtain the necessary Kerberos environment and commands on your external workstation, you have to install the Kerberos client code, meaning the ssp.clients fileset of the PSSP software. You can mount the spdata/sys1/install/pssplpp directory from the CWS and install the ssp.fileset
from the appropriate PSSP-<code.version> subdirectory. After the installation, the lslpp command should return the following:
206
RS/6000 SP Software Maintenance
risc77:/# lslpp -L | grep ssp ssp.clients
3.1.0.0
C (ECIP) SP Authenticated Client
9.1.2 Edit /etc/krb.realms on the Control Workstation
The /etc/krb.realms file indicates which host, or better, which adapter name, belongs to the realm. Therefore, you have to add the adapter name of your external workstation to the /etc/krb.realms file on the CWS. In our example we add the risc77 adapter name to that file.
sp2en0:/# tail /etc/krb.realms sp2css11 SP2EN0 sp2n12 SP2EN0 sp2css12 SP2EN0 sp2n13 SP2EN0 sp2css13 SP2EN0 sp2n14 SP2EN0 sp2css14 SP2EN0 sp2n15 SP2EN0 sp2css15 SP2EN0 risc77 SP2EN0
9.1.3 Extend the /.klogin File on the Control Workstation
When a Kerberos principal tries to set up a remote shell command on a remote host, the last step of authentication is the check of the /.klogin file on the remote host. If the principal that asks for the service is not found in the
/.klogin file, the service is refused. Currently the root.admin principal is included to the /.klogin file. Later this file will be distributed to the external workstation. You can include the rcmd principal of the external workstation in the /.klogin file, but it is not mandatory. What would be the difference?
As long as you are authenticated as the Kerberos principal root.admin either on host sp2en0 or on risc77, the /.klogin check will return okay. But as soon as you use the rcmdtgt command on risc77, which gives you a ticket for the rcmd.risc77 principal, you will not pass the /.klogin check because the rcmd.risc77 principal is not part of the /.klogin.
In any case the integration of the rcmd.principal avoids authentication errors.
We will include the rcmd principal for risc77 in the /.klogin file on the CWS, as follows:
Integration of External Workstations
207
# cat /.klogin | pg root.admin@SP2EN0 rcmd.sp2en0@SP2EN0 rcmd.risc77@SP2EN0 rcmd.sp2n01@SP2EN0 rcmd.sp2n05@SP2EN0 rcmd.sp2n06@SP2EN0 rcmd.sp2n07@SP2EN0 rcmd.sp2n08@SP2EN0 rcmd.sp2n09@SP2EN0
^C
9.1.4 Add rcmd Principals to the Kerberos Database
As described in 8.11.5, “Adding Principals for Remote Nodes” on page 193,
you have two possibilities for adding a principal to the Kerberos database.
You can use the kadmin command or you can use the add_principal command.
The add_principal command needs a file as input in which the principal and a password is provided. For just one principal, it is not worth creating a file, so we will define the rcmd principal by using kadmin command. If you are
interested in the other method, see 8.11.5, “Adding Principals for Remote
p2en0:/# kadmin
Welcome to the Kerberos Administration Program, version 2
Type "help" if you need it.
admin:
ank rcmd.risc77
Admin password:
Password for rcmd.risc77:
Verifying, please re-enter Password for rcmd.risc77: rcmd.risc77 added to database.
admin: quit
Cleaning up and exiting.
Repeat this action for each adapter configured in the external workstation.
Just add a principal rcmd.<adapter name> to the database. First you are prompted for the admin password that is the root.admin password, then you are prompted for the rcmd.principal’s password. Provide a password. Choose an arbitrary one and bear in mind that you will never need it again. Why?
Because you get authenticated as an rcmd principal by issuing rcmdtgt and
this command does not ask you for any password. For details refer to 8.7.6,
“Never-Expiring Ticket” on page 176.
208
RS/6000 SP Software Maintenance
9.1.5 Create a krb-srvtab File for the External Workstation
As soon as the rcmd principal is defined, a srvtab file can be extracted from the Kerberos database containing the password for this principal. Later this file will be stored under /etc/krb-srvtab on the external workstation. Since we just include one adapter name of the external workstation, there is only one srvtab to extract. If you extract several srvtabs due to having several rcmd principals, these srvtab files have to be concatenated to one file (refer to
8.10, “Recreating the krb-srvtab File for a Node” on page 187).
The syntax of the command looks like the following:
# ext_srvtab [-n] [-r realm] instance [instance...] sp2en0:/tftpboot# ext_srvtab -n risc77
Generating 'risc77-new-srvtab'....
sp2en0:/tftpboot# ls -la | grep risc77
-rw------1 root system 28 Aug 5 13:49 risc77-new-srvtab
9.1.6 Transfer the Required File to the External Workstation
Next, the required Kerberos files have to be transferred to the external workstation; these are /etc/krb.conf, /etc/krb.realms, and /.klogin. And last but not least, the just-created srvtab has to be copied to the workstation as a
/etc/krb-srvtab file. We use ftp and just show the commands (the messages have been truncated in this example).
sp2en0:/tftpboot# ftp risc77 ftp> bin ftp> put /etc/krb.conf ftp> put /etc/krb.realms ftp> put /.klogin ftp> put /tftpboot/risc77-new-srvtab /etc/krb-srvtab ftp> quit
9.1.7 Set the Authentication Method on the External Workstation
With AIX 4.3.1, an authentication option is provided. There is a choice as to which kind of authentication will be used. We use Kerberos version 4 on our external workstation. This has to be defined; otherwise, the authentication services will not work even if all files are provided and correct. Set the authentication method by using SMIT.
1. Type smitty auth_set.
2. Choose Kerberos 4.
Integration of External Workstations
209
Set Authentication Methods
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Kerberos 5
* Kerberos 4
* Standard Aix
[Entry Fields]
[no]
[yes]
[yes]
+
+
+
F1=Help
F5=Reset
F9=Shell
F2=Refresh
F6=Command
F10=Exit
F3=Cancel
F7=Edit
Enter=Do
F4=List
F8=Image
If you set Kerberos 4 to yes but set Standard AIX and Kerberos 5 to no, standard remote login (telnet, tfp) is not longer provided. Kerberos 5 requires
DCE version 2.2 or higher.
After enabling Kerberos 4 as an authentication method, the kerberized remote shell commands issued from the CWS to our external workstation should work.
# dsh -w risc77 date risc77: Wed Apr 14 15:59:38 EDT 1999
Including this external workstation to the file that is referenced by the WCOLL variable (in our example, /tmp/nodes) makes this workstation accessible in combination with your SP nodes.
sp2en0:/# export WCOLLL=/tmp/nodes sp2en0:/# dsh date sp2n01: Wed Apr 14 16:03:37 EDT 1999 sp2n05: Wed Apr 14 16:03:37 EDT 1999 sp2n06: Wed Apr 14 16:03:37 EDT 1999 sp2n07: Wed Apr 14 16:03:37 EDT 1999 risc77: Wed Apr 14 16:03:33 EDT 1999
210
RS/6000 SP Software Maintenance
9.2 Using the CWS NIM Master Setup to Install External Workstations
Since you have a NIM Master Setup on your CWS anyway, why not use this
NIM configuration for external workstations that are not part of the SP environment? Almost all needed resources are already created, such as lppsource, spot, boot and so on. Only the definition of the external NIM clients and the network over which they will get installed are missing. In this section, we will show what steps remain in order to include external workstations to your NIM configuration on the CWS. For detailed information about NIM and
NIM on the SP, refer to 2.1, “NIM Overview” on page 9.
9.2.1 Definition of a Second Network Served by the NIM Master
A NIM configuration always consists of a NIM master in combination with at least one network over which the installations take place. Moreover, there are
NIM client definitions that also include information on which client uses which network for installation. Finally, there are the definitions of resources that are necessary during installation, migration and maintenance. All the previously mentioned components are already available on your CWS, with the internal
SP Ethernet as the defined network for NIM procedures.
Since you cannot use the internal SP Ethernet for the installation of external clients, you have to define a second network served by the NIM master.
In regard to NIM network definitions on a CWS, although it is physically possible to connect external workstations to the internal SP Ethernet, it is not supported to extend the SP Ethernet and connect other clients. For non-SP clients, you have to define a second network served by the master. However, one problem still remains: as soon as setup_server runs the next time, the definitions of non-SP clients will be deleted. The setup_server command maintains the whole NIM environment and uses the SDR as input.
Consequently, it detects that this client is “no longer” part of the SDR, so the definition will be removed.
Until you migrate to this level, however, there is a workaround. Since only the client definition is deleted (no network or resource is deleted), it is very easy to write a script that redefines the non-SP clients after setup_server runs.
Figure 42 on page 212 shows an example configuration and we provide you
with the workaround script.
Integration of External Workstations
211
risc79:# risc78:# sp2n15 sp2n13 sp2n14 sp2n11 sp2n12 sp2n09 sp2n10 sp2n07 sp2n08 sp2n05 sp2n06 sp2n01 risc77:# sp2en0:#
SP Ethernet
Serial Link
To ke n R ing
NIM Master
serves: spnet_en0
clients
: sp2n01 sp2n05
...
serves: net_tok0
clients
: risc77 risc78 risc79
Resources
mksysb, spot, lppsource, boot, noprompt prompt, nim_script ...
Figure 42. The CWS as NIM Master for a Second Network
Since the external hosts are not part of the SP, we cannot use the helpful wrappers of setup_server
. Instead, we are forced to use the standard NIM commands. In our example we are going to install the risc77 from the NIM master sp2en0 over the Token Ring. First we define the second network. The
SMIT fastpath for NIM administrative tasks is nim_mknet.
1. Type smitty nim_mknet.
2. Select the appropriate network interface.
You will see the following menu:
212
RS/6000 SP Software Maintenance
Define a Network
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Network Name
* Network Type
* Network IP Address
* Subnetmask
Other Network Type
Comments
[Entry Fields]
[sp_tok] tok
[9.12.1.0]
[255.255.255.0]
[]
+
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
The fields that are required are shown in bold characters. The Network Name is free-choosable, and it is the equivalent value to spnet_en0.
Up to now only the network itself is defined, so this new network now has to be defined as an interface served by the NIM master, as follows:
1. Type smitty nim_mkmac_if.
2. Select master.
3. Enter the appropriate interface.
You will see the following menu:
Integration of External Workstations
213
Define a Network Install Interface
Type or select a value for the entry field.
Press Enter AFTER making all desired changes.
* Host Name of Network Install Interface
[Entry Fields]
[sp2cw0]
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
Select the appropriate network interface (Token Ring, in our example). Then you will see the following screen.
Define a Network
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Network Name
* Network Type
* Network IP Address
* Subnetmask
Other Network Type
Comments
[Entry Fields]
[sp_tok] tok
[9.12.1.0]
[255.255.255.0]
[]
+
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
The Hardware Address of the interface is needed, and you can get it with the command:
# lscfg -v -l tok0
214
RS/6000 SP Software Maintenance
Now we check to see if the master serves both interfaces:
# lsnim -l master| grep if[0-9] if1 = spnet_en0 sp2en0 02608C2D4A7F if2 = sp_tok sp2cw0 10005AB1919C tok
9.2.2 External NIM Client Definition
Which kind of client will the external workstation be? Since we are working with the standard NIM functions independent of setup_server and the SDR, we have the entire NIM functionality. Nevertheless, we will define the external workstation as a standalone client like the SP nodes.
Type smitty nim_mkmac.
You will see the following menu:
Define a Machine
Type or select a value for the entry field.
Press Enter AFTER making all desired changes.
* Host Name of Machine
(Primary Network Install Interface)
[Entry Fields]
[risc77]
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
After entering the hostname of the external NIM client, you will see the following screen.
Integration of External Workstations
215
Define a Machine
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* NIM Machine Name
* Machine Type
* Hardware Platform Type
Kernel to use for Network Boot
Primary Network Install Interface
* Ring Speed
* NIM Network
* Host Name
Network Adapter Hardware Address
Network Adapter Logical Device Name
IPL ROM Emulation Device
CPU Id
Machine Group
Comments
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
[Entry Fields]
[risc77]
[standalone]
[rs6k]
[up]
[16] sp_tok risc77
[10005AA8E7A3]
[tok]
[]
[]
[]
[]
F4=List
Esc+8=Image
+
+
+
+
+/
+
In this menu you are not required to fill in the CPU ID of the client. After the first installation, the NIM master fetches the CPU ID of the client itself. The following screen shows the client definition.
sp2en0:/# lsnim -l risc77 risc77: id = 902176487 class type
= machines
= standalone platform = rs6k netboot_kernel = up if1 ring_speed1
Cstate prev_state
Mstate
= sp_tok risc77 10005AA8E7A3 tok
= 16
= ready for a NIM operation
= ready for a NIM operation
= currently running
As already mentioned, this external client definition will be removed during the next setup_server run. As a workaround until this is fixed, you can write a shell script that defines the client. The manual definition of our NIM client risc77 would look like this:
216
RS/6000 SP Software Maintenance
# nim -o define -t standalone -a if2="sp_tok risc77 \
10005AA8E7A3 tok" -a ring_speed=16 -a platform=rs6k \
-a netboot_kernel=up risc77
9.2.3 Creating the Resources for External NIM Clients
All NIM resources available on a CWS can be used for external clients except for the script resource that is used for SP nodes, because neither an ssp fileset installation nor any SP-related customization has to run at the end of an external client installation. Nevertheless, it is possible to write another customization script for other clients and define it as a NIM resource. Once the client’s installation has started and there is a script resource allocated for it, this script runs at the end of the installation on the client. For details on the
script resource, see 2.1.5, “NIM Resources” on page 17. In our example, we
do not use a script resource.
Finally, an mksyb resource has to be defined pointing to the mksysb of our workstation risc77. We placed one under /spdata/sys1/install/images.
However, you can put it wherever you want, but you should insure that this file can be exported. Once the mksysb file is placed somewhere, you can create the mksysb resource by using SMIT:
1. Type smitty nim_mkres.
2. Choose mksysb.
You should see the following menu:
Integration of External Workstations
217
Define a Resource
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP]
* Resource Name
* Resource Type
* Server of Resource
* Location of Resource
Comments
System Backup Image Creation Options:
CREATE system backup image?
NIM CLIENT to backup
PREVIEW only?
IGNORE space requirements?
EXPAND /tmp if needed?
Create MAP files?
[MORE...7]
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do no
[] no no no no
[Entry Fields]
[mksysb_risc77] mksysb
[master]
<images/risc77.img] /
+
[]
F4=List
Esc+8=Image
+
+
+
+
+
+
After that definition, lsnim returns the following screen.
sp2en0:/# lsnim -l mksysb_risc77 mksysb_risc77: class = resources type
Rstate
= mksysb
= ready for use prev_state = unavailable for use location = /spdata/sys1/install/images/risc77.img
version release
= 4
= 3 mod = 1 alloc_count = 0 server = master
Now the additional definitions for our external workstations on the NIM master are finished completed.
9.2.4 Setting Up the Installation of an External Workstation
Unfortunately setup_server cannot be used for the allocation of the necessary resources for an external client installation. We have to set it up manually.
However, it is helpful to use the allnimres script ( setup_server wrapper) to look up the appropriate commands for the resource allocation and
218
RS/6000 SP Software Maintenance
preparation and write your own setup script for the external clients. In our example we use SMIT.
1. Type smitty nim_mac_res.
2. Select Allocate Network Resources.
3. Choose the external NIM client.
You should see the following menu:
Manage Network Install Resource Allocation
------------------------------------------------------------------------------
Available Network Install Resources
Move cursor to desired item and press Esc+7.
ONE OR MORE items can be selected.
Press Enter AFTER making all selections.
psspscript prompt
> noprompt migrate
> lppsource_aix432 mksysb_1
> spot_aix432 script bosinst_data bosinst_data bosinst_data lpp_source mksysb spot
> mksysb_risc77 mksysb
--------------------------------------------------------------------------------
Select the needed resources, depending upon to your setup. In our example they are noprompt, lppsource_aix432, spot_aix432 and mksysb_risc77.
Finally, the boot resource has to be allocated, the required directories have to be exported, and an entry to the /etc/bootptab has to be made for our external
NIM client. This is done by the command:
# nim -o bosinst <nim_client>
If you go through the SMIT panels, this is done by the command:
1. Type smitty nim_mac.
2. Select Perform Operations on Machines.
3. Choose the extern NIM client.
You should see the following menu:
Integration of External Workstations
219
Manage Machines
-----------------------------------------------------------------------------
Operation to Perform
Move cursor to desired item and press Enter.
[TOP] diag cust bos_inst maint reset fix_query check reboot maint_boot showlog
[MORE...3]
= enable a machine to boot a diagnostic image
= perform software customization
= perform a BOS installation
= perform software maintenance
= reset an object’s NIM state
= perform queries on installed fixes
= check the status of a NIM object
= reboot specified machines
= enable a machine to boot in maintenance mode
= display a log in the NIM environment
F1=Help
Esc+8=Image
/=Find
F2=Refresh
Esc+0=Exit n=Find Next
F3=Cancel
Enter=Do
------------------------------------------------------------------------------
Choose bos_inst and fill out the next screen:
Perform a Network Install
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
Target Name
Source for BOS Runtime Files installp Flags
Fileset Names
Remain NIM client after install?
Initiate Boot Operation on Client?
Set Boot List if Boot not Initiated on Client?
Force Unattended Installation Enablement?
[Entry Fields] risc77
mksysb
[-agX]
[] yes
no
no no
F1=Help
Esc+5=Reset
Esc+9=Shell
F2=Refresh
Esc+6=Command
Esc+0=Exit
F3=Cancel
Esc+7=Edit
Enter=Do
F4=List
Esc+8=Image
+
+
+
+
+
With these choices, you have to start the client installation manually and the mksyb image that is allocated will be used. Before starting the installation, the allocated NIM resources, the /tftpboot/<client_name> file and the
220
RS/6000 SP Software Maintenance
/etc/bootptab file can be verified. We will just have a look into these files, as
shown by the following screen, but for detailed information, see 2.1.2, “NIM
sp2en0:/# lsnim -c resources risc77 noprompt lppsource_aix431 bosinst_data lpp_source risc77_mksysb spot_aix431 boot nim_script mksysb spot boot nim_script represents the network boot resource directory containing customization scrip sp2en0:/# tail /etc/bootptab
# T175 -- (xstation only) -- primary / secondary boot host indicator
#
#
#
#
T176 -- (xstation only) -- enable tablet
T177 -- (xstation only) -- xstation 130 hard file usage
T178 -- (xstation only) -- enable XDMCP
T179 -- (xstation only) -- XDMCP host
# T180 -- (xstation only) -- enable virtual screen risc77:bf=/tftpboot/risc77:ip=9.12.1.77:ht=token-ring:ha=10005AA8E7A3: sa=9.12.1.37:sm=255.255.255.0: sp2en0:/# ls -la /tftpboot/risc77 lrwxrwxrwx 1 root system oot/spot_aix431.rs6k.up.tok
33 Aug 5 17:14 /tftpboot/risc77 -> /tftpb
All entries are correct. The necessary resources are allocated: there is one entry in the /etc/bootptab file for the client risc77 and the /tftpboot/risc77 file points to the appropriate uniprocessor boot kernel, indicated as bootfile (bf) in the bootptab entry.
9.2.5 Initializing the Client Installation
When the client resides in the same network as the NIM server, you can change the bootlist of the client to the appropriate interface. As soon as you reboot the client machine, it will try to boot over the network by sending out bootp requests. The NIM master will react with a bootp response because there is an entry in the /etc/bootptab file for this client. The command for our external NIM client is:
# bootlist -m normal tok0
A second way to initialize the client installation is to use the IPL ROM of the client machine. It will offer you, in a menu, the interfaces that are available and you can choose over which interface it should boot. If you have already done the manual node conditioning on an SP node, you will be familiar with this procedure:
1. Set key to secure.
2. Power on and wait for LED 200.
Integration of External Workstations
221
3. Set key to service and reset.
At this point, the IPL ROM menu will be offered. You can then make the appropriate choices and start the system boot over the network.
222
RS/6000 SP Software Maintenance
Chapter 10. The Switch
After the installation of the nodes, you might start the Switch interface.
Usually this is done from the CWS with the
Estart command, and you hope that every node will join the switch. However, several things can occur between issuing
Estart and getting the green switch responds: It can fail at once; or only some of the nodes join the switch; or the primary node is fenced and when you try to unfence it, you get a message that it cannot be unfenced because it is the primary node, and so on.
To avoid these situations, there are several verification steps that you can do before using
Estart
. This chapter describes these verification steps.
10.1 Definition of the Switch
Between running setup_server and starting the installation of the nodes over the net, you define the Switch by annotating a topology file, defining the primary node and primary backup node, and setting the switch clock source, which means you simply add definitions to the System Data Repository
(SDR). Verification of these definitions can be done by several commands.
10.1.1 Checking the Primary and Primary Backup Node
There are three types of nodes in a Switch environment:
1. Primary node
This node initializes the switch, distributes topology, fences and unfences nodes and updates the SDR.
2. Primary backup node
This node monitors the primary nodes, and if the primary fails, it becomes the primary node and defines the next primary backup.
3. Secondary node
This refers to every node that is neither the primary nor primary backup node.
An SP Switch Board contains eight SP Switch chips that provide connections for each of the nodes to the SP Switch network. Each SP Switch chip has 8
Switch ports which are used for data transmission. Figure 43 on page 224
gives an overview of the connections within an SP Switch board.
© Copyright IBM Corp. 1999
223
Connections to other switch boards
J11
J12
J13
J14
J19
J20
J21
J22
J3
J4
J5
J6
J27
J28
J29
J30
3
2
1
0
SW3
6
7
4
5
1
0
3
2
SW2
6
7
4
5
1
0
3
2
SW1
6
7
4
5
1
0
3
2
SW0
6
7
4
5
7
6
5
4
SW4
0
1
2
3
7
6
5
4
SW5
0
1
2
3
5
4
7
6
SW6
2
3
0
1
Connections from node to switch board
To
Node
Number
Switch
Node
Number
J34 N14 SNN13
J33 N13 SNN12
J32 N10 SNN9
J31 N9 SNN8
J10 N6
J9 N5
J8
J7
N2
N1
SNN5
SNN4
SNN1
SNN0
J26 N3
J25 N4
J24 N7
J23 N8
SNN2
SNN3
SNN6
SNN7
5
4
7
6
SW7
0
1
2
3
J18 N11 SNN10
J17 N12 SNN11
J16 N15 SNN14
J15 N16 SNN15
Figure 43. SP Node Switch Board
The rule for setting primary and primary backup node is as follows: if you have more than one switch board, spread them over different switch boards.
But in any case, do not define the primary and primary backup on nodes that share the same switch chip because if a switch chip fails, the primary backup
node will not be able to become the primary node (see 10.2.2, “The Worm
Figure 44 on page 225 shows the same topic, but focuses on the nodes in the
frame.
224
RS/6000 SP Software Maintenance
Switch Chip 4
all 4 ports in use
Switch Chip 5
only 2 ports in use
NN 15
SNN 14
Thin Node
Chip 7
NN 13
SNN 12
Thin Node
Chip 4
Chip 6
NN 5
SNN 4
Chip 5
NN 16
SNN 15
Thin Node
Chip 7
NN 14
SNN 13
Thin Node
Chip 4
NN 11
SNN 10
Thin Node
Chip 7
NN 9
SNN 8
Chip 4
NN 7
SNN 6
Thin Node
NN 12
SNN 11
Thin Node
Chip 7
NN 10
Thin Node
SNN 9
Chip 4
Wide Node
Wide Node
NN 1
SNN 0
Chip 5
High Node
Switch Chip 7
all 4 ports in use
Switch Chip 6
only 1 port in use
Figure 44. SP Node - Switch Chip Connections
In our configuration, five of the 16 available switch ports are left because only nine nodes are connected to the switch board. They could be used, for example, to connect the SP Switch Router or S70 nodes.
The
Eprimary command without options returns the current settings for the primary and primary backup node. With the old High Performance Switch
(HiPS), the output of the
Eprimary command is only one line telling you which node is currently the primary node. For more information about the differences between HiPS and SPS, refer to RS/6000 SP: Problem
Determination Guide,
SG24-4778, Chapter 4.
The oncoming primary and oncoming primary backup node will become the primary and primary backup node, respectively, after the next
Estart
. In our example, the Switch is currently not up, therefore primary and primary backup are marked as none.
The Switch
225
Eprimary none - primary
1 none
7
- oncoming primary
- primary backup
- oncoming primary backup
If the Switch were up and running, the output would be the following:
1
1
Estart
7
8
- primary
- oncoming primary
- primary backup
- oncoming primary backup
When you set the primary or the primary backup node (or both) to another node, this only changes the oncoming values stored in the SDR. These values will be activated during the next
Estart
.
10.1.2 Checking the Switch Clock Source Settings
With the splstdata -s command, you get information about switch node numbers, which node is connected to which switch chip, over which port it communicates and so on. In the second part of the output, the definition of the switch clock source setting in the SDR is returned.
In a one-frame configuration, there is only one possibility for the switch clock source setting. The clock input has to be 0, meaning the internal clock of this switch board is used.
In a multiframe configuration, there has to be one dedicated board that serves the clock signal to the others. In our example we have a two-frame SP and you can see that the clock input for switch board 1 is 0 (internal clock), but for switch board 2 the clock input is 1. This board is getting the clock signal from switch board 1.
226
RS/6000 SP Software Maintenance
# splstdata -s
List Node Switch Information switch switch switch switch switch node# initial_hostname node# protocol number chip chip_port
----------------------------------------------------------------
1 sp4n01
5 sp4n05
6 sp4n06
7 sp4n07
0
4
5
6
IP
IP
IP
IP
1
1
1
1
5
5
5
6
3
1
0
2
8 sp4n08
9 sp4n09
13 sp4n13
14 sp4n14
15 sp4n15
17 sp4n17
19 sp4n19
7
8
12
13
14
0
2
IP
IP
IP
IP
IP
IP
IP
1
1
1
1
1
2
2
6
4
4
4
7
5
6
2
3
0
3
3
1
0 switch frame slot switch_partition switch clock switch number number number number type input level
----------------------------------------------------------------
1 1 17 1 129 0
2 2 17 1 129
1
switch_part number topology filename primary name arp switch_node enabled nos._used
-------------------------------------------------------------------------
1 expected.top.an
sp4n01 yes no
Since this is just the definition of the clock source setting in the SDR, it may happen that the switch clock signal is missing on the board or on a switch adapter. In this case, you are not able to start the Switch and you will find an entry in the error report. To set the clock source again, look for the appropriate
Eclock file fitting to your configuration and issue the command shown in
# Eclock -f /etc/SP/Eclock.top.1nsb.0isb.0
1 node switch board:
N ode switch boards have direct connections to the nodes.
as opposed to
0 interm ediate sw itch board:
Interm ediate switch boards are used to connect more than five node switch boards.
Figure 45. Eclock
This command should only be used when necessary, because it stops the entire Switch.
The Switch
227
10.2 Initialize the Switch
The start of the switch interface can be divided into three parts:
1. The
Estart command on the CWS
2. The
Estart_sw script running on the oncoming primary node
3. Computing of routes and downloading to the switch adapter on each node
In Figure 46 we can see what is going on during switch initialization.
On the Control
Workstation
$ Estart
Which is the active system partition?
Which node is primary node?
Which kind of tbx adapter is in use
$ dsh -w <primary> /usr/lpp/ssp/css/Estart_sw
What has to work?
SDRGetObjects and Kerberos
On the oncoming primary node
Estart_sw
Get the topology info from SDR and distribute it to the other nodes
Request Worm daemon to check the switch
(cabling, unfenced nodes, and so on)
Compute routes for primary node
Download route table to the adapter
Distribute topology changes
Update switch_responds values in SDR
What has to run?
Worm daemon
Nodes unfenced
On each unfenced node
Update topology
Compute routes for this node
Download routes to adapter
Figure 46. Switch Initialization
These three parts have different prerequisites and they also run on different nodes. Verifying that the
Estart command will bring the Switch up means knowing which script is running where and what is needed for successful execution. In the following section we discuss this in detail.
10.2.1 Estart on the Control Workstation
Usually you start the switch with the
Estart command from the CWS. It asks the SDR what the active system partition is, which nodes are oncoming primary and oncoming primary backup, and then redirects the
Estart command to the oncoming primary node. This is done with the distributed shell command:
# dsh -w <oncoming_primary_name> /usr/lpp/ssp/css/Estart_sw
228
RS/6000 SP Software Maintenance
If your Kerberos setup is not correct,
Estart fails. You will get a command response like the one in the following screen:
# Estart
Estart: Oncoming primary != primary, Estart directed to oncoming primary rshd: Kerberos Authentication Failed: Access denied because of improper credenti als.
/usr/lpp/ssp/rcmd/bin/rsh: 0041-004 Kerberos rcmd failed: rcmd protocol failure.
trying normal rsh (/usr/bin/rsh) rshd: 0826-813 Permission is denied.
Estart: 0028-028 Fault service worm not up on oncoming primary node, cannot Est art : sp4n01.
This error message is misleading because the Worm daemon on the oncoming primary node
is
up and running. The only problem is the Kerberos authentication. The CWS was not able to start
Estart_sw on the oncoming primary node via dsh. Estart is being issued to the primary node: sp4n01.
When a kerberized remote shell command fails, the standard rsh command will be tried. In other words, if you have a /.rhosts file on the oncoming primary node allowing the root user from the CWS to set up remote shell commands,
Estart will be successful even if Kerberos does not work. Within the Kerberos error messages the
Estart reports how many nodes have joined the Switch.
# Estart
Estart: Oncoming primary != primary, Estart directed to oncoming primary rshd: Kerberos Authentication Failed: Access denied because of improper credenti als.
/usr/lpp/ssp/rcmd/bin/rsh: 0041-004 Kerberos rcmd failed: rcmd protocol failure.
trying normal rsh (/usr/bin/rsh)
Estart:0028-06 Estart is being issued to the primary node: sp4n01
Switch initialization started on sp4n01.
Initialized 11 node(s).
Switch initialization completed.
rshd: Kerberos Authentication Failed: Access denied because of improper credenti als.
/usr/lpp/ssp/rcmd/bin/rsh: 0041-004 Kerberos rcmd failed: rcmd protocol failure.
trying normal rsh (/usr/bin/rsh)
Estart finds out which node is the oncoming primary node simply by asking the SDR on the CWS. Then Estart executes a dsh command running
Estart_sw on the primary node. It is possible to issue
Estart on every node that belongs to same partition as long as you have a valid Kerberos ticket for this node.
The Switch
229
Even if you type
Estart_sw directly on the oncoming primary node, the switch initialization will be successful, but keep in mind that in any case the SDR has to be available.
10.2.2 The Worm Daemon
The switch network function is supported by service software known as the
Worm
. This service software is coupled with the
Route Table Generator
, which examines the state of the Switch as determined by the Worm execution, and generates valid routes from every node to any other node.
These two components are implemented in the fault-service daemon
(fault_service_Worm_RTG_SP) which runs on every node in an SP and is central to the functioning of the switch. A diagram illustrating the relationships between the switch communication subsystem components is shown in
send/rec eiv e
Library
Sw itch
C omm ands
(Estart, E fenc e, e tc .)
fault_service
daem on g et fs_req ue st
U ser sp ace
K ern el sp ace
faul t_service wor k queue
CS S Device
D river
T Bx Adapter
Figure 47. Switch Subsystem Component Architecture
The Worm daemon plays a key role in the coordination of the switch network.
It is a non-concurrent server, and therefore can only service one switch event to completion before servicing the next. Examples of switch events (or faults) include switch initialization, Switch Chip error detection/recovery and node
230
RS/6000 SP Software Maintenance
(Switch Port) failure. These events are queued by an internal Worm process on the work queue. The events are serviced by the Worm when a get_fs_request system call is made to satisfy a single work queue element.
The Worm daemon is started from the /etc/inittab by the rc.switch
script; it runs on all nodes. While the Worm daemon on every node is the same, any node is able to be or become primary or primary backup node.
Use the following command to check that Kerberos is working and the Worm daemons are running:
# dsh -a ps -e |grep Worm sp4n01: 13588 sp4n05: 13530
-
-
0:00 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP sp4n06: 16480 sp4n07: 12326 sp4n08: 11954 sp4n09: 13824
-
-
-
-
0:00 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP sp4n13: 13934 sp4n14: 14664 sp4n15: 6526 sp4n17: 11948 sp4n19: 12574
0:00 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP
0:05 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP
0:00 fault_service_Worm_RTG_SP
If the Worm daemon is not running on a node, you can restart it by issuing the rc.switch
script directly either on the node itself or from the CWS with a dsh
:
# dsh -w <node_name> /usr/lpp/ssp/css/rc.switch
Sometimes it may happen that even after a restart of a node’s Worm daemon, this node is not able to join the Switch, particularly if you had to set the switch clock source with the
Eclock command. In this case you have to use a more powerful command to unload the device driver from the switch adapter, then load it again and start the Worm daemon (so rc.switch
is not necessary), as follows:
$ dsh -w <node_name> /usr/lpp/ssp/css/css_restart_node
10.2.3 Estart_sw Running on the Oncoming Primary Node
The oncoming primary node consults the SDR to get information about the current system partition name and retrieves the appropriate topology, the oncoming primary backup node and the unfenced nodes belonging to its partition. It then distributes this topology file to these nodes.
The Switch
231
Then the Worm is used to send packets from the primary node to each switch chip to check for cabling and to find out the routes to the primary node. Each unfenced node is sent service packets to initialize, to identify itself and to verify the switch address and readiness. The primary node updates its device database (containing the global information about the topology) in accordance with the findings and feedbacks. After the computation of the switch route table, it downloads it to the adapter and distributes the deltas of the device database to the other nodes.
10.2.4 What Happens on the Non-Primary Nodes
The unfenced non-primary nodes receive the topology file from the primary node. Then they get the deltas between the “official” SDR topology and the topology produced by the primary node (excluding, for example, the fenced nodes or the nodes that could not acknowledge to the services packet from the primary, and so on).
Based on this information, each node builds its own device database with the information on which nodes are currently ready to join the switch. Then, on each node, the Routing_Table_Generator-part (RTG) of the Worm computes the switch route table, including all the paths to the other nodes, and downloads it to the switch adapter.
Finally, the primary node updates the switch_responds class in the SDR and sets the primary and primary backup values from
none
to the appropriate nodes.
10.2.5 Fenced and Unfenced Nodes
Since Estart_sw on the oncoming primary node only cares about the unfenced nodes, fenced nodes will not be initialized during the Estart (except
information about fenced nodes from the SDR by the following command:
232
RS/6000 SP Software Maintenance
# SDRGetObjects switch_responds node_number switch_responds autojoin
1
5
1
1
0
0
6
7
8
9
1
0
1
0
0
0
0
0
13
14
15
17
19
1
1
0
1
1
0
0
0
0
0 isolated adapter_config_status
0 css_ready
0 css_ready
0 css_ready
1 css_ready
0 css_ready
1 css_ready
0 css_ready
0 css_ready
1 css_ready
0 css_ready
0 css_ready
Currently the switch is up and running but nodes 7, 9 and 15 are isolated
(perhaps they are fenced) and logically have no switch responds. For example, when you fence a node by typing
Efence 8 on the CWS, the primary node excludes this node from the switch interface, then updates its own routing table and requests all the other nodes to update their routing tables by deleting the paths to node 8. Finally, the primary node sets the isolated value in the SDR for node eight to 1. The same principle (vice versa) happens when you integrate a node by the
Eunfence command.
Note
: It is not possible to fence or unfence the primary or primary backup node.
10.2.6 Fenced with and without the autojoin Flag
Usually autojoin means the node will join the switch automatically. The autojoin value depends on the isolated value, because it is only possible to have the autojoin flag set when the node is fenced (say the isolated value equals 1.
So when will this value be evaluated? Assume the Switch is up but node 10 and 11 are already fenced, and now we want to fence two more nodes with the autojoin flag.
# Efence 8 9 -autojoin
This command isolates nodes 8 and 9 from the Switch and sets the isolated and autojoin values in the SDR for these nodes to 1. The switch responds of nodes 8 and 9 are off and the values in the SDR look like the following:
The Switch
233
# SDRGetObjects switch_responds node_number switch_responds autojoin
1
5
1
1
0
0
6
7
8
9
1
1
0
0
0
0
1
1
10
11
12
13
14
15
0
0
1
1
1
1
0
0
0
0
0
0 isolated adapter_config_status
0 css_ready
0 css_ready
0 css_ready
0 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
0 css_ready
0 css_ready
0 css_ready
0 css_ready
And what is the related spmon -d output?
9
10
11
12
5
6
7
8 spmon -d
[...]
--------------------------------- Frame 1 -------------------------------------
Frame Node Node Host/Switch Key
Slot Number Type Power Responds Switch
Env
Fail
Front Panel
LCD/LED
LCD/LED is
Flashing
-------------------------------------------------------------------------------
1 1 high on yes yes normal no LCDs are blank no
13
14
15
5
6
7
8
9
10
11
12
13
14
15 thin thin thin thin thin thin thin thin thin thin wide on yes yes on yes yes on yes yes on yes off on yes off on yes yes on on on yes yes yes yes yes yes normal no LEDs are blank no normal no LEDs are blank no normal no LEDs are blank no on yes fence normal no LEDs are blank no on yes fence normal no LEDs are blank no normal normal normal normal normal normal no no no no no no
LEDs are blank
LEDs are blank
LEDs are blank
LEDs are blank
LEDs are blank
LEDs are blank no no no no no no
Up to PSSP 2.4, when nodes are fenced/isolated and on autojoin, spmon -d returns the switch responds as
fence
.
Who cares about this value?
•
Estart
• rc.switch
As described in 10.2.3, “Estart_sw Running on the Oncoming Primary Node” on page 231,
Estart does not integrate nodes that are currently fenced.
Exception: the nodes that are fenced and set to autojoin will join the Switch when
Estart is executed.
234
RS/6000 SP Software Maintenance
When a node is rebooted and rc.switch
is executed, first the Worm daemon is started and tries to talk to the Worm daemon of the primary node. If the node is only fenced, nothing will happen. However, if in addition, the autojoin value is set to 1, an
Eunfence command for this node is executed by the primary and this will join the Switch. As soon as the switch_responds value is set to 1, the fenced and the autojoin values are set back to 0.
10.2.7 When Do You Have to Fence a Node?
While it was necessary on the old High Performance Switch to fence a node before shutting it down, on the new SP Switch this is unnecessary. The SP
Switch introduces the following changes: a backup node for the primary and no global failures on the switch interface. With the SP Switch, there is no problem shutting down a node that is currently on the Switch without explicitly fencing it because the Worm daemon of the primary node does this for you.
As soon as you shutdown or power off a node, this node get fenced.
This is also true for the primary node: the primary backup node immediately becomes the primary node, and after finishing the recovery procedure, the old primary node is set to
isolated
in the SDR.
10.2.8 Modifying the Switch Values in the SDR
There are two situations where you are forced to modify the SDR manually.
Assume the worst case: all nodes are on isolated in the SDR, or a few nodes are isolated and you want to have one of these nodes as primary. In this case you will not be able to start the Switch. But as long as the Switch is not up, the isolated entry in the SDR is just a value that can be changed manually because these entries will not be evaluated before
Estart
.
By the way
...what criteria do you use to say the Switch is up and running?
It is up as soon as the primary and primary backup nodes have initialized the switch interface. After that, all the other nodes can be integrated by using the
Eunfence command without any
Estart
.
Before changing the SDR manually, make a backup by using the following command:
# SDRArchive
This command will create a tar-file of the SDR in the subdirectory
SDRArchive command.
The Switch
235
[sp4en0:/]# SDRArchive
SDRArchive: SDR archive file name is /spdata/sys1/sdr/archives/backup.99109.1320
[sp4en0:/]#
Check the Switch Responds by using the
SDRGetObjects switch_responds command. The next screen shows example output from the command:
# SDRGetObjects switch_responds node_number switch_responds autojoin
1
5
0
0
0
0
6
7
8
9
0
0
0
0
0
0
0
0
10
11
12
13
14
15
0
0
0
0
0
0
0
0
0
0
0
0 isolated adapter_config_status
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
To get the Switch up, at least the isolated value of primary and primary backup must be 0 in the SDR. The primary and primary backup are nodes 1
and 6. Figure 48 on page 237 shows the two commands for changing the
isolated values of these nodes.
236
RS/6000 SP Software Maintenance
# SDRChangeAttrValues switch_responds node_number==1 isolated=0 this is the selector this is the new setting
# SDRChangeAttrValues switch_responds node_number==6 isolated=0
Figure 48. SDRChangeAttrValues
After setting these new values,
SDRGetObjects switch_responds looks like the following:
# SDRGetObjects switch_responds node_number switch_responds autojoin
1
5
0
0
0
0
6
7
8
9
0
0
0
0
0
0
0
0
10
11
12
13
14
15
0
0
0
0
0
0
0
0
0
0
0
0 isolated adapter_config_status
0
css_ready
1 css_ready
0
css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
1 css_ready
If Kerberos is working and the Worm daemon is running on nodes 1 and 6,
Estart will succeed for these two nodes.
# SDRChangeAttrValues switch_responds isolated=0
The preceding command will change every isolated attribute value found in the SDR class
switch_responds
, as follows:
The Switch
237
# SDRChangeAttrValues switch_responds isolated=0 node_number switch_responds autojoin isolated
1
5
0
0
0
0 adapter_config_status
0 css_ready
0 css_ready
6
7
8
9
0
0
0
0
0
0
0
0
0 css_ready
0 css_ready
0 css_ready
0 css_ready
10
11
12
13
14
15
0
0
0
0
0
0
0
0
0
0
0
0
0 css_ready
0 css_ready
0 css_ready
0 css_ready
0 css_ready
0 css_ready
The reason for using the
SDRArchive before using this
SDRChangeAttrValues command is obvious: if you look at the command syntax for the
SDRChangeAttrValues command, you will notice that the node gets selected by using a node_number==6
. Using the double equal sign (==) is very important; if you use a single equal sign (=) instead,
all
the isolated values of all nodes will be set to the new value. If this happens, use your SDR archive and restore it. Here is an example how to restore the SDR:
# SDRRestore <backup filename>
For more information on this subject, refer to 7.2.2, “Restoring the SDR” on page 149.
10.3 Monitoring the Switch
With PSSP 2.3, the Emonitor daemon has been introduced. This daemon, which is managed by the system resource controller, only runs on the CWS. It is not started automatically; therefore, you have to edit the appropriate configuration file and then start the Switch with the
Estart -m command.
Currently the switch is up and running on our system and started without the monitor option. Let us check the SP-related subsystems, as follows:
238
RS/6000 SP Software Maintenance
# lssrc -a | grep sp4en0 hags.sp4en0
hags hagsglsm.sp4en0
hags haem.sp4en0
haem haemaixos.sp4en0 haem hats.sp4en0
hats hr.sp4en0
pman.sp4en0
hr pman pmanrm.sp4en0
sdr.sp4en0
pman sdr
Emonitor.sp4en0
emon
18834 active
19608 active
21156 active
20902 active
20134 active
21418 active
21674 active
20648 active
11360 active inoperative
As we see, the Emonitor.sp4en0 subsystem is inoperative.
10.3.1 How to Activate the Emonitor Function
First you have to edit the /etc/SP/Emonitor.cfg file to indicate which nodes should be monitored. You specify these nodes at the end of the file. In the following screen, we are looking at the end of this file:
7
8
9
10
# vi /etc/SP/Emonitor.cfg
<Shift G>
# example entries (remove # to use
#5 # This will set node 5 to be monitored
#1 # This will set node 1 to be monitored
1
5
6
We added nodes 1, 5, 6, 7, 8, 9 and 10 to this file. Next we have to start the
Switch with the monitor option:
# Estart -m
As we see, the Emonitor subsystem is now activated.
# lssrc -a | grep emon
Emonitor.sp4en0
emon 23556 active
10.3.2 How Emonitor Works
The
Estart -m command starts the Switch in monitoring mode. Contrary to what you might expect, Emonitor checks the nodes’ host_responds values in
The Switch
239
the SDR, not the switch_reponds. As soon as the host_responds of a node turns to 0 (red), the Emonitor daemon detects it and when the host_responds for this node returns, it integrates this node by the
Eunfence command (see
host responds
n15 n13 n09 n05 n01
switch responds
n15 n13 n09 n05 n01
host responds
n15 n13 n09 n05 n01
switch responds
n15 n13 n09 n05 n01
node 09 fails host responds
n15 n13
n09
n05 n01
switch responds
n15 n13 n09 n05 n01
primary=1 primary backup=13
# Estart -m node 09 reboots switch_responds is up for 5 nodes.
host responds
n15 n13 n09 n05 n01
switch responds
n15 n13 n09 n05
180 seconds
n01
host responds
n15 n13 n09 n05 n01
switch responds
n15 n13 n09 n05 n01
Emonitor detects host_responds for node09 is down.
Emonitor detects host_responds for node09 is up again.
After 180 seconds, Emonitor will unfence node 09.
Node 09 has joined the Switch again.
Figure 49. Emonitor
That means Emonitor will not recover from a “death” of a node’s Worm daemon because there is no impact on the host_responds value of this node.
Furthermore, Emonitor is not able to restart the Worm daemon.
If a node is under control of Emonitor and this node crashes, when it comes up again, first the rc.switch
script starts the Worm daemon. As soon as the host_responds for this node turns to 1 (green), Emonitor detects it, waits for a time interval of 180 seconds, then checks in the SDR if the autojoin flag is set
240
RS/6000 SP Software Maintenance
for this node. If yes, Emonitor does nothing because autojoin will be covered from the Worm daemon. If no, Emonitor executes
Eunfence for this node and
reintegrates it to the Switch. Figure 50 shows the cases and the recovery
actions.
Event / Action
Node fails and comes up again
Worm daemon of a node dies
Efence a node manually
Efence a node manually and reboot it
Efence a node manually with the autojoin flag
Efence a node with the autojoin flag and reboot it
Reboot the Control
Workstation host_responds
Turns from 1 to 0 and after the reboot to 1
Stays on 1
Stays on 1
Turns from 1 to 0 and after the reboot to 1
Stays on 1
Turns from 1 to 0 and after the reboot to 1
Stays on 1 switch_responds
Turns from 1 to 0 and after the reboot stays on 0
Turns from 1 to 0
Turns from 1 to 0
Turns from 1 to 0
Turns from 1 to 0
Turns from 1 to 0 and after the Worm started to 1 due to autojoin
Stays on 1
Emonitor
Emonitor will unfence the node
Emonitor will not recover
Emonitor will not recover
Emonitor will not recover since there was a dedicated fence command for the node
Emonitor will not recover
Emonitor will not react because of the autojoin flag
Emonitor will become inactive
Figure 50. Emonitor Cases
10.3.3 The Emonitor.log File
When you start the Switch with the monitor flag, the Emonitor.log file is created under /var/adm/SPlogs/css. This file provides you with the information about the events that have been detected by the Emonitor daemon and also the decisions about the action. In the following example, we fenced node 7 manually and rebooted it. Due to our
Efence command,
Emonitor decided not to unfence it when the host_responds was back.
The Switch
241
# cat /var/adm/SPlogs/css/Emonitor.log
Emonitor: Monitoring host_responds Wed Apr 14 09:44:00 EDT 1999 partition sp4en0
.
Emonitor: frame1 node10 came up. Wed Apr 14 09:49:36 EDT 1999 partition sp4en0
Emonitor: Timer popped .... Wed Apr 14 09:50:16 EDT 1999 partition sp4en0
Emonitor: Looking For Nodes: partition sp4en0
1[ OK ]
5[ OK ]
6[ OK ]
7[Autounfence in effect .. unfence NOT required]
8[ OK ]
9[ OK ]
10[ OK ]
11[ OK ]
12[ OK ]
13[ OK ]
14[ OK ]
15[ OK ]
Emonitor: Timer action completed partition sp4en0.
10.3.4 The Time Interval
Since
Estart -m does not permit any other options, you cannot manipulate the time interval between the detection of returning host_responds of a node and the
Eunfence command issued by Emonitor. The time interval is 180 seconds.
While we do not recommend it, if you think this time period is too long for your system, you can change this value in the /usr/lpp/ssp/bin/Emonitor file by editing the entry $EstartSTALL=180 to $EstartSTALL=<seconds>. Bear in mind that this could be changed again after applying PSSP PTFs.
10.3.5 Monitoring the Switch with PSSP 3.1
With PSSP 3.1, the
cssadm
daemon has been introduced for monitoring the
Switch. This daemon is managed by the system resource controller and gets its information from Event Management. For further information about using this daemon, refer to
Understanding and Using the SP Switch,
SG24-5161
.
242
RS/6000 SP Software Maintenance
Appendix A. Downloading PTFs for RS/6000 and SP
IBM provides a number of mirrored sites on the Internet where you may freely download AIX-related fixes. While not every AIX-related fix is available, we are constantly adding to these anonymous FTP servers. Though we do not guarantee all fixes will be immediately made available, we usually update the servers within 24 hours of tape distribution.
The current anonymous FTP servers are:
Country
United States
United Kingdom
Canada
Germany
Japan
Hostname
service.boulder.ibm.com
ftp.europe.ibm.com
rwww.aix.can.ibm.com
www.ibm.de
fixdist.yamato.ibm.co.jp
IP Address
198.17.57.66
193.129.186.2
204.138.188.126
192.109.81.2
203.141.89.41
To download the fixes on these servers you can use either the ftp command, or a Web browser, or the AIX-exclusive tool called FixDist.
This appendix discusses downloading fixes using the FixDist tool and from the Web.
A.1 FixDist Tool
The FixDist tool is a Web application that provides discrete downloads and delivers all required images with just one click. It can also keep track of fixes you have already downloaded, so you can download smaller fix packages the next time you need them.
Although the location of the Web pages varies from country to country, the most common on is: http://service.boulder.ibm.com/support/rs6000
The FixDist tool and the user's guide are located in the anonymous FTP directory /aix/tools/fixdist, or they can be viewed online with a Web browser.
© Copyright IBM Corp. 1999
243
A.1.1 Downloading the FixDist Tool
To download through the Web:
1. Go to the AIX Technical Support home page.
http://service.boulder.ibm.com/support/rs6000
2. Click
Downloads
.
3. Click
Tools
.
4. Download the tool and guide into your /tmp directory by clicking the appropriate links.
To download using the ftp command:
1. Change to your /tmp directory.
2. Get the tool and the PostScript user's guide. The text version of the user's guide comes with the tool and once installed is found as
/usr/lpp/fixdist/fixdist.txt.
You can use any hostname from the list in the previous Electronic Fix
Distribution section:
# ftp service.software.ibm.com
> login: anonymous
> password: "email" (example: johndoe@)
> bin
> cd /aix/tools/fixdist
> get fd.tar.Z
(FixDist tool in compressed tar format)
> get fixdist.ps.Z
> quit
(User guide in compressed PostScript)
A.1.2 Installing the FixDist Tool
The following procedure must be performed as the root user.
Install the tool into the /usr file system. You
must
install it from the / (root) directory to access the online help and preserve your .netrc file.
# cd / (change directory to the root)
# zcat /tmp/fd.tar.Z |tar -xpvf (uncompress and untar)
244
RS/6000 SP Software Maintenance
A.1.3 Starting and Configuring the FixDist Tool
The database and temporary files that FixDist manipulates on your RS/6000 are very sensitive to file permissions. Because of this, we recommend that you consistently run FixDist using the same user ID you start with.
To start FixDist, enter:
# fixdist
The file /usr/bin/fixdist is a script that calls /usr/lpp/fifdist/fixdistm for CDE and
AIXwindows users. If you have a dumb terminal, FixDist will call
/usr/lpp/fixdist/fixdistc.
Read the user's guide for detailed configuration information. The basic configuration tasks are: specifying an IBM server; specifying a location on your RS/6000 where you want to put the fixes; downloading the fix database from the IBM server to your RS/6000.
A.2 TapeGen Tool
TapeGen is a companion service tool provided with FixDist that enables you to create a stacked tape containing SMIT-installable fixes. You create a stack file that lists all the images you want stacked onto a tape and TapeGen does the rest.
A.3 Downloading Fixes from the Web
This section contains the steps for downloading fixes from the Web. It is applicable to AIX Versions 3 and 4.
The steps for downloading are as follows:
1. Open the following URL: http://service.software.ibm.com/aix.us/
2. Select the database to use (AIX Version 3/AIX Version 4/CATIA for AIX).
3. Select the search options (APAR NUMBER/FILESET NAME/PTF
NUMBER/APAR ABSTRACT).
4. Type in the fix number desired (APAR NUMBER/PTF NUMBER) and click
submit
.
5. Use your left mouse button to click on the needed updates.
6. Select your Maintenance level.
Downloading PTFs for RS/6000 and SP
245
7. Click [
Get Fix Package
].
8. Use your right mouse button to click the image needed and select the option to
Save Link as ... .
9. Select the directory to save the file and use your left mouse button to click
[
Save
].
246
RS/6000 SP Software Maintenance
Appendix B. Integrating an S70 System into an SP
PSSP version 3.1 provides support for integrating the S70 system as a node to SP. In this appendix we discuss how to integrate S70 to SP.
B.1 Overview
The S70 node in the SP has the following characteristics:
• The node is not physically located in the SP frame. It will be considered as a non-SP frame node. This frame will be assigned a frame number containing a single node. Therefore, a S70 frame will occupy 16 node numbers, of which only one will ever be used.
• The node is connected to the SP administrative network (also referred to as the SP LAN).
• The node is attached to the switch with a TB3PCI adapter.
• There is no frame or node supervisor card. The S70 node will be connected to the SP Control Workstation using two serial ports. One tty port is used for the hardware_protocol, and the second tty port is for establishing the serial connection using the s1term command.
• The S70 node will contain all the PSSP code, and be managed and used in all the same ways as standard SP nodes currently are.
• The S70 frame cannot be the first frame; an SP frame must always be the lowest numbered frame.
B.2 Installation
Installation of the S70 node in SP is the same as any other node, after you set up the physical connections between the CWS and the S70 system. As far as the installation steps are concerned, the only difference you will find is when configuring the frame information. For incorporating the S70 node (which, as previously mentioned, is considered as a non-SP frame), the Enter Database
Information SMIT panel has a new option; the Non-SP Frame Information menu. This menu has to be used for configuring the S70 frames.
Lab Scenario:
We have a single frame SP system with four High nodes and a Control
Workstation. In this lab, we will integrate the S70 system to our existing SP
environment. Figure 51 on page 248 shows the configuration of the SP and
the S70 node.
© Copyright IBM Corp. 1999
247
SP
Frame 1 k48n13
Slot 13 k48n09
Slot 9 k48n05
Slot 5 k48n01
SP switch
Slot 1
S70 (Raven)
Frame 2
Slot 1
Switch cable
Control Workstation k48raven
Ethernet
/dev/tty0
/dev/tty1
/dev/tty2
Ethernet
Figure 51. Lab Scenario of Integrating S70 to an SP
The steps to integrate the S70 system to the existing SP are as follows:
Step 1: Connect the S70 system to the Control Workstation
• Connect two RS232 cables from the tty ports of the S70 system to the SP
Control Workstation on ports tty1 and tty2.
• Connect the Ethernet port to the SP LAN.
• If you have the switch adapter TB3PCI present on the S70 system, connect it to the SP switch.
Step 2: Power on the S70 system
If the S70 is not already powered on, turn it on.
Step 3: Configure the RS-232 Control line
Configure the tty ports /dev/tty1 and /dev/tty2, which are connected to the
S70 system from the CWS by using the mkdev command as follows:
This example configures the second tty port in the CWS.
# mkdev -c tty -t tty -s rs232 -p sa1 -w s2
248
RS/6000 SP Software Maintenance
This example configures port 0 of the 16-Port Asynchronous Adapter
EIA-232:
# mkdev -c tty -t tty -s rs232 -p sa2 -w 0
Step 4: Enter frame information and initialize the SDR
In our lab scenario, the S70 system will be the second frame. Port tty1 will be used for the s1term, and port tty2 will be used for the hardware protocol. The hardware_protocol for S70 nodes is Service and
Manufacturing Interface (SAMI).
• Add the frame definition of the S70 node in the SP as follows:
# smitty enter_data
Select
Non-SP Frame Information.
Non-SP Frame Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Start Frame
* Frame Count
* Starting Frame tty port
* Starting Switch Port Number s1 tty port
Frame Hardware Protocol
Re-initialize the System Data Repository
F1=Help
Esc+5=Reset
F9=Shell
F2=Refresh
F6=Command
F10=Exit
F3=Cancel
F7=Edit
Enter=Do
[2]
[Entry Fields]
[1]
[/dev/tty2]
[1]
[/dev/tty1]
[SAMI] yes
F4=List
F8=Image
Step 5: Verify frame information
Check the frame configuration by using the command splstdata -f to verify that the S70 system is added as frame 2. The output of the command will look as follows:
+
#
#
#
[k48s][/]> splstdata -f
List Frame Database Information frame# tty s1_tty frame_type hardware_protocol
---------------------------------------------------------------------------
1
2
/dev/tty0
/dev/tty2
""
/dev/tty1 switch
""
SP
SAMI
Integrating an S70 System into an SP
249
Step 6: Node object population
Since there is no node or frame supervisor card on the S70 node, there will be no supervisor-to-supervisor communication to tell hardmon to trigger a logging event. For S70 nodes, population of the node object will be triggered by the spframe command executed in the CWS. The S70 node does not need any microcode download as it does not have any frame or node supervisor cards.
When we update the frame definition in the SDR, the node k48raven.ppd.pok is configured in frame 2, slot 1 and the node number is
17. Check this using the command splstdata -n
. The screen output will look as follows: k48s][/]> splstdata -n
List Node Configuration Information node# frame# slot# slots initial_hostname reliable_hostname dcehostname default_route processor_type processors_installed description
------------------------------------------------------------------------------
1 1 1 4 k48n01.ppd.pok.i
k48n01.ppd.pok.i
""
5
9
9.114.88.93
1 5
9.114.88.93
1 9
4
4
MP k48n05.ppd.pok.i
MP k48n09.ppd.pok.i
1 "" k48n05.ppd.pok.i
1 "" k48n09.ppd.pok.i
""
""
13
9.114.88.93
1 13
9.114.88.93
2 1 17
[k48s][/]>
9.114.88.93
MP 1 ""
4 k48n013.ppd.pok.
k48n013.ppd.pok.
""
MP 1 ""
1 k48raven.ppd.pok
k48raven.ppd.pok
""
MP 1 ""
The frame and node information in Perspectives will look as shown in
250
RS/6000 SP Software Maintenance
Figure 52. Frame/Node Information in Perspectives
Step 7: Enter the required node information:
This step adds IP address-related information to the node object in the
SDR. This information is used for node customization and configuration.
Configure the S70 node Ethernet information and the default route as follows:
Integrating an S70 System into an SP
251
SP Ethernet Information
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP]
Start Frame
Start Slot
Node Count
OR
Node Group
OR
Node List
* Starting Node's en0 Hostname or IP Address
* Netmask
* Default Route Hostname or IP Address
Ethernet Adapter Type
Duplex
Ethernet Speed
Skip IP Addresses for Unused Slots?
[BOTTOM]
F1=Help
Esc+5=Reset
F9=Shell
F2=Refresh
F6=Command
F10=Exit
F3=Cancel
F7=Edit
Enter=Do
[2]
[1]
[1]
[Entry Fields]
[]
[]
[9.114.88.77]
[255.255.255.0]
[9.114.88.93] bnc half
10 yes
F4=List
F8=Image
+
+
+
+
+
#
#
#
Step 8: Acquire the Hardware Ethernet address
The next step is to acquire the hardware Ethernet address of the Ethernet adapter in the S70 system using the command sphrdwrad
. If you already know the hardware MAC address of the Ethernet adapter in the S70 system, put the address in the file /etc/bootptab.info to speed up the process. Keep in mind that to get the hardware Ethernet address, the node will be powered off and on. This example gets the hardware Ethernet address for the S70 node in frame 2, slot 1 and node count 1.
k48s][/]> sphrdwrad 2 1 1
Acquiring hardware ethernet address for node 17 from /etc/bootptab.info
Step 9: Verify that the Ethernet addresses were acquired
Check that the Ethernet hardware addresses are placed in the SDR node object by using the following command:
252
RS/6000 SP Software Maintenance
[k48s][/]> splstdata -b -l 17
List Node Boot/Install Information node# hostname hdw_enet_addr srvr response install_disk last_install_image last_install_time next_install_image lppsource_name pssp_ver selected_vg
-------------------------------------------------------------------------------
17 k48raven.ppd.pok
020701232D9C
0 install hdisk0 initial
PSSP-3.1
initial rootvg bos.obj.ssp.432
aix432
[k48s][/]>
Step 10: Configure additional adapters in the S70 node
Depending on the system configuration you have in the S70 system, configure all the network and switch adapters now by using the command spadaptrs or by using SMIT/Perspectives.
Step 11: Configure the initial hostname for the S70 node
Configure the initial hostname for the S70 node using the command sphostnam
. This command indicates that the hostname is the fully qualified form of the hostname for the en0 adapter, for the frame 2, start slot 1 and node count of 1.
# sphostnam -a en0 -f long 2 1 1
Step 12: Refresh system partition-sensitive subsystems
To refresh the subsystems, issue the following command on the Control
Workstation:
# syspar_ctrl -r -G
This command refreshes all the subsystems in each partition.
Step 13: Install the system image in the S70 node
Select the image and PSSP version to be installed by using the spbootins and spchvgobj commands, or by using SMIT/Perspectives.
For example, to install the image bos.obj.ssp.432 and PSSP-3.1 on the
S70 node, use the following commands:
Integrating an S70 System into an SP
253
[k48s][/]> spchvgobj -r rootvg -i bos.obj.ssp.432 -p PSSP-3.1 -v aix432 2 1 1 spchvgobj: Successfully changed the Node and Volume_Group objects for node number volume group rootvg.
spchvgobj: The total number of changes successfully completed is 1.
spchvgobj: The total number of changes which were not successfully completed is 0
[k48s][/]>
[k48s][/]> spbootins -s no -r install 2 1 1
[k48s][/]>
[k48s][/]> setup_server
Step 14: Network-boot the S70 node
First open the LED/LCD display from Perspectives to monitor the installation codes. The LCD display screen has been changed to show the
Raven node. This is shown in Figure 53.
Figure 53. LED/LCD Display for the New S70 Node
To monitor the messages displayed during the time of installation, start the tty console for the S70 node by using the command:
# s1term 2 1
Now initiate the network boot on the S70 node and monitor the tty and
LCD display to verify that the AIX and PSSP software is getting installed in the node.
254
RS/6000 SP Software Maintenance
Appendix C. Secondary Authentication Server on an SP Node
authentication server for the SP kerberos realm. But in case you want to have a secondary server and you are forced to install it on one of the SP nodes, this appendix provides the necessary workaround.
C.1 On the Appropriate Node
You have to first run setup_authent on the appropriate node. This command checks the node_number value in the CuAt ODM. If this value is 0 (for a
Control Workstation), or if it does not exist at all, the setup of the secondary server will be started. Otherwise, the setup_authent script exits. The steps described in the following sections then have to be done.
C.2 On the Control Workstation
• Edit the /etc/krb.conf file on the primary authentication server and distribute it.
C.3 On the Node That Will Become the Secondary Authentication Server
• Make a copy of CuAt ODM.
• Write the node_number stanza of the ODM to a file.
• Delete the node_number stanza in ODM.
• Run setup_authent.
• Add the node_number stanza back to the ODM.
C.4 On the Control Workstation
Add an entry to the /etc/krb.conf file for the secondary server as described in
8.12.2, “/etc/krb.conf File Modification and Distribution” on page 199. Then
distribute this modified /etc/krb.conf to all nodes belonging to the realm:
# pcp ’-a’ /etc/krb.conf
© Copyright IBM Corp. 1999
255
C.5 On the Future Secondary Authentication Server
Just to have a backup, make a copy of the CuAt ODM:
# cp /etc/objrepos/CuAt /etc/objrepos/CuAt.bak
# odmget -q name=sp CuAt > /tmp/node_num_stanza
# odmdelete -o CuAt name=sp
Now set up the secondary server with the setup_authent command (see also
8.12.3, “Run setup_authent on the Secondary Authentication Server” on page
# /usr/lpp/ssp/bin/setup_authent
Reintegrate the node_number stanza to the ODM:
# odmadd /tmp/node_num_stanza
The secondary server setup is now complete.
256
RS/6000 SP Software Maintenance
Appendix D. Special Notices
This publication is intended to help RS/6000 SP system administrators, system operators and field personnel to maintain the RS/6000 SP. It provides techniques for installing, migrating and maintaining an RS/6000 SP environment. The information in this publication is not intended as the specification of any programming interfaces that are provided by RS/6000 SP.
See the PUBLICATIONS section of the IBM Programming Announcement for
RS/6000 SP for more information about what publications are considered to be product documentation.
References in this publication to IBM products, programs or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM's product, program, or service may be used. Any functionally equivalent program that does not infringe any of IBM's intellectual property rights may be used instead of the IBM product, program or service.
Information in this book was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the IBM
Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY
10504-1785.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact IBM
Corporation, Dept. 600A, Mail Drop 1329, Somers, NY 10589 USA.
Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.
The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The information about non-IBM
("vendor") products in this manual has been supplied by the vendor and IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate
© Copyright IBM Corp. 1999
257
them into the customer's operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere.
Customers attempting to adapt these techniques to their own environments do so at their own risk.
Any pointers in this publication to external Web sites are provided for convenience only and do not in any manner serve as an endorsement of these Web sites.
Any performance data contained in this document was determined in a controlled environment, and therefore, the results that may be obtained in other operating environments may vary significantly. Users of this document should verify the applicable data for their specific environment.
Reference to PTF numbers that have not been released through the normal distribution process does not imply general availability. The purpose of including these reference numbers is to alert IBM customers to specific information relative to the implementation of the PTF when it becomes available to each customer according to the normal IBM PTF distribution process.
The following terms are trademarks of the International Business Machines
Corporation in the United States and/or other countries:
IBM

RS/6000
AT
SP
The following terms are trademarks of other companies:
C-bus is a trademark of Corollary, Inc. in the United States and/or other countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and/or other countries.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States and/or other countries.
PC Direct is a trademark of Ziff Communications Company in the United
States and/or other countries and is used by IBM Corporation under license.
ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel
Corporation in the United States and/or other countries. (For a complete list
258
RS/6000 SP Software Maintenance
of Intel trademarks see www.intel.com/tradmarx.htm)
UNIX is a registered trademark in the United States and/or other countries licensed exclusively through X/Open Company Limited.
SET and the SET logo are trademarks owned by SET Secure Electronic
Transaction LLC.
Other company, product, and service names may be trademarks or service marks of others.
Special Notices
259
260
RS/6000 SP Software Maintenance
Appendix E. Related Publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this redbook.
E.1 International Technical Support Organization Publications
For information on ordering these ITSO publications see “How to Get ITSO
•
PSSP 2.4 Technical Presentation,
SG24-5173
•
PSSP 3.1 Announcement,
SG24-5332
•
RS/6000 SP: Problem Determination Guide,
SG24-4778
•
RS/6000 SP: PSSP 2.2 Survival Guide,
SG24-4928
•
RS/6000 SP High Availability Infrastructure,
SG24-4838
•
Understanding and Using the SP Switch,
SG24-5161
•
Inside the RS/6000 SP
, SG24-5145
E.2 Redbooks on CD-ROMs
Redbooks are also available on the following CD-ROMs. Click the CD-ROMs button at http://www.redbooks.ibm.com/ for information about all the CD-ROMs offered, updates and formats.
CD-ROM Title
System/390 Redbooks Collection
Networking and Systems Management Redbooks Collection
Transaction Processing and Data Management Redbooks Collection
Lotus Redbooks Collection
Tivoli Redbooks Collection
AS/400 Redbooks Collection
Netfinity Hardware and Software Redbooks Collection
RS/6000 Redbooks Collection (BkMgr Format)
RS/6000 Redbooks Collection (PDF Format)
Application Development Redbooks Collection
Collection Kit
Number
SK2T-2177
SK2T-6022
SK2T-8038
SK2T-8039
SK2T-8044
SK2T-2849
SK2T-8046
SK2T-8040
SK2T-8043
SK2T-8037
E.3 Other Publications
These publications are also relevant as further information sources:
•
IBM Parallel System Support Programs for AIX: Installation and Migration
Guide,
GC23-3898
•
IBM Parallel System Support Programs for AIX: Administration Guide,
GC23-3897
•
IBM Parallel System Support Programs for AIX: Command and Technical
Reference, Volume 1 and Volume 2,
SA22-7351
•
AIX Version 4.3, Network Installation Management Guide and Reference
,
SC23-4113
© Copyright IBM Corp. 1999
261
262
RS/6000 SP Software Maintenance
How to Get ITSO Redbooks
This section explains how both customers and IBM employees can find out about ITSO redbooks, redpieces, and
CD-ROMs. A form for ordering books and CD-ROMs by fax or e-mail is also provided.
•
Redbooks Web Site
http://www.redbooks.ibm.com/
Search for, view, download, or order hardcopy/CD-ROM redbooks from the redbooks Web site. Also read redpieces and download additional materials (code samples or diskette/CD-ROM images) from this redbooks site.
Redpieces are redbooks in progress; not all redbooks become redpieces and sometimes just a few chapters will be published this way. The intent is to get the information out much quicker than the formal publishing process allows.
•
E-mail Orders
Send orders by e-mail including information from the redbooks fax order form to:
In United States
Outside North America
e-mail address
Contact information is in the “How to Order” section at this site: http://www.elink.ibmlink.ibm.com/pbl/pbl/
•
Telephone Orders
United States (toll free)
Canada (toll free)
Outside North America
1-800-879-2755
1-800-IBM-4YOU
Country coordinator phone number is in the “How to Order” section at this site: http://www.elink.ibmlink.ibm.com/pbl/pbl/
•
Fax Orders
United States (toll free)
Canada
Outside North America
1-800-445-9269
1-403-267-4455
Fax phone number is in the “How to Order” section at this site: http://www.elink.ibmlink.ibm.com/pbl/pbl/
This information was current at the time of publication, but is continually subject to change. The latest information may be found at the redbooks Web site.
IBM Intranet for Employees
IBM employees may register for information on workshops, residencies, and redbooks by accessing the IBM
Intranet Web site at http://w3.itso.ibm.com
/ and clicking the ITSO Mailing List button. Look in the Materials repository for workshops, presentations, papers, and Web pages developed and written by the ITSO technical professionals; click the Additional Materials button. Employees may access
MyNews at http://w3.ibm.com
/ for redbook, residency, and workshop announcements.
© Copyright IBM Corp. 1999
263
IBM Redbook Fax Order Form
Please send me the following:
Title Order Number Quantity
First name
Company
Address
City
Telephone number
Invoice to customer number
Credit card number
Last name
Postal code
Telefax number
Country
VAT number
Credit card expiration date Card issued to Signature
We accept American Express, Diners, Eurocard, Master Card, and Visa. Payment by credit card not available in all countries. Signature mandatory for credit card payment.
264
RS/6000 SP Software Maintenance
List of Abbreviations
AIX
AMG
ANS
APAR
API
BIS
BSD
BUMP
CP
CPU
CSS
CWS
DB
EM
EMAPI
EMCDB
EMD
EPROM
FIFO
FS
GB
GL
Advanced Interactive
Executive
Adapter Membership
Group
Abstract Notation
Syntax authorized program analysis report
Application
Programming Interface boot/install server
Berkeley Software
Distribution
Bring-Up
Microprocessor
Crown Prince central processing unit communication subsystem control workstation database
Event Management
Event Management
Application
Programming Interface
Event Management
Configuration Database
Event Manager
Daemon
Erasable
Programmable
Read-Only Memory first-in first-out file system gigabytes
Group Leader
hb
HiPS hrd
HSD
IBM
IP
ISB
ISC
ITSO
JFS
LAN
LCD
LED
LP
LPP
LRU
LSC
LV
GPFS
GS
GSAPI
GVG
HACMP
HACMP/ES
© Copyright IBM Corp. 1999
General Parallel File
System
Group Services
Group Services
Application
Programming Interface
Global Volume Group
High Availability Cluster
Multiprocessing
High Availability Cluster
Multiprocessing
Enhanced Scalability heart beat
High Performance
Switch host respond daemon
Hashed Shared Disk
International Business
Machines Corporation
Internet Protocol
Intermediate Switch
Board
Intermediate Switch
Chip
International Technical
Support Organization
Journaled File System
Local Area Network liquid crystal display light emitter diode logical partition
Licensed Program
Product last recently used
Link Switch Chip logical volume
265
PE
PID
POE
PP
PSSP
NFS
NIM
NSB
NSC
OID
ODM
PAIDE
PTC
PTF
PTPE
PTX
PV
RAM
LVM
MB
MIB
ML
MPI
MPL
MPP
Logical Volume
Manager megabytes
Management
Information Base
Maintenance Level
Message Passing
Interface
Message Passing
Library
Massive Parallel
Processors
Network File System
Network Installation
Management
Node Switch Board
Node Switch Chip object ID
Object Data Manager
Performance Aide for
AIX
Parallel Environment process ID
Parallel Operating
Environment physical partition
Parallel System
Support Programs prepare to commit
Program Temporary Fix
Performance Toolbox
Parallel Extensions
Performance Toolbox for AIX physical volume random access memory
RCP
RML
RM
RMAPI
RVSD
SAMI
SBS
SCSI
SDR
SMIT
SSA
RPQ
RSCT
RSI
R/VSD
VG
VRMF
VSD
WEBSM
266
RS/6000 SP Software Maintenance
Remote Copy Protocol
Recommended
Maintenance Level
Resource Monitor
Resource Monitor
Application
Programming Interface
Request For Product
Quotation
RS/6000 Cluster
Technology
Remote Statistics
Interface
Recoverable/Virtual
Shared Disk
Recoverable Virtual
Shared Disk
Service and
Manufacturing Interface structured byte string
Small Computer
System Interface
System Data
Repository
System Management
Interface Tool
Serial Storage
Architecture volume group version, release, modification, and fix
Virtual Shared Disk
Web-based System
Manager
Index
Symbols
A
acronyms
AIX
,
,
,
,
,
,
B
backup file format
Base Operating System
BIS
,
boot image
,
bootp_response
,
,
,
,
,
,
bundle
,
C
Common Hardware Reference Platform
CWS
,
,
,
,
,
D
daemon
DSMIT
F
,
,
,
,
,
H
High Availability
I
,
,
,
K
,
,
/.k
,
admin_acl.add 179 admin_acl.get 179 admin_acl.mod 179
authorization
create_krb_files
,
daemon
hardmon 162 hmcmds 162 hmmon 162
instance
,
kdb_util load
,
,
krb-srvtab
,
,
,
kstash
© Copyright IBM Corp. 1999
267
,
mkkp
,
pcp
principal.pag
rcmd
,
rmkp
,
secondary authentication server 197
,
,
ticket granting ticket
tkt0
L
Licensed Program Product
,
,
,
,
,
,
lppsource
M
maintenance
,
example
,
,
,
,
,
,
,
268
RS/6000 SP Software Maintenance
N
Network File System
network installation management
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
bosinst_data
,
,
,
,
,
,
,
diagnostics mode
lppchk
lppsource_aix432
,
manage groups
,
,
,
,
,
,
,
,
,
,
,
,
mknimres
,
,
,
,
primary network
push method
,
,
,
,
,
,
,
showlog
unallnimres
,
NIS
node
P
,
PAIDE
Performance Aide for AIX
Performance Toolbox for AIX
Program Temporary Fix
,
,
,
,
,
,
Q
R
,
recommended maintenance level
resources
,
,
,
route
S
,
,
,
,
,
,
,
Shared Product Object Tree
SMIT
,
,
,
,
,
,
,
,
,
,
,
SP Switch
,
,
,
spmon
,
,
,
,
,
,
,
,
,
,
spunmirrorvg
Stacked Product Option
/etc/SP/Emonitor.cfg
css_restart_node
Eclock
,
Emonitor.log
Eunfence
fenced
269
isolated
oncoming primary 225 oncoming primary backup 225
primary node
switch clock source
switch route table
System Data Repository
system resource controller 238
T
U
unallnimres
V
Virtual Shared Disks
W
WEBSM
resources
task guides
,
270
RS/6000 SP Software Maintenance
ITSO Redbook Evaluation
RS/6000 SP Software Maintenance
SG24-5160-00
Your feedback is very important to help us maintain the quality of ITSO redbooks.
Please complete this questionnaire and return it using one of the following methods:
• Use the online evaluation form found at http://www.redbooks.ibm.com/
• Fax this form to: USA International Access Code + 1 914 432 8264
• Send your comments in an Internet note to [email protected]
Which of the following best describes you?
_ Customer _ Business Partner
_ None of the above
_ Solution Developer _ IBM employee
Please rate your overall satisfaction with this book using the scale:
(1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor)
__________ Overall Satisfaction
Please answer the following questions:
Was this redbook published in time for your needs?
If no, please explain:
Yes___ No___
What other redbooks would you like to see published?
Comments/Suggestions: (THANK YOU FOR YOUR FEEDBACK!)
© Copyright IBM Corp. 1999
271
SG24-5160-00
Printed in the U.S.A.
advertisement
Key Features
- Installation guide
- Migration instructions
- Update procedures
- Backup and restore processes
- NIM (Network Install Management)
- Switch commands
- Kerberos authentication
- System administration tips
- Troubleshooting solutions
Frequently Answers and Questions
What does this manual cover?
How do I update my SP system to a higher level of AIX or PSSP software?
What are the key components of the RS/6000 SP system?
What is NIM (Network Install Management) and how is it used?
Related manuals
advertisement
Table of contents
- 5 Contents
- 11 Figures
- 13 Tables
- 15 Preface
- 15 The Team That Wrote This Redbook
- 17 Comments Welcome
- 19 Chapter 1. Software Maintenance
- 19 1.1 Packaging Terms
- 19 1.1.1 Licensed Program Products (LPPs)
- 19 1.1.2 Package
- 20 1.1.3 Filesets
- 21 1.1.4 Fileset Updates (PTFs)
- 22 1.1.5 Bundle
- 22 1.1.6 Product Offerings
- 23 1.2 AIX Maintenance Strategy
- 23 1.2.1 Monthly Maintenance Packages
- 23 1.2.2 Maintenance Levels
- 24 1.2.3 Maintenance-Level Bundle
- 24 1.3 Media - IBM Software Manufacturing Company
- 24 1.3.1 CD-ROM
- 25 1.3.2 Tape
- 27 Chapter 2. The Installation and Customization Process
- 27 2.1 NIM Overview
- 29 2.1.1 SP Installation Hierarchy
- 32 2.1.2 NIM Master
- 33 2.1.3 NIM Networks
- 34 2.1.4 NIM Machines
- 35 2.1.5 NIM Resources
- 44 2.1.6 NIM Operations on Standalone Clients
- 47 2.1.7 NIM Administration
- 49 2.2 NIM Configuration Using PSSP
- 49 2.2.1 mknimmast
- 50 2.2.2 mknimint
- 50 2.2.3 mknimclient
- 51 2.2.4 mkconfig
- 52 2.2.5 mkinstall
- 53 2.2.6 mknimres
- 55 2.2.7 delnimmast
- 56 2.2.8 delnimclient
- 56 2.2.9 allnimres
- 58 2.2.10 unallnimres
- 59 2.3 Web-Based System Manager for NIM (WSM)
- 67 2.4 The Installation Process
- 68 2.4.1 Setting Up the Control Workstation
- 68 2.4.2 Disk Space Considerations
- 69 2.5 Directory Structure
- 72 2.6 Additional Required AIX LPPs
- 73 2.7 Minimum Required PSSP Filesets
- 73 2.8 Install PSSP on the Control Workstation
- 74 2.9 setup_authent
- 75 2.9.1 setup_authent Files and Daemons
- 76 2.10 The install_cw Command
- 77 2.11 The setup_server Command
- 78 2.12 Verification Tests
- 78 2.13 Debugging NIM
- 81 2.14 Set bootp_response to Install
- 81 2.14.1 /tftpboot Files
- 85 Chapter 3. Verifying Your SP
- 86 3.1 LPP Level
- 90 3.2 Dates and Timezones
- 91 3.3 Checking Kerberos
- 92 3.3.1 Kerberos and kadmind Daemons
- 92 3.3.2 Kerberos Files
- 95 3.4 General Environment
- 103 3.5 System Data Repository (SDR)
- 107 Chapter 4. Alternate and Mirrored rootvg
- 107 4.1 Alternate Boot System Image Function
- 109 4.1.1 Prerequisites
- 109 4.1.2 Defining an Alternate rootvg
- 111 4.1.3 Installing an Alternate rootvg
- 113 4.1.4 Switching between Alternate System Images
- 115 4.1.5 Alternate rootvg Highlights
- 115 4.2 Rootvg Mirroring
- 116 4.2.1 Mirroring Example
- 117 4.2.2 Unmirroring a Root Volume Group
- 118 4.3 Boot Modes
- 119 4.3.1 Disk Mode
- 119 4.3.2 Install Mode
- 119 4.3.3 Migrate Mode
- 120 4.3.4 Customize Mode
- 120 4.3.5 Diag Mode
- 120 4.3.6 Maintenance Mode
- 121 Chapter 5. Updating Your System
- 122 5.1 Applying PTFs
- 122 5.2 Applying AIX PTFs in the CWS
- 124 5.3 Applying AIX PTFs on the Nodes
- 124 5.3.1 Method 1: Applying AIX PTFs Using mksysb Install Method
- 129 5.3.2 Method 2: Applying AIX PTFs Using SMIT or dsh
- 131 5.3.3 Method 3: Applying AIX PTFs Using DSMIT
- 136 5.4 Applying PSSP PTFs in the CWS
- 137 5.5 Applying PSSP PTFs to the Nodes
- 137 5.5.1 Method 1: Applying PSSP PTFs Using mksysb Install Method
- 137 5.5.2 Method 2: Applying PSSP PTFs Using SMIT or dsh
- 138 5.5.3 Method 3: Applying PSSP PTFs Using DSMIT
- 138 5.6 Applying PTFs in the Nodes Using the Switch
- 141 Chapter 6. Migrating PSSP and AIX to Later Versions
- 144 6.1 Coexistence
- 145 6.2 Migrating the CWS from PSSP 2.4 to PSSP 3.1
- 152 6.3 Migrating the Nodes from PSSP 2.4 to PSSP 3.1
- 157 Chapter 7. Backup and Recovery
- 157 7.1 Backing Up Your SP System
- 157 7.1.1 Mksysb and Savevg
- 161 7.1.2 Backing Up the SDR
- 163 7.1.3 Kerberos Database
- 164 7.1.4 Backing Up the NIM Database
- 165 7.2 Restoring Your SP System
- 165 7.2.1 Restoring an mksysb Image or a Volume Group Backup
- 167 7.2.2 Restoring the SDR
- 169 7.2.3 Restoring the Kerberos Database
- 170 7.2.4 Restoring the NIM Database
- 170 7.3 Remote Backups and Restores
- 173 Chapter 8. Taming Kerberos
- 175 8.1 The Default SP Kerberos Realm
- 177 8.2 Kerberos Setup during the Installation of an RS/6000 SP System
- 177 8.2.1 The Kerberos Master Key
- 178 8.2.2 The Kerberos Configuration File
- 179 8.2.3 The Kerberos Realm Configuration
- 179 8.3 Kerberos Principals
- 180 8.3.1 The /.klogin File
- 181 8.4 Kerberos Daemons
- 182 8.5 setup_authent, setup_server and the Kerberos Database
- 184 8.6 The krb-srvtab File
- 186 8.7 Tickets and Keys
- 187 8.7.1 Ticket-Granting-Ticket
- 187 8.7.2 Contents and Purpose of the Ticket-Granting-Ticket (tgt)
- 189 8.7.3 Asking for a Remote Service
- 192 8.7.4 The Ticket Cache File
- 193 8.7.5 Authentication and Authorization
- 194 8.7.6 Never-Expiring Ticket
- 195 8.7.7 Hardmon Access Control List - hmacls File
- 197 8.7.8 Kerberos Database Access Control Lists
- 197 8.8 Kerberos Commands
- 198 8.8.1 List a Kerberos Principal
- 199 8.8.2 Make a Kerberos Principal
- 200 8.8.3 Change a Kerberos Principal
- 202 8.8.4 Remove a Kerberos Principal
- 203 8.9 Reinitializing Kerberos without Rebooting the Nodes
- 203 8.9.1 Removing or Reinitializing the Kerberos Subsystem
- 205 8.10 Recreating the krb-srvtab File for a Node
- 205 8.10.1 The create_krb_files Wrapper
- 205 8.10.2 The ext_srvtab Wrapper
- 207 8.11 Merging Two (or more) SP Systems to One Kerberos Realm
- 208 8.11.1 Deleting an Authentication Server Setup
- 209 8.11.2 Provide Name Resolution and Time Synchronization
- 209 8.11.3 Adjust the /etc/krb.realms File
- 210 8.11.4 Extend the /.klogin File
- 211 8.11.5 Adding Principals for Remote Nodes
- 213 8.11.6 Creating the srvtab Files for the Remote Nodes
- 215 8.11.7 Distributing the Kerberos Files to the Remote Nodes
- 215 8.12 Setting Up a Secondary Authentication Server
- 217 8.12.1 Install Kerberos-Related Filesets on a Secondary Server
- 217 8.12.2 /etc/krb.conf File Modification and Distribution
- 218 8.12.3 Run setup_authent on the Secondary Authentication Server
- 220 8.13 Working with the WCOLL Variable
- 223 Chapter 9. Integration of External Workstations
- 223 9.1 Integration of External Workstations to the SP Kerberos Realm
- 224 9.1.1 Time Synchronization, Name Resolution and Kerberos Code
- 225 9.1.2 Edit /etc/krb.realms on the Control Workstation
- 225 9.1.3 Extend the /.klogin File on the Control Workstation
- 226 9.1.4 Add rcmd Principals to the Kerberos Database
- 227 9.1.5 Create a krb-srvtab File for the External Workstation
- 227 9.1.6 Transfer the Required File to the External Workstation
- 227 9.1.7 Set the Authentication Method on the External Workstation
- 229 9.2 Using the CWS NIM Master Setup to Install External Workstations
- 229 9.2.1 Definition of a Second Network Served by the NIM Master
- 233 9.2.2 External NIM Client Definition
- 235 9.2.3 Creating the Resources for External NIM Clients
- 236 9.2.4 Setting Up the Installation of an External Workstation
- 239 9.2.5 Initializing the Client Installation
- 241 Chapter 10. The Switch
- 241 10.1 Definition of the Switch
- 241 10.1.1 Checking the Primary and Primary Backup Node
- 244 10.1.2 Checking the Switch Clock Source Settings
- 246 10.2 Initialize the Switch
- 246 10.2.1 Estart on the Control Workstation
- 248 10.2.2 The Worm Daemon
- 249 10.2.3 Estart_sw Running on the Oncoming Primary Node
- 250 10.2.4 What Happens on the Non-Primary Nodes
- 250 10.2.5 Fenced and Unfenced Nodes
- 251 10.2.6 Fenced with and without the autojoin Flag
- 253 10.2.7 When Do You Have to Fence a Node?
- 253 10.2.8 Modifying the Switch Values in the SDR
- 256 10.3 Monitoring the Switch
- 257 10.3.1 How to Activate the Emonitor Function
- 257 10.3.2 How Emonitor Works
- 259 10.3.3 The Emonitor.log File
- 260 10.3.4 The Time Interval
- 260 10.3.5 Monitoring the Switch with PSSP 3.1
- 261 Appendix A. Downloading PTFs for RS/6000 and SP
- 261 A.1 FixDist Tool
- 262 A.1.1 Downloading the FixDist Tool
- 262 A.1.2 Installing the FixDist Tool
- 263 A.1.3 Starting and Configuring the FixDist Tool
- 263 A.2 TapeGen Tool
- 263 A.3 Downloading Fixes from the Web
- 265 Appendix B. Integrating an S70 System into an SP
- 265 B.1 Overview
- 265 B.2 Installation
- 273 Appendix C. Secondary Authentication Server on an SP Node
- 273 C.1 On the Appropriate Node
- 273 C.2 On the Control Workstation
- 273 C.3 On the Node That Will Become the Secondary Authentication Server
- 273 C.4 On the Control Workstation
- 274 C.5 On the Future Secondary Authentication Server
- 275 Appendix D. Special Notices
- 279 Appendix E. Related Publications
- 279 E.1 International Technical Support Organization Publications
- 279 E.2 Redbooks on CD-ROMs
- 279 E.3 Other Publications
- 281 How to Get ITSO Redbooks
- 282 IBM Redbook Fax Order Form
- 283 List of Abbreviations
- 285 Index
- 289 ITSO Redbook Evaluation