sg246965

IBM Front cover IBM Eserver pSeries Cluster uster Systems Handbook ndbook Overview of Cluster 1600 software components Cluster 1600 machine types, models, and feature codes Solution hints and tips Dino Quintero Graeme Cassie Mukesh Kumar Gupta Edward EuiJoo Lee See Poo Soh ibm.com/redbooks International Technical Support Organization IBM Eserver pSeries Cluster Systems Handbook October 2003 SG24-6965-00 Note: Before using this information and the product it supports, read the information in “Notices” on page ix. First Edition (October 2003) This edition applies to Version 3, Release 5, of Parallel System Support Programs, and Version 1, Release 3, Modification 2, of Cluster Systems Management for use with the AIX Operating System Version 5, Release 2, Modification 1. © Copyright International Business Machines Corporation 2003. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview of Cluster 1600 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Choosing PSSP or CSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Cluster management with PSSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Cluster management with CSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2. Cluster 1600 hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Cluster 1600 hardware components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 PSSP control workstations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.5 CSM for AIX management server . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.6 Hardware Management Console (HMC) . . . . . . . . . . . . . . . . . . . . . . 17 2.3 CSM and PSSP hardware support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 CSM-managed node requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 PSSP control workstation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Control workstation requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 Supported control workstations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.3 High Availability Control Workstation . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.4 HACWS limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 CSM management server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 Memory and disk space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 Network requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.3 Asynchronous card requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.4 Using a Logical Partition (LPAR) as a CSM management server . . . 27 2.6 Cluster 1600 server concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.1 pSeries architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.2 Cluster 1600 and the HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.3 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.4 Electronic Service Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 © Copyright IBM Corp. 2003. All rights reserved. iii 2.6.5 Planning for Cluster 1600 servers. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7 Hardware supported and currently marketed . . . . . . . . . . . . . . . . . . . . . . 52 2.8 pSeries servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.8.1 pSeries 615 server (7029-6C3 and 6E3 deskside) . . . . . . . . . . . . . . 52 2.8.2 pSeries 630 server (7028-6C4 and 6E4 deskside) . . . . . . . . . . . . . . 57 2.8.3 pSeries 650 server (7038-6M2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.8.4 pSeries 655 server (7039-651) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.8.5 670 Server (7040-671) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 2.8.6 690 server (7040-681). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 2.8.7 xSeries servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.9 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.9.1 9076 model 555 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2.9.2 9076 model 556 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2.9.3 9076 model 557 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 2.9.4 9076 model 558 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 2.9.5 7045-SW4 pSeries HPS (High Performance Switch) . . . . . . . . . . . 103 2.10 Switch adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 2.10.1 Switch adapter placement restrictions . . . . . . . . . . . . . . . . . . . . . 109 2.10.2 pSeries HPS switch network interface cards (SNI) . . . . . . . . . . . . 110 2.11 Legacy hardware supported but no longer marketed . . . . . . . . . . . . . . 112 2.11.1 pSeries 660 Model 6M1 (7026-6M1). . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 3. Network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.1 SP LAN Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.1.1 Supported Ethernet adapters and their placement . . . . . . . . . . . . . 122 3.1.2 Ethernet network topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.1.3 IP label convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.2 Switch network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.2.1 Benefits of a Switch network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.2.2 SP Switch2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.2.3 Switch IP network and addressing . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.3 Other networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.4 Network considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.4.1 The RS-232 connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.4.2 System topology considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.4.3 Boot/install server requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.4.4 The SP Ethernet administrative LAN . . . . . . . . . . . . . . . . . . . . . . . 134 3.4.5 Additional LANs - considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.4.6 IP over the switch - considerations . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.4.7 Subnetting - considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.4.8 HMC trusted network - considerations . . . . . . . . . . . . . . . . . . . . . . 139 3.4.9 Network router node considerations . . . . . . . . . . . . . . . . . . . . . . . . 144 3.4.10 Clustered server configuration considerations . . . . . . . . . . . . . . . 145 iv IBM Eserver pSeries Cluster Systems Handbook 3.4.11 SP-attached server considerations . . . . . . . . . . . . . . . . . . . . . . . . 146 3.5 Sample scenarios of Cluster 1600 managed by PSSP . . . . . . . . . . . . . . 147 3.5.1 CWS with two HMCs and four pSeries . . . . . . . . . . . . . . . . . . . . . . 147 3.5.2 One CWS, one HMC and one pSeries . . . . . . . . . . . . . . . . . . . . . . 148 3.5.3 CWS, two HMCs, two 9076s, with one pSeries and SP Switch2 . . 149 3.5.4 CWS, HMC, 9076 frame and pSeries with SP Switch . . . . . . . . . . 150 3.6 Networking for Cluster Systems Management (CSM). . . . . . . . . . . . . . . 151 3.6.1 CSM hardware control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 3.6.2 Hardware and network requirements . . . . . . . . . . . . . . . . . . . . . . . 151 3.6.3 Virtual LANs (VLANs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 3.6.4 Conceptual diagram for pSeries cluster . . . . . . . . . . . . . . . . . . . . . 154 3.6.5 pSeries HPS switch network overview . . . . . . . . . . . . . . . . . . . . . . 156 3.6.6 Switch Network Manager (SNM). . . . . . . . . . . . . . . . . . . . . . . . . . . 158 3.6.7 Considerations for Cluster 1600 managed by CSM network . . . . . 160 3.6.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 3.6.9 Redundant HMC Layout for pseries HPS in Cluster 1600 . . . . . . . 163 3.6.10 Redundant layout for pSeries in the Cluster 1600 . . . . . . . . . . . . 164 3.6.11 Conceptual Cluster 1600 without a pSeries HPS . . . . . . . . . . . . . 165 3.6.12 Management Server with two HMCs and four pSeries . . . . . . . . . 166 Chapter 4. Software support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4.1 Software components of the Cluster 1600 . . . . . . . . . . . . . . . . . . . . . . . 170 4.2 Parallel System Support Programs (PSSP) . . . . . . . . . . . . . . . . . . . . . . 172 4.2.1 Administration and operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 4.2.2 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . 174 4.2.3 IBM Virtual Shared Disk (VSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.2.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.2.5 Communication subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.6 Network Time Protocol (NTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.7 System availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.8 Other PSSP services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.2.9 New in PSSP 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.2.10 Software requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.2.11 Software compatibility matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.2.12 Documentation references - PSSP . . . . . . . . . . . . . . . . . . . . . . . . 179 4.3 Cluster Systems Management (CSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.3.1 Administration and operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.3.2 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . 188 4.3.3 New in CSM 1.3.2 for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 4.3.4 Supported platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 4.3.5 PSSP-to-CSM transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 4.3.6 Documentation references - CSM . . . . . . . . . . . . . . . . . . . . . . . . . . 191 4.4 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Contents v 4.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 4.4.2 Administration and operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 4.4.3 Higher performance/scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 4.4.4 Recoverability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 4.4.5 Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.4.6 New in GPFS 2.1 for AIX 5L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.4.7 Software requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.4.8 Documentation references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 4.5 LoadLeveler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 4.5.1 Administration and operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 4.5.2 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 4.5.3 New in LoadLeveler 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 4.5.4 New in LoadLeveler 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 4.5.5 Software requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 4.5.6 LoadLeveler configuration suggestions. . . . . . . . . . . . . . . . . . . . . . 208 4.5.7 Documentation references - LoadLeveler . . . . . . . . . . . . . . . . . . . . 208 4.6 Scientific subroutine libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 4.6.1 Engineering and Scientific Subroutines Library (ESSL) family of products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 4.6.2 Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 4.6.3 New in ESSL 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 4.6.4 New in Parallel ESSL 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 4.6.5 Software requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 4.6.6 Documentation references - ESSL and PESSL . . . . . . . . . . . . . . . 213 4.6.7 Mathematical Acceleration Subsystem (MASS) . . . . . . . . . . . . . . . 214 4.7 Parallel Environment (PE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 4.7.1 Parallel Programming support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 4.7.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 4.7.3 New in PE 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 4.7.4 Software requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 4.7.5 Documentation references - Parallel Environment (PE) . . . . . . . . . 220 4.8 IBM High Availability Cluster Multi-Processing for AIX (HACMP) . . . . . . 220 4.8.1 HACMP operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 4.8.2 Administration and operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 4.8.3 New in HACMP 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 4.8.4 Software requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 4.8.5 Documentation references - HACMP . . . . . . . . . . . . . . . . . . . . . . . 224 4.9 Performance Toolbox (PTX) and Performance AIDE (PAIDE) . . . . . . . . 225 4.9.1 Administration and operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 4.9.2 Platform requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 4.10 Software ordering and configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Chapter 5. Solutions and offerings “best practices” . . . . . . . . . . . . . . . . 229 vi IBM Eserver pSeries Cluster Systems Handbook 5.1 High Performance Computing (HPC) environment . . . . . . . . . . . . . . . . . 230 5.1.1 A hypothetical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 5.1.2 Solution architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 5.1.3 Solution discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.1.4 Cluster management considerations. . . . . . . . . . . . . . . . . . . . . . . . 236 5.2 Transition from SP nodes to LPARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 5.2.1 System migration by utilizing alternate disk migration . . . . . . . . . . 238 5.3 Virtual Serial port implications with LPARS . . . . . . . . . . . . . . . . . . . . . . . 240 5.3.1 Console device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 5.3.2 Serial port implication in an LPAR/SP environment . . . . . . . . . . . . 242 5.3.3 Virtual terminal window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 5.4 HMC considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 5.4.1 Redundant HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 5.5 Web-based System Manager client solutions . . . . . . . . . . . . . . . . . . . . . 245 5.5.1 Web-based System Manager functionality through the firewall . . . 246 Appendix A. Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 A.1 Switch performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 A.2 Adapter performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 A.3 Node I/O slot performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 A.4 Application performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 A.5 MPI/user space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 A.6 TCP/IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Contents vii viii IBM Eserver pSeries Cluster Systems Handbook Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces. © Copyright IBM Corp. 2003. All rights reserved. ix Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: ™ ^™ AIX® Enterprise Storage Server® IBM® IntelliStation® LoadLeveler® Lotus® Micro Channel® NetView® PowerPC® POWER2™ POWER3™ POWER4™ POWER4+™ PTX® pSeries® Redbooks™ (logo) ™ RS/6000® SAA® SP1® Tivoli® VisualAge® The following terms are trademarks of other companies: Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. SET, SET Secure Electronic Transaction, and the SET Logo are trademarks owned by SET Secure Electronic Transaction LLC. Other company, product, and service names may be trademarks or service marks of others. x IBM Eserver pSeries Cluster Systems Handbook Preface The IBM Eserver™ Cluster 1600 server, which was introduced to meet the rigorous demands of mission-critical enterprise applications, continues to offer outstanding performance, scalability, reliability, availability, serviceability, and management capabilities. In this IBM® Redbook, we highlight the benefits of using a Cluster 1600, and describe which hardware components can be managed by either Parallel System Support Programs (PSSP) Version 3, Release 5, or Cluster Systems Management (CSM) Version 1, Release 3, Modification 2. This publication contains the following information on the Cluster 1600: Cluster 1600 hardware components Networking components and considerations Cluster 1600 software components Scalability of the Cluster 1600 Solutions and offerings scenarios The Cluster 1600 helps to reduce the complexities and costs of system management, thus lowering the total cost of ownership and allowing simplification of application service level management. It also provides the infrastructure that supports availability, data sharing, and response time. This redbook will be useful for IT professionals seeking to implement Cluster 1600 mission-critical solutions to address business intelligence applications, server consolidation, and collaborative computing.The team that wrote this redbook This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization, Austin Center. Dino Quintero is a Project Leader and IBM Certified Senior IT Specialist at the International Technical Support Organization, Poughkeepsie Center. His responsibilities include assessing, designing, and implementing pSeries®/AIX technical solutions for various customer sets including those in clustered environments, and writing Redbooks™ and teaching ITSO workshops. Graeme Cassie is a pre-sales Technical Consultant in Scotland, currently working for Morse Group Ltd., one of Europe’s largest technology integrators. Previously, Graeme spent almost 15 years with IBM as a Customer Engineer specializing in mid-range systems. He has 20 years of experience in the IT field. He holds an HNC in Electronic Engineering, and holds IBM Certifications for RS/6000® SP, HACMP, p690, mid-range storage, Enterprise Disks, Enterprise Tape and e-business solutions design. He is also a Certified AIX Technical © Copyright IBM Corp. 2003. All rights reserved. xi Expert (CATE). His areas of expertise include solution design and implementation around IBM pSeries and IBM storage systems. Mukesh Kumar Gupta is a Senior IT Specialist (Integrated Technology Services) with IBM, India. He has eight years of experience in the Support and Services field, and has worked at IBM for five years. His areas of expertise include AIX®, PSSP, RS/6000, RS/6000 SP, HACMP, Cluster 1600 and Storage (administration, support and implementation), and he works closely with pre-sales in solution design. Edward EuiJoo Lee works as a Senior Technology Architect for General Electric (GE-ITS) in New York, USA, and has written many internal technical papers for GE. Edward has 12 years of experience in Open Systems technology and solution development. He participates in speaking engagements on technical subjects, and served as a speaker for IBM’s SP World in 1999. He is an IBM Certified Advanced Technical Expert (CATE), and is also a Certified Professional in Information Technology Services Management (ITSM). He holds a B.S. degree in Information Technology Engineering from S.U.N.Y., and is currently working towards a Master’s Degree in Management. See Poo Soh is an Advisory Technical Specialist in Singapore with the IBM Systems Group, and has nine years of experience in the IT field. He holds a Bachelor of Engineering degree from the National University of Singapore, and a Graduate Diploma in Business Administration from the Singapore Institute of Management. His areas of expertise include High Performance Computing, where he spent 3 1/2 years working with HPC systems in a customer environment, and 3 1/2 years as a Technical Specialist in IBM, architecting HPC solutions and supporting HPC engagements. He is also skilled in SAP and Lotus® Domino sizing on the pSeries platform, and has worked on infrastructure solutions in the Data Center. Team members (left to right): Graeme Cassie, Mukesh Kumar Gupta, Edward EuiJoo Lee, See Poh Soh, Dino Quintero (project leader) xii IBM Eserver pSeries Cluster Systems Handbook Thanks to the following people for their contributions to this project: Margarita Hunt, Keigo Matsubara International Technical Support Organization, Austin Center Ramesh Goel, Umang Mathur IBM, India Dave Delia, Duane Witherspoon, Paul Swiatocha Jr., Janet Ellsworth, John Simpson, Skip Russel, Robin Hanrahan, Paula Trimble, Gary Mincher, Brian Herr, Robert Curran, Joan McComb, Waiman Chan, Patrick Caffrey, Gordon Mcpheeters, Elaine Krakower, Laura Murphy IBM Poughkeepsie Bernard King-Smith IBM Poughkeepsie Development Lab Patricia Curry IBM Fishkill Simon Robertson IBM UK Bruno Blanchard IBM France Alvin Hua Juay Teng IBM Singapore Joseph Skovira IBM Center, Parallel Computing Cornell Theory Center EunYoung Chung, YoungSang Kim, SeckJun Song HJS Inc., Paramus, New Jersey Become a published author Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Preface xiii Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html Comments welcome Your comments are important to us! We want our Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at: ibm.com/redbooks Send your comments in an Internet note to: [email protected] Mail your comments to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 xiv IBM Eserver pSeries Cluster Systems Handbook 1 Chapter 1. Introduction The IBM eServer Cluster 1600 high performance system uses the power of parallel processing to expand your applications. Designed and extended as a clustered IBM eServer pSeries system for performance, scalability, reliability, availability, serviceability, and management, this system makes feasible the processing of applications characterized by large scale data handling and compute-intensive applications. The Cluster 1600 system is the IBM pSeries family of parallel computing solutions. It provides a state-of-the-art parallel computing system and industry-leading application enablers. This redbook details the hardware and software components that make up the Cluster 1600. It also details the network connectivity required to join the components together and provides useful scenarios showing how you can utilize a Cluster 1600 system. We cover the following topics in this chapter: 1.1, “Overview of Cluster 1600” on page 2 1.2, “Choosing PSSP or CSM” on page 3 © Copyright IBM Corp. 2003. All rights reserved. 1 1.1 Overview of Cluster 1600 Cluster 1600 builds on the success of RS/6000 SP technology, extends the benefits to more hardware building blocks, and provides flexibility for creating a new cluster configuration. Yet it reduces the complexity and cost of developing and managing servers in enterprise data centers. Cluster 1600 further exploits scalability, incrementally and non-disruptively, to the demands of your enterprise. Enterprise applications and databases often exceed the capacity of even the largest single systems, and some enterprises are outgrowing systems faster than they can capitalize them. Cluster 1600 addresses this rapid growth and change, while controlling costs. Organizations need to move away from the model of buying dedicated servers for particular applications. Enterprises must find a non-disruptive solution to enable clustering and allocation of resources as demand rises and falls in different application environments. The Cluster 1600 addresses the business needs of enterprises with a technical solution for both current customers and new customers. It further expands modularity by being able to support both AIX and Linux offerings. The Cluster 1600 is a collection of interconnected computers used as a unified computing resource with a single administrative domain. The Cluster 1600 basic configuration starts with minimum of two supported servers interconnected by a network and interprocessor communications for I/O. The Cluster 1600 has evolved from the RS/6000 SP through the addition of standalone “SP-attached servers” to where it is today. The Cluster 1600 looks very different from the traditional RS/6000 SP, but it essentially performs the same task with more flexibility than ever before. Previously, we saw a change in the hardware associated with Cluster 1600, and we are now seeing a substantial change in the software that manages the cluster. The Cluster 1600 is in a transition period at the moment, as the management software is changing from Parallel System Support Programs (PSSP) to Cluster Systems Management (CSM). During this transition period, both PSSP and CSM are available to manage Cluster 1600 systems. PSSP will continue to be supported for a number of years in order to enable customers to make the move to CSM when they are ready. The change to CSM has been driven by the need for a more flexible approach to clustering than PSSP is able to offer. PSSP is based on the “SP frame” concept, which is now less relevant with the new POWER4™ servers. CSM, on the other hand, is based on a node concept, rather than a frame concept. 2 IBM Eserver pSeries Cluster Systems Handbook CSM and PSSP cannot coexist in the same Cluster 1600 system, so you need to make careful choices when planning a new Cluster 1600 system, upgrading an existing Cluster 1600, or deciding to move to CSM. Sometimes the choices are easy, as some hardware is only supported on CSM and some hardware is only supported on PSSP. Likewise, some versions of application software are only supported on CSM and others are only supported on PSSP. In other cases, where the hardware and application software are supported on both PSSP and CSM, the choice may not be so straightforward. In the following chapters, we help you to decide on the correct management software for your situation. For further information on choosing the most appropriate management software, contact your IBM sales representative or your IBM Business Partner. In this redbook, we concentrate on the new Cluster 1600 enhancements. For information on existing RS/6000 SP and CES systems, refer to RS/6000 SP and Clustered IBM EserverpSeries Systems Handbook, SG24-5596. 1.2 Choosing PSSP or CSM In the following sections, we describe the characteristics of both PSSP and CSM to help you to decide which software management product is most appropriate for your environment. 1.2.1 Cluster management with PSSP PSSP enables a single point of control of all cluster nodes and resources. It simplifies system management and reduces the administrative and operational costs of managing distributed servers. PSSP has always been the main system management software in SP implementations. Its function and usability has evolved over the years. Now, PSSP is enhanced to support a cluster of standalone pSeries servers, with or without standard SP frames and nodes. PSSP management highlights include single point of control, system monitoring and management, system administration, and cluster security. In the following sections, we describe each item in more detail. Single point of control PSSP single point of control provides: Installation and configuration Installation of cluster nodes Chapter 1. Introduction 3 Simultaneous installation of nodes with tiered processes using “boot/install servers” Ability to upgrade one node at a time Hardware control Parallel tools, such as dsh, sysctl, and node groups Hardware and switch control, such as power on/off, reset, and so on Visual monitoring Control available through the command line, SMIT, and Perspectives GUI Base clustering technology Use of base clustering services such as Topology Services and Group Services Event Management and Event monitoring - provided by Reliable Scalable Cluster Technology (RSCT), the framework for clustering availability, scalability, and manageability System monitoring and management With PSSP system monitoring and management, you can do the following: Easily monitor system resources Monitor user-defined events Monitor and act on particular events (such as /tmp or /var filling up, daemon status, switch events, node status, and other events) Allow user-defined actions when events occur (such as SNMP traps, AIX errors, and so on) Use commands to control event management; resource monitors can be scripts Use SP and Tivoli® integration for users utilizing Tivoli Use a GUI and API interface for associating actions with a monitored event System administration PSSP provides tools to help you to simplify system management: Node Groups. File collection technology. Keep specified files on each cluster node identical by using the Software Update Protocol (SUP). SP log management. 4 IBM Eserver pSeries Cluster Systems Handbook Log tracing - keep the trace of logs from each cluster node on the Control Workstation by using the splm tool. Time synchronization - use the Network Time Protocol (NTP) to synchronize the time all over the SP system. Automounter - The automounter is supported by AIX, and PSSP utilizes it. Users on each node can use their own home directory from the server on the request base. Network tunables - The node-specific network tunables are set using custom files and scripts, during the initial installation and also during regular operations. Cluster security Security administration is an important management task in a cluster environment. PSSP provides these tools for managing security in a cluster environment: AIX authentication Kerberos authentication Distributed Computing Environment (DCE) support Secure remote command support Firewall configuration inside a cluster 1.2.2 Cluster management with CSM IBM Cluster Systems Management (CSM) provides a robust, powerful, and centralized way to manage large numbers of pSeries and xSeries servers in a Cluster 1600. CSM is comprised of a modular architecture, so that it integrates both IBM software and open source software into a complete systems management solution, but also allows administrators to decide which parts of CSM to use. CSM leverages the rich heritage and proven technology of the IBM RS/6000 SP by utilizing and deriving various software from the PSSP systems management software product (such as dsh and RSCT). There are command line interfaces available for all CSM functions. A high priority was placed on making the CSM command line interface consistent and streamlined. The CSM command line interface allows a user with the appropriate permissions to access all the resources in the system, their attributes, and state values. Query and control of nodes, file systems, CPU and memory statistics, global cluster parameters, and so on can all be accessed via the command line, and scripts can be written to take advantage of this. Chapter 1. Introduction 5 On both Linux and AIX, a Distributed Command Execution Manager (DCEM) GUI is available to invoke remote commands across a CSM cluster. On AIX there are SMIT panels available for install, setup, and hardware control functions in CSM. Also on AIX, an intuitive graphical interface available from the Web-Based System Manager GUI provides a powerful way for an administrator to manage and visually monitor their CSM cluster. Some of the CSM management highlights include single point of control, CSM event management, system administration, Configuration File Management (CFM), and cluster security. In the following sections, we discuss these items in more detail. Single point of control With CSM single point of control, you can do the following: Install and update machines remotely Remotely power on, off and reboot nodes in the cluster Continuously monitor across all the machines in the cluster Automate responses that can be run any time a problem occurs, and which can provide notification or take corrective action Run probes across nodes in the cluster to diagnose problems Change files in one spot and distribute them to all the machines or a set of machines in the cluster Manage systems across multiple hardware domains Manage multiple high availability clusters inside a given CSM domain Handle large scaling, and handle distributed operations in parallel (such as remote command execution, monitoring and automated responses, installing and updating nodes) In addition, CSM is independent of any switch topology or other types of domains, such as the HATS/HAGS domains used in PSSP. Also, CSM architecture is designed to have very few dependencies on the hardware platforms that it runs on, so it can be brought easily to other platforms 6 IBM Eserver pSeries Cluster Systems Handbook CSM event management CSM event management provides the following: Monitoring for various conditions across nodes or node groups in the cluster, and running actions in response to events that occur in the cluster. Monitoring of network health, power status, state of applications and daemons, CPU, memory and file system utilization. Execution of conditional commands that can be run on the management server or on any node of the cluster or notification actions such as logging, e-mailing or paging. Generation of SNMP traps in response to events in the cluster. Predefined “Conditions” for many types of information. Predefined “Responses” for e-mail notification, SNMP traps, logging and displaying a message to a console The ability for a user to take a condition, quickly associate it with a response, and start monitoring. The administrator can easily customize these conditions and responses. Administrator-defined recovery actions, including cleaning up file systems that are filling up, taking actions to help restart a critical application that went down, and so on. System administration Following are CSM system administration highlights: The administrator installs the management server, defines the nodes to be in the cluster and then CSM can remotely do a parallel network AIX Operating System (OS) install of the nodes. CSM updates software on the nodes for new CSM versions and for new open source updates. If a node is down during an update, CSM automatically performs the update when the node comes back up. CSM automatically sets up the security configuration for the underlying cluster infrastructure. It can also do the necessary set up for rsh or ssh (exchange of ssh keys). An administrator can run commands in parallel across nodes or node groups in the cluster and gather the output using the dsh (distributed shell) command. The dsh command can use rsh (UNIX® basic remote shell) or ssh (secure shell). The administrator decides which one to use. Chapter 1. Introduction 7 The dshbak command can format the output returned from dsh if needed (for example, collapsing identical output from more than one node so that it is displayed only once. The administrator can run diagnostic probes provided by CSM to automatically perform “health checks” of particular software functions if a problem is suspected. These probes can also be run periodically or automatically, as a response to a condition occurring in the system. Current probes shipped with CSM include probes to diagnose network connectivity, NFS health, and the status of daemons that CSM runs. Administrators can also write their own probes to add into the existing CSM probe infrastructure. Configuration File Management (CFM) Configuration File Manager (CFM) is provided to synchronize and maintain the consistency in files across nodes in the cluster. This prevents the administrator from having to copy files manually across the nodes in the cluster. The particular files can be changed once on the management server, and then distributed to all the nodes or node groups in the cluster. CFM can use make use of meta variables for IP address or hostname substitution in files being transferred. Scripts can also be run for processing before and after a file is copied (for example, to stop or start daemons and so on). CFM makes use of rdist for file transfer; rdist can use rsh (UNIX basic remote shell) or ssh (secure shell). The administrator decides which one to use. Cluster security Distributed command execution (dsh) and CFM make use of either rsh or openSSH. CFM uses a push mechanism to distribute updates to the nodes. The underlying cluster infrastructure uses an “out-of-the-box”, host-based authentication mechanism. The security is designed to be pluggable in order to more easily allow other security mechanisms to be used in the future. Hardware control uses an encrypted id and password to communicate with the hardware control points. We recommend that you put the hardware control communication to the hardware control points and terminal servers on a VLAN that is separate from the VLAN used to install and manage the nodes. The cluster security infrastructure provides an “out-of-the-box” security authentication mechanism based on hostnames and a public key/private key 8 IBM Eserver pSeries Cluster Systems Handbook setup. The security infrastructure allows for pluggable security mechanisms; Kerberos V5 can be supported in the future. CSM automatically sets up and exchanges public keys between the management server and the managed nodes during a full install. Access to particular resources in the cluster are determined by ACL files that CSM assists with setting up. The managed nodes are not given access to any resources on the management server, thus preventing the management server from having to trust the managed nodes. Chapter 1. Introduction 9 10 IBM Eserver pSeries Cluster Systems Handbook 2 Chapter 2. Cluster 1600 hardware In this chapter, we describe the hardware building blocks that fit together to build a Cluster 1600 system. Because Parallel System Support Programs (PSSP) and Cluster Systems Management (CSM) support different hardware, we state which hardware is supported by CSM within a Cluster 1600, and which hardware is supported by PSSP within a Cluster 1600. We first look at the generic building blocks, and then look at the individual machines that make up a Cluster 1600 system. Important: All cluster scalability limits shown in this chapter are subject to change without notice. For the latest scalability limits, refer to the IBM Systems Sales Web site: www.ibm.com/servers/eserver/pseries © Copyright IBM Corp. 2003. All rights reserved. 11 2.1 Overview Cluster 1600 is a collection of interconnected computers used as a unified computing resource with a single administrative domain. The Cluster 1600 basic configuration starts with a minimum of: Two supported servers A cluster management console, and management software The cluster interconnected by network and interprocessor communications for input/output (I/O Traditional clustered server environments were most widely found in academic, research, and vast computational environments. Lately, however, server clustering is being utilized in commercial environments to achieve and maximize parallel application environments (for example, parallel database, server/workload consolidations, and redeployment of servers). Cluster 1600 unifies the existing offerings of RS/6000 SP and Cluster Enterprise Servers (CES), and is managed by IBM cluster management software, the Parallel System Support Programs (PSSP) or Cluster Systems Management (CSM). A collection from the following systems can be part of this Cluster 1600. The intermix possibilities depend on the management software, including: Legacy RS/6000 SP Standalone servers such as RS/6000 S70/S7A and pSeries p680 Rack-mounted servers such as pSeries 660 model 6H1/6M1/6H0 All POWER4 servers All POWER4+ servers Important: Refer to Table 2-2 on page 17 for server types supported by PSSP and CSM. With the ability to “mix and match” cluster building blocks, ranging from low-cost pSeries servers running AIX to high-end pSeries 690 running AIX, the Cluster 1600 offers maximum flexibility in matching applications and services to computing resources. Configuring a Cluster 1600 offers the following benefits and more: 12 Single administrative domain Centralized topology Broader range of building blocks Functional enhancements Increased scalability Denser packaging Hybrid clusters (intermix of Linux and/or AIX environments) IBM Eserver pSeries Cluster Systems Handbook Grid-enabled Autonomic computing Reliability, Availability, Serviceability (RAS), performance and flexibility SMP scalability to extend outside of the SMP server, 64-bit exploration, broader range of cluster building blocks, mature software environment The hardware components that make up a Cluster 1600 are described in 2.2, “Cluster 1600 hardware components” on page 13. We detail hardware that is supported on both CSM 1.3.2 managed clusters and PSSP 3.5 managed clusters. Important: Not all hardware components are supported on both CSM and PSSP. Some components are only supported on CSM, and some components are only supported on PSSP. Because PSSP and CSM cannot coexist in the same Cluster 1600, it is important to check the support for all components. Table 2-2 on page 17 references the components with support available from CSM and PSSP. Note that this support may change with later versions. 2.2 Cluster 1600 hardware components A Cluster 1600 can be built from a collection of pSeries, RS/6000 and xSeries servers (CSM-managed clusters only), which form the building blocks of the cluster. The servers within the cluster are all networked to and managed by the control workstation (PSSP) or the management server (CSM). The servers are often connected together by a high bandwidth/low latency network. This network is referred to as “the switch network”. The Cluster 1600 is very flexible, and every cluster can be very different from each other. To protect the investment, the Cluster 1600 can incorporate older servers, and in particular RS/6000 SP clusters. These older servers, which are not currently marketed, are referred to as legacy hardware. Although the building blocks retain their own model type and serial number, when they are part of a Cluster 1600, they collectively take on another model type and serial number. Individually, each server has a feature code related to the Cluster 1600 that it is a member off. The Cluster 1600 model type is a 9078-160. This model type and the features attached to it are reflected in IBM’s administration records. Chapter 2. Cluster 1600 hardware 13 The Feature codes for Cluster 1600 servers are shown in Table 2-1. Table 2-1 Cluster 1600 Feature codes Description M/T Model Feature 7017 server type 9078 160 0001 7026 server type 9078 160 0002 9076-555/557 type (SP Switch) 9078 160 0003 9076-556/558 type (SP Switch2) 9078 160 0004 9076-SP 9078 160 0005 9076 SP expansion frame 9078 160 0006 Control Workstation 9078 160 0007 7040 server type 9078 160 0008 LPAR 7040 (switched) 9078 160 0009 7039 server type 9078 160 0010 LPAR 7039 (switched) 9078 160 0011 7028 server type 9078 160 0012 LPAR 7028 (switched) 9078 160 0013 7038 server type 9078 160 0014 LPAR 7038 (switched) 9078 160 0015 7029 server type 9078 160 0016 7045-SW4 HPS pSeries HPS (High Performance Switch) 9078 160 0017 Note: CSM can manage the following servers within a single cluster: pSeries servers running AIX 5L Selected xSeries servers Selected IntelliStations and Blade Center cards Cluster 1600 only supports the pSeries nodes shown in Table 2-1. 14 IBM Eserver pSeries Cluster Systems Handbook 2.2.1 Nodes Servers, or “nodes”, as they are known when clustered, are either pSeries, RS/6000 servers, SP nodes, xSeries servers, IntelliStation® or Blade Center nodes. These provide the computing power for the cluster. A Cluster 1600 can contain between 2 and 128 nodes. Larger configurations are available with the IBM special order process. There are five types of Cluster 1600 nodes: High Nodes - RS6000/SP - legacy Wide Nodes - RS6000/SP - legacy Thin Nodes - RS6000/SP - legacy SP-attached servers and Clustered Enterprise Servers - legacy Selected pSeries servers - currently available Attention: Not all nodes are supported on both CSM and PSSP. Table 2-2 on page 17 lists the supported nodes. 2.2.2 Frames The IBM RS/6000 SP system frames contain and provide power for processor nodes, switches, hard disk drives and other hardware. A frame feature code provides an empty frame with its integral power subsystem and AC power cable. The nodes and the other components have different specific feature codes. The frames are offered in a list of five options: Tall (1.93 m) model 550 frames - legacy Tall expansion frames (F/C 550) - legacy Short (1.25 m) model 500 frames - legacy Short expansion frames (F/C 1500) - legacy SP switch frames (F/C 2031 for SP Switch and F/C 2032 for SP Switch2) currently available 7040-W42 p655 racks - currently available Frames have locations known as drawers into which the processor nodes are mounted. Tall frames have eight drawers and short frames have four. Each drawer location is further divided into two slots. A slot has the capacity of one thin node or SP Expansion I/O unit. A wide node occupies one full drawer, while a high node occupies two full drawers. The maximum number of SP frames supported in an SP system is 128. Chapter 2. Cluster 1600 hardware 15 2.2.3 Switches The SP switches provide a message-passing network that connects all processor nodes with a minimum of four paths between any pair of nodes. The switch provides a high bandwidth, low latency network which can use either TCP/IP or MPI protocol to pass messages. With high performance parallel applications, normally the MPI protocol is used. Other typical uses are for performing high speed network backups. The SP series of switches can also be used to connect the SP system with optional external devices when a switch router is used. A switch feature code provides you with a switch assembly and the cables to support node connections. The number of cables you receive depends on the type of switch you order. The SP switches available with PSSP 3.5 support are the SP Switch or the SP Switch2: SP Switch (F/C 4011), 16-port switch (available until December 31, 2003) SP Switch2 (F/C 4012), 16-port switch The switch available with CSM 1.3.2 support is the pSeries HPS (High Performance Switch): pSeries High Performance Switch (HPS), 7045-SW4, 32-link switch Important: PSSP does not manage the High Performance Switch, and CSM does not manage SP switches. 2.2.4 PSSP control workstations You can view the control workstation (CWS) as the control point of the Cluster 1600 system managed by PSSP. The subsystems running on the CWS are for controlling and monitoring the Cluster 1600. The CWS subsystems provide configuration data, security, hardware monitoring, diagnostics, a single point of control service, and optionally, job scheduling data and a time source. The CWS must be a pSeries or an RS/6000 system running AIX. 2.2.5 CSM for AIX management server You can think of the management server as the control point of the Cluster 1600 system managed by CSM.The management server is very similar to the control workstation on a PSSP Cluster 1600. The management server provides configuration data, hardware monitoring, diagnostics, a single point of control service, and node installation. The CSM management server provides the 16 IBM Eserver pSeries Cluster Systems Handbook interface for the remote hardware control capability. Remote hardware control allows you to power on and off, reboot, bring up a console through a tty connection and query nodes from the management server. The management server for a Cluster 1600 can be a pSeries or an RS/6000 system running AIX. 2.2.6 Hardware Management Console (HMC) The Hardware Management Console (HMC) is a dedicated desktop workstation that gives you a GUI for configuring and operating multiple pSeries servers in SMP, or LPAR, or clustered environments. First introduced with the p690, the HMC can now provide hardware control and console facilities for the full POWER4 and later range of pSeries servers. The HMC is used to configure and manage LPARs, dynamic LPARs, and Capacity Upgrade on Demand (CUoD), as well as the basic console and hardware control functions. The HMC is the interface between the CWS or management server and the POWER4 or later nodes, and is mandatory with Cluster 1600 systems. Note: The HMC cannot be used as either the CWS or a CSM management server. 2.3 CSM and PSSP hardware support Hardware support varies between CSM and PSSP. Hardware which is supported by PSSP may not be supported on CSM, and vice versa. The general direction is that new hardware being released will be supported on CSM but not PSSP. Table 2-2 shows the hardware supported by PSSP 3.5 and CSM 1.3.2 within a Cluster 1600. Table 2-2 CSM 1.3.2 and PSSP 3.5 Cluster 1600 supported hardware Hardware description PSSP 3.5 CSM 1.3.2 9076 SP nodes PWR_1, PWR_2, P2SC, thin and wide Yes No 9076 SP nodes 601, 604, 604e high nodes yes No 9076 SP nodes 332 MHz SMP thin and wide (F/C 2050, F/C 2051) Yes Yes 9076 SP nodes POWER3™ thin and wide (F/C 2052, F/C 2053) Yes Yes Nodes Chapter 2. Cluster 1600 hardware 17 Hardware description PSSP 3.5 CSM 1.3.2 9076 SP nodes 375 MHz/450 MHz thin and wide (F/C 2056, F/C 2057) Yes Yes 9076 SP nodes POWER3 and 375 MHz POWER3 high (F/C 2054, F/C 2058) Yes Yes 7017 S70, S7A, S80, S85 Yes No 7026 H80, M80, 6H0, 6H1, 6M1 Yes Yes 7028 6C4 (p630) Yes Yes 7029 6C3 (p615) No Yes 7038 6M2 (p650) Yes Yes 7039 651 (p655) Yes Yes 7040 671, 681 (p670, p690) Yes Yes SP Switch Yes No SP Switch2 Yes No 7045-SW4 pSeries HPS (High performance switch) No Yes Switches 2.3.1 CSM-managed node requirements The following section describes the hardware requirements for supported CSM-managed nodes. There is a distinction to be made between a Cluster 1600 and a CSM-managed cluster: A Cluster 1600 is restricted to the nodes identified in Table 2-2 on page 17. A CSM-managed Cluster 1600 supports all the nodes that are supported in the Cluster 1600 and in addition, the nodes listed in Table 2-3 on page 19. We support both AIX and Linux as managed nodes, up to 128, in the CSM for AIX cluster environment. This means that you can have a maximum of 128 nodes, including both AIX and Linux nodes. 18 IBM Eserver pSeries Cluster Systems Handbook Table 2-3 CSM supported nodes Hardware description PSSP 3.5 CSM 1.3.2 xSeries 330, 342, 335, 345, 360, 440, running Linux No Yes Blade servers running linux (8677 and 8678) No Yes IntelliStation 6221(Hardware is controlled for the IntelliStation 6221 through the APC Master Switch) No Yes eServer 325 running Linux (AMD system) No Yes Tip: Each node type is required to have at least 128 MB of RAM. AIX-managed nodes AIX-managed nodes must be one of the following servers, running AIX 5L Version 5.1 with Recommended Maintenance Package 5100-03, or later, plus the latest APARs: pSeries server RS/6000 A partition of pSeries server (LPAR) A CSM-supported SP node (see Table 2-2 on page 17) Tip: You may configure the management server as a managed node. Linux-managed nodes The following servers are supported as Linux-managed nodes: CSM-supported xSeries servers. For a list of supported servers, refer to Table 2-2 on page 17. Important: Supported servers must run Red Hat 7.2, 7.3, 8.0, AS 2.1 or SUSE 8.0, 8.1, SLES 7 or SLES 8 Linux. Contact your IBM sales representative or IBM Business Partner for the list of latest Linux distributions supported. Only selected xSeries nodes, capable of hardware control, are supported by CSM. Chapter 2. Cluster 1600 hardware 19 CSM-supported pSeries servers. Note: These servers must run Red Hat AS 2.1, or SUSE SLES 7 or SLES 8 Linux. 2.4 PSSP control workstation A Cluster 1600 system requires a customer-supplied pSeries or RS/6000 system known as a control workstation. The control workstation serves as a point of control for managing, monitoring, and maintaining the Cluster 1600 system frames and individual processor nodes. A system administrator can perform these control tasks by logging into the control workstation from any other workstation on the network. The control workstation can also act as a boot/install server for other servers in the Cluster 1600 system. In addition, the control workstation can be set up as an authentication server using Kerberos. The control workstation can be the Kerberos primary server, with the master database and administration service, as well as the ticket-granting service. As an alternative, the control workstation can be set up as a Kerberos secondary server, with a backup database, to perform ticket-granting service. Kerberos is no longer the only security method. The Distributed Computing Environment (DCE) can be used with Kerberos V4, or by itself, secure remote command method (SSH) or no security (AIX standard security). A high availability solution, named the High Availability Control Workstation (HACWS), may be implemented for the control workstation. Note: A CWS is required on all PSSP-managed Cluster 1600 systems, even when POWER4 nodes are used with an HMC. 2.4.1 Control workstation requirements The control workstation has the following requirements: pSeries or RS6000 server (from the supported list on Table 2-4 on page 22) RS-232 port available for each frame/server Note: An RS-232 connection is not required between the CWS and the HMC. However, it is required between the HMC and the POWER4 server. 20 IBM Eserver pSeries Cluster Systems Handbook Restriction: The native RS-232 ports on the system planar cannot be used as tty ports for the hardware controller interface. Either an 8-port or 128-port async adapter must be ordered if RS-232 ports are required. Two RS-422 ports available for each p655 frame (8-port async card) Ample disk space for mksysbs and NIM requirements Suitable Ethernet network adapters Graphic adapter and graphical screen AIX, PSSP, C++ or C for AIX Tip: The CWS must have at least 128 MB of main memory. An extra 64 MB of memory should be added for each additional system partition. For Cluster 1600 systems with more than 80 nodes, 256 MB is required and 512 MB of memory is recommended. The CWS must have at least 9 GB of disk storage. If the SP is going to use an HACWS configuration, you can configure 9 GB of disk storage in the rootvg volume group and 9 GB for the spdata in an external volume group. Because the control workstation is used as a Network Installation Manager (NIM) server, the number of unique filesets required for all the nodes in the SP system might be larger than a normal single system. You should plan to reserve 6 GB of disk storage for the file sets and 2 GB for the operating system. This will allow adequate space for future maintenance, system mksysb images, and LPP growth. Keep in mind that if you have nodes at different levels of PSSP or AIX, each node requires its own LPP source, which takes up extra space. A good rule of thumb to use for disk planning for a production system is 4 GB for the rootvg to accommodate additional logging and /tmp space, plus 4 GB for each AIX release and modification level for lppsource files. Additional disk space should be added for mksysb images for the nodes. If you plan on using rootvg mirroring, then for one mirror, double the number of physical disks you estimated so far. For two mirrors, triple the estimate. 2.4.2 Supported control workstations Table 2-4 on page 22 lists the currently available supported pSeries control workstations for PSSP 3.5. Chapter 2. Cluster 1600 hardware 21 Table 2-4 Supported control workstations on PSSP 3.5 Machine type Models 7028 p630 - 6C4, 6E4 7038 p650 - 6M2 7044 170 Note: For supported legacy CWS, refer to RS/6000 SP and Clustered IBM eServer pSeries System Handbook, SG24-5596. A typical control workstation configuration for a large Cluster 1600 is shown in Example 2-1. Smaller clusters may require less disk space and less CPU, although the network requirements remain the same. Example 2-1 Typical control workstation configuration Product ________ 7028-6E4 2633 2848 2943 3158 3159 3628 4254 4451 4961 5005 5134 6273 6568 8700 8741 9300 9556 9825 Description Qty ________________________________________ _____ Deskside Server 1:pSeries 630 1 IDE CD-ROM Dr.(Black bezel) 1 POWER GXT135P GRAPHICS ACC 1 8-Port Ad.EIA-232/RS-422 1 36.4 GB 10K RPM U3 SCSI DDA 2 73.4 GB 10K RPM U3 SCSI DDA 2 Color Monitor,St.Black and C. 1 SCSI Connector Cable 1 1024MB DIMMs, 208-pin, 8NS DDR 1 Univ 4-Port 10/100 Enet Ad. 2 Software Preinstall 1 2way 1.2GHz POWER4+ Proc Card 2 Power Supply, 645 Watt AC, HS 1 U3 SCSI Backplane for HS Disks 1 Quiet Touc Keyb.,US Engl. 1 3-Button Mouse-Stealth Black 1 Lang.Group Spec.-US English 1 6 Slot PCI Riser Init ord only 1 Power Cord - U.K. 1 2.4.3 High Availability Control Workstation The High Availability Control Workstation (HACWS) is a component that is used to reduce the possibility of single point of failure in the SP system. Although there 22 IBM Eserver pSeries Cluster Systems Handbook already are redundant power supplies and replaceable nodes, there are also many elements of hardware and software that could fail on a control workstation. With a HACWS, your SP system has the added security of a backup control workstation. Also, HACWS allows your control workstation to be powered down for maintenance or updating without affecting the entire SP system. The design of the HACWS is modeled on the High Availability Cluster Multi-Processing for the AIX (HACMP) Licensed Program Product. HACWS utilizes HACMP running on two RS/6000 control workstations in a two-node rotating configuration. HACWS utilizes an external disk that is accessed non-concurrently between the two control workstations for storage of SP-related data. There is also a dual RS-232 frame supervisor card with a connection from each control workstation to each SP frame in your configuration. This HACWS configuration provides automated detection, notification, and recovery of control workstation failures. Figure 2-1 on page 24 shows the HACWS overview. Chapter 2. Cluster 1600 hardware 23 Figure 2-1 HACWS overview HACWS provides a high availability solution on the control workstation, which serves as a single point of control for managing and maintaining the SP nodes using Parallel System Support Programs (PSSP). Depending on your system environment and the type of applications that are running on your CWS, the impact of a failure in the CWS may not affect the operation of a Cluster environment. However, if your needs are such that you choose to run critical applications on the CWS, then you should consider protecting it with the HACWS program. 24 IBM Eserver pSeries Cluster Systems Handbook Users who do not run any critical applications on the CWS, or who are unconcerned about spending time restoring system backup and rebuilding the control workstation after its failure, might not require the high availability configuration. For such users, however, HACWS still provides minimum downtime for maintenance of the CWS. The loss or failure of the CWS has the following effects on the management of an Cluster 1600 system, but should have little or no impact on the active jobs on the cluster nodes: It is not possible to control the cluster hardware. The System Data Repository (SDR) is unavailable. Existing jobs continue to completion, but new parallel jobs cannot be started. No configuration changes can be made. Software installations can not be done from the CWS. Should a switch fault occur, reset processing cannot be completed. Error logging of alerts raised by SP/Cluster nodes are lost (although the information is still logged on the individual nodes). While the CWS is unavailable, some administrative tasks that use Parallel System Support Programs (PSSP) are not able to proceed. The HACWS configuration requires external disks that provide a non-concurrent access feature for both primary and backup control workstations. In a typical configuration, the cluster management data, as well as the AIX system images, PSSP, and related software install file sets, NIM configuration data files, and any other software installp filesets should reside on the external shared disks. Any disks supported by HACMP and RS/6000 models for both the primary and backup control workstations can be used. Each CWS requires the same number of connections to the SP Ethernet on the same LAN segments. Each SP Ethernet LAN segment must be cabled to the same Ethernet (enx) adapter. Standby network adapters are optional in the HACWS configuration. However, the presence of a standby adapter may avoid the need to fail over (switch to backup) to the inactive CWS in case of a single LAN adapter failure. 2.4.4 HACWS limitations Be aware of the following limitations before implementing HACWS: There is no hardware control to SP-attached nodes or CES when the CWS fails over. The limitation is due to the serial connection from the CWS to the attached servers. Chapter 2. Cluster 1600 hardware 25 The SP-attached servers do not have a secondary serial port to allow an RS-232 connection to the secondary CWS. These limitations do not apply to HMC-attached servers. Note: HACWS configuration works with either HACMP “classic” or HACMP Enhanced Scalability. 2.5 CSM management server The management server is very similar to the control workstation on a PSSP Cluster 1600. The management server provides configuration data, hardware monitoring, diagnostics, a single point of control service, and node installation. The management server can be any RS/6000, pSeries server or pSeries LPAR that is capable of running AIX 5.2. It can even be one of the managed nodes. Restriction: A p655 cannot be a management server, as there are no media drawers. The following are the hardware requirements for the CSM management server: The machine you use for your management server must have a CD-ROM drive. The management server must be a workstation capable of running AIX 5L Version 5.2. 2.5.1 Memory and disk space On the management server, a minimum of 1024 MB of memory and 120 MB of disk space is required for installing CSM. An additional 2.0 GB of disk space is required for installing the AIX operating system and CSM. In addition, 20 MB is required if you are managing Linux nodes, at least 1 GB for NIM and at least 1 GB if storing backups from the managed nodes. 2.5.2 Network requirements Ethernet adapters are required. When configuring a CSM cluster, give particular attention to secure hardware control functions. We suggest that you define and isolate the management server on a private Ethernet network when connecting to a management VLAN (a virtual LAN that connects the management server to the cluster hardware control points). One Ethernet connection is also required for 26 IBM Eserver pSeries Cluster Systems Handbook each cluster VLAN (virtual LAN that connects the nodes to each other and to the management server). You may also want to have a separate network connection that can connect to a public LAN. 2.5.3 Asynchronous card requirements An asynchronous card will only be required if you are controlling non-HMC attached nodes, such as p660s or legacy SP nodes. Both 8-port and 128-port asynchronous adapter cards are available. 2.5.4 Using a Logical Partition (LPAR) as a CSM management server CSM 1.3 for AIX supports a logical partition (LPAR) as a CSM management server with the limitations and considerations listed below. Depending on your cluster configuration and needs, you may choose to use an LPAR as the CSM management server. However, it is important to understand the limitations and considerations to decide if an LPAR CSM management server is appropriate for your cluster. Considerations for an LPAR management server 1. The CSM management server can be brought down inadvertently by someone on the HMC deactivating that LPAR. Even if someone does not have access to the CSM management server, if they still have access to the HMC, they can power off the management server. It can also be effected by someone moving resources such as CPU or I/O from that LPAR. 2. If the firmware needs to be upgraded, the LPAR management server may go down along with the rest of the CEC in order to upgrade the firmware. However, upon bringing the CEC back up, the system will return to normal. 3. There is no direct manual hardware control of the CSM management server. The administrator must go through the HMC for power control of the management server. 4. In many cases a physically separate CSM management server can be safer, since an LPAR management server belongs to a CEC that is down for a hardware or power failure, you will lose access to your management server. 5. An LPAR management server may not have an attached display. This may affect the performance of your CSM GUIs. 6. An LPAR management server may not contain media devices such as CD, tape, or diskette drives. This can affect your back-up strategy. In machines such as the p690, you may be able to assign a CD-ROM drive to the management server LPAR. 7. An LPAR management server should not also be defined as a managed node. Chapter 2. Cluster 1600 hardware 27 Note: A cluster that is installed and configured can still function even if the management server goes down. For example, cluster applications can continue to run, and nodes in the cluster can be rebooted, even if the management server is down. However, tasks including monitoring, automated responses for detecting problems in the cluster, and scheduled file and software updates will not occur while the management server is down. Depending on your cluster configuration and use, this may or may not be a concern. 2.6 Cluster 1600 server concepts Figure 2-2 Cluster 1600 servers Servers currently available for the Cluster 1600 are from the pSeries range. The servers shown in Figure 2-2 are the cluster-enabled pSeries servers. The currently marketed pSeries servers which can be attached to a Cluster 1600 all contain POWER4 or later technology. 28 IBM Eserver pSeries Cluster Systems Handbook As a result of the LPAR technology introduced with POWER4, a node can either be a physical server or an LPAR within a server. Logical partitions (LPARs) run separate instances of AIX or Linux within the same server. LPARs are completely isolated from each other and use separate resources allocated from the server. CPU, memory and I/O can be allocated to any LPAR with the granularity of a single processor, 256 MB of memory and any single PCI/PCI-X adapter. Figure 2-3 displays the concepts of LPAR. With AIX version 5.2, these LPARs can be dynamic, which means that you can change resources without the requirement to reboot. Figure 2-3 LPAR concept diagram For more information on LPARs, refer to The Complete Partitioning Guide for IBM pSeries Servers, SG24-7039. Because of LPARs, we now talk about operating system images as opposed to servers. All current pSeries servers can only be attached to a Cluster 1600 when they are controlled by a HMC. A Cluster 1600 can consist of 2 to 128 AIX operating system images. An operating system image, or logical node, can be one of the following: A 7040 server (p670, p690) running as a full system partition An LPAR of a 7040 server (p670, p690) A 7039 server (p655) running as a full system partition An LPAR of a 7039 server (p655) A 7038 server (p650) Chapter 2. Cluster 1600 hardware 29 An LPAR of a 7038 server (p650) A 7028 server (p630) An LPAR of a 7028 server (p630) A 7026 server (H80, M80, p660, 6H0/6H1, 6M1) A 7017 server (S70, S7A, S80, p680) A 9076 SP node Logical nodes are limited in a cluster running PSSP as described in Table 2-5. Any cluster system that exceeds any of the supported logical node limits requires a special bid or request for price quotation (RPQ). Table 2-5 Cluster limits for a Cluster 1600 managed by PSSP Nodes SP Switch SP Switch2 SP Switch2 (total switched plus nonswitched Industry std interconnect (non switched) 32 32 32 32 p655 servers per cluster 64 64 64 p650 servers per cluster 64 64 64 p630 servers per cluster 64 64 64 p690/p670 servers per cluster LPARs per p690 server 8 32 32 32 LPARs per p670 server 4 8 16 16 LPARs per p655 server 2 4 4 LPARs per p650 server 2 8 8 LPARs per p630 server 2 4 4 128 128 128 LPARs per cluster 128 Note: With the SP Switch, switched and non-switched logical nodes cannot be mixed. 30 IBM Eserver pSeries Cluster Systems Handbook A Cluster 1600 with PSSP must meet all of the limits listed in Table 2-6; otherwise, a special bid order is required. Table 2-6 A Cluster 1600 with PSSP must meet all of the following limits No more than 128 logical nodes from the set (7040, 7039, 7038, 7028, 7026, 7017, 9076) No more than 32 servers from the set (7040) No more than 64 servers from the set (7039) No more than 32 servers from the set (7038) No more than 64 servers from the set (7040, 7039, 7038, 7028, 7026, 7017) No more than 16 servers from the set (7017) No more than 128 9076 SP Nodes For example: 32 p690s with 4 LPARs each or 16 p690s with 8 LPARs each 32 p660s, 12 p690s with 4 LPARs each, 4 p690s with 8 LPARs each and 16 p630s 16 p690s with 4 LPARs each, 32 p660s and 32 SP nodes 12 p690s with 4 LPARs each, 16 p680s, 16 p660s and 48 SP nodes Logical nodes are limited in a cluster running CSM, as described in Table 2-7. Any cluster system that exceeds any of the supported logical node limits requires a special bid or RPQ. Table 2-7 Cluster limits for CSM Node maximums IBM eServer pSeries High Performance Switch (HPS) HPS (total switched plus non-switched) Industry std interconnect (non-switched) p690 servers per cluster 16 32 32 p670 servers per cluster N/A N/A 32 p655 servers per cluster 16 64 64 p650 servers per cluster N/A N/A 64 p630 servers per cluster N/A N/A 64 LPARs per p690 server 16 16 16 Chapter 2. Cluster 1600 hardware 31 Node maximums IBM eServer pSeries High Performance Switch (HPS) HPS (total switched plus non-switched) Industry std interconnect (non-switched) LPARs per p670 server N/A N/A 16 LPARs per p655 server 4 4 4 LPARs per p650 server N/A N/A 8 LPARs per p630 server N/A N/A 4 LPARs per cluster 128 128 128 - A Cluster 1600 with CSM must meet all of the limits in Table 2-8; otherwise, a special order bid is required. Table 2-8 A Cluster 1600 with CSM must meet all of the following limits No more than 128 logical nodes from the set (7040, 7039, 7038, 7028, 7026, 9076) No more than 32 servers from the set (7040) No more than 64 servers from the set (7039) No more than 64 servers from the set (7038) No more than 64 servers from the set (7040, 7039, 7038, 7028, 7026) For example: 32 p690s with 4 LPARs each or 16 p690s with 8 LPARs each 32 p660s, 12 p690s with 4 LPARs each, 4 p690s with 8 LPARs each and 16 p630s 16 p690s with 4 LPARs each, 32 p660s and 32 SP nodes 12 p690s with 4 LPARs each, 16 p680s, 16 p660s and 48 SP nodes Attention: The cluster scalability for Linux on pSeries is the same as AIX on pSeries, which is 128 nodes. 2.6.1 pSeries architecture pSeries servers that are eligible for cluster attachment are all POWER4 or later architecture. The larger servers (p655, p670, and p690), all use Multiple Chip 32 IBM Eserver pSeries Cluster Systems Handbook Modules (MCM). The mid-range servers (p630 and p650) use Single Chip Modules (SCM). The POWER4 architecture and both SCM and MCM packaging is described in the following section. POWER4 chip The components of the POWER4 chip are shown in Figure 2-4. The chip has two processors on board. Included in what we are referring to as “the processor” are the various execution units and the split first level instruction and data caches. The two processors share a unified second level cache, also on board the chip, through a Core Interface Unit (CIU), as shown in the figure. The CIU is a crossbar switch between the L2 (implemented as three separate, autonomous cache controllers), and the two processors. Each L2 cache controller can operate concurrently and feed 32 bytes of data per cycle. The CUI connects each of the three L2 controllers to either the data cache or the instruction cache in either of the two processors. Additionally, the CUI accepts stores from the processors across 8-byte wide buses and sequences them to the L2 controllers. Each processor has associated with it a non-cacheable (NC) Unit (the NC Unit in Figure 2-4) that is responsible for handling instruction serializing functions and performing any non-cacheable operations in the storage hierarchy. Logically, this is part of the L2. Processor 1 Core IFetch Store Processor 2 Core Loads 8B Trace & Debug IFetch Store 32B 8B Perf Monitor 8B 32B Cache L2 L2 Cache 8B Core1 NC Unit 8B 32B Cache L2 8B 32B 32B Cache L2 MCM-MCM (2:1) 32B 32B 32B 32B GX Bus (n:1) 16B 16B Fabric Controller Controller 4B 4B 8B 8B Fabric 16B 8B Error Detect And Logging 8B Core2 NC Unit 32B 8B JTAG POR Sequencer 16B 16B SP Controller 32B CIU Switch BIST Engines Chip-Chip Fabric (2:1) Loads 8B 32B GX Controller L3 Directory 16B 8B 16B L3 Controller Mem Controller 16B Chip-Chip Fabric (2:1) MCM-MCM (2:1) Bus L3/Mem (3:1) Figure 2-4 POWER4 logical description Chapter 2. Cluster 1600 hardware 33 The directory for a third level cache, L3, and logically its controller, are also located on the POWER4 chip. The actual L3 is on a separate chip. A separate functional unit, referred to as the Fabric Controller, is responsible for controlling data flow between the L2 and L3 controller for the chip and for POWER4 communication. The GX controller is responsible for controlling the flow of information in and out of the system. Typically, this would be the interface to an I/O drawer attached to the system. However, with the POWER4 architecture, this is also where we would natively attach an interface to a switch for clustering multiple POWER4 nodes together. Also included on the chip are functions we logically call pervasive functions. These include trace and debug facilities used for First Failure Data Capture (FFDC), Built-in Self Test (BIST) facilities, Performance Monitoring Unit, an interface to the Service Processor (SP) used to control the overall system, power-on Reset (POR) Sequencing logic, and Error Detection and Logging circuitry. Multi-Chip Modules (MCM) The POWER4 chips are packaged on a single module called Multi-Chip Modules (MCM). Each MCM houses four chips (eight CPU cores) that are connected through chip-to-chip ports. The chips are mounted on the MCM such that they are all rotated 90 degrees from one another, as shown in Figure 2-5. This arrangement minimizes the interconnect distances, which improves the speed of the inter-chip communication. There are separate communication buses between processors in the same MCM and processors in different MCMs, as shown in Figure 2-6 on page 36. 34 IBM Eserver pSeries Cluster Systems Handbook 8-processor Multi-Chip Module Chips rotated to allow faster interconnection 4.5" square, glass-ceramic construct Figure 2-5 POWER4 Multi-Chip Module (MCM) An internal representation of the MCM is shown in Figure 2-6 on page 36, with four interconnected POWER4 chips. Each installed MCM comes with 128 MB of L3 cache. This provides 32 MB of L3 cache per POWER4 chip. The system bus (L3 cache, GX Bus, memory nest) operates at a 3:1 ratio with the processor frequency. Therefore, the L3 cache-to-MCM connections operate at: 375 MHz for 1.1 GHz processors 433 MHz for 1.3 GHz processors 500 MHz for 1.5 GHz processors 567 MHz for 1.7 GHz processors Chapter 2. Cluster 1600 hardware 35 4 POWER4 chips (8 processors) on an MCM >1 Ghz Core L3 Shared L2 >1 Ghz Core Shared L2 L3 Dir M E M O R Y Mem Ctrl L3 Chip-chip communication GX Bus Mem Ctrl GX Bus 4 GX links for external connections GX Bus Mem Ctrl GX Bus L3 Shared L2 Shared L2 L3 Mem Ctrl M E M O R Y Multi-chip Module L3 cache shared across all processors (If the amount of memory behind each L3 is the same size.) Figure 2-6 Multi-Chip Module with L2, L3, and memory The MCM is a proven technology that IBM has been using for many years in the mainframe systems (now IBM eServer zSeries). It offers several benefits in mechanical design, manufacturing, and component reliability. IBM has also used MCM technology in the RS/6000 servers in the past. The IBM RS/6000 Model 580 was based on the POWER2™ processor that has all its processing units and chip-to-chip wiring packaged in an MCM. Single core POWER4 processor feature Some technical applications benefit from very large bandwidth between processors and memory. The POWER4 processor delivers an exceptional bandwidth to the cores inside. For those applications that require extremely high bandwidth, the High Performance Computing (HPC) feature is an attractive alternative. Instead of 8-way MCMs, you have 4-way MCMs with the same amount of L2 and L3 caches and the same bus interconnection. This configuration provides twice the amount of L2 and L3 cache per processor and additional memory bandwidth, when compared to the pSeries 690 configured with 8-way processor MCMs. This additional cache and memory bandwidth available for each processor in this configuration may provide significantly higher performance per processor for certain engineering and technical environment applications. 36 IBM Eserver pSeries Cluster Systems Handbook Note: The HPC feature is only available on the pSeries 690 with a POWER4 1.3 GHz processor. It is not offered with the POWER4+ 1.5 or 1.7 GHz processors. For more information on MCMs and POWER4 architecture, refer to the white paper “POWER4 System Micro architecture”, which can be found at: http://www.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html You can also refer to the IBM Redbook IBM Eserver pSeries 690 and pSeries 670 System Handbook, SG24-7040. Single Chip Module (SCM) Single Chip Modules are used in the p630 and p650 mid-range servers and contain one single POWER4+ chip with either one or two processor cores (CPUs). The SCM is permanently mounted on a processor card. One key difference is that the chip-to-chip fabric bus (which was used between chips on the same MCM) is no longer relevant in the SCM. It is replaced by a module-to-module fabric. Each SCM is a Ceramic Column Grid Array (CCGA) package where the chip carrier is raised slightly from its board mounting by small metal solder columns that provide the required connections and improved thermal resilience characteristics. Similar to the SCM, each processor card also contains the L3 cache and the memory DIMMs, as shown in Figure 2-7 on page 38. The processor card is mounted in a rugged metal enclosure (book) that protects and secures the card (both in and out of the server), and helps manage airflow used for cooling. The POWER4+ storage subsystem consists of three levels of cache and the memory subsystem. The first two levels of cache are on board the POWER4+ chip. The first level is 64 KB of Instruction (I) and 32 KB of Data (D) cache per processor core. The second level is 1.5 MB of L2 cache on the POWER4+ and 1.44 MB on the POWER4 chip. All caches have either full ECC1 or parity protection on the data arrays, and the L1 cache has the ability to re-fetch data from the L2 cache in the event of soft errors detected by parity checking. A 2-way configuration using two 1-way processor cards offers better performance than a configuration using a single 2-way processor card, because the maximum capacity of memory is doubled from 16 GB to 32 GB and the Level 2 (L2) and Level 3 (L3) cache on each card is dedicated to a single processor. A 2-way configuration using a 2-way processor card shares the L2 and L3 caches. Chapter 2. Cluster 1600 hardware 37 Figure 2-7 POWER4+ processor card layout 2.6.2 Cluster 1600 and the HMC A prerequisite to attaching a POWER4 or POWER4+ pSeries server to a Cluster 1600 is the Hardware Management Console (HMC). The current model of the HMC is the IBM 7315-C02. The IBM 7315-C01 with feature codes (F/C 7315) and (F/C 7316) can also be used for Cluster 1600 attachment. The IBM 7315-C02 HMC is a dedicated desktop workstation that gives you a GUI for configuring and operating multiple pSeries servers in SMP, LPAR, or clustered environments. It comes with a hardware management application for configuring and partitioning the server. The IBM 7315-C02 HMC delivers functions necessary to manage LPAR configurations: Creating and storing LPAR profiles that define the processor, memory, and I/O resources allocated to an individual partition Starting, stopping, and resetting a system partition Booting a partition or system by selecting a profile Displaying system and partition status 38 IBM Eserver pSeries Cluster Systems Handbook Displaying a virtual operator panel of the contents for each partition or controlled system The HMC offers a service focal point for the systems it controls. It is connected to the service processor of the system via a dedicated serial link. The HMC provides tools for problem determination and service support, such as call home and error log notification through an analog phone line. Note: The IBM 7315-C02 HMC is a dedicated function device. It is utilized only for the control and service functions of the pSeries servers it serves. It is not available for use as a general purpose computing resource. The HMC can control servers of multiple machine types. When combining multiple machine types on an HMC, substitutions of other server types are based upon the relative weighting factor of the server. For example, a p690 is twice the weighted factor of a p630. A supported HMC configuration could control four p690 servers and eight p630 servers. For the number of servers supported per HMC, refer to Table 2-9. Table 2-9 Number of servers per HMC Server SP Switch SP Switch2 SP Switch2 (total switched plus non-switched, subject to note 2 limits) Industry std interconnect (non switched) p690/p670 8 8 8 8 p655 16 16 16 p650 16 16 16 p630 16 16 16 32 32 32 Number of LPARs per HMC 32 Note: With the SP Switch, switched and non-switched logical nodes cannot be mixed. Chapter 2. Cluster 1600 hardware 39 Redundant HMC Each system that supports a Hardware Management Console (HMC) has two HMC port RS-232 connections, so that you may optionally attach a second HMC to the same system. The benefits of using two HMCs are: It ensures that access to the HMC management function capabilities are not interrupted. It ensures access if the network is down. In configurations with two HMCs, both HMCs are fully active and accessible at all times, enabling you to perform management tasks from either HMC at any time. There is no primary or backup designation. To avoid basic conflicts, mechanisms in the communication interface between HMCs and the managed systems allow an HMC to temporarily take exclusive control of the interface, effectively locking out the other HMC. Usually this locking is done only for the duration of time it takes to complete an operation, after which the interface is available for further commands. HMCs are also automatically notified of any changes that occur in the managed systems, so the results of commands issued by one HMC are visible in the other. For example, if you select to activate a partition from one HMC, you will observe the partition going to the Starting and Running states on both HMCs. However, the locking between HMCs does not prevent users from running commands that might seem to be in conflict with each other. For example, if the user on one HMC selects to activate a partition, and a short time later a user on the other HMC selects to power the system off, the system will power off. Effectively, any sequence of commands that you can do from a single HMC is also permitted when it comes from redundant HMCs. For this reason, it is important to carefully consider how you want to use this redundant capability to avoid such conflicts. You might choose to use them in a primary and backup role, even though the HMCs are not restricted in that way. Because authorized users can be defined independently for each HMC, determine whether the users of one HMC should be authorized on the other. If so, the user authorization must be set up separately on each HMC. Important: Because both the HMCs provide Service Focal Point and Service Agent functions, connect a modem and phone line to only one of the HMCs, and enable its Service Agent. To prevent redundant service calls, do not enable Service Agent on both HMCs. 40 IBM Eserver pSeries Cluster Systems Handbook Perform HMC software maintenance separately on each HMC, at separate times, so that there is no interruption in accessing HMC function. This allows one HMC to run at the new fix level, while the other HMC can continue to run at the previous fix level. However, the best practice is to move both HMCs to the same fix level as soon as possible. HMC considerations When attaching HMCs to a Cluster 1600, we recommend that you configure a separate Ethernet adapter for the “administrative LAN”. This LAN is limited to 10/100Mb/s. If multiple servers are to be controlled from a single HMC, either an 8-port asynchronous adapter (F/C 2943)or a 128-port asynchronous adapter (F/C 2944) is required. The following list provides a summary of HMC requirements for a Cluster 1600: Machine type (M/T) 7040, 7039, and 7028 require a model 7315-C01 or later as the HMC. The HMC requires an RS-232 connection to each server. The RS-232 cable connects to the first HMC port on both the Hardware Management Console and the p655 or p630 servers. The second HMC port is used for redundant HMC configurations. Machine type 7039 also requires RS-422 connections to each Bulk Power Controller installed in the server’s M/T 7040-W42 frame. Note: Machine type 7040, 7039, and 7028 servers do not require any RS-232 connections to the CWS. Tip for the M/T 7039: When used in the HMC for a M/T 7039 cluster, the 8-port asynchronous adapter (F/C 2943) can be configured for simultaneous RS-232 and RS-422 communication. For smaller systems, this allows one adapter to communicate with the two BPC units in the frame (RS-422) and with six p655 servers (RS-232). For larger systems, a 128-port adapter may be required for server communications and the 8-port adapter would be dedicated to RS-422 frame communications. For additional information, refer to the server-specific hardware documentation. Chapter 2. Cluster 1600 hardware 41 Tip for LPARs: For systems configured with an SP Switch2, the LAN adapters must be placed in I/O subsystem slot 8 using the same respective LPAR as the switch adapter. Place a second LAN adapter in slot 9. For systems configured with an SP Switch, the LAN adapter must be placed in the same respective LPAR as the switch adapter, but does not need to be in the same I/O subsystem. Tip for Dynamic LPAR: An Ethernet connection between the HMC and each active partition on the pSeries server is recommended for all installations—and is required for dynamic LPAR configurations. This connection is used to provide: Additional systems management such as using Web-based System Manager for AIX in the individual partitions. Collection and passing of hardware service events to the HMC for automatic notification of error conditions to IBM. Total system inventory collection. For full connectivity options, see Chapter 3, “Network configuration” on page 121. HMC standard hardware The IBM 7315-C02 HMC is a dedicated desktop workstation use for system and partition control of pSeries servers. It also provides a service focal point function for these servers. The HMC has the following fixed hardware attributes: Intel® Pentium®-based desktop workstation 1 GB of system memory 40 GB minimum hard disk DVD-RAM for backup Two integrated serial ports One graphics port One integrated Ethernet port Six USB ports Three PCI slots Alternative rack mount HMC hardware The 7315-CR2 is a rack-mounted HMC that mounts into a 19-inch rack. 42 IBM Eserver pSeries Cluster Systems Handbook The HMC has the following fixed hardware attributes: Intel XEON-based rack-mounted system unit 1U tall 1 GB of system memory 40 GB minimum hard disk DVD-RAM for backup One integrated serial port One graphics port Two integrated 10/100 Mbps Ethernet ports Three USB ports Two PCI slots HMC hardware options Table 2-10 details the optional feature codes available for order with, or for, an HMC. Table 2-10 HMC hardware options F/C 2943 8-port asynchronous adapter PCI BUS EIA-232/RS-422 F/C 2944 128-port asynchronous controller PCI bus F/C 8120 Attachment cable HMC to host 6 meters (RS-232) F/C 8121 Attachment cable HMC to host 15 meters (RS-232) F/C 8122 Attachment cable HMC to W42 rack 6 meters (RS-422) F/C 8123 Attachment cable HMC to W42 rack 15 meters (RS-422) F/C 8137 2.4 MB/sec enhanced remote asynchronous node (RAN) 16-port EIA-232 F/C 8131 128-port asynchronous controller cable 4.5 m (1.2 MB/sec transfers) F/C 8132 128-port asynchronous controller cable 23 cm (1.2 MB/sec transfers) F/C 8133 RJ-45 to DB-25 converter cable F/C 8136 1.2 MB/sec rack-mountable remote asynchronous node (RAN) 16-port EIA-232 F/C 2934 Asynchronous terminal/printer cable EIA-232 (2.4 MB/sec transfers) F/C 3124 Serial port to serial port cable for drawer-to-drawer connections (2.4 MB/sec transfers) F/C 3125 Serial port to serial port cable for rack-to-rack connections (2.4 MB/sec transfers) F/C 4962 10/100 Mb/s Ethernet PCI adapter II Chapter 2. Cluster 1600 hardware 43 F/C 3628 IBM P260/275 Color monitor, stealth black and cable F/C 3636 L200P Flat panel monitor HMC standard software The HMC is preloaded with dedicated Linux operating software. Tip: HMC recovery software for pSeries (5639-N47) must be ordered in conjunction with every 7315-C02 system. For more information on HMC and configuration, refer to Effective System Management Using the IBM Hardware Management Console for pSeries, SG24-7038. 2.6.3 Firmware Firmware is a critical component in all modern systems; it is the system- or device-specific microcode that interfaces the hardware to the software or other pieces of hardware. It is critical that the firmware is kept up to date. Occasionally, new versions of system firmware and microcode for devices are introduced for the following reasons: New features, connectivity, security, and resource support are included to improve system operation. Serviceability is also an issue, and new levels may include improved problem determination and fault isolation, resulting in accurate error codes. Performance enhancements in the system: firmware may improve or resolve problems in response times and throughput. Finally, new levels of system firmware may improve the user interface with easier-to-understand messages. For these reasons, it is important to know your current level and be able to check it against the latest available version. In this section, we explain how to check what your installed version of system firmware is, what the latest available version is, and how to install it. It also explains how to check your microcode for devices on some of your devices. Note that firmware and microcode for devices, as well as their installation procedures, are different from machine to machine and device to device. The installation instructions must be followed exactly. It is now the customer’s responsibility to keep the firmware at a current level. In order to help customers to easily install, update, and manage their own microcode (also called machine code or firmware) pSeries and RS/6000 systems 44 IBM Eserver pSeries Cluster Systems Handbook and associated I/O adapters and devices are using new software tools and existing services. The following tools and existing services reiterate and support the Customer-Managed Microcode concept, which emphasizes customer responsibility for managing their own microcode updates: Microcode update application for AIX 5L V5.1 and V5.2 Microcode discovery service Web site: http://techsupport.services.ibm.com/server/aix.invscoutMDS pSeries and RS/6000 microcode updates Web site http://techsupport.services.ibm.com/server/mdownload Microcode update files and discovery tool V1.1 on CD-ROM Note: All product microcode installation is also available with IBM service (installation may be fee-based). The microcode update application, files, tools, and services make microcode available at your fingertips. Simple process steps help you analyze the system, acquire microcode, and install it without assistance from IBM. These tools and services also support “secure” accounts that do not have immediate access to the Internet and customer systems without a diskette drive. They enable efficient microcode-level surveying for new code. Two new tools and two existing services are available to assist you in managing and installing your microcode. Microcode update application for AIX 5L V5.1 and V5.2 Microcode discovery service Web site at: http://techsupport.services.ibm.com/server/aix.invscoutMDS pSeries and RS/6000 microcode updates Web site at: http://techsupport.services.ibm.com/server/mdownload Microcode update files and discovery tool V1.1 on CD-ROM Microcode update application The microcode update application is the primary microcode management tool for customers using AIX 5L V5.1 and V5.2. This application exists on standalone systems (those with no HMC attached) and HMC-controlled systems. The microcode update application executes a survey process that lets you and IBM determine whether there is a microcode update for your system. This Chapter 2. Cluster 1600 hardware 45 process can now be done without IBM's intervention; there is no need to contact an IBM service support representative. Microcode update application distribution The distribution methods for microcode are now more flexible. The Microcode Update Application tool for AIX 5L V5.1 and V5.2, or later, allows you to get the latest microcode from three different repositories: If you have access to the Internet, the tool accesses the IBM microcode Web site: http://techsupport.services.ibm.com/server/mdownload You can order the microcode update files and discovery tool V1.1 on CD, which lets you survey, report, and retrieve microcode without having to connect to the Internet. You can FTP to and from another machine or site to which you may have downloaded the latest level of microcode for temporary storage or test purposes. Microcode discovery service Web site You can determine if your pSeries or RS/6000 system is at the latest microcode level. Microcode discovery service gives you two ways to generate a real-time comparison report showing subsystems that may need to be updated. Before using this service, you will need to install a tool known as Inventory Scout onto the server you want to survey for microcode information. Using a secure Internet connection and a signed, trusted Java™ applet running on your Internet-connected workstation, IBM can connect to Inventory Scout, running on your server, and capture hardware and microcode data. The workstation or PC running the browser must be able to connect to your server. If your server is disconnected from the Internet, you can still create a data file on that system and then upload the data file to IBM from an Internet-connected system. pSeries and RS/6000 microcode updates Web site The pSeries and RS/6000 microcode updates Web site provides all available product microcode. Use this page to keep your microcode current with the latest available microcode updates. You can view “what's new” to see a list of microcode releases for systems, adapters, and devices sorted by release date, with the most recent releases on top. You can select the microcode updates and download the file to upgrade your system. You can stay informed of future microcode updates by subscribing to the microcode update mailing list to receive the Hardware Microcode Bulletins at: http://techsupport.services.ibm.com/server/mdownload/download.html 46 IBM Eserver pSeries Cluster Systems Handbook Microcode update files and discovery tool V1.1 The microcode update files and discovery tool V1.1 complements the microcode update application for AIX 5L V5.1 and V5.2 and the pSeries and RS/6000 microcode updates Web site. The microcode update files and discovery tool V1.1 expands the availability of microcode onto a CD-ROM and provides simple process steps to analyze the system, acquire microcode, and install it without the need for an IBM Service Support Representative (SSR) or the Internet. It supports “secure” accounts that do not have access to the Internet and systems without a diskette drive. It enables efficient microcode-level surveying for new code. With this CD-ROM, you can perform microcode updates when it is most convenient to you. This comprehensive source of the latest code eliminates having to access the Internet in multiple sessions in order to create a diskette of the latest code. Microcode formats The latest microcode levels are available in several formats for supported IBM systems and devices as of the date the CD was created (reference the date on the CD label). Use the microcode discovery tool to review installed microcode and download the latest microcode to your local disk. You can choose from one of these formats: A Java applet that runs in a Web browser GUI environment A script that runs in a command line AIX window text environment Alternative microcode check To quickly check the microcode level on your Cluster 1600 system, from AIX 4.3.3 ML08 onward, use the lsmcode command. This command, when run with no flags, displays the platform system firmware microcode level and the service. Use the command lsmcode -A to display all system and device microcode on the system. Example 2-2 shows typical output from this command, run on a 7040-681 LPAR. Example 2-2 lsmcode -A output {node1:root}/-> lsmcode -A sys0!system:RH021114 |System Firmware:RG021019_GA3_H|SPCN Firmware:0000RH 06182 fcs0!df1000f9.382101 ent0!1410ff01.SCU002 fcs1!df1000f9.382101 ent1!1410ff01.SCU002 ent2!1410ff01.SCU002 Chapter 2. Cluster 1600 hardware 47 ent3!1410ff01.SCU002 hdisk0!ST33660.41543033.43353039 ses0!0011 hdisk1!ST33660.41543033.43353039 ses1!0011 hdisk2!ST33660.41543033.43353039 ses2!0011 hdisk3!ST33660.41543033.43353039 ses3!0011sys0!service:CL000628 2.6.4 Electronic Service Agent Electronic Service Agent and Service Director are IBM software applications supplied with attached servers. Electronic Service Agent is a replacement for the previously-supplied Service Director. Both applications perform the same function and are referred to as service agent in this redbook. Service Agent runs on AIX and also on the HMC. Service Agent monitors the “health” of your pSeries, RS/6000 and Cluster 1600 systems. When a system fault is detected, the severity of the fault is analyzed and, if required, Service Agent notifies the IBM support center. In addition, you can also configure Service Agent to send an automated e-mail message containing the fault information to your system administrator. This requires mail to be active on each node. Upon receiving the fault notification, IBM automatically dispatches a service representative, along with parts if needed, to correct the fault. Service Focal Point Service Focal Point is an application, specific to HMC-controlled systems, that is installed by the customer engineer at installation, or shortly after, when the LPAR configuration has been established. The Service Focal Point works hand-in-hand with the Service Agent. The application runs on the HMC of POWER4 and later nodes to accommodate error reporting, analysis and repair in the LPAR environment. Service Focal Point (SFP) leverages the design capabilities of the HMC to provide equivalent virtual function to the current capabilities presented by physical op panels, service processor TTY menu interfaces and system firmware interfaces, as well as the capability for configuring/reconfiguring building block hardware into partitions. SFP is a system infrastructure which manages serviceable event information for the system building blocks. It consists of resource managers which monitor and 48 IBM Eserver pSeries Cluster Systems Handbook record information about different objects in the system. It filters and correlates events from the resource managers and initiates calls home, when appropriate. It also provides a user interface that allows a user to view the events and perform problem determination. When a problem is corrected, the user can record actions that have been taken to resolve the hardware problem. These features of SFP support the overall problem management strategy in a complex system. The SFP application receives Service Action Events from the service processor for critical system down situations, and from the Service Agent client application programs running on the individual logical partitions for system recoverable or predictive events, as well as operating system- or device driver-detected events. Service Action Event (SAE) log The SFP collects the serviceable events from different building blocks and logs them in a Service Action Event (SAE) log. The log entries are generated by analysis routines that run on an error that has occurred in a building block. The resource manager for the building block forwards information about the event to the Service Focal Point and the information is placed in the SAE log. The particular content of the error data depends upon the type of the error and on the system configuration itself. The SAE log on the SFP also contains pointers to extended information that may have been recorded at the time of a serviceable event by the building block. Extended error collection includes not only the collection of first failure data capture, but also vital product data, partition information, operating system error logs, service processor error logs, error register data, and so on. When the SFP receives a new log entry, filtering is performed to determine if this was a unique event. The filtering is done because sometimes an event notification can come from more than one resource manager for the same event, or a resource manager may forward a notification for an event which previously occurred but has not been corrected yet. Service Agent component When a Service Action Event is logged in the SFP, the system needs to communicate the failure back to IBM. During this call home function, particular error data and system configuration information needs to be sent to IBM to drive the service delivery infrastructure. The SFP utilizes the Service Agent Focal Point application residing on the HMC, along with the HMC modem, to initiate the call home and transfer the pertinent error information to IBM Service. When a call home is required, Service Agent manages the connection to IBM which is used to open a problem record. The problem record is used by the service delivery team to determine whether or not Chapter 2. Cluster 1600 hardware 49 to dispatch a customer engineer (CE) with the appropriate service parts to the system to perform a repair. When service personnel perform a repair on the system, the SFP is used to identify the source of the problem and record information relating to the repair. When the repair has been completed, the service representative can update the SAE log with Field Replacement Unit (FRU) replacement information and any comments that the service representative has. The information stored by the SFP represents the system's service history and can be used to ensure proper maintenance over the life of the system. Service Agent Gateway Server (SAS) Where a customer has multiple pSeries servers or Cluster 1600s, a Service Agent Gateway Server can be set up. This will be the server in the organization which has a modem attached to it. Other servers that have Service Agent running on them will contact the Service Agent Gateway Server when they need to log a service call. The Service Agent Gateway Server will then dial IBM and log the service call for the client server. The client Service Agent Servers must be registered with the Service Agent Gateway Server before calls can be logged. The client Service Agent Servers contact the Service Agent Gateway Server through the LAN, therefore only the Service Agent Gateway Server requires a modem and analog phone line. 2.6.5 Planning for Cluster 1600 servers Planning for Cluster 1600 servers should be done with consultation with the following documentation: IBM eServer Cluster 1600 Planning Volume 1, Hardware and Physical Environment, GA22-7280 IBM eServer Cluster 1600 Planning Volume 2, Control Workstation and Software Environment, GA22-7281 pSeries Systems Handbook 2003 Edition, SG24-5120 In the following sections, we summarize the key points for planning for Cluster 1600 servers. Network communication Network communication is a very complex aspect of planning large clustered systems. Cluster 1600 systems have several communication requirements, including the following: All SP systems require an SP Ethernet LAN for system administration. 50 IBM Eserver pSeries Cluster Systems Handbook Switch-configured systems require a frame-to-frame switch cable network. SP systems connected to external networks (or with networks between SP system partitions) require additional communication adapters. The required SP Ethernet LAN that connects all nodes to the control workstation is needed for system administration—and it is to be used for that purpose exclusively. If you attempt to route non-administrative traffic over the SP Ethernet and it interferes with administrative traffic, you have to reroute the non-administrative traffic. Further network connectivity is supplied by various adapters (some of which are optional) that provide connection to I/O devices, networks of workstations, and mainframe networks. Ethernet, FDDI, Token-Ring, HIPPI, SCSI, FCS, and ATM are examples of adapter types that can be used as part of an RS/6000 SP system. Important: The establishment of a trusted network between the control workstation (CWS) and the Hardware Management Console (HMC) is recommended. For more information, refer to the PSSP V3.4 Read This First document, or to the PSSP V3.5 documentation at: http://www.rs6000.ibm.com/resource/aix_resource/sp_books/ For more details on network configurations, see Chapter 3, “Network configuration” on page 121. Security Security should be planned very carefully because there are many options available to secure your system. Physical, network, login, and file access should be policy-driven on a “need to have” access basis. Kerberos should be considered for authentication purposes and for PSSP daemon use. Additionally, ssh should be considered if encryption is also required. Restricted root access should also be considered for individual nodes administrators. For more information on security, refer to the following documentation: Exploiting RS/6000 SP Security: Keeping It Safe, SG24-5521 IBM Eserver Certification Study Guide: Cluster 1600 Managed by PSSP, SG24-7013 Chapter 2, PSSP 3.5 Administration Guide, SA22-7348 Chapter 6, IBM eServer Cluster 1600 Planning Volume 2, Control Workstation and Software Environment, GA22-7281 Chapter 2. Cluster 1600 hardware 51 Environmental considerations When planning to install the servers, consideration should be given to the following items: Electrical requirements (some systems are 3-phase) Floor space (including engineering access space) Weight distribution (some systems are very heavy) Heat output (check air conditioning requirements) Delivery access (check if either height or weight reduction is required) Server-specific documentation includes environmental requirements. You should also contact your IBM Customer Engineer or IBM Business Partner to help with installation planning. 2.7 Hardware supported and currently marketed This section describes the Cluster 1600 hardware components, including servers, switches and control workstations, that are currently supported in the Cluster 1600 and available from IBM. 2.8 pSeries servers In this section, we present the pSeries servers that are building blocks for a Cluster 1600 system; we describe each server and available features and options. 2.8.1 pSeries 615 server (7029-6C3 and 6E3 deskside) The 7029 pSeries 615 Model 6E3 deskside server, shown in Figure 2-8, gives you the tools for managing e-business, greater application flexibility, and innovative technology, all designed to help you capitalize on the e-business revolution. Each model is available in 1-way or 2-way configurations. 52 IBM Eserver pSeries Cluster Systems Handbook Figure 2-8 The p615 6E3 (left) and 6C3 (right) The symmetric multiprocessor (SMP) uses 1-way or 2-way, state of the art, 64-bit, copper-based, POWER4+ microprocessors running at 1.2 GHz. Each processor includes 8 MB of Level 3 (L3) cache. The base 1 GB of main memory can be expanded to 16 GB for faster performance and exploitation of 64-bit addressing, as used in large database applications. The Model 6E3 contains up to 11 bays. There are four front-accessible, hot-swap-capable DASD bays in a minimum configuration with an additional four hot-swap-capable DASD bays optional. The eight DASD bays can accommodate up to 1174.4 GB of disk storage. Two of the remaining three bays are used for a diskette drive and a DVD-ROM, and the third bay can contain a DVD-RAM or tape drive. Chapter 2. Cluster 1600 hardware 53 Figure 2-9 7029-6C3 internal structure Note: Support for logical partitioning (LPAR) is not available on the Models 6E3 and 6C3. Summary of standard and optional features Table 2-11 lists the standard and optional features available for the pSeries 615 model 6C3 and model 6E3. Table 2-11 The fp615 standard and optional features 54 Description Quantity Type Processors 1-2 POWER+ 1.2GHz 8 MB L3 Cache (F/C 5133) Memory 1 - 16GB DDR SDRAM Internal disk bays 8 Hot swap 18.2 GB - 146.8 GB per bay Media bays 3 2 slimline and 1 full media bay PCI slots 4 2 64-bit hot swap PCI-X (long) 32-bit hot swap PCI-X (short) IBM Eserver pSeries Cluster Systems Handbook Description Quantity Type Standard ports 1 1 3 1 2 Keyboard Mouse Serial Parallel HMC ports Integrated SCSI adapters 2 Ultra 3 SCSI controllers (internal) Integrated LAN adapter ports 2 1 x 10/100 Ethernet port and 1 x 10/100/1000 Ethernet port I/O Drawers N/A N/A Redundant power supply Optional Redundant AC power (F/C 6266) Operating System version AIX Linux Version 5.1 with 5100-04 ML or later SLES 8 for pSeries Physical specifications Rack 7014-T00 Rack 7014-T42 Power requirements Rack Mount: Width: 437 mm (17.2 in) Depth: 508 mm (20.0 in) Height: 178 mm (07.0 in) Weight: 35.5 kg (78.0 lb) - minimum configuration Weight: 43.1 kg (94.8 lb) - maximum configuration 9 10 Maximum is nine per rack Maximum is ten per rack Racks may be shared with peripherals Operating voltage: 100 to 127 or 200 to 240 V ac (auto-ranging) Operating frequency: 47/63 Hz Power requirements: 300 watts (typical); 450 watts (maximum) Power source loading: 0.30 kVA (typical configuration) 0.50 kVA (maximum configuration) Restriction: PCI-X slot 1 only allows short cards. 64-bit cards are not allowed in the 32-bit slots (slot 2 and 3). Chapter 2. Cluster 1600 hardware 55 Cluster considerations The Cluster 1600 is enhanced to include the p615 server within the available cluster building blocks. This server is now available as both an initial option and a field addition to the Cluster 1600. Restriction: The 615 is only supported in a CSM-managed Cluster 1600. When ordering a p615 for attachment to a Cluster 1600, the features listed in Table 2-12 should be considered. Table 2-12 Cluster features for the p615 Feature code Description 7315-C02 Hardware Management Console (HMC) 7315 F/C 8120 Attachment cable HMC to host 6 meters 7315 F/C 8121 Attachment cable HMC to host 15 meters 9078-160 F/C 0012 Cluster 1600 7029 type server Tip: An HMC must be available, or ordered with the p615. If an HMC is already available, then the console attachment cable must be ordered. An asynchronous card may also be required. See Table 2-10 on page 43 for the HMC options. 56 IBM Eserver pSeries Cluster Systems Handbook CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p615: The p615 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, AIX 5.2 ML01 or later. Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as managed nodes and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 and p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 The p615 is not supported on PSSP 3.5. For more information on the pSeries 615, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120. 2.8.2 pSeries 630 server (7028-6C4 and 6E4 deskside) The 7028 pSeries 630 Model 6C4, as shown in Figure 2-10 on page 58, is a rack-mounted server, and the 7028 IBM eServer pSeries 630 Model 6E4 is the deskside model as shown in the figure. The Model 6C4 provides the power, capacity, and expendability required for e-business computing. It offers 64-bit scalability via the 64-bit POWER4 or POWER4+ processor packaged as 1-way and 2-way cards. Chapter 2. Cluster 1600 hardware 57 Figure 2-10 The p630 6C4 (left) and 6E4 (right) picture With its two-processor card positions, the model 6C4 can be configured into 1-way, 2-way, or 4-way configurations. The processor cards operate at 1.0 GHz with 32 MB of L3 cache per processor card, or 1.2 and 1.45 GHz with 8 MB of L3 cache per processor. The pSeries 630 Model 6C4 memory DIMMs are mounted on the CPU card and can contain up to 32 GB of memory. The p630’s processors are packaged on Single Chip Modules (SCM). There is one POWER4 or POWER4+ chip on the module with either one or two CPU cores within the chip. The model 6C4 contains six bays. The four front-accessible, hot-swap capable bays can accommodate up to 587.2 GB of disk storage. 58 IBM Eserver pSeries Cluster Systems Handbook Figure 2-11 7028-6C4 internal structure LPAR considerations Support for logical partitioning (LPAR) is available on the models 6E4 and 6C4. LPAR enables you to divide a Model 6E4 system into two LPARs and the Model 6C4 into as many as four LPARs. Each LPAR functions as a “computer within a computer” with its own instance of the operating system. For more information on LPARs on pSeries refer to the IBM Redbook The Complete Partitioning Guide for IBM eServer pSeries Servers, SG24-7039. p630 I/O drawers The IBM eServer 630-6C4 can attach two 7311-D20 I/O drawers to increase the number of I/O adapters that can be inserted into the system. I/O drawers are only supported on the rack mounted model of p630. 7311-D20 I/O drawer The IBM 7311 Model D20 is a rack-mounted, high density expansion drawer that attaches to the pSeries 630 Model 6C4 to provide remote I/O. There are seven hot-swap PCI-X I/O slots in the 7311-D20 and twelve optional hot-swap disk drive bays. Chapter 2. Cluster 1600 hardware 59 The 7311-D20 occupies 4U (7.00-inches) of space in a rack and mounts in a 19-inch standard rack drawer. The Model D20 is 24-inches deep. The fans, power supplies, and PCI adapters, are top-accessible while the disk drives are front-accessible for easy service and maintenance. Figure 2-12 shows the internal structure of a 7311-D20. Figure 2-12 7311-D20 I/O drawer (p630) Tip: If the disk drives in the Model D20 are being used to support two LPARs, then two SCSI adapters must be installed in the Model D20; one adapter is needed to support each 6-pack of DASD. Summary of standard and optional features Table 2-13 shows the standard and optional features available for the pSeries 630 model 6C4 and 6E4. 60 IBM Eserver pSeries Cluster Systems Handbook Table 2-13 Standard and optional features Description Quantity Type Processors 1-4 1-4 1-4 POWER4 1.0GHz, 32MB L3 Cache (F/C 5131) POWER4+ 1.2GHz 8MB L3 Cache (F/C 5133) POWER4+ 1.45GHz, 8MB Cache (F/C 5126) Memory 1 - 32GB DDR SDRAM Internal disk bays 4 Hot swap 18.2GB - 146.8GB per bay Media bays 2 Bay 1 for CD or DVD Bay 2 for CD, DVD, diskette or tape (optional) PCI slots 4 6 64 bit hot swap PCI-X with 1.0GHz proc 64 bit hot swap PCI-X with 1.2 and 1.4GHz Standard ports 1 1 3 1 2 2 Keyboard Mouse Serial Parallel HMC ports Model 6C4 only RIO ports, Rio-2 ports with F/C 9581 Integrated SCSI adapters 2 Ultra 3 SCSI controllers (1 int & 1 ext VHDC I AMP half pitch 68-pin) Integrated LAN adapter ports 2 10/100 Ethernet controller with 2 ports I/O Drawers Optional 2 x 7311-D20 can be attached to model 6C4 Redundant power supply Optional Redundant AC power (F/C 6273) Operating System Version AIX Version 5.1 with 5100-02 ML or later Physical specifications Rack 7014-T00 Rack 7014-T42 Width: 444.4mm (17.5in) Depth: 609.6mm (24.0in) Height:172.8mm (06.8in)(4 EIA units) weight: 32.0 kg (70.4 lb) - Minimum Configuration 47.3 kg (104 lb) - Maximum Configuration with rails 9 10 Maximum is nine per rack Maximum is ten per rack Racks may be shared with peripherals Chapter 2. Cluster 1600 hardware 61 Description Quantity Power requirements Type Operating voltage: (auto-ranging) 100 to 127 or 200 to 240 V ac Operating frequency: 50/60 Hz Power requirements: 2-way (typical configuration): 330 watts 2-way (maximum configuration): 500 watts 4-way (typical configuration): 500 watts 4-way (maximum configuration): 750 watts Power source loading: 2-way (typical configuration): 0.348 kVA 2-way (maximum configuration): 0.522 kVA 4-way (typical configuration): 0.522 kVA 4-way (maximum configuration): 0.783 kVA Maximum altitude: 2,135 m (7,000 ft) Cluster considerations The Cluster 1600 is enhanced to include the p630 server within the available cluster building blocks. This server is now available as both an initial option and a field addition to the Cluster 1600. When ordering a p630 for attachment to a Cluster 1600, the features in Table 2-14 should be considered. Table 2-14 Cluster features for p630 Feature code Description 7315-C02 Hardware Management Console (HMC) 7315 F/C 8120 Attachment cable HMC to host 6 meters 7315 F/C 8121 Attachment cable HMC to host 15 meters 7028 F/C 8398 SP Switch2 PCI-X attachment adapter 7028 F/C 8397 SP Switch2 PCI attachment adapter 9078-160 F/C 0012 Cluster 1600 7028 type server 9078-160 F/C 0013 Cluster 1600 7018 LPAR (switched) Tip: An HMC must be available, or ordered with the p630. If an HMC is already available, then the console attachment cable must be ordered. An asynchronous card may also be required. See Table 2-10 on page 43 for the HMC options. 62 IBM Eserver pSeries Cluster Systems Handbook CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p630: The IBM Eserver p630 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, 5.2 ML01 or later. Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as managed nodes and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 & p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 The p630-6C4 is supported on PSSP 3.5 with AIX 5.1 ML03 or later. For more information on the pSeries 630 and features, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120. 2.8.3 pSeries 650 server (7038-6M2) The pSeries 650 model 6M2, as shown in Figure 2-13 on page 64, consists of a single rack-mounted drawer containing the processors, memory, disk drives, and hot-plug I/O slots. The p650 server and optional attached 7311-D10 or D20 I/O drawers implement redundant power and redundant cooling. Chapter 2. Cluster 1600 hardware 63 Figure 2-13 The p650 6M2 The p650 system can accommodate one, two, three, or four 2-way processor cards to create configurations of 2, 4, 6, or 8 processors as required, offering outstanding scalability as your business needs dictate. The p650 rack-mounted drawer configuration offers flexibility regarding the number of server and I/O drawers that can be mounted in the rack, providing more compute and I/O power per square foot of valuable floor space. The p650 requires 8U (EIA Units) of rack space and up to eight 7311-D10 I/O drawers can be attached, requiring 4U of rack space for each pair of I/O drawers. A minimum p650 configuration requires only 8U of rack space, while a system configured with eight 7311-D10 drawers fits into a 24U space. (The maximum configuration, which consists of one p650 drawer and eight 7311-D20 I/O drawers, requires 40U of rack space and provides 63 available hot-plug, PCI-X slots with up to eight processors and 64 GB of memory.) Depending on the number of I/O drawers attached, up to four p650 systems can be installed in a 7014-T00 rack, with room remaining to install external data storage drawers. Figure 2-14 on page 65 shows the internal structure of pSeries 650. 64 IBM Eserver pSeries Cluster Systems Handbook Figure 2-14 7038-6M2 internal structure Table 2-15 on page 67 shows the standard and optional features available for the pSeries 650 model 6M2. p650 I/O drawers The 650 has a choice of I/O drawers available to it: the 7311-D10, and the 7311-D20. The primary difference is that the D20 contains internal SCSI disks which can be used to boot LPARs from, and the D10 only contains adapter slots. In the following sections, we discuss each drawer in more detail. 7311-D10 I/O drawer The IBM 7311 model D10 I/O drawer is a rack-mountable expansion cabinet that can be attached to a pSeries 650 server. Each model D10 drawer gives you six full-length adapter slots. Up to eight model D10 drawers are supported by the pSeries 650, with connections provided by remote I/O adapters and cables. The model D10 requires 4U of vertical space in a 19-inch rack, such as the IBM 7014-T00 or 7014-T42. Two D10 drawers can fit side by side within the enclosure that provides rack mounting hardware. The model D10 offers a modular growth path for pSeries 650 systems with increasing I/O requirements. When a pSeries 650 is fully configured with eight Chapter 2. Cluster 1600 hardware 65 attached model D10 drawers, the combined system supports up to 55 PCI adapters. Figure 2-15 shows the structure of a 7311-D10 I/O drawer. Figure 2-15 7311-D10 internal structure 7311-D20 I/O drawer The IBM 7311 Model D20 is a rack-mounted, high-density expansion drawer that attaches to the pSeries 630 Model 6C4 to provide remote I/O. There are seven hot-swap PCI-X I/O slots in the 7311-D20 and twelve optional hot-swap disk drive bays. The 7311-D20 occupies 4U (7.00-inches) of space in a rack and mounts in a 19-inch standard rack drawer. The model D20 is 24 inches deep. The fans, power supplies, and PCI adapters are top-accessible, and the disk drives are front-accessible for easy service and maintenance. Figure 2-16 on page 67 shows the internal structure of a 7311-D20. 66 IBM Eserver pSeries Cluster Systems Handbook Figure 2-16 7311-D20 internal structure (p650) Tip: If the disk drives in the model D20 are being used to support two LPARs, then two SCSI adapters must be installed in the model D20 (one to support each 6-pack of DASD). Summary of standard and optional features Table 2-15 p650 standard and optional features Description Quantity Type Processors 2, 4, 6, 8 2, 4, 6, 8 POWER4+ 1.2GHz 8 MB L3 Cache (F/C 5122) POWER4+ 1.45GHz, 32 MB Cache (F/C 5208) Memory 1 - 64GB DDR SDRAM Internal disk bays 4 Hot swap 18.2 GB - 146.8 GB per bay Media bays 2 Auto docking bays for CD, DVD, diskette or tape (optional) PCI slots 6 1 64-bit hot swap 133MHz PCI-X full length 32-bit hot swap 66MHz PCI-X half length Chapter 2. Cluster 1600 hardware 67 Description Quantity Type Standard ports 1 1 4 1 2 1 2-8 Keyboard Mouse Serial Parallel HMC ports Status beacon port RIO ports or RIO-2 ports. Number of ports is dependent on proc cards and GX RIO cards Integrated SCSI adapters 2 Ultra 3 SCSI controllers (1 int & 1 ext 68-pin) Integrated LAN adapter ports 1 10/100 Ethernet controller I/O drawers Optional 8 x 7311-D10 or D20 drawers can be attached Redundant power supply Standard AC power supplies (F/C 9172) Operating system version AIX Version 5.1 with 5100-02 ML or later Physical specifications Rack 7014-T00 Rack 7014-T42 see following notes Width: 444.4mm (17.5 in) Depth: 770mm (29.9 in) Height:351mm (13.8 in)(8 EIA units) Weight: 93.0 kg (205.4 lb) - maximum configuration 4 5 Power requirements Maximum is four per rack Maximum is five per rack, 5 PDUs required Racks may be shared with peripherals Operating voltage: 200 to 240 V ac 50/60 Hz Electrical output: 1,070 watts (typical); 1,600 watts (maximum) Power source loading: 1.126 kVA (typical configuration) 1.684 kVA (maximum configuration) Thermal Output: 1,070 joules/sec (3,652 Btu/hr, typical configuration) 1,600 joules/sec (5,461 Btu/hr, maximum configuration Note: An IBM 7014-T00 or T42 rack must have at least one Power Distribution Unit (PDU) per p650 system installed. It is recommended that each power supply in a p650 system be connected to a different PDU. No more than two p650 power supplies may be connected to the same PDU. 68 IBM Eserver pSeries Cluster Systems Handbook Each PDU has nine C13 power connectors, in groups of three. When a p650 power supply is connected to one PDU power outlet, the other two connectors in that group of three m Important: The 7014 PDUs must be one of the following feature codes: F/C 7176, F/C 7177, F/C 7178, F/C 9176, F/C 9177, or F/C 9178. If you are using an existing rack, then an upgrade to the rack may be required. Capacity on Demand (CuOD) Capacity on Demand (CuOD) is a way of planning for future growth within your server. It means that you can have extra resources available which can be utilized at short notice and with no disruption to service. Payment for the extra resources is only made when the resources are used. Capacity on Demand is available for both CPU resource and memory resource. It also has the added advantage that if one of the active CPUs fails, it can be substituted automatically and without disruption for one of the inactive CPUs. This action is non-chargeable. The pSeries 650 systems can be shipped with non-activated resources (processors and/or memory), which may be purchased and activated at a certain point in time without affecting normal machine operation. The following methods are available: Capacity Upgrade on Demand Memory Capacity Upgrade on Demand On/off Capacity on Demand Trial Capacity Upgrade on Demand For further information about Capacity on Demand, refer to specific server documentation or to pSeries 650 Model 6M2 Technical Overview and Introduction, REDP0194. LPAR considerations Support of logical partitioning (LPAR) is available on the pSeries 650. LPAR enables you to split a Model 6M2 system into eight LPARs. Each LPAR functions as a “computer within a computer”, with its own instance of the operating system. Note the following points: Static LPAR is supported with AIX V5.1 or V5.2. Dynamic LPAR requires AIX V5.2. A 7315-C01 Hardware Management Console is required for all LPAR configurations. Chapter 2. Cluster 1600 hardware 69 A fully configured pSeries 650 can support up to eight partitions. A partition must have at least a bootable SCSI or SSA or Fibre Channel adapter, and an Ethernet adapter/controller. Since all the PCI slots in p650 can be independently assigned to partitions, this means that four partitions with minimal I/O can be configured without requiring a separate 7311-D10 I/O drawer. A model D10 I/O drawer can provide up to three additional minimal I/O partitions. To configure the maximum of eight partitions would require at least two model D10 I/O drawers. The actual number of D10 I/O drawers required depends on the customer's I/O requirements for each partition. Note: Both the integrated internal SCSI bus and the integrated external SCSI bus must be assigned to the same partition. When the system is configured with the (F/C 6578) Ultra3 SCSI Backplane, all of the internal disk drives and media share the same SCSI bus and must be assigned to the same partition. That means that all but the first partition must be NIM-installed and use network diagnostics, or can only be CD-ROM-installed or use network diagnostics when the first partition is shut down. The user should not use the internal disk drives during install because drives could have data for the first partition and may be reassigned back to the first partition. We recommend that you use NIM and network diagnostics for all partitions except the first partition—or that internal disk drives not be used for any partition. Two independent pairs of hot-swappable drives are optionally available by using the (F/C 6579) Split SCSI Backplane. This configuration is available for systems that require two logical partitions without using any I/O drawers or external boot devices. When configured using a single SCSI card and the external SCSI port, the internal media bays will be grouped with the two drives controlled by the external SCSI port. Tip: To free the media devices for assignment to LPARs, it is recommended to use two SCSI adapters to support the split SCSI backplane. In this configuration, the media devices will be attached to the internal SCSI controller and be free to be assigned to any partition, and not affect a configuration using the internal disks. When the system is configured with the (F/C 6578) Ultra3 SCSI Backplane, external disk drives must be provided for all but the first partition. When the system is configured with the (F/C 6579) Split SCSI Backplane, external disk 70 IBM Eserver pSeries Cluster Systems Handbook drives must be provided for all but the first two partitions. AIX does not support diskless LPAR operation. For more information on LPARs on pSeries, refer to The Complete Partitioning Guide for IBM eServer pSeries Servers, SG24-7039. Note When running LPAR with AIX V5.1, fully populated processor cards cannot mix memory DIMMs with different capacities on the same card. Cluster considerations The model 6M2 is supported in either a non-switched IBM ^ Cluster 1600 or a switched Cluster 1600 system using the SP Switch2 adapter (F/C 8398). Up to 32 p650s running in non-LPAR (full system partition) mode are supported in a cluster. A Cluster 1600 can scale up to 128 LPARs. PSSP Version 3.5 is required to support SP Switch2 adapter (F/C 8398) with AIX 5L Version 5.2 or Version 5.1. If a model 6M2 configured in LPAR mode is part of the cluster, all LPARs must be part of the cluster. It is not possible to use selected LPARs as part of the cluster and use others for non-cluster use. The HMC uses a dedicated connection to the model 6M2 to provide the functions needed to control the server, such as powering the system on and off. The HMC must have an Ethernet connection to the CWS. Each LPAR in the model 6M2 must have an Ethernet adapter to connect to the CWS trusted LAN. Chapter 2. Cluster 1600 hardware 71 When ordering a p650 for attachment to a Cluster 1600, the features in Table 2-16 should be considered. Table 2-16 Cluster features for p650 Feature code Description 7315-C02 Hardware Management Console (HMC) 7315 F/C 8120 Attachment cable HMC-to-host 6 meters 7315 F/C 8121 Attachment cable HMC-to-host 15 meters 7038 F/C 8398 SP switch2 PCI-X attachment adapter 9078-160 F/C 0014 Cluster 1600 7038-type server 9078-160 F/C 0015 Cluster 1600 7038 LPAR (switched) PSSP The 7038-6M2 with F/C 8398 requires PSSP V3.5 with the following APARs: – IY42352 PSSP V3.5 support for p650 with SP Switch2 PCI-X in RIO mode The 7038-6M2 with F/C 8398 requires PSSP V3.4 or PSSP V3.5 with the following APARs: – IY42359 PSSP V3.4 support for p650 with SP Switch2 PCI-X in RIO-2 mode – Y42358 PSSP V3.5 support for p650 with SP Switch2 PCI-X in RIO-2 mode Tip: An HMC must be available, or ordered with the p650. If an HMC is already available, then the console attachment cable must be ordered. An Asynchronous card may also be required. See Table 2-10 on page 43 for the HMC options. CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p650: The IBM Eserver p650 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, 5.2 ML01 or later. 72 IBM Eserver pSeries Cluster Systems Handbook Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as managed nodes and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 and p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 The IBM Eserver p650-6M2 is supported on PSSP 3.5 with AIX 5.1 ML03 or later. For more information on the pSeries 650 and its features, refer to the specific server documentation manual or to pSeries Systems Handbook 2003 Edition, SG24-5120. 2.8.4 pSeries 655 server (7039-651) The pSeries 655 (7039-651), as shown in Figure 2-17 on page 74, is the latest in a family of products within IBM's UNIX, 64-bit, symmetric multiprocessing (SMP) servers. The pSeries 655 offers a super-dense package (up to 128 processors in a single frame), ideal for many high-performance computing and business intelligence applications. Based on IBM's leadership POWER4 Technology, the p655 is designed for medium to large customers whose workloads are best managed with a clustered server solution. Chapter 2. Cluster 1600 hardware 73 Figure 2-17 The p655 Packaged in scalable, single multi-chip module (MCM)-based thin nodes, within a half-drawer, 4U form factor, up to sixteen p655s can be installed within a 7040-W42 frame. The p655 server can be interconnected to the SP Switch2 fabric using the SP Switch2 PCI-X adapter as a standalone cluster, or as an additional building block of the Cluster 1600. The p655 can be partitioned into four logical partitions (LPARs), either switched or non-switched, for greater flexibility. Note: The p655 must be mounted in a 7040-W42 24-inch rack. Within the 7040-W42 frame for the p655, the 7040-61D I/O drawer offers expansion of up to 20 blind-swap PCI bus slots, and up to 16 additional Ultra3 SCSI disks per server that can be hot-plugged. 74 IBM Eserver pSeries Cluster Systems Handbook Figure 2-18 7039-651 internal structure The ultra-powerful pSeries 655 sets the standard for supercomputing class UNIX servers with its mainframe-heritage LPAR and self-managing features. The p655 is the latest system to offer the POWER4+ microprocessor. The POWER4+ chip is a dual processor SMP on a single piece of silicon. It owes its advanced performance to the IBM SOI and copper fabrication technology. The pSeries 655 can be configured with a 7040-61D I/O Drawer. The I/O drawer can contain up to 16 hot-swap-capable disks structured into four 4-packs of 18.2 GB, 36.4 GB, 73.4 GB, or 146.8 GB disks. The drawer contains two planars with 10 slots/planar, 20 slots/drawer. The slots are hot-plug-capable to permit PCI adapters to be added or replaced without extending the I/O drawer, all while the system remains available to the customer. PCI-X planars in the 7040-61D I/O Drawer, with RIO-2 connectivity, are supported on the POWER4+ version of the 7039-651. 7040-61D I/O drawer The I/O drawers provide internal storage and I/O connectivity to the system. Figure 2-19 on page 76 shows the rear view of an I/O drawer, with the PCI slots and riser cards that connect to the RIO ports in the I/O books. Chapter 2. Cluster 1600 hardware 75 2 SCSI Backplanes 4 SCSI Drives each 2 SCSI Backplanes 4 SCSI Drives each 24 inch 10 PCI slots RIO Riser Card Rear View 10 PCI slots 4U RIO Riser Card Figure 2-19 I/O drawer rear view Each drawer is composed of two physically symmetrical I/O planar boards that contain 10 Hot-Plug PCI slots each, and PCI adapters can be inserted in the rear of the I/O drawer. The planar boards also contain two integrated Ultra3 SCSI adapters and SCSI Enclosure Services (SES), connected to a SCSI 4-pack Backplane. Tip: Both planars within the 7040-61D I/O drawer can be connected to the same p655 CEC, or each individual planar can be attached to a separate p655 CEC. The I/O drawers exist in two technologies: RIO and RIO-2. I/O drawers of both technologies offer the same number of PCI slots, the same number of disks, and are packaged in the same chassis. The difference is the type of I/O planar that is installed inside the drawer. Externally, the only visible difference between the two technologies is the shape of the cable connectors on the RIO Riser card (see Figure 2-20 on page 77). RIO connectors have a thumbscrew retention physical connector. RIO-2 connectors have a bayonet retention physical connector. 76 IBM Eserver pSeries Cluster Systems Handbook RIO Connector RIO-2 Connector Figure 2-20 Difference between RIO and RIO-2 connectors The I/O Drawer for pSeries 670, pSeries 655 and pSeries 690 has its own product number, 7040-61D, which refers to the two technologies. For ordering purposes, the difference between them is the Feature Code (F/C) of the configured I/O planar: RIO drawer uses F/C 6563, I/O Drawer PCI Planar, 10 slots, 2 Integrated Ultra3 SCSI Ports. RIO-2 drawer is configured with F/C 6571, I/O Drawer PCI-X Planar, 10 slots, 2 Integrated Ultra3 SCSI Ports. Summary of standard and optional features Table 2-17 shows the standard and optional features available for the pSeries 655 model 651. Table 2-17 p655 standard and optional features Description Quantity Type Processors 8 4 8 4 POWER4 1.1GHz 128 MB L3 Cache (F/C 5511) POWER4 1.3GHz 128 MB L3 Cache (F/C 5513) POWER4+ 1.5GHz, 128 MB Cache (F/C 5515) POWER4+ 1.7GHz, 128 MB Cache (F/C 5518) Memory 4 - 32GB DDR SDRAM (64 GB with RPQ) Internal disk bays 2 18.2GB - 146.8 GB per bay Chapter 2. Cluster 1600 hardware 77 Description Quantity Type Media bays 0 N/A PCI slots 3 64-bit hot swap PCI-X Standard ports 2 2 HMC ports RIO ports, RIO-2 ports feature-dependent Integrated SCSI adapters 1 Internal for internal disks Integrated LAN adapter ports 2 10/100 Ethernet controller with 2 ports I/O drawers Optional 1 x 7040-61D I/O drawer can be attached Redundant power supply Standard Redundant AC power within the rack Operating System Version AIX Version 5.1 with 5100-02 ML or later Physical specifications: Rack Rack 7040-W42 16 Power requirements 78 IBM Eserver pSeries Cluster Systems Handbook Width:78.5 cm (30.9 in) (with acoustic doors) 78.5 cm (30.9 in) (with slim-line doors) Depth: 179.9 cm (70.8 in) (with acoustic doors) 144.3 cm (56.8 in) (with slim-line doors) Height: 202.5 cm (79.7 in) (with acoustic doors) 202.5 cm (79.7 in) (with slim-line doors) Weight: 1652 kgs (3,635 lb maximum) (with acoustic doors)1,642 kgs (3,613 lb maximum) (with slim-line doors) Maximum is sixteen per rack Operating voltage @ 50/60 Hz: (3-phase): 200 to 240 V ac 380 to 415 V ac 480 V ac Electrical output: 31,818 watts (maximum, non-redundant line cords). Power source loading: 32.1 kVA (maximum non-redundant line cord configuration). Thermal output: 30,012 joules/sec (102,000 Btu/hr) (maximum configuration, non-redundant line cords) LPAR considerations The pSeries 655 can be partitioned into a maximum of four LPARs, switched or non-switched. Each logical partition operates under the control of its own copy of the operating system. PCI-X/PCI cards may be assigned to an LPAR on a slot-by-slot basis, if required. The integrated Ethernet ports can be individually assigned to an LPAR. Internal disk can be assigned to only one LPAR. To achieve the minimum resources per LPAR, additional I/O expansion for disk is required. Information on resources required to enable LPAR can be found in IBM Hardware Management Console for pSeries Installation and Operations Guide, SA38-0590. Racking considerations The IBM 7040-W42 system rack provides space for mounting system drawer components for pSeries 655 servers and 7040-61D I/O drawers. The Model W42 is a 24-inch rack that provides 42U of rack space. The 7040-W42 utilizes a 350 V DC bulk power subsystem to provide power for the components within the rack. The bulk power subsystem incorporates redundant bulk power assemblies mounted in the front and rear sections of the top 8U of the rack. Tip: The rack must have RS-422 connections from the HMC to each of the bulk power supply controllers. These cables require an asynchronous card to be present in the HMC. If the 128 port card (F/C 2944) is used, then RS-232 RS-422 connectors are also required. The racking considerations for the pSeries 655 can be quite complex, especially when planning for future expansion. When a p655 is configured using the IBM econfig tool, the rack position is specified with a feature code that designates the EIA position within the rack. The power cable is also specified, which designates either the right or left position within the EIA position. Figure 2-21 on page 80 shows the physical location code, EIA units, feature codes and filling order of the 7040-W42 rack. The eConfig tool tries to optimize your rack placements and will place them by default. The p655 should be placed by power cable feature (38xx), and the 7040-61D I/O drawers should be placed by EIA feature (46xx). The 7040-61D I/O drawer can go in any position, but should be placed nearest to the server which it is attached to. RIO and RIO2 cables should be ordered with the I/O drawer and not the p655. Cables ordered against the CEC are not validated. For more information on racking rules, see Appendix D of pSeries 655 Installation Guide, SA38-0616. Chapter 2. Cluster 1600 hardware 79 Note: On multiple rack configurations, rack-to-rack cables are not supported between the p655 CEC and the 7040-61D I/O drawer. Figure 2-21 p655 racking considerations Cluster considerations Following are considerations to take into account when ordering a p655 for attachment to a Cluster 1600. One or more of the following APARs will need to be installed on your Cluster 1600 system in order to support a p655 as a PSSP-managed server: APAR IY37884 needs to be installed on all p655 servers that will run PSSP 3.4, and on all control workstations running PSSP 3.4 that manage clusters containing p655 servers. APAR IY37885 needs to be installed on all p655 servers that will run PSSP 3.5, and on all control workstations running PSSP 3.5 that manage clusters containing p655 servers. APAR IY37884 provides support for the switchless attachment of p655 servers running PSSP 3.4 to Cluster 1600 systems managed by PSSP 3.4 or 80 IBM Eserver pSeries Cluster Systems Handbook PSSP 3.5. With APAR IY37884, p655 servers can be included in clusters containing an SP Switch2 switch, but may not be attached directly to the SP Switch2. APAR IY37885 provides support for the switchless or SP Switch2 attachment of p655 servers running PSSP 3.5 to Cluster 1600 systems managed by PSSP 3.5. SP Switch2 attachment is supported via the SP Switch2 PCI-X Attachment Adapter. APAR IY37885 to allow for current hardware installation, configuration, bring-up and application enablement. Software updates are planned to be made available in the future for production support. In the interim, e-fixes are planned to be provided as they become available. Important: Problems may be encountered with the s1term whenever Kerberos 4 keyfiles or supper file collection passwords are being passed to a p655 server in SMP (full system partition) mode which does not yet have APAR IY36001 (PSSP 3.4) or APAR IY36002 (PSSP 3.5) installed on all control workstations and/or p655 servers. PSSP service can be installed automatically during an installation or migration using function introduced in IY42352 for PSSP 3.5 and IY41696 for PSSP 3.4. After installing APAR IY42352 or IY41696 on your control workstation, place PSSP service in directory /spdata/sys1/install/pssplpp/PSSP-3.5 or /spdata/sys1/install/pssplpp/PSSP-3.4 and it will be automatically installed during an installation or migration. For details, refer to PSSP Installation and Migration Guide, GA22-7347. When ordering a p655 for attachment to a Cluster 1600, the features in Table 2-18 should be considered. Table 2-18 Cluster features for p655 Feature code Description 7315-C02 Hardware Management Console (HMC) 7315 F/C 8120 Attachment cable HMC-to-host 6 meters (RS-232) 7315 F/C 8121 Attachment cable HMC-to-host 15 meters (RS-232) 7315 F/C 8122 Attachment cable HMC-to-W42 rack 6 meters (RS-422) 7315 F/C 8123 Attachment cable HMC-to-W42 rack 15 meters (RS-422) 7039 F/C 8398 SP Switch2 PCI-X attachment adapter 7039 F/C 6420 pSeries HPS SNI GX bus adapter card Chapter 2. Cluster 1600 hardware 81 Feature code Description 9078-160 F/C 0010 Cluster 1600 7039 type server 9078-160 F/C 0011 Cluster 1600 7039 LPAR (switched) Restrictions: p655 servers cannot be attached to clusters containing an SP Switch. When a pSeries HPS SNI GX bus adapter is installed, there are only two PCI-X slots available in the p655 CEC. CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p655: The p655 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, 5.2 ML01 or later. Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as “managed nodes” and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 and p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 The p655-651 is supported on PSSP 3.5 with AIX 5.1 ML03 or later. For more information on the pSeries 655 and features, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120. 2.8.5 670 Server (7040-671) The IBM 7040 pSeries 670 Model 671, as shown in Figure 2-22 on page 83, brings advanced POWER4 and POWER4+ technology to the midrange server environment. The p670 yields increased flexibility through logical partitioning, 82 IBM Eserver pSeries Cluster Systems Handbook Linux support, and a wide variety of application availability. The p670 meets the requirements of customers who need lower life cycle costs with flexibility for multiple workloads, but who do not require scalability to very large systems. The attributes of the pSeries 670 set the stage to grow as your needs change: high availability, scalability, reliability, and flexibility. Figure 2-22 p670 picture The p670 can be configured as a 4-way, 8-way, or 16-way computer system, with the CPUs packaged on two Multi-Chip Modules (four or eight POWER4 processors per MCM). The CPUs are either 1.1 GHz POWER4 processors or 1.5 GHz POWER4+ processors with 128 MB of L3 ECC cache per MCM. The memory within the system is DDR-expandable from 4 GB to 256 GB. You can have up to three 7040-61D I/O drawers per server; each I/O drawer supports 20 blind-swap PCI or PCI-X bus slots, and up to 16 Ultra3 SCSI disks that can be hot-plugged.Hardware can be split into as many as sixteen LPARs, each functioning as a “computer within a computer” with its instance of the operating system. Figure 2-23 shows the internal structure of the p670. Chapter 2. Cluster 1600 hardware 83 Figure 2-23 7040-671internal structure 7040-61D I/O drawer The I/O drawers provide internal storage and I/O connectivity to the system. Figure 2-24 on page 85 shows the rear view of an I/O drawer, with the PCI slots and riser cards that connect to the RIO ports in the I/O books. 84 IBM Eserver pSeries Cluster Systems Handbook 2 SCSI Backplanes 4 SCSI Drives each 2 SCSI Backplanes 4 SCSI Drives each 24 inch 10 PCI slots RIO Riser Card Rear View 10 PCI slots 4U RIO Riser Card Figure 2-24 p670 I/O drawer rear view Each drawer is composed of two physically symmetrical I/O planar boards that contain 10 Hot-Plug PCI slots each, and PCI adapters can be inserted in the rear of the I/O drawer. The planar boards also contain two integrated Ultra3 SCSI adapters and SCSI Enclosure Services (SES), connected to a SCSI 4-pack backplane. Tip: The pSeries 670 can be initially configured with only a half I/O drawer with 10 PCI slots and up to eight disk drives installed. The I/O drawers exist in two technologies: RIO and RIO-2. I/O drawers of both technologies offer the same number of PCI slots, the same number of disks, and are packaged in the same chassis. The difference is the type of I/O planar that is installed inside the drawer. Externally, the only visible difference between the two technologies is the shape of the cable connectors on the RIO Riser card (see Figure 2-25 on page 86): RIO connectors have a thumbscrew retention physical connector. RIO-2 connectors have a bayonet retention physical connector. Chapter 2. Cluster 1600 hardware 85 RIO Connector RIO-2 Connector Figure 2-25 Difference between RIO and RIO-2 connectors The I/O Drawer for pSeries 670, pSeries655 and pSeries 690 has its own product number: 7040-61D, which refers to the two technologies. For ordering purposes, the difference between them is the feature code of the configured I/O planar: RIO drawer uses the F/C 6563, I/O Drawer PCI Planar, 10 slots, 2 Integrated Ultra3 SCSI Ports. RIO-2 drawer is configured with the F/C 6571, I/O Drawer PCI-X Planar, 10 slots, 2 Integrated Ultra3 SCSI Ports. Summary of standard and optional features Table 2-19 shows the standard and optional features available for the pSeries 670 model 671. Table 2-19 Standard and optional p670 features 86 Description Quantity Type Processors 4, 8, 16 4, 8, 16 POWER4 1.1GHz,128 MB L3 Cache POWER4+ 1.5GHz 128 MB L3 Cache Memory 4 - 256 GB DDR SDRAM Internal disk bays 16 - 48 16 per 7040-61D I/O drawer Media bays 4 Located in the media drawer (SCSI adapter required) PCI slots 20 - 60 64-bit hot swap PCI or PCI-X within 7040-61D I/O drawer IBM Eserver pSeries Cluster Systems Handbook Description Quantity Type Standard ports 2 2 2-4 Serial HMC ports 2 - 4 RIO or RIO2 loops supported Integrated SCSI adapters 4 - 12 Ultra3 SCSI controllers built into the 7040-61D I/O drawer planar Integrated LAN adapter ports 0 N/A I/O drawers 1-3 7040-61D I/O drawers Redundant power supply Standard Redundant AC bulk power supplies Operating System Version AIX AIX 5L V5.1 with the 5100-02 Recommended Maintenance package (APAR IY28102) or later, or AIX 5L V5.2 or later Physical specifications Width: 79 cm (30.3 in) Depth: 149 cm (58.8 in) (Acoustic) Depth: 134 cm (52.8 in) (slimline) Height:202 cm (79.7 in) Weight: 1,184kgs (2606lbs)max cfg (acoustic) Weight: 1,170kgs (2574lbs)max cfg (slimline) Power requirements Operating voltage @ 50/60 Hz: (3-phase) 200 to 240 V AC 380 to 415 V AC 480 V AC Operating voltage @ 50/60Hz (Single-phase): 200 to 415V AC Electrical output: 15,400 watts (maximum) Power source loading: 6.7 kVA (maximum configuration) Thermal Output: 6,670 joules/sec (22,800 Btu/hr) (maximum configuration, 16-way CEC) Note: The pSeries 670 can be field upgraded with a CEC change to a pSeries 690. Chapter 2. Cluster 1600 hardware 87 Capacity on Demand (CuOD) Capacity on Demand (CuOD) is a way of planning for future growth within your server. It means that you can have extra resources available which can be utilized at short notice and with no disruption to service. Payment for the extra resources is only made when the resources are used. Capacity on Demand is available for both CPU resource and memory resource. It also has the added advantage that if one of the active CPUs fail it can be substituted automatically and without disruption for one of the inactive CPUs. This action is non-chargeable. The pSeries 670 systems can be shipped with non-activated resources (processors and/or memory), which may be purchased and activated at a certain point in time without affecting normal machine operation. The following are the methods available: Capacity Upgrade on Demand Memory Capacity Upgrade on Demand On/off Capacity on Demand Trial Capacity Upgrade on Demand For further information on capacity on demand, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120 or to IBM Eserver pSeries 690 and pSeries 670 System Handbook, SG24-7040. LPAR considerations The pSeries 670 can be divided into as many as 16 logical partitions. System resources can be dedicated to each LPAR. p670 LPARs have following considerations: LPAR allocation, monitoring, and control is provided by the Hardware Management Console. Each LPAR functions under its own instance of the operating system. A minimum of one processor is required per LPAR. A minimum of 1 GB of system memory is required per LPAR. While it is not required, 4 GB of system memory is recommended per LPAR to ensure adequate memory is available. I/O adapters in PCI slots may be allocated to an LPAR on an individual slot basis. Integrated Ultra3 SCSI controllers located in the 7040-61D I/O drawers may be individually allocated to an LPAR. These integrated SCSI adapters each support one 4-pack Disk Backplane. 88 IBM Eserver pSeries Cluster Systems Handbook Tip: While it is not mandatory, consideration should be given to allocating one half of a 7040-61D I/O drawer for each LPAR. This would provide one 10-slot PCI or PCI-X planar, two integrated SCSI controllers, and two 4-Pack SCSI Disk Backplanes to the LPAR. This will help to ensure balanced I/O bandwidth for the LPAR configurations. Cluster considerations In this section, we discuss considerations you need to take into account when ordering a p670 for attachment to a Cluster 1600. Each pSeries 670 clustered server interfaces to the control workstation via Ethernet twisted pair connections. An Ethernet connection to the control workstations is required for each clustered server (or LPAR operating as a clustered server), as well as on the HMC controlling the p670 system. The 10/100 Mbps Ethernet PCI Adapter II (F/C 4962) should be utilized for this connection for the p670 server. Important: The establishment of a trusted network between the control workstation (CWS) and the Hardware Management Console (HMC) is recommended. Refer to the PSSP V3.4 Read This First document or to the PSSP V3.5 documentation located at: http://www.rs6000.ibm.com/resource/aix_resource/sp_books/ Restriction: The Ethernet adapters supporting the LPARs and clustered servers are allowed in slots 8, 9, 18, and 19 of the 7040-61D I/O drawer when utilizing the SP Switch2 Attachment Adapter (F/C 8397) or Switch2 PCI-X Attachment Adapter (F/C 8398). It can be installed in slots 1 through 7 or in slots 11 through 17 with the SP Switch Attachment Adapter (F/C 8396). In its basic configuration, the HMC incorporates an Ethernet connection, which is used for this connection. Four SP Switch Attachment Adapters (F/C 8396) are allowed per 7040-671 server. These adapters are located in the 7040-61D I/O drawers. A maximum of one adapter is allowed per F/C 6563 I/O planar, and two adapters per 7040-61D I/O drawer. Eight SP Switch2 attachment adapters (F/C 8397) or Switch2 PCI-X attachment adapters (F/C 8398) are allowed per 7040-671 server. These adapters are located in the 7040-61D I/O drawers. A maximum of two F/C 8397 adapters are allowed per F/C 6563 PCI I/O planar. A maximum of two F/C 8398 adapters are allowed per F/C 6571 PCI-X I/O planar. Chapter 2. Cluster 1600 hardware 89 Restriction: F/C 8397 adapters are not allowed with F/C 6571 PCI-X planars. F/C 8398 adapters are not allowed with F/C 6563 PCI planars. When ordering a p670 for attachment to a Cluster 1600, you should consider the features listed in Table 2-20. Table 2-20 Cluster features for p670 Feature code Description 7315-C02 Hardware Management Console (HMC) 7315 F/C 8120 Attachment cable HMC-to-host 6 meters 7315 F/C 8121 Attachment cable HMC-to-host 15 meters 7040 F/C 8396 SP Switch PCI Attachment Adapter (withdrawn 12/31/03) 7028 F/C 8398 SP Switch2 PCI-X Attachment Adapter 7018 F/C 8397 SP Switch2 PCI Attachment Adapter 9078-160 F/C 0008 Cluster model 1600 7028-type server 9078-160 F/C 0009 Cluster model 1600 7018 LPAR (switched) Note: The SP Switch (F/C 8396), and Switch2 (F/C 8397) adapters are “double wide” PCI cards and take up the space of two PCI slots. They must be mounted in the optional PCI Blind-Swap Cassette Kit for Double Wide Adapters (F/C 4598). CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p670: The p670 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, 5.2 ML01, and later. 90 IBM Eserver pSeries Cluster Systems Handbook Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as “managed nodes” and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 and p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 The p670-671 is supported on PSSP 3.5 with AIX 5.1 ML03 or later. For more information on the pSeries 670 and features, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120. 2.8.6 690 server (7040-681) The p690 was the first system to offer the POWER4 microprocessor. The p690 is shown in Figure 2-26 on page 92. The POWER4 chip was the first ever dual processor SMP on a single piece of silicon. The balanced performance of the POWER4 and POWER4+ chips are equally at home with mission-critical commercial or compute-intensive applications. Chapter 2. Cluster 1600 hardware 91 Figure 2-26 The IBM Eserver690 The p690 can be configured as a 4-way, 8-way, 16-way, 24-way or 32-way computer system, with the CPUs packaged on two Multi-Chip Modules (four or eight POWER4 processors per MCM). The CPUs are either 1.1GHz, 1.3GHz POWER4 processors or 1.5GHz, 1.7GHz POWER4+ processors with 128 MB of L3 ECC cache per MCM. The memory within the system is DDR-expandable from 8 GB to 512 GB. You can have up to eight 7040-61D I/O drawers per server; each I/O drawer supports 20 blind-swap PCI or PCI-X bus slots and up to 16 hot-swap-capable disks structured into four 4-packs of 18.2 GB, 36.4 GB, 73.4 GB or 146.8 GB disks.This delivers up to 2.34 terabytes maximum per I/O Drawer, with up to 8 drawers and 18.79 terabytes of internal disk for the system. Hardware can be split into as many as thirty-two LPARs, each functioning as a “computer within a computer” with its instance of the operating system Figure 2-27 on page 93 shows the internal structure of the p690. 92 IBM Eserver pSeries Cluster Systems Handbook Figure 2-27 7040-681 pSeries 690 internal structure Complementing the power of the p690 are extraordinary self-managing capabilities. The system can detect faulty memory, processors, cache, and PCI buses and dynamically take them offline; it even makes the service call. It performs all of these tasks without requiring administrative action or interrupting operations. Growth is easy with the hot-swappable disk drive and hot-plug PCI and PCI-X. For the ultimate in availability, the systems can be clustered together with High Availability Cluster Multiprocessing software, the leader in UNIX availability solutions. 7040-61D I/O drawer The I/O drawers provide internal storage and I/O connectivity to the system. Figure 2-28 on page 94 shows the rear view of an I/O drawer, with the PCI slots and riser cards that connect to the RIO ports in the I/O books. Up to eight drawers can be connected to a pSeries 690. Note: When utilizing RIO attachment, the following number of I/O drawers are allowed per MCM: One MCM allows attachment of up to two I/O drawers. Two MCMs allow attachment of up to four I/O drawers. Three MCMs allow attachment of up to six I/O drawers. Four MCMs allow attachment of up to eight I/O drawers. Chapter 2. Cluster 1600 hardware 93 2 SCSI Backplanes 4 SCSI Drives each 2 SCSI Backplanes 4 SCSI Drives each 24 inch 10 PCI slots RIO Riser Card Rear View 10 PCI slots 4U RIO Riser Card Figure 2-28 p690 I/O drawer rear view Each drawer is composed of two physically symmetrical I/O planar boards that contain 10 Hot-Plug PCI slots each, and PCI adapters can be inserted in the rear of the I/O drawer. The planar boards also contain two integrated Ultra3 SCSI adapters and SCSI Enclosure Services (SES), connected to a SCSI 4-pack Backplane. The I/O drawers exist in two technologies: RIO and RIO-2. I/O drawers of both technologies offer the same number of PCI slots, the same number of disks, and are packaged in the same chassis. The difference is the type of I/O planar that is installed inside the drawer. Externally, the only visible difference between the two technologies is the shape of the cable connectors on the RIO Riser card (see Figure 2-29 on page 95): RIO connectors have a thumbscrew retention physical connector. RIO-2 connectors have a bayonet retention physical connector. 94 IBM Eserver pSeries Cluster Systems Handbook RIO Connector RIO-2 Connector Figure 2-29 Difference between RIO and RIO-2 connectors The I/O drawer for pSeries 670, pSeries655, and pSeries 690 has its own product number, 7040-61D, which refers to the two technologies. For ordering purposes, the difference between them is the feature code of the configured I/O planar: RIO drawer uses F/C 6563, I/O Drawer PCI Planar, 10 slots, 2 Integrated Ultra3 SCSI Ports. RIO-2 drawer is configured with F/C 6571, I/O Drawer PCI-X Planar, 10 slots, 2 Integrated Ultra3 SCSI Ports. Summary of standard and optional features Table 2-21 shows the standard and optional features available for the pSeries 690 model 7040-681. Table 2-21 Standard and optional p690 features Description Quantity Type Processors 8, 16, 24, 32 8, 16, 24, 32 8, 16, 24, 32 8, 16, 24, 32 POWER4 1.1GHz,128 MB L3 Cache POWER4 1.3GHz 128 MB L3 Cache POWER4+ 1.5GHz 128 MB L3 Cache POWER4+ 1.7GHz 128MB L3 Cache Memory 8 - 512GB DDR SDRAM Internal disk bays 16 - 128 16 per 7040-61D I/O drawer Media bays 4 Located in the media drawer (SCSI adapter required) Chapter 2. Cluster 1600 hardware 95 Description Quantity Type PCI slots 20 - 160 64-bit hot swap PCI or PCI-X within 7040-61D I/O drawer Standard ports 2 2 2 - 14 loops Serial HMC ports 2 - 14 RIO or RIO2 loops supported Integrated SCSI adapters 4 - 32 Ultra3 scsi controllers built into the 7040-61D I/O drawer planar Integrated LAN adapter ports 0 I/O drawers 1-8 7040-61D I/O drawers Redundant power supply Standard Redundant AC bulk power supplies Operating system version AIX AIX 5L V5.1 with the 5100-01 Recommended maintenance package (APAR IY21957) or later, or AIX 5L V5.2 or later Physical specifications Width:154 cm (60.6 in) Depth: 149 cm (58.8 in) (Acoustic) Depth: 134 cm (52.8 in) (slimline) Height:202 cm (79.7 in) Weight: 1,850kgs (2606lbs)max cfg (acoustic) Weight: 1,823kgs (2574lbs)max cfg (slimline) Power requirements Operating voltage @ 50/60 Hz: (3-phase) 200 to 240 V AC 380 to 415 V AC 480 V AC Electrical output: 15,400 watts (maximum) Power source loading: 15.0 kVA (maximum configuration) Capacity on Demand (CuOD) The pSeries 690 systems can be shipped with non-activated resources (processors and/or memory), which may be purchased and activated at a certain point in time without affecting normal machine operation. 96 IBM Eserver pSeries Cluster Systems Handbook The following are the methods available: Capacity Upgrade on Demand Memory Capacity Upgrade on Demand On/off Capacity on Demand Trial Capacity Upgrade on Demand For further information on capacity on demand, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120 or IBM Eserver pSeries 690 and pSeries 670 System Handbook, SG24-7040. LPAR considerations The pSeries 690 can be divided into as many as 32 logical partitions. System resources can be dedicated to each LPAR. p690 LPARs have the following considerations. LPAR allocation, monitoring, and control is provided by the Hardware Management Console. Each LPAR functions under its own instance of the operating system. A minimum of one processor is required per LPAR. A minimum of 1 GB of system memory is required per LPAR. While it is not required, 4 GB of system memory is recommended per LPAR to ensure adequate memory is available. I/O adapters in PCI slots may be allocated to an LPAR on an individual slot basis. Integrated Ultra3 SCSI controllers located in the 7040-61D I/O drawers may be individually allocated to an LPAR. These integrated SCSI adapters each support one 4-pack Disk Backplane. Note: The following restrictions apply: Servers configured with the support processor with remote I/O loop attachment (F/C 6404) can support up to a maximum of 16 LPARs. Servers configured with the support processor with remote I/O-2 (RIO-2) loop attachment (F/C 6418) can support up to a maximum of 32 LPARs. Tip: While not mandatory, consideration should be given to allocating one half of a 7040-61D I/O drawer for each LPAR. This would provide one 10-slot PCI or PCI-X planar, two integrated SCSI controllers, and two 4-Pack SCSI disk backplanes to the LPAR. This will help to ensure balanced I/O bandwidth for the LPAR configurations. Cluster considerations In this section, we discuss considerations you should take into account when ordering a p690 for attachment to a Cluster 1600. Each p690 clustered server Chapter 2. Cluster 1600 hardware 97 interfaces to the control workstation via Ethernet twisted pair connections. An Ethernet connection to the control workstations is required for each clustered server (or LPAR operating as a clustered server), as well as on the HMC controlling the p690 system. The 10/100 Mbps Ethernet PCI Adapter II (F/C 4962) should be utilized for this connection for the p690 server. Note: The establishment of a trusted network between the control workstation (CWS) and the Hardware Management Console (HMC) is recommended. Refer to the PSSP V3.4 Read This First document or to the PSSP V3.5 documentation at the following site: http://www.rs6000.ibm.com/resource/aix_resource/sp_books/ Restriction: The Ethernet adapters supporting the LPARs and clustered servers are allowed in slots 8, 9, 18, and 19 of the 7040-61D I/O drawer when utilizing the SP Switch2 Attachment Adapter (F/C 8397) or Switch2 PCI-X Attachment Adapter (F/C 8398). It can be installed in slots 1 through 7, or in slots 11 through 17 with the SP Switch Attachment Adapter ((F/C 8396). Eight SP Switch Attachment Adapters (F/C 8396) are allowed per 7040-681 server. These adapters are located in the 7040-61D I/O drawers. A maximum of one adapter is allowed per (F/C 6563) I/O planar and two adapters per 7040-61D I/O drawer. Thirty two SP Switch2 Attachment Adapters (F/C 8397) or Switch2 PCI-X Attachment Adapters (F/C 8398) are allowed per 7040-681 server. These adapters are located in the 7040-61D I/O drawers. A maximum of two F/C 8397 adapters is allowed per (F/C 6563) PCI I/O planar. A maximum of two F/C 8398 adapters is allowed per (F/C 6571) PCI-X I/O planar. Restriction: Feature (F/C 8397) adapters are not allowed with (F/C 6571) PCI-X planars. Feature (F/C 8398) adapters are not allowed with (F/C 6563) PCI planars. When ordering a p670 for attachment to a Cluster 1600, the features listed in Table 2-22 should be considered. Table 2-22 Cluster features for p670 98 Feature code Description 7315-C02 Hardware Management Console (HMC) 7315 F/C 8120 Attachment cable HMC to host 6 meters 7315 F/C 8121 Attachment cable HMC to host 15 meters IBM Eserver pSeries Cluster Systems Handbook Feature code Description 7040 F/C 8396 SP Switch PCI Attachment Adapter (withdrawn 12/31/03) 7028 F/C 8398 SP Switch2 PCI-X Attachment Adapter 7018 F/C 8397 SP Switch2 PCI Attachment Adapter 7040 F/C 6432 pSeries HPS SNI 2-link attachment book 7040 F/C 6434 pSeries HPS SNI 4-link attachment book 9078-160 F/C 0008 Cluster 1600 7028-type server 9078-160 F/C 0009 Cluster 1600 7018 LPAR (switched) Note: The SP Switch (F/C 8396), and Switch2 (F/C 8397) adapters are “double wide” PCI cards and take up the space of two PCI slots. They must be mounted in the optional PCI Blind-Swap Cassette Kit for Double Wide Adapters (F/C 4598). CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p690: The p690 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, 5.2 ML01, and later. Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as “managed nodes” and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 and p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 Chapter 2. Cluster 1600 hardware 99 The p690-681 is supported on PSSP 3.5 with AIX 5.1 ML03 and later. For more information on the pSeries 690 and its features, refer to the specific server documentation or to pSeries Systems Handbook 2003 Edition, SG24-5120, or to Configuring a p690 in an IBM Cluster 1600, REDP0187. 2.8.7 xSeries servers With CSM for AIX-managed clusters, selected xSeries nodes running Linux can be included in the cluster. These servers are not available from the Cluster 1600, although they can be attached to a CSM for AIX-managed cluster. For more information on xSeries servers, see: http://www.ibm.com/servers/eserver/education/xseries/xref.html 2.9 Switches The SP switches provide a message-passing network that connects all processor nodes with a minimum of four paths between any pair of nodes. The switches provide a high bandwidth, low latency network that can use either TCPIP or MPI protocol to pass messages. Currently there are three types of switches available for the Cluster 1600. Two are supported in clusters managed by PSSP, and one is supported in clusters managed by CSM. The two switch types for PSSP managed clusters are the SP Switch and the SP Switch2. There are two models available for each switch type. Each switch type has a 23-inch rack version (SP frame) and a 19-inch rack version (7014 pSeries rack). Attention: The SP Switch and all associated features will be withdrawn from marketing on December 31, 2003. For more information on the PSSP switches, refer to RS/6000 SP and Clustered IBM eServer pSeries System Handbook, SG24-5596. The switch that is supported in CSM-managed clusters is the 7045-SW4 pSeries HPS. There is one model of this switch available, and it is currently supported on p690 and p655 nodes. It is packaged in a 23-inch form and can be mounted in either a 7040-61R or a 7040-W42 rack. The frame model types and descriptions for the switches are described in the following sections. 100 IBM Eserver pSeries Cluster Systems Handbook 2.9.1 9076 model 555 The SP Switch provides high bandwidth, low latency communication between nodes, supplying a minimum of four paths between any pair of nodes. The SP Switch can be used in conjunction with the SP Switch Router to dramatically increase the speed for TCP/IP, file transfers, remote procedure calls, and relational database functions. The required SP Switch Adapter (F/C 4020), SP Switch MX Adapter (F/C 4022), or SP Switch MX2 Adapter (F/C 4023) connects each SP node to the SP Switch subsystem. One adapter of the required type must be ordered for each node in a switch-configured SP system. If you are using switch expansion frames, the SP Switch subsystem will allow you to scale your SP system up to 128 nodes. When you order F/C 4011, you receive one 16-port SP Switch and all of the switch-to-node cables you need to connect the switch ports to up to sixteen nodes, both within the switch-equipped frame and in any non-switched expansion frames. You must specify the length of all switch-to-switch cables that make the connections between switch-equipped frames and to an SP Switch Frame. The IBM 9076 Tall 24-inch Frame Model 555 requires an SP Switch and a minimum of two clustered RS/6000 or pSeries servers, along with a control workstation to create a functional system. An additional SP Switch is supported in the Model 555 expansion frame feature 1555. The 9076 Model 555 has a three-phase power system that includes N+1 power capability. This frame accepts power input 200-240V 50/60Hz 3-phase 50 A or 380-415V 50/60Hz 3-phase 30 A. Frame Redundant Power Input Features are available. SP nodes are supported on the SP Switch ports within the Model 555. SP Switches can connect to other SP Switches in other 9076 Models (for example, switch-to-switch connections between 9076 Model 555, and Model 550 and Model 557, are supported. Special attention to cluster system planning for Model-to-Model cabling is required). Important: The SP Switch and all associated features will be withdrawn from marketing on December 31, 2003. 2.9.2 9076 model 556 The SP Switch2 is the next step in the evolution of the SP interconnection fabric and offers enhanced performance and RAS (Reliability, Availability, and Serviceability) over the SP Switch. The performance gains are significant increases in bandwidth and reductions in latency. Chapter 2. Cluster 1600 hardware 101 SP Switch2 provides a one-way bandwidth of up to 500 MB per second between nodes (1 GB bidirectional) for interconnecting POWER3 SMP high nodes. Because the SP Switch2 design improvements are evolutionary steps on the SP and high performance switches, the SP Switch2 is fully compatible with applications written for the older switches. The SP Switch2 software support continues to emphasize usage of Message Passing Interface (MPI), Low level Application Interface (LAPI), and Internet Protocol (IP), but the new switch subsystem enhances message handling so that the switch can keep pace with the increased performance of the newer nodes. The IBM 9076 Tall Frame Model 556 requires an SP Switch2 and a minimum of two clustered RS/6000 or pSeries servers, along with a control workstation, to create a functional system. Additional SP Switch2s and features are supported for each Model 556, to scale the frame up to a maximum of eight SP Switch2s. Both single-plane and two-plane SP Switch2 environments are supported within the server scaling limits. The 9076 Model 556 has a three-phase power system which includes N+1 power capability. This frame accepts power input 200-240V 50/60Hz 3-phase 50 A, or 380-415V 50/60Hz 3-phase 30 A. Frame-redundant power input features are available. SP nodes are supported on the SP Switch2 ports within the Model 556. SP Switch2s can connect to other SP Switch2s in other 9076 Models (for example, switch-to-switch connections between 9076 Model 556, and Model 550 and Model 558, are supported. Special attention to cluster system planning for Model-to-Model cabling is required.) 2.9.3 9076 model 557 The IBM 9076 Model 557 requires an SP Switch and a minimum of two clustered RS/6000 or pSeries servers, along with a control workstation, to create a functional system. The Model 557 is used for mounting the SP Switch and power subsystem in IBM 19-inch racks, enhancing the available switch building blocks of the Cluster 1600. This model is field-installed into a new or existing 7014-T00 or 7014-T42 rack, requiring 16 EIA positions. The 9076-557 can have up to one SP Switch. The Model 557 requires two of the 7014 PDUs, or power distribution units, for dual power input. The two power cords for Model 557 plug into one receptacle of each of the PDU units. SP nodes are supported on the SP Switch ports within the Model 557. The SP Switch within Model 557 can connect to other SP Switches in other 9076 Models (for example, switch-to-switch connections between 9076 Model 557, and Model 102 IBM Eserver pSeries Cluster Systems Handbook 550 and Model 555, are supported). Special attention to cluster system planning for Model-to-Model cabling is required. Attention: The SP Switch and all associated features will be withdrawn from marketing on December 31, 2003. 2.9.4 9076 model 558 The IBM 9076 Model 558 requires an SP Switch2 and a minimum of two clustered RS/6000 or pSeries servers, along with a control work station, to create a functional system. The Model 558 is used for mounting the SP Switch2 and power subsystem in IBM 19-inch racks, enhancing the available switch building blocks of the Cluster 1600. This model is field-installed into a new or existing 7014-T00 or 7014-T42 rack, requiring 16 EIA positions. The 9076-558 can have up to two SP Switch2s. The Model 558 requires two of the 7014 PDUs, or power distribution units, for dual power input. The two power cords for Model 558 plug into one receptacle of each of the PDU units. Both single-plane and two-plane SP Switch2 environments are supported within the server scaling limits. The SP Switch2s within Model 558 can connect to other SP Switch2s in other 9076 Models (for example, switch-to-switch connections between 9076 Model 558, and Model 550 and Model 556, are supported. Special attention to cluster system planning for Model-to-Model cabling is required). 2.9.5 7045-SW4 pSeries HPS (High Performance Switch) The pSeries High Performance Switch (HPS) communication subsystem is an evolutionary step in network data transfers using technology based on the proven architecture of SP Switch and SP Switch2. The technology driving the pSeries HPS is designed to augment the latest offerings of pSeries 690 and 655 clustered servers by increasing the communication bandwidth between servers and partitions within the cluster. The benefits of implementing this new communication subsystem include many enhancements over previous SP switch offerings, including: Parallel, interconnected communication channels that form a unified switch network Significantly improved communication bandwidth and reductions in latency The option to use either fiber optic or copper cables for switch-to-switch network connections Improved reliability, availability, and serviceability (RAS) Chapter 2. Cluster 1600 hardware 103 Restriction: The pSeries HPS is the first offering in a new generation of switch technology. As such, it does not support migration from, nor coexistence with, previous versions of SP switch networks. It is only supported on CSM-managed Clusters and is not supported with PSSP. Figure 2-30 HPS switch internal structure The pSeries HPS (M/T 7045-SW4) is packaged in a 4U-high 23-inch chassis with optional slot in switch port connector books. Figure 2-30 shows the internal structure of the HPS switch.These books can be copper, optical or blanks. Server connections are copper, and inter-switch connections can be either copper or fibre (optical). These books have two links on each book. The server attachment books can be located in slots C3, C4, C7, C8, C11, C12, C15, C16, as shown in Figure 2-31 on page 105. Inter-switch link books can be located in the remaining 8 slots. The switch supports 16 servers and 16 inter-switch links. The switch can be mounted in either a 7040-61R or a 7040-W42 rack. A p690 or p655 rack can only support one 7045-SW4 switch. Additional switches must be mounted in a separate 7040-W42 rack. 104 IBM Eserver pSeries Cluster Systems Handbook Figure 2-31 pSeries HPS book locations The pSeries HPS switch is supported on 7040-681 and 7039 servers only. It is connected to the servers using either 3-meter or 10-meter copper cables. The server interface is either a 2-link or a 4-link switch network interface (SNI) card which plugs directly onto the GX bus on the server. The switch network interface card can be shared with multiple LPARs. Both links within the SNI card allow message passing across direct internal connections. This means that both links can be on the same unified single switch network, which increases the resilience over the previous SPswitch2 technology. If one link fails, traffic can be directed over the other link seamlessly. Each switch link can be shared across 16 LPARs, and a single LPAR can also be connected to up to 8 SNI cards for increased bandwidth. Tip: pSeries HPS switched nodes can be intermixed with non-switched nodes. As with the SP switch technology, switches can be server switch boards (SSB) or inter-switch boards (ISB). Server switch boards are connected directly to servers, and inter-switch boards are only connected to other switches.You can have up to three SSBs in a switch network before you require ISBs The switch is not required to be controlled by a specific node but instead is controlled by a program running on the HMC called Switch Network Manager (SNM). This program is part of the HMC software build and controls, and it monitors the switch network. The human interface with this program is integrated through the HMC Web-Based System Management GUI. This allows you to configure, power on/off, run diagnostics and check the status of the HPS switch. As prerequisites, the Cluster 1600 must have the following software installed: AIX 5.2 latest Recommended Maintenance Level CSS file sets for AIX SNM switch network manager (HMC) SFP Service focal point (HMC) Chapter 2. Cluster 1600 hardware 105 Service Agent (Service Gateway between IBM and customer) Inventory Scout Cluster Systems Management The hardware connection to the HMC is by two RS-422 connections between the HMC and each of the 7040 racks that contain servers attached to the pSeries HPS. The RS-422 cable plugs into the bulk power controller (BPC) at the top of the 7040 racks. The switch cable is the only direct connection between the node and the HPS switch. Note: Four RS-422 cables are required per rack if redundant HMC is being used. For more information on HPS switch cabling, refer to Chapter 3, “Network configuration” on page 121. Summary of standard and optional features Table 2-23 lists the standard and optional features available for the 7045-SW4 pSeries HPS. Table 2-23 Standard and optional 7045-SW4 HPS switch features 106 Description Feature code pSeries HPS ISB indicator F/C 9049 pSeries HPS SSB indicator F/C 9047 RS-422 cable 15 m (HMC-to-frame) F/C 8123 RS-422 cable 6 m (HMC-to-frame) F/C 8122 Dual SNI link-to-single SNI link convertor F/C 6437 Switch Port connection card, optical F/C 6436 Switch Port connection card, blank F/C 6435 Switch network interface, 4-link (book) (p690) F/C 7040-6434 Switch Port connection card, copper F/C 6433 Switch network interface, 2-link (book) (p690) F/C 7040-6432 Switch network interface, 2-link (GX bus-mounted card) (p655) F/C 7039-6420 Frame extender F/C 6234 Network diagnostic tool kit F/C 3756 IBM Eserver pSeries Cluster Systems Handbook Description Feature code Switch cable, 40 m (fiber optic) F/C 3257 Switch cable, 20 m (fiber optic) F/C 3256 Switch cable, 10 m (copper) F/C 3167 Switch cable, 2.2 m (copper) F/C 3166 Racking considerations The 7040-W42 or 7040-61R System Racks provide space for mounting the pSeries HPS. The Model W42 is a 24-inch rack that provides 42U of rack space. The 7040-W42 utilizes a 350 V DC bulk power subsystem to provide power for the components within the rack. The bulk power subsystem incorporates redundant bulk power assemblies mounted in the front and rear sections of the top 8U of the rack. Note the following: The 7040-61R rack can contain 1 pSeries HPS with a p690 CEC (681 only). The 7040-W42 rack can contain 1 pSeries HPS with p655 CECs. The 7040-W42 rack with no p655 CECs can contain up to 8 pSeries HPSs and up to 16 pSeries HPSs with expansion frame (F/C 8691). Tip: The rack must have RS-422 connections from the HMC to each of the bulk power supply controllers. These cables require an async card to be present in the HMC. If the 128 port card (F/C 2944) is used RS-232 - RS-422 connectors are also required. CSM/PSSP support statement The following support statement details the CSM and PSSP support for the 7045-SW4 pSeries HPS: The 7045-SW4 pSeries HPS is supported on CSM V1.3.2 with 5.2 ML 02 and later. Chapter 2. Cluster 1600 hardware 107 Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-02. The other machines within the cluster are referred to as “managed nodes” and can be running AIX 5L V5.2 with 5200-02. Added Service for RSCT on AIX 5.2 IY42783 CSM 1.3.2 for AIX 5L Service Focal point Service Agent Inventory Scout The 7045-SW4 HPS switch is not supported on PSSP 3.5. 2.10 Switch adapters There are currently four types of switch adapter available: MX, two PCI types, and PCI-X. The MX adapters are only available for SP nodes with an MX slot available. The PCI adapters and PCI-X adapters are available for all attached servers. The available switch adapters are: PCI SP Switch adapter (F/C 8396) (withdrawn from marketing 12/31/2003) PCI SP Switch2 adapter (F/C 8397) PCI-X SP Switch2 adapter (F/C 8398) HPS Switch network interface card (F/C 6434,6432, 6420) Restrictions: PCI SP Switch Attachment Adapter (F/C 8396) uses 5 volt switching, and are only supported in 5 volt PCI slots. No PCI-X planars support the PCI SP Switch Attachment Adapter. SP Switch Attachment Adapter (F/C 8396) and PCI-X SP Switch2 Attachment Adapter (F/C 8398) cannot be hot-plugged. The PCI SP Switch2 Attachment Adapter (F/C 8397) is a double wide adapter. For each adapter, you must also order one of the following cables (depending on your floor plan layout) to connect the adapter to a valid switch port: 108 2.6 m (9 ft.) switch cable (F/C 9302) 5 m (16 ft.) switch cable (F/C 9305) 10 m (33 ft.) switch cable (F/C 9310) 15 m (49 ft.) switch cable (F/C 9315) 20-m (66 ft.) switch cable (F/C 9320) IBM Eserver pSeries Cluster Systems Handbook 2.10.1 Switch adapter placement restrictions The following are placement limitations for the SP Switch and the SP Switch2 Attachment Adapters within the current server range. Note: Each instance of an SP Switch (F/C 8396), SP Switch2 (F/C 8397), or SP Switch2 PCI-X (F/C 8398) adapter should be counted as two adapters when calculating the maximum adapter limitation. M/T 7040 models Placement considerations for the SP Switch and the SP Switch2 Attachment Adapters for the machine type 7040: For the SP System Attachment Adapter (F/C 8396) – install in I/O subsystem slot 8 only (one adapter per LPAR). (This is withdrawn from marketing December 31, 2003.) For SP Switch2 Attachment Adapter (F/C 8397). – Single-plane – install in I/O subsystem slot 3 or 5, or both if on separate LPARs (one adapter per LPAR). – Two-plane – install in I/O subsystem slot 3 for css0 and slot 5 for css1 (one adapter per LPAR). Note: The SP Switch Attachment Adapter (F/C 8396) and the SP Switch2 Attachment Adapters (F/C 8397 and F/C 8398) are mutually exclusive on a pSeries 690 system. M/T 7039 models Placement considerations for the SP Switch and the SP Switch2 Attachment Adapters for the machine type 7039: For SP Switch2 PCI-X Attachment Adapter (F/C 8398): – Single-plane – install in server slot 1 or 3 or both if on separate LPARs (one adapter per LPAR, 2 maximum per server). – Two-plane – install in server slot 1 for css0 and server slot 3 for css1. Chapter 2. Cluster 1600 hardware 109 Note: M/T 7039 LPAR and multi-plane switch restrictions If the servers are going to be configured with a two-plane switch fabric, they cannot be configured with LPARs. If the servers are configured with LPARs, the system is restricted to single-plane switch configurations. Servers configured with two LPARs require one adapter for each LPAR connected to the switch. However, the 7039 can be configured with one LPAR attached to the switch and the other LPAR off the switch. Additional switch adapters (F/C 8396, F/C 8397, or F/C 8398) are not permitted in RIO drawers attached to these servers. M/T 7028 models Placement considerations for the SP Switch and the SP Switch2 Attachment Adapters for the machine type 7028: The adapter must be installed in slot 1 of the server. Server slot 2 must be left empty. 2.10.2 pSeries HPS switch network interface cards (SNI) The SNI card is either a 2-link or a 4-link that plugs directly onto the GX bus on the server. The switch network interface card can be shared with multiple LPARs. Both links within the SNI card allow message passing across direct internal connections. This means that both links can be on the same unified single switch network, which increases the resilience over the previous SPswitch2 technology. If one link fails, traffic can be directed over the other link seamlessly. Each switch link can be shared across 16 LPARs, and a single LPAR can also be connected to up to 8 links for increased bandwidth. Type 7040 servers Type 7040 servers can support up to two 2-link or 4-link SNI books (F/C 6434, F/C 6432). Restriction: The number of links supported is dependent on the number of MCMs installed in the p690. 110 1 MCM supports 2 links. 2 MCM support 4 links. 3 MCM support 6 links. 4 MCM support 8 links. IBM Eserver pSeries Cluster Systems Handbook The SNI cards (F/C 6434, F/C 6432) plug into the GX bus in the P690 CEC. There are two slots available for the SNI books. Figure 2-32 on page 111 shows the SNI GX bus slots on the p690 CEC. Figure 2-32 p690 SNI GX bus slots location Note: The two links on each SNI card are paired. A single link convertor wrap is required to use them as a single link. Type 7039 servers Type 7039 servers can support one 2-link card (F/C 6420). The GX card plugs into slot 1 in the p655 CEC. The SNI card plugs in instead of the first PCI-X card. Restriction: When a SNI card is in the p655, there are only two PCI-X slots available for PCI-X cards within the CEC. Figure 2-33 on page 112 shows the SNI location on a p655 server. Chapter 2. Cluster 1600 hardware 111 Figure 2-33 p655 SNI GX bus slot location Note: The two links on each SNI card are paired. A single-link convertor wrap is required to use them as a single link. 2.11 Legacy hardware supported but no longer marketed This section only details changes since the IBM Redbook RS/6000 SP and Clustered IBM Eserver pSeries System Handbook was published. For further details about legacy hardware, refer to RS/6000 SP and Clustered IBM Eserver pSeries System Handbook, SG24-5596. 2.11.1 pSeries 660 Model 6M1 (7026-6M1) The pSeries 660 Model 6M1 is a mid-range member of the 64-bit family of symmetric multiprocessing (SMP) enterprise servers from IBM. Positioned between the Model 6H1 and the powerful Model S85, the Model 6M1 provides the power, capacity, and expandability required for e-business, mission-critical 112 IBM Eserver pSeries Cluster Systems Handbook computing. The Model 6M1 can assist you in managing the evolution of your business to incorporate the power of the Web and 64-bit computing into your environment, while still supporting existing 32-bit applications. Figure 2-34 shows the 7026-6M1 and its primary I/O drawer. The p660 model 6M1 is the follow-on server to the RS/6000 7026-M80. Figure 2-34 The 7026-6M1 and IO drawer The Model 6M1 delivers 64-bit scalability via the 64-bit RS64 IV processor packaged as 2-way and 4-way cards. With its two-processor positions, the model 6M1 can be configured into 2-, 4-, 6-, or 8-way SMP configurations. The model 6M1 also incorporates an I/O subsystem supporting 32-bit and 64-bit standard PCI adapters. The Model 6M1 has 2-way and 4-way processor cards that operate at 750 MHz and incorporate 8 MB of L2 cache per processor. Chapter 2. Cluster 1600 hardware 113 Figure 2-35 p660-6M1 internal structure The Model 6M1 is packaged as a rack-mounted Central Electronics Complex (CEC) drawer, cable-attached to rack-mounted Remote I/O (RIO) drawers. The CEC and I/O drawers offer redundant power and redundant cooling. The CEC drawer incorporates the system processors, memory, and supporting systems logic. The primary IO drawer contains the PCI cards, boot disks and media bays. The internal structure is shown in Figure 2-36 on page 115. Up to three secondary I/O drawers can be added to give you up to a total of 56 PCI slots and eight media bays. System storage is added via remotely attached SCSI, SSA, or Fibre Channel storage subsystems. 114 IBM Eserver pSeries Cluster Systems Handbook Figure 2-36 p660 IO drawer internal structure Summary of standard and optional features Table 2-24 lists the standard and optional features available for the pSeries 660 model 6M1. Table 2-24 p660-6M1 standard and optional features Description Quantity Type Processors 2, 4, 6 or 8 2 or 4 750 MHz RS64 IV with 8 MB L2 cache per processor 500 MHz RS64 III with 4 MB L2 cache per processor Memory 2 - 64GB DIMMs Internal disk bays 2 Hot swap 18.2 GB - 146.8 GB per bay Media bays 3 In I/O drawer PCI slots 10 4 64-bit hot swap PCI 32-bit hot swap PCI Chapter 2. Cluster 1600 hardware 115 Description Quantity Type Standard ports 1 1 4 1 4 Keyboard Mouse Serial Parallel RIO ports Integrated SCSI adapters 2 Ultra 3 SCSI controllers (1 int and 1 ext VHDC I AMP half pitch 68-pin) Integrated LAN adapter ports 1 I/O drawers 1-4 Redundant power supply standard Redundant AC power Operating system version AIX Version 4.33 Physical specifications Rack 7014-T00 Rack 7014-T42 116 10/100 Ethernet controller Central Electronics Complex: Width: 445 mm (17.5 in) Depth: 826 mm (32.5 in) Height: 356 mm (14.0 in) Weight: 59.7 kg (132 lb) (minimum configuration) I/O Drawer: Width: 445 mm (17.5 in) Depth: 820 mm (32.3 in) Height: 218 mm (8.6 in) Weight: 41 kg (90 lb) (minimum configuration) 52 kg (115 lb) (maximum configuration 2 3 IBM Eserver pSeries Cluster Systems Handbook Maximum is two per rack Maximum is three per rack Racks may be shared with peripherals Description Quantity Power requirements Type Central Electronics Complex: Operating voltage: 200 to 240 V ac 50/60 Hz Electrical output: 370 watts (typical); 550 watts (maximum) Power source loading: 0.39 kVA (typical configuration) 0.60 kVA (maximum configuration) Thermal Output: 370 joules/sec (1,265 Btu/hr, typical configuration) 550 joules/sec (1,877 Btu/hr, maximum configuration) I/O Drawer: Operating voltage: 200 to 240 V ac 50/60 Hz Electrical output: 220 watts (typical); 515 watts (maximum) Power source loading: 0.23 kVA (typical configuration) 0.54 kVA (maximum configuration) Thermal Output: 220 joules/sec (750 Btu/hr, typical configuration) 511 joules/sec (1,750 Btu/hr, maximum configuration) Cluster considerations The Cluster 1600 is enhanced to include the 7026-6M1 server within the available cluster building blocks. SP Switch2 attached servers utilize one or two SP Switch2 Attachment adapters (F/C 8397) for high-speed interconnection to the SP Switch2 mounted in a 9076 frame. This environment does not require a 9076 SP Node in the configuration: SP Switch-attached servers utilize the SP Switch Attachment adapter (F/C 8396) for high-speed interconnection to the SP Switch mounted in a 9076 frame. A non-switch attached environment utilizes an Ethernet connection to pass data between the clustered servers. This environment does not require a switch, 9076 Frame, or 9076 SP node in the configuration. The SP Switch2 supports two configurations allowing communications over either one or two switch planes. Two switch plane configurations must incorporate two SP Switch2 Attachment Adapters (F/C 8397) in each attached server. Implementations utilizing a one-switch plane configuration are limited to a maximum of one Switch2 Attachment Adapter (F/C 8397) per Chapter 2. Cluster 1600 hardware 117 attached server. Each clustered server interfaces to the control workstation and SP switch via a combination of cables, depending on environment. Restrictions: The cable and PCI Card for SP Control Workstation Attachment (F/C 3154) is required in either the non-switch-attached or switch-attached environments. Feature (F/C 3154) provides internal connection from an internal connector on the planar of the primary I/O drawer to a PCI slot location on the rear bulkhead of the I/O drawer. The PCI card associated with feature (F/C 3154) must be located in the primary I/O drawer in the following slots: Slot #7 for non-switched clustered server environment. Slot #7 for SP Switch Attachment Adapter (F/C 8396) environment. Slot #6 for SP Switch2 Attachment Adapter (F/C 8397) environment. Feature (F/C 3154) must be ordered with each 7026 server in the cluster. The non-switch attached cluster environment requires each server to be connected to the control workstation via a Clustered Server Control Panel-to-Control Workstation Cable (F/C 3151). This cable may be ordered with either the clustered server (recommended) or the control workstation, as desired. This cable is not required for switch-attached servers. SP switch-attached servers interface to the control workstation and SP frame via nomenclature and cables ordered from the 9076 SP. The appropriate features for this interface are provided by the SP-attached Server Frame Identification feature (F/C 9123) and SP-attached Server Cable for 7026 feature (F/C 9125). An appropriate length switch cable is required to attach each switch-attached server. This cable is ordered as a feature of the 9076 SP. A minimum of one Ethernet adapter is required for each clustered server. This adapter must be recognized by the clustered server as EN0 and must reside in slot 1 of the primary I/O drawer. The supported adapters are: 10/100 Ethernet 10BaseTX adapter (F/C 2968) 10/100 Mbps Ethernet PCI Adapter II (F/C 4962) The SP Switch Attachment Adapter (F/C 8396) is a 5 V adapter and must be located in slot #6 of the primary I/O drawer. Only one (F/C 8396) adapter is allowed per server. The SP Switch Attachment Adapter (F/C 8397) is a universal PCI adapter. Two are allowed per 7026 server. The first (F/C 8397) adapter is always located in slot #5 of the primary I/O drawer. If two (F/C 8397) adapters are utilized, the second adapter must be placed in slot #3 and requires slot #4 to remain empty. 118 IBM Eserver pSeries Cluster Systems Handbook Graphics adapters and natively-attached displays are not supported in the clustered server environment. An ASCII terminal may be attached to servers in these environments, but is not required. System control functions are provided by the SP control workstation. Additional information regarding server clustering and control workstations is available at the following IBM PSSP Web site: http://www.ibm.com/servers/eserver/pseries/library/sp_books/planning.html CSM/PSSP support statement The following support statement details the CSM and PSSP support for the p660-6M1: The IBM Eserver 7026-6M1 is supported on CSM V1.3.1 with APAR IY42369 with AIX 5.1 ML04, 5.2 ML01, and later. Note: The CSM management server must be running AIX 5L V5.2 with Recommended Maintenance package 5200-01 (APAR IY39795). The other machines within the cluster are referred to as “managed nodes” and can be running AIX 5L V5.2 with 5200-01, or V5.1 with the 5100-04 Recommended Maintenance package (APAR IY39794). They can also be xSeries machines running CSM for Linux V1.3. For all AIX servers, the following CSM for AIX 5L 1.3.1 service is required: Added Service for RSCT on AIX 5.1 IY42782 Added Service for RSCT on AIX 5.2 IY42783 CSM for AIX 5L added service IY42353 CSM Support for p670 and p690 POWER4+ IY42356 CSM Support for 7026 servers, 9076 SP nodes and p650 IY42379 CSM 1.3.1 support for 9076 SP Node Features 2054/2058 IY42847 CSM Support of p655 POWER4+ IY42377 The IBM Eserver p660-6M1 is supported on PSSP 3.5 with AIX 5.1 ML03 or later. Chapter 2. Cluster 1600 hardware 119 120 IBM Eserver pSeries Cluster Systems Handbook 3 Chapter 3. Network configuration In this chapter, we look at the different networks in a cluster configuration, and explain the tasks relating to network devices and network configurations. We cover the following topics: Networking for Parallel System Support Programs (PSSP) SP LAN Ethernet Switch network Other networks Network considerations Sample scenarios for Cluster 1600 managed by PSSP Networking for Cluster Systems Management (CSM) CSM hardware control Hardware and network requirements Virtual LANs (VLANs) pSeries HPS switch network overview Examples © Copyright IBM Corp. 2003. All rights reserved. 121 3.1 SP LAN Ethernet The SP LAN, or internal Ethernet, is a fundamental resource for the cluster system because it is involved in most system management operations. It is needed when nodes are being installed and customized and when their software components are being controlled. Without it, you cannot manage your nodes. Due to its importance, it is imperative that it remain functional and available. The network topology for the SP/CES Ethernet mainly depends on the size of the system, and it should be planned on an individual basis. Since the SP/CES private Ethernet network is used for installation and administration purposes, we strongly recommend that additional network connectivity, such as the SP Switch, additional Ethernet, Token Ring, FDDI, or ATM networks should be used for the applications on the SP/CES, which performs significant communication among nodes. This can also avoid overloading the SP/CES Ethernet administration network with application traffic. When the cluster is built, each node provides an Ethernet (en0) interface dedicated to PSSP system management operation. This is connected to the control workstation (CWS) and is used by PSSP for internal cluster communication. This normally is a private network, with nodes on this network having private addresses. The PSSP software requires that the SP system is Internet protocol Version 4 (IPv4) for all interfaces. IPv6 is a version that extends addressing capability. PSSP components can tolerate IPv6 alias addresses coexistence with the required IPv4 network address for the Ethernet and Token Ring interfaces only. Understand and keep in mind that none of the PSSP software components actually uses IPV6 addresses. Note: Do not add IPv6 aliases if your system uses SP switch, or DCE, HACWS, or HACMP licensed programs.Those products do not tolerate IPv6. It is important that you read and clearly understand the information concerning IPv6 in the publications for each IBM licensed program. 3.1.1 Supported Ethernet adapters and their placement Ethernet adapters supported for cluster Ethernet communication are: Twisted-pair cable connection: – 10/100 Ethernet PCI adapter II F/C 4962) – 10 MB AUI/RJ-45 Ethernet adapter (F/C 2987) 122 IBM Eserver pSeries Cluster Systems Handbook BNC cable connection: – 10 MB BNC/RJ-45 Ethernet adapter (F/C 2985) New I/O adapters – The IBM 2-Port 10/100/1000 Base-TX Ethernet PCI-X adapter – IBM 2-Port Gigabit Ethernet-SX PCI-X adapter Note: The IBM 2-port Gigabit Ethernet-SX PCI-X adapter provides two full duplex Ethernet connections for Universal Twisted Pair (UTP) and short wave multimode optical cable, respectively. Both adapters provide functions to support high performance communications, including Jumbo frames at 1000Mbps speed to help increase throughput with less processor utilization; Large Send, which offloads TCP segmentation from system software to the adapter; and Checksum offload, which allows the adapter to calculate TCP/UDP checksum, rather than system software Attention: H80/M80 requires a 10/100 Ethernet PCI adapter II (F/C 4962). Feature codes for the PCI Ethernet adapters are: F/C 4962 IBM 10/100 Mbps Ethernet PCI adapter F/C 2985 PCI Ethernet BNC/RJ-45 adapter F/C 2987 PCI Ethernet AUI/RJ-45 adapter F/C 2969 Gigabit Ethernet SX PCI adapter F/C 2975 10/100/1000 BASE-T Ethernet PCI adapter Note:10/100 Ethernet adapter (F/C 2968) is withdrawn.Twisted-pair cable connections require one F/C 9223 for each SP-attached server and BNC cable connection require F/C 9222 for each SP-attached server. It has to be ordered with the SP system. The 2-Port Gigabit Ethernet and audio adapters are supported on AIX 5L v5.1 and v5.2. Restriction: For the pSeries servers, the integrated Ethernet is not supported for the cluster LAN. Chapter 3. Network configuration 123 Placement These adapters must be placed in the en0 position of the cluster nodes. That is the lowest-numbered Ethernet bus slot in the first I/O tower, so that it will always be en0. The following slots apply to 7017 and 7026: Slot 5 on a 7017 in the primary I/O drawer Slot 1 on a 7026 in the primary I/O drawer If you plan to attach an RS/6000 server to your cluster system, you must place a cluster Ethernet adapter in the en0 position inside the RS/6000 server. Due to the fact that the Ethernet adapter in this slot must be configured for PSSP communications, any non-supported Ethernet adapter that is in the en0 slot must be removed. For HMC-controlled pSeries, refer to 3.4.8, “HMC trusted network considerations” on page 139. 3.1.2 Ethernet network topology As mentioned in 3.1, “SP LAN Ethernet” on page 122, the network topology for the SP/CES Ethernet mainly depends on the size of the system. For that reason, it should be individually planned. The SP/CES private Ethernet network is used for installation and administration purposes, so additional network connectivity should be used for applications on the SP/CES, which performs significant communication among nodes. In this way, you can also avoid overloading the SP/CES Ethernet administration network with application traffic. For further information about this topology, refer to IBM eServer Cluster 1600 Planning, Installation and Service Guide, GA22-7863. 3.1.3 IP label convention Each network interface within a cluster system requires a unique IP label. The label should include information such as frame number, node number, and interface type, in order to reduce management complexity. The host name for each node within the cluster system should be set to the IP label for the cluster Ethernet LAN en0. The naming convention can be follows: sAFFNNnet where: 124 IBM Eserver pSeries Cluster Systems Handbook sA = RS/6000 system type and number FF = The frame number NN = the node number net = The network interface type (for example, en0 = Ethernet) The cluster LAN is assigned from a private range of Class A or C network addresses reserved for use in private networks. For more information about IP addressing, refer to IBM eServer Cluster 1600 Planning Volume 2, Control Workstation and Software Environment, GA22-7821. 3.2 Switch network In a Cluster configuration, you can have another type of network called the Switch network. This network requires a switch device that gets mounted in a SP frame. Switches are used to connect processor nodes, and provide the message passing network through which they communicate with a minimum of four disjoint paths between any pair of nodes. In any cluster system, you can use only one type of switch, either the SP Switch2 or the SP Switch. Cluster nodes are supported to be connected to the switch, but they require a tall SP frame for housing the switch. In order to connect to the switch, the cluster nodes require switch adapters. In the following section, we discuss the benefits of a switch network, relating to the SP Switch2 and the SP Switch. 3.2.1 Benefits of a Switch network A switch provides low latency, high-bandwidth communication between nodes, supplying a minimum of four paths between any pair of nodes. Switches provide enhanced, scalable high performance communication for parallel job execution, and they dramatically speed up TCP/IP, file transfers, remote procedure calls, and relational database functions. Using a switch offers the following improved capabilities: Interframe connectivity and communication Scalability up to 256 node connections, including intermediate switch frames Constant bandwidth and latency between node pairs Support for Internet Protocol (IP) communication between nodes IP Address Resolution Protocol (ARP) support Support for dedicated or multi-user environments Error detection and retry Chapter 3. Network configuration 125 High availability Fault isolation Concurrent maintenance for nodes Improved switch chip bandwidth 3.2.2 SP Switch2 The Switch2 feature code 4012 can be used to interconnect the cluster nodes to form a Switch network. This switch has 16 ports for node connections and 16 ports for switch-to-switch connections. Switch2, which has evolved from the traditional SP Switch, provides the following added features: Next generation of SP Switch and adapter Switch throughput increased (over a 3x increase in bandwidth) – 150 to 500 MB/sec one way – 300 to 1000 MB/sec for bi-directional 32 ports – 16 internal ports for connections to nodes within the switch-equipped frame, or to nodes in non-switched expansion frames – 16 external ports for switch-to-switch connections N+1 redundancy on – Hot-swappable power supplies – Hot-swappable cooling fans Hot-swappable switch supervisor card Hot-swappable switch interposers Note: Any unused switch ports must have a blank interposer card installed to prevent contamination of the connector and ensure proper cooling air flow. Figure 3-1 on page 127 shows the node and switch ports. 126 IBM Eserver pSeries Cluster Systems Handbook Figure 3-1 Switch2 allocation of node and switch ports (port side shown) The functions of a switch are made available by the PSSP software. The functions of Switch2 were enhanced with the release PSSP 3.4. The following list describes these enhancements: The SP Switch2 supports two switch configurations, so that nodes can communicate over one or two switch planes. By having two switch planes operational, you can improve communication performance and also achieve higher availability, since one switch plane can continue operating even when you take a switch down for maintenance. Optional switch connectivity. A system using the SP Switch2 can have some nodes that are not connected to the switch. You can use the SP Switch2 and still keep older nodes in your system, but not connected to the switch. You can have up to 8 SP Switch2 node switch boards in a single SP frame. This is advantageous for installations that require a large number of switch connections for SP-attached servers or clustered enterprise server configurations. This is a frame configured only with SP Switch2 switches. Relaxed node placement rules and optional connectivity. Nodes can be placed anywhere allowed by the physical restrictions. The node switch port numbers are not predefined, so you can connect a node to any available switch port, and the numbers will be generated when the switch is started. As a node is assigned to a CSS adapter, it is given the lowest available switch node number from zero (0) through 511. There is no correlation between the switch port number and any hardware connections cluster of two-plane Switch2 networks. Chapter 3. Network configuration 127 3.2.3 Switch IP network and addressing If using IP for communication over the switch, each node needs to have an IP address and name assigned for the switch interface (the css0 adapter). If you plan to use the Switch2 with two switch planes, you also need to have an IP address and name assigned for the css1 adapter, and you have the option to use the ml0 aggregate IP interface. If hosts outside the switch network need to communicate over the switch using IP with nodes in the cluster system, those hosts must have a route to the switch network through one of the cluster nodes. If you use the Switch2 and all nodes are running PSSP 3.4 and PSSP 3.5, you have optional connectivity, so some nodes can be left off the switch. Switch port numbering is used to determine the IP address of the nodes on the switch. If your system is not ARP-enabled on the css0 adapter and the css1 adapter in a two-plane Switch2 system, choose the IP address of the first node on the first frame. The switch port number is used as an offset added to that address to calculate all other switch IP addresses. If ARP is enabled for the css0 and css1 adapters, the IP addresses can be assigned like any other adapter. That is, they can be assigned beginning and ending at any node. They do not have to be contiguous addresses for all the css0 and css1 adapters in the system. Switch port numbers are automatically generated by the Switch2, and so are not used for determining the IP address of the Switch2 adapters. With the introduction of Switch2 and PSSP 3.4, you are able to build a switched cluster with some nodes that do not have to belong to the switch network. Configuring such a cluster is no different from constructing a standard switched or unswitched cluster. Tip: Configure the unswitched nodes as if you were configuring the nodes without the switch. When adding the additional adapter information about the unswitched nodes, use the “Node list” option. Configure the switched nodes as if you were configuring the nodes with the switch. When adding the additional adapter information about the switched nodes, use the “Node list” option. 3.3 Other networks Once you have configured the cluster LAN and the optional Switch network, you can configure other networks that allow you to use the nodes. For example, you 128 IBM Eserver pSeries Cluster Systems Handbook could have high speed Ethernet adapters in the nodes. This network connection allows users to access the application running on the nodes. The configuration of these additional adapters is quite straightforward; however, you need to ensure it is done in the CWS and propagated to the nodes, in order to keep the SDR consistent. If you configure these adapters within the node, PSSP’s SDR residing in the CWS will not know about it—and if you reconfigure or customize the node from the CWS in the future, you could end up deleting or overwriting the network configuration in the nodes. To avoid this situation, keep the following points in mind when choosing to configure additional adapters in your cluster nodes: All configuration should be carried out through PSSP on the CWS. Set the right initial host name on the node so that applications can find this host. Use the tuning.cust script in PSSP to set the right routes for this additional adapter network. If you are installing an application on this node, it should be done after installing the external adapter and setting the initial host name. 3.4 Network considerations In the following sections, we highlight considerations and other information that will help you when building the network for the Cluster 1600. These sections provide specific information on the Cluster 1600 managed by PSSP and its subsystems. 3.4.1 The RS-232 connection In classic SP environments, the CWS has a direct connection to each SP frame and to each SP-attached server, to perform the following tasks: Controlling the hardware (for example, power on/power off) Transferring the Kerberos tickets and password files Communicating with the firmware (for example, node conditioning) Chapter 3. Network configuration 129 Note: In a pSeries-attached configuration, there is no direct RS-232 connection between the CWS and pSeries. The CWS is connected to the HMC via Ethernet, and the HMC is connected to the pseries server using an RS-232/RS-422 serial line. 3.4.2 System topology considerations When configuring larger systems, you need to consider several areas when setting up your network: the Cluster 1600 Ethernet, the outside network connections, the routers, the gateways, and the switch traffic. The Cluster 1600 Ethernet is the network that connects a control workstation to each of the nodes in the cluster that are to be operated and managed by that control workstation using PSSP. When configuring the cluster Ethernet, the most important consideration is the number of subnets you configure. Because of the limitation on the number of simultaneous network installs, the routing through the cluster Ethernet can be complicated. Usually the amount of traffic on this network is low. If you connect the cluster Ethernet to your external network, you must make sure that the user traffic does not overload the SP Ethernet network. If your outside network is a high speed network like FDDI or HIPPI, routing the traffic to the cluster management Ethernet can overload it. For gateways to FDDI and other high speed networks, route traffic over the Switch network. Configure routers or gateways to distribute the network traffic so that one network or subnet is not a bottleneck. If the cluster management Ethernet is overloaded by user traffic, move the user traffic to another network. If you expect a lot of traffic, then configure several gateways 3.4.3 Boot/install server requirements When planning your management LAN topology, consider your network install server requirements. The network install process uses the admin/management Ethernet for transferring the install image from the install server to the cluster nodes. Running numerous, concurrent network installs can exceed the capacity of the admin LAN. Following are suggested guidelines for designing the Ethernet topology for efficient network installs. Many of the configuration options will require additional network hardware, beyond the minimal node and control workstation requirements. There are also network addressing and security issues to consider. 130 IBM Eserver pSeries Cluster Systems Handbook HMC-controlled servers get network-connected and do not require you to use en0. For all other nodes, you must use the en0 adapter to connect each node to the SP Ethernet admin LAN. The following requirements pertaining to the cluster Ethernet admin LAN exist for all configurations: The NIM clients that are served by boot/install servers must be on the same subnet as the boot-install server's Ethernet adapter. NIM clients must have a route to the control workstation over the SP Ethernet. The control workstation must have a route to the NIM clients over the SP Ethernet. Note: Keep the following points in mind regarding boot/install servers. Certain security options have limitations with multiple boot/install servers. See Limitations in IBM eServer Cluster 1600 Planning Volume 2, Control Workstation and Software Environment, GA22-7281. Do not install cascading levels of boot/install servers. Every boot/install server node must have the control workstation as its boot/install server. Single frame systems For small systems, you can use the control workstation as the network install server. This means that the SP Ethernet admin LAN is a single network connecting all nodes to the control workstation. When installing the nodes, limit yourself to installing eight nodes at a time, because that is the limit of acceptable throughput on the Ethernet. An alternate way to configure your system is to install a second Ethernet adapter in your control workstation, if you have an available I/O slot, and use two Ethernet segments to the SP nodes. Connect each network to half of the SP nodes. When network-installing the frame, you can install all 16 nodes at the same time. Note: Set up your SP Ethernet admin LAN routing so nodes on one Ethernet can communicate to nodes on the other network. Set up your network mask so that each SP Ethernet is its own subnet within a larger network address. Multiple frame systems For multiple frame systems, you might want to spread the network traffic over multiple Ethernets, and keep the maximum number of simultaneous installs per network to eight. You can use the control workstation to network-install specific SP nodes to be the network install servers for the rest of nodes. There are three ways to accomplish this: Chapter 3. Network configuration 131 Use a control workstation with one Ethernet adapter for each frame of the system, and one associated SP Ethernet per frame. For example, if you have a system with four frames (as in Figure 3-2 on page 132), the control workstation must have enough I/O slots for four Ethernet adapters. Each adapter connects one of the four SP frame Ethernet segments to the control workstation. Using this method, you install the first eight nodes on a frame at a time—or up to 32 nodes, if you use all four Ethernet segments simultaneously. Running two installs will install up to 64 nodes. Note: Set up your SP Ethernet routing so nodes on one Ethernet can communicate with nodes on another. Set up your network mask so that each SP Ethernet is its own subnet within a larger network address.This method is applicable up to the number of slots your control workstation has available. Figure 3-2 CWS with one Ethernet adapter for each frame A second approach designates the first node in each frame as a network install server, and then the remaining nodes of that frame are set to be installed by that node. This means that, from the control workstation, you will have an SP Ethernet segment connected to one node on each frame. Then the network install 132 IBM Eserver pSeries Cluster Systems Handbook node in each frame has a second Ethernet card installed, which is connected to an Ethernet card in the rest of the nodes in the frame. When using this method, as shown in Figure 3-3, installing the nodes requires that you first install the network install node in each frame. The second set of installs will install up to eight additional nodes on the frame. The last install, if needed, installs the rest of the nodes in each frame. Be forewarned that this configuration usually brings performance problems due to two phenomena: – All SP Ethernet admin LAN traffic (for example, installs, and SDR activity) is routed through the control workstation. The single control workstation Ethernet adapter becomes a bottleneck, eventually. Figure 3-3 CWS with one Ethernet card; one node has two adapters – An application running on a node which produces a high volume of SP Ethernet traffic (for example, LoadLeveler®) causes all subnet routing to go through the one control workstation Ethernet adapter. Moving the subject application to the control workstation can cut that traffic in half, but the control workstation must have the capacity to accommodate that application. You can improve the performance here by adding an external router, similar to that described in method 3. A third method adds an external router to the topology of the previous approach. This router is made part of each of the frame Ethernets, so that traffic to the outside does not need to go through the control workstation. You can do this Chapter 3. Network configuration 133 only if the control workstation can also be attached externally, providing another route between nodes and the control workstation. Figure 3-4 shows this Ethernet topology for such a multi-frame system. Figure 3-4 Ethernet topology for multi-frame system 3.4.4 The SP Ethernet administrative LAN The SP Ethernet is the administrative LAN that connects all nodes in one system running PSSP to the control workstation. The SP LAN is essential for the Cluster 1600 and for the following PSSP software management tasks: RSCT communications (for example, hosts responds) NIM operations Systems management communications Note: The I/O drawer of the pSeries server does not support the BNC-type network adapters. If you want to integrate the pSeries server into an existing Cluster 1600 with a BNC SP LAN, you will need an HUB in order for the SP LAN to provide BNC and Twisted pair (TP) cabling in one network. For HMC-controlled servers, the SP LAN can also be used to connect the HMC to the control workstation for hardware control and monitoring; refer to 3.4.8, “HMC trusted network - considerations” on page 139, for more information. 134 IBM Eserver pSeries Cluster Systems Handbook For each node, ensure that the SDR reliable_hostname attribute is identical to the default host name returned by the host command for its SP Ethernet IP addresses. With the exception of HMC-controlled server nodes, the PSSP components expect that en0 is the connection from the node to the SP Ethernet admin LAN for installs and other PSSP functions. For an HMC-controlled server node, you can use any Ethernet adapter that is supported for connecting to the cluster Ethernet admin LAN, and identify it by name or by physical location. For all other nodes, you must connect the cluster Ethernet admin LAN to the Ethernet adapter in the node's lowest hardware slot of all the Ethernet adapters on that node. When a node is network-booted, it selects the lowest Ethernet adapter from which to perform the install. This Ethernet adapter must be on the same subnet of an Ethernet adapter on the node's boot/install server. In nodes that have an integrated Ethernet adapter, it is always the lowest Ethernet adapter. Be sure to maintain this relationship when adding Ethernet adapters to a node. You can attach the cluster admin Ethernet to other site networks and use it for other site-specific functions. You assign all addresses and names used for the cluster Ethernet admin LAN. You can make the connections from the control workstation to the nodes in one of three ways. The method you choose should be one that optimizes network performance for the functions required of the cluster admin Ethernet LAN by your site. The three connection methods are: Single-subnet, single-stage cluster admin Ethernet, in which one interface on the control workstation connects to all cluster nodes. Multi-subnet, single-stage cluster admin Ethernet. In this method, there is more than one interface on the control workstation, and each connects to a subset of the cluster admin nodes. Multi-subnet, multi-stage cluster admin Ethernet. In this method, a set of nodes, acting as routers to the remaining nodes on separate subnets, connects directly to the control workstation. The cluster’s boot/install servers must be on the same subnet as their clients. In the case of a multi-stage, multi-subnet cluster Ethernet admin LAN, the control workstation is the boot/install server for the first node in each frame, and those nodes are the boot/install servers for the other nodes in the frames. Also, when booting from the network, nodes broadcast their host request over their Ethernet admin LAN interface. Therefore, that interface must be the Ethernet adapter on the node that is connected to the boot/install network. Chapter 3. Network configuration 135 The IBM eServer pSeries introduces a new way to configure the SP LAN adapter. It is now recommended to configure the SP LAN adapter through its unique hardware location code. For further details on this subject, refer to IBM eServer Cluster 1600: Planning, Installation and Service Guide, GA22-7863. Note: For HMC-controlled servers, you are no longer required to have the first Ethernet adapter (en0) configured as the SP LAN adapter. PSSP uses the hardware physical location codes to uniquely define the Ethernet adapter. Tip: To avoid problems later on, we recommend that you use unique domain names not only for each HMC, but also in the Cluster 1600 environment. Node Class Since PSSP represents each LPAR as a node, the LPAR name is stored in the Node Class of the SDR. The Node Class is described in Table 3-1. Table 3-1 List with new and enhanced attributes in the Node Class Attribute name Description p690 values LPAR_name Logical partition identifier Retrieved by hardmon. It is the same name as shown on the HMC in the Partition Management menu. Adapter Class The new feature to define an SP LAN adapter through its hardware location code requires the new physical_location attribute in the SDR Adapter Class. Since the SP LAN adapter can be different than en0, it needs to define explicitly which adapter belongs to the SP LAN. The Adapter Class is described in Table 3-2. Table 3-2 List with new and enhanced attributes in the Adapter Class 136 Attribute name Description p690 values physical_location Physical location code for the adapter. The value as shown by the spadaptr_loc or lscfg -vl command. SP LAN Indicates whether or not this adapter is connected to the SP admin network (SP LAN). A value of 1 indicates that this is an SP LAN adapter. A value of 0 means that this adapter belongs to a different network. Only one adapter per node can have this value set to 1. IBM Eserver pSeries Cluster Systems Handbook Important: There is no specific plug-in rule for the SP LAN Ethernet adapter placement. To see which Ethernet adapters are supported as SP LAN adapters, refer to RS/6000 SP: Planning Volume 1, Hardware and Physical Environment, GA22-7280. For further information about adapter placements, refer to RS/6000 and eServer pSeries: PCI Adapter Placement Reference, SA38-0538. 3.4.5 Additional LANs - considerations The cluster admin LAN can provide a means to connect all nodes and the control workstation to your site networks. However, it is likely that you will want to connect your SP nodes to site networks through other network interfaces. If the cluster admin LAN is used for other networking purposes, the amount of external traffic must be limited. If too much traffic is generated on the SP Ethernet admin LAN, the administration of the SP nodes might be severely impacted. For example: Problems might occur with network installs, diagnostic function, and maintenance mode access. In an extreme case, if too much external traffic occurs, the nodes will hang when broadcasting for the network. Additional Ethernet, Fiber Distributed Data Interface (FDDI), and Token-Ring networks can also be configured by the PSSP software. Other network adapters must be configured manually. These connections can result in increased network bandwidth. Performance in user file serving and other network-related functions may be reduced. You need to assign all the addresses and names associated with these additional networks. Note: You can plan to run PSSP on a system with a firewall. For more information, see Implementing a Firewalled RS/6000 SP System, GA22-7874. 3.4.6 IP over the switch - considerations If your SP has a switch and you want to use IP for communications over the switch, each node needs to have an IP address and name assigned for the switch interface, the css0 adapter. If you plan to use the SP Switch2 with two switch planes, you also need to have an IP address and name assigned for the css1 adapter, and you have the option to use the ml0 aggregate IP address. If hosts outside the SP switch network need to communicate over the switch using Chapter 3. Network configuration 137 IP with nodes in the SP system, those hosts must have a route to the switch network through one of the SP nodes. If you are not enabling ARP on the switch, specify the switch network subnet mask and the IP address. Unlike all other network interfaces, which can have sets of nodes divided into several different subnets, the SP switch IP network must be one contiguous subnet that includes all nodes in the system. (If you use the SP Switch2 and all nodes are running PSSP 3.4 or greater, you have optional connectivity so some nodes can be left off the switch.) If you want to assign your switch IP addresses as you do your other adapters, you must enable ARP for the css0 adapter and, if you are using two switch planes, for the css1 adapter. If you enable ARP for those interfaces, you can use whatever IP addresses you wish, and those IP addresses do not have to be in the same subnet for the whole system. They must all be resolvable by the host command on the control workstation. For more information about this subject, refer to IBM eServer IBM RS/6000 SP: Planning Volume 1, Hardware and Physical Environment, GA22-7280. 3.4.7 Subnetting - considerations All but the simplest SP system configurations likely include several subnets. Thoughtful use of netmasks in planning your networks can economize on the use of network addresses. For more information about Internet addresses and subnets, refer to AIX 5L Version 5.1 System Users Guide: Communications and Networks, which can be found at: http://publib16.boulder.ibm.com/pseries/en_US/aixuser/usrcomm/usrcommtfrm.htm As an example, consider an SP Ethernet where none of the six subnets making up the SP Ethernet have more than 16 nodes on them. A netmask of 255.255.255.224 provides 30 discrete addresses per subnet, which is the smallest range that is usable in the wiring as shown. Using 255.255.255.224 as a netmask, we can then allocate the address ranges as follows: 129.34.130.1-31 to the control workstation to node 1 subnet 129.34.130.33-63 to the frame 1 subnet 129.34.130.65-96 to frame 2 In the same example, if we used 255.255.255.0 as our netmask, then we would have to use six separate Class C network addresses to satisfy the same wiring configuration (that is, 129.34.130.x, 129.34.131.x, 129.34.132.x, and so on). 138 IBM Eserver pSeries Cluster Systems Handbook 3.4.8 HMC trusted network - considerations Cluster 1600s that are managed by PSSP configurations which contain HMC-controlled servers require additional security planning and configuration to protect password data transferred from the control workstation to an HMC-controlled server. Cluster 1600 configurations managed by PSSP that do not contain HMC-controlled servers do not require additional security planning and configuration. RMC LAN The Resource Monitoring and Control (RMC) LAN is an Ethernet connection between the HMC and the pSeries server in an LPAR or SMP mode. Over this connection the HMC gathers data from active LPARs or SMP servers about the health of the system, and acts as a service focal point for service representatives to determine an appropriate service strategy. For more information about the RMC LAN, refer to HMC Operations Guide, SA38-0590. Important: Use of the RMC LAN is not mandatory. It is possible to use the function of the RMC LAN over the SP LAN. The trusted LAN Since there is no direct RS-232/RS-422 connectivity between the HMC and the CWS, it is mandatory to have a trusted network connection between the HMC and the CWS to transfer Kerberos tickets and password files. A separate trusted LAN is recommended, but not necessary. If your SP LAN is secure, you do not need to set up an additional trusted LAN connection. You can send your secure sensitive data over the existing SP LAN. A trusted network In a trusted network, all hosts on the same network (LAN) are regarded as trusted, according to site security policies and procedures governing the hosts. Data on a trusted network can be seen by all trusted hosts and users on the trusted hosts, but the implied trust among and between the hosts assumes that the data will not be intercepted or modified. By way of implied mutual trust, traffic flowing across the trusted network is regarded as safe from unwanted or unintended interception or tampering. However, it does not imply that the data on the trusted network is itself private or encrypted. You need a trusted network that best suits your environment: If you consider the SP Ethernet admin LAN to be a trusted network, no additional setup or configuration is needed to satisfy the requirement for security. Chapter 3. Network configuration 139 If you do not consider the SP Ethernet admin LAN to be a trusted network, then you must establish another trusted network between the control workstation and the HMC, in order to satisfy your security requirement. The remainder of this discussion will help you to plan a trusted network between the control workstation and the HMC. A trusted network between the control workstation and an HMC requires that both hosts are connected via a network other than the SP Ethernet admin LAN. This other network is referred to as the HMC trusted network. We strongly suggest that you connect only the SP control workstation and HMC-controlled server systems to the HMC trusted network. The HMC trusted network requires a set of IP addresses all in the same subnet. We suggest that you reserve the subnet for use only by hosts that will be connected to the HMC trusted network. If you plan on connecting multiple control workstations and multiple HMC systems to the HMC trusted network, then ensure that the subnet can accommodate the total number of trusted hosts (actual or expected). In order for a control workstation to be connected to both the SP Ethernet admin LAN and the HMC trusted network, each control workstation requires the installation and configuration of an additional network adapter to be configured for the HMC trusted network. Each HMC must have a network adapter configured for, and connected to, the HMC trusted network. Connecting an HMC to the HMC trusted network will require the reconfiguration of an existing HMC network adapter. How the existing HMC network adapter is reconfigured depends on whether your HMC-controlled server is a new SP-attached server install, or an existing SP-attached server. New SP-attached server install When your HMC-controlled server is already installed and configured, but not yet connected to your SP system, it is a new SP-attached server install. Therefore, you must reconfigure the existing network adapter in the HMC for the HMC trusted network. Existing SP-attached server When your HMC-controlled server is already an SP-attached server, the SP Ethernet admin LAN is used to connect the HMC to the control workstation. Specifically, the network adapter in the HMC is configured for the SP Ethernet admin LAN. 140 IBM Eserver pSeries Cluster Systems Handbook In this case, you must un-configure the network adapter in the HMC, disconnect it from the SP Ethernet admin LAN, connect it to the HMC trusted network, and then configure it for the HMC trusted network. Configuration examples In this section, we show different possible network connections. Figure 3-5 shows the RMC LAN with SP LAN and trusted network. Trusted LAN pSeries LPAR Mode RS232/422 HMC TTY CWS SP LAN Ethernet Figure 3-5 The RMC LAN uses the SP LAN and the trusted network Figure 3-6 on page 142 shows a cluster with one physical network for the RMC, the SP LAN, and the trusted LAN. Chapter 3. Network configuration 141 pSeries LPAR Mode RS232/422 HMC TTY CWS SPLAN Ethernet / RMC LAN / trusted LAN Figure 3-6 Cluster with only one physical network Figure 3-7 shows two physical networks: one for the SP LAN, and one for the trusted network. The RMC LAN uses the SP LAN. pSeries Trusted LAN LPAR Mode RS232/422 TTY HMC CWS SPLAN Ethernet / RMC LAN Figure 3-7 Cluster with two physical networks Figure 3-8 on page 143 shows three separate physical networks for the SP LAN, the RMC LAN, and the trusted network. 142 IBM Eserver pSeries Cluster Systems Handbook RMC LAN pSeries Trusted LAN LPAR Mode RS232/422 TTY CWS HMC SPLAN Ethernet Figure 3-8 Each function has its own physical network In Figure 3-9, you can attach the pSeries to the IBM eServer Cluster 1600 either in full system partition mode (also known as SMP mode), or in LPAR mode. Therefore, the configuration is different for each mode. RS232 TTY CWS Trusted Ethernet SP Frame pSeries pSeries SMP Mode LPAR Mode HMC Frame Supervisor RS232 TTY RS232 TTY SPLAN Ethernet Figure 3-9 pSeries attached in Cluster 1600 Chapter 3. Network configuration 143 3.4.9 Network router node considerations If you plan to use an SP Switch Router and the SP Switch Router Adapter for routing purposes in your environment, the next few paragraphs on using standard nodes as a network router might not be applicable to your SP configuration. However, if you are not using the SP Switch Router, you might be interested in some considerations for using your nodes as network routers. When planning router nodes on your system, several factors can help determine the number of routers needed and their placement in the SP configuration. The number of routers you need can vary, depending on your network type. (In some environments, router nodes might also be called “gateway nodes”.) For nodes that use low bandwidth networks (such as 10 Mb Ethernet or token ring) as the routed network, a customer network running at full bandwidth results in a lightly loaded CPU on the router node. For nodes that use high bandwidth networks (such as GB Ethernet or FDDI) as the customer routed network, a customer network running at or near maximum bandwidth results in high CPU utilization on the router node. For this reason, you should not assign any additional role in the computing environment (such as a node in a parallel job) to a router using a high bandwidth network as the customer network. You also should not connect more than one high bandwidth network card to a router node. Applications, such as POE, should run on nodes other than high bandwidth routers. However, low bandwidth gateways can run with these applications. For systems that use low bandwidth routers, traffic can be routed through the SP Ethernet, but careful monitoring of the SP Ethernet will be needed to prevent traffic coming through the router from impacting other users of the SP Ethernet. For high bandwidth networks, traffic should be routed across the switch to the destination nodes. The amount of traffic coming in through the high bandwidth network can be up to 10 times the bandwidth the SP Ethernet can handle. For more information about configuring network adapters and the various network tunables on the nodes, refer to PSSP 3.5 Administration Guide, SA22-7348. SP Switch Router considerations The SP Switch Router is something you can use in a system with the SP Switch. It is not supported with the SP Switch2. It is by type an extension node, more specifically a dependent node. The SP Switch Router gives you high speed access to other systems. Without the SP Switch Router, you would need to dedicate a standard node to performing external network router functions. Also, 144 IBM Eserver pSeries Cluster Systems Handbook because the SP Switch Router is external to the frame, it does not take up valuable processor space. The SP Switch Router has two optional sizes. The smaller unit has four internal slots, and the larger unit has sixteen. One slot must be occupied by an SP Switch Router Adapter card, which provides the SP connection. The other slots can be filled with any combination of network connection cards, including the following types: Ethernet FDDI ATM SONET HIPPI HSSI Additional SP Switch Router Adapters Additional SP Switch Router Adapters are needed for communicating between system partitions and other SP systems. These cards provide switching rates of from 4 GB to 16 GB per second between the router and the external network. To attach an extension node to an SP Switch, configuration information must be specified on the control workstation. Communication of switch configuration information between the control workstation and the SP Switch Router takes place over the SP system's administrative Ethernet and requires use of the UDP port number 162 on the control workstation. If this port is in use, a new communication port will have to be configured into both the control workstation and the SNMP agent supporting the extension node. You can improve throughput of data coming into and going out of the SP system by using the SP Switch Router. The SP Switch Router can be connected with the SP Switch Router Adapter to an SP Switch, 8-port or 16-port. Each SP Switch Router Adapter in the SP Switch Router requires a valid, unused switch port in the SP system. 3.4.10 Clustered server configuration considerations A “clustered” server is any of the AIX servers discussed in “Cluster 1600 hardware” on page 11. It is not mounted in an SP frame and has no SP frame or node supervisor, though some do have comparable function-enabling hardware control and monitoring. A clustered server is directly attached to the SP Ethernet Chapter 3. Network configuration 145 admin LAN and to the control workstation. The means of connection differ with the server hardware. In a clustered server system configuration, you can assign frame numbers in any order. However, if you add SP frames with SP nodes or with SP switches, your system will then be subject to all the rules of an SP system, and these clustered servers become SP-attached servers. (Remember that those terms reflect only the system configuration in which the servers participate.) If you might use the SP Switch2 or the SP Switch, you need to plan the respective switch network. If you might use the SP Switch, plan your system with suitable frame numbers and switch port numbers in advance so you can expand to an SP system without having to totally reconfigure existing servers. For information on switch port numbering for a switchless system, refer to IBM eServer Cluster 1600 Planning Volume 1, Hardware and Physical Environment, GA22-7280. 3.4.11 SP-attached server considerations An “SP-attached server” is any of the servers discussed in “Cluster 1600 hardware” on page 11. If the SP system has the SP Switch, an SP-attached server requires an available node slot within an SP frame to reserve a valid unused switch port on the switch in the same SP frame. An SP-attached server is not supported with an SP Switch-8. You must connect the server to the SP Switch network with an adapter. That adapter connects to the valid unused switch port in the SP frame. For an HMC-controlled server node, each LPAR attaches to the switch. For more information on how to choose a valid port on the SP Switch, refer to SP Switch Router Adapter Guide, GA22-7310. SP system partitioning is supported by default in switchless systems with at least one SP node frame. In that case, the number of SP-attached servers you can have is limited by the number of available switch ports. But what if you have no need for multiple SP system partitions, but do need more SP-attached servers than are accommodated by the available switch ports in your system? Then you can force a switchless SP system to be non-partitionable. In this way, you can have more SP-attached servers because you can ignore the switch port numbering rules and assign sequential numbers. For more information about this topic, refer to PSSP 3.5 Administration Guide, SA22-7348. If the system has the SP Switch2, an SP-attached server and each HMC-controlled server LPAR node can connect with an adapter to any available switch port in the SP frame. If all nodes are running PSSP 3.4 or greater, you can leave some nodes not connected to the switch. 146 IBM Eserver pSeries Cluster Systems Handbook You must assign a frame number to an SP-attached server. Be sure to read and understand the information regarding SP-attached servers in IBM eServer Cluster 1600 Planning Volume 1, Hardware and Physical Environment, GA22-7280. 3.5 Sample scenarios of Cluster 1600 managed by PSSP In this section, we show a few sample Cluster 1600 scenarios using pSeries, HMC, and the CWS. 3.5.1 CWS with two HMCs and four pSeries Figure 3-10 shows four pSeries servers in LPAR mode, with all nodes connected to the SP LAN. To achieve high availability, we use two HMCs. Each HMC is connected to all pSeries via RS-232 connections. The connection between the CWS and the HMC is via a trusted Ethernet connection. CWS SPLAN Ethernet HMC Trusted Ethernet HMC RS232 TTY p690/p670 LPAR Mode p690/p670 p690/p670 p690/p670 LPAR Mode LPAR Mode LPAR Mode Figure 3-10 Four pSeries with two redundant HMCs and one CWS To avoid having the HMC become, potentially, a single point of failure, we recommend that you connect each pSeries server to two different HMCs. That Chapter 3. Network configuration 147 way, in the event of an HMC failure, you are still capable of reaching the pSeries through the second HMC connection. Depending on how many pSeries servers you have, additional 8-port asynchronous adapters may be required. Tip: For better performance when installing the nodes, make sure that when defining the pSeries frames (using the spframe command), you change the order of the HMC IP addresses. 3.5.2 One CWS, one HMC and one pSeries Figure 3-11 shows the smallest possible Cluster 1600 based on one pSeries. CWS p690/p670 LPAR Mode HMC RS232 TTY SPLAN Ethernet / RMC LAN / trusted LAN Figure 3-11 Cluster 1600 with one HMC, one pSeries and CWS A useful upgrade for this scenario could be a trusted Ethernet between the HMC and the CWS. 148 IBM Eserver pSeries Cluster Systems Handbook Note: Keep the following in mind: Depending upon the pSeries type and model, the serial connectivity will be RS-232 or RS-422. In an SP Switch or a switchless environment, you must define a switch node number for each LPAR in the pSeries server. You can do that by manually editing the /etc/switch.info file.This file must include one line entry for each attached LPAR. You can skip this configuration step if you are using SP Switch2 only, or you have a switchless Clustered Enterprise Server system (CES). 3.5.3 CWS, two HMCs, two 9076s, with one pSeries and SP Switch2 Figure 3-12 shows a Cluster 1600 after integrating a pSeries server in LPAR mode. Not all LPARs have a connection to the SP Switch2. This scenario shows a redundant HMC with a separate trusted LAN between the HMC and the CWS. CWS HMC Trusted Ethernet RS232 TTY SP Frame Frame Supervisor SPLAN Ethernet/RMC LAN HMC RS232 TTY SP Frame p690/p670 LPAR Mode Frame Supervisor SP Switch2 Figure 3-12 SP Frames with one pSeries, two LPARs on an SP Switch2 Chapter 3. Network configuration 149 Note: Depending upon the pSeries type and model, the serial connectivity will be RS-232 or RS-422. 3.5.4 CWS, HMC, 9076 frame and pSeries with SP Switch Figure 3-13 shows a pSeries in LPAR mode where all LPARs have an SP Switch connection. The RMC LAN traffic goes over the SP LAN. CWS HMC Trusted Ethernet RS232 TTY SPLAN Ethernet/RMC LAN SP Frame RS232 TTY p690/p670 LPAR Mode Frame Supervisor SP switch Figure 3-13 SP frame and pSeries with SP-Switched LPARs Note: Ethernet and serial cables have limited length; keep those lengths in mind during configuration planning, because they limit the location of the servers. 150 IBM Eserver pSeries Cluster Systems Handbook 3.6 Networking for Cluster Systems Management (CSM) In the following sections, we highlight considerations and other information that will help you when building the network for the Cluster 1600. These sections provide specific information on the Cluster 1600 managed by Cluster Systems Management (CSM) and its subsystems, and on the pSeries High Performance Switch (HPS) used in Cluster 1600 managed by CSM. 3.6.1 CSM hardware control IBM Cluster Systems Management (CSM) for AIX 5L hardware control software provides remote hardware control functions for CSM cluster nodes from a single point of control. CSM allows a system administrator to control cluster nodes remotely through access to the cluster management server. An administrator can run cluster management commands from the management server using the command line, Web-based System Manager graphical user interface (GUI), or System Management Interface Tool (SMIT) panels. CSM hardware control functions depend on the specific hardware, software, network, and configuration requirements. The requirements for remote power are separate and distinct from the requirements for remote console. AIX 5L clusters without the hardware, software, network, or configuration required to use CSM hardware control can still have CSM installed on some or all cluster nodes. However, in such clusters the hardware control commands may be inoperable or provide only limited function. CSM for AIX 5L supports cluster hardware control for both pSeries and xSeries nodes from an AIX management server. The hardware control commands can be run on the cluster management server to simultaneously control both node types in a mixed cluster. For a complete description of CSM mixed clusters, refer to IBM CSM for AIX 5L: Administration Guide, SA22-7918. 3.6.2 Hardware and network requirements The management server for CSM for AIX 5L can be connected to cluster nodes and external networks using various configurations of IBM and non-IBM hardware and software that meet the CSM architecture requirements. Note: To perform hardware control on HMC-attached pSeries machines, Hardware Management Console (HMC) for pSeries release 3 version 1.0 or later is required. The HMC can be used to partition physical pSeries servers into multiple logical partitions (LPARs), or nodes, each containing its own operating system image. Chapter 3. Network configuration 151 When configuring a CSM for AIX 5L cluster, give particular attention to secure hardware control functions. For both performance and security reasons, it is important to understand and control the data that is transferred around your network. If the data you transmit over a network is sensitive, and is not encrypted, consider isolating that data on a dedicated network or LAN. We recommend that you isolate the management server, Hardware Management Consoles (HMCs), console servers, and RSAs on a dedicated network. In a CSM environment, this network is known as a management virtual LAN (VLAN); see “Management VLAN” on page 153 for further details. A VLAN hides the HMCs, console servers, and RSAs from the normal users and keeps the traffic between them from being visible on the public network. CSM management servers use Telnet, which transmits passwords in clear text, to communicate with console servers. Many systems administrators choose to create a network for cluster administration data, such as node installations, backups, and monitoring. In a CSM environment, this network is known as a cluster VLAN; see “Cluster VLAN” on page 154. One Ethernet connection is also required for each cluster VLAN. Many systems administrators also create a network for application-related traffic. In a CSM environment, this network is known as a public VLAN; see “Public VLAN” on page 154. Considerations It is possible to use private IP addresses (192.168.*.*, 172.16.*.*, or 10.*.*.*) for both the management VLAN and the cluster VLAN. Private addresses can provide a higher level of security than public addresses. However, if your administrators’ workstations are not on the management VLAN, they will have no direct access to HMCs, RSAs, or console servers. They will have to access them by logging on to the management server. If this is not acceptable, the only alternative is to use routing. However, routing can become complex if your administrators’ workstations are dispersed across multiple networks, or if they use DHCP for their workstations. Incorrectly implemented routing can impact security, and thus defeat the purpose of creating multiple networks. For example, incorrect routing can cause secure data to be transmitted over the public VLAN. 152 IBM Eserver pSeries Cluster Systems Handbook Figure 3-14 shows a possible network configuration for a fairly simple environment. It contains three networks, a management VLAN, a cluster VLAN and a public VLAN. management VLAN public VLAN management server HMC(s) console console server(s) server(s) cluster VLAN administrators nodes users RSA(s) Figure 3-14 VLAN configuration 3.6.3 Virtual LANs (VLANs) VLANs discussed in this redbook refer to VLANs as defined by IEEE standards. VLANs can be configured to control cluster security access. Figure 3-14 on page 153 shows a network partitioned into three virtual LANs, the management, cluster, and public VLANs. We explain these in the following sections. Management VLAN Hardware control commands such as rpower and rconsole are run on the management server and communicate to nodes through the management VLAN. The management VLAN connects the management server to the cluster hardware through an Ethernet connection. For optimal security, the management VLAN must be restricted to hardware control points, remote console servers, the management server, and root users. Chapter 3. Network configuration 153 Routing between the management VLAN and cluster or public VLANs could compromise security on the management VLAN. Note: The management VLAN is subject to the Remote Server Access (RSA) restriction of 10/100Mb/s. Cluster VLAN A cluster VLAN connects nodes to each other and to the management server through an Ethernet connection. Installation and CSM administration tasks such as running dsh and ssh are done on the cluster VLAN. Host names and attribute values for nodes on the cluster VLAN are stored in the CSM database. Public VLAN A public VLAN connects the cluster nodes and management server to the site network. Applications are accessed and run on cluster nodes over the public VLAN. The public VLAN can be connected to nodes through a second Ethernet adapter in each node, or by routing to each node through the Ethernet switch. Note the following: The management VLAN contains only the CSM management server, HMCs, console servers, and RSAs. The cluster VLAN contains the management server and the nodes. The public network contains the nodes, the users’ workstations, and the administrators’ workstations. It may also contain other servers that are not members of the CSM cluster. The HMCs, console servers, and RSAs can be accessed only through the management server. 3.6.4 Conceptual diagram for pSeries cluster Figure 3-15 on page 155 shows the hardware and networking configuration required for using CSM hardware control with IBM HMC-attached pSeries nodes. The p690 machines in the diagram are single pieces of hardware that have been partitioned by HMCs into 16 LPARs (nodes). The management server connects to the management and cluster VLANs through Ethernet adapters. The nodes must be connected to the cluster VLAN through their first Ethernet adapters (eth0), and directly or indirectly to an HMC. Configuration for a public VLAN is flexible and can be defined by the system administrator. 154 IBM Eserver pSeries Cluster Systems Handbook Figure 3-15 Conceptual diagram for series cluster Redundant HMCs are supported providing automatic failover. The HMC can control multiple server machine types. When combining multiple machine types, substitutions of other server types are based upon the relative weighting factor of the server. For example, a p690 is twice the weighted factor of a p630 (eight p690s per HMC compared to sixteen p630s per HMC). Therefore, a supported HMC configuration could control four p690 servers and eight p630 servers. Chapter 3. Network configuration 155 Note: Ethernet and serial cables have limited length; keep those lengths in mind during configuration planning, because they limit the location of the servers in relation to the other system equipment. 3.6.5 pSeries HPS switch network overview SP switch technology used several types of switch adapters to connect system components to the switch network. With the pSeries HPS, all network connections pass through a revolutionary new interface card packaged as either a 2-Link Switch Network Interface or a 4-Link Switch Network Interface (SNI). Each link on the SNI allows message passing between the server bus and the switch. Therefore, each of the multiple links provides a function similar to a previous SP-type switch adapter. These architectural improvements in the SNI and the pSeries HPS create a distinct difference in switch networks when compared to previous adapters and switch networks. For example, using the SP Switch2, a Cluster 1600 switch network could be configured with two SP Switch2 Adapters at each communication point. With the SP Switch2, the two switch adapters formed two independent switch networks. If one of the adapters developed a fault, the network experienced reduced performance. In a similar manner, the pSeries HPS can also use two Switch Network Interfaces at each communication point. However, with the pSeries HPS, both the 2-Link Switch Network Interface and the 4-Link Switch Network Interface allow message passing across direct internal connections on each of the SNIs; refer to Figure 3-16 on page 157. Using these direct internal connections, the interfaces create a single, unified switch network across the SNI pair. As a result, if one of the interfaces develops a fault, its load is automatically directed to the other member of the pair. Although there may be reduced performance at the failed interface component, the overall performance of the pSeries HPS network remains unchanged. 156 IBM Eserver pSeries Cluster Systems Handbook Figure 3-16 Conceptual view of a 2-Link SNI card Topology overview Each switch link may be shared by a maximum of sixteen independent logical partitions within a server (configured by the HMC). It is important to note that the ability to share links is a major enhancement to system capability. Since each link can be shared by up to sixteen LPARs, one SNI can now take the place of sixteen SP type switch adapters. This allows systems using the pSeries HPS to be configured for the maximum number of LPARs while using a minimum number of SNIs. Another way to look at link sharing is that a single LPAR can also be connected with up to eight SNIs for maximum bandwidth. A pSeries HPS may be used as either a server switch board (SSB) or as an intermediate switch board (ISB). For small networks, the pSeries High Performance Switches are configured as SSBs and the switches are connected to each other. In this configuration, a single SSB can be connected to a maximum of sixteen links located on various configurations of Switch Network Interfaces. This allows the SSB to be connected with up to sixteen other switches. In contrast, large network configurations use pSeries High Performance Switches configured as ISBs to connect the SSBs. If a network has three or fewer SSBs, an ISB is not needed. However, if a network has more than three SSBs, Intermediate Switch Boards are required to insure bandwidth availability. For more information, refer to pSeries High Performance Switch Planning, Installation, and Service, GA22-7951. The pSeries HPS network consists of two interconnected SNIs at each communication point. In addition, each SNI has either one, two, or four external links (in a multi-link network, links are always added in pairs). The number and configuration of switch boards required to support these links depends on the total number of links in the network. Chapter 3. Network configuration 157 Note: Single-link SNIs require a converter wrap installed in multi-link SNIs. Adding single-link SNIs to a network places constraints on how the network is cabled. With two interconnected SNIs at each communication point, the pSeries HPS network has parallel paths from the SNIs through the switch. While the SP Switch2 network could also be configured with one or two SP Switch2 Adapters per node, the two adapters created two independent networks between servers. Because those two networks were not interconnected, a message on one network could not be passed to the parallel network. If one of the networks (planes) went down, the system did not have redundancy available. In comparison, the pSeries HPS supports up to eight communication paths between the switch and the server. However, the Switch Network Interfaces for the pSeries HPS are also interconnected within either the 2-Link or the 4-Link SNI package. With this design, should a fault occur on one link, the parallel adapter link will absorb the message load of the faulted adapter. With this architecture, each link on the pSeries HPS network provides full interconnection between all SNIs. Because the SNIs are completely interconnected, the parallel paths can be thought of as a unified network. 3.6.6 Switch Network Manager (SNM) Switch Network Manager (SNM) is a new approach to switch management. SNM software resides on the Hardware Management Console (HMC), and manages networks consisting of pSeries High Performance Switches and 2-Link or 4-Link Switch Network Interfaces. SNM is intended to be more LAN-like in its operation, meaning that the same level of control given in prior versions of switch management is no longer required. With SNM software, pSeries HPS networks do not require specific commands (like the SP “Ecommands”) to start or stop switch operations, bring servers partitions on and off the switch, or perform other functions previously driven by commands. Instead, the SNM software dynamically determines the network topology and initializes the active switch components. When new components are added or removed, SNM again manages this with little or no user intervention. With the pSeries HPS, all servers or server logical partitions (LPARs) act as peers on the switch. In contrast to previous SP switch types, the cluster does not have a switch primary node. Instead of a primary node, the pSeries HPS network uses the interfaces provided by the HMC service network and SNM software to configure, initialize, control, and monitor the switch network. 158 IBM Eserver pSeries Cluster Systems Handbook SNM configuration functions SNM dynamically sets up the internals of the network and defines its physical endpoints. This software also provides miswire detection, helps balance network bandwidth, offers limited capabilities to help plan network upgrades, and supplies service actions for non-operational components. SNM initialization functions SNM initialization functions allow you to set the operational parameters of the network. Initialization runs automatically once the network configuration is known. During initialization, SNM communicates with each active network component (switch and SNI), sets up their operational parameters, delivers route information to the endpoints, and signals to each server OS the availability of message passing over the network. SNM control functions SNM provides control and monitoring capabilities through the IBM Web-based System Manager Graphical User Interface (GUI) for function like: Power controls System status Diagnostics Network topology services Switch hardware control SNM monitoring functions SNM’s monitoring functions allow the software to detect, recover, and report error conditions in the network. When errors are detected, SNM determines the severity of the error, takes the appropriate action to recover or disable failing component, and creates serviceable events when appropriate. In most network configurations, these errors will not result in network outages. Service Focal Point (SFP) In a partitioned environment, each partition runs independently and is not aware of other partitions on the system. If there is a problem with a shared resource such as a managed system power supply, all active partitions will report the same error. Service Focal Point recognizes that these errors repeat and filters them into one serviceable event for the service representative to review. Note: Service Focal Point must be installed on the HMC. Chapter 3. Network configuration 159 From the Service Focal Point interface, you can execute maintenance procedures such as examining the error log history, checking for components requiring replacement, and performing a Field Replaceable Unit (FRU) replacement. With Service Agent installed on the HMC, the serviceable events captured by SFP are automatically sent to IBM (call-home support) and automatically generate a maintenance request. Note: pSeries HPS can be used with pSeries 690 and pSeries 655 only. 3.6.7 Considerations for Cluster 1600 managed by CSM network This section contains the following network hardware planning information: Administrative LAN Ethernet adapters and cabling RS-232 and RS-422 serial cabling requirements Switch interfaces and cabling Administrative LAN Ethernet adapters and cabling This includes the management server and all attached servers using customer-supplied twisted pair Ethernet cable. The LAN connection may be attached to the native Ethernet port on the HMC, or through additional PCI adapters placed in the HMC. pSeries High Performance Switches are not directly connected to the administrative LAN. All LAN adapters and cables are server-mounted and server-connected. RS-232 and RS-422 serial cabling requirements This requires a serial network to monitor and control system variables such as heartbeat and power. pSeries High Performance Switches do not require an RS-232 or RS-422 connections to the HMC.However, all switch-only frames housing the pSeries HPS and any server frame connected to the pSeries. HPS requires a pair of RS-422 connections to the frame BPC. 160 IBM Eserver pSeries Cluster Systems Handbook Note: Be aware of the following: RS-232 to RS-422 converters are included with the RS-422 cables shipped with F/C 8122 and F/C 8123. In contrast to other p690 installations that do not require RS-422 connections, all p690 (M/T 7040) servers connected to the pSeries HPS require RS-422 connections between the M/T 7040-61R frame BPCs and the HMC. Switch interfaces and cabling There are two types of switch connections (interfaces): server-to-switch, and switch-to-switch. Server-to-switch The server-to-switch interface requires three component types to connect each server or LPAR to the switch: Switch Network Interfaces on the server side of the connection Copper cable switch port connectors on the switch side of the interface Up to four copper cables connecting the server side to the switch side of the interface Switch-to-switch The switch-to-switch interface requires two component types to interconnect the switches on the network: Paired switch port connectors for each switch interconnection using either: – Copper cable option – Fiber optic cable option Cables to complete the switch-to-switch interface using either: – Two copper cables – Four fiber optic cables 3.6.8 Examples In this section, we provide sample scenarios for the pSeries High Performance Switch (HPS) in a Cluster 1600, as shown in Figure 3-17 on page 163. In a Cluster 1600 with a pSeries HPS, one of the LPARs is the management server. The pSeries HPS is connected to the pSeries 655 and the pSeries 690 through switch interface cables. Chapter 3. Network configuration 161 The HMC is connected to the following: 1. The administrative LAN network – This includes the management server and all attached servers using customer-supplied twisted pair Ethernet cables. – The LAN connection may be attached to the native Ethernet port on the HMC, or through additional PCI adapters placed in the HMC. 2. The environmental control network – This requires a serial network used to monitor and control system variables such as heartbeat and power. – This requires RS-232 connections to the HMC 1 serial port on all servers (redundant HMC configurations also use the HMC 2 server port). – This requires RS-422 connections to both BPC ports on all cluster frames (including M/T 7040-61R frames and M/T 7040-W42 switch-only frames). Note: Redundant HMCs are not required, but this optional configuration is available. When configured, redundant HMCs require separate cables from both HMCs to all required connection points. For additional information on redundant HMCs, refer to IBM Hardware Management Console for pSeries Installation and Operations Guide, SA38-0590. For details about the limitation and restrictions on the HPS network, refer to “Cluster 1600 hardware” on page 11. 162 IBM Eserver pSeries Cluster Systems Handbook M/T 7039 (p655) server RS-232 Ethernet HMC RS-422 Switch Ethernet Server LPAR configured as management server ---- M/T 7040 (p690) server (M/T 7040-61R frame) Switch HMC RS-232 Ethernet RS-422 pSeries High Performance Switches (M/T 7045 SW4) mounted in M/T 7040-W42 frame Figure 3-17 Conceptual pSeries HPS Cluster 1600 3.6.9 Redundant HMC Layout for pseries HPS in Cluster 1600 We describe a sample scenario as shown in Figure 3-18 on page 164 for the pSeries HPS in a Cluster 1600 with redundant HMCs. In a Cluster 1600 with a pSeries HPS with redundant HMCs, one of the pSeries server M/T 7028 is the management server. The pSeries HPS is connected to the pSeries 655 and the pSeries 690 through the switch interface cables, and the HMC is connected to the pSeries HPS via an RS/422 cable. The pSeries HPS requires a Model 7315-C01 as the HMC. When used as the part of Cluster 1600, the pSeries HPS requires a HMC. Chapter 3. Network configuration 163 Primary HMC Serial fan-out (1 of 4) RS-422 LAN port HMC 1 2 ports RS-232 Ethernet p655 server Ethernet RS-422 p690 server .... RS-232 ......... RS-232 ............. ............. RS-422 Secondary HMC HMC ports 1 2 RS-422 RS-422 LAN port MS RS-232 Ethernet Ethernet pSeries High Performance Switches (M/T 7045 SW4) mounted in M/T 7040-W42 frame 7028 Management server Figure 3-18 Redundant HMC layout for pseries HPS in Cluster 1600 3.6.10 Redundant layout for pSeries in the Cluster 1600 Figure 3-19 on page 165 shows a cluster of pSeries servers, with all nodes connected to the Ethernet LAN. To achieve high availability, two HMCs are used. Each HMC is connected to all pSeries servers via RS-232/RS-422 connections. The connection between the management server and the HMC is via an Ethernet connection. 164 IBM Eserver pSeries Cluster Systems Handbook Primary HMC Serial fan-out (1 of 4) RS-422 LAN port HMC 1 2 ports RS-232 Ethernet p655 server Ethernet pSeries server 655/690/670 RS-422 .... RS-232 ......... RS-232 ............. ............. RS-422 Secondary HMC HMC ports 1 2 RS-422 RS-422 LAN port MS RS-232 Ethernet Ethernet pSeries 670 7028 Management server Figure 3-19 Redundant layout for pSeries in a Cluster 1600 3.6.11 Conceptual Cluster 1600 without a pSeries HPS Figure 3-20 on page 166 shows classic SP nodes with frame and pSeries servers; all pSeries and the SP nodes are connected to the Ethernet LAN. One HMC is used to connect to all the pSeries servers, and the management server is connected to the HMC via the RS-232/RS-422. To achieve high availability, two HMCs can be used. Each HMC can be connected to all pSeries servers via RS-232/RS-422 connections. The management server is connected to the SP frame with RS-232 and Ethernet connections. The connection between the management server and the HMC is via an Ethernet connection. Chapter 3. Network configuration 165 pSeries 690 pSeries 655 RS-232 HMC RS-422 M/T 7028 Server Ethernet Ethernet 7038 RS-232 RS-232 RS-232 Ethernet Management Server 7026 M/T 7026 server RS/6000 SP system Ethernet RS-232 Figure 3-20 Conceptual Cluster 1600 without HPS 3.6.12 Management Server with two HMCs and four pSeries Figure 3-21 on page 167 shows four pSeries servers in LPAR mode, with all nodes connected to the Ethernet LAN. To achieve high availability, two HMCs are used. Each HMC is connected to all pSeries servers via RS-232 connections. The connection between the management server and the HMC is via a trusted Ethernet connection. 166 IBM Eserver pSeries Cluster Systems Handbook MS SPLAN Ethernet HMC Trusted Ethernet HMC RS232/RS422 TTY p690/p670 LPAR Mode p690/p670 p690/p670 p690/p670 LPAR Mode LPAR Mode LPAR Mode Figure 3-21 Four pSeries with two redundant HMCs and one management server To avoid having the HMC become, potentially, a single point of failure, we recommend that you connect each pSeries to two different HMCs. That way, in the event of an HMC failure, you are still capable of reaching the pSeries through the second HMC connection. Depending on how many pSeries servers you have, additional 8-port asynchronous adapters may be required. Chapter 3. Network configuration 167 168 IBM Eserver pSeries Cluster Systems Handbook 4 Chapter 4. Software support In this chapter, we describe the software available for Cluster 1600 systems: Parallel System Support Programs (PSSP) Cluster Systems Management (CSM) General Parallel File System (GPFS) LoadLeveler (LL) Scientific Subroutine Library (ESSL, PESSL, MASS) Parallel Environment (PE) Performance Toolbox (PTX®) and Performance AIDE (PAIDE) Note: In this redbook, the terms “managed nodes” and “nodes” are used interchangeably to mean “managed servers”. © Copyright IBM Corp. 2003. All rights reserved. 169 4.1 Software components of the Cluster 1600 The Cluster 1600 leverages and extends the software capabilities of the very successful RS/6000 SP. The full suite of software offered in Cluster 1600 is listed in Table 4-1. Table 4-1 Software components on the Cluster 1600 Software component Products Cluster Management Software Parallel System Support Programs (PSSP) Cluster Systems Management (CSM) Cluster File System General Parallel File System (GPFS) High Performance Computing Software Suite LoadLeveler (LL) Parallel Environment (PE) Scientific Subroutines Libraries (SSL): – Engineering and Scientific Subroutine Libraries (ESSL) – Parallel ESSL High Availability Cluster Software High Availability Cluster Multiprocessing (HACMP) In this chapter, we discuss the software components in the different cluster management environments, as shown in Figure 4-1 on page 171 and Figure 4-2 on page 172. Note: All software versions indicated in this chapter indicate the updated release only. Check with your IBM representative to determine the recommended maintenance package (APAR) required for the installation. Figure 4-1 shows a functional diagram of the Cluster 1600 software that is tightly integrated with PSSP. This is implemented in typical RS/6000 SP installations, as well as in Cluster 1600 installations that use PSSP for cluster management. 170 IBM Eserver pSeries Cluster Systems Handbook GPFS LL HACMP ESSL Communication Subsystem RSCT1 PSSP Security Services RSCT2 PE Hardware Control AIX pSeries Hardware 1 Available as a PSSP component before PSSP Version 3.5 2 Integrated into AIX starting with AIX Version 5.1 Figure 4-1 Functional diagram of Cluster 1600 software on PSSP Figure 4-2 on page 172 shows a functional diagram of the Cluster 1600 software that operates within a CSM environment. Chapter 4. Software support 171 GPFS RMC Group Services HACMP LL RM CtSec PE SMS Topology Services CFM C/M RSCT ESSL Hardware Control Probes CSM AIX Hardware Figure 4-2 Functional diagram of Cluster 1600 software on CSM 4.2 Parallel System Support Programs (PSSP) The PSSP software provides a comprehensive suite of applications for the installation, operation, management, and administration of the RS/6000 SP and Cluster 1600 system from a single point of control. The software is offered as program number 5765-D51 and consists mostly of base components and a set of optional components. PSSP offers the following advantages: A full suite of system management applications with the unique functions required to manage the SP and Cluster 1600 system. Simplified installation, operation, and maintenance of all nodes in a Cluster 1600 system. You can operate from a single control workstation. Parallel system management tools for allocating Cluster 1600 resources across the enterprise. 172 IBM Eserver pSeries Cluster Systems Handbook Advanced performance monitoring for consolidated analysis and reporting. Error detection and recovery features that reduce the impact and occurrence of unplanned outages. Coexistence is allowed for several releases within an SP partition, allowing for easier software migration. 4.2.1 Administration and operation PSSP allows the system administrator/operator to perform all local and remote administrative functions from the control workstation (CWS). This makes the CWS a single point of control, as illustrated in Figure 4-3 on page 174. The PSSP system administration component packages make operating and administering an SP system easy and efficient. This is accomplished through the tools and capabilities that PSSP offers: Installation facility PSSP supports the installation of AIX and PSSP on nodes using the Network Installation Management (NIM) environment. Configuration management The System Data Repository (SDR) stores node configuration information that can be retrieved across the workstation, file servers, and nodes. File management Files may be grouped together such that any changes to these file collections can be effectively propagated to the appropriate nodes. User management Administrators can easily add, change, and delete users and passwords. Consolidated accounting This allows you to centralize records at the node level (for tracking use by wall clock time rather than processor time), and gather statistics on parallel jobs. System monitoring and control PSSP provides system management tools that enable systems administrators to manage the Cluster 1600. PSSP allows authorized users to monitor and manipulate cluster hardware variables at node or frame levels. PSSP also provides for consolidation of error and status logs, to expedite problem determination. Other administrative tools – Parallel system management functions across multiple nodes. Chapter 4. Software support 173 – SP Perspectives, a set of graphical user interfaces (GUIs) that help to simplify administrative work. – Centralized Management Interface, available in the form of a menu-driven interface for system management commands, as well as command line equivalents. SP attached Servers IBM IBM RS/6000 SP IBM HMC managed Servers IBM IBM CWS PSSP HMC SDR Single Point of Control Figure 4-3 Single point of control from CWS in PSSP-managed clusters 4.2.2 Reliable Scalable Cluster Technology (RSCT) RSCT is a set of software components that provides a comprehensive clustering technology on AIX and PSSP. It provides an infrastructure to achieve improved system availability, scalability, manageability, and ease of use for several IBM products. Note: RSCT is an integrated component of PSSP up to PSSP 3.4. It is no longer shipped with the PSSP product set with the release of PSSP 3.5, as it is integrated into AIX 5L Version 5.1. Hence PSSP3.5 only runs on AIX 5L Version 5.1. AIX 5L Version 5.1 installs RSCT by default. The RSCT design differences in PSSP and AIX are beyond the scope of this redbook. You can find these details in Chapter 3, IBM ^Cluster 1600 174 IBM Eserver pSeries Cluster Systems Handbook Managed by PSSP 3.5: What’s New, SG24-6617. For more information about RSCT, also refer to IBM RSCT for AIX: Guide and Reference, SA22-7889. 4.2.3 IBM Virtual Shared Disk (VSD) The IBM Virtual Shared Disk (IBM VSD) allows multiple nodes to access a disk as if the disk were attached locally to each node. This feature can enhance the performance of applications that provide concurrent control for data integrity, such as Oracle databases or the GPFS file system. IBM Concurrent Virtual Shared Disk (CVSD) The IBM Concurrent Virtual Shared Disk (VCVSD) takes advantage of the concurrent disk access environment supplied by AIX. VSD uses the services of Concurrent LVM (CLVM), which provides the synchronization of LVM and the management of concurrency for system administration services. IBM Recoverable Virtual Shared Disk (RVSD) The IBM Recoverable Virtual Shared Disk (IBM RVSD) provides recovery from failure of virtual shared disk server nodes and is designed to ensure continuous data and system access from anywhere in the cluster. Hashed Shared Disk This component works with the IBM Virtual Shared Disk component to offer data striping for your virtual shared disks. Note: VSD, CVSD, and RVSD are optional components for PSSP installation. For details on VSD and RVSD, see Parallel System Support Programs for AIX, Managing Shared Disks, SA22-7349. 4.2.4 Security PSSP 3.5 offers the following security services: Standard AIX authentication Kerberos Version 4 method using either the PSSP or Andrew File System (AFS) components. Kerberos Version 5 method provided by the Distributed Computing Environment (DCE) software. You can use them individually, or in combination, or choose not to use them at all. Chapter 4. Software support 175 4.2.5 Communication subsystem The Communication subsystem software supports the SP Switch and SP Switch2, including the device drivers. It allows for switch configuration and initialization, provides adapter diagnostics and fault-handling, as well as parallel communications application programing interfaces (APIs). 4.2.6 Network Time Protocol (NTP) NTP can be used to manage time synchronization time-of-day clocks on your control workstation and processor nodes in one of the following ways: Using an established NTP server on your network Using an NTP server from the Internet Run NTP on one of the node to generate a consensus time 4.2.7 System availability To significantly improve system availability, PSSP also contains functions and interfaces to other products that can help reduce unplanned outages and minimize the impact of outages that do occur. System partitioning This allows the system to be organized into non-overlapping groups of nodes for various purposes, such as testing of new software or creating different production environments. However, system partitioning is only supported in switchless or SP Switch-attached SP systems. It is not supported in systems with the SP Switch2 or in switchless clustered server systems. Coexistence can reduce scheduled maintenance time PSSP can coexist with several releases within the Cluster 1600, thus allowing easier migration to new software levels. Node isolation This removes a node or server in the cluster from active duty and enables it to be reintegrated without causing an SP Switch fault or disrupting switch traffic. This isolation is useful for correcting an error condition or installing new hardware and software without impacting production. Error recovery on an SP switch This is managed by the active primary node. The primary node handles switch operations and process errors reported from the switch, and initializes the switch 176 IBM Eserver pSeries Cluster Systems Handbook fabric. A separate node can be designated to take over this function in order to eliminate the primary node as a single point of failure. High Availability Control Workstation (HACWS) This connects two control workstations (with HACMP installed) to an SP system to provide a backup control workstation in the event the primary one becomes unavailable (only one is active at any time). HACWS does not automatically support SP-attached or clustered servers. However, if you want to use an SP-attached or clustered server in an HACWS configuration, see the chapter entitled “Planning for a high availability control workstation” in IBM RS/6000 SP: Planning Volume 2, Control Workstation and Software Environment, GA22-7281, for limits and restrictions. 4.2.8 Other PSSP services Other services offered in PSSP are as follows: The IBM eServer Cluster Information Center provides one simple interface for all softcopy SP documentation and information resources. It consists of HTML, Java, and Java script files, and it works with a Web browser. The Resource Center provides access to a variety of information including publications, READMEs, Redbooks, white papers, product information, as well as up-to-date service information. PSSP supports a multi-threaded, standards-compliant Message Passing Interface (MPI) through an IBM Parallel Environment for AIX (PE), as well as maintaining its single-threaded MPI support. In addition, PSSP includes a Low-level Application Programming Interface (LAPI) with a flexible, active-message style, communications programming model on the SP Switch. PSSP offers perl programming language for developing system-wide shell scripts. PSSP comes with Tool command language (Tcl) for controlling and extending applications. 4.2.9 New in PSSP 3.5 PSSP 3.5 was announced on the October 8, 2002. It offers functional enhancements to the previous release, including the following: Chapter 4. Software support 177 Use of 64-bit kernel PSSP 3.5 provide support for use of the 64-bit kernel version of AIX 5L V5.1 or later. This includes the software stack supported by PSSP 3.5 (GPFS, MPI, Parallel ESSL, LAPI, KLAPI and IBM VSD). This also extends the capability to address physical memory beyond the 96 GB limit on the 32-bit kernel. Important: The 64-bit kernel support only exists on the CWS and the managed nodes that meet the following requirements: Running PSSP 3.5 or later Running AIX 5L Version 5.1 Maintenance Level 3 (IY32749) or later Running AIX 64-bit kernel on supported 64-bit hardware Hardware support PSSP 3.5 software supports logical partitions (LPARs) of pSeries as a PSSP node in a Cluster 1600 or SP system. These nodes require AIX 5L V5.1. The pSeries servers are supported in an SP-attached configuration with the SP Switch2, the SP Switch, or no switch, or in a clustered server configuration. Switch support Switch support now allows the administrator to specify which set of nodes to exclude from serving as a switch primary or primary backup, and which nodes are available for that purpose. Note: IBM does not intend to support the next generation IBM eServer High Performance Switch on PSSP-managed clusters. IBM Virtual Shared Disk Virtual Shared Disk (VSD) has been enhanced for performance using IP-based mode, and adds improved diagnostics. For details about the enhancements on PSSP 3.5, see Chapter 4, IBM ^ Cluster 1600 Managed by PSSP 3.5: What’s New, SG24-6617. 178 IBM Eserver pSeries Cluster Systems Handbook 4.2.10 Software requirements Table 4-2 lists the minimum software requirements for PSSP 3.5. Table 4-2 PSSP 3.5 software requirements AIX version Other software 5L V5.1 with AIX service at 5100-03 C for AIX, V6.0 or later. VisualAge® C++ Professional for AIX V6.0 or later. Important: PSSP 3.5 is the last release of PSSP. For more information about PSSP and CSM plans, contact your IBM technical representative or see: http://www.ibm.com/servers/eserver/clusters/software 4.2.11 Software compatibility matrix Table 4-3 shows the software compatibility matrix for PSSP 3.5. Table 4-3 PSSP 3.5 software compatibility matrix Software stack products Supported version General Parallel File System (5765-F64) Version 2 Release 1 LoadLeveler (5765-E69) Version 3 Release 2 Parallel Environment (5765-D93) Version 4 Release 1 ESSL for AIX (5765-C42) No dependencies on PSSP. ESSL Version 4 Release 1 and Version 3 Release 3 will be supported on the Cluster 1600 with the appropriate level of the AIX operating system. Parallel ESSL for AIX (5765-C41) Version 3 Release 1 HACMP (5765-E54) Version 5 Release 1 4.2.12 Documentation references - PSSP You may wish to refer to the following documentation for more information about PSSP: IBM ^Cluster 1600 Managed by PSSP 3.5: What’s New, SG24-6617 Parallel System Support Programs for AIX, Administration Guide, SA22-7348 Chapter 4. Software support 179 Parallel System Support Programs for AIX, Managing Shared Disks, SA22-7349 IBM RS/6000 SP: Planning Volume 2, Control Workstation and Software Environment, GA22-7281 4.3 Cluster Systems Management (CSM) Cluster Systems Management (CSM) for AIX and Linux is designed for simple, low-cost management of distributed and clustered IBM pSeries and xSeries servers in technical and commercial computing environments. CSM, included with the IBM Cluster 1600 and Cluster 1350, dramatically simplifies administration of a cluster by providing management from a single point of control. CSM is available for managing homogeneous clusters of xSeries servers running Linux or pSeries servers running AIX, or heterogeneous clusters which include both. CSM is offered as program number 5765-F67. CSM offers the following advantages: It provides the basis for consistent cluster management for both AIX and Linux, which will enable a tighter integration of these solutions over time. It allows server management of a hybrid collection of AIX or Linux servers (or nodes) from a single point of control. It is comprised of a modular architecture, where both IBM software and certain open source software can be integrated into a complete system management solution. It leverages the rich heritage and proven technology of the SP by utilizing and deriving software from PSSP. It is built on Reliable Scalable Cluster Technology (RSCT), which is an AIX-based clustering technology. It can provide tighter integration with AIX, because CSM is now part of the AIX 5L distribution. It is designed to handle large scaling and has the ability to handle distributed operations in parallel. It is designed to be independent of any switch topology or other types of domains, such as those used in the SP. It can manage multiple High Availability clusters inside a given CSM domain. It exploits advance hardware management features of the servers in the Cluster 1600 and Cluster 1350. The discussion of CSM in this section mainly focuses on managing clusters of pSeries running the AIX operating system. 180 IBM Eserver pSeries Cluster Systems Handbook 4.3.1 Administration and operation CSM has been designed to facilitate easy and efficient systems management of a cluster of servers with an integrated collection of tools, utilities, and functions. Conceptually, CSM-managed clusters are similar to the ones managed by PSSP. From a high level perspective, the entire cluster consists of a management server and one or more managed servers, as shown in Figure 4-4. The CSM software is packaged into a “server package” that runs on the management server and a “client package” that runs on each managed server. The size of the CSM client packages are small, thus putting a minimum of overhead on the nodes in the cluster. IBM IBM Managed Servers Servers that are defined to be in the cluster. IBM IBM IBM Management Server A single point of control to operate, maintain and monitor the entire cluster of servers. (This can also be another server inside the cluster) Figure 4-4 An overview of CSM-managed cluster Tip: While a CSM-managed cluster can consist of a set of AIX-based pSeries, Linux-based pSeries, Linux-based xSeries, or a combination of the three, a CSM-managed Cluster 1600 system is uniquely identified by a pSeries management server running the AIX 5L operating system. Cluster management from a single point of control CSM allows the system administrator/operator to perform all system administrative functions from the management server. This makes the management server a single point of control. The cluster management capability allows an administrator to: Add nodes to the cluster (installation of the operating system and/or CSM can be done at that time), and remove nodes from the cluster. Chapter 4. Software support 181 View and/or change information for one or more nodes in the cluster. Concept of node groups Node groups can be formed inside the CSM cluster and managed and monitored as distinct entities, as shown in Figure 4-5. Node Group Managed Nodes Management Server IBM IBM IBM IBM IBM Figure 4-5 Forming a node group within a cluster Node groups can be defined to be a static list of nodes, or they can be defined to be dynamic and have nodes inside them that correspond to one or more characteristics. This is illustrated in the following examples: 1. We predefined a dynamic node group for Linux nodes, and a dynamic node group for AIX nodes. As nodes are added to the cluster, they will automatically become part of one of their corresponding node groups. 2. An administrator can create a node group for a particular type of hardware and monitor that hardware. If new machines of that hardware type are added into the cluster, monitoring will automatically start for them. Installation and configuration Software installation and configuration can be performed centrally on all cluster nodes using the Network Installation Manager (NIM). NIM facilitates the remote installation of AIX on the cluster nodes. Figure 4-6 on page 183 illustrates the following flow: The administrator installs the management server and defines the nodes to be in the cluster; then CSM can remotely do a parallel network operating system install of the nodes. CSM updates the software on the nodes, such as new CSM versions and new open source updates. If a node is down during an update, CSM automatically performs the update when the node comes back up. 182 IBM Eserver pSeries Cluster Systems Handbook CSM automatically sets up the security configuration for the underlying cluster infrastructure. It can also do the necessary setup for rsh or ssh (exchange of ssh keys). Managed Nodes IBM Management Server IBM IBM IBM IBM OS CSM Security OS CSM Security OS CSM Security OS CSM Security Figure 4-6 CSM installation and setup For details of AIX installation on cluster nodes, see Chapter 6 of IBM Cluster Systems Management for AIX 5L,Planning and Installation Guide, SA22-7919, or Chapter 5 of An Introduction to CSM 1.3 for AIX 5L. Tip: CSM can also allow you to tailor and run user customization scripts during installation and update. Remote and distributed commands UNIX basic remote shell (rsh), secure shell (ssh), and distributed shell (dsh) are available to systems administrators to help manage the nodes efficiently. Administrators can run commands in parallel across nodes or node groups in the cluster, and gather the output using the dsh command. Administrators can choose to use either the rsh or ssh command within dsh. The dshbak command can format the output returned from dsh if desired. For example, identical output from more than one node can be collapsed so that it is displayed only once. Configuration File Management (CFM) CFM is the file distribution technology used by CSM. It allows common configuration files and group-specific configuration files to be maintained once in a CSM server directory (/cfmroot), and then be automatically distributed to all the nodes or node group in the cluster, so administrators do not have to copy files manually across the nodes in the cluster. This file maintenance and propagation is illustrated in Figure 4-7. Chapter 4. Software support 183 /cfmroot Server Repository AIX 5L Version 5.2 CSM server Direction of files Red Hat 7.2 Client Red Hat 7.3 Client AIX 5L Version 5.1F Client AIX 5L Version 5.2F Client Figure 4-7 Configuration file maintenance and propagation using CFM CFM makes use of the rdist command for file transfer. The rdist command can use rsh or ssh, so administrators can decide which one to use. Tip: CFM can use make use of variables for IP address or hostname substitution in files being transferred. Preprocessing and postprocessing scripts can be run before and after a file is copied, such as to stop and start daemons. For details of CFM, refer to Chapter 1 in IBM Cluster Systems Management for AIX 5L Administration Guide, SA22-7918. Management interfaces CSM functions can be administered using the following management interfaces: Command Line Interface Distributed Command Execution Manager GUI AIX Systems Management Interface Tool (SMIT) Web-based System Manager CSM uses a set of Web-based System Manager plug-ins to provide a common look and feel in a Web-based System Manager environment. This provides an interface for monitoring and managing a CSM cluster. For an overview of the plug-ins, see Chapter 6 of An Introduction to CSM 1.3 for AIX 184 IBM Eserver pSeries Cluster Systems Handbook 5L, SG24-6859. This redbook also shows screenshots of Web-based System Manager interfaces. Hardware control capability CSM provides hardware control capability to power on and power off, reboot, bring up a remote hardware console through a tty, and query nodes in the cluster. This can be done remotely without being physically present at the server or hardware console. This capability is currently available for xSeries machines, BladeCenter, and HMC-managed pSeries servers, such as those shown in Figure 4-8. IBM Management Server HMC HMC-controlled pSeries Servers xSeries nodes with hardware-controlled Service Processor Figure 4-8 Hardware control capability in CSM The hardware control capability is designed with a layer of code (which performs hardware functions) that references a library based on the hardware type of the node; this allows you to “plug in” a library for the specific hardware to be supported. This capability was specifically designed so that you can easily add additional hardware support. Cluster management across geographies CSM remote hardware support on HMC-managed servers extends the AIX cluster management domain beyond the data center. This is illustrated in the example in Figure 4-9 on page 186, which shows an organization with data center operations in four different geographies. All the HMC-managed servers can be managed remotely from a single CSM management server. While the hardware function is managed through the HMC located at the respective data center, software management is performed through network connectivity from the management server to the clusters. Chapter 4. Software support 185 IBM Edinburgh IBM Singapore I BM IBM IBM IBM IBM IBM HMC IBM IBM HMC Poughkeepsie Management Server IBM IBM IBM IBM IBM IBM IBM IBM IBM HMC Long Island HMC IBM New Delhi Figure 4-9 Extending cluster management across geographies System event monitoring An administrator can set up monitoring for various conditions across nodes or node groups in the cluster and have actions run in response to events that occur in the cluster. This is illustrated in Figure 4-10. Managed Nodes Management Server Administrator sets up monitoring for particular conditions in the cluster IBM IBM IBM IBM IBM Events occur in the cluster and actions are run in response to those events Figure 4-10 Event monitoring and automated response in a CSM cluster Conditions that can be monitored Conditions that can be monitored include network accessibility, power status, application or daemon status and utilization level of CPU, memory and file system, among others. 186 IBM Eserver pSeries Cluster Systems Handbook Predefined “Conditions” are shipped with CSM for every type of information that can be monitored, so that monitoring can be started “out of the box”. Actions in response to occurring conditions You can run the following actions in response to events that occur in the cluster: Commands can be run on the management server or on any node of the cluster. Notification actions such as logging, e-mailing or paging can be run. SNMP traps can be generated in response to events in the cluster. Predefined “Responses” for e-mail notification, SNMP traps, logging and displaying a message to a console are provided. Users can quickly associate a condition with a response and start monitoring. And administrators can easily customize such conditions and responses to fit their own needs. In addition to notification actions, administrator-defined recovery actions can also be run. Such recovery actions can include cleaning up file systems that are filling up, or taking actions to help restart a critical application that went down. Resources and events in a CSM-managed cluster are monitored through the Resource Monitoring and Control (RMC) infrastructure. RMC is described in 4.3.2, “Reliable Scalable Cluster Technology (RSCT)” on page 188. You can also find useful information in A Practical Guide for Resource Monitoring and Control (RMC), SG24-6615. Diagnostic probes Administrators can run diagnostic probes provided by CSM to automatically perform “health checks” of the particular software function if a problem is suspected. These probes are small standalone programs that can be run on a specific part of a system or subsystem to automate problem determination and isolation. These probes can also be run periodically or automatically as a response to a condition occurring in the system. Current probes shipped with CSM include probes to diagnose network connectivity, NFS health, and the status of daemons that CSM runs. Administrators can also write their own probes to add into the existing CSM probe infrastructure. For a complete list of probes in CSM 1.3, refer to IBM Cluster Systems Management for AIX 5L, Administration Guide, SA22-7918. Chapter 4. Software support 187 4.3.2 Reliable Scalable Cluster Technology (RSCT) RSCT, a set of software components, provides a comprehensive clustering technology on AIX and Linux. It brings an infrastructure that helps achieve improved system availability, scalability, and ease of use for several IBM products. CSM utilizes RSCT as its clustering infrastructure. A diagrammatic representation of the RSCT components on AIX is shown in Figure 4-2 on page 172, and the main components are discussed in the next sections. Resource Monitoring and Control (RMC) RMC is a distributed system monitoring application provided by RSCT. It allows you to define conditions on the system to be monitored, and can automatically respond to the system events that occur when those predefined conditions are met. For example, you can configure RMC to monitor the usage of file systems and automatically expand if a predefined threshold is exceeded. You can also set different predefined thresholds for file systems in different node groups. AIX 5L Version 5.2 includes more than 80 predefined conditions that can be used to configure automatic actions in response to a system event. You can find details of RMC in Chapter 3, IBM Cluster Systems Management for AIX 5L, Administration Guide, SA22-7918 and in A Practical Guide for Resource Monitoring and Control (RMC), SG24-6615. Resource managers (RMs) Resource managers are daemon processes that provide the interface between RMC and actual physical or logical resources. RSCT provides a core set of resource managers for managing base resources on single systems and across clusters. Additional resource managers are provided by cluster licensed program products (such as CSM, which contains the Domain Management resource manager). The RSCT core resource managers are: Audit Log resource manager (AuditLogRM) Event Response resource manager (ERRM) File System resource manager (FileSystemRM) Host resource manager (HostRM) Sensor resource manager (SensorRM) For details on RMC and the core resource managers, see Chapter 3 of IBM Reliable Scalable Cluster Technology for AIX 5L, Administration Guide, SA22-7889. 188 IBM Eserver pSeries Cluster Systems Handbook Cluster security services (CtSec) CtSec is used by RMC to determine and authenticate a node within the cluster. Note: This is not to be confused with authorization (granting or denying access to resources), which is handled by RMC. CtSec uses credential-based authentication that enables: A client process to present information to a server in a way that cannot be imitated A server process to clearly identify the client and validity of the information Credential-based authentication uses a third party that both the client and server trust. In this release of CSM, only UNIX host-based authentication is supported. Group Services and Topology Services Although included in RSCT, these are not used in the management domain structure of CSM. Group Services and Topology Services are used in peer domains for applications, such as HACMP/ES and GPFS. These provide node/process coordination and node/network failure detection. These services are often referred to as hags, or high availability groups services, and hats, or high availability topology services. Their corresponding daemons are known as hagsd and hatsd. For more information about RSCT, refer to IBM RSCT for AIX: Guide and Reference, SA22-7889. 4.3.3 New in CSM 1.3.2 for AIX CSM 1.3.2, announced in September, 2003, includes the following enhancements: Back up and restore of scripts for CSM. Install customization scripts. Install and hardware control probes for enhanced problem diagnosis. CSM installation and configuration usability enhancements. NIM enhancement to support secondary adapter configuration. This will allow additional adapters, such as Ethernet or IBM high performance switch adapters, to be configured at install time. Support of a new command cmstat, which provides a snapshot of a cluster similar to the spmon -d command in PSSP. Chapter 4. Software support 189 Other enhancements include: Linux on pSeries support – SuSE Linux Enterprise Server (SLES) 8 on pSeries p630, p630+, p650, p655, p615 – Support for mixed AIX and Linux partitions 1024-way scaling (Linux on xSeries) RedHat 9 and SuSE 8.2 (Linux on xSeries) Replace use of System Installation Suite (SIS) with native SuSE Installer, AutoYaST for SuSE and SLES installs AMD Opteron support 4.3.4 Supported platform CSM version 1.3.2 is supported on three platforms: AIX 5L on pSeries Linux on pSeries Linux on xSeries Table 4-4 lists the platform requirements for CSM 1.3.2. Table 4-4 CSM 1.3.2 software and hardware requirements Operating systems Hardware requirement AIX 5.1 or 5.2 SLES 8 p615, p630, p650, p655, p670, p690 Red Hat 7.2, 7.3, or 8.0 Red Hat AS 2.1 SuSE 8.0 or 8.1 SLES 7 or 8 x330, x335, x342, x345, x360, x440 BladeCenter IntelliStation Model 6221 Note: AIX 5L Version 5.2 or higher is required for the management server. Switch support IBM intends to support the IBM ^ High Performance Switch with CSM. Note: Since the current SP Switch technology is dependent on PSSP, it is not supported by CSM on AIX. 190 IBM Eserver pSeries Cluster Systems Handbook 4.3.5 PSSP-to-CSM transition CSM is designed to be the next generation cluster systems management software. It encompasses and exploits the best of breed advantage of the highly successful, cluster-proven PSSP software and AIX operating systems to meet the demands of present and future cluster computing needs. Figure 4-11 shows a diagrammatic representation of the transition of the PSSP software, AIX operating systems, and CSM. The RSCT and VSD components, which used to be in PSSP, have now been developed and packaged in AIX. PSSP AIX CSM PSSP 3.4 PSSP 3.5 RSCT 1.x VSD 3.4 VSD 3.5 AIX 5.1 AIX 5.2 AIX 5.2 ML2 RSCT 2.x RSCT 2.x VSD 2.3.1 CSM 1.3.0 CSM 1.3.2 Figure 4-11 PSSP, AIX and CSM software components transition 4.3.6 Documentation references - CSM The following are suggested documentation for further reading and references for CSM: An Introduction to CSM 1.3 for AIX 5L, SG24-6859 A Practical Guide for Resource Monitoring and Control (RMC), SG24-6615 IBM Cluster Systems Management for AIX 5L, Administration Guide, SA22-7918 IBM Cluster Systems Management for AIX 5L,Planning and Installation Guide, SA22-7919 IBM Reliable Scalable Cluster Technology for AIX 5L, Administration Guide, SA22-7889 Chapter 4. Software support 191 4.4 General Parallel File System (GPFS) General Parallel File System (GPFS), the IBM high performance file system, provides a single, global file system for multiple nodes within a cluster. It allows parallel and serial applications shared access to files with high performance, availability, and scalability characteristics that surpass those of traditional distributed file system. GPFS is used extensively on the IBM RS/6000 SP system to scale file system I/O in order to meet demanding requirements of applications in areas such as scientific and technical computing, business intelligence, and shared data access in server consolidation. GPFS is offered on the IBM Cluster 1600, and on clusters of IBM pSeries nodes as GPFS for AIX and IBM Cluster 1350, and on clusters of selected IBM xSeries nodes as GPFS for Linux. In this section, the discussion will focus on GPFS for AIX. GPFS for AIX 5L Version 2 is offered as program number 5765 -F64. You can also find information on GPFS for Linux in the white paper “An Introduction to GPFS v1.3 for Linux, June 2003”, available at: http://www.ibm.com/servers/eserver/clusters/whitepapers/gpfs_linux_intro.html Advantages of GPFS: GPFS is designed to provide high performance by striping I/O across multiple disks or storage subsystems (on multiple servers). The file system performance can scale with additional server nodes and disks storage added to the cluster. It can be used as a general purpose file system, as it is suitable for many kinds of workloads, from technical to commercial computing. GPFS provides global access (or uniform access) to files. GPFS offers many standard UNIX file system interfaces that cater to application portability. GPFS offers data consistency through a token-management system. GPFS is designed for high availability through file system logging, replication, and server and disk failover. 4.4.1 Architecture A simple conceptual diagram of the GPFS environment is shown in Figure 4-12 on page 193. It shows storage devices with data access paths to a set of 192 IBM Eserver pSeries Cluster Systems Handbook application nodes within a GPFS nodeset. These nodes are capable of sharing the file system where the files are striped across the disks. Control Network GPFS Application Nodes Storage or System Network Storage Devices Figure 4-12 GPFS environment There are two types of data access mechanisms and connectivity, and they are shown in the examples in Figure 4-13 on page 194 and Figure 4-14 on page 195. Direct storage attachment to nodes in GPFS nodeset Figure 4-13 on page 194 shows the direct disk attachment of the disk storage subsystem to each node of the GPFS nodeset. The connectivity may be Serial Storage Architecture (SSA) or switched fibre channel (FC). Important: The use of FC loops without a switch was not fully tested at the time of writing. Physical connections other than SSA and FC are also theoretically possible, but untested. We recommend that you contact your IBM representative before trying GPFS in these environments. Chapter 4. Software support 193 IP Network Application Application GPFS GPFS SSA/FCdriver SSA/FCdriver Application ....... GPFS Compute Nodes SSA/FCdriver FC Switch/SSA Loops/... ....... Disk Collections Figure 4-13 GPFS in a direct-attached SAN environment Virtual Shared Disk (VSD) Figure 4-14 on page 195 shows a mechanism involving the use of the Virtual Shared Disk (VSD). For this example, only a subset of all nodes, called the “I/O nodes” or “VSD server nodes”, have direct physical connectivity to the disks subsystem. VSD is a software simulation of an SAN that runs across a high-speed interconnect (such as the IBM SP Switch2) which interconnects the nodes in the GPFS nodeset. The VSD servers act as “gateway”, routing disk operations from the disk subsystem to the application nodes. Note: The use of the SP Switch will require the implementation of PSSP and its software prerequisites. You can choose to use CSM when using the IBM eServer High Performance switch for interconnecting the nodes. 194 IBM Eserver pSeries Cluster Systems Handbook Application Application GPFS GPFS VSD VSD Application ....... GPFS Compute Nodes VSD SP Switch VSD Disk Device Interface VSD ....... Disk Device Interface I/O Nodes SSA, FC, SCSI,.... ....... Figure 4-14 GPFS using VSD over a high performance interconnect The IBM Recoverable Virtual Shared Disk (RVSD) component allows for failure recovery of the VSD. It automatically manages the VSDs by detecting error conditions, such as node failures, adapter failures, and disk failures, and then switches access to the disk from the primary node to the secondary node so that your application can continue to operate normally. Tip: We recommend that GPFS installations twin-tail their storage to multiple RVSD servers if using RVSD. A simplified diagram of this setup is shown in Figure 4-15 on page 196. For details of the GPFS architecture and operations, refer to the white paper An Introduction to GPFS v2.1 for AIX, available at: http://www.ibm.com/servers/eserver/pseries/software/whitepapers/gpfs_intro.html You can also refer to the white paper GPFS Primer for AIX Clusters, available at: http://www.ibm.com/servers/eserver/pseries/software/whitepapers/gpfs_primer.html For more useful information, see General Parallel File System for AIX 5L in an RSCT Peer Domain, Concepts, Planning, and Installation Guide, GA22-7940. Chapter 4. Software support 195 Servers Node 1 Primary Node Storage Device Node 2 Secondary Node Client Node Twin-tailed disks Figure 4-15 Simplified view of a twin-tailed disk 4.4.2 Administration and operation GPFS provides functions that simplify multinode administration. It allows the system administrator to issue commands from any node/server in the cluster. These functions are based on, and are in addition to, the AIX administrative commands. A single GPFS multinode command can perform a file system function across the entire GPFS cluster. In addition, most existing UNIX utilities will also run unchanged. All of these capabilities allow GPFS to be used where parallel optimization is desired. GPFS, like all file systems, provides a number of data management services, such as the ability to duplicate files, remove files, rename files, and so forth. GPFS supports the file system standards of X/Open 4.0, with minor exceptions. As a result, most AIX and UNIX applications can use GPFS data without modification and most UNIX utilities will run unchanged. 4.4.3 Higher performance/scalability GPFS is designed to provide superior performance; it has the following features: It allows parallel applications simultaneous access to the same file or different files, from any node in the GPFS nodeset. It supports very large file systems, up to 100 TB per file system. It allows data striping across multiple storage devices or storage subsystems, to achieve increased aggregate bandwidth of your file system. 196 IBM Eserver pSeries Cluster Systems Handbook It provides load balancing across all disks, to maximize their combined throughput. It allows concurrent read and write from multiple nodes. It provides enhanced parallel programming capabilities in conjunction with the implementation of the MPI-IO industry standard. GPFS scales beyond single-server (node) performance limits by delivering file performance across multiple nodes and disks. In a parallel environment, GPFS can easily outperform other file systems such as Network File System (NFS), Journalled File System (JFS), and Distributed File System (DFS). Unlike NFS and JFS, GPFS file performance scales with additional file server nodes and disks added to the cluster. Incremental improvements may be made to the file system by adding additional hardware of the same, or even lesser, capability. You can even add or delete disks from a mounted GPFS and mount a GPFS file system from every node/server in the cluster. 4.4.4 Recoverability GPFS can survive many system and I/O failures. It is designed to transparently fail over locked servers and other GPFS central services. GPFS can be configured to automatically recover from node, disk connection, disk adapter, and communication network failures. In a PSSP cluster environment, availability is achieved through the use of the clustering technology capability of PSSP and PSSP Recoverable Virtual Shared Disk (RVSD) functions, or disk-specific recovery capabilities. In an AIX cluster environment, availability is achieved through the use of the cluster technology capabilities of either an RSCT peer domain or an HACMP cluster, in combination with the Logical Volume Manager (LVM) component or disk-specific recovery capabilities. GPFS supports data and metadata replication to further reduce the chances of losing data if storage media fail. GPFS is a logging file system that allows the recreation of consistent structures for quicker recovery after node failures. GPFS also provides the capability to mount multiple file systems, each of which can have its own recovery scope in the event of component failures. You can eliminate single points of failure in a GPFS solution by organizing the hardware into a number of failure groups. Establishing a redundant access path to the data will allow GPFS find an available path to the data at all times. Chapter 4. Software support 197 4.4.5 Migration Upgrading to a new release of GPFS can be tested on a system currently running GPFS. This eases migration by allowing testing of a new level of code without inhibiting the production GPFS application. 4.4.6 New in GPFS 2.1 for AIX 5L GPFS 2.1 for AIX 5L includes the following enhancements: It supports the 64-bit kernel in AIX 5L. High Availability Cluster Multi-Processing (HACMP) is no longer required to run GPFS in a non-SP or non-Parallel System Support Programs (non-PSSP) environment. It is supported on AIX 5L only. 4.4.7 Software requirements Table 4-5 lists the software requirements for GPFS in a PSSP-managed cluster. Table 4-5 Software requirements for GPFS in a PSSP environment GPFS version AIX version PSSP version Other software 2.1 5.1 3.5 IBM VSD and RVSD of PSSP 1.5 5.1 (32-bit kernel) 4.3.3 3.4 IBM VSD and RVSD of PSSP Important: Note the following: The SP Switch or the SP Switch2 is required for node interconnect in a PSSP environment. VSD servers are required for disks connectivity. IBM intends to support AIX 5L Version 5.2 in December 2003 with PSSP 3.5 with a new release of the GPFS. 198 IBM Eserver pSeries Cluster Systems Handbook Table 4-6 lists the software requirements for GPFS in an RSCT Peer Domain (rpd)-managed cluster. Table 4-6 Software requirements for GPFS in an rpd environment GPFS version AIX version CSM version Other software 2.1 5.2 5.1 1.3.2 CSM is not required for GPFS in an RPD cluster. The component that is required to create the peer domain is RSCT, which is delivered with AIX 5L. Table 4-7 lists the software requirements for GPFS in an HACMP environment. Table 4-7 Software requirements for GPFS in an HACMP environment GPFS version AIX version HACMP version CSM version Other software 2.1 5.2 4.5 1.3.2 CSM is not required for GPFS in an HACMP cluster. The component that is required to create the peer domain is RSCT, which is delivered with HACMP 4.5. 2.1 5.1 4.5 1.3.2 CSM is not required for GPFS in an HACMP cluster. The component that is required to create the peer domain is RSCT, which is delivered with HACMP 4.5. 4.4.8 Documentation references The following are suggested documents for further reading and references for GPFS: An Introduction to GPFS v2.1 for AIX, white paper, June 2003 Chapter 4. Software support 199 http://www.ibm.com/servers/eserver/pseries/software/whitepapers/gpfs_intro.html An Introduction to GPFS v1.3 for Linux, white paper, June 2003 http://www.ibm.com/servers/eserver/clusters/whitepapers/gpfs_linux_intro.html GPFS Primer for AIX Clusters, white paper, March 24, 2003 http://www.ibm.com/servers/eserver/pseries/software/whitepapers/gpfs_primer.html GPFS Primer for Linux Clusters, white paper, March 24, 2003 http://www.ibm.com/servers/eserver/clusters/whitepapers/gpfs_linux_primer.html General Parallel File System Questions and Answers (a useful resource for answers to frequently asked questions about GPFS) http://www.ibm.com/servers/eserver/pseries/software/sp/gpfs_faq.pdf General Parallel File System for AIX 5L, PSSP Clusters Concepts, Planning, and Installation Guide, GA22-7899 General Parallel File System for AIX 5L, AIX Clusters Concepts, Planning, and Installation Guide, GA22-7895 General Parallel File System for AIX 5L in an RSCT Peer Domain, Concepts, Planning, and Installation Guide, GA22-7940 4.5 LoadLeveler LoadLeveler for AIX 5L is an integral part of the total systems management solution for the Cluster 1600, RS/6000 SP nodes, or clusters of pSeries or RS/6000 servers. It is a network job management and scheduling system that allows users to run more jobs in less time by matching the jobs’ processing needs with the available resources. Figure 4-16 shows an example of a LoadLeveler configuration. LoadLeveler (LL) supports load balancing for interactive Parallel Operating Environment sessions, and focuses on parallel job scheduling and scalability. It can also schedule jobs written for Network Queuing System (NQS) to run on machines outside the LoadLeveler cluster. LoadLeveler for AIX 5L, Version 3 is available as program number 5765-E69. 200 IBM Eserver pSeries Cluster Systems Handbook NQS pSeries SP LOADLEVELER NFS AFS DCE RS/6000 Cluster of Workstations Submit-Only Machines Figure 4-16 Example of a LoadLeveler configuration LoadLeveler offers the following advantages: The efficient scheduling algorithm helps improve overall system utilization and responsiveness. It offers a single point of control for administration, job management and workload scheduling. LoadLeveler provides users full scalability across processors and jobs, and supports jobs scaling across hundreds of cluster nodes. It provides checkpoint/restart enhancements to facilitate job recovery. It exploits many features of AIX and integrates with the AIX 5L Workload Manager (WLM) for resource enforcement and monitoring. For details, refer to Workload Management with LoadLeveler, SG24-6038. 4.5.1 Administration and operations LoadLeveler schedules jobs, and provides functions for building, submitting, and processing jobs quickly and efficiently in one or more cluster nodes under its Chapter 4. Software support 201 control. The batch jobs can be submitted, either in serial or parallel, to run in the background without input from the user. LoadLeveler accepts the job and reviews its requirements; it then determines, from the availability of consumable resources, on which cluster nodes the job will run. (Consumable resources include items like the number of CPUs and the amount of memory.) LoadLeveler interfaces LoadLeveler uses three types of interfaces. They allow users to build, create, submit, and delete jobs, and allow system administrators to configure LoadLeveler, control running jobs, and do accounting. Common line interface This interface provides basic and administrative functions. Application programming interface (API) This interface allows application programs written by users to interact with the LoadLeveler environment. Graphical User Interface (GUI) This interface is similar to the Command Line Interface and is designed for ease of use. (However, experienced users and administrators may find the command line interface more efficient than the GUI.) LoadLeveler cluster LoadLeveler can submit jobs to a machine if that machine is configured as a member of a LoadLeveler cluster. The machines in a LoadLeveler cluster can have one or more roles in job scheduling: Scheduling node This is used to manage jobs from submission to completion. Central manager This is a central resource manager and workload balancer. One or more alternate central managers can be set up to prevent a single point of failure. Execute node Jobs dispatched by the central manager are run here. Submit node This is used to submit jobs to LoadLeveler from outside the cluster, and runs no daemon. (For an explanation of daemons, refer to “LoadLeveler daemons” on page 203.) An example of job scheduling in LoadLeveler is shown in Figure 4-17 on page 203. 202 IBM Eserver pSeries Cluster Systems Handbook Central Manager Node LoadL_master LoadL_negotiator Execute Node LoadL_master LoadL_startd LoadL_starters LoadL_kbdd Scheduler Node LoadL_master Submit-only Node (no daemons) LoadL_schedd disk daemon logs state info. daemon logs state info. runnable copy of executable daemon logs state info. local job database accounting info. intermediate copy of executable original copy of executable Figure 4-17 An example of job scheduling in a LoadLeveler cluster LoadLeveler daemons LoadLeveler operates through a set of daemons. These daemons control the processes that move the jobs through the LoadLeveler cluster. These process control daemons are managed by a master daemon. You can find detailed descriptions of LoadLeveler daemons in Chapter 1 of IBM LoadLeveler for AIX 5L: Using and Administering, SA22-7881; other chapters contain complete information on configuring, administering, and using LoadLeveler. Job definition Each step of a job is defined in a job command file. LoadLeveler uses the information specified in the job command file to identify the job and to execute the job steps. A detailed description of submitting and managing jobs can be found in Chapter 5 of IBM LoadLeveler for AIX 5L: Using and Administering, SA22-7881. 4.5.2 Capabilities LoadLeveler has the following capabilities, including parallel processing, individual control, central control, and more. Chapter 4. Software support 203 Parallel processing LoadLeveler interfaces with the Parallel Environment (PE) software to support batch jobs in a parallel operating environment (POE). Parallel Environment is described in “Parallel Environment (PE)” on page 216. Individual control Owners of machines in the LoadLeveler cluster can restrict the use of their resources.They can specify the availability of their resources and how they are to be used. For example, some users might allow their workstations to accept any job during the night—but only certain jobs during the day, when they need their resources most. Others might allow LoadLeveler to monitor their keyboard activity and make their workstations available whenever it has been idle for a period of time. Central control LoadLeveler gives administrators a complete view of the status of all jobs on the system and the resources available to those jobs. It allows administrators to change the availability of these resources to best meet their computing needs. Scalability Nodes can be added to the LoadLeveler cluster to increase the consumable resources available in LoadLeveler. Automated job tracking Process tracking allows LoadLeveler to remove any processes (throughout the entire cluster) that are left behind when a job terminates. This ensures that resources held by the orphaned processes are freed for future use. You can also automate the process of removing old checkpoint files that are no longer needed on the system. Multiple user space tasks LoadLeveler supports multiple user space tasks on both SP Switch and SP Switch2. Parallel Applications that use LAPI user space API or Parallel Environment MPI and LoadLeveler can use up to 4096 user space tasks. Additional scheduling algorithms LoadLeveler supports the Gang Scheduler and Backfill scheduler, in addition to standard reservation scheduling, for more efficient use of system resources. 204 IBM Eserver pSeries Cluster Systems Handbook 4.5.3 New in LoadLeveler 3.2 LoadLeveler 3.2, introduced in September, 2003, includes the following enhancements: AIX Cluster Security Services LoadLeveler now includes an additional security mechanism, AIX Cluster Security Services (CtSec), which allows for identity authentication of LoadLeveler daemons and users. These identities are then used to determine authorization, thus preventing unauthorized users and programs from misusing resources or disrupting services. Cluster Security Services is a security mechanism similar to DCE, but tailored for a clustered environment. Multi-adapter connection support LoadLeveler now supports multi-adapter connection support. This means you can now use multiple adapters for job processing, and allocate separate memory support for each. You can also request multiple windows per task per protocol, and have the adapters automatically share the workload for the number of requested windows. To facilitate this, LoadLeveler now refers to an adapter by its connection to the network, not by its name. High performance switch support LoadLeveler now supports the IBM eServer pSeries High Performance Switch (HPS). The HPS provides significantly greater communication bandwidth, along with reduced latency and improved fault tolerance. It is able to support multiple adapters per network per node and multiple windows on the same adapter or across multiple adapters – Enhancements for running in an RSCT peer domain A new command, llextRPD, is added to help you set up your administration files. It extracts machine and adapter data from the RSCT peer domain and formats it into stanzas for use in building administration files. A new keyword is added to specify whether GSmonitor uses RMC API access or PSSP SDR access routines: • The keyword is CSMONITOR_DOMAIN=PEER for RMC API access. • The keyword is CSMONITOR_DOMAIN=PSSP For PSSP SDR access routines. – Dynamic adapter configuration support LoadLeveler now offers the option to dynamically detect and handle adapters and adapter changes for machines in the cluster. This option is available only on RSCT peer domain clusters. Chapter 4. Software support 205 Automatic resume for drained startd Improved switch management and recovery now allows LoadLeveler to automatically resume a startd that was drained due to switch table unload errors. Support for POE Wait option for interactive jobs Interactive jobs can now be queued to wait for resources to become available. Enhancements to the llmodify command Attributes such as job class, account number for idle job steps, and job step wall clock limits can be modified with the llmodify command. Enhancements to Data Access API Job class information can be queried through the LoadLeveler Data Access API. 4.5.4 New in LoadLeveler 3.1 LoadLeveler 3.1, announced on November 13, 2001, includes the following enhancements: Striped Communication Striping increases performance, because it allows the job to use multiple communication paths to the node. AIX Workload Manager Integration LoadLeveler can be configured to use WLM to enforce usage of consumable CPU and memory. Gang Scheduling This can be used to improve the overall system utilization and responsiveness to parallel workloads. Checkpoint/Restart This function is enhanced so that it allows you to periodically save the state of jobs automatically. 64-bit support Support for 64-bit applications for interactive and batch jobs that run on POWER nodes and AIX 5L is introduced. Multiple User Space Tasks per adapter This is supported on the SP Switch and SP Switch2. 206 IBM Eserver pSeries Cluster Systems Handbook 4.5.5 Software requirement Table 4-8 lists the software requirements for LoadLeveler on PSSP. Table 4-8 Software requirements for LoadLeveler 3.1 and 3.2 on PSSP LL version AIX version Domain Other software 3.2 5.2 or higher RSCT 2.3.1 and PSSP 3.5 PE for AIX Version 4.1, if you plan to run POE jobs in user space mode. Java Runtime Environment filesets (Java131.rte.bin, Java131.rte.lib 1.3.1), if you plan to use the graphical user interface or configuration tasks wizard. TaskGuide Runtime Environment fileset (sysmgt.sguide.rte 5.2), if you plan to use the configuration tasks wizard. RSCT 2.3.1.0, if you plan to run in an RSCT peer domain or configure LoadLeveler to support RSCT security services. PE for AIX Version 3.2. if you plan to run POE jobs in user space mode. If you plan to configure LoadLeveler to support DCE security services, then fileset ssp.clients 3.2 is required. Java Runtime Environment filesets (Java130.rte.bin, Java130.rte.lib 1.3.0), if you plan to use the graphical user interface or configuration tasks wizard. TaskGuide Runtime Environment fileset (sysmgt.sguide.rte 5.1), if you plan to use the configuration tasks wizard. 3.1 5.1 or higher 3.4 or 3.5 Chapter 4. Software support 207 4.5.6 LoadLeveler configuration suggestions The following are some suggestions for LoadLeveler configurations: Place LoadLeveler executables in the shared file system. Place logs, execute, spool in the local file system (/var). Place configuration files in the shared file system. Place local configuration files in a subdirectory. Soft-link worker local configuration files. 4.5.7 Documentation references - LoadLeveler The following are suggested documents for further reading and reference on LoadLeveler: Workload Management with LoadLeveler, SG24-6038 IBM LoadLeveler for AIX 5L: Using and Administering, SA22-7881 4.6 Scientific subroutine libraries In this section, we describe the scientific subroutine libraries and their features. 4.6.1 Engineering and Scientific Subroutines Library (ESSL) family of products The Engineering and Scientific Subroutine Library (ESSL) family of products consists of: Parallel Engineering and Scientific Subroutine Library (Parallel ESSL) for Advanced Interactive Executive (AIX), program number 5765-F84 Engineering and Scientific Subroutine Library (ESSL) for AIX, program number 5765-F82 These products are state-of-the-art collections of mathematical subroutines that provide a wide range of mathematical functions for many different scientific and engineering applications. You can use these subroutine libraries to develop and enable many different types of scientific and engineering applications. New applications can be designed and developed to take full advantage of all the capabilities of ESSL product family. Existing applications can be enabled by replacing comparable subroutines and in-line code with calls to ESSL subroutines. Some of the types of applications that can take advantage of the ESSL capabilities are: 208 IBM Eserver pSeries Cluster Systems Handbook Structural analysis time series analysis Computational chemistry computational techniques Fluid dynamics analysis mathematical analysis Seismic analysis dynamic systems simulation Reservoir modeling nuclear engineering Quantitative analysis electronic circuit design Time series analysis Computational techniques Mathematical analysis Dynamic systems simulation Nuclear engineering Electronic circuit design Advantages of ESSL and Parallel ESSL These subroutines can be called from application programs written in Fortran, C, and C++. They have been designed with an easy-to-use call interface and error-handling ability. They are compatible with public domain subroutine libraries such as Basic Linear Algebra Subprograms (BLAS), Scalable Linear Algebra Package (ScaLAPACK), and Parallel Basic Linear Algebra Subprograms (PBLAS), thereby making it easy to migrate from these libraries. You can obtain high performance on SMP processors without requiring extensive knowledge of parallelization techniques. The subroutines support a 64-bit environment. 4.6.2 Operations ESSL and Parallel ESSL offer mathematical subroutines available in several computational areas, and these are summarized in Table 4-9. Table 4-9 Wide range of mathematical functions for different applications ESSL Computational areas Parallel ESSL Linear Algebra Subprograms Matrix Operations Linear Algebraic Equations Eigensystem Analysis Fourier Transforms, Convolutions and Correlations, and Related Computations Sorting and Searching Interpolation Numerical Quadrature Random Number Generation Utilities Level 2 PBLAS Level 3 PBLAS Linear Algebraic Equations Eigensystem Analysis and Singular Value Analysis Fourier Transforms Random Number Generation Utilities Chapter 4. Software support 209 ESSL Supported platform POWER, POWER2, POWER3, POWER3–II, POWER4, POWER4+, PowerPC®, and Symmetric Multi-Processing (SMP) PowerPC processors. IBM RS/6000 SP Parallel ESSL IBM RS/6000 SP Clusters of pSeries and RS/6000 workstations For details of the computational areas, refer to IBM ESSL Products General Information, GA22-7903. 4.6.3 New in ESSL 4.1 The ESSL libraries are tuned for the POWER4 and POWER4+ processors. ESSL now supports the AIX 5L V5.2 32-bit and 64-bit kernels. The Dense Linear Algebraic Equation Subroutines has been enhanced to include new LAPACK subroutines as follows: General Matrix Factorization and Multiple Right-Hand Side Solve Positive Definite Real Symmetric Matrix Factorization and Multiple Right-Hand Side Solve Positive Definite Real Symmetric or Complex Hermitian Matrix Factorization and Multiple Right-Hand Side Solve Positive Definite Real Symmetric Matrix Factorization Positive Definite Real Symmetric Matrix Multiple Right-Hand Side Solve General Matrix Inverse Positive Definitive Complex Hermitian Matrix Inverse Triangular Matrix Inverse SMP support has been added to the SCFT and FDCFT subroutines when computing a single, large transform. 4.6.4 New in Parallel ESSL 3.1 The Parallel ESSL Libraries are tuned for the POWER4 and POWER4+ processors. The Dense Linear Algebraic Equations Subroutines now include: Positive Definite Real Symmetric and Complex Hermitian Matrix Inverse 210 IBM Eserver pSeries Cluster Systems Handbook Estimation of the Reciprocal of the Condition Number of a Positive Definite Real Symmetric or Complex Hermitian Matrix Inverse The Eigensystems Analysis Subroutines now include: Reduce a General Matrix to Bidiagonal Form Singular Value Decomposition of a General Matrix The Utilities now include: Initialize a Type-1 Array Descriptor Initialize a Type-1 Array Descriptor with Error Checking Compute the Ceiling of the Division of Two Integers Compute the Least Common Multiple of Two Positive Integers Compute the Local Row or Column Index of a Global Element of a Block-Cyclically Distributed Matrix Compute the Process Row or Column Index of a Global Element of a Block-Cyclically Distributed Matrix Compute the Global Row or Column Index of a Local Element of a Block-Cyclically Distributed Matrix Compute the Starting Local Row or Column Index and Process Row or Column Index of a Global Element of a Block-Cyclically Distributed Matrix Compute the Starting Local Row and Column Indices and the Process Row and Column Indices of a Global Element of a Block-Cyclically Distributed Matrix Real Symmetric, Complex Symmetric or Complex Hermitian Matrix Norm 4.6.5 Software requirements The software requirements for ESSL version 4.1 and 3.3 and Parallel ESSL version 3.1 and 2.3 are tabulated in this section. Chapter 4. Software support 211 Table 4-10 lists the software requirements for ESSL. Table 4-10 Requirements for ESSL 4.1 and 3.3 ESSL version AIX version Other software 4.1 5.2 with 5200-01 For compiling (one of these): XL Fortran for AIX, Version 8.1 or later (5765-F70) IBM VisualAge C++ Professional for AIX Version 6.0 (5765-F56) C for AIX, Version 6.0 (5765-F57) For linking, loading or running: 3.3 5.1 or higher XL Fortran Run-Time Environment for AIX, Version 8.1, or later (5765-F71) C libraries (included in the AIX 5L V5 Application Development Toolkit) For compiling (one of these): XL Fortran for AIX, Version 7.1.1 (5765-E02) IBM VisualAge C++ Professional for AIX Version 5.0.2 (5765-E26) C for AIX, Version 5.0.2 (5765-E32) For linking, loading, or running: XL Fortran Run-Time Environment for AIX, Version 7.1.1 (5765-E03) C libraries (included in the AIX 5L Version 5 Application Development Toolkit) The software requirements for Parallel ESSL are listed in Table 4-11. 212 IBM Eserver pSeries Cluster Systems Handbook Table 4-11 Requirements for Parallel ESSL 3.1 and 2.3 PESSL version AIX version PSSP version Other software 3.1 5.2 3.5 or higher For compiling (one of these): XL Fortran for AIX, Version IBM VisualAge C++ Professional C for AIX, Version For linking, loading, or running: 2.3 AIX 5L Version 5.1 or higher 3.4 or higher XL Fortran Run-Time Environment for AIX, Version ESSL V4.1 Parallel Environment for AIX Version 4.1 C libraries (included in the AIX 5L Version 5 Application Development Toolkit) For compiling (one of these): XL Fortran for AIX, Version 7.1.1 (5765-E02) IBM VisualAge C++ Professional C for AIX, Version 5.0.2 (5765-E32) For linking, loading, or running: XL Fortran Run-Time Environment for AIX, Version 7.1.1 (5765-E03) ESSL V3.3 (5765-C42) Parallel Environment for AIX Version 3.2 (5765-D93) C libraries (included in the AIX 5L Version 5 Application Development Toolkit) 4.6.6 Documentation references - ESSL and PESSL The following are suggested documents for further reading and reference on ESSL and PESSL: ESSL Products General Information, GA22-7903 ESSL Guide and Reference, SA22-7904 ESSL Installation Guide, GA22-7886 Parallel ESSL Guide and Reference, SA22-7906 Chapter 4. Software support 213 4.6.7 Mathematical Acceleration Subsystem (MASS) Another Scientific Subroutine Library that you can use with the pSeries server or Cluster 1600 is the Mathematical Acceleration Subsystem (MASS). MASS is not an orderable product from IBM; it is provided for use as a download from the MASS support Web site: http://techsupport.services.ibm.com/server/mass The download is available at no additional charge to users on the World Wide Web, subject to a set of terms and conditions. Note: MASS is provided as is. IBM makes no warranties, expressed or implied, including the implied warranties of merchantability and fitness for a particular purpose. IBM has no obligation to defend or indemnify against any claim of infringement, including, but not limited to, patents, copyright, trade secret, or intellectual property rights of any kind. MASS consists of libraries of tuned mathematical intrinsic functions. Each new MASS version includes the material from previous versions that has not been changed, except that Version 3.0 no longer includes the library that runs only on POWER2 processors. The POWER2 library remains available in Version 2.7 of MASS. For details of the versions of MASS, refer to the MASS support Web site. Scalar library The MASS scalar library, libmass.a, contains an accelerated set of frequently used math intrinsic functions in the AIX system library libm.a (now called libxlf90.a in the IBM XL Fortran publication). sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y (FORTRAN), pow (C) The libmass.a library can be used with either FORTRAN or C applications and will run under AIX on all of the IBM pSeries and RS/6000 processors. Because MASS does not check its environment, it must be called with the IEEE rounding mode set to round-to-nearest and with exceptions masked off (the default XLF environment). MASS may not work properly with other settings. In some cases MASS is not as accurate as the system library, and it may handle edge cases differently from libm.a (sqrt(Inf), for example). The trigonometric functions (sin, cos, tan) return NaN (Not-a-Number) for large arguments (abs(x)>2**50*pi). These functions accept double-precision arguments and return a double-precision result. You can refer to the MASS support Web site for details of FORTRAN and C declarations for the functions. 214 IBM Eserver pSeries Cluster Systems Handbook Vector libraries The general vector library, libmassv.a, contains vector functions that will run on all computers in the IBM pSeries and RS/6000 families. The library libmassvp3.a contains some functions that have been tuned for the POWER3 architecture, while the remaining functions are identical to those in libmass.a. Similarly, the library libmassvp4.a contains some functions that have been tuned for the POWER4 architecture, while the remaining functions are identical to those in libmass.a. The vector libraries libmassv.a, libmassvp3.a, and libmassvp4.a can be used with either FORTRAN or C applications. When calling the library functions from C, only call by reference is supported, even for scalar arguments. As with the scalar functions, the vector functions must be called with the IEEE rounding mode set to round-to-nearest and with exceptions masked off. The accuracy of the vector functions is comparable to that of the corresponding scalar functions in libmass.a, though results may not be bit-wise identical. Double-precision functions: vrec, vdiv, vsqrt, vrsqrt, vexp, vlog, vsin, vcos, vtan, vasin, vacos, vatan2, vsincos, vcosisin, vdint, vdnint Single-precision functions: vsrec, vsdiv, vssqrt, vsrsqrt, vsexp, vslog, vssin, vscos, vstan, vsasin, vsacos, vsatan2, vssincos, vscosisin, vsdint, vsdnint These functions accept double-precision (or single-precision) vector input and output arguments, and an integer vector-length parameter. You can refer to the MASS support Web site for details of FORTRAN and C declarations for the functions. The MASS vector FORTRAN source library enables application developers to write portable vector codes. The source library, libmassv.f, includes FORTRAN versions of all the vector functions in the MASS vector libraries. Further information on MASS Information about the installation, use, performance, and accuracy of the MASS libraries is beyond the scope of the redbook and is not discussed here. You can obtain this information from the MASS support Web site: http://techsupport.services.ibm.com/server/mass Chapter 4. Software support 215 For questions, you may write to: MASS Support IBM Toronto Laboratory Mail Stop D2-515 8200 Warden Avenue Markham, Ontario, L6G 1C7 Canada Or send e-mail to: [email protected] 4.7 Parallel Environment (PE) IBM Parallel Environment (PE) for AIX is a high-function environment for the development and execution of parallel applications on the Cluster 1600 system. It allows organizations to develop, debug, analyze, tune, and execute parallel C/C++ and FORTRAN applications that use the industry-standard message passing interface. PE is available as program number 5765-D93. The advantages of Parallel Environment include the following: It facilitates parallel application development and portability to clusters of pSeries or RS/6000 systems. It exploits MPI message passing on all nodes, including symmetric multiprocessors (SMPs). It supports Low-level Application Programming Interface (LAPI) programs. It provides enhanced tools for program debugging and application performance analysis. LoadLeveler can be used for POE batch jobs. It allows you to checkpoint and restart POE jobs. 4.7.1 Parallel Programming support PE supports two basic parallel programming models: Single Program Multiple Data (SPMD) In this model, the programs running the parallel tasks are identical, but the tasks work on different sets of data. Multiple Program Multiple Data (MPMD) In this model, different programs may be running on each node. 216 IBM Eserver pSeries Cluster Systems Handbook PE is a distributed memory message-passing system. Two types of subroutines calls are available to the developer to parallelize the application: Message Passing Interface (MPI) Low-level Application Programming Interface (LAPI) The processor nodes that handle the parallel tasks communicate using the following protocols: Internet Protocol (IP) Communication Subsystem User Space (US) Communication Subsystem Note: Although LAPI is used for data communication in conjunction with PE, it is available as a component of PSSP or shipped as part of RSCT in AIX 5.2. 4.7.2 Operation PE consists of the following: Message passing and collective communication API subroutine libraries Developers can use these subroutines for code parallelization. For details of the message passing subroutine calls, refer to IBM Parallel Environment for AIX: MPI Subroutine Reference, SA22-7423, and IBM Parallel Environment for AIX: Hitchhiker’s Guide, SA22-7424. Parallel Operating Environment (POE) POE is an execution environment that facilitates a smooth transition from serial to parallel processing. It allows you to invoke your parallel program from a home node and have the partition manager start the parallel tasks on a number of remote nodes that you specify in a host list file. A number of parallel compiler scripts and POE environment variables are available to enhance the operation of POE. Debugging and profiling tools The following tools are available in PE to facilitate debugging and profiling: Parallel Debugging Facility - the pdbx facility is a line-oriented parallel debugger based on the dbx debugger. Parallel Profiling Capability - once the parallel program is debugged, you can profile the program using the AIX Xprofiler graphical performance tool and the AIX commands prof and gprof. Chapter 4. Software support 217 Performance analysis tools The PE Benchmarker tools consist of a suite of applications and utilities: Performance collection tools This set of tools is built using the Dynamic Probe Class Library (DPCL) where probes are placed in running executables to collect the required information. Unified Trace Environment (UTE) utilities This is a set of enhanced trace utilities that allows you to obtain an MPI trace of all or part of a parallel application. Performance visualization tools This set of tools allows you to view system profiles from data collected from the performance collection tools. Note: DPCL is no longer part of the PE licensed program, but is shipped with PE for convenience. It is available as an open source offering that supports PE. For details of the operations and use of PE, refer to IBM Parallel Environment for AIX: Operation and Use, Volume 1, SA22-7948, and Volume 2, SA22-7949. 4.7.3 New in PE 4.1 PE version 4.1 includes the following functional enhancements: MPI is enhanced to use LAPI as the common transport protocol, and provides collective communications performance enhancements. PE now supports the new generation of pSeries switches, in addition to the pSeries (RS/6000 SP) switches. Threaded library support only is provided, with binary compatibility for signal (non-threaded) library applications. Electronic Licensing is provided. Cluster-based security support is provided. Program Marker Array is removed. MPL is no longer supported. Xprofiler is now part of AIX. The parallel utility subroutine MP_QUERYINTRDELAY, mpc_queryintrdelay is no longer supported. When invoked, it returns a value of zero. 218 IBM Eserver pSeries Cluster Systems Handbook There is a new library to be used when converting MPI trace files to the slog2 file format used by the latest version of Jumpshot (an MPI trace viewer program available from Argonne National Laboratory). User-written SIGIO handlers are no longer invoked when a packet arrives. 4.7.4 Software requirements Table 4-12 lists the software requirements of PE. Table 4-12 Software requirements for PE 4.1 and 3.2 PE version AIX version PSSP version Other software 4.1 5.2 or higher 3.5 XL Fortran V or later, if FORTRAN programs are to be compiled and run. FORTRAN Run-Time Environment for AIX V or later, if FORTRAN programs are to be run. If User Space batch or interactive jobs are to be submitted under PE, then LoadLeveler V3.1 (5765-E69) is required. At least one concurrent use license of C for AIX compiler or C++ compiler installed on the SP/server complex that includes the control workstation. PE Benchmarker requires Java Runtime Environment V1.3 XL Fortran, V7.1.0.2 or later, if FORTRAN programs are to be compiled and run. XL Fortran Run-time Environment for AIX Version 7.1.1 or later, if FORTRAN programs are to be run. If User Space batch or interactive jobs are to be submitted under PE, then LoadLeveler V3.1 (5765-E69) is required. At least one concurrent use license of C for AIX compiler or C++ compiler installed on the SP/server complex that includes the control workstation. PE Benchmarker requires Java Runtime Environment V1.3. 3.2 5.1 3.4 or higher. PSSP is not required when Parallel Environment is used on a standalone or independently managed cluster of RS/6000 servers. Chapter 4. Software support 219 4.7.5 Documentation references - Parallel Environment (PE) The following resources are suggested references for further reading and references: PE for AIX 5L Installation, GA22-7418 PE for AIX 5L Hitchhiker's Guide, SA22-7424 PE for AIX 5L Operation and Use, Volume 1, SA22-7948 PE for AIX 5L Operation and Use, Volume 2, SA22-7949 PE for AIX 5L MPI Programming Guide, SA22-7422 PE for AIX 5L MPI Subroutine Reference, SA22-7423 4.8 IBM High Availability Cluster Multi-Processing for AIX (HACMP) HACMP for AIX is a proven high availability solution for business-critical environment.s For over 10 years, HACMP has been providing reliable high-availability services, monitoring capabilities, and dependable detection of application failures. HACMP manages the failover of business application environments to backup servers. With the introduction of the new optional package, HACMP/XD (Extended Distance), HACMP will also manage failover to backup servers at remote sites. HACMP/XD provides long distance remote failover for ESS/PPRC peers, and unlimited distance failover for IP-connected peers using proven IBM High Availability Geographic Cluster (HAGEO) technology. Now there is a single, world-class source of protection for mission-critical applications. HACMP is offered as program number 5765-F62. It is important to understand that a solution with HACMP is not a fault-tolerant solution—but when properly designed and configured, it helps to minimize unplanned outage. The term “high availability” refers to a computing configuration that has the ability to detect and recover from failures and provide a better level of protection against system downtime than the standard hardware and software alone. An HACMP solution provides these means to detect and recover from any unplanned server hardware and application failures. It also gives you the means to take down an individual server (node) for planned maintenance and upgrades, without having to take down the entire cluster. 220 IBM Eserver pSeries Cluster Systems Handbook HACMP offers the following advantages: It minimizes expensive downtime for both planned and unplanned outages by quickly restoring essential services, as follows: – HACMP makes use of redundant hardware configured in a cluster to keep an application running, restarting it on a backup server if necessary. – HACMP can also detect software problems that are not severe enough to interrupt proper operation of the system, such as process failure or exhaustion of system resources. – HACMP can monitor, detect and react to failure events, allowing the system to stay available during the occurrence of random and unexpected problems. – HACMP can be configured to react to hundreds of system events. It provides the flexibility to accommodate changing business needs. It provides horizontal growth with rock-solid reliability, where up to 32 servers can participate in an HACMP cluster. HACMP exploits the systems and network management capabilities of AIX 5L and RSCT. 4.8.1 HACMP operations The primary goal of high availability clustering software is to minimize or ideally, eliminate, the need to take your resources out of service during maintenance and reconfiguration activities. HACMP is a software product that executes on each node in a loosely coupled cluster. It provides application availability by detecting and reacting to failures of systems, processors, adapters, networks, disks, or applications. When these failures occur, HACMP makes use of redundant hardware in the cluster to keep the application running. In the event of a complete node failure, HACMP restarts the application on a backup node. HACMP responds to failure and recovery events based on policies specified when the cluster was defined. It also allows the systems administrator to customize extra operations or accommodate additional resource types. Components of an HACMP cluster A HACMP cluster consists broadly of the following components: Physical resources These are the hardware nodes, network interfaces and volume groups. Chapter 4. Software support 221 Logical resources These are the applications, service IP addresses and mounted file systems, that can be instantiated on any one of an equivalent set of physical resources. For example: – An application can run on any of a set of nodes. – A service IP address can be made active on any of a set of network interfaces. – A volume group can be varied on any of a set of nodes. Resource groups Resource groups allow you to combine related resources into a single logical entity for easier management. They are moved by HACMP from one node to another when the conditions in the cluster change. Policies Policies determine which physical resource will hold a logical resource, when there are multiple choices available. HACMP will then, in accordance with the specified policies, move logical resources around so as to keep applications running despite hardware and software failures. The takeover relationships among cluster nodes determine which cluster nodes control a resource group and which cluster nodes take over control of the resource group when the original node relinquishes control. Takeover relationships are defined by assigning one of the following HACMP takeover relationships: 222 Cascading The resource is always owned by the active node having the highest priority for that resource. Rotating The resource rotates among all the nodes defined in the chain. The node with the highest priority in the resource chain for the resource takes over for a failed node, but does not immediately return the resources when the failed node returns. Concurrent The resource is owned and accessed simultaneously by all owning nodes. Custom The administrator has the choice and control over the takeover policy. IBM Eserver pSeries Cluster Systems Handbook 4.8.2 Administration and operation HACMP management and administration facilities include the following: The Cluster Manager in HACMP is responsible for monitoring the status of cluster resources, reacting to failures, and responding to administrative requests. It uses the RSCT services in AIX. HAView allows you to monitor HACMP clusters through the NetView® network management platform. HATivoli allows you to monitor the state of an HACMP cluster and its components through your Tivoli Framework enterprise management system. Version Compatibility allows nodes running earlier versions of HACMP to interoperate with those running HACMP V5.1. A customer can upgrade an existing cluster running HACMP Version V4.5 or Version 4.4, without taking the entire cluster offline. The Cluster Snapshot captures a cluster configuration, creating text files that contain all the information necessary to configure a similar cluster. Cluster Single Point of Control (C-SPOC) enables the user to perform certain common administrative operations across the cluster from a single SMIT session. Dynamic Reconfiguration allows the user to change the configuration of a running cluster. The changes take effect immediately, without having to stop and restart the HACMP daemons, and without having to disrupt the applications running on the cluster. Resource Group Management (clRGmove) enables customers to use the SMIT interface to move resource groups between nodes and to bring them online or offline. Other options allow groups to be “stuck” to a particular node or to temporarily suspend application monitoring. The Application Availability Analysis Tool provides a tool for measuring application availability. A log file is maintained to capture application and node startup and outages. The analysis tool reads the log and generates a report including availability metrics. For details of HACMP systems administration, see High Availability Cluster Multi-Processing for AIX, Administration and Troubleshooting Guide, SC23-4862. 4.8.3 New in HACMP 5.1 HACMP 5.1, announced in July 2003, includes the following enhancements: Chapter 4. Software support 223 Consolidation of all previous forms of HACMP (HAS, CRM, ES, ESCRM) into a single HACMP offering. Reduced failover time using fast disk takeover (which happens within 10 seconds). A streamlined configuration interface that requires only six user inputs to build a simple HA cluster. Non-IP heartbeating protection over disks where no additional hardware is required. Enhanced security mechanism, removing the need for /.rhosts. Increased administration productivity through faster cluster verification and synchronization. Greater control over resources owning application startup and failover behavior. More cluster status information readily available in the cluster monitor. The addition of multiple disaster recovery technologies to keep the system accessible if disaster strikes. 4.8.4 Software requirements Table 4-13 lists the software requirements for HACMP 5.1 on PSSP- and CSM-managed clusters. Table 4-13 Software requirements for HACMP 5.1 AIX version PSSP version CSM version Other software 5.1 with 5100-03 5.2 with 5200-01 3.5 1.3.0 or later Neither PSSP nor CSM is required to configure HACMP 5.1. 4.8.5 Documentation references - HACMP The following are suggested documents for further reading and reference on HACMP: 224 HACMP for AIX: Concepts and Facilities, SC23-4864 HACMP for AIX: Planning and Installation Guide, SC23-4861 HACMP for AIX: Administration and Troubleshooting Guide, SC23-4862 HACMP for AIX: Programming Client Applications, SC23-4865 HACMP for AIX: Programming Locking Applications, SC23-4866 IBM Eserver pSeries Cluster Systems Handbook HACMP for AIX: Glossary, SC23-4867 HACMP Remote Copy: ESS PPRC Guide, SC23-4863 4.9 Performance Toolbox (PTX) and Performance AIDE (PAIDE) The AIX Performance Toolbox (PTX) and Performance AIDE (PAIDE) are licensed programs designed to support pSeries and RS/6000 systems. It provides a provides a comprehensive tool for monitoring and tuning system performance in distributed environments. PTX is offered as program number 5765-E74, and PAIDE is offered as program number 5765-E68. PTX and PAIDE offer the following advantages: They provide a quick and simple solutions for obtaining and analyzing detailed system information. They allow customization of views for monitoring. They can generate reports on system activity over hours, days, or weeks. They support distributed performance monitoring. They provide access to thousands of system metrics. 4.9.1 Administration and operation PTX PTX provides an easy method of monitoring a single server or managing a distributed environment in the following ways: It can provide a high level view of system performance. It generates reports over specified time periods, or “drills down” on a specific system, to collect data or analyze recorded data. These reports can be customized to show hourly, weekly, or monthly trends. It provides access to thousands of performance metrics. It provides customizable graphical interfaces to view system status and tune parameters. It provides a customizable menu interface. Built-in analysis tools combine with 2-dimensional and 3-dimensional graphing capabilities to pinpoint immediate or long-term bottlenecks. It provides additional utilities to convert recorded data into formats for import by third-party spreadsheets. Chapter 4. Software support 225 PAIDE PAIDE can be used in conjunction with PTX to allow the user to visualize live performance on multiple systems concurrently. PAIDE offers the following functions: It provides agents for creating 24/7 recordings of large sets of performance metrics. It provides data filtering based on user-customized criteria. The filtering criteria can be used for event generation to a monitoring console, or to execute administration scripts. A new Java-based interface, Jtopas, focuses on pre-filtering the set of performance metrics available on large-scale systems to provide simplified snapshot views of overall system activity. 4.9.2 Platform requirements Table 4-14 lists the platform requirements for PTX and PAIDE for POWER Version 3. Table 4-14 Platform requirements for PTX and PAIDE for POWER version 3 Hardware RS/6000 or pSeries system with a minimum of 256 MB memory Graphics adapter for Performance Toolbox user interface operation Network adapter needed for Performance Toolbox remote monitoring AIX V5.1 with 5100-01 4.10 Software ordering and configuration You can order the Cluster 1600 software from your IBM sales representative or IBM Business Partners. A technical review of your requirements with an IBM technical representative may be helpful. 226 IBM Eserver pSeries Cluster Systems Handbook Note for IBM sales representative and IBM Business Partners: The following resources can assist you in configuring and ordering Cluster 1600 software: IBM Universal Sales Manual IBM eServer pSeries & RS/6000 Cluster Software Ordering Guide, available from the IBM Systems Sales Web site: http://www.ibm.com/servers/eserver/pseries Chapter 4. Software support 227 228 IBM Eserver pSeries Cluster Systems Handbook 5 Chapter 5. Solutions and offerings “best practices” This chapter presents scenarios and solutions that came to our attention while writing this redbook. To the degree possible, the team has verified each of the solutions and best practices. Also, where possible, we have solicited feedback from subject matter experts. We cover the following scenarios and solutions: High Performance Computing (HPC) - view of a hypothetical HPC environment Replacing SP nodes with LPAR technology Virtual serial port implications with LPAR technology Hardware Management Console and Web-based System Manager considerations HMC and Web-based System Manager coexisting with firewall © Copyright IBM Corp. 2003. All rights reserved. 229 5.1 High Performance Computing (HPC) environment In this section, we discuss the Cluster 1600 in a hypothetical high performance computing environment. The discussion includes applications on different hardware that can be clustered and incorporated as part of the Cluster 1600. 5.1.1 A hypothetical solution The example in this section is based on common HPC requirements in the field. Note: The example shown here does not refer to or imply any particular installation, nor has this example been specifically configured and tested. It is used here solely to apply the concepts and details described in earlier chapters. Background requirement This hypothetical solution assumes that this HPC customer technical requirement includes workload ranging from 32-bit and 64-bit applications running both serial and parallel applications. Some applications also require large shared memory. Some applications use Message Passing Interface (MPI) and require high bandwidth, low latency, node-to-node communications, and the compute platform must be able to scale as the workload requirement increases. For performance reasons, the ability to share files globally in the same application domain is strongly preferred. Some applications (such as databases) are mission critical, and require high availability. Job scheduling is required to ensure optimized job throughput and resource usage. 5.1.2 Solution architecture Figure 5-1 on page 231 shows a hypothetical solution of a Cluster 1600 solution that consists of pSeries servers running AIX and Linux, as well as xSeries servers running Linux to meet the requirements stated in “Background requirement”. 230 IBM Eserver pSeries Cluster Systems Handbook Gigabit Ethernet ..... ..... p690 Rack of xSeries nodes p690 Rack of p655 compute nodes ....... HMC ........ GPFS on Linux LoadLeveler Submit Node GPFS on AIX LoadLeveler Myrinet Network ..... High Performance Switch ........ Rack of p655 as VSD Servers xSeries nodes as NSD servers ..... p650 HACMP p650 Cluster for Databases ..... LoadLeveler Submit Node CSM Management Server SAN Storage Subsystem Figure 5-1 A hypothetical HPC environment This environment contains several components, which are described in the following sections. xSeries servers The rack of two of xSeries nodes are running the Linux operating system. Some of these nodes are used as Network Shared Disk (NSD) to manage GPFS I/O, while the rest are used fully for computational purposes. The NSD nodes have data connectivity to the Storage Area Network (SAN). All the nodes are interconnected via a high speed interconnect, which can be Gigabit Ethernet or Myrinet. Detailed discussion of the Cluster 1350 is beyond the scope of this redbook; for more information on that topic, refer to “Documentation and suggested reading” on page 237. Chapter 5. Solutions and offerings “best practices” 231 pSeries servers The pSeries servers in the cluster run the AIX operating system. They consist of the following: p655s used for computation for jobs requirement that are no larger than a 4-way scaling. p690s with up to 32 processor and up to 512 GB of memory are used for applications that require large shared memory and high memory bandwidth. A set of servers, which can be p655 or LPAR p690, are used as Virtual Shared Disk (VSD) servers with data connectivity to the SAN. A pair of p650s or p670s configured for high availability for mission-critical databases. The compute p655 and p690 are interconnected via a high bandwidth, low latency High Performance Switch (HPS). A pair of p615s or p630s that are used for submitting jobs to the pSeries clusters. A p630 that is used as the CSM management server. You can find the details of the pSeries servers and the HPS in Chapter 2, “Cluster 1600 hardware” on page 11. Storage Area Network (SAN) The SAN consists of storage subsystems such as the Enterprise Storage Server® (ESS) and FAStT Storage Server (FAStT), Tape Storage and SAN switches. Details of the SAN are beyond the scope of this redbook; for further information on that subject, refer to Introduction to Storage Area Networks, SG24-5470. Network The communication network consists of the following: A private compute network, such as the myrinet network for xSeries and HPS network for pSeries (p655 and p690). A common Ethernet backbone, such as the Gigabit Ethernet (1000Mbps) or Fast Ethernet (100 Mbps), can be used for CSM installs and client access. 5.1.3 Solution discussion In this section, we provide a discussion of the various components and considerations for designing clusters. 232 IBM Eserver pSeries Cluster Systems Handbook pSeries compute cluster The pSeries compute cluster is discussed in terms of the p655 and p690 platforms and the software covered in Chapter 2, “Cluster 1600 hardware” on page 11 and Chapter 4, “Software support” on page 169. p690 for shared memory applications The p690 is an ideal platform for workloads requiring large shared memory. It can also be clustered together to create compute clusters to run large parallel workloads with bigger compute requirements per node than the p655. The p690 is also an ideal platform to run multiple logical partitions (LPAR) to allow consolidation of application servers on the same hardware. The same is true when LPARs are created separately for a development environment. The LPARs can also be dynamically reconfigured to allow flexibility and adaptability to changing workload requirements. This is accomplished with the dynamic logical partitions (DLPAR) feature which allows processor, memory, and I/O slot resources to be added to or deleted from running partitions, or moved between running partitions without requiring any AIX instance to be rebooted. You can find details of using LPAR in pSeries Systems Handbook 2003 Edition, SG24-5120 and The Complete Partitioning Guide for IBM eServer pSeries Servers, SG24-7039. Important: DLPAR is supported on AIX 5L 5.2 or later. You can consider creating an on-demand environment using the IBM DLPAR tool set for pSeries. This application consists of a set of tools that enhance the usability of the DLPAR feature in AIX 5.2. In this tool set, both time-based and load-based scenarios for moving processor resources and memory resources among partitions are explored. It includes sample scripts for monitoring LPARs and automating DLPAR operations, which you can use or customize to fit your requirements. Tip: The IBM DLPAR tool set for pSeries is currently available at the alphaWorks Web site. For details and download, go to: http://www.alphaworks.ibm.com/tech/dlpar IBM intends to include the features of the DLPAR Tool Set for pSeries in the next release of AIX 5L 5.3. p655 compute cluster The p655 compute clusters provide a good compute platform for both 32-bit and 64-bit applications. Besides running serial applications on the individual nodes, Chapter 5. Solutions and offerings “best practices” 233 the p655 is ideally suited for running parallel applications across the cluster. The p655 uses the same MCM technology as the p690 and p670. It provides an advantage to applications requiring high memory bandwidth. Parallel computing Parallel jobs using MPI on the cluster of p655 and p690 can involve message passing via the HPS. It can also be executed in the multiprocessor environment of the p690. The Parallel Environment (PE) can be set up for running parallel jobs using the Parallel Operating Environment (POE). The scientific subroutines, such as the Parallel ESSL, can be used for parallel computing, while ESSL or MASS can be used for serial or shared-memory applications. For discussions of PE, ESSL, PESSL and MASS, see 4.7, “Parallel Environment (PE)” on page 216 and 4.6, “Scientific subroutine libraries” on page 208. Job scheduling on the pSeries The compute jobs on these nodes can be managed with job scheduling. LoadLeveler can be configured in this cluster to schedule the jobs. Two submit nodes allow users to submit their jobs for queuing and execution on the cluster. These nodes can also function as the central manager and the scheduler node; refer to 4.5, “LoadLeveler” on page 200 for further information. Parallel file system Files can be shared and accessed in parallel using General Parallel File System (GPFS). This is set up using Virtual Shared Disk (VSD) servers. The mechanism involves software simulation of a SAN which runs across the HPS interconnecting the nodes in the GPFS node set. The node set in this case refers to the compute nodes participating in the sharing of the GPFS. GPFS gives you the flexibility to group and manage the cluster nodes into one or more nodeset. The VSD servers act as “gateway”, routing disk operations from the disk subsystem to the compute cluster node; refer to 4.4, “General Parallel File System (GPFS)” on page 192 for more information on GPFS. High availability solution A pair of p650s, as shown in Figure 5-2, are used to serve the purpose of a database server. The database server is clustered using IBM HACMP software. Both servers have direct fiber channel connectivity to the SAN and use “heartbeat” over IP and “heartbeat” over shared disks. In this scenario, we are assuming that there is only a single database instance running on the active server. In the case of a failover situation, the database instance is started on the standby server. 234 IBM Eserver pSeries Cluster Systems Handbook Ethernet Network IP Heartbeats IP Heartbeats p650 p650 Heartbeat over disks Heartbeat over disks SAN Active Database Server Standby Database Server Figure 5-2 A basic two-server HACMP cluster configuration There is flexibility in designing high availability clusters using HACMP and LPAR. If there is more than one instance of the databases, you can keep them isolated in different LPARs and run them on the same server—or on different servers. As an example, if we have two instances of databases, the design of the high availability cluster can look like the logical diagram shown in Figure 5-3 on page 235. Distributing the active database server across the physical hardware ensures that system resources are better utilized and that a failure of the hardware only affects the databases running on that hardware. The individual LPAR can be managed as a single server within the CSM cluster. Ethernet Network IP Heartbeats IP Heartbeats Server Fallover LPAR 1 LPAR 2 Active Standby DB "A" DB "B" Heartbeat over disks SAN Heartbeat over disks LPAR 3 LPAR 4 Standby Active DB "A" DB "B" LPARed p650 LPARed p650 Server Fallover Figure 5-3 HACMP configuration in an LPAR environment Chapter 5. Solutions and offerings “best practices” 235 xSeries compute cluster The xSeries compute cluster provides similar functions of parallel computing to the p655 clusters. The xSeries is a 32-bit platform and may provide better price/performance than the 64-bit p655 platform for certain applications. It is ideally suited for certain 32-bit Linux applications. The NSD works similarly to the traditional VSD on the pSeries in terms of functionality, and the GPFS on the Cluster 1350 allows high performance file sharing on the Cluster 1350. Jobs on the xSeries cluster servers can be scheduled via some third-party applications, such as Load Sharing Facility offered by Platform Computing, or some open source applications such as Portable Batch Scheduler (PBS). More information can be found at the following sites: http://platform.com http://www.openpbs.org 5.1.4 Cluster management considerations In this section, we discuss cluster management of a heterogeneous solution consisting of AIX on pSeries and Linux on xSeries. Cluster Systems Management (CSM) This heterogeneous platform can be centrally managed via a CSM management server, making the management server the single point of control. CSM provides low cost, consistent management of distributed and clustered IBM pSeries and xSeries servers. Refer to 4.3, “Cluster Systems Management (CSM)” on page 180 for more information on CSM. Tip: In addition to AIX on pSeries and Linux on xSeries servers, selected pSeries running the Linux operating system can also be managed via CSM in the same cluster. This provides the benefit of a broader range of Linux source codes that can be compiled on Linux on pSeries. You can find details of Linux on pSeries at the following Web site: http://www.ibm.com/servers/eserver/pseries/linux Compilers such as IBM XL Fortran for Linux on pSeries and IBM VisualAge C++ for Linux on pSeries are also available. These compilers are optimized for the pSeries platform. Evaluation copies are currently available at: http://www.ibm.com/developerworks/offers/linux-speed-start/download-p.html 236 IBM Eserver pSeries Cluster Systems Handbook Managing clusters across geographies In “Cluster management across geographies” on page 185, we discuss cluster management across geography using CSM. This concept may be applied in some scenarios where the clusters are housed separately in separate data centers. For example, the xSeries clusters may be owned by Department A and located physically in Building A, while the pSeries cluster may be owned by Department B and housed physically in Building B. The system administrator operates out of the operations center in Building C. The management server can be located at the operations center in Building C, and yet be able to provide a single point of management of the clusters at the different locations. Documentation and suggested reading The following documents are suggestions for further reading and references: Linux Cluster with CSM and GPFS, SG24-6601 IBM ^ xSeries Clusters Planning Guide, SG24-5845 Linux Handbook: A Guide to IBM Linux Solutions and Resources, SG24-7000 Introduction to Storage Area Networks, SG24-5470 pSeries Systems Handbook 2003 Edition, SG24-5120 The Complete Partitioning Guide for IBM ^ pSeries Servers, SG24-7039 5.2 Transition from SP nodes to LPARs Many IBM pSeries install bases are already utilizing POWER3-based servers, and are in-plan to refresh and/or replace their production environments with pSeries servers. In this section we present one of the scenarios that will be most common in the realm of Cluster 1600: it addresses install bases trying to migrate off a “node” in an SP frame into partitions in LPAR-enabled systems such as p690, p670, p655, p650, or p630. Note: In your case, you can use different methods to perform this migration, or devise creative solutions or scripts on your own. However, in this redbook we only present a solution that has been implemented in our own sandbox environment and that has been shown to be effective in this type of migration. Moving to a POWER4 hardware ensures that you can benefit from the enhanced features of the improved hardware architecture. For more details on POWER4, refer to Chapter 2, “Cluster 1600 hardware” on page 11. Chapter 5. Solutions and offerings “best practices” 237 As you know, SP nodes are POWER3 architecture and all of the new LPAR-enabled servers are POWER4 or POWER4+ architectures. Therefore, for simplistic reference, we refer to the SP nodes and LPAR servers as POWER3 and POWER4+, respectively. 5.2.1 System migration by utilizing alternate disk migration Alternate disk migration is one of the methods available to migrate a server from an older version of AIX (for example, AIX 4.3.x) to a newer version of AIX (for example, AIXL 5L) without disrupting the production environment during the course of the server migration from POWER3 to POWER4+. Using this method, you need a spare disk that is equal or greater in size in rootvg to use for the migration. IBM pSeries (LPAR Enabled Partition) IBM pSeries (SP Node) 32 Bit: APAR IY32749 64 Bit: /usr/sbin/partition_ready APAR IY22854 BOS AIX 4.3x rootvg (hdisk2) rootvg (hdisk1) BOS AIX 5L Migrate rootvg (hdisk2) RE S TO B YS KS M RE Cloned rootvg (hdisk1) Tape/DVD/CD Figure 5-4 alt_diskinstall process The advantage to using this method is that you are able to clone your rootvg and then migrate it from AIX 4.3.x to AIX 5L while your server’s production rootvg is kept safely if rollback is needed. By utilizing this method, you are in fact creating two bootable images, as shown in Figure 5-4: Your original rootvg with the AIX 4.x bootable image intact (hdisk1) The cloned rootvg to which the migration from AIX 4.3.x to AIX 5L can be applied (hdisk2) 238 IBM Eserver pSeries Cluster Systems Handbook Once the process of migration has been completed and verified, the downtime associated with normal BOS updates and migrations are minimized because you are able to migrate the rootvg in the background. All you need is a reboot, with the primary boot device pointing to the migrated (AIX 5L rootvg). This methodology offers a useful rollback or contingency plan, if there is a need for one. Steps for alt_diskinstall migration To migrate from POWER3 to POWER4++™ by utilizing the alt_diskinstall method, follow these steps: 1. Enter: smitty alt_clone a. Select -> Software Installation and Maintenance -> Alternate Disk Installation. b. Clone rootvg to an alternate disk. 2. Modify bootlist to boot from the cloned physical device (hdisk-x). 3. Reboot. Once booted with the cloned rootvg, verify that APAR IY32749 has been properly applied 1. If your hardware is 64-bit architecture, verify that these filesets are installed: – devices.chrp_lpar.base.ras – devices.chrp_lpar.base.rte – devices.chrp.base.rte – devices.chrp.base.ServiceRM – bos.64bit – bos.mp64 – APAR IY22854 2. Execute: /usr/sbin/partition_ready. 3. Create a mksysb of the system onto either a DVD or bootable tape media. If your rootvg spans one physical disk (hdisk0 - rootvg), the results of alt_diskinstall create a second rootvg on a pdisk that you specify (that is, hdisk2 altinst_rootvg). If you issue the command lspv, you will see that hidsk0 is still “active”. At this point, you need to modify your bootlist to boot from hdisk2. Upon reboot, issue an lspv command once again to verify that the hdisk0’s label has changed to “old_rootvg”, and that hdisk2’s label has indeed changed to “rootvg” and is set to “active”. This will ensure that the 5L migration you are about to execute is applied to the cloned alternate disk. Note: More detailed guidance about this migration can be found at: http://publib16.boulder.ibm.com/pseries/en_US/infocenter/howto/HT_insgdrf_a ltdiskinstall_clone.htm#insgdrf_altdiskinstall_clone Chapter 5. Solutions and offerings “best practices” 239 Important: AIX 5L version 5.1, recommended maintenance level 5100-03, is the minimum AIX install CD set that supports 32-bit-to-64-bit install migration. This version ensures that the operating system image is bootable on a partitioned POWER4 system. Using an earlier package (for example, AIX 5L version 5.1, recommended maintenance level 5100-01) causes the booting process to halt on LED 0c43 when you boot after restoring the backup to an LPAR. Tasks to be performed on the POWER4++ partition On the POWER4++ partition, perform these tasks: 1. Allocate a bootable tape drive or DVD (in a media format that is compatible with the POWER3 system). 2. Load the media. 3. Boot the LPAR to an SMS state, and select tape/ DVD device as the boot device. 4. Issue the boot command. 5. Configure the booted images as needed Your BOS migration from POWER3 to POWER4+ should now be completed and ready for non-system data to be migrated to the LPAR system. 5.3 Virtual Serial port implications with LPARS In various discussion forums for IBM pSeries, the following question is often asked. Question: Can the integrated serial port on a LPAR enabled pSeries be used for an HACMP Heartbeat? Answer: Yes, but only in SMP (non-LPAR) mode. If the pSeries is configured in LPAR mode, it requires an 8-port Async adapter. You will need to find other means. Also note that the p655 Model 651 has no built-in native serial port. Refer to “Moving the media drawer between active LPARs” on page 243 for more detail. 5.3.1 Console device The HMC provides one virtual tty console access for each partition (LPAR), which removes most of the need for partition access to native serial ports. The two native serial ports on the p670/p690 are under the control of the service processor and are not available for functions such as HACMP heartbeat cabling 240 IBM Eserver pSeries Cluster Systems Handbook or UPS control, since they may be dynamically allocated to a single LPAR at any time. For HACMP heartbeat applications on LPAR-enabled servers, you may choose to utilize “heartbeat on disk”, which requires no other hardware. In AIX 5L, RSCT utilizes its Topology Services layer to provide a heartbeat mechanism to ensure application takeover can occur at any time. Tip: The HMC provides two integrated serial ports: the first native serial port is required for server connectivity, and the second native serial port is reserved for a modem if “Service Agent call home” is implemented. Therefore, if there is more than one managed partition and the call home function is implemented, you have to assign an 8-port (F/C 2943) or 128-port (F/C 2944) serial adapter. Note that 128-Port asynchronous adapters are not supported on the HMC (F/C 7316). Each system that supports a Hardware Management Console (HMC) has two serial port connections, so that you may optionally attach a second HMC to the same system. The benefits of using two HMCs are as follows: This ensures that access to the HMC management function capabilities are not interrupted. It ensures access if the network is down. There is no physical console on the partition unless you assign it explicitly (by assigning the serial port in the media drawer). Two built-in native serial (RS-232) ports on the p690 can only be assigned to one partition (LPAR) at the same time. If an explicit system console is needed on more than one partition at the time, your only option to facilitate this would be to implement a “graphical console” environment for the partitions. A graphical console is available on a partition by configuring a graphical adapter (F/C 2943) with a graphical display attached and a USB keyboard and mouse adapter per partition needing explicit system console. As with standalone servers, only one graphical console is configurable per each partition; the number of graphical console can equal the number of LPARs. Important: For installation or service processor supported functions, you must use the Virtual Terminal Window (VTW) function of the HMC. Chapter 5. Solutions and offerings “best practices” 241 5.3.2 Serial port implication in an LPAR/SP environment There is no serial connection to any LPAR or p690 server. Connectivity from the control workstation (CWS) to the p690, p670, p655, p650, and p630 (the LPAR enabled systems) is achieved through the Hardware Management Console (HMC) via a network connection to the SP LAN. The LPAR-enabled systems are supported as SP-attached servers, or in a Cluster 1600 configuration with the SP Switch2, the SP Switch, or no switch. The first step in installing the SP system is to prepare the CWS. This involves connecting your SP frames and attached servers to the serial ports of your CWS, configuring the Ethernet connections on your CWS, and verifying name resolution. You must then install the PSSP code on the CWS and apply the necessary PTFs. The control workstation uses one serial port per SP frame for hardware monitoring and control. For attached servers, the number of serial ports you must allocate depends on the servers: SAMI protocol servers use two tty ports for communicating with the CWS. CSP protocol servers use one tty port. For pSeries 690 servers, you do not need to reserve a tty port, but the Hardware Management Console (HMC) must be connected to the SP Ethernet administrative LAN. 5.3.3 Virtual terminal window In an LPAR environment there are no physical consoles, unless you assign it explicitly (with the limitations previously mentioned. To facilitate console outputs and to provide the means for diagnostics, firmware implementations and virtual tty device, a Virtual Terminal Window (VTW) is offered. AIX sees this VTW as a standard tty device. The output of the VTW is streamed to the HMC, and acts as the console for the LPAR. Note: VTW is designed for limited purposes. It is used during AIX installation and diagnostics services. VTW does not support: 242 Printing to a virtual terminal Transparent print service Modem connections Real time application usage IBM Eserver pSeries Cluster Systems Handbook VTW emulates a VT320. If a VTW is used in a full partition mode, the I/O of the serial-1 port is redirected to the VTW window, and communication to and from the serial-1 port is interrupted. Normal operation of the serial-1 port is resumed as soon as the VTW window is closed. Tip: For AIX configurations and systems management, we recommend that you use the network or the serial ports (using the serial adapter assigned to the partition). In a “No Power” state of the system, you can still access the service processor via the VTW. If you are booting an AIX partition that did not have the native serial adapters assigned as one of its resources at the time of BOS installation, the device driver for the serial adapter is not available in that partition’s BOS. So in order to install device drives, you may have to dynamically move the media drawer to each of the partitions temporarily to define, configure, and load device drives, as explained in the following section. Moving the media drawer between active LPARs You can use the Hardware Management Console (HMC) to move the managed system's media drawer from one active logical partition to another active logical partition without rebooting the partition's operating system. To use the HMC to reassign the managed system's media drawer between active partitions, follow these steps: 1. Log in to the HMC using either the System Administrator or Advanced Operator roles. – – – – – – – Click the console's icon to expand the tree. Click the Server and Partition folder. Click the Server Management icon. Click the managed system's icon to expand the tree. Select the partition from which you want to move the CD-ROM. Select Dynamic Logical Partitioning. Select Adapters. 2. The Dynamic Reconfiguration window opens. Click Move resource to a partition. 3. Select the CD-ROM. The CD-ROM is listed as group U1.18-P1-I10 in the HMC interface. 4. Select the name of the partition to which you want to move the CD-ROM. Chapter 5. Solutions and offerings “best practices” 243 5. In the Task time-out field, select the number of minutes you want the system to wait before it stops the task. 6. In the Details field, select the level of feedback you would like to see while the HMC performs the task. Details shown include the operating system's standard output and standard error information. When you are finished selecting the correct information, click OK. 5.4 HMC considerations When implementing a Cluster 1600 environment, one of the key components is the HMC. The HMC provides a standard user interface for configuring and managing partitions or SMP systems. HMC also facilitates features to monitor system and hardware problems. 5.4.1 Redundant HMC PSSP now supports automatic failover of hardware control and monitoring (hardmon) communications to a redundant HMC. The pSeries Hardware Management Console allows you to configure two HMCs to control a single HMC-controlled server. The PSSP spframe command allows you to enter a list of IP addresses to be used for the hardware control point for the logical frame being defined for the HMC-controlled server. PSSP initially tries to use the first IP address when attempting to establish a communication session with an HMC for the HMC-controlled server. If communication cannot be established to the first HMC in the list of IP addresses, the hardmon subsystem attempts to establish a communication session with the other IP address defined for that logical frame. In redundant HMC configurations, both HMCs are fully active and accessible at all times, enabling you to perform management tasks from either HMC at any time. There is no primary or backup designation. To avoid conflicts, mechanisms in the communication interface between HMCs and the managed systems allow an HMC to temporarily take exclusive control of the interface, effectively locking out the other HMC. Usually this locking is done only for the duration of time it takes to complete an operation, after which the lock is released for further operations. HMCs are also automatically refreshed of any changes that occur in the managed systems, so the results of commands issued by one HMC are visible to the other. For example, if you select to activate a partition from one HMC, you will observe the partition going to the Starting and Running states on both HMCs. 244 IBM Eserver pSeries Cluster Systems Handbook The locking between HMCs does not prevent users from running commands that might seem to be in conflict with each other. For example, if the user on one HMC selects to activate a partition, and a short time later, a user on the other HMC selects to power off the system, the system will power off. Effectively, any sequence of commands that you can do from a single HMC is also permitted when your environment contains redundant HMCs. For this reason, it is important to carefully consider how you want to use this redundant capability to avoid such conflicts. You might choose to use them in a primary and backup role, even though the HMCs are not restricted in that way. The interface locking between two HMCs is automatic and is usually of short duration. Most console operations wait for the lock to release without requiring user intervention. However, if one HMC experiences a problem while in the middle of an operation, it may be necessary to manually release the lock. Because authorized users can be defined independently for each HMC, you should determine whether the users of one HMC should be authorized on the other. If so, the user authorization must be set up separately on each HMC. Tip: Because both HMCs provide Service Focal Point and Service Agent functions, connect a modem and phone line to only one of the HMCs, and enable its Service Agent. To prevent redundant service calls, enable Service Agent on only one HMC. Perform HMC software maintenance separately on each HMC, at separate times, so that there is no interruption in accessing HMC function. This situation allows one HMC to run at the new fix level, while the other HMC can continue to run at the previous fix level. Make sure that both HMCs are moved to the same fix level as soon as possible. 5.5 Web-based System Manager client solutions Web-based System Manager is a tool that provides a complete set of Web-based system management interfaces for the entire pSeries domain. It enables administration of single/multiple AIX machines/partitions, AIX clusters, and SP nodes from any client platform, including the browser on the PC at the administrator's home. HMC can be managed through Web-based System Manager interfaces. However, in an enterprise computing environment, you have to work through the firewall to enable the functionality of the Web-based System Manager and the Chapter 5. Solutions and offerings “best practices” 245 HMC. In this section, we present a solution to facilitate communication between the HMC and a Web-based System Manager client through a firewall. 5.5.1 Web-based System Manager functionality through the firewall In a corporate environment, firewalls are common, if not mandatory. It is also common that systems administration tasks are often performed outside of the physical boundaries of a data center; often systems administrators are across town or across the country from the data center. With the HMC, you are able to manage multiples of servers from various corporate locations, enabling the systems administrators to be freed from the distance limitations of RS-232 connections, and to still be able to manage hardware and configuration issues on their servers. However, systems administrators are still limited by the security measures (firewall) put in place between the intranet and Internet. If you want to put the HMC’s management capabilities outside the compounds of trusted networks, you are confronted with the limitation of not knowing how the HMC and the Web-based System Manager client communicate through the firewall. Capturing the port numbers utilized by the HMC and Web-based System Manager in a single session and opening those specific ports is rendered useless on subsequent communications between the HMC and the Web-based System Manager. For that reason, we offer the following solution to address the firewall limitation imposed on HMC and Web-based System Manager communications. Note: This solution is not documented and may not be supported by IBM at this time. We verified this solution on the following: IBM p690, LPARed with AIX5.1 and AIX5.2 Checkpoint Firewall 4.1 running on Sun Ultra60 - Solaris 7 HMC 7315-C02, with software version 1.3 Web-based System Manager HMC-Client on Windows® XP (SP1®) workstation DSL (static IP) Internet connectivity 246 IBM Eserver pSeries Cluster Systems Handbook Intranet HMC-to-Web-based System Manager communication 1) Initial Web-based System Manager Request (port: 9090) Ethernet 2) Web-based System Manager Communications (Port: Random) Web-based System Manager HMC Client Local Network HMC Local Network p690 Enterprise Data Center Figure 5-5 Intranet communication between HMC/Web-based System Manager Figure 5-5 depicts a sequence of initial requests from the HMC and Web-based System Manager and their subsequent communication. The HMC and Web-based System Manager initialize communication on Port 9090 (this is the factory default, which may be changed as described in “Changing the default client communication port” on page 247). Once the initial request is received by the HMC, it returns a handshake to Web-based System Manager, using any one of the random port numbers between 1024 to 65535. (It is more than likely that HMC will not be using a commonly used port number, for example, 8080.) At this point, communication is established between the HMC and the Web-based System Manager HMC client. Changing the default client communication port If your environment requires that you change the Web-based System Manager’s initialization port from 9090 to some other port, issue the following command: /usr/websm/bin/wsmserver -enable -listenport <desired port number> You must enable or open these ports on your firewall to facilitate the communication. Implications of a firewall After we set up the LPAR-enabled pSeries, along with the HMC, within the enterprise data center, we tested all HMC functionality and verified that it worked within the intranet. However, soon as the Web-based System Manager client was removed from the bounds of the trusted network, as shown in Figure 5-6, communication between the HMC and Web-based System Manager stopped. Chapter 5. Solutions and offerings “best practices” 247 1) Initial Web-based System Manager Request (port: 9090) 2) Random Port HMC Handshake (port: 1024-65535) 3) Subsequent Web-based System Manager Request (port: 9090) Web-based System Manager HMC Client Long Island , NY 4) Random Port HMC Traffic (port: 1024-65535) Firewall HMC Poughkeepsie, NY p690 Enterprise Data Center Figure 5-6 Implications of the firewall The initial Web-based System Manager communication coming through port 9090 (assuming that this port is enabled in the firewall) reached the HMC; however, the handshake and all subsequent requests and communications were held at the firewall, because the HMC’s follow-on communication is via a random port numbers. An easy way to enable communications between the HMC and the Web-based System Manager client is to enable all ports on your firewall. However, enabling all ports between port number 1024 and 65535 defeats the purpose of having a firewall in the first place. Important: We noticed that the handshake coming from the HMC to the Web-based System Manager server utilizes a different, random port number each session. Therefore, capturing the initial handshake port number and opening that specific port on your firewall will not resolve this issue. Solution Solving this problem is rather simple if you control the range of ports which the HMC uses to communicate with the Web-based System Manager client. Then the number of ports to be enabled on the firewall is manageable, as shown in “Changing the default client communication port” on page 247. Enabling ranges of ports on your firewall to facilitate the Web-based System Manager communications to the HMC presents a level of security risk. However, you can minimize the risk by implementing security schemes (for example, a time-based connection). 248 IBM Eserver pSeries Cluster Systems Handbook Defining HMC-to-Web-based System Manager communication ports To shrink the range of ports that the HMC uses to communicate through port numbers 10010 to 10020, issue the following command: /usr/websm/bin/wsmserver -enable -portstart <start port>10010 -portend <end port>10020 1) Initial Web-based System Manager Request (port: ####) 2) Handshake Traffic 3) HMC Traffic Start Port: 10010 End Port: 10020 Web-based System Manager HMC Client Long Island , NY HMC Poughkeepsie, NY Firewall p690 Enterprise Data Center Figure 5-7 Limiting the HMC’s communication ports In summary, to ensure reliable communications, reduce the number of ports that the HMC utilizes for Web-based System Manager client communications from 64511 possibilities to about 5 to 10 possibilities (as shown in Figure 5-7), and open those ports on the firewall. Chapter 5. Solutions and offerings “best practices” 249 250 IBM Eserver pSeries Cluster Systems Handbook A Appendix A. Performance The communication performance seen by applications running in AIX® cluster nodes and communicating through the SP Switch or SP Switch2 is comprised of a number of elements. The SP Switch and SP Switch2 provide the base communications performance capability. Important: All performance data is subject to change. For the latest performance information, refer to the following Web site: http://www.ibm.com/servers/eserver/pseries/hardware/whitepapers/sp_switch_perf.html The performance characteristics of the SP Switch and SP Switch2 adapters in the node, and the time the node takes to process the communication protocol stack, determine how much of the SP Switch and SP Switch2 performance capability can be sustained during application-to-application communication. The time for processing the communication protocol stack is, in turn, determined by the software path length and the performance characteristics of the processor. In this appendix, we discuss the performance capabilities of both the SP Switch and SP Switch2 fabrics and their adapters. We also present actual application performance measurements. A discussion of the software path length in processing communication protocol stacks is beyond the scope of this redbook. © Copyright IBM Corp. 2003. All rights reserved. 251 We use the term “switch” in this appendix to refer to any switch type, and we use the term “adapter” to refer to any adapter type, unless a specific switch or adapter is named. A.1 Switch performance The SP Switch2 was first introduced in 2000 for the interconnection of SP nodes via internal MX slots. As the next step in the evolution of the SP interconnection fabric, it offered significant improvements in bandwidth, latency, and Reliability, Availability, and Serviceability, or RAS, over the previous generation SP Switch. The bandwidth gains resulted from design improvements in the adapters and switch hardware. In 2001, the SP Switch2 was enhanced to include the SP Switch2 PCI Attachment Adapter, allowing interconnection of any PCI-based server supported in an IBM Cluster 1600. It was also enhanced to include dual plane support, which is the ability to support up to two adapters per node (or logical partition), thereby increasing bandwidth and allowing for redundant adapters and higher degrees of RAS. These enhancements enabled switch communication performance to keep pace with the increased performance of the SP nodes and Cluster 1600 systems. With the announcement of the pSeries™ 655 in December of 2002, a new SP Switch2 adapter, the SP Switch2 PCI-X adapter, was introduced Initially, this adapter is intended for attaching pSeries 655 servers to the SP Switch2 by way of one of the available PCI-X slots. Because the SP Switch2 design improvements are evolutionary steps based on the SP Switch and previous high performance switches, the SP Switch2 is designed to be fully compatible with applications written for the older SP switches. As with these prior switches, the SP Switch2 supports the industry-standard Message Passing Interface (MPI) and Internet Protocol (IP), as well as the IBM Low-level Application Programming Interface (LAPI). The raw peak performance of the SP Switch and new SP Switch2 are listed in Table A-1 on page 253. 252 IBM Eserver pSeries Cluster Systems Handbook Table A-1 Switch peak performance Number of cluster nodes Switch type Up to 16 17 to 80 Latency (µsec) Uni-directional Bi-directional 500 1000 150 300 1.0 1.5 SP Switch2 81 to 512 2.5 Up to 16 1.3 17 to 80 Bandwidth (MB/sec) 1.9 SP Switch 81 to 512 3.3 This peak (for example, not-to-exceed) performance cannot be achieved by an application. A.2 Adapter performance The raw peak performance of the SP Switch and SP Switch2 adapters are listed in Table A-2. Table A-2 Switch adapter peak performance Bandwidth (MB/sec) Adapter Uni-directional Bi-directional SP Switch2 adapter 500 1000 SP Switch2 PCI of PCI-X adapters 500 1000 SP Switch MX2 adapter 150 300 SP System Attachment adapter 150 150 Greater detail about these adapters is given in the following sections. Appendix A. Performance 253 SP Switch2 adapters The SP Switch2 adapter (SP Switch2 adapter, SP F/C 4025) is available for the 222 MHz and 375 MHz POWER3™ high nodes. The SP Switch2 MX2 adapter (SP Switch2 MX2 adapter, SP FC 4026) is available for the 375 and 450 MHz POWER3 thin and wide nodes. The SP Switch2 PCI-X adapter (SP Switch2 PCI-X Attachment Adapter, F/C 8398) is available for pSeries 655 nodes. The SP Switch2 PCI adapter (SP Switch2 PCI Attachment Adapter, F/C 8397) adapter is available to attach selected RS/6000® and pSeries servers to the SP Switch2 to PCI slots in the server or I/O drawer (as in the case of the pSeries 670 or 690 servers). SP Switch adapters The SP Switch MX2 adapter (SP Switch MX2 adapter, SP F/C 4023) is available for all MX slot-enabled SP nodes The SP System Attachment Adapter (SP System Attachment Adapter, server F/C 8396) is available to attach various servers to the SP Switch using a PCI slot in the node or server. These peak performance numbers represent the maximum rate at which data can be given to or taken from the switch by the node and, effectively, become the base communications performance capability of the switch subsystems. Similar to the switch peak performance, the switch adapter peak (that is, not-to-exceed) performance cannot be achieved by an application. When the communication occurs between nodes with different adapters supported on the same switch type, the effective peak performance of the switch subsystem is that of the slower adapter. A.3 Node I/O slot performance Since the performance of the SP Switch and SP Switch2 fabrics is faster than some I/O slots available in various pSeries servers, there are several configurations where the same adapter, in different nodes or servers, will perform at different rates. This is due to differences in the I/O subsystems of different RS/6000 or pSeries servers. In the case of the SP Switch2, with a bandwidth of 500 MB/s full duplex, this is much faster than the individual PCI or PCI-X slots that it plugs into. Different I/O drawers or internal slots on various models of the pSeries servers have different maximum bandwidths per PCI slot. This shows up as slightly different performance when identical switch adapters are used in different server models. 254 IBM Eserver pSeries Cluster Systems Handbook Therefore, depending on the I/O subsystem used, the same SP Switch2 PCI or PCI-X adapters will perform differently in various server models. When the SP Switch2 adapter for the 6xx direct memory slot is used in the SP POWER3 SMP High Nodes, the much faster memory slot allows full utilization of the bandwidth in the SP Switch2. Over time, as advances are made in the I/O slots used in pSeries servers, the performance of the I/O buses should catch up to the performance of the SP Switch adapters. This is why initially adapters for the SP Switch and SP Switch2 used memory bus attachment and later used standard I/O slots. Both adapters are now available in both memory buses and PCI bus versions. Initially, the next generation fabric adapters will plug into the direct memory bus. After the standard buses catch up in performance, the adapters will plug into a standard bus. A.4 Application performance High performance internode communication is a key component of the overall performance of many user applications. The most basic measurements to characterize the performance of the communication subsystem are latency and bandwidth. Latency is the overhead associated with sending data between two processors, and is usually quantified in microseconds (µsec). Bandwidth is the rate at which data can be transmitted between two processors, and is typically measured in megabytes per second (MB/s). In this document, for bandwidth, when using MPI, a megabyte is defined as 10 6 bytes. When using IP, a megabyte is defined as 2 20 bytes. Historically, these have been the definitions used when measuring these two protocols. We will characterize internode communication performance over the switch for two communication protocols: the so-called “user space” protocol (used to support the industry standard MPI), and the industry standard IP family of communication protocols (which include TCP/IP and UDP/IP). The user space protocol is sometimes referred to as a “lightweight” protocol because it requires fewer processor cycles to transmit a given amount of data compared to heavier protocols like TCP/IP. User space is most commonly used for scientific and technical computing applications via a message-passing interface. MPI was used to measure user space performance. TCP/IP is utilized in socket interface communication for Appendix A. Performance 255 many commercial applications, and is the basis for popular network protocols such as Network File System (NFS) and File Transfer Protocol (FTP). Performance measurements for these two protocols are presented in the next two sections. Several factors contribute to the communication performance that is obtained by a user application. Internode communication performance depends on the processor, the memory subsystem, the switch adapter, and the switch fabric. Therefore, when considering communication performance measurements, it is extremely important to understand the exact configuration of the system to which the data applies. In addition to hardware considerations, the system software contributes to the overhead involved in sending data between processors. Note the following points: The latency and bandwidth measurements represent performance as seen by an application. MPI measured bandwidth increases asymptotically as message size grows very large. Each bandwidth measurement presented in these tables represents the asymptotic values for very large messages. The measurements for a given node were made using the latest release of software generally available at the time of the announcement of that node. The latencies and bandwidths for older nodes are included in these tables for reference. Latencies and bandwidths are measured on two nodes connected to the same switch chip. On the 375 MHz POWER3 High Node, MPI tasks were bound to CPUs in order to reduce performance fluctuations The performance data shown in the following tables was measured under ideal conditions. In all cases except the pSeries 680 server, we configured the maximum number of CPUs and memory per node or server. For the pSeries 680 measurements, we only configured 12 CPUs and 16 Gigabytes of memory. However, this less than full configuration did not impact the performance of the adapters. For each measurement, we configured up to two adapters using customer configurable guidelines. In most cases, both adapters were configured in the same I/O drawer as would be expected in customer configurations. The best possible tuning was done on the systems for these measurements, as well as the latest level of AIX and Parallel System Support Programs for AIX (PSSP) available at the time of measurement. 256 IBM Eserver pSeries Cluster Systems Handbook A.5 MPI/user space On a distributed-memory system, parallel applications perform interprocessor communication via some form of message passing. IBM fully supports MPI as an industry standard. This standardized interface for message passing greatly improves the portability of parallel application codes among different parallel systems. We will discuss the performance of the SP Switch for parallel applications in terms of what can be measured using MPI. Table A-3 through Table A-6 on page 259 show interprocessor communication performance measurements from a FORTRAN program with MPI calls, using the user space protocol. Latency is a measure of the time it takes to send a zero byte message between two processors using mpi_send and mpi_recv from the MPI library. It is calculated as half the time needed for a round trip between processors for that zero byte message. Latency represents the time used to set up a single message for transfer at the level of an application, and may be seen as the initialization overhead for transferring information between applications. Table A-3 through Table A-6 also contain data for bandwidth measurements using MPI over the user space interface. The uni-directional bandwidth, sometimes called “point-to-point” bandwidth, was measured for messages of several megabytes in size. The bi-directional bandwidth, sometimes called “exchange” bandwidth, implies simultaneous sending and receiving of messages between processors, thereby achieving a slightly higher data rate. The bi-directional data rate is the sum of the simultaneous data rates in both directions. In Table A-3 through Table A-6, the SP Switch adapter has the lowest latency of all adapters. Table A-3 shows SP Switch2 MPI user space performance with a single MPI per node connected to the same switch. Table A-3 SP Switch2 MPI user space performance Switch adapter SP Switch2 latency (µsec) SP Switch2 bandwidth (MB/sec) Node type Uni-directional Bi-directional 375 MHz POWER3 high node SP Switch2 17.5 350 350 450 MHz POWER3 SMP node SP Switch2 MX2 19.4 168 197 pSeries 680 SP Switch2 25 111 122 Appendix A. Performance 257 pSeries 660-6h1 SP Switch2 PCI 21.1 161 167 pSeries 660-6M1 SP Switch2 PCI 20.6 162 162 pSeries 690 SP Switch2 PCI 18.5 179 225 pSeries 655 SP Switch2 PCI-X 18 239 245 All but one of the results in Table A-4 were obtained using the threaded MPI library with MP_SINGLE_THREAD=yes. (The pSeries 680/SP Switch Attachment adapter results were obtained using the non-threaded MPI library.) Table A-4 shows the SP Switch MPI user performance with a single MPI task per node connected to the same switch chip. Table A-4 SP Switch MPI user space performance Bandwidth (MB/sec) Node type Switch adapter Latency (µsec) Uni-directional Bi-directional 375 MHz POWER3 thin/wide node SPSMX2 19.7 140 192 pSeries 680 SPSAA 36.6 71 87 Measurements for the SP Switch2 Attachment adapter for the pSeries 660 (Models 6H0, 6H1 and 6M1) and the pSeries 680 are included for completeness, even though applications deployed on attached servers will generally use IP rather than MPI communication. These servers will generally be used for commercial computing applications, while MPI is generally used by scientific and technical computing applications. Table A-5 shows the MPI user space performance on nodes with multiple MPI tasks per node for the SP Switch2. Table A-5 MPI user space performance with multiple MPI tasks 258 Node type Switch adapter pSeries 655 SP Switch2 PCI-X Number of MPI tasks per node Bandwidth (MB/sec) Uni-directional Bi-directional 1 or more 235 254 IBM Eserver pSeries Cluster Systems Handbook pSeries690 SP Switch2 PCI 1 or more 179 225 pSeries 6M1 SP Switch2 PCI 1 2 or more 162 163 162 213 pSeries 660-6h1 SP Switch2 PCI 1 2 or more 161 161 167 212 450 MHz POWER3 SMP node SP Switch2 MX2 1 or more 168 197 pSeries 680 SP Switch2 1 or more 111 122 375 MHz POWER3 high node SP Switch2 PCI-X 1 2 4 or more 350 445 445 350 680 720 Table A-6 shows MPI user space performance on nodes with multiple MPI tasks per node for the Dual fabric SP Switch2. Table A-6 MPI user space performance Node type Switch adapter Number of MPI tasks per node Bandwidth (MB/sec) Uni-directional Bi-directional pSeries 655 SP Switch2 PCI-X 1 2 or more 412 460 440 460 pSeries690 SP Switch2 PCI 1 2 or more 337 337 375 385 pSeries 6M1 SP Switch2 PCI 1 2 4 or more 188 309 310 188 347 366 pSeries 660-6h1 SP Switch2 PCI 1 2 4 or more 168 245 295 168 261 324 pSeries 680 SP Switch2 1 2 or more 149 164 151 193 375 MHz POWER3 high node SP Switch2 1 2 4 8 or more 350 664 880 880 350 675 1150 1270 Appendix A. Performance 259 A.6 TCP/IP TCP/IP is a more common industry standard communication protocol used to transfer information between two systems running the IP family of protocols. It is a robust protocol that supports multiple users and reliable transport of data. However, since it supports networking function not currently used by MPI (such as multiplexing and guaranteed delivery), it requires higher processor overhead compared to the user space protocol using MPI. This increases the CPU overhead for the same bandwidth when compared to MPI. The performance of the TCP/IP socket protocol on various nodes was measured using a modified version of the Netperf public-domain benchmark, and the results are listed in Table A-7 through Table A-12 on page 263. All Netperf measurements were memory-to-memory to eliminate slower devices (such as disks) from impacting the performance. As with the user space measurements, the same hardware configuration was used for the TCP/IP measurements as for the MPI measurement listed in A.5, “MPI/user space” on page 257. Table A-7 TCP/IP SP Switch performance Bandwidth (MB/sec), Uni-/bi-directional Number of TCP/IP sessions 1 or more Switch adapter 375 MHz POWER3 thin/wide node pSeries 680 node Uni- Bi- Uni- Bi- 135.4 174.6 74.2 90.7 SPSMX2 SPSAA On the SP Switch, the 375 MHz POWER3 SMP Node with the SPSMX2 adapter was the fastest performing adapter and node combination; see Table A-7. In both the p680 and 375 MHz SMP Node, the adapter could be saturated by a single TCP/IP session. 260 IBM Eserver pSeries Cluster Systems Handbook Table A-8 SP Switch2 TCP/IP SP Switch performance Bandwidth (MB/sec), Uni-/Bi-directional single fabric Number of TCP/IP sessions 375 MHz POWER3 high node pSeries 690 450 MHz POWER3 SMP node Uni- Bi- Uni- Bi- Uni- Bi- 1 132 253 168 230 136 210 2 252 472 167 230 156 210 4 or more 440 650 167 230 157 210 Switch adapter SP Switch 2 SP Switch2 PCI SP Switch2 MX2 As shown in Table A-8, the 375 MHz POWER3 SMP high node delivers the highest bandwidth of all node and adapter combinations for the SP Switch or SP Switch2. This performance can primarily be attributed to the use of the memory bus to attach the adapters. All other nodes or servers are limited by the throughput of the bus into which the adapter is plugged. When using the SP Switch2 adapter and switch, you get higher bandwidth when using more than one CPU on the 375 MHz POWER3 high node; see Table A-9. Table A-9 SP Switch2 TCP/IP SP Switch performance Bandwidth (MB/sec), Uni-/Bi-directional single fabric Number of TCP/IP sessions pSeries 660-6M1 pSeries 660-6H1 pSeries 680 Uni- Bi- Uni- Bi- Uni- Bi- 1 or more 176 223 178 132 124 130 Switch adapter SP Switch 2 PCI SP Switch2 PCI SP Switch2 PCI When using dual switch fabrics and using the global interface (ml0), the performance of TCP/IP increases significantly; see Table A-10 on page 262. We do not get perfect scaling to two adapters because of the limitations on aggregate bandwidth to a single I/O drawer. In addition, the global interface introduces some overhead managing both interfaces. Appendix A. Performance 261 Table A-10 SP Switch2 TCP/IP SP Switch performance Bandwidth (MB/sec), Uni-/bi-directional Number of TCP/IP sessions 1 or more pSeries 630 pSeries 655 node Uni- Bi- Uni- Bi- 169 231 238 259 Switch adapter SP Switch 2 PCI SP Switch 2 PCI-X The bandwidth measured is the maximum obtainable both uni-directional and bi-directional over TCP between two identical applications running on two identical nodes; see Table A-11. All Netperf measurements were memory-to-memory to eliminate slower devices such as disks from impacting the performance. Table A-11 SP Switch2 dual fabric TCP/IP SP Switch performance Number of TCP/IP sessions Bandwidth (MB/sec), Uni-/bi-directional dual fabric (ml0 interface) pSeries 690 pSeries 655 Uni- Bi- Uni- Bi- 1 317 385 449 495 2 or more 308 397 435 491 Switch adapter SP Switch 2 PCI SP Switch 2 PCI-X The nodes can take advantage of multiple processors if there are multiple IP connections running at the same time. If only one TCP/IP socket is used, the maximum throughput will be similar to the single-processor throughput no matter how many processors are configured in the node. A single TCP/IP socket currently cannot take advantage of multiple processors, due to the single-threaded nature of memory-to-memory copies and the TCP/IP stack; see Table A-12 on page 263. 262 IBM Eserver pSeries Cluster Systems Handbook Table A-12 SP Switch2 dual fabric TCP/IP SP Switch performance Bandwidth (MB/sec), Uni-/Bi-directional single fabric Number of TCP/IP sessions pSeries 660-6M1 pSeries 660-6H1 pSeries 680 Uni- Bi- Uni- Bi- Uni- Bi- 1 197 353 199 323 175 200 2 or more 287 366 288 324 174 199 Switch adapter SP Switch 2 PCI SP Switch2 PCI SP Switch2 PCI Appendix A. Performance 263 264 IBM Eserver pSeries Cluster Systems Handbook Related publications The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this redbook. IBM Redbooks For information on ordering these publications, see “How to get IBM Redbooks” on page 267. Note: Note that some of the documents referenced here may be available in softcopy only. An Introduction to CSM 1.3 for AIX 5L, SG24-6859 Linux Clustering with CSM and GPFS, SG24-6601 RS/6000 SP and Clustered IBM Eserver pSeries System Handbook, SG24-5596 IBM Eserver pSeries 690 and pSeries 670 System Handbook, SG24-7040 Exploiting RS/6000 SP Security: Keeping It Safe, SG24-5521 pSeries Systems Handbook 2003 Edition, SG24-5120 IBM Eserver Certification Study Guide: Cluster 1600 Managed by PSSP, SG24-7013 Effective System Management Using the IBM Hardware Management Console for pSeries, SG24-7038 The Complete Partitioning Guide for IBM pSeries Servers, SG24-7039 pSeries 650 Model 6M2 Technical Overview and Introduction, REDP0194 Configuring a p690 in an IBM Cluster 1600, REDP0187 Other publications These publications are also relevant as further information sources: IBM Eserver Cluster 1600 Planning, Installation and Service Guide, GA22-7863 © Copyright IBM Corp. 2003. All rights reserved. 265 IBM Eserver Cluster 1600 Planning Volume 2, Control Workstation and Software Environment, GA22-7281 PSSP 3.5 Administration Guide, SA22-7348 IBM Eserver Cluster 1600 Planning Volume 1, Hardware and Physical Environment, GA22-7280 RS/6000 and eServer pSeries: PCI Adapter Placement Reference, SA38-0538 Implementing a Firewalled RS/6000 SP System, GA22-7874 HMC Operations Guide, SA38-0590 PSSP Administration Guide, SA22-7348 PSSP Installation and Migration Guide, SA22-7347 IBM CSM for AIX 5L: Administration Guide, SA22-7918 SP Switch Router Adapter Guide, GA22-7310 pSeries High Performance Switch Planning, Installation, and Service Manual, GA22-7951 Online resources These Web sites and URLs are also relevant as further information sources: White paper on POWER4 System Micro architecture http://www.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html Microcode Discovery Service Web site http://techsupport.services.ibm.com/server/aix.invscoutMDS pSeries and RS/6000 Microcode Updates Web site http://techsupport.services.ibm.com/server/mdownload PSSP V3.5 documentation at the Web site http://www.rs6000.ibm.com/resource/aix_resource/sp_books/ More information on IBM eServer xSeries servers http://www.ibm.com/servers/eserver/education/xseries/xref.html 266 IBM Eserver pSeries Cluster Systems Handbook Detailed guide on alt_diskinstall http://publib16.boulder.ibm.com/pseries/en_US/infocenter/howto/HT_insgdrf_a ltdiskinstall_clone.htm#insgdrf_altdiskinstall_clone Detailed Base Operating System Migration Guide http://publib16.boulder.ibm.com/pseries/en_US/infocenter/howto/HT_insgdrf_m igrationinstall.htm#insgdrf_migrationinstall How to get IBM Redbooks You can search for, view, or download Redbooks, Redpapers, Hints and Tips, draft publications and Additional materials, as well as order hardcopy Redbooks or CD-ROMs, at this Web site: ibm.com/redbooks Help from IBM IBM Support and downloads ibm.com/support IBM Global Services ibm.com/services Related publications 267 268 IBM Eserver pSeries Cluster Systems Handbook Abbreviations and acronyms AC Alternating Current CWS Control Work Station AIX Advanced Interactive Executive C-SPOC Cluster Single Point of Control DASD Direct Access Storage Device AFS Andrew File System DC Direct Current APAR Authorized Program Analysis Report DCE Distributed Computing Environment API Application Programming Interface DDR Double Data Rate ARP Address Resolution Protocol DFS Distributed File System ATM Asynchronous Transfer Mode DHCP Dynamic Host Control Protocol ATM Asynchronous Transfer Mode DIMM Dual Inline Memory Module BIST Built In Self Tests DVD-RAM BLAS Basic Linear Algebra Subprograms Digital Versatile Disk Random Access Memory DVD-ROM BNC Barrel Nut Connector Digital Versatile Disk - Read Only Memory BPC Bulk Power Controller EIA Electronic Industries Alliance BTU British Thermal Unit ESSL CCGA Ceramic Column Grid Array Engineering and Scientific Subroutine Library CD Compact Disk F/C Feature Code CD-ROM Compact Disk Read Only Memory FCS Fibre Channel Standard FDDI CE Customer Engineer Fiber Distributed Data Interface CEC Central Electronics Complex FFDC First Failure Data Capture CES Clustered Enterprise Server FRU Field Replaceable Unit CFM Configuration File Management FTP File Transfer Protocol Gb Giga bits CIU Core Interface Unit GB Gigabyte (109 byte) CPU Central Processing Unit Gb/s Giga bits per second CSM Cluster Systems Management GB/s Giga Bytes per second GHz Gigahertz (109 hertz) CSS Communication Subsystem GPFS General Parallel File System CtSec Cluster Security Services GUI Graphical User Interface CUoD Capacity Upgrade on Demand HACMP IBM High Availability Cluster Multi-Processing for AIX © Copyright IBM Corp. 2003. All rights reserved. 269 HACMP/XD HACMP Extended Distance MASS HACWS High Availability Control Work Station Mathematical Acceleration Subsystem Mb Mega bits HIPPI High Performance Parallel Interface MB Mega Bytes Mb/s Mega bits per second HMC Hardware Management Console MCA Micro Channel Architecture HPC High Performance Computing MCM Multi Chip Module HPS High Performance Switch MES Miscellaneous Equipment Specification Hz Hertz (Cycles per Second) MHz megahertz (106 hertz) I/O Input Output ML Maintenance Level IBM International Business Machines Corporation mm Millimeter IP Internet Protocol MPI Message Passing Interface IPv4 Internet Protocol Version 4 MS Management Server IPv6 Internet Protocol Version 6 NC Non-Cachable ISB Intermediate Switch Board NFS Network File System ITSO International Technical Support Organization NIM Network Installation and Maintenance JFS Journaled File System NQS Network Queuing System KB Kilo Bytes NTP Network Time Protocol kg Kilo-Gram OS Operating System KLAPI Kernel Low level Application Programming Interface PAIDE Performance AIDE PC Personal Computer kVA KiloVolt-Ampere PCI L1 Level 1 Peripheral Component Interconnect L2 Level 2 PCI-X Peripheral Component Interconnect Extended L3 Level 3 PE Parallel Environment LAN Local Area Network PDU Power Distribution Unit LAPI Low Level Application Programming Interface POE Parallel Operating Environment lb pounds (weight) POR Power-on Reset LL LoadLeveler PSSP LPAR Logical Partition Parallel System Support Programs LPP Licensed Programme Product PTX Performance Toolbox LVM Logical Volume Manager RAN Remote Access Node M/T Machine Type RAS Reliability, Availability, Scalability 270 IBM Eserver pSeries Cluster Systems Handbook RIO Remote Input/Output SVC Switch Virtual Circuit RM Resource Managers TCP Transmission Control Protocol RMC Resource Monitoring and Control Subsystem TP Twisted Pair TTY RPQ Request for Price Quotation Terminal Type (UNIX terminal interface) RSA Remote Server Access UDP User Datagram Protocol RSCT Reliable Scalable Cluster Technology US User Space USB Universal Serial Bus SAE Service Action Event UTP Unshielded Twisted Pair SAS Service Agent gateway Server V Volts ScaLAPACK Scalable Linear Algebra Package VLAN Virtual Area Network SCM Single Chip Module VSD Virtual Shared Disks SCSI Small Computer System Interface WAN Wide Area Network WLM Workload Manager SDR System Data Repository SDRAM Synchronous Dynamic Random Access Memory SES SCSI Enclosure Services SFP Service Focal Point SIS System Installation Suite SLES SuSE Linux Enterprise Server SMIT System Management Interface Tool SMP Symmetric Multiprocessors SNI System Network Interface SNM Switch Network Manager SNMP System Network Management Protocol SOI Silicon On Insulator SP Scalable Parallel SP Service Processor SSA Serial Storage Architecture SSB Server Switch Board SSH Secure SHell SSL Secure Sockets Layer std standard Abbreviations and acronyms 271 272 IBM Eserver pSeries Cluster Systems Handbook Index A accounting 173 AIX Performance Toolbox (PTX) 225 Alternate disk migration 238 APAR IY21957 96 IY22854 239 IY28102 87 IY32749 239 IY36001 81 IY36002 81 IY37884 80–81 IY37885 80–81 IY39794 57, 63, 73, 82, 91, 99, 119 IY39795 57, 63, 73, 82, 91, 99, 119 IY41696 81 IY42352 72, 81 IY42353 57, 63, 73, 82, 91, 99, 119 IY42356 57, 63, 73, 82, 91, 99, 119 IY42359 72 IY42369 57, 63, 72, 82, 90, 99, 119 IY42377 57, 63, 73, 82, 91, 99, 119 IY42379 57, 63, 73, 82, 91, 99, 119 IY42782 57, 63, 73, 82, 91, 99, 119 IY42783 57, 63, 73, 82, 91, 99, 108, 119 IY42847 57, 63, 73, 82, 91, 99, 119 Y42358 72 asynchronous card 27 Audit Log resource manager (AuditLogRM) 188 automounter 5 B Built-in Self Test (BIST) 34 bulk power controller (BPC) 106 C Capacity on Demand (CuOD) 69 Capacity Upgrade on Demand (CUoD) 17 Central Electronics Complex (CEC) 114 central manager 234 Ceramic Column Grid Array (CCGA) 37 Cluster 1600 2, 13, 230, 237 © Copyright IBM Corp. 2003. All rights reserved. components 13 hardware components 52 Cluster Enterprise Servers (CES) 12 cluster management 237 Cluster Systems Management (CSM) xi, 11, 151, 170, 282 advantages 180 modular architecture 5 nodes concept 2 cluster VLAN 154 command cmstat 189 dsh 4–5, 7, 154, 183 dshbak 8, 183 llextRPD 205 llmodify 206 lsmcode 47 lsmcode -A 47 lspv 239 rconsole 153 rpower 153 rsh 8, 183 spframe 148, 244 spmon -d 189 ssh 154, 183 command line 4 Communication Subsystems Support 176 Configuration File Manager (CFM) 8 control workstation (CWS) 16, 242 Core Interface Unit (CIU) 33 CSP 242 D daemon hagsd 189 hatsd 189 directory /cfmroot 183 /spdata/sys1/install/pssplpp/PSSP-3.4 81 /spdata/sys1/install/pssplpp/PSSP-3.5 81 /tmp 4 /var 4 Distributed Command Execution Manager (DCEM) 273 6 Distributed Computing Environment (DCE) 175 dynamic logical partitions (DLPAR) 233 Dynamic Probe Class Library (DPCL) 218 E electronic service agent 48 Engineering and Scientific Subroutine Libraries (ESSL) 170 Enterprise Storage Server (ESS) 232 ESSL (Scientific Subroutine Library) 169 Event Response resource manager (ERRM) 188 F F 70, 107 FAStT Storage Server (FAStT) 232 feature code F/C 0008 90, 99 F/C 0009 90, 99 F/C 0010 82 F/C 0011 82 F/C 0012 56, 62 F/C 0013 62 F/C 0014 72 F/C 0015 72 F/C 1500 15 F/C 2031 15 F/C 2032 15 F/C 2050 17 F/C 2051 17 F/C 2052 17 F/C 2053 17 F/C 2054 18 F/C 2056 18 F/C 2057 18 F/C 2058 18 F/C 2934 43 F/C 2943 41, 43, 241 F/C 2944 41, 43, 79, 107, 241 F/C 2968 118, 123 F/C 2969 123 F/C 2975 123 F/C 2985 123 F/C 2987 122–123 F/C 3124 43 F/C 3125 43 F/C 3151 118 F/C 3154 118 274 IBM Eserver pSeries Cluster Systems Handbook F/C 3166 107 F/C 3167 107 F/C 3256 107 F/C 3257 107 F/C 3628 44 F/C 3636 44 F/C 3756 106 F/C 4011 16, 101 F/C 4012 16 F/C 4020 101 F/C 4022 101 F/C 4023 101 F/C 4598 90, 99 F/C 4962 43, 89, 98, 118, 122–123 F/C 5122 67 F/C 5126 61 F/C 5131 61 F/C 5133 54, 61 F/C 5208 67 F/C 550 15 F/C 5511 77 F/C 5513 77 F/C 5515 77 F/C 5518 77 F/C 6234 106 F/C 6266 55 F/C 6273 61 F/C 6404 97 F/C 6418 97 F/C 6420 81, 111 F/C 6432 99, 110–111 F/C 6433 106 F/C 6434 99, 108, 110–111 F/C 6435 106 F/C 6436 106 F/C 6437 106 F/C 6563 77, 86, 89–90, 95, 98 F/C 6571 77, 86, 89–90, 95, 98 F/C 6578 70 F/C 6579 70 F/C 7039-6420 106 F/C 7040-6432 106 F/C 7040-6434 106 F/C 7176 69 F/C 7177 69 F/C 7178 69 F/C 7315 38 F/C 7316 38, 241 F/C 8120 43, 56, 62, 72, 81, 90, 98 F/C 8121 43, 56, 62, 72, 81, 90, 98 F/C 8122 43, 81, 106, 161 F/C 8123 43, 81, 106, 161 F/C 8131 43 F/C 8132 43 F/C 8133 43 F/C 8136 43 F/C 8137 43 F/C 8396 89–90, 98–99, 108–110, 117–118 F/C 8397 62, 89–90, 98–99, 108–110, 117–118 F/C 8398 62, 71–72, 81, 89–90, 98–99, 108–110 F/C 8691 107 F/C 9047 106 F/C 9049 106 F/C 9123 118 F/C 9125 118 F/C 9172 68 F/C 9176 69 F/C 9177 69 F/C 9178 69 F/C 9222 123 F/C 9223 123 F/C 9302 108 F/C 9305 108 F/C 9310 108 F/C 9315 108 F/C 9320 108 F/C 9581 61 Fiber Distributed Data Interface (FDDI) 137 Field Replaceable Unit (FRU) 160 Field Replacement Unit (FRU) 50 file /etc/switch.info 149 File System resource manager (FileSystemRM) 188 File Transfer Protocol (FTP) 256 First Failure Data Capture (FFDC) 34 frames 15 G General Parallel File System (GPFS) 170, 192, 234 Gigabit Ethernet 231 GPFS (General Parallel File System) 169 Group Services 189 H HACMP (High Availability Cluster Multi-Processing ) 169 HACMP/XD (Extended Distance) 220 hags 189 Hardware Management Console (HMC) 17, 38 redundant 40 hats 189 heartbeat 234 High Availability Cluster Multiprocessing (HACMP) 170 High Availability Control Workstation (HACWS) 20, 22 High Availability Geographic Cluster (HAGEO) 220 Host resource manager (HostRM) 188 I IBM Recoverable Virtual Shared Disk 175 IBM Virtual Shared Disk 175 inter switch boards (ISB) 105 intermediate switch board (ISB) 157 Internal Ethernet 122 Inventory Scout 46 K Kerberos 51 L LAPI 177 legacy hardware 13 LoadLeveler 234 LoadLeveler (LL) 170, 200 logical partition (LPAR) 27, 233 Logical Volume Manager (LVM) 197 Low-level Application Programming Interface 177 LPAR concept diagram 29 M machine type 7039 41 M/T 14 M/T 7028 41, 163 M/T 7039 41 M/T 7040 41, 161 M/T 7040-61R 161–162 M/T 7040-W42 162 M/T 7045-SW4 104 managed nodes 169 managed servers 169 Index 275 management server 16, 26, 166 management VLAN 153 Message Passing Interface 177 microcode discovery service 46 microcode level 47 microcode update application 45 microcode update files and discovery tool 47 MPI 177 Myrinet 231 R Network File System (NFS) 256 Network Shared Disk (NSD) 231 Network Time Protocol (NTP) 5 node 169, 237 Recoverable Virtual Shared Disk (RVSD) 197 redbook xi Redbooks Web site 267 Contact us xiv redundant HMC 244 Reliability, Availability, Serviceability (RAS) 13 Reliable Scalable Cluster Technology (RSCT) 4, 180 Remote I/O (RIO) 114 request for price quotation (RPQ) 30 Resource Monitoring and Control (RMC) 139 restricted root access 51 roll-back 239 RS/6000 SP 2 P S p630 32 P630 Internal Structure 54, 59, 65 p650 32 p670 32 p690 32 Parallel Environment (PE) 170, 234 Parallel ESSL 170, 234 Parallel Operating Environment (POE) 234 Parallel System Support Program (PSSP) xi advantages 172 frame concept 2 Parallel System Support Programs (PSSP) 11, 25, 170, 282 partitions 237 PE (Parallel Environment) 169 Performance AIDE (PAIDE) 169, 225 Performance Toolbox (PTX) 169 perl 177 Placement 124 Power Distribution Unit (PDU) 68 POWER3 237 POWER4 237 pSeries and RS/6000 microcode updates 46 pSeries High Performance Switch (HPS) 151 pSeries servers 32 p655 32 PSSP (Parallel System Support Programs) 169 public VLAN 154 T N SAMI 242 scheduler node 234 Scientific Subroutines Libraries (SSL) 170 SCSI Enclosure Services (SES) 76 SDR 173 Sensor resource manager (SensorRM) 188 server switch board (SSB) 105, 157 Service Action Event (SAE) 49 Service Agent 245 service director 48 Service Focal Point (SFP) 48, 245 Service Processor (SP) 34 single administrative domain 2 Single Chip Modules (SCM) 58 Software Update Protocol (SUP) 4 SP frame 237 SP Switch 16, 100, 125 SP Switch2 16, 100, 125–126 SP switches 16 Storage Area Network (SAN) 231 subnets 138 switch network interface (SNI) 105 Switch Network Manager (SNM) 105, 158 symmetric multiprocessor (SMP) 53 System Data Repository (SDR) 25, 173 System Management Interface Tool (SMIT) 151 Tcl 177 time synchronization 176 Tool command language 177 276 IBM Eserver pSeries Cluster Systems Handbook Topology Services 189 trusted Ethernet connection 166 trusted network 139 U Unified Trace Environment (UTE) 218 V Virtual Shared Disk (VSD) 194, 232 Virtual Terminal Window (VTW) 242 VLAN 26 W Workload Manager (WLM) 201 Index 277 278 IBM Eserver pSeries Cluster Systems Handbook IBM Eserver pSeries Cluster Systems Handbook (0.5” spine) 0.475”<->0.875” 250 <-> 459 pages Back cover ® IBM Eserver pSeries Cluster Systems Handbook Overview of Cluster 1600 software components Cluster 1600 machine types, models, and feature codes Solution hints and tips The IBM Eserver Cluster 1600 server, which was introduced to meet the rigorous demands of mission-critical enterprise applications, continues to offer outstanding performance, scalability, reliability, availability, serviceability, and management capabilities. In this IBM Redbook, we highlight the benefits of using a Cluster 1600, and describe which hardware components can be managed by either Parallel System Support Programs (PSSP) Version 3, Release 5, or Cluster Systems Management (CSM) Version 1, Release 3, Modification 2. This publication contains the following information on the Cluster 1600: - Cluster 1600 hardware components - Networking components and considerations - Cluster 1600 software components - Scalability of the Cluster 1600 - Solutions and offerings scenarios The Cluster 1600 helps to reduce the complexities and costs of system management, thus lowering the total cost of ownership and allowing simplification of application service level management. It also provides the infrastructure that supports availability, data sharing, and response time. This redbook will be useful for IT professionals seeking to implement Cluster 1600 mission-critical solutions to address business intelligence applications, server consolidation, and collaborative computing. INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment. For more information: ibm.com/redbooks SG24-6965-00 ISBN 0738499137
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Related manuals
Download PDF
advertisement