Project Goals ● ● Establish a resource for researchers on campus with large computing needs. Help researchers convert their programs to run on the cluster. ● Research performance bottlenecks. ● Develop tools to improve the usability of clusters. The Beowulf Cluster Lab is funded by the National Science Foundation Major Research Infrastructure Award No. 0321233. Cluster Specifications The main Beowulf Cluster (beowulf.boisestate.edu) ● 61 nodes ● 122 2.4 GHz Intel Xeon processors ● 64GB RAM ● 2.4 TB disk space ● private Gigabit network ● Gigabit connection to the campus backbone Other clusters: ● 6 processor developmental cluster (tux.boisestate.edu) ● 32 processor teaching cluster (onyx.boisestate.edu) Beowulf Cluster Lab Cluster Hardware Compute nodes (about $1400/node for 64 nodes = $90,000) ● Tyan i7505 S2665ANF dual-533 MHz FSB ● dual 2.4 Ghz Intel Xeon CPUs with 512K Cache 533MHz FSB ● 2 x 512MB Micron Technology Memory Module 184-pin DIMM PC2100 DDR 266 MHz, unbuffered, non-parity ● Samsung SP4002H disk drives, 80GB 7200RPM ATA100 ● HP Broadcom NetXtreme 5782 Gigabit card ● Antec 1080 Plus AMG case with Antec True Power 550W Supply ● Master node: same, except with 4GB RAM and SATA drives with RAID Networking: (about $12,000) ● 3 x Cisco 3750G 24-port Stacking Cluster Gigabit Switches with redundant power supply Facilities: Liebert A/C, power setup to handle up to 300 Amps ($29,000) Cluster Software Red Hat Linux 9.0 with custom 2.4.24 -bigmem SMP kernel (Fedora Core 1 Linux with stock kernel on the cluster used for teaching) ● Portable Batch Scheduling for job scheduling ● Parallel Programming Libraries and Tools ● PVM, MPICH/MPI, LAM/MPI ● XPVM and XMPI ● Portland Group Cluster Development Toolkit ● HPF, Fortran 90, Fortran 77, C, C++ ● Parallel graphical debugger ● Parallel graphical profiler ● GNU C, C++ and Fortran 77 compilers and related tool set like ddd (Data Display Debugger) ● Full suite of other tools available under Linux. ● Cluster Setup Experiences ● ● ● ● YACI (Yet Another Cluster Installer) was used for automated installation. The 61- node cluster went from bare disks to fully operational in 12 minutes! YACI is available from Larwence Livermore National Lab. Design choice to go with boxes instead of blades since cooling boxes is easier and real estate was a relatively smaller issue. Evaluated AMD Athlon, AMD Opteron, Intel Xeon for Performance/Power/Price (PPP) factor to choose Intel Xeons. Chose to go with a regular PC assembler rather than a “cluster” company to keep costs down and have more control of what goes in each node. People ● Faculty: Amit Jain (Computer Science) and Paul Michaels (Geophysics) ● Graduate Students: Kevin Nuss, Hongyi Hu and Mason Vail ● Undergraduate Students: Joey Mazzarelli, Brady Catherman, Luke Hindman, Charles Paulson, Jason Main and Oralee Nudson. The project uses a model of teaming up computer scientists with researchers from other fields to create a synergistic environment. Projects Some applications running on the cluster. ● Air Quality Modeling. Paul Dawson (Mechanical Engineering), Kevin Nuss and Charles Paulson. ● Modeling of Ocean Currents. Jodi Mead (Mathematics) and Hongyi Hu. ● Waveform Relaxation. Barbara Zubik-Kowal (Mathematics) and Hongyi Hu. ● Hydraulic Tomography. Tom Clemo (Geophysics) and Kevin Nuss. ● Bioinformatics: Bayesian Analysis of Phylogeny. James Smith (Biology) and Amit Jain. ● Basic Seismic Utilities package. Paul Michaels (Geophysics) and Amit Jain. ● Biologically Inspired Computing. Crowley Davis Research (private company) Design Patterns Projects (contd.) ● ● ● Clusmon. A comprehensive web-based cluster monitoring software. (Joey Mazzarelli, Computer Science senior) Remote Power Control. A cluster of smart power strips to enable remote hard power on/off, cascaded power on/off etc. (Brady Catherman, Computer Science junior) Parallel Shell. A more capable parallel shell for system administration. (Mason Vail, Computer Science graduate student) Clusmon: Cluster Monitor Clusmon: Cluster Monitor Remote Power Control Cluster Statistics ● ● ● ● 1179 jobs since July, adding up to about 88000 CPU-hours. Average CPU temperatures: 77F at low load and 100F at full load. The A/C is set to 65F with tolerance of 4F. Hardware failures: Extremely low... – One disk drive failed right after installation. – The memory for one node failed. Only one unscheduled “downtime” in the last three months. The A/C compressor was cycling more than the factory set limit. As a result, it shut itself off. The CPU temperatures still remained below 115F after several hours! (as the air flow was maintained) The cluster was shut down as a precaution. The solution was to simply set a higher tolerance (4 degrees instead of 2 degrees) ● The experiences gained in this project were used to help Geophysics set-up a 10 processor cluster and Mathematics a 20 processor cluster. Further Work ● ● ● Integrate Beowulf clusters with Condor grids. Develop a complete catalogue of programs illustrating each design pattern in PVM and MPI. Continue to team with researchers to help get their code up and running on clusters.