3-30-2011 Tutorial
ISTeC Cray
High-Performance Computing System
Richard Casey, PhD
RMRCE
CSU Center for Bioinformatics
Accounts
• To Get an ISTeC Cray Account
– Get ISTeC Cray account request form at:
http://istec.colostate.edu/istec_cray/
– Or send email request to [email protected]
– Submit account request form to Richard Casey
– Accounts available for faculty, graduate students, postdocs, classes,
others
– Accounts typically set up in two business days
Access
• To Access the ISTeC Cray
– SSH (secure shell); SFTP (secure FTP)
• PuTTY, FileZilla, others
– Windows
• Secure Shell Client, Secure File Transfer Client
– Mac, Linux
• Terminal window sessions
– Remotely
• VPN (Virtual Private Network)
(http://www.acns.colostate.edu/Connect/VPN-Download)
– Check ACNS website for client software (http://www.acns.colostate.edu/)
Access
• Cray DNS name: cray2.colostate.edu
• Cray IP address: 129.82.103.183
• SSH login:
– ssh –l accountname cray2.colostate.edu
• SFTP file transfers:
– sftp [email protected]
• PuTTy is available at:
– http://www.chiark.greenend.org.uk/~sgtatham/putty/
• FileZilla is available at:
– http://filezilla-project.org/
System Architecture
Compute blades
(contain compute nodes)
SeaStar 2+
Interconnect
Login node;
Boot node;
Lustre file system node
Front
Back
Cray System Architecture
XT6m Compute Node Architecture
XT6m Compute Node Architecture
Greyhound
DDR3 Channel
Greyhound
6MB L3
Cache
HT3
Greyhound
Greyhound
6MB L3
Cache
Greyhound
Greyhound
Greyhound
Greyhound
6MB L3
Cache
Greyhound
6MB L3
Cache
Greyhound
Greyhound
Greyhound
DDR3 Channel
Greyhound
Greyhound
DDR3 Channel
Greyhound
Greyhound
HT3
Greyhound
To Interconnect
•
•
•
•
•
•
•
•
DDR3 Channel
Greyhound
Greyhound
DDR3 Channel
Greyhound
Greyhound
HT3
DDR3 Channel
Greyhound
HT3
HT3
DDR3 Channel
Greyhound
Greyhound
HT
Each compute node contains 2 processors (2 sockets)
64-bit AMD Opteron “Magny-Cours” 1.9Ghz processors
1 NUMA processor = 6 cores
4 NUMA processors per compute node
24 cores per compute node
4 NUMA processors per compute blade
32 GB RAM (shared) / compute node = 1.664 TB total RAM (ECC DDR3 SDRAM)
1.33 GB RAM / core
DDR3 Channel
SeaStar 2+ Interconnect
•
•
•
2D torus topology
6.8 GB/sec total bandwidth
~6.5 us latency
6-Port
Router
DMA
Engine
HyperTrans
port
Interface
Memory
Blade
Control
Processor
Interface
PowerPC
440
Processor
SeaStar 2+ Interconnect
Compute Node
Login Node
Boot Node
Lustre Filesystem Node
SMW
Lustre
Filesystem
Software Maintenance Workstation
(sys admin only)
Modules
• Modules Environment Management Package
• Automatically configures shell environment via
modulefiles
• Each modulefile contains all information needed to
configure shell for a particular application
• Modules set PATH, MANPATH, shell environment
variables, libraries, etc. so you don‟t have to
• Use modules to easily manage shell environment and
applications
Modules
• After logging in enter:
– module list
[email protected]:~> module list
Currently Loaded Modulefiles:
1) modules
2) portals/2.2.0-1.0301.24560.5.2.ss
3) nodestat/2.2-1.0301.24557.5.1.ss
4) sdb/1.0-1.0301.24568.5.4.ss
5) MySQL/5.0.64-1.0301.2899.20.1.ss
6) lustre-cray_ss_s/1.8.2_2.6.27.48_0.12.1_1.0301.5636.4.1-1.0301.24584.3.6
7) Base-opts/1.0.2-1.0301.24518.5.1.ss
8) xtpe-network-seastar
9)
10)
11)
12)
13)
14)
15)
cce/7.2.8
acml/4.4.0
xt-libsci/10.4.9
xt-mpt/5.1.2
pmi/1.0-1.0000
xt-asyncpe/4.5
PrgEnv-cray/3.1.49
Modules
• To see the contents of a module enter:
– module show
[email protected]:~> module show cce
------------------------------------------------------------------/opt/modulefiles/cce/7.2.8:
setenv
CRAYLMD_LICENSE_FILE /opt/cray/cce/cce.lic
setenv
CRAY_BINUTILS_ROOT /opt/cray/cce/7.2.8/cray-binutils
setenv
CRAY_BINUTILS_VERSION /opt/cray/cce/7.2.8
setenv
CRAY_BINUTILS_BIN /opt/cray/cce/7.2.8/cray-binutils/x86_64-unknown-linux-gnu/bin
setenv
LINKER_X86_64 /opt/cray/cce/7.2.8/cray-binutils/x86_64-unknown-linux-gnu/bin/ld
setenv
ASSEMBLER_X86_64 /opt/cray/cce/7.2.8/cray-binutils/x86_64-unknown-linux-gnu/bin/as
setenv
GCC_X86_64 /opt/gcc/4.1.2/snos
setenv
CRAYLIBS_X86_64 /opt/cray/cce/7.2.8/craylibs/x86-64
prepend-path
FORTRAN_SYSTEM_MODULE_NAMES ftn_lib_definitions
prepend-path
MANPATH /opt/cray/cce/7.2.8/man:/opt/cray/cce/7.2.8/craylibs/man:/opt/cray/cce/7.2.8/CC/
prepend-path
NLSPATH /opt/cray/cce/7.2.8/CC/x86-64/nls/En/%N.cat:/opt/cray/cce/7.2.8/craylibs/x86-6
prepend-path
INCLUDE_PATH_X86_64 /opt/cray/cce/7.2.8/craylibs/x86-64/include
prepend-path
PATH /opt/cray/cce/7.2.8/cray-binutils/x86_64-unknown-linux-gnu/bin:/opt/cray/cce/7.2.
append-path
MANPATH /usr/share/man
-------------------------------------------------------------------
Modules
• To see a description of a module enter:
– module help
[email protected]:~> module help cce
----------- Module Specific Help for 'cce/7.2.8' -----------------The modulefile, cce, defines the system paths and environment
variables needed to run the Cray Compile Environment.
Cray Compiling Environment 7.2.8 (CCE 7.2.8)
============================================
Release Date: October 21, 2010
Purpose:
-------The CCE 7.2.8 update provides Bug Fixes to the CCE 7.2.7 release for
Cray XT and Cray XE systems.
Bugs fixed in CCE 7.2.8 remain private at the request of the reporter.
ETC.
Modules
• To see all available modulefiles enter:
– module avail
• To load a module enter:
– module load modulefile
• To unload a module enter:
– module unload modulefile
• To swap one module for another enter:
– module swap modulefile1 modeulefile2
– module swap PrgEnv-cray PrgEnv-pgi
Cray Operating System
• Login node
– Cray Linux Environment (CLE) on login node
– Based on SUSE Linux v.11
• Compute nodes
– Compute Node Linux (CNL) on compute nodes
– Lightweight kernel (stripped-down version of SUSE)
– Maximize performance; maximize stability; minimize OS overhead;
minimize OS jitter
• Somewhat different from cluster environment
– Read-only shared root file system
– Shared root file system mounted from boot node via DVS (data virtualization
service) and SeaStar interconnect
– All compute nodes have same directory structure
Compilers
•
Cray (Cray Compiler Environment)
–
•
PGI (Portland Group)
–
•
C, C++, Fortran
GNU
–
•
C, C++, Fortran
PathScale
–
•
C, C++, Fortran
gcc, g++, gfortran
Use PrgEnv modules to select compiler
–
–
–
–
–
–
“module load PrgEnv-cray”
“module load PrgEnv-pgi”
“module load PrgEnv-pathscale”
“module load PrgEnv-gnu”
Drivers “cc”, “CC” and “ftn” are used for all compilers
Drivers automatically include appropriate libraries for the selected compiler
• i.e. –lmpich, -lsci, -lacml, -lpapi, etc.
Cray Compilers
• C compiler
–
*.c
• C++ compiler
–
*.CC, *.C++, *.c++
• Fortran compiler
–
*.f90, *.F90, *.ftn, *.FTN, *.f, *.F
• Compilation with module drivers
•
•
•
C:
“cc [options] exe.c”
C++:
“CC [options] exe.CC”
Fortran: “ftn [options] exe.f90”
• Where‟s Python?
•
•
•
/usr/bin/python (serial version)
No module driver yet
We‟re checking into parallel Python
A Note on Compilers
• There are a limited number of licenses for compiling, e.g.
simultaneous users
– 5 licenses for Cray environment
– 2 licenses for PathScale environment
– 2 licenses for Portland Group environment
• Keep trying if you cannot get a license, or use
„unrestricted‟ open source compilers, i.e. gnu
Lustre Parallel File System
•
•
•
•
•
•
Scalable, parallel file system
Separate metadata servers and file servers
MDS: metadata server, manages file names and directories
MDT: metadata target, stores file names/directories/permissions/layout
OSS: object storage server, manages file I/O services and network requests
OST: object storage target, stores data
•
lfs command
–
lfs df –h
[email protected]:~> lfs df -h
UUID
bytes
lustrefs-MDT0000_UUID
1.4T
lustrefs-OST0000_UUID
7.2T
lustrefs-OST0001_UUID
7.2T
lustrefs-OST0002_UUID
7.2T
lustrefs-OST0003_UUID
7.2T
filesystem summary:
28.6T
Used Available Use% Mounted on
848.0M
1.3T
0% /mnt/lustre_server[MDT:0]
33.1G
6.8T
0% /mnt/lustre_server[OST:0]
40.0G
6.8T
0% /mnt/lustre_server[OST:1]
45.3G
6.8T
0% /mnt/lustre_server[OST:2]
38.5G
6.8T
0% /mnt/lustre_server[OST:3]
156.9G
27.0T
0% /mnt/lustre_server
Lustre Parallel File System
•
•
•
•
Login to “login” node
cd to “lustrefs” directory
Run parallel jobs within “lustrefs” directory
Data storage to OST‟s via SeaStar interconnect
MDT
OST
OST
MPI
•
MPI
•
•
•
Module
•
•
#include <mpi.h>
Compiler directives
•
•
•
•
•
“module load xt-mpich2”
Include file
•
•
message passing model
coarse-grained parallelism
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Finalize();
Environment variables
•
•
Large set of env vars to control all aspects of MPI execution environment
i.e. MPICH_ENV_DISPLAY (show MPICH2 env vars and their current settings)
•
•
•
/* initialize the MPI environment */
/* determine rank of calling process */
/* number of cores */
/* terminate the MPI environment */
“export MPICH_ENV_DISPLAY=true”
“aprun –n 1 executable”
If “module load PrgEnv-cray” then no special compiler options needed
•
i.e. “cc –o vectoradd vectoradd.c”
MPI Stub
/* parallel mpi vector sum program stub */
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
void main(int argc, char **argv) {
float v[50000], sum, ptSums[1024];
int numProcs, rank, sndcount;
/* initialize MPI environment */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&numProcs);
sndcount = 50000/numProcs; /* partition size */
/* initialize vector */
if (rank == 0) { for (int i=0; i<50000; i++) v[i]= 1; }
/* scatter vector partitions among processes */
MPI_Scatter(v,sndcount,MPI_FLOAT,v,50000,MPI_FLOAT,0,MPI_COMM_WORLD);
/* compute partial sums */
sum=0;
for (int i=0; i<sndcount; i++) sum += v[i];
/* gather partial sums from processes */
MPI_Gather(&sum,1,MPI_FLOAT,ptSums,1,MPI_FLOAT,0,MPI_COMM_WORLD);
/* sum partial sums */
if (rank == 0) {
sum = 0;
for (int i=0; i<numProcs; i++) sum += ptSums[i];
printf("Vector sum =: %f\n", sum);
}
MPI_Finalize();
}
/* terminate MPI environment */
OpenMP
•
OpenMP
•
•
•
Module
•
•
(forms thread pool, starts parallel execution)
i.e. int nthreads = omp_get_num_threads()
(gets number of threads in current thread pool)
Environment variables
•
Large set of env vars to control all aspects of OpenMP execution environment
•
•
•
i.e. #pragma omp parallel num_threads(10)
Runtime library routines
•
•
#include <omp.h>
Compiler directives
•
•
“module load xt-mpt” (Cray message passing toolkit)
Include file
•
•
shared memory model
fine-grained parallelism
i.e. export OMP_NUM_THREADS=2
i.e. “aprun –n 1 –d 4 executable”
(set # of threads per rank/processing element)
(assign 4 threads on 1 core)
If “module load PrgEnv-cray” then no special compiler options needed
•
i.e. “cc –o vectoradd vectoradd.c”
OpenMP Stub
/* parallel openmp vector sum program stub */
#include <omp.h>
#include <stdio.h>
Void main(int argc, char* argv[])
{
float sum=0.0, local, a[10];
/* initialize */
for (int i=0; i<10; i++) a[i] = I * 3.5;
# pragma omp parallel shared(a, sum) private(local) num_threads(24)
{
local = 0.0;
# pragma omp for
for (int i=0; i<10; i++)
local += a[i];
# pragma omp critical
{ sum += local; }
}
printf(“%f\n”, sum);
}
Hybrid MPI/OpenMP
•
•
MPI facilitates communication between nodes
OpenMP manages workload within nodes
•
Workflow
–
–
–
–
Single MPI process launched per node
Each process spawns N threads per node
At some global sync point, master threads on each node communicate with one another
Threads belonging to each process continue until another sync point or completion
-- Courtesy LSU
Hybrid MPI/OpenMP Stub
/* hybrid mpi openmp stub */
#include <omp.h>
#include <mpi.h>
#define NUM_THREADS 24
/* Each MPI process spawns a distinct OpenMP master thread;
limit number of MPI processes to one per node */
void main (int argc, char *argv[]) {
int rank, numProcs, c=0;
/* set number of threads to spawn */
omp_set_num_threads(NUM_THREADS);
/* initialize MPI environment */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&numProcs);
/* OpenMP threads executed by each MPI process */
#pragma omp parallel reduction(+:c)
{
c = omp_get_num_threads();
}
/* terminate MPI environment */
MPI_Finalize();
}
Compute Node Status
• Check whether interactive and batch compute nodes are up or down:
– xtprocadmin
NID
12
13
14
15
16
17
18
42
43
44
45
61
62
63
•
(HEX)
0xc
0xd
0xe
0xf
0x10
0x11
0x12
0x2a
0x2b
0x2c
0x2d
0x3d
0x3e
0x3f
NODENAME
c0-0c0s3n0
c0-0c0s3n1
c0-0c0s3n2
c0-0c0s3n3
c0-0c0s4n0
c0-0c0s4n1
c0-0c0s4n2
c0-0c1s2n2
c0-0c1s2n3
c0-0c1s3n0
c0-0c1s3n1
c0-0c1s7n1
c0-0c1s7n2
c0-0c1s7n3
Currently
•
•
960 batch compute cores
288 interactive compute cores
TYPE
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
STATUS
up
up
up
up
up
up
up
up
up
up
up
up
up
up
MODE
interactive
interactive
interactive
interactive
interactive
interactive
interactive
batch
batch
batch
batch
batch
batch
batch
Compute Node Status
• Check the state of interactive and batch compute nodes and
whether they are already allocated to other user‟s jobs:
– xtnodestat
Current Allocation Status at Mon Feb 21 12:36:45 2011
Cabinet ID
Cage X:
Node X
n3
n2
n1
c1n0
n3
n2
n1
c0n0
C0-0
-----dfj
-----c-i
-----b-h
-----aeg
SSS;;;;;
;;;;;
;;;;;
SSS;;;;;
s01234567
Batch Compute Nodes
Allocated Nodes
Free Nodes
Service Nodes
Interactive Compute Nodes
Slots
Legend:
nonexistent node
; free interactive compute node
A allocated, but idle compute node
X down compute node
Z admindown compute node
Available compute nodes:
S
?
Y
service node
free batch compute node
suspect compute node
down or admindown service node
20 interactive,
22 batch
ALPS (Application Level Placement Scheduler)
• For interactive jobs, i.e.
•
Thread pool
development & debugging
ALPS
•
•
•
•
Specifies application resource
requirements
Specifies application placement
on compute nodes
Initiates application launch
Must be used within the
“lustrefs” filesystem
• aprun
•
launch parallel applications
• apstat
•
show running applications
• apkill PID
•
•
kill running applications
PID = process ID
thread 1
Core 1
Processing Element (PE) 1
thread 2
Core N
Processing Element (PE) N
thread N
Interactive Jobs
•
Key parameters
•
•
•
•
-n pes:
specifies number of processing elements (PE‟s; executables) to launch on compute nodes
-N pes_per_node: specifies number of PE‟s placed per node
-m mem_size:
specifies memory size per PE
Examples (make sure to set/export the correct number of OMP_NUM_THREADS for
each case) :
•
•
•
•
“export OMP_NUM_THREADS=1”
“aprun –n 24 exe”:
run 24 PE’s on 24 cores; entire node
“aprun –n 24 –N 4 exe”: run 24 PE’s with 4 PE’s per node; more memory per PE
“aprun –n 24 –m 1G exe”: run 24 PE’s with 1GB per PE
•
•
“export OMP_NUM_THREADS=2”
“aprun –n 24 exe”:
run 24 PE’s on 24 cores with 2 threads per PE
Batch Jobs
• Torque/PBS Batch Queue Management System
– For submission and management of jobs in batch queues
– Use for jobs with large resource requirements (long-running, # of cores,
memory, etc.)
• List all available queues (brief):
– qstat –Q
[email protected]:~> qstat -Q
Queue
Max
Tot
-------------------batch
0
0
Ena
--yes
Str
--yes
Que
--0
Run
--0
Hld
--0
Wat
--0
Trn
--0
Ext T
--- 0 E
• Show the status of jobs in all queues:
– qstat
–
(Note: if there are no jobs running in any of the batch queues, this command will show nothing and
just return the Linux prompt).
[email protected]:~/lustrefs/mpi_c> qstat
Job id
Name
User
Time Use S Queue
------------------------- ---------------- --------------- -------- - ----1753.sdb
mpic.job
rcasey
0 R batch
Batch Jobs
• Submit a job to the default batch queue:
– qsub filename
– “filename” is the name of a file that contains batch queue commands
• Delete a job from the batch queues:
– qdel jobid
– “jobid” is the job ID number as displayed by the “qstat” command.
You must be the owner of the job in order to delete it.
Sample Batch Job Script
#!/bin/bash
#PBS –N jobname
#PBS –j oe
#PBS –l mppwidth=24
#PBS –l walltime=1:00:00
#PBS –q batch
cd $PBS_O_WORKDIR
date
aprun –n 24 executable
•
PBS directives:
–
–
–
–
–
-N:
-j oe:
-l mppwidth:
-l walltime:
-q:
name of the job
combine standard output and standard error in single file
specifies number of cores to allocate to job
specifies maximum amount of wall clock time for job to run (hh:mm:ss)
specify which queue to submit the job to
Sample Batch Job Script
•
PBS_O_WORKDIR environment variable generated by Torque/PBS.
Contains absolute path to directory from which you submitted your job.
Required for Torque/PBS to find your executable files.
•
Linux commands can be included in batch job script
•
The value set in aprun “-n” parameter should match value set in PBS
“mppwidth” directive
•
•
i.e. #PBS –l mppwidth=24
i.e. aprun –n 24 exe
Performance Analysis
• Perftools, CrayPat
– Trace function calls, loops, call graphs, execution profiles
– Adds overhead to executable & increases runtime
• Load Cray, perftools, & craypat modules before compiling
– “module load PrgEnv-cray”
– “module load perftools”
– “module load xt-craypat”
• Compile code
–
–
–
–
Use Cray compiler wrappers (cc, CC, ftn)
Make sure object files (*.o) are retained
If you use Makefiles, modify them to retain object files
i.e. “cc –c exe.c”, then “cc –o exe exe.o”
Performance Analysis
• Generate instrumented executable
– “pat_build [options] exe”
– Creates an instrumented executable “exe+pat”
• Execute instrumented code
– “aprun –n 1 exe+pat”
– Creates file “exe+pat+PID.xf” (PID = process ID)
• Generate reports
– “pat_report [options] exe+pat+PID.xf”
– Outputs performance reports
Performance Analysis
-- Courtesy Sandia National Labs
Debugging Tools
• lgdb
– Based on GNU debugger “gdb”
• Steps:
–
–
–
–
–
–
“module load xt-lgdb”
Compile code with “-g” option; “cc –g –o exe exe.c”
Launch code; “lgdb - -pes=0 - - command=“aprun –n 1 exe”
Get “target” info
Open second window/session
Run:
• /opt/xt-tools/lgdb/1.2/xt/x86_64-unknown-linux-gnu/bin/gdb
/home/rcasey/lustrefs/debug/exe
– At gdb prompt enter “target” info
– Run gdb commands to debug code
• We‟re working on installing more debugging tools
Debugging Tools
bounds.c
#include <mpi.h>
void array_bounds(int []);
void main(int argc, char *argv[]) {
int rank, numprocs, array[10];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
array_bounds(array);
MPI_Finalize();
}
void array_bounds(int ary[]) {
for(int i=0; i<20; i++) ary[i] = i;
}
First session window:
Second session window:
> cc -g -o bounds bounds.c
> lgdb --pes=0 --command="aprun -n 1 bounds“
> /opt/xt-tools/lgdb/1.2/xt/x86_64-unknownlinux-gnu/bin/gdb
/home/rcasey/lustrefs/debug/bounds
>-----cryptic output here----->target remote nid00012:10000
(gdb) target remote nid00012:10000
(gdb) “run gdb commands for
breakpoints, stacktrace, calltrees,
etc.”
Scientific Libraries
• Module
– “module load xt-libsci” (loaded by default)
– Optimized for Cray architecture
•
•
•
•
•
•
•
•
•
BLAS (Basic Linear Algebra Subroutines)
BLACS (Basic Linear Algebra Communication Subprograms)
LAPACK (Linear Algebra Routines)
ScaLAPACK (Scalable LAPACK)
IRT (Iterative Refinement Toolkit)
CRAFFT (Cray Adaptive Fast Fourier Transform Routines)
FFT (Fast Fourier Transform Routines)
FFTW2 (the Fastest Fourier Transforms in the West, release 2)
FFTW3 (the Fastest Fourier Transforms in the West, release 3)
Misc Items
• Backups
– /home directory -> nightly incremental backups
– “lustrefs” directory -> NOT backed up; be sure to copy key files to /home
directory or sftp files off the Cray
• /home directory is very small
– Do not store large files here
– Store large files in “lustrefs” directory
Contact Info
Richard Casey, PhD
ISTeC Cray User Support
Phone: 970-492-4127
Cell: 970-980-5975
Email: [email protected]
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement