My Document - Documentation

Moab Workload Manager
Administrator Guide 9.0.1
Released: March 2016; Revised: May 16, 2016
© 2016 Adaptive Computing Enterprises, Inc. All rights reserved.
Distribution of this document for commercial purposes in either hard or soft copy form is strictly prohibited without prior
written consent from Adaptive Computing Enterprises, Inc.
Adaptive Computing, Cluster Resources, Moab, Moab Workload Manager, Moab Viewpoint, Moab Cluster Manager, Moab
Cluster Suite, Moab Grid Scheduler, Moab Grid Suite, Moab Access Portal, and other Adaptive Computing products are either
registered trademarks or trademarks of Adaptive Computing Enterprises, Inc. The Adaptive Computing logo and the Cluster
Resources logo are trademarks of Adaptive Computing Enterprises, Inc. All other company and product names may be
trademarks of their respective companies.
Adaptive Computing Enterprises, Inc.
1712 S. East Bay Blvd., Suite 300
Provo, UT 84606
+1 (801) 717-3700
www.adaptivecomputing.com
Scan to open online help
ii
Welcome
1
Moab Workload Manager Overview
3
Chapter 1 Philosophy
5
Value Of A Batch System
Philosophy And Goals
Workload
6
7
9
Chapter 2 Scheduler Basics
11
Initial Moab Configuration
Layout Of Scheduler Components
Scheduling Environment
Scheduling Dictionary
Scheduling Iterations And Job Flow
Configuring The Scheduler
Credential Overview
Job Attributes/Flags Overview
Chapter 3 Scheduler Commands
Status Commands
Job Management Commands
Reservation Management Commands
Policy/Configuration Management Commands
End-user Commands
Commands
Checkjob
Checknode
Mcredctl
Mdiag
Mdiag -a
Mdiag -b
Mdiag -c
Mdiag -f
Mdiag -g
Mdiag -j
Mdiag -n
Mdiag -t
Mdiag -p
Mdiag -q
Mdiag -r
Mdiag -R
Mdiag -S
Mdiag -s
11
13
15
22
29
32
35
64
73
76
77
77
78
78
80
80
92
97
101
104
105
105
109
111
112
113
119
120
122
123
128
133
134
i
Mdiag -T
Mdiag -u
Mjobctl
Mnodectl
Moab
Mrmctl
Mrsvctl
Mschedctl
Mshow
Mshow -a
Mshow -a
Msub
Applying The Msub Submit Filter
Submitting Jobs Via Msub In XML
Mvcctl (Moab Virtual Container Control)
Mvmctl
Showbf
Showq
Showhist.moab.pl
Showres
Showstart
Showstate
Showstats
Showstats -f
TIMESPEC
Deprecated Commands
Canceljob
Changeparam
Diagnose
Releasehold
Releaseres
Resetstats
Runjob
Sethold
Setqos
Setres
Setspri
Showconfig
Chapter 4 Prioritizing Jobs And Allocating Resources
Job Prioritization
Priority Overview
Job Priority Factors
ii
135
137
138
155
161
162
165
192
200
202
212
215
234
237
241
246
250
254
264
269
275
278
281
293
296
297
297
297
298
299
300
301
301
303
304
304
308
310
313
313
313
314
Fairshare Job Priority Example
Common Priority Usage
Prioritization Strategies
Manual Job Priority Adjustment
Node Allocation Policies
Node Access Policies
Node Availability Policies
Task Distribution Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, And Allocation Management
Fairness Overview
Usage Limits/Throttling Policies
Fairshare
Sample FairShare Data File
Accounting, Charging, And Allocation Management
AMCFG Parameters And Flags
Chapter 6 Controlling Resource Access - Reservations, Partitions, And QoS Facilities
Advance Reservations
Reservation Overview
Administrative Reservations
Standing Reservations
Reservation Policies
Configuring And Managing Reservations
Personal Reservations
Partitions
Quality Of Service (QoS) Facilities
Chapter 7 Optimizing Scheduling Behavior – Backfill And Node Sets
Optimization Overview
Backfill
Node Set Overview
Chapter 8 Evaluating System Performance - Statistics, Profiling And Testing
Moab Performance Evaluation Overview
Accounting: Job And System Statistics
Testing New Versions And Configurations
Chapter 9 General Job Administration
Job Holds
Job Priority Management
Suspend/Resume Handling
Checkpoint/Restart Facilities
Job Dependencies
325
327
330
330
331
341
343
350
353
353
356
376
391
391
407
429
429
429
435
437
438
442
478
481
486
497
497
498
505
515
515
515
517
521
521
523
523
524
525
iii
Job Defaults And Per Job Limits
General Job Policies
Using A Local Queue
Job Deadlines
Job Arrays
Chapter 10 General Node Administration
Node Location
Node Attributes
Node Specific Policies
Managing Shared Cluster Resources (Floating Resources)
Managing Node State
Managing Consumable Generic Resources
Enabling Generic Metrics
Enabling Generic Events
Chapter 11 Resource Managers And Interfaces
Resource Manager Overview
Resource Manager Configuration
Resource Manager Extensions
PBS Resource Manager Extensions
Adding New Resource Manager Interfaces
Managing Resources Directly With The Native Interface
Utilizing Multiple Resource Managers
License Management
Resource Provisioning
Managing Networks
Intelligent Platform Management Interface
Resource Manager Translation
Chapter 12 Troubleshooting And System Maintenance
Internal Diagnostics/Diagnosing System Behavior And Problems
Logging Overview
Object Messages
Notifying Administrators Of Failures
Issues With Client Commands
Tracking System Failures
Problems With Individual Jobs
Diagnostic Scripts
Chapter 13 Improving User Effectiveness
User Feedback Loops
User Level Statistics
Enhancing Wallclock Limit Estimates
iv
527
528
535
538
541
549
550
553
564
566
571
572
576
579
587
588
591
622
651
654
655
667
669
671
672
675
679
681
681
684
691
693
694
696
698
698
703
703
704
705
Job Start Time Estimates
Providing Resource Availability Information
Collecting Performance Information On Individual Jobs
Chapter 14 Cluster Analysis And Testing
Testing New Releases And Policies
Testing New Middleware
Chapter 15 Green Computing Overview
Deploying Adaptive Computing IPMI Scripts
Choosing Which Nodes Moab Powers On Or Off
Enabling Green Computing
Adjusting Green Pool Size
Handling Power-Related Events
Maximizing Scheduling Efficiency
Putting Idle Nodes In Power-Saving States
Troubleshooting Green Computing
Chapter 16 Elastic Computing Overview
Dynamic Nodes
Elastic Trigger
Viewing Node And Trigger Information
Configuring Elastic Computing
Integration With A Private OpenStack Cloud
Chapter 17 About Object Triggers
How-to
Creating A Trigger
Creating VM Triggers
Using A Trigger To Send Email
Using A Trigger To Execute A Script
Using A Trigger To Perform Internal Moab Actions
Requiring An Object Threshold For Trigger Execution
Enabling Job Triggers
Modifying A Trigger
Viewing A Trigger
Checkpointing A Trigger
References
Job Triggers
Node Triggers
Reservation Triggers
Resource Manager Triggers
Scheduler Triggers
Threshold Triggers
705
706
707
709
709
713
717
730
731
732
737
737
738
739
739
743
744
748
748
751
753
757
759
760
764
765
766
767
767
768
769
770
771
771
771
773
774
776
777
778
v
Trigger Components
Trigger Exit Codes
Node Maintenance Example
Environment Creation Example
Trigger Variables
About Trigger Variables
How-to
Setting And Receiving Trigger Variables
Externally Injecting Variables Into Job Triggers
Exporting Variables To Parent Objects
Requiring Variables From Generations Of Parent Objects
Requesting Name Space Variables
References
Dependency Trigger Components
Trigger Variable Comparison Types
Internal Variables
Chapter 18 Miscellaneous
User Feedback Overview
Enabling High Availability Features
Malleable Jobs
Identity Managers
Generic System Jobs
Chapter 19 Database Configuration
SQLite3
Connecting To A MySQL Database With An ODBC Driver
Connecting To A PostgreSQL Database With An ODBC Driver
Connecting To An Oracle Database With An ODBC Driver
Installing The Oracle Instant Client
Migrating Your Database To Newer Versions Of Moab
Importing Statistics From Stats/DAY.* To The Moab Database
Chapter 20 Accelerators
Scheduling GPUs
Using GPUs With NUMA
NVIDIA GPUs
GPU Metrics
Intel® Xeon Phi™ Coprocessor Configuration
Intel® Xeon Phi™ Co-processor Metrics
Chapter 21 About Preemption
How-to
Canceling Jobs With Preemption
vi
779
786
787
788
790
790
790
790
791
792
792
793
793
793
794
795
797
797
798
801
802
807
811
811
812
815
818
826
828
830
833
833
834
835
838
840
845
847
848
848
Checkpointing Jobs With Preemption
Requeueing Jobs With Preemption
Suspending Jobs With Preemption
Using Owner Preemption
Using QoS Preemption
References
Manual Preemption Commands
Preemption Flags
PREEMPTPOLICY Types
Simple Example Of Preemption
Testing And Troubleshooting Preemption
Chapter 22 About Job Templates
How-to
Creating Job Templates
Viewing Job Templates
Applying Templates Based On Job Attributes
Requesting Job Templates Directly
Creating Workflows With Job Templates
References
Job Template Extension Attributes
Job Template Matching Attributes
Job Template Examples
Job Template Workflow Examples
Chapter 23 Moab Workload Manager For Grids
Grid Basics
Grid Configuration Basics
Centralized Grid Management (Master/Slave)
Hierarchal Grid Management
Localized Grid Management
Resource Control And Access
Workload Submission And Control
Reservations In The Grid
Grid Usage Policies
Grid Scheduling Policies
Grid Credential Management
Grid Data Management
Accounting And Allocation Management
Grid Security
Grid Diagnostics And Validation
Chapter 24 About Data Staging
How-to
852
853
857
861
865
867
867
867
869
870
873
877
878
878
879
879
880
881
882
882
895
895
896
899
899
908
909
910
912
913
916
917
917
920
922
924
929
931
931
933
935
vii
Configuring The SSH Keys For The Data Staging Transfer Script
Configuring Data Staging
Staging Data To Or From A Shared File System
Staging Data To Or From A Shared File System In A Grid
Configuring The $CLUSTERHOST Variable
Staging Data To Or From A Compute Node
Configuring Data Staging With Advanced Options
References
Sample User Job Script
Chapter 25 Using NUMA With Moab
Using NUMA-Aware With Moab
Using NUMA-Support With Moab
Moab Appendices
Moab Parameters
Multi-OS Provisioning
Event Dictionary
Adjusting Default Limits
Security
Initial Moab Testing
Integrating Other Resources With Moab
Compute Resource Managers
Moab-Torque Integration Guide
963
963
966
969
971
1135
1155
1271
1277
1287
1291
1291
1291
Torque/PBS Integration Guide - RM Access Control
1295
Torque/PBS Config - Default Queue Settings
1296
Moab-SLURM Integration Guide
Installation Notes For Moab And Torque For Cray
Provisioning Resource Managers
Validating An XCAT Installation For Use With Moab
Hardware Integration
Moab-NUMA-Support Integration Guide
Interfacing With Moab (APIs)
Considerations For Large Clusters
Configuring Moab As A Service
Migrating From Maui 3.2
Cray-Specific Power Management And Energy-Consumption-by-Job Accounting
Cray Power Management Overview
Enable Moab/Cray Power Management
Moab Energy-Consumption-by-Job Accounting
Cray RUR Configuration
Moab Job Energy Consumption Accounting Configuration
viii
936
938
941
946
952
953
958
961
961
1296
1301
1318
1318
1321
1321
1327
1333
1341
1343
1347
1348
1350
1355
1357
1358
Tracing Energy Usage From The Cray XC System To MAM
Node Allocation Plug-in Developer Kit
Scalable Systems Software Specification
Scalable Systems Software Job Object Specification
Scalable Systems Software Resource Management And Accounting Protocol (SSSRMAP) Message
Format
Scalable Systems Software Node Object Specification
Scalable Systems Software Resource Management And Accounting Protocol (SSSRMAP) Wire Protocol
Moab Resource Manager Language Interface Overview
Moab Resource Manager Language Data Format
Managing Resources With SLURM
Moab RM Language Socket Protocol Description
SCHEDCFG Flags
Module-Based Features
Accounting Manager License
Advanced Resource Management License
Elastic Computing License
Grid Manager License
Group Sharing License
Power Management License
Workflow Management License
1359
1365
1373
1373
1411
1437
1447
1465
1465
1474
1485
1493
1499
1499
1499
1500
1501
1501
1502
1502
ix
x
Welcome
Welcome
Welcome to the Moab Workload Manager 9.0.1 Administrator Guide.
This guide is intended for Moab Workload Manager system administrators.
The following sections will help you quickly get started:
l
l
Moab Workload Manager Overview: Gives an overview about Moab
Workload Manager basics.
Philosophy: Explains the value of using Moab Workload Manager and the
philosophy behind what Moab Workload Manager is designed to do.
1
2
Moab Workload Manager Overview
Moab Workload Manager Overview
Moab Workload Manager is a scheduling and management system designed for
clusters, grids, and on-demand/utility computing systems. Moab:
l
l
l
applies site policies and extensive optimizations to orchestrate jobs,
services, and other workload across the ideal combination of network,
compute, and storage resources.
enables adaptive computing; allowing compute resources to be
customized to changing needs and failed systems to be automatically
fixed or replaced.
increases system resource availability, offers extensive cluster
diagnostics, delivers powerful quality of service (QoS) and service level
agreement (SLA) features, and it provides rich visualization of cluster
performance through advanced statistics, reports, and charts. In addition,
the Elastic Computing feature allows Moab to temporarily utilize systems
that can provide additional resources to take care of increased workload
demand (caused by high job backlog) in a more timely manner.
Moab also works with major resource management and resource monitoring
tools. From hardware monitoring systems like IPMI to provisioning systems and
storage managers, Moab takes advantage of domain expertise to allow these
systems to do what they do best, importing their state information and
providing them with the information necessary to do their job better. Moab
uses its global information to coordinate the activities of both resources and
services, which optimizes overall performance in-line with high-level mission
objectives.
3
4
Chapter 1 Philosophy
Chapter 1 Philosophy
The scheduler's purpose is to optimally use resources in a convenient and
manageable way. System users want to specify resources, obtain quick
turnaround on their jobs, and have reliable resource allocation. On the other
hand, administrators want to understand both the workload and the resources
available. This includes current state, problems, and statistics—information
about what is happening that is transparent to the end user. Administrators
need an extensive set of options to enable management enforced policies and
tune the system to obtain desired statistics.
There are other systems that provide batch management; however, Moab is
unique in many respects. Moab matches jobs to nodes, dynamically
reprovisions nodes to satisfy workload, and dynamically modifies workload to
better take advantage of available nodes. Moab allows sites to fully visualize
cluster and user behavior. It can integrate and orchestrate resource monitors,
databases, identity managers, license managers, networks, and storage
systems, thus providing a cohesive view of the cluster—a cluster that fully acts
and responds according to site mission objectives.
Moab can dynamically adjust security to meet specific job needs. Moab can
create real and virtual clusters on demand and from scratch that are customtailored to a specific request. Moab can integrate visualization services, web
farms, and application servers; it can also create powerful grids of disparate
clusters. Moab maintains complete accounting and auditing records, exporting
this data to information services on command, even providing professional
billing statements to cover all used resources and services.
Moab provides user- and application-centric web portals and powerful graphical
tools for monitoring and controlling every conceivable aspect of a cluster's
objectives, performance, workload, and usage. Moab is unique in its ability to
deliver a powerful user-centric cluster with little effort. Its design is focused on
ROI, better use of resources, increased user effectiveness, and reduced
staffing requirements.
This chapter contains these sections:
l
Value of a Batch System
l
Philosophy and Goals
l
Workload
5
Chapter 1 Philosophy
Value of a Batch System
Batch systems provide centralized access to distributed resources through
mechanisms for submitting, launching, and tracking jobs on a shared resource.
This greatly simplifies use of the cluster's distributed resources, allowing users
a single system image in terms of managing jobs and aggregate compute
resources available. Batch systems should do much more than just provide a
global view of the cluster, though. Using compute resources in a fair and
effective manner is complex, so a scheduler is necessary to determine when,
where, and how to run jobs to optimize the cluster. Scheduling decisions can be
categorized as follows:
l
Traffic Control
l
Mission Policies
l
Optimizations
Traffic Control
A scheduler must prevent jobs from interfering. If jobs contend for resources,
cluster performance decreases, job execution is delayed, and jobs may fail.
Thus, the scheduler tracks resources and dedicates requested resources to a
particular job, which prevents use of such resources by other jobs.
Mission Policies
Clusters and other HPC platforms typically have specific purposes; to fulfill
these purposes, or mission goals, there are usually rules about system use
pertaining to who or what is allowed to use the system. To be effective, a
scheduler must provide a suite of policies allowing a site to map site mission
policies into scheduling behavior.
Optimizations
The compute power of a cluster is a limited resource; over time, demand
inevitably exceeds supply. Intelligent scheduling decisions facilitate higher job
volume and faster job completion. Though subject to the constraints of the
traffic control and mission policies, the scheduler must use whatever freedom
is available to maximize cluster performance.
6
Value of a Batch System
Chapter 1 Philosophy
Philosophy and Goals
Managers want high system utilization and the ability to deliver various qualities
of service to various users and groups. They need to understand how available
resources are delivered to users over time. They also need administrators to
tune cycle delivery to satisfy the current site mission objectives.
Determining a scheduler's success is contingent upon establishing metrics and
a means to measure them. The value of statistics is best understood if optimal
statistical values are known for a given environment, including workload,
resources, and policies. That is, if an administrator could determine that a site's
typical workload obtained an average queue time of 3.0 hours on a particular
system, that would be a useful statistic; however, if an administrator knew that
through proper tuning the system could deliver an average queue time of 1.2
hours with minimal negative side effects, that would be valuable knowledge.
Moab development relies on extensive feedback from users, administrators,
and managers. At its core, it is a tool designed to manage resources and
provide meaningful information about what is actually happening on the
system.
Management Goals
A manager must ensure that a cluster fulfills the purpose for which it was
purchased, so a manager must deliver cycles to those projects that are most
critical to the success of the funding organizations. Management tasks to fulfill
this role may include the following:
l
Define cluster mission objectives and performance criteria
l
Evaluate current and historical cluster performance
l
Instantly graph delivered service
Administration Goals
An administrator must ensure that a cluster is effectively functioning within the
bounds of the established mission goals. Administrators translate goals into
cluster policies, identify and correct cluster failures, and train users in best
practices. Given these objectives, an administrator may be tasked with each of
the following:
l
Maximize utilization and cluster responsiveness
l
Tune fairness policies and workload distribution
l
Automate time-consuming tasks
l
Troubleshoot job and resource failures
Philosophy and Goals
7
Chapter 1 Philosophy
l
Instruct users of available policies and in their use regarding the cluster
l
Integrate new hardware and cluster services into the batch system
End user Goals
End users are responsible for learning about the resources available, the
requirements of their workload, and the policies to which they are subject.
Using this understanding and the available tools, they find ways to obtain the
best possible responsiveness for their own jobs. A typical end user may have
the following tasks:
8
l
Manage current workload
l
Identify available resources
l
Minimize workload response time
l
Track historical usage
l
Identify effectiveness of prior submissions
Philosophy and Goals
Chapter 1 Philosophy
Workload
Moab can manage a broad spectrum of compute workload types, and it can
optimize all workload types within the same cluster simultaneously, delivering
on the objectives most important to each workload type, which include the
following:
l
Batch Workload
l
Interactive Workload
l
Calendar Workload
l
Service Workload
Batch Workload
Batch workload is characterized by a job command file that typically describes
all critical aspects of the needed compute resources and execution
environment. With a batch job, the job is submitted to a job queue and runs
somewhere on the cluster as resources become available. In most cases, the
submitter submits multiple batch jobs with no execution time constraints and
processes job results as they become available.
Moab can enforce rich policies defining how, when, and where batch jobs run to
deliver compute resources to the most important workload and provide
general SLA guarantees while maximizing system utilization and minimizing
average response time.
Interactive Workload
Interactive workload differs from batch in that requestors are interested in
immediate response and are generally waiting for the interactive request to be
executed before going on to other activities. In many cases, interactive
submitters will continue to be attached to the interactive job, routing
keystrokes and other input into the job and seeing both output and error
information in real-time. While interactive workload may be submitted within a
job file, commonly, it is routed into the cluster via a web or other graphical
terminal and the end user may never even be aware of the underlying use of
the batch system.
For managing interactive jobs, the focus is usually on setting aside resources to
guarantee immediate execution or at least a minimal wait time for interactive
jobs. Targeted service levels require management when mixing batch and
interactive jobs. Interactive and other job types can be dynamically steered in
terms of what they are executing as well as in terms of the quantity of
resources required by the application.
Workload
9
Chapter 1 Philosophy
Calendar Workload
Calendar workload must be executed at a particular time and possibly in a
regular periodic manner. For such jobs, time constraints range from flexible to
rigid. For example, some calendar jobs may need to complete by a certain
time, while others must run exactly at a given time each day or each week.
Moab can schedule the future and can thus guarantee resource availability at
needed times to allow calendar jobs to run as required. Furthermore, Moab
provisioning features can locate or temporarily create the needed compute
environment to properly execute the target applications.
Service Workload
Moab can schedule and manage both individual applications and long-running
or persistent services. Service workload processes externally-generated
transaction requests while Moab provides the distributed service with needed
resources to meet target backlog or response targets to the service. Examples
of service workload include parallel databases, web farms, and visualization
services. Moab can apply cluster, grid, or dynamically-generated on-demand
resources to the service.
When handling service workload, Moab observes the application in a highly
abstract manner. Using the JOBCFG parameter, aspects of the service jobs can
be discovered or configured with attributes describing them as resource
consumers possessing response time, backlog, state metrics, and associated
QoS targets. In addition, each application can specify the type of compute
resource required (OS, arch, memory, disk, network adapter, data store, and
so forth) as well as the support environment (network, storage, external
services, and so forth).
If the QoS response time/backlog targets of the application are not being
satisfied by the current resource allocation, Moab evaluates the needs of this
application against all other site mission objectives and workload needs and
determines what it must do to locate or create (that is, provision, customize,
secure) the needed resources. With the application resource requirement
specification, a site may also indicate proximity/locality constraints, partition
policies, ramp-up/ramp-down rules, and so forth.
Once Moab identifies and creates appropriate resources, it hands these
resources to the application via a site customized URL. This URL can be
responsible for whatever application-specific handshaking must be done to
launch and initialize the needed components of the distributed application upon
the new resources. Moab engages in the hand-off by providing needed context
and resource information and by launching the URL at the appropriate time.
Related Topics
Malleable Jobs
QOS/SLA Enforcement
10
Workload
Chapter 2 Scheduler Basics
Chapter 2 Scheduler Basics
l
Initial Moab Configuration
l
Layout of Scheduler Components
l
Scheduling Environment
o
Scheduling Dictionary
l
Scheduling Iterations and Job Flow
l
Configuring the Scheduler
l
Credential Overview
o
Job Attributes/Flags Overview
Initial Moab Configuration
Configuring an RPM-based install of Moab
When Moab is installed via an RPM source, the moab.cfg file contains only one
directive - an #IMPORT line that imports all the configuration files in
/opt/moab/etc. The usual configuration settings that are normally contained
in moab.cfg have been moved to moab-server.cfg. Moab still reads the
moab.cfg file and, due to the #INCLUDE directive, reads in all the other
configuration files as well.
To configure Moab in the case of an RPM install, you can modify the moab.cfg
file, the moab-server.cfg file, or any of the configuration files that are read in
by moab.cfg such as the accounting manager configuration file (am.cfg) or
the resource manager configuration file (rm.cfg).
The RPMs allow for a client install of Moab, instead of a server install. In this
instance, the moab-server.cfg file is replaced with a moab-client.cfg file.
The server and client RPMs cannot be installed on the same machine.
Basic configuration of Moab
After Moab is installed, there may be minor configuration remaining within the
primary configuration file, moab.cfg. While the configure script automatically
sets these parameters, sites may choose to specify additional parameters. If
the values selected in configure are satisfactory, then this section may be
safely ignored.
The parameters needed for proper initial startup include the following:
Initial Moab Configuration
11
Chapter 2 Scheduler Basics
Parameter
Instructions
SCHEDCFG
The SCHEDCFG parameter specifies how the Moab server will execute and communicate with
client requests. The SERVER attribute allows Moab client commands to locate the Moab server and
is specified as a URL or in <HOST>[:<PORT>] format. For example:
SCHEDCFG[orion] SERVER=cw.psu.edu
Specifying the server in the Moab configuration file is optional. If nothing is specified,
gethostname() is called. You can restart Moab and run mdiag -S to confirm that the correct host
name is specified.
The SERVER attribute can also be set using the environment variable $MOABSERVER.
Using this variable allows you to quickly change to the Moab server that client commands
will connect to.
> export MOABSERVER=cluster2:12221
ADMINCFG
Moab provides role-based security enabled via multiple levels of admin access. Users who are to
be granted full control of all Moab functions should be indicated by setting the ADMINCFG[1]
parameter. The first user in this USERS attribute list is considered the primary administrator. It is
the ID under which Moab will execute. For example, the following may be used to enable users
greg and thomas as level 1 admins:
ADMINCFG[1] USERS=greg,thomas
Moab may only be launched by the primary administrator user ID.
The primary administrator should be configured as a manager/operator/administrator in
every resource manager with which Moab will interface.
If the msub command will be used, then "root" must be the primary administrator.
Moab's home directory and contents should be owned by the primary administrator.
RMCFG
For Moab to properly interact with a resource manager, the interface to this resource manager
must be defined as described in the Resource Manager Configuration Overview. Further, it is
important that the primary Moab administrator also be a resource manager administrator within
each of those systems. For example, to interface to a Torque resource manager, the following may
be used:
RMCFG[torque1] TYPE=pbs
Related Topics
Parameter Overview
mdiag -C command (for diagnosing current Moab configuration)
12
Initial Moab Configuration
Chapter 2 Scheduler Basics
Layout of Scheduler Components
Moab is initially unpacked into a simple one-deep directory structure. What
follows demonstrates the default layout of scheduler components; some of the
files (such as log and statistics files) are created while Moab runs.
l
* $(MOABHOMEDIR) (default is /opt/moab and can be modified via the -with-homedir parameter during ./configure) contains the following
files:
Filename
Description
.moab.ck
Checkpoint file
.moab.pid
Lock file
moab.lic
License file
contrib/
Directory containing contributed code and plug-ins
.counters
File containing last 3 counters for InsightIDs, jobs, and reservations respectively. Created
during installation and required for Moab operation.
docs/
Directory for documentation
etc/
Directory for configuration files
moab.cfg
General configuration file
moab.dat
Configuration file generated by Moab Cluster Manager
moabprivate.cfg
Secure configuration file containing private information
lib/
Directory for library files (primarily for tools/)
log/
Directory for log files
moab.log
Log file
moab.log.1
Previous log file
Layout of Scheduler Components
13
Chapter 2 Scheduler Basics
Filename
Description
stats/
Directory for statistics files:
samples/
l
l
o
events.<date> – event files
o
{DAY|WEEK|MONTH|YEAR}.<date> – usage profiling data
o
FS.<PARTITION>.<epochtime> – fairshare usage data
Directory for sample configuration files, simulation trace files, etc.
$(MOABINSTDIR) (default is /opt/moab and can be modified via the -prefix parameter during ./configure) contains the following files:
Filename
Description
bin/
Directory for client commands (for example, showq, setres, etc.)
sbin/
Directory for server daemons
moab
Moab binary
tools/
Directory for resource manager interfaces and local scripts
/etc/moab.cfg – If the Moab home directory cannot be found at startup,
this file is checked to see if it declares the Moab home directory. If a
declaration exists, the system checks the declared directory to find Moab.
The syntax is: MOABHOMEDIR=<DIRECTORY>.
If you want to run Moab from a different directory other than /opt/moab but
did not use the --with-homedir parameter during ./configure, you can
set the $MOABHOMEDIR environment variable, declare the home directory in
the /etc/moab.cfg file, or use the -C command line option when using the
Moab server or client commands to specify the configuration file location.
When Moab runs, it creates a log file, moab.log, in the log/ directory and
creates a statistics file in the stats/ directory with the naming convention
events.WWW_MMM_DD_YYYY (for example, events.Sat_Oct_10_2009).
Additionally, a checkpoint file, .moab.ck, and lock file, .moab.pid, are
maintained in the Moab home directory.
Layout of Scheduler Components with Integrated Database
Enabled
If USEDATABASE INTERNAL is configured, the layout of scheduler components
varies slightly. The .moab.ck file and usage profiling data (stat/
14
Layout of Scheduler Components
Chapter 2 Scheduler Basics
{DAY|WEEK|MONTH|YEAR}.<date>) are stored in the moab.db database. In
addition, the event information is stored in both event files:
(stat/events.<date>) and moab.db.
Related Topics
Commands Overview
1.1.7 (Optional) Install Moab Client
Scheduling Environment
Moab functions by manipulating a number of elementary objects, including
jobs, nodes, reservations, QoS structures, resource managers, and policies.
Multiple minor elementary objects and composite objects are also used; these
objects are defined in the scheduling dictionary.
l
Jobs
o
Job States
o
Requirement (or Req)
l
Nodes
l
Advance Reservations
l
Policies
l
Resources
l
Task
l
PE
l
Class (or Queue)
l
Resource Manager (RM)
Moab functions by manipulating a number of elementary objects, including
jobs, nodes, reservations, QoS structures, resource managers, and policies.
Multiple minor elementary objects and composite objects are also used; these
objects are defined in the scheduling dictionary.
Jobs
Job information is provided to the Moab scheduler from a resource manager
such as Loadleveler, PBS, Wiki, or LSF. Job attributes include ownership of the
job, job state, amount and type of resources required by the job, and a
wallclock limit indicating how long the resources are required. A job consists of
one or more task groups, each of which requests a number of resources of a
given type; for example, a job may consist of two task groups, the first asking
for a single master task consisting of 1 IBM SP node with at least 512 MB of RAM
Scheduling Environment
15
Chapter 2 Scheduler Basics
and the second asking for a set of slave tasks such as 24 IBM SP nodes with at
least 128 MB of RAM. Each task group consists of one or more tasks where a
task is defined as the minimal independent unit of resources. By default, each
task is equivalent to one processor. In SMP environments, however, users may
wish to tie one or more processors together with a certain amount of memory
and other resources.
Job States
The job's state indicates its current status and eligibility for execution and can
be any of the values listed in the following tables:
Table 2-1: Pre-execution states
State
Definition
Deferred
Job that has been held by Moab due to an inability to schedule the job under current conditions.
Deferred jobs are held for DEFERTIME before being placed in the idle queue. This process is
repeated DEFERCOUNT times before the job is placed in batch hold.
Hold
Job is idle and is not eligible to run due to a user, (system) administrator, or batch system hold (also,
batchhold, systemhold, userhold).
Idle
Job is currently queued and eligible to run but is not executing (also, notqueued).
NotQueued
The job has not been queued.
Unknown
Moab cannot determine the state of the job.
Table 2-2: Execution states
16
State
Definition
Starting
Batch system has attempted to start the job and the job is currently performing pre-start tasks that
may include provisioning resources, staging data, or executing system pre-launch scripts.
Running
Job is currently executing the user application.
Suspended
Job was running but has been suspended by the scheduler or an administrator; user application is
still in place on the allocated compute resources, but it is not executing.
Scheduling Environment
Chapter 2 Scheduler Basics
Table 2-3: Post-execution states
State
Definition
Completed
Job has completed running without failure.
Removed
Job has run to its requested walltime successfully but has been canceled by the scheduler or
resource manager due to exceeding its walltime or violating another policy; includes jobs canceled
by users or administrators either before or after a job has started.
Vacated
Job canceled after partial execution due to a system failure.
Task Group (or Req)
A job task group (or req) consists of a request for a single type of resources.
Each task group consists of the following components:
Component
Description
Task Definition
A specification of the elementary resources that compose an individual task.
Resource
Constraints
A specification of conditions that must be met for resource matching to occur. Only resources
from nodes that meet all resource constraints may be allocated to the job task group.
Task Count
The number of task instances required by the task group.
Task List
The list of nodes on which the task instances are located.
Task Group
Statistics
Statistics tracking resource utilization.
Nodes
Moab recognizes a node as a collection of resources with a particular set of
associated attributes. This definition is similar to the traditional notion of a node
found in a Linux cluster or supercomputer wherein a node is defined as one or
more CPUs, associated memory, and possibly other compute resources such as
local disk, swap, network adapters, and software licenses. Additionally, this
node is described by various attributes such as an architecture type or
operating system. Nodes range in size from small uniprocessor PCs to large
symmetric multiprocessing (SMP) systems where a single node may consist of
hundreds of CPUs and massive amounts of memory.
Scheduling Environment
17
Chapter 2 Scheduler Basics
In many cluster environments, the primary source of information about the
configuration and status of a compute node is the resource manager. This
information can be augmented by additional information sources including
node monitors and information services. Further, extensive node policy and
node configuration information can be specified within Moab via the graphical
tools or the configuration file. Moab aggregates this information and presents a
comprehensive view of the node configuration, usages, and state.
While a node in Moab in most cases represents a standard compute host, nodes
may also be used to represent more generalized resources. The GLOBAL node
possesses floating resources that are available cluster wide, and created virtual
nodes (such as network, software, and data nodes) track and allocate resource
usage for other resource types.
For additional node information, see General Node Administration.
Advance Reservations
An advance reservation dedicates a block of specific resources for a particular
use. Each reservation consists of a list of resources, an access control list, and a
time range for enforcing the access control list. The reservation ensures the
matching nodes are used according to the access controls and policy
constraints within the time frame specified. For example, a reservation could
reserve 20 processors and 10 GB of memory for users Bob and John from
Friday 6:00 a.m. to Saturday 10:00 p.m. Moab uses advance reservations
extensively to manage backfill, guarantee resource availability for active jobs,
allow service guarantees, support deadlines, and enable metascheduling.
Moab also supports both regularly recurring reservations and the creation of
dynamic one-time reservations for special needs. Advance reservations are
described in detail in the Advance Reservations overview.
Policies
A configuration file specifies policies controls how and when jobs start. Policies
include job prioritization, fairness policies, fairshare configuration policies, and
scheduling policies.
Resources
Jobs, nodes, and reservations all deal with the abstract concept of a resource.
A resource in the Moab world is one of the following:
18
Resource
Description
processors
Specify with a simple count value
memory
Specify real memory or RAM in megabytes (MB)
Scheduling Environment
Chapter 2 Scheduler Basics
Resource
Description
swap
Specify virtual memory or swap in megabytes (MB)
disk
Specify local disk in megabytes (MB)
In addition to these elementary resource types, there are two higher level
resource concepts used within Moab: Task and the processor equivalent, or
(PE).
Task
A task is a collection of elementary resources that must be allocated together
within a single node. For example, a task may consist of one processor, 512 MB
of RAM, and 2 GB of local disk. A key aspect of a task is that the resources
associated with the task must be allocated as an atomic unit, without spanning
node boundaries. A task requesting 2 processors cannot be satisfied by
allocating 2 uniprocessor nodes, nor can a task requesting 1 processor and 1
GB of memory be satisfied by allocating 1 processor on 1 node and memory on
another.
In Moab, when jobs or reservations request resources, they do so in terms of
tasks typically using a task count and a task definition. By default, a task maps
directly to a single processor within a job and maps to a full node within
reservations. In all cases, this default definition can be overridden by specifying
a new task definition.
Within both jobs and reservations, depending on task definition, it is possible to
have multiple tasks from the same job mapped to the same node. For
example, a job requesting 4 tasks using the default task definition of 1
processor per task, can be satisfied by 2 dual processor nodes.
PE
The concept of the processor equivalent, or PE, arose out of the need to
translate multi-resource consumption requests into a scalar value. It is not an
elementary resource but rather a derived resource metric. It is a measure of
the actual impact of a set of requested resources by a job on the total
resources available system wide. It is calculated as follows:
PE = MAX(ProcsRequestedByJob / TotalConfiguredProcs,
MemoryRequestedByJob / TotalConfiguredMemory,
DiskRequestedByJob / TotalConfiguredDisk,
SwapRequestedByJob / TotalConfiguredSwap) *
TotalConfiguredProcs
For example, if a job requested 20% of the total processors and 50% of the
total memory of a 128-processor MPP system, only two such jobs could be
supported by this system. The job is essentially using 50% of all available
Scheduling Environment
19
Chapter 2 Scheduler Basics
resources since the system can only be scheduled to its most constrained
resource - memory in this case. The processor equivalents for this job should
be 50% of the processors, or PE = 64.
Another example: Assume a homogeneous 100-node system with 4 processors
and 1 GB of memory per node. A job is submitted requesting 2 processors and
768 MB of memory. The PE for this job would be calculated as follows:
PE = MAX(2/(100*4), 768/(100*1024)) * (100*4) = 3.
This result makes sense since the job would be consuming 3/4 of the memory
on a 4-processor node.
The calculation works equally well on homogeneous or heterogeneous
systems, uniprocessor or large SMP systems.
Class (or Queue)
A class (or queue) is a logical container object that implicitly or explicitly applies
policies to jobs. In most cases, a class is defined and configured within the
resource manager and associated with one or more of the following attributes
or constraints:
Attribute
Description
Default Job
Attributes
A queue may be associated with a default job duration, default size, or default resource requirements.
Host Constraints
A queue may constrain job execution to a particular set of hosts.
Job Constraints
A queue may constrain the attributes of jobs that may be submitted, including setting limits such
as max wallclock time and minimum number of processors.
Access List
A queue may constrain who may submit jobs into it based on such things as user lists and group
lists.
Special
Access
A queue may associate special privileges with jobs including adjusted job priority.
As stated previously, most resource managers allow full class configuration
within the resource manager. Where additional class configuration is required,
the CLASSCFG parameter may be used.
Moab tracks class usage as a consumable resource allowing sites to limit the
number of jobs using a particular class. This is done by monitoring class
initiators that may be considered to be a ticket to run in a particular class. Any
compute node may simultaneously support several types of classes and any
number of initiators of each type. By default, nodes will have a one-to-one
20
Scheduling Environment
Chapter 2 Scheduler Basics
mapping between class initiators and configured processors. For every job task
run on the node, one class initiator of the appropriate type is consumed. For
example, a 3-processor job submitted to the class "batch" consumes three
batch class initiators on the nodes where it runs.
Using queues as consumable resources allows sites to specify various policies
by adjusting the class initiator to node mapping. For example, a site running
serial jobs may want to allow a particular 8-processor node to run any
combination of batch and special jobs subject to the following constraints:
l
Only 8 jobs of any type allowed simultaneously.
l
No more than 4 special jobs allowed simultaneously.
To enable this policy, the site may set the node's MAXJOB policy to 8 and
configure the node with 4 special class initiators and 8 batch class initiators.
In virtually all cases, jobs have a one-to-one correspondence between
processors requested and class initiators required. However, this is not a
requirement, and with special configuration, sites may choose to associate job
tasks with arbitrary combinations of class initiator requirements.
In displaying class initiator status, Moab signifies the type and number of class
initiators available using the format [<CLASSNAME>:<CLASSCOUNT>]. This is
most commonly seen in the output of node status commands indicating the
number of configured and available class initiators, or in job status commands
when displaying class initiator requirements.
Resource Manager (RM)
While other systems may have more strict interpretations of a resource
manager and its responsibilities, Moab's multi-resource manager support
allows a much more liberal interpretation. In essence, any object that can
provide environmental information and environmental control can be used as a
resource manager, including sources of resource, workload, credential, or
policy information such as scripts, peer services, databases, web services,
hardware monitors, or even flats files. Likewise, Moab considers to be a
resource manager any tool that provides control over the cluster environment
whether that be a license manager, queue manager, checkpoint facility,
provisioning manager, network manager, or storage manager.
Moab aggregates information from multiple unrelated sources into a larger
more complete world view of the cluster that includes all the information and
control found within a standard resource manager such as Torque, including
node, job, and queue management services. For more information, see the
Resource Managers and Interfaces overview.
Arbitrary Resource
Nodes can also be configured to support various arbitrary resources. Use the
NODECFG parameter to specify information about such resources. For
Scheduling Environment
21
Chapter 2 Scheduler Basics
example, you could configure a node to have 256 MB RAM, 4 processors, 1 GB
Swap, and 2 tape drives.
Scheduling Dictionary
Account
Definition
Example
A credential also known as "project ID." Multiple users may be associated a single account ID and
each user may have access to multiple accounts. (See credential definition and ACCOUNTCFG parameter.)
ACCOUNT=hgc13
ACL (Access Control List)
Definition
In the context of scheduling, an access control list is used and applied much as it is elsewhere. An
ACL defines what credentials are required to access or use particular objects. The principal objects to
which ACLs are applied are reservations and QoSs. ACLs may contain both allow and deny statements, include wildcards, and contain rules based on multiple object types.
Example
Reservation META1 contains 4 access statements.
l
Allow jobs owned by user "john" or "bob "
l
Allow jobs with QoS "premium"
l
Deny jobs in class "debug"
l
Allow jobs with a duration of less than 1 hour
Allocation
Definition
Example
A logical, scalar unit assigned to users on a credential basis, providing access to a particular quantity
of compute resources. Allocations are consumed by jobs associated with those credentials.
ALLOCATION=30000
Class
Definition
22
(See Queue) A class is a logical container object that holds jobs allowing a site to associate various
constraints and defaults to these jobs. Class access can also be tied to individual nodes defining
whether a particular node will accept a job associated with a given class. Class based access to a node
is denied unless explicitly allowed via resource manager configuration. Within Moab, classes are tied
to jobs as a credential.
Scheduling Environment
Chapter 2 Scheduler Basics
Class
Example
job "cw.073" is submitted to class batch
node "cl02" accepts jobs in class batch
reservation weekend allows access to jobs in class batch
CPU
Definition
A single processing unit. A CPU is a consumable resource. Nodes typically consist of one or more
CPUs. (same as processor )
Credential
Definition
An attribute associated with jobs and other objects that determines object identity. In the case of
schedulers and resource managers, credential based policies and limits are often established. At
submit time, jobs are associated with a number of credentials such as user, group , account , QoS, and
class. These job credentials subject the job to various polices and grant it various types of access.
In most cases, credentials set both the privileges of the job and the ID of the actual job executable.
Example
Job "cw.24001" possesses the following credentials:
USER=john;GROUP=staff;ACCOUNT=[NONE];
QOS=[DEFAULT];CLASS=batch
Disk
Definition
A quantity of local disk available for use by batch jobs. Disk is a consumable resource.
Execution Environment
Definition
A description of the environment in which the executable is launched. This environment may
include attributes such as the following:
Scheduling Environment
l
an executable
l
command line arguments
l
input file
l
output file
l
local user ID
l
local group ID
l
process resource limits
23
Chapter 2 Scheduler Basics
Execution Environment
Example
Job "cw.24001" possesses the following execution environment:
EXEC=/bin/sleep;ARGS="60";
INPUT=[NONE];OUTPUT=[NONE];
USER=loadl;GROUP=staff;
Fairshare
Definition
A mechanism that allows historical resource utilization information to be incorporated into job priority decisions.
Fairness
Definition
The access to shared compute resources that each user is granted. Access can be equal or based on
factors such as historical resource usage, political issues, and job value.
Group
Definition
A credential typically directly mapping to a user's UNIX group ID.
Job
Definition
The fundamental object of resource consumption. A job contains the following components:
l
l
24
A list of required consumable resources
A list of resource constraints controlling which resources may be allocated to the
job
l
A list of job constraints controlling where, when, and how the job should run
l
A list of credentials
l
An execution environment
Scheduling Environment
Chapter 2 Scheduler Basics
Job Constraints
Definition
A set of conditions that must be fulfilled for the job to start. These conditions are far reaching and
may include one or more of the following:
l
l
l
Example
When the job may run. (After time X, within Y minutes.)
Which resources may be allocated. (For example, node must possess at least 512 MB of RAM,
run only in partition or Partition C, or run on HostA and HostB.)
Starting job relative to a particular event. (Start after job X successfully completes.)
RELEASETIME>='Tue Feb 12, 11:00AM'
DEPEND=AFTERANY:cw.2004
NODEMEMORY==256MB
Memory
Definition
A quantity of physical memory (RAM). Memory is provided by compute nodes. It is required as a constraint or consumed as a consumable resource by jobs. Within Moab, memory is tracked and reported in megabytes (MB).
Example
Node "node001" provides the following resources:
PROCS=1,MEMORY=512,SWAP=1024
"Job cw.24004" consumes the following resources per task:
PROCS=1,MEMORY=256
Node
Definition
A node is the fundamental object associated with compute resources. Each node contains the
following components:
l
A list of consumable resources
l
A list of node attributes
Node Attribute
Definition
Example
A node attribute is a non-quantitative aspect of a node. Attributes typically describe the node itself
or possibly aspects of various node resources such as processors or memory. While it is probably not
optimal to aggregate node and resource attributes together in this manner, it is common practice.
Common node attributes include processor architecture, operating system, and processor speed.
Jobs often specify that resources be allocated from nodes possessing certain node attributes.
ARCH=AMD,OS=LINUX24,PROCSPEED=950
Scheduling Environment
25
Chapter 2 Scheduler Basics
Node Feature
Definition
Example
A node feature is a node attribute that is typically specified locally via a configuration file. Node features are opaque strings associated with the node by the resource manager that generally only have
meaning to the end-user, or possibly to the scheduler. A node feature is commonly associated with a
subset of nodes allowing end-users to request use of this subset by requiring that resources be allocated from nodes with this feature present. In many cases, node features are used to extend the
information provided by the resource manager.
FEATURE=s950,pIII,geology
This may be used to indicate that the node possesses a 950 MHz Pentium III processor and
that the node is owned by the Geology department.
Processor
Definition
A processing unit. A processor is a consumable resource. Nodes typically consist of one or more processors. (same as CPU)
Quality of Service (QoS)
Definition
An object that provides special services, resources, and so forth.
Queue
Definition
(see Class )
Reservation
Definition
An object that reserves a specific collection or resources for a specific timeframe for use by jobs that
meet specific conditions.
Example
Reserve 24 processors and 8 GB of memory from time T1 to time T2 for use by user X or jobs in the
class batch.
Resource
Definition
26
Hardware, generic resources such as software, and features available on a node, including memory,
disk, swap, and processors.
Scheduling Environment
Chapter 2 Scheduler Basics
Resource, Available
Definition
A compute node's configured resources minus the maximum of the sum of the resources utilized by
all job tasks running on the node and the resources dedicated; that is, R.Available = R.Configure MAX(R.Dedicated,R.Utilized).
In most cases, resources utilized will be associated with compute jobs that the batch system has
started on the compute nodes, although resource consumption may also come from the operating
system or rogue processes outside of the batch system's knowledge or control. Further, in a wellmanaged system, utilized resources are less than or equal to dedicated resources and when
exceptions are detected, one or more usage-based limits are activated to preempt the jobs violating
their requested resource usage.
Example
Node "cl003" has 4 processors and 512 MB of memory. It is executing 2 tasks of job "clserver.0041"
that are using 1 processor and 60 MB of memory each. One processor and 250 MB of memory are
reserved for user "jsmith" but are not currently in use.
Resources available to user jsmith on node "cl003":
l
2 processors
l
392 MB memory
Resources available to a user other than jsmith on node "cl003":
l
1 processor
l
142 MB memory
Resource, Configured
Definition
The total amount of consumable resources that are available on a compute node for use by job tasks.
Example
Node "cl003" has 4 processors and 512 MB of memory. It is executing 2 tasks of job "clserver.0041"
that are using 1 processor and 60 MB of memory each. One processor and 250 MB of memory are
reserved for user "jsmith" but are not currently in use.
Configured resources for node "cl003":
Scheduling Environment
l
4 processors
l
512 MB memory
27
Chapter 2 Scheduler Basics
Resource, Consumable
Definition
Any object that can be used (that is, consumed and thus made unavailable to another job) by, or
dedicated to a job is considered to be a resource. Common examples of resources are a node's
physical memory or local disk. As these resources may be given to one job and thus become
unavailable to another, they are considered to be consumable. Other aspects of a node, such as its
operating system, are not considered to be consumable since its use by one job does not preclude its
use by another. Note that some node objects, such as a network adapter, may be dedicated under
some operating systems and resource managers and not under others. On systems where the
network adapter cannot be dedicated and the network usage per job cannot be specified or tracked,
network adapters are not considered to be resources, but rather attributes.
Nodes possess a specific quantity of consumable resources such as real memory, local disk, or
processors. In a resource management system, the node manager may choose to report only those
configured resources available to batch jobs. For example, a node may possess an 80-GB hard drive
but may have only 20 GB dedicated to batch jobs. Consequently, the resource manager may report
that the node has 20 GB of local disk available when idle. Jobs may explicitly request a certain
quantity of consumable resources.
Resource, Constraint
Definition
A resource constraint imposes a rule on which resources can be used to match a resource request.
Resource constraints either specify a required quantity and type of resource or a required node
attribute. All resource constraints must be met by any given node to establish a match.
Resource, Dedicated
Definition
A job may request that a block of resources be dedicated while the job is executing. At other times, a
certain number of resources may be reserved for use by a particular user or group. In these cases,
the scheduler is responsible for guaranteeing that these resources, utilized or not, are set aside and
made unavailable to other jobs.
Example
Node " cl003" has 4 processors and 512 MB of memory. It is executing 2 tasks of job "clserver.0041"
that are using 1 processor and 60 MB of memory each. One processor and 250 MB of memory are
reserved for user "jsmith" but are not currently in use.
Dedicated resources for node "cl003":
l
1 processor
l
250 MB memory
Resource, Utilized
Definition
28
All consumable resources actually used by all job tasks running on the compute node.
Scheduling Environment
Chapter 2 Scheduler Basics
Resource, Utilized
Example
Node "cl003" has 4 processors and 512 MB of memory. It is executing 2 tasks of job "clserver.0041"
that are using 1 processor and 60 MB of memory each. One processor and 250 MB of memory are
reserved for user "jsmith" but are not currently in use.
Utilized resources for node "cl003":
l
2 processors
l
120 MB memory
Swap
Definition
A quantity of virtual memory available for use by batch jobs. Swap is a consumable resource
provided by nodes and consumed by jobs.
Task
Definition
An atomic collection of consumable resources.
User, Global
Definition
The user credential used to provide access to functions and resources. In local scheduling, global
user IDs map directly to local user IDs.
User, Local
Definition
The user credential under which the job executable will be launched.
Workload
Definition
Generalized term.
Scheduling Iterations and Job Flow
l
Scheduling Iterations
o
Update State Information
o
Handle User Requests
Scheduling Iterations and Job Flow
29
Chapter 2 Scheduler Basics
o
l
Perform Next Scheduling Cycle
Detailed Job Flow
o
Determine Basic Job Feasibility
o
Prioritize Jobs
o
Enforce Configured Throttling Policies
o
Determine Resource Availability
o
Allocate Resources to Job
o
Launch Job
Scheduling Iterations
In any given scheduling iteration, many activities take place, examples of which
are listed below:
l
Refresh reservations
l
Schedule reserved jobs
l
Schedule priority jobs
l
Backfill jobs
l
Update statistics
l
Update State Information
l
Handle User Requests
l
Perform Next Scheduling Cycle
Update State Information
Each iteration, the scheduler contacts the resource manager(s) and requests
up-to-date information on compute resources, workload, and policy
configuration. On most systems, these calls are to a centralized resource
manager daemon that possesses all information. Jobs may be reported as
being in any of the following states listed in the job state table.
Handle User Requests
User requests include any call requesting state information, configuration
changes, or job or resource manipulation commands. These requests may
come in the form of user client calls, peer daemon calls, or process signals.
Perform Next Scheduling Cycle
Moab operates on a polling/event driven basis. When all scheduling activities
complete, Moab processes user requests until a new resource manager event
is received or an internal event is generated. Resource manager events include
30
Scheduling Iterations and Job Flow
Chapter 2 Scheduler Basics
activities such as a new job submission or completion of an active job, addition
of new node resources, or changes in resource manager policies. Internal
events include administrator schedule requests, reservation
activation/deactivation, or the expiration of the RMPOLLINTERVAL timer.
Detailed Job Flow
Determine Basic Job Feasibility
The first step in scheduling is determining which jobs are feasible. This step
eliminates jobs that have job holds in place, invalid job states (such as
Completed, Not Queued, Deferred), or unsatisfied preconditions. Preconditions
may include stage-in files or completion of preliminary job steps.
Prioritize Jobs
With a list of feasible jobs created, the next step involves determining the
relative priority of all jobs within that list. A priority for each job is calculated
based on job attributes such as job owner, job size, and length of time the job
has been queued.
Enforce Configured Throttling Policies
Any configured throttling policies are then applied constraining how many jobs,
nodes, processors, and so forth are allowed on a per credential basis. Jobs that
violate these policies are not considered for scheduling.
Determine Resource Availability
For each job, Moab attempts to locate the required compute resources needed
by the job. For a match to be made, the node must possess all node attributes
specified by the job and possess adequate available resources to meet the
"TasksPerNode" job constraint. (Default "TasksPerNode" is 1.) Normally, Moab
determines that a node has adequate resources if the resources are neither
utilized by nor dedicated to another job using the calculation.
R.Available = R.Configured - MAX(R.Dedicated,R.Utilized).
The NODEAVAILABILITYPOLICY parameter can be modified to adjust this
behavior.
Allocate Resources to Job
If adequate resources can be found for a job, the node allocation policy is then
applied to select the best set of resources. These allocation policies allow
selection criteria such as speed of node, type of reservations, or excess node
resources to be figured into the allocation decision to improve the performance
of the job and maximize the freedom of the scheduler in making future
scheduling decisions.
Scheduling Iterations and Job Flow
31
Chapter 2 Scheduler Basics
Launch Job
With the resources selected and task distribution mapped, the scheduler then
contacts the resource manager and informs it where and how to launch the
job. The resource manager then initiates the actual job executable.
Configuring the Scheduler
l
Adjusting Server Behavior
o
Logging
o
Checkpointing
o
Client Interface
o
Scheduler Mode
o
Configuring a job ID offset
Scheduler configuration is maintained using the flat text configuration file
moab.cfg. All configuration file entries consist of simple <PARAMETER> <VALUE>
pairs that are whitespace delimited. Parameter names are not case sensitive
but <VALUE> settings are. Some parameters are array values and should be
specified as <PARAMETER>[<INDEX>] (Example: QOSCFG[hiprio] PRIORITY=1000);
the <VALUE> settings may be integers, floats, strings, or arrays of these. Some
parameters can be specified as arrays wherein index values can be numeric or
alphanumeric strings. If no array index is specified for an array parameter, an
index of zero (0) is assumed. The example below includes both array based
and non-array based parameters:
SCHEDCFG[cluster2] SERVER=head.c2.org MODE=NORMAL
LOGLEVEL 6
LOGDIR
/var/tmp/moablog
See the parameters documentation for information on specific parameters.
The moab.cfg file is read when Moab is started up or recycled. Also, the
mschedctl -m command can be used to reconfigure the scheduler at any time,
updating some or all of the configurable parameters dynamically. This
command can be used to modify parameters either permanently or
temporarily. For example, the command mschedctl -m LOGLEVEL 3will temporarily
adjust the scheduler log level. When the scheduler restarts, the log level
restores to the value stored in the Moab configuration files. To adjust a
parameter permanently, the option --flags=persistent should be set.
At any time, the current server parameter settings may be viewed using the
mschedctl -l command.
32
Configuring the Scheduler
Chapter 2 Scheduler Basics
Adjusting Server Behavior
Most aspects of Moab behavior are configurable. This includes both scheduling
policy behavior and daemon behavior. In terms of configuring server behavior,
the following realms are most commonly modified.
Logging
Moab provides extensive and highly configurable logging facilities controlled by
parameters.
Parameter
Description
LOGDIR
Indicates directory for log files.
LOGFACILITY
Indicates scheduling facilities to track.
LOGFILE
Indicates path name of log file.
LOGFILEMAXSIZE
Indicates maximum size of log file before rolling.
LOGFILEROLLDEPTH
Indicates maximum number of log files to maintain.
LOGLEVEL
Indicates verbosity of logging.
Checkpointing
Moab checkpoints its internal state. The checkpoint file records statistics and
attributes for jobs, nodes, reservations, users, groups, classes, and almost
every other scheduling object.
Parameter
Description
CHECKPOINTEXPIRATIONTIME
Indicates how long unmodified data should be kept after the associated object
has disappeared; that is, job priority for a job no longer detected.
CHECKPOINTFILE
Indicates path name of checkpoint file.
CHECKPOINTINTERVAL
Indicates interval between subsequent checkpoints.
Client Interface
The Client interface is configured using the SCHEDCFG parameter. Most
commonly, the attributes SERVER and PORT must be set to point client
Configuring the Scheduler
33
Chapter 2 Scheduler Basics
commands to the appropriate Moab server. Other parameters such as
CLIENTTIMEOUT may also be set.
Scheduler Mode
The scheduler mode of operation is controlled by setting the MODE attribute of
the SCHEDCFG parameter. The following modes are allowed:
Mode
Description
INTERACTIVE
Moab interactively confirms each scheduling action before taking any steps. (See interactive
mode overview for more information.)
MONITOR
Moab observes cluster and workload performance, collects statistics, interacts with allocation
management services, and evaluates failures, but it does not actively alter the cluster, including
job migration, workload scheduling, and resource provisioning. (See monitor mode overview for
more information.)
NORMAL
Moab actively schedules workload according to mission objectives and policies; it creates reservations; starts, cancels, preempts, and modifies jobs; and takes other scheduling actions.
SIMULATION
Moab obtains workload and resource information from specified simulation trace files and schedules the defined virtual environment.
SINGLESTEP
Moab behaves as in NORMAL mode but will only schedule a single iteration and then exit.
SLAVE
Moab behaves as in NORMAL mode but will only start a job when explicitly requested by a trusted grid peer service or administrator.
TEST
Moab behaves as in NORMAL mode, will make reservations, and scheduling decisions, but will
then only log scheduling actions it would have taken if running in NORMAL mode. In most cases,
"TEST" mode is identical to MONITOR mode. (See test mode overview for more information.)
Configuring a job ID offset
Moab assigns job IDs as integers in numeric order as jobs are submitted,
starting with 1. In some situations, you might want to offset the integer at
which Moab starts to assign job IDs in your system.
This example describes how you would offset the job IDs in a compound
system consisting of Site A, Site B, and Site C, each of which runs its own
instance of Moab. Users belonging to any of the sites can submit jobs to their
own site and to the other two. To simplify aggregation of usage records from
the three sites, offset the job IDs for Site B to a starting value higher than the
expected total lifetime value for the system; in this example, to 20000000.
Likewise, set Site C to 20,000,000 more, or 40000000. To do so, set the
MINJOBID attribute of SCHEDCFG in each system's moab.cfg to the offset value.
34
Configuring the Scheduler
Chapter 2 Scheduler Basics
To ensure that Moab will never use the same job ID for two different sites, also
set MAXJOBID. If the Moab job naming process ever reaches the MAXJOBID, it will
start over again with the MINJOBID.
SCHEDCFG[moab] SERVER=moab_siteA:4244 MAXJOBID=19999999
SCHEDCFG[moab] SERVER=moab_siteB:4344 MINJOBID=20000000 MAXJOBID=39999999
SCHEDCFG[moab] SERVER=moab_siteC:4444 MINJOBID=40000000 MAXJOBID=59999999
When users submit jobs to Moab using msub, Moab selects the job ID in
numeric order, starting with 1 in Site A, 20000000 in Site B, and 40000000 in
Site C.
If the compound system in this example uses Torque as its resource manager
and users submit jobs directly to Torque using qsub, Torque assigns the job ID
instead of Moab. In this case, you should also offset the Torque job IDs by
setting the next_job_number server parameter of Site B and Site C to
20000000 and 40000000, respectively.
$user qmgr "set server next_job_number=20000000"
$user qmgr "set server next_job_number=40000000"
Torque job ID limits will allow you to use the 20,000,000 offset scheme for
up to 4 sites.
Related Topics
Initial Configuration
Adding #INCLUDE files to moab.cfg
Credential Overview
Moab supports the concept of credentials, which provide a means of attributing
policy and resource access to entities such as users and groups. These
credentials allow specification of job ownership, tracking of resource usage,
enforcement of policies, and many other features. There are five types of
credentials -user, group, account, class, and QoS. While the credentials have
many similarities, each plays a slightly different role.
l
General Credential Attributes
l
User Credential
l
Group Credential
l
Account (or Project) Credential
Credential Overview
35
Chapter 2 Scheduler Basics
l
Class (or Queue) Credential
l
QoS Credential
General Credential Attributes
Internally, credentials are maintained as objects. Credentials can be created,
destroyed, queried, and modified. They are associated with jobs and requests
providing access and privileges. Each credential type has the following
attributes:
l
Priority Settings
l
Usage Limits
l
Service Targets
l
Credential and Partition Access
l
Statistics
l
Credential Defaults, State and Configuration Information
All credentials represent a form of identity, and when applied to a job, express
ownership. Consequently, jobs are subject to policies and limits associated with
their owners.
Credential Priority Settings
Each credential may be assigned a priority using the PRIORITY attribute. This
priority affects a job's total credential priority factor as described in the Priority
Factors section. In addition, each credential may also specify priority weight
offsets, which adjust priority weights that apply to associated jobs. These
priority weight offsets include FSWEIGHT (See Priority-Based Fairshare for
more information.), QTWEIGHT, and XFWEIGHT.
For example:
# set priority weights
CREDWEIGHT
1
USERWEIGHT
1
CLASSWEIGHT
1
SERVICEWEIGHT
1
XFACTORWEIGHT
10
QUEUETIMEWEIGHT 1000
# set credential priorities
USERCFG[john] PRIORITY=200
CLASSCFG[batch] PRIORITY=15
CLASSCFG[debug] PRIORITY=100
QOSCFG[bottomfeeder] QTWEIGHT=-50 XFWEIGHT=100
ACCOUNTCFG[topfeeder] PRIORITY=100
Credential Usage Limits
Usage limits constrain which jobs may run, which jobs may be considered for
scheduling, and what quantity of resources each individual job may consume.
36
Credential Overview
Chapter 2 Scheduler Basics
With usage limits, policies such as MAXJOB, MAXNODE, and MAXMEM may be
enforced against both idle and active jobs. Limits may be applied in any
combination as shown in the example below where usage limits include 32
active processors per group and 12 active jobs for user john. For a job to run, it
must satisfy the most limiting policies of all associated credentials. The
Throttling Policy section documents credential usage limits in detail.
GROUPCFG[DEFAULT] MAXPROC=32 MAXNODE=100
GROUPCFG[staff]
MAXNODE=200
USERCFG[john]
MAXJOB=12
Service Targets
Credential service targets allow jobs to obtain special treatment to meet usage
or response time based metrics. Additional information about service targets
can be found in the Fairshare section.
Credential and Partition Access
Access to partitions and to other credentials may be specified on a per
credential basis with credential access lists, default credentials, and credential
membership lists.
Credential Access Lists
You can use the ALIST, PLIST, and QLIST attributes (shown in the following table)
to specify the list of credentials or partitions that a given credential may access.
Credential
Attribute
Account
ALIST (allows credential to access specified list of accounts
Partition
PLIST (allows credential to access specified list of partitions)
QoS
QLIST (allows credential to access specified list of QoSes)
Example 2-1:
USERCFG[bob]
ALIST=jupiter,quantum
USERCFG[steve] ALIST=quantum
Account-based access lists are only enforced if using an accounting
manager or if the ENFORCEACCOUNTACCESS parameter is set to "TRUE."
Assigning Default Credentials
Use the *DEF attribute (shown in the following table) to specify the default
credential or partition for a particular credential.
Credential Overview
37
Chapter 2 Scheduler Basics
Credential
Attribute
Account
ADEF (specifies default account)
Class
CDEF (specifies default class)
QoS
QDEF (specifies default QoS)
Example 2-2:
# user bob can access accounts a2, a3, and a6. If no account is explicitly requested,
# his job will be assigned to account a3
USERCFG[bob]
ALIST=a2,a3,a6 ADEF=a3
# user steve can access accounts a14, a7, a2, a6, and a1. If no account is explicitly
# requested, his job will be assigned to account a2
USERCFG[steve] ALIST=a14,a7,a2,a6,a1 ADEF=a2
Specifying Credential Membership Lists
As an alternate to specifying access lists, administrators may also specify
membership lists. This allows a credential to specify who can access it rather
than allowing each credential to specify which credentials it can access.
Membership lists are controlled using the MEMBERULIST, EXCLUDEUSERLIST and
REQUIREDUSERLIST attributes, shown in the following table:
Credential
Attribute
User
---
Account, Group, QoS
MEMBERULIST
Class
EXCLUDEUSERLIST and REQUIREDUSERLIST
Example 2-3:
# account omega3 can only be accessed by users johnh, stevek, jenp
ACCOUNTCFG[omega3] MEMBERULIST=johnh,stevek,jenp
Example 2-4: Controlling Partition Access on a Per User Basis
A site may specify the user john may access partitions atlas, pluto, and zeus
and will default to partition pluto. To do this, include the following line in the
configuration file:
USERCFG[john] PLIST=atlas,pluto,zeus
38
Credential Overview
Chapter 2 Scheduler Basics
Example 2-5: Controlling QoS Access on a Per Group Basis
A site may also choose to allow everyone in the group staff to access QoS
standard and special with a default QoS of standard. To do this, include the
following line in the configuration file:
GROUPCFG[staff] QLIST=standard,special QDEF=standard
Example 2-6: Controlling Resource Access on a Per Account Basis
An organization wants to allow everyone in the account omega3 to access nodes
20 through 24. To do this, include the following in the configuration file:
ACCOUNTCFG[omega3] MEMBERULIST=johnh,stevek,jenp
SRCFG[omega3]
HOSTLIST=r:20-24 ACCOUNTLIST=omega3
Credential Statistics
Full statistics are maintained for each credential instance. These statistics
record current and historical resource usage, level of service delivered,
accuracy of requests, and many other aspects of workload. Note, though, that
you must explicitly enable credential statistics as they are not tracked by
default. You can enable credential statistics by including the following in the
configuration file:
USERCFG[DEFAULT]
GROUPCFG[DEFAULT]
ACCOUNTCFG[DEFAULT]
CLASSCFG[DEFAULT]
QOSCFG[DEFAULT]
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
Job Defaults, Credential State, and General Configuration
Credentials may apply defaults and force job configuration settings via the
following parameters:
COMMENT
Description
Example
Credential Overview
Associates a comment string with the target credential.
USERCFG[steve] COMMENT='works for boss, provides good
service'
CLASSCFG[i3]
COMMENT='queue for I/O intensive workload'
39
Chapter 2 Scheduler Basics
HOLD
Description
Specifies a hold should be placed on all jobs associated with the target credential.
The order in which this HOLD attribute is evaluated depends on the following credential
precedence: USERCFG, GROUPCFG, ACCOUNTCFG, CLASSCFG, QOSCFG, USERCFG
[DEFAULT], GROUPCFG[DEFAULT], ACCOUNTCFG[DEFAULT], CLASSCFG[DEFAULT],
QOSCFG[DEFAULT].
Example
GROUPCFG[bert] HOLD=yes
JOBFLAGS
Description
Example
Assigns the specified job flag to all jobs with the associated credential.
CLASSCFG[batch] JOBFLAGS=suspendable
QOSCFG[special] JOBFLAGS=restartable
NOSUBMIT
Description
Example
Specifies whether jobs belonging to this credential can submit jobs using msub.
ACCOUNTCFG[general]
CLASSCFG[special]
NOSUBMIT=TRUE
NOSUBMIT=TRUE
OVERRUN
Description
Example
Specifies the amount of time a job may exceed its wallclock limit before being terminated. (Only
applies to user and class credentials.)
CLASSCFG[bigmem] OVERRUN=00:15:00
VARIABLE
Description
Example
40
Specifies attribute-value pairs associated with the specified credential. These variables may be
used in triggers and other interfaces to modify system behavior.
GROUPCFG[staff] VARIABLE='nocharge=true'
Credential Overview
Chapter 2 Scheduler Basics
Credentials may carry additional configuration information. They may specify
that detailed statistical profiling should occur, that submitted jobs should be
held, or that corresponding jobs should be marked as preemptible.
User Credential
The user credential is the fundamental credential within a workload manager;
each job requires an association with exactly one user. In fact, the user
credential is the only required credential in Moab; all others are optional. In
most cases, the job's user credential is configured within or managed by the
operating system itself, although Moab may be configured to obtain this
information from an independent security and identity management service.
As the fundamental credential, the user credential has a number of unique
attributes.
l
Role
l
Privileges
l
Email Address
l
Disable Moab User Email
Role
Moab supports role-based authorization, mapping particular roles to collections
of specific users. See the Security section for more information.
Privileges
Moab supports the ability to configure which "mdiag" commands a user can
run. Format is:
"USERCFG[] PRIVILEGES=<SCHED|RM|NODE>:diagnose" and does not have
a default. For example, to allow "jill" to run the "mdiag -R" command to
diagnose resource managers, use:
USERCFG[jill] PRIVILEGES=RM:diagnose
Use a semi-colon (;) to separate multiple privileges. For example:
USERCFG[jill] PRIVILEGES=NODE:diagnose;RM:diagnose
Email Address
Facilities exist to allow user notification in the event of job or system failures or
under other general conditions. This attribute allows these notifications to be
mailed directly to the target user.
USERCFG[sally]
Credential Overview
EMAILADDRESS=sally@acme.com
41
Chapter 2 Scheduler Basics
Disable Moab User Email
You can disable Moab email notifications for a specific user.
USERCFG[john]
NOEMAIL=TRUE
Group Credential
The group credential represents an aggregation of users. User-to-group
mappings are often specified by the operating system or resource manager
and typically map to a user's UNIX group ID. However, user-to-group
mappings may also be provided by a security and identity management
service, or you can specify such directly within Moab.
With many resource managers such as Torque, PBSPro, and LSF, the group
associated with a job is either the user's active primary group as specified
within the operating system or a group that is explicitly requested at job
submission time. When a secondary group is requested, the user's default
group and associated policies are not taken into account. Also note that a job
may only run under one group. If more constraining policies are required for
these systems, an alternate aggregation scheme such as the use of Account or
QOS credentials is recommended.
To submit a job as a secondary group, refer to your local resource manager's
job submission options. For Torque users, see the group_list=g_list option
of the qsub -W command.
Account Credential
The account credential is also referred to as the project. This credential is
generally associated with a group of users along the lines of a particular project
for accounting and billing purposes. User-to-accounting mapping may be
obtained from a resource manager or accounting manager, or you can
configure it directly within Moab. Access to an account can be controlled via the
ALIST and ADEF credential attributes specified via the Identity Manager or the
moab.cfg file.
The MANAGERS attribute (applicable only to the account and class credentials)
allows an administrator to assign a user the ability to manage jobs inside the
credential, as if the user is the job owner.
Example 2-7: MANAGERS Attribute
ACCOUNTCFG[general]
ACCOUNTCFG[special]
MANAGERS=ops
MANAGERS=stevep
If a user is able to access more than one account, the desired account can be
specified at job submission time using the resource-manager specific attribute.
For example, with Torque this is accomplished using the -A argument to the
qsub command.
42
Credential Overview
Chapter 2 Scheduler Basics
Example 2-8: Enforcing Account Usage
Job-to-account mapping can be enforced using the ALIST attribute and the
ENFORCEACCOUNTACCESS parameter.
USERCFG[john]
ALIST=proj1,proj3
USERCFG[steve]
ALIST=proj2,proj3,proj4
USERCFG[brad]
ALIST=proj1
USERCFG[DEFAULT] ALIST=proj2
ENFORCEACCOUNTACCESS TRUE
...
Class Credential
l
Class Job Defaults
l
Per Job Min/Max Limits
l
Resource Access
l
Class Membership Constraints
l
Attributes Enabling Class Access to Other Credentials
l
Special Class Attributes (such as Managers and Job Prologs)
l
Setting Default Classes
l
Creating a Remap Class
l
Class Attribute Overview
l
Enabling Queue Complex Functionality
The concept of the class credential is derived from the resource manager class
or queue object. Classes differ from other credentials in that they more directly
impact job attributes. In standard HPC usage, a user submits a job to a class
and this class imposes a number of factors on the job. The attributes of a class
may be specified within the resource manager or directly within Moab. Class
attributes include the following:
l
Job Defaults
l
Per Job Min/Max Limits
l
Resource Access Constraints
l
Class Membership Constraints
l
Attributes Enabling Class Access to Other Credentials
l
Special Class Attributes
When using SLURM, Moab classes have a one-to-one relationship with
SLURM partitions of the same name.
Credential Overview
43
Chapter 2 Scheduler Basics
For all classes configured in Moab, a resource manager queue with the
same name should be created.
When Torque reports a new queue to Moab a class of the same name is
automatically applied to all nodes.
Class Job Defaults
Classes can be assigned to a default job template that can apply values to job
attributes not explicitly specified by the submitter. Additionally, you can specify
shortcut attributes from the table that follows:
Attribute
Description
DEFAULT.ATTR
Job Attribute
DEFAULT.DISK
Required Disk (in MB)
DEFAULT.EXT
Job RM Extension
DEFAULT.FEATURES
Required Node Features/Properties
DEFAULT.GRES
Required Consumable Generic Resources
DEFAULT.MEM
Required Memory/RAM (in MB)
DEFAULT.NODESET
Node Set Specification
DEFAULT.PROC
Required Processor Count
DEFAULT.TPN
Tasks Per Node
DEFAULT.WCLIMIT
Wallclock Limit
Defaults set in a class/queue of the resource manager will override the
default values of the corresponding class/queue specified in Moab.
RESOURCELIMITPOLICY must be configured in order for the CLASSCFG
limits to take effect.
44
Credential Overview
Chapter 2 Scheduler Basics
Example 2-9:
CLASSCFG[batch] DEFAULT.DISK=200MB DEFAULT.FEATURES=prod DEFAULT.WCLIMIT=1:00:00
CLASSCFG[debug] DEFAULT.FEATURES=debug DEFAULT.WCLIMIT=00:05:00
Per Job Min/Max Limits
Classes can be assigned a minimum and a maximum job template that
constrains resource requests. Jobs submitted to a particular queue must meet
the resource request constraints of these templates. If a job submission
exceeds these limits, the entire job submission fails.
Limit
Description
MAX.ARRAYSUBJOBS
Max Allowed Jobs in an Array
MAX.CPUTIME
Max Allowed Utilized CPU Time
MAX.NODE
Max Allowed Node Count
MAX.PROC
Max Allowed Processor Count
MAX.PS
Max Requested Processor-Seconds
MIN.NODE
Min Allowed Node Count
MIN.PROC
Min Allowed Processor Count
MIN.PS
Min Requested Processor-Seconds
MIN.TPN
Min Tasks Per Node
MIN.WCLIMIT
Min Requested Wallclock Limit
MAX.WCLIMIT
Max Requested Wallclock Limit
The parameters listed in the preceding table are for classes and PARCFG
only, not users, accounts, groups or QoSes, and they function on a per-job
basis. The MAX.* and MIN.* parameters are different from the MAXJOB,
MAXNODE, and MAXMEM parameters described earlier in Credential Usage
Limits.
Credential Overview
45
Chapter 2 Scheduler Basics
Resource Access
Classes may be associated with a particular set of compute resources.
Consequently, jobs submitted to a given class may only use listed resources.
This may be handled at the resource manager level or via the CLASSCFG
HOSTLIST attribute.
Class Membership Constraints
Classes may be configured at either the resource manager or scheduler level to
only allow select users and groups to access them. Jobs that do not meet these
criteria are rejected. If specifying class membership/access at the resource
manager level, see the respective resource manager documentation. Moab
automatically detects and enforces these constraints. If specifying class
membership/access at the scheduler level, use the REQUIREDUSERLIST or
EXCLUDEUSERLIST attributes of the CLASSCFG parameter.
Under most resource managers, jobs must always be a member of one
and only one class.
Attributes Enabling Class Access to Other Credentials
Classes may be configured to allow jobs to access other credentials such as
QoSs and Accounts. This is accomplished using the QDEF, QLIST, ADEF, and
ALIST attributes.
Special Class Attributes
The class object also possesses a few unique attributes including JOBPROLOG,
JOBEPILOG, RESFAILPOLICY, and DISABLEAM attributes described in what
follows:
MANAGERS
Users listed via the MANAGERS parameter are granted full control over all jobs
submitted to or running within the specified class.
# allow john and steve to cancel and modify all jobs submitted to the class/queue
special
CLASSCFG[special] MANAGERS=john,steve
In particular, a class manager can perform the following actions on jobs within
a class/queue:
46
l
view/diagnose job (checkjob)
l
cancel, requeue, suspend, resume, and checkpoint job (mjobctl)
l
modify job (mjobctl)
Credential Overview
Chapter 2 Scheduler Basics
JOBPROLOG
The JOBPROLOG class performs a function similar to the resource manager level
job prolog feature; however, there are some key differences:
l
l
l
l
l
l
l
Moab prologs execute on the head node; resource manager prologs
execute on the nodes allocated to the job.
Moab prologs execute as the primary Moab administrator, resource
manager prologs execute as root.
Moab prologs can incorporate cluster environment information into their
decisions and actions. (See Valid Variables.)
Unique Moab prologs can be specified on a per class basis.
Job start requests are not sent to the resource manager until the Moab job
prolog is successfully completed.
Error messages generated by a Moab prolog are attached to jobs and
associated objects; stderr from prolog script is attached to job.
Moab prologs have access to Moab internal and peer services.
Valid epilog and prolog variables are:
Variable
Description
$TIME
Time that the trigger launches
$HOME
Moab home directory
$USER
User name the job is running under
$JOBID
Unique job identifier
$HOSTLIST
Entire host list for job
$MASTERHOST
Master host for job
The JOBPROLOG class attribute allows a site to specify a unique per-class action
to take before a job is allowed to start. This can be used for environmental
provisioning, pre-execution resource checking, security management, and
other functions. Sample uses may include enabling a VLAN, mounting a global
file system, installing a new application or virtual node image, creating dynamic
storage partitions, or activating job specific software services.
A prolog is considered to have failed if it returns a negative number. If a
prolog fails, the associated job will not start.
Credential Overview
47
Chapter 2 Scheduler Basics
If a prolog executes successfully, the associated epilog is guaranteed to
start, even if the job fails for any reason. This allows the epilog to undo
any changes made to the system by the prolog.
Job Prolog Examples
# explicitly specify prolog arguments for special epilog
CLASSCFG[special] JOBPROLOG='$TOOLSDIR/specialprolog.pl $JOBID $HOSTLIST'
# use default prolog arguments for batch prolog
CLASSCFG[batch]
JOBPROLOG=$TOOLSDIR/batchprolog.pl
JOBEPILOG
The Moab epilog is nearly identical to the prolog in functionality except that it
runs after the job completes within the resource manager but before the
scheduler releases the allocated resources for use by subsequent jobs. It is
commonly used for job clean-up, file transfers, signaling peer services, and
undoing other forms of resource customization.
An epilog is considered to have failed if it returns a negative number. If an
epilog fails, the associated job will be annotated and a message will be
sent to administrators.
RESFAILPOLICY
This policy allows specification of the action to take on a per-class basis when a
failure occurs on a node allocated to an actively running job. See the Node
Availability Overview for more information.
DISABLEAM
You can disable allocation management for jobs in specific classes by setting
the DISABLEAM class attribute to TRUE. For all jobs outside of the specified
classes, allocation enforcement will continue to be enforced.
# do not enforce allocations on low priority and debug jobs
CLASSCFG[lowprio] DISABLEAM=TRUE
CLASSCFG[debug]
DISABLEAM=TRUE
Setting Default Classes
In many cases, end-users do not want to be concerned with specifying a job
class/queue. This is often handled by defining a default class. Whenever a user
does not explicitly submit a job to a particular class, a default class, if specified,
is used. In resource managers such as Torque, this can be done at the resource
manager level and its impact is transparent to the scheduler. The default class
can also be enabled within the scheduler on a per resource manager or per
user basis. To set a resource manager default class within Moab, use the
DEFAULTCLASS attribute of the RMCFG parameter. For per user defaults, use the
CDEF attribute of the USERCFG parameter.
48
Credential Overview
Chapter 2 Scheduler Basics
Creating a Remap Class
If a single default class is not adequate, Moab provides more flexible options
with the REMAPCLASS parameter. If this parameter is set and a job is
submitted to the remap class, Moab attempts to determine the final class to
which a job belongs based on the resources requested. If a remap class is
specified, Moab compares the job's requested nodes, processors, memory,
and node features with the class's corresponding minimum and maximum
resource limits. Classes are searched in the order in which they are defined;
when the first match is found, Moab assigns the job to that class.
Because Moab remaps at job submission, updates you make to job
requirements after submission will not cause any class changes. Moab does not
restart the process.
In order to use REMAPCLASS, you must specify a DEFAULTCLASS. For
example:
RMCFG[internal] DEFAULTCLASS=batch
In the example that follows, a job requesting 4 processors and the node
feature fast are assigned to the class quick.
# You must specify a default class in order to use remap classes
RMCFG[internal]
DEFAULTCLASS=batch
# Jobs submitted to "batch" should be remapped
REMAPCLASS
batch
# stevens only queue
CLASSCFG[stevens] REQ.FEATURES=stevens REQUIREDUSERLIST=stevens,stevens2
# Special queue for I/O nodes
CLASSCFG[io]
MAX.PROC=8 REQ.FEATURES=io
# General access queues
CLASSCFG[quick]
MIN.PROC=2 MAX.PROC=8 REQ.FEATURES=fast|short
CLASSCFG[medium] MIN.PROC=2 MAX.PROC=8
CLASSCFG[DEFAULT] MAX.PROC=64
...
The following parameters can be used to remap jobs to different classes:
l
MIN.PROC
l
MAX.PROC
l
MIN.TIN
l
MAX.TIN
l
MIN.WCLIMIT
l
MAX.WCLIMIT
l
REQ.FEATURES
Credential Overview
49
Chapter 2 Scheduler Basics
l
REQ.FLAGS=INTERACTIVE
l
REQUIREDUSERLIST
If the parameter REMAPCLASSLIST is set, then only the listed classes are
searched and they are searched in the order specified by this parameter. If
none of the listed classes are valid for a particular job, that job retains its
original class.
The remap class only works with resource managers that allow dynamic
modification of a job's assigned class/queue.
If default credentials are specified on a remap class, a job submitted to
that class will inherit those credentials. If the destination class has
different default credentials, the new defaults override the original
settings. If the destination class does not have default credentials, the job
maintains the defaults inherited from the remap class.
Class Attribute Overview
The following table enumerates the different parameters for CLASSCFG.
Setting DEFAULT.* on a class does not assign resources or features to that
class. Rather, it specifies resources that jobs will inherit when they are
submitted to the class without their own resource requests. To configure
features, use NODECFG.
DEFAULT.ATTR
Format
<ATTRIBUTE>[,<ATTRIBUTE>]...
Description
One or more comma-delimited generic job attributes.
Example
---
DEFAULT.DISK
50
Format
<INTEGER>
Description
Default amount of requested disk space.
Example
---
Credential Overview
Chapter 2 Scheduler Basics
DEFAULT.EXT
Format
<STRING>
Description
Default job RM extension.
Example
---
DEFAULT.FEATURESDEFAULT.EXT
Format
Comma-delimited list of features.
Description
Default list of requested node features (a.k.a, node properties). This only applies to compute resource reqs.
Example
---
DEFAULT.GRES
Format
<STRING>[<COUNT>][,<STRING>[<COUNT>]]...
Description
Default list of per task required consumable generic resources.
Example
CLASSCFG[viz] DEFAULT.GRES=viz:2
DEFAULT.MEM
Format
<INTEGER> (in MB)
Description
Default amount of requested memory.
Example
---
DEFAULT.NODE
Format
<INTEGER>
Description
Default required node count.
Credential Overview
51
Chapter 2 Scheduler Basics
DEFAULT.NODE
Example
CLASSCFG[viz] DEFAULT.NODE=5
When a user submits a job to the viz class without a specified node count, the job is assigned 5
nodes.
DEFAULT.NODESET
Format
<SETTYPE>:<SETATTR>[:<SETLIST>[,<SETLIST>]...]
Description
Default node set.
Example
CLASSCFG[amd]
DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON
DEFAULT.PROC
Format
<INTEGER>
Description
Default number of requested processors.
Example
---
DEFAULT.TPN
Format
<INTEGER>
Description
Default number of tasks per node.
Example
---
DEFAULT.WCLIMIT
52
Format
<INTEGER>
Description
Default wallclock limit.
Example
---
Credential Overview
Chapter 2 Scheduler Basics
EXCL.FEATURES
Format
Comma- or pipe-delimited list of node features.
Description
Set of excluded (disallowed) features. If delimited by commas, reject job if all features are requested; if delimited by the pipe symbol (|), reject job if at least one feature is requested.
Example
CLASSCFG[intel] EXCL.FEATURES=ATHLON,AMD
EXCL.FLAGS
Format
Comma-delimited list of job flags.
Description
Set of excluded (disallowed) job flags. Reject job if any listed flags are set.
Example
CLASSCFG[batch] EXCL.FLAGS=INTERACTIVE
EXCLUDEUSERLIST
Format
Comma-delimited list of users.
Description
List of users not permitted access to class.
Example
---
FORCENODEACCESSPOLICY
Format
one of SINGLETASK, SINGLEJOB, SINGLEUSER, or SHARED
Description
Node access policy associated with queue. If set, this value overrides any per job settings specified by the user at the job level. (See Node Access Policy overview for more information.)
Example
CLASSCFG[batch] FORCENODEACCESSPOLICY=SINGLEJOB
FSCAP
Format
Credential Overview
<DOUBLE>[%]
53
Chapter 2 Scheduler Basics
FSCAP
Description
See fairshare policies specification.
Example
---
FSTARGET
Format
<DOUBLE>[%]
Description
See fairshare policies specification.
Example
---
HOSTLIST
Format
Host expression, or comma-delimited list of hosts or host ranges.
Description
List of hosts associated with a class. If specified, Moab constrains the availability of a class to only
nodes listed in the class host list.
Example
CLASSCFG[batch] HOSTLIST=r:abs[45-113]
IGNHOSTLIST
Format
<BOOLEAN>
Default
FALSE
Description
If set to TRUE, any job submitted to the class will have its requested hostlist ignored by the scheduler.
Example
CLASSCFG[batch] IGNHOSTLIST=TRUE
JOBEPILOG
Format
54
<STRING>
Credential Overview
Chapter 2 Scheduler Basics
JOBEPILOG
Description
Scheduler level job epilog to be run after job is completed by resource manager. (See special class
attributes.)
Example
---
JOBFLAGS
Format
Comma-delimited list of job flags.
Description
See the flag overview for a description of legal flag values.
Example
CLASSCFG[batch] JOBFLAGS=restartable
JOBPROLOG
Format
<STRING>
Description
Scheduler level job prolog to be run before job is started by resource manager. (See special class
attributes.)
Example
---
MANAGERS
Format
<USER>[,<USER>]...
Description
Users allowed to control, cancel, preempt, and modify jobs within class/queue. (See special class
attributes.)
Example
CLASSCFG[fast] MANAGERS=root,kerry,e43
MAXJOB
Format
Credential Overview
<INTEGER>
55
Chapter 2 Scheduler Basics
MAXJOB
Description
Maximum number of active (starting or running) jobs allowed in the class.
Example
---
MAXPROCPERNODE
Format
<INTEGER>
Description
Maximum number of processors requested per node. May optionally include node names to articulate
which nodes have a specific limit.
Example
CLASSCFG[cpu] MAXPROCPERNODE=20
# When using this class, limit 20 for all nodes
CLASSCFG[cpu] MAXPROCPERNODE[n1,n2]=20 MAXPROCPERNODE[n3]=10
limit 20 for n1 & n2 and limit 10 for n3
# When using this class,
CLASSCFG[cpu] MAXPROCPERNODE[n1,n2]=20 MAXPROCPERNODE=10 # When using this class, limit
20 for n1 & n2 and limit 10 for all other nodes
MAX.CPUTIME
Format
<INTEGER>
Description
Maximum allowed utilized CPU time.
Example
---
MAX.NODE
Format
<INTEGER>
Description
Maximum number of requested nodes per job. (Also used when REMAPCLASS is set to correctly
route the job.)
Example
CLASSCFG[batch] MAX.NODE=64
Deny jobs requesting over 64 nodes access to the class batch.
56
Credential Overview
Chapter 2 Scheduler Basics
MAX.PROC
Format
<INTEGER>
Description
Maximum number of requested processors per job. (Also used when REMAPCLASS is set to correctly route the job.)
This enforces the requested processors, not the actual processors dedicated to a job. When
enforcing limits for NODEACCESSPOLICY SINGLEJOB, use MAX.NODE instead.
Example
CLASSCFG[small] MAX.PROC[USER]=3,6
MAX.PS
Format
<INTEGER>
Description
Maximum requested processor-seconds.
Example
---
MAX.TPN
Format
<INTEGER>
Description
Maximum required tasks per node per job. (Also used when REMAPCLASS is set to correctly route
the job.)
Example
---
MAX.WCLIMIT
Format
[[[DD:]HH:]MM:]SS
Description
Maximum allowed wallclock limit per job. (Also used when REMAPCLASS is set to correctly route
the job.)
Example
Credential Overview
CLASSCFG[long] MAX.WCLIMIT=96:00:00
57
Chapter 2 Scheduler Basics
MIN.NODE
Format
<INTEGER>
Description
Minimum number of requested nodes per job. (Also used when REMAPCLASS is set to correctly
route the job.)
Example
CLASSCFG[dev] MIN.NODE=16
Jobs must request at least 16 nodes to be allowed to access the class.
MIN.PROC
Format
<INTEGER>
Description
Minimum number of requested processors per job. (Also used when REMAPCLASS is set to correctly route the job.)
Example
CLASSCFG[dev] MIN.PROC=32
Jobs must request at least 32 processors to be allowed to access the class.
MIN.PS
Format
<INTEGER>
Description
Minimum requested processor-seconds.
Example
---
MIN.TPN
58
Format
<INTEGER>
Description
Minimum required tasks per node per job. (Also used
when REMAPCLASS is set to correctly route the job.)
Example
---
Credential Overview
Chapter 2 Scheduler Basics
MIN.WCLIMIT
Format
[[[DD:]HH:]MM:]SS
Description
Minimum required wallclock limit per job. (Also used when REMAPCLASS is set to correctly route
the job.)
Example
---
NODEACCESSPOLICY
Format
one of SINGLETASK, SINGLEJOB, SINGLEUSER, or SHARED
Description
Default node access policy associated with queue. This value will be overridden by any per job
settings specified by the user at the job level. See Node Access Policy overview.
Example
CLASSCFG[batch] NODEACCESSPOLICY=SINGLEJOB
PARTITION
Format
<STRING>
Description
Partition name where jobs associated with this class must run.
Example
CLASSCFG[batch] PARTITION=p12
PRIORITY
Format
<INTEGER>
Description
Priority associated with the class. (See Priority overview.)
Example
CLASSCFG[batch] PRIORITY=1000
QDEF
Format
Credential Overview
<QOSID>
59
Chapter 2 Scheduler Basics
QDEF
Description
Default QoS for jobs submitted to this class. You may specify a maximum of four QDEF entries per
credential. Any QoSes specified after the fourth will not be accepted.
In addition to classes, you may also specify QDEF for accounts, groups, and users.
Example
CLASSCFG[batch] QDEF=base
Jobs submitted to class batch that do not explicitly request a QoS will have the QoS base assigned.
QLIST
Format
<QOSID>[,<QOSID>]...
Description
List of accessible QoSs for jobs submitted to this class.
Example
CLASSCFG[batch] QDEF=base
QLIST=base,fast,special,bigio
REQ.FEATURES
Format
Comma- or pipe-delimited list of node features.
Description
Set of required features. If delimited by commas, all features are required; if delimited by the pipe
symbol (|), at least one feature is required.
Example
CLASSCFG[amd] REQ.FEATURES=ATHLON,AMD
REQ.FLAGS
Format
REQ.FLAGS can be used with only the INTERACTIVE flag.
Description
Sets the INTERACTIVE flag on jobs in this class.
Example
60
CLASSCFG[orion] REQ.FLAGS=INTERACTIVE
Credential Overview
Chapter 2 Scheduler Basics
REQUIREDACCOUNTLIST
Format
Comma-delimited list of accounts.
Description
List of accounts allowed to access and use a class (analogous to *LIST for other credentials).
Example
CLASSCFG[jasper] REQUIREDACCOUNTLIST=testers,development
REQUIREDUSERLIST
Format
Comma-delimited list of users.
Description
List of users allowed to access and use a class (analogous to *LIST for other credentials).
Example
CLASSCFG[jasper] REQUIREDUSERLIST=john,u13,steve,guest
REQUIREDQOSLIST
Format
Comma-delimited list of QoSs
Description
List of QoSs allowed to access and use a class (analogous to *LIST for other credentials).
The number of unique QoSs is limited by the Moab Maximum ACL limit, which
defaults to 32.
Example
CLASSCFG[jasper] REQUIREDQOSLIST=hi,lo
SYSPRIO
Format
<INTEGER>
Description
Value of system priority applied to every job submitted to this class.
Once a system priority has been added to a job, either manually or through configuration,
it can only be removed manually.
Example
Credential Overview
CLASSCFG[special] SYSPRIO=100
61
Chapter 2 Scheduler Basics
WCOVERRUN
Format
[[[DD:]HH:]MM:]SS
Description
Tolerated amount of time beyond the specified wallclock limit.
Example
---
Enabling Queue Complex Functionality
Queue complexes allow an organization to build a hierarchy of queues and
apply certain limits and rules to collections of these queues. Moab supports this
functionality in two ways. The first way, queue mapping, is very simple but
limited in functionality. The second method provides very rich functionality but
requires more extensive configuration using the Moab hierarchical fairshare
facility.
Queue Mapping
Queue mapping allows collections of queues to be mapped to a parent
credential object against which various limits and policies can be applied, as in
the following example.
QOSCFG[general]
MAXIJOB[USER]=14 PRIORITY=20
QOSCFG[prio]
MAXIJOB[USER]=8
PRIORITY=2000
# group short, med, and long jobs into 'general' QOS
CLASSCFG[short]
QDEF=general FSTARGET=30
CLASSCFG[med]
QDEF=general FSTARGET=40
CLASSCFG[long]
QDEF=general FSTARGET=30 MAXPROC=200
# group interactive and debug jobs into 'prio' QOS
CLASSCFG[inter]
QDEF=prio
CLASSCFG[debug]
QDEF=prio
CLASSCFG[premier] PRIORITY=10000
QoS Credential
The concept of a quality of service (QoS) credential is unique to Moab and is not
derived from any underlying concept or peer service. In most cases, the QoS
credential is used to allow a site to set up a selection of service levels for endusers to choose from on a long-term or job-by-job basis. QoSs differ from
other credentials in that they are centered around special access where this
access may allow use of additional services, additional resources, or improved
responsiveness. Unique to this credential, organizations may also choose to
apply different charge rates to the varying levels of service available within
each QoS. As QoS is an internal credential, all QoS configuration occurs within
Moab.
62
Credential Overview
Chapter 2 Scheduler Basics
QoS access and QoS defaults can be mapped to users, groups, accounts, and
classes, allowing limited service offering for key users. As mentioned, these
services focus around increasing access to special scheduling capabilities &
additional resources and improving job responsiveness. At a high level, unique
QoS attributes can be broken down into the following:
l
Usage Limit Overrides
l
Service Targets
l
Privilege Flags
l
Charge Rate
l
Access Controls
QoS Usage Limit Overrides
All credentials allow specification of job limits. In such cases, jobs are
constrained by the most limiting of all applicable policies. With QoS override
limits, however, jobs are limited by the override, regardless of other limits
specified.
QoS Service Targets
Service targets cause the scheduler to take certain job-related actions as
various responsiveness targets are met. Targets can be set for either job
queue time or job expansion factor and cause priority adjustments,
reservation enforcement, or preemption activation. In strict service centric
organizations, Moab can be configured to trigger various events and
notifications in the case of failure by the cluster to meet responsiveness
targets.
QoS Privilege Flags
QoSs can provide access to special capabilities. These capabilities include
preemption, job deadline support, backfill, next to run priority, guaranteed
resource reservation, resource provisioning, dedicated resource access, and
many others. See the complete list in the QoS Facility Overview section.
QoS Charge Rate
Associated with the QoSs many privileges is the ability to assign end-users costs
for the use of these services. See Charging and Allocation Management for
more information on general single cluster and multi-cluster charging
capabilities.
QoS Access Controls
QoS access control can be enabled on a per QoS basis using the MEMBERULIST
attribute or specified on a per-requestor basis using the QDEF and QLIST
Credential Overview
63
Chapter 2 Scheduler Basics
attributes of the USERCFG, GROUPCFG, ACCOUNTCFG, and CLASSCFG
parameters. See Managing QoS Access for more detail.
Related Topics
Identity Manager Interface
Usage Limits
Job Attributes/Flags Overview
In this topic:
l
Job Attributes on page 64
l
Job Flags on page 66
Job Attributes
FLAGS
Format:
<FLAG>[:<FLAG>]...
Default:
---
Description:
Specifies job specific flags.
Example:
FLAGS=ADVRES:RESTARTABLE
The job can restart and should only utilize
reserved resources.
PLIST*
Format:
<PARTITION_NAME>[^|&]
[:<PARTITION_NAME>[^|&]]...
64
Default:
[ALL]
Description:
Specifies the list of partitions the object can access. If no partition list is specified, the object is
granted default access to all partitions.
Job Attributes/Flags Overview
Chapter 2 Scheduler Basics
PLIST*
Example:
PLIST=OldSP:Cluster1:O3K
The object can access resources located in the OldSP, Cluster1, and/or O3K partitions.
QDEF
Format:
<QOS_NAME>
Default:
[DEFAULT]
Description:
Specifies the default QOS associated with the object.
Example:
QDEF=premium
The object is assigned the default QOS
premium.
QLIST*
Format:
<QOS_NAME>[^|&]
[:<QOS_NAME>[^|&]]...
Default:
<QDEF>
Description:
Specifies the list of QoSs the object can access. If no QOS list is specified, the object is granted
access only to its default partition.
Example:
QLIST=premium:express:bottomfeeder
The object can access any of the 3 QoSs listed.
Job Attributes/Flags Overview
65
Chapter 2 Scheduler Basics
By default, jobs may access QoSs based on the 'logical or' of the access
lists associated with all job credentials. For example, a job associated with
user "John," group "staff," and class "batch" may utilize QoSs accessible by
any of the individual credentials. Thus the job's QOS access list, or QLIST,
equals the 'or' of the user, group, and class QLIST's. (i.e., JOBQLIST =
USERQLIST | GROUPQLIST | CLASSQLIST). If the ampersand symbol, '&',
is associated with any list, this list is logically and'd with the other lists. If
the carat symbol, '^', is associated with any object QLIST, this list is
exclusively set, regardless of other object access lists using the following
order of precedence user, group, account, QOS, and class. These special
symbols affect the behavior of both QOS and partition access lists.
Job Flags
ADVRES
Format:
ADVRES[:<RESID>]
Default:
Use available resources where ever found, whether inside a reservation or not.
Description:
Specifies the job may only utilize accessible, reserved resources. If <RESID> is specified, only
resources in the specified reservation may be utilized.
Example:
FLAGS=ADVRES:META.1
The job may only utilize resources located in the META.1 reservation.
ALLPROCS
Format:
---
Default:
---
Description:
Each task should occupy all the processors on the node.
Incompatible with ppn and non-Torque systems.
ALLPROCS is scheduled to be deprecated in a future Moab version in which it will be
replaced with the new NUMA job submission syntax (place=nodes in this particular case).
66
Job Attributes/Flags Overview
Chapter 2 Scheduler Basics
ALLPROCS
Example:
msub -l nodes=6 -l flags=allprocs
Each of the 6 tasks will occupy all the processors on the node and the job will launch
enough processes to occupy each of those processors.
ARRAYJOBPARLOCK
Format:
---
Default:
---
Description:
Specifies that the job array being submitted should not span across multiple partitions. This locks
all sub jobs of the array to a single partition. If you want to lock all job arrays to a single partition,
specify the ARRAYJOBPARLOCK parameter in moab.cfg to force this behavior on a global scale.
Example:
> msub -t moab.[1-5]%3 -l walltime=30,flags=arrayjobparlock
ARRAYJOBPARSPAN
Format:
---
Default:
---
Description:
Specifies that the job array being submitted should span across multiple partitions. This is the
default behavior in Moab, unless the ARRAYJOBPARLOCK parameter is specified in moab.cfg.
This job flag overrides the ARRAYJOBPARLOCK parameter so that job arrays can be allowed to
span multiple partitions at submit time.
Example:
> msub -t moab.[1-5]%3 -l walltime=30,flags=arrayjobparspan
GRESONLY
Format:
GRESONLY
Default:
False
Job Attributes/Flags Overview
67
Chapter 2 Scheduler Basics
GRESONLY
Description:
Example:
Uses no compute resources such as processors, memory, and so forth; uses only generic resources.
> msub -l gres=matlab,walltime=300
IGNIDLEJOBRSV
Format:
IGNIDLEJOBRSV
Default:
N/A
Description:
Only applies to QOS. IGNIDLEJOBRSV allows jobs to start without a guaranteed walltime. Instead, it
overlaps the idle reservations of real jobs and is preempted 2 minutes before the real job starts.
Example:
QOSCFG[standby] JOBFLAGS=IGNIDLEJOBRSV
NOQUEUE
Format:
NOQUEUE
Default:
Jobs remain queued until they are able to run
Description:
Specifies that the job should be removed if it is unable to allocate resources and start execution
immediately.
Example:
FLAGS=NOQUEUE
The job should be removed unless it can start running at submit time.
This functionality is identical to the resource manager extension QUEUEJOB:FALSE.
NORMSTART
68
Format:
NORMSTART
Default:
Moab passes jobs to a resource manager to schedule.
Job Attributes/Flags Overview
Chapter 2 Scheduler Basics
NORMSTART
Description:
Example:
Specifies that the job is an internal system job and will not be started via an RM.
FLAGS=NORMSTART
The job begins running in Moab without a corresponding RM job.
NOVMMIGRATE
Format
NOVMMIGRATE
Default
Moab can migrate the VM associated with the job.
Description
Specifies that Moab may not migrate the VM that the job sets up.
Example
msub -l
walltime=INFINITY,template=VMTracking,os=linux,nodes=h3,jobflags=novmmigrate
Moab will not migrate the new VM.
PREEMPTEE
Format:
PREEMPTEE
Default:
Jobs may not be preempted by other jobs
Description:
Specifies that the job may be preempted by other jobs which have the PREEMPTOR flag set.
Example:
FLAGS=PREEMPTEE
The job may be preempted by other jobs which have the PREEMPTOR flag set.
PREEMPTOR
Format:
PREEMPTOR
Default:
Jobs may not preempt other jobs
Job Attributes/Flags Overview
69
Chapter 2 Scheduler Basics
PREEMPTOR
Description:
Example:
Specifies that the job may preempt other jobs which have the PREEMPTEE flag set .
FLAGS=PREEMPTOR
The job may preempt other jobs which have the PREEMPTEE flag set.
PURGEONSUCCESSONLY
Format
PURGEONSUCCESSONLY
Default
Completed jobs are sent to a queue for a short period of time before Moab purges them from
the system.
Description
Specifies that Moab should only purge the job from the completed queue if it completed successfully. If the job failed, Moab will keep it in the queue indefinitely to allow you to restart it at
any time. This flag is particularly useful for setup and take down jobs in job workflows. See
Creating Workflows with Job Templates for more information.
Example
FLAGS=PURGEONSUCCESSONLY
If the job fails, Moab will not purge it from the completed job queue.
RESTARTABLE
Format:
RESTARTABLE
Default:
Jobs may not be restarted if preempted.
Description:
Specifies jobs can be requeued and later restarted if preempted.
Example:
FLAGS=RESTARTABLE
The associated job can be preempted and restarted
at a later date.
70
Job Attributes/Flags Overview
Chapter 2 Scheduler Basics
SUSPENDABLE
Format:
SUSPENDABLE
Default:
Jobs may not be suspended if preempted.
Description:
Specifies jobs can be suspended and later resumed if preempted.
Example:
FLAGS=SUSPENDABLE
The associated job can be suspended and resumed at
a later date.
SYSTEMJOB
Format:
SYSTEMJOB
Default:
N/A
Description:
Creates an internal system job that does not require resources.
Example:
FLAGS=SYSTEMJOB
USEMOABJOBID
Format
<BOOLEAN>
Default
FALSE
Description
Specifies whether to return the Moab job ID when running "msub", or the resource manager's job
ID if it is available.
Setting USEMOABJOBID here overrides the global setting for USEMOABJOBID in moabcfg.
See USEMOABJOBID for more information.
Example
FLAGS=USEMOABJOBID SELECT=TRUE
Job Attributes/Flags Overview
71
Chapter 2 Scheduler Basics
WIDERSVSEARCHALGO
Format:
<BOOLEAN>
Default:
---
Description:
When Moab is determining when and where a job can run, it either searches for the most
resources or the longest range of resources. In almost all cases searching for the longest range is
ideal and returns the soonest starttime. In some rare cases, however, a particular job may need
to search for the most resources. In those cases this flag can be used to have the job find the soonest starttime. The flag can be specified at submit time, or you can use mjobctl -m to modify the
job after it has been submitted. See the RSVSEARCHALGO parameter.
Example:
> msub -l flags=widersvsearchalgo
> mjobctl -m flags+=widersvsearchalgo job.1
Related Topics
Setting Per-Credential Job Flags
72
Job Attributes/Flags Overview
Chapter 3 Scheduler Commands
Chapter 3 Scheduler Commands
Moab Commands
Command
Description
checkjob
Provide detailed status report for specified job
checknode
Provide detailed status report for specified node
mcredctl
Controls various aspects about the credential objects within Moab
mdiag
Provide diagnostic reports for resources, workload, and scheduling
mjobctl
Control and modify job
mnodectl
Control and modify nodes
moab
Control the Moab daemon
mrmctl
Query and control resource managers
mrsvctl
Create, control and modify reservations
mschedctl
Modify scheduler state and behavior
mshow
Displays various diagnostic messages about the system and job queues
mshow -a
Query and show available system resources
msub
Scheduler job submission
mvcctl
Create, modify, and delete VCs
mvmctl
Create, control and modify VMs
showbf
Show current resource availability
showhist.moab.pl
Show past job information
73
Chapter 3 Scheduler Commands
Command
Description
showq
Show queued jobs
showres
Show existing reservations
showstart
Show estimates of when job can/will start
showstate
Show current state of resources
showstats
Show usage statistics
showstats -f
Show various tables of scheduling/system performance
Moab command options
For many Moab commands, you can use the following options to specify that
Moab will run the command in a different way or different location from the
configured default. These options do not change your settings in the
configuration file; they override the settings for this single instance of the
command.
74
Option
Description
--about
Displays build and version information and the status of your Moab license
--help
Displays usage information about the command
--host=<serverHostName>
Causes Moab to run the client command on the specified host
-loglevel=
<logLevel>
Causes Moab to write log information to STDERR as the client command is running. For
more information, see Logging Overview.
--msg=<message>
Causes Moab to annotate the action in the event log
--port=<serverPort>
Causes Moab to run the command using the port specified
Chapter 3 Scheduler Commands
Option
Description
-timeout=
<seconds>
Sets the maximum time that the client command will wait for a response from the Moab
server
--version
Displays version information
--xml
Causes Moab to return the command output in XML format
Commands Providing Maui Compatibility
The following commands are deprecated. Click the link for respective
deprecated commands to see the updated replacement command for
each.
Command
Description
canceljob
Cancel job
changeparam
Change in memory parameter settings
diagnose
Provide diagnostic report for various aspects of resources, workload, and scheduling
releasehold
Release job defers and holds
releaseres
Release reservations
runjob
Force a job to run immediately
sethold
Set job holds
setqos
Modify job QOS settings
setres
Set an admin/user reservation
setspri
Adjust job/system priority of job
showconfig
Show current scheduler configuration
75
Chapter 3 Scheduler Commands
Status Commands
The status commands organize and present information about the current
state and historical statistics of the scheduler, jobs, resources, users, and
accounts. The following table presents the primary status commands and flags.
76
Command
Description
checkjob
Displays detailed job information such as job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization.
checknode
Displays detailed node information such as node state, resources, attributes, reservations, history,
and statistics.
mdiag -f
Displays summarized fairshare information and any unexpected fairshare configuration.
mdiag -j
Displays summarized job information and any unexpected job state.
mdiag -n
Displays summarized node information and any unexpected node state.
mdiag -p
Displays summarized job priority information.
mschedctl
-f
Resets internal statistics.
showstats
-f
Displays various aspects of scheduling performance across a job duration/job size matrix.
showq [-r|i]
Displays various views of currently queued active, idle, and non-eligible jobs.
showstats
-g
Displays current and historical usage on a per group basis.
showstats
-u
Displays current and historical usage on a per user basis.
showstats
-v
Displays high level current and historical scheduling statistics.
Status Commands
Chapter 3 Scheduler Commands
Job Management Commands
Moab shares job management tasks with the resource manager. Typically, the
scheduler only modifies scheduling relevant aspects of the job such as partition
access, job priority, charge account, and hold state. The following table covers
the available job management commands. The Commands Overview lists all
available commands.
Command
Description
canceljob
Cancels existing job.
checkjob
Displays job state, resource requirements, environment, constraints, credentials, history, allocated
resources, and resource utilization.
mdiag -j
Displays summarized job information and any unexpected job state.
releasehold
-a
Removes job holds or deferrals.
runjob
Starts job immediately, if possible.
sethold
Sets hold on job.
setqos
Sets/modifies QoS of existing job.
setspri
Adjusts job/system priority of job.
Related Topics
Job State Definitions
Reservation Management Commands
Moab exclusively controls and manages all advance reservation features
including both standing and administrative reservations. The following table
covers the available reservation management commands.
Command
Description
mdiag -r
Displays summarized reservation information and any unexpected state.
Job Management Commands
77
Chapter 3 Scheduler Commands
Command
Description
mrsvctl
Reservation control.
mrsvctl -r
Removes reservations.
mrsvctl -c
Creates an administrative reservation.
showres
Displays information regarding location and state of reservations.
Policy/Configuration Management Commands
Moab allows dynamic modification of most scheduling parameters allowing new
scheduling policies, algorithms, constraints, and permissions to be set at any
time. Changes made via Moab client commands are temporary and are
overridden by values specified in Moab configuration files the next time Moab is
shut down and restarted. The following table covers the available configuration
management commands.
Command
Description
mschedctl -l
Displays triggers, messages, and settings of all configuration parameters.
mschedctl
Controls the scheduler (behavior, parameters, triggers, messages).
mschedctl -m
Modifies system values.
End-user Commands
While the majority of Moab commands are tailored for use by system
administrators, a number of commands are designed to extend the knowledge
and capabilities of end-users. The following table covers the commands
available to end-users.
78
Policy/Configuration Management Commands
Chapter 3 Scheduler Commands
When using Active Directory as a central authentication mechanism, all
nodes must be reported with a different name when booted in both Linux
and Windows (for instance, node01-l for Linux and node01 for Windows).
If a machine account with the same name is created for each OS, the most
recent OS will remove the previously-joined machine account. The nodes
must report to Moab with the same hostname. This can be done by using
aliases (adding all node names to the /etc/hosts file on the system
where Moab is running) and ensuring that the Linux resource manager
reports the node with its global name rather than the Linux-specific one
(node01 rather than node01-l).
Command
Description
canceljob
Cancels existing job.
checkjob
Displays job state, resource requirements, environment, constraints, credentials, history, allocated
resources, and resource utilization.
msub
Submit a new job.
releaseres
Releases a user reservation.
setres
Create a user reservation.
showbf
Shows resource availability for jobs with specific resource requirements.
showq
Displays detailed prioritized list of active and idle jobs.
showstart
Shows estimated start time of idle jobs.
showstats
Shows detailed usage statistics for users, groups, and accounts, to which the end-user has access.
Related Topics
Commands Overview
End-user Commands
79
Chapter 3 Scheduler Commands
Commands
checkjob
Synopsis
checkjob [exact:jobid] [-l policylevel] [-n nodeid] [-q qosid] [-r reservationid] [v] [--flags=future | complete] [--blocking] jobid
Overview
checkjob displays detailed job state information and diagnostic output for a
specified job. Detailed information is available for queued, blocked, active, and
recently completed jobs. The checkjob command shows the master job of an
array as well as a summary of array sub-jobs, but does not display all sub-jobs.
Use checkjob -v to display all job-array sub-jobs.
Access
This command can be run by level 1-3 Moab administrators for any job. Also,
end users can use checkjob to view the status of their own jobs.
Arguments
--blocking
Format
--blocking
Description
Do not use cache information in the output. The --blocking flag retrieves results exclusively
from the scheduler.
Example
> checkjob -v --blocking 1234
Display real time data about job 1234.
--flags
Format
80
--flags=future | complete
Commands
Chapter 3 Scheduler Commands
--flags
Description
l
l
Example
future – Evaluates future eligibility of job (ignore current resource state and usage
limitations).
complete – Queries details for jobs that have already terminated.
> checkjob -v --flags=future 6235
Display reasons why idle job is blocked ignoring node state and current node utilization
constraints.
exact
Format
exact:<JOBID>
Description
Searches for and returns the exact job ID
Example
> checkjob exact:1.job_
dependency1
-l (Policy level)
Format
<POLICYLEVEL>
HARD, SOFT, or OFF
Description
Example
Reports job start eligibility subject to specified throttling policy level.
> checkjob -l SOFT 6235
> checkjob -l HARD 6235
-n (NodeID)
Format
<NODEID>
Description
Checks job access to specified node and preemption status with regards to jobs located on that
node.
Example
Commands
> checkjob -n node113 6235
81
Chapter 3 Scheduler Commands
-q (QoS)
Format
<QOSID>
Description
Checks job access to specified QoS <QOSID>.
Example
> checkjob -q special 6235
-r (Reservation)
Format
<RSVID>
Description
Checks job access to specified reservation <RSVID>.
Example:
> checkjob -r orion.1 6235
-v (Verbose)
Description
Sets verbose mode. If the job is part of an array, the -v option shows pertinent array information
before the job-specific information (see Example 2 and Example 3 for differences between
standard output and -v output).
Specifying the double verbose (-v -v) displays additional information about the job. See
the Output table for details.
Example
> checkjob -v 6235
Details
This command allows any Moab administrator to check the detailed status and
resource requirements of an active, queued, or recently completed job.
Additionally, this command performs numerous diagnostic checks and
determines if and where the job could potentially run. Diagnostic checks include
policy violations, reservation constraints, preemption status, and job to
resource mapping. If a job cannot run, a text reason is provided along with a
summary of how many nodes are and are not available. If the -v flag is
specified, a node by node summary of resource availability will be displayed for
idle jobs.
82
Commands
Chapter 3 Scheduler Commands
Job Eligibility
If a job cannot run, a text reason is provided along with a summary of how
many nodes are and are not available. If the -v flag is specified, a node by
node summary of resource availability will be displayed for idle jobs. For job
level eligibility issues, one of the following reasons will be given:
Reason
Description
job has hold in place
one or more job holds are currently in place
insufficient idle procs
there are currently not adequate processor resources available to start
the job
idle procs do not meet requirements
adequate idle processors are available but these do not meet job requirements
start date not reached
job has specified a minimum start date which is still in the future
expected state is not idle
job is in an unexpected state
state is not idle
job is not in the idle state
dependency is not met
job depends on another job reaching a certain state
rejected by policy
job start is prevented by a throttling policy
If a job cannot run on a particular node, one of the following 'per node' reasons
will be given:
Reason
Description
Class
Node does not allow required job class/queue
CPU
Node does not possess required processors
Disk
Node does not possess required local disk
Features
Node does not possess required node features
Memory
Node does not possess required real memory
Network
Node does not possess required network interface
Commands
83
Chapter 3 Scheduler Commands
Reason
Description
State
Node is not Idle or Running
Reservation Access
The -r flag can be used to provide detailed information about job access to a
specific reservation
Preemption Status
If a job is marked as a preemptor and the -v and -n flags are specified, checkjob
will perform a job by job analysis for all jobs on the specified node to determine
if they can be preempted.
Output
The checkjob command displays the following job attributes:
Attribute
Value
Description
Account
<STRING>
Name of account associated with job
Actual Run Time
[[[DD:]HH:]MM:]SS
Length of time job actually ran.
This info is only displayed in simulation mode.
84
Allocated Nodes
Square bracket delimited list of node and
processor ids
List of nodes and processors allocated to job
Applied Nodeset**
<STRING>
Nodeset used for job's node allocation
Arch
<STRING>
Node architecture required by job
Attr
square bracket delimited list of job attributes
Job Attributes (i.e. [BACKFILL][PREEMPTEE])
Available Memory**
<INTEGER>
The available memory requested by job. Moab displays the
relative or exact value by returning a comparison symbol
(>, <, >=, <=, or ==) with the value (i.e. Available
Memory <= 2048).
Commands
Chapter 3 Scheduler Commands
Attribute
Value
Description
Available Swap**
<INTEGER>
The available swap requested by job. Moab displays the relative or exact value by returning a comparison symbol (>,
<, >=, <=, or ==) with the value (i.e. Available Swap >=
1024).
Average Utilized
Procs*
<FLOAT>
Average load balance for a job
Avg Util Resources
Per Task*
<FLOAT>
BecameEligible
<TIMESTAMP>
The date and time when the job moved from Blocked to Eligible.
Bypass
<INTEGER>
Number of times a lower priority job with a later submit
time ran before the job
CheckpointStartTime**
[[[DD:]HH:]MM:]SS
The time the job was first checkpointed
Class
[<CLASS NAME>
<CLASS COUNT>]
Name of class/queue required by job and number of class
initiators required per task.
Dedicated Resources
Per Task*
Space-delimited list of
<STRING>:<INTEGER>
Resources dedicated to a job on a per-task basis
Disk
<INTEGER>
Amount of local disk required by job (in MB)
Estimated Walltime
[[[DD:]HH:]MM:]SS
The scheduler's estimated walltime.
In simulation mode, it is the actual walltime.
EnvVariables**
Comma-delimited list
of <STRING>
List of environment variables assigned to job
Exec Size*
<INTEGER>
Size of job executable (in MB)
Executable
<STRING>
Name of command to run
Features
Square bracket delimited list of <STRING>s
Node features required by job
Commands
85
Chapter 3 Scheduler Commands
Attribute
Value
Description
Group
<STRING>
Name of UNIX group associated with job
Holds
Zero or more of User,
System, and Batch
Types of job holds currently applied to job
Image Size
<INTEGER>
Size of job data (in MB)
IWD (Initial Working
Directory)
<DIR>
Directory to run the executable in
Job Messages**
<STRING>
Messages attached to a job
Job Submission**
<STRING>
Job script submitted to RM
Memory
<INTEGER>
Amount of real memory required per node (in MB)
Max Util Resources
Per Task*
<FLOAT>
Flags
NodeAccess*
86
Nodecount
<INTEGER>
Number of nodes required by job
Opsys
<STRING>
Node operating system required by job
Partition Mask
ALL or colon delimited
list of partitions
List of partitions the job has access to
PE
<FLOAT>
Number of processor-equivalents requested by job
Per Partition Priority**
Tabular
Table showing job template priority for each partition
Priority Analysis**
Tabular
Table showing how job's priority was calculated:
Job PRIORITY* Cred( User:Group:Class) Serv
(QTime)
Commands
Chapter 3 Scheduler Commands
Attribute
Value
Description
QOS
<STRING>
Quality of Service associated with job
Reservation
<RSVID> ( <TIME1> <TIME2> Duration:
<TIME3>)
RESID specifies the reservation id, TIME1 is the relative
start time, TIME2 the relative end time, TIME3 the duration of the reservation
Req
[<INTEGER>]
TaskCount:
<INTEGER> Partition:
<partition>
A job requirement for a single type of resource followed by
the number of tasks instances required and the appropriate partition
StageIn
<SOURCE>
%<DESTINATION>
The <SOURCE> is the username, hostname, directory and
file name of origin for the file(s) that Moab will stage in for
this job. The <DESTINATION> is the username, hostname,
directory and file name where Moab will place the file during this job. See About Data Staging for more information.
StageInSize
<INTEGER><UNIT>
The size of the file Moab will stage in for this job. <UNIT>
can be KB, MB, GB, or TB. See About Data Staging for more
information.
StageOut
<SOURCE>
%<DESTINATION>
The <SOURCE> is the username, hostname, directory and
file name of origin for the file(s) that Moab will stage out
for this job. The <DESTINATION> is the username, hostname, directory and file name where Moab will place the
file during this job. See About Data Staging for more information.
StageOutSize
<INTEGER><UNIT>
The size of the file Moab will stage out for this job. <UNIT>
can be KB, MB, GB, or TB. See About Data Staging for more
information.
StartCount
<INTEGER>
Number of times job has been started by Moab
StartPriority
<INTEGER>
Start priority of job
StartTime
<TIME>
Time job was started by the resource management system
State
One of Idle, Starting,
Running, etc. See Job
States for all possible
values.
Current Job State
Commands
87
Chapter 3 Scheduler Commands
Attribute
Value
Description
SubmitTime
<TIME>
Time job was submitted to resource management system
Swap
<INTEGER>
Amount of swap disk required by job (in MB)
Task Distribution*
Square bracket delimited list of nodes
Time Queued
Total Requested
Nodes**
<INTEGER>
Number of nodes the job requested
Total Requested Tasks
<INTEGER>
Number of tasks requested by job
User
<STRING>
Name of user submitting job
Utilized Resources
Per Task*
<FLOAT>
WallTime
[[[DD:]HH:]MM:]SS of
[[[DD:]HH:]MM:]SS
Length of time job has been running out of the specified
limit
In the above table, fields marked with an asterisk (*) are only displayed when
set or when the -v flag is specified. Fields marked with two asterisks (**) are
only displayed when set or when the -v -v flag is specified.
88
Commands
Chapter 3 Scheduler Commands
Example 3-1: checkjob 717
> checkjob 717
job 717
State: Idle
Creds: user:jacksond group:jacksond class:batch
WallTime: 00:00:00 of 00:01:40
SubmitTime: Mon Aug 15 20:49:41
(Time Queued Total: 3:12:23:13 Eligible: 3:12:23:11)
TerminationDate:
INFINITY Sat Oct 24 06:26:40
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: --- Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: --IWD:
/home/jacksond/moab/moab-4.2.3
Executable:
STDIN
Flags:
RESTARTABLE,NORMSTART
StartPriority: 5063
Reservation '717' ( INFINITY ->
INFINITY Duration: 00:01:40)
Note: job cannot run in partition base (idle procs do not meet requirements : 0 of 1
procs found)
idle procs:
4 feasible procs:
0
Rejection Reasons: [State
:
3][ReserveTime :
1]
cannot select job 717 for partition GM (partition GM does not support requested class
batch)
The example job cannot be started for two different reasons.
l
l
It is temporarily blocked from partition base because of node
state and node reservation conflicts.
It is permanently blocked from partition GM because the
requested class batch is not supported in that partition.
Example 3-2: Using checkjob (no -v) on a job array master job:
checkjob array.1
job array.1
AName: array
Job Array Info:
Name: array.1
Sub-jobs:
Active:
Eligible:
Blocked:
Complete:
Commands
10
6
2
2
0
( 60.0%)
( 20.0%)
( 20.0%)
( 0.0%)
89
Chapter 3 Scheduler Commands
Example 3-3: Using checkjob -v on a job array master job:
$ checkjob -v array.1
job array.1
90
AName: array
Job Array Info:
Name: array.1
1 : array.1.1 :
2 : array.1.2 :
3 : array.1.3 :
4 : array.1.4 :
5 : array.1.5 :
6 : array.1.6 :
7 : array.1.7 :
8 : array.1.8 :
9 : array.1.9 :
10 : array.1.10
Running
Running
Running
Running
Running
Running
Idle
Idle
Blocked
: Blocked
Sub-jobs:
Active:
Eligible:
Blocked:
Complete:
10
6
2
2
0
( 60.0%)
( 20.0%)
( 20.0%)
( 0.0%)
Commands
Chapter 3 Scheduler Commands
Example 3-4: Using checkjob -v on a data staging job
$ checkjob -v moab.14.dsin
job moab.14.dsin
AName: moab.14.dsin
State: Running
Creds: user:fred group:company
WallTime:
00:00:00 of 00:01:01
SubmitTime: Wed Apr 16 10:07:19
(Time Queued Total: 00:00:00 Eligible: 00:00:00)
StartTime: Wed Apr 16 10:07:19
TemplateSets: dsin
Triggers: 78$start+0@0.000000:exec@/opt/moab/tools/datastaging/ds_move_rsync -stagein:FALSE
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: SHARED
Dedicated Resources Per Task: bandwidth: 1
NodeAccess: SHARED
Allocated Nodes:
[GLOBAL:1]
Job Group: moab.14
SystemID:
moab
SystemJID: moab.14.dsin
Task Distribution: GLOBAL
IWD:
$HOME/test/datastaging
SubmitDir:
$HOME/test/datastaging
StartCount:
1
Parent VCs:
vc11
User Specified Partition List:
local
Partition List: local
SrcRM:
internal
Flags:
NORMSTART,GRESONLY,TEMPLATESAPPLIED
Attr:
dsin
StageInSize:
386MB
StageOutSize:
100MB
StageIn:
fred@remotelab:/home/fred/input1/%fred@scratch:/home/fred/input1/
StageIn:
fred@remotelab:/home/fred/input2/%fred@scratch:/home/fred/input2/
StageIn:
fred@remotelab:/home/fred/input3/%fred@scratch:/home/fred/input3/
StageOut:
fred@scratch:/home/fred/output/%fred@remotelab:/home/fred/output/
StartPriority: 1
SJob Type:
datastaging
Completion Policy:
datastaging
PE:
0.00
Reservation 'moab.14.dsin' (-00:00:06 -> 00:00:55 Duration: 00:01:01)
Related Topics
showhist.moab.pl - explains how to query for past job information
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mdiag -j command - display additional detailed information regarding jobs
showq command - showq high-level job summaries
JOBCPURGETIME parameter - specify how long information regarding completed jobs is maintained
diagnosing job preemption
Commands
91
Chapter 3 Scheduler Commands
checknode
Synopsis
checknode options nodeID
ALL
Overview
This command shows detailed state information and statistics for nodes that
run jobs.
The following information is returned by this command:
92
Name
Description
ACL
Node Access Control List (displayed only if set)
ActiveTime
Total time node has been busy (allocated to active jobs) since statistics initialization
expressed in HH:MM:SS notation (percent of time busy: BusyTime/TotalTime)
Adapters
Network adapters available
Arch
Architecture
Classes
Classes available
Disk
Disk space available
Downtime
Displayed only if downtime is scheduled
EffNodeAccessPolicy
Configured effective node access policy
Features
Features available
Load
CPU Load (Berkley one-minute load average)
Memory
Memory available
Opsys
Operating system
RequestID
Dynamic Node RequestID set by the RM (displayed only if set)
Commands
Chapter 3 Scheduler Commands
Name
Description
State
Node state
StateTime
Time node has been in current state in HH:MM:SS notation
Swap
Swap space available
TotalTime
Total time node has been detected since statistics initialization expressed in HH:MM:SS
notation
TTL
Dynamic Node Time To Live set by the RM (expiration date, displayed only if set)
UpTime
Total time node has been in an available (Non-Down) state since statistics initialization
expressed in HH:MM:SS notation (percent of time up: UpTime/TotalTime)
After displaying this information, some analysis is performed and any unusual
conditions are reported.
Access
By default, this command can be run by any Moab Administrator (see
ADMINCFG).
Parameters
Name
Description
NODE
Node name you want to check. Moab uses regular expressions to return any node that contains the
provided argument. For example, if you ran checknode node1, Moab would return information about
node1, node10, node100, etc. If you want to limit the results to node1 only, you would run checknode
"^node1$".
Flags
Name
Description
ALL
Returns checknode output on all nodes in the cluster.
-h
Help for this command.
-v
Returns verbose output.
Commands
93
Chapter 3 Scheduler Commands
Name
Description
--xml
Output in XML format. Same as mdiag -n --xml.
Example 3-5: checknode
> checknode P690-032
node P690-032
State:
Busy (in current state for 11:31:10)
Configured Resources: PROCS: 1 MEM: 16G SWAP: 2000M
Utilized
Resources: PROCS: 1
Dedicated Resources: PROCS: 1
Opsys:
AIX
Arch:
P690
Speed:
1.00
CPULoad:
1.000
Network:
InfiniBand,Myrinet
Features:
Myrinet
Attributes: [Batch]
Classes:
[batch]
Total Time: 5:23:28:36
Reservations:
Job '13678'(x1)
Job '13186'(x1)
Jobs: 13186
94
Up: 5:23:28:36 (100.00%)
DISK: 500G
Active: 5:19:44:22 (97.40%)
10:16:12:22 -> 12:16:12:22 (2:00:00:00)
-11:31:10 -> 1:12:28:50 (2:00:00:00)
Commands
Chapter 3 Scheduler Commands
Example 3-6: checknode ALL
Commands
95
Chapter 3 Scheduler Commands
> checknode ALL
node ahe
State:
Idle (in current state for 00:00:30)
Configured Resources: PROCS: 12 MEM: 8004M SWAP: 26G
Utilized
Resources: PROCS: 1 SWAP: 4106M
Dedicated Resources: --MTBF(longterm):
INFINITY MTBF(24h):
INFINITY
Opsys:
linux
Arch:
--Speed:
1.00
CPULoad:
1.400
Flags:
rmdetected
Classes:
[batch]
RM[ahe]*:
TYPE=PBS
EffNodeAccessPolicy: SHARED
Total Time: 00:01:44
Up: 00:01:44 (100.00%)
DISK: 1M
Active: 00:00:00 (0.00%)
Reservations: --node ahe-ubuntu32
State:
Running (in current state for 00:00:05)
Configured Resources: PROCS: 12 MEM: 2013M SWAP: 3405M
Utilized
Resources: PROCS: 6 SWAP: 55M
Dedicated Resources: PROCS: 6
MTBF(longterm):
INFINITY MTBF(24h):
INFINITY
Opsys:
linux
Arch:
--Speed:
1.00
CPULoad:
2.000
Flags:
rmdetected
Classes:
[batch]
RM[ahe]*:
TYPE=PBS
EffNodeAccessPolicy: SHARED
Total Time: 00:01:44
Reservations:
6x2 Job:Running
7x2 Job:Running
8x2 Job:Running
Jobs:
6,7,8
node ahe-ubuntu64
Up: 00:01:44 (100.00%)
Active: 00:00:02 (1.92%)
-00:00:07 -> 00:01:53 (00:02:00)
-00:00:06 -> 00:01:54 (00:02:00)
-00:00:05 -> 00:01:55 (00:02:00)
State:
Busy (in current state for 00:00:06)
Configured Resources: PROCS: 12 MEM: 2008M SWAP: 3317M
Utilized
Resources: PROCS: 12 SWAP: 359M
Dedicated Resources: PROCS: 12
MTBF(longterm):
INFINITY MTBF(24h):
INFINITY
Opsys:
linux
Arch:
--Speed:
1.00
CPULoad:
0.000
Flags:
rmdetected
Classes:
[batch]
RM[ahe]*:
TYPE=PBS
EffNodeAccessPolicy: SHARED
Total Time: 00:01:44
Reservations:
0x2 Job:Running
1x2 Job:Running
2x2 Job:Running
3x2 Job:Running
4x2 Job:Running
5x2 Job:Running
96
DISK: 1M
Up: 00:01:44 (100.00%)
-00:01:10
-00:00:20
-00:00:20
-00:00:17
-00:00:13
-00:00:07
->
->
->
->
->
->
00:00:50
00:01:40
00:01:40
00:01:43
00:01:47
00:01:53
DISK: 1M
Active: 00:00:55 (52.88%)
(00:02:00)
(00:02:00)
(00:02:00)
(00:02:00)
(00:02:00)
(00:02:00)
Commands
Chapter 3 Scheduler Commands
Jobs:
ALERT:
0,1,2,3,4,5
node is in state Busy but load is low (0.000)
Example 3-7: checknode n001 (Dynamic Node)
> checknode node001
node node001
State:
Idle (in current state for 00:13:50)
Configured Resources: PROCS: 2 MEM: 4096M
Utilized
Resources: PROCS: 2
Dedicated Resources: --ACL:
USER==FRED+:==BOB+ GROUP==DEV+
MTBF(longterm):
INFINITY MTBF(24h):
INFINITY
Opsys:
--Arch:
--Speed:
1.00
CPULoad:
2.000
Partition: local Rack/Slot: --- NodeIndex: 1
RM[local]*: TYPE=NATIVE:AGFULL
EffNodeAccessPolicy: SHARED
RequestID: 1234
TTL: Tue Nov 10 00:00:00 2015
Total Time: 2:21:19:05 Up: 2:21:19:05 (100.00%) Active: 00:00:00 (0.00%)
Reservations:
node001-TTL-1234x1 User
441days ->
INFINITY ( INFINITY)
Blocked Resources@
441days Procs: 2/2 (100.00%) Mem: 4096/4096 (100.00%)
Swap: 1/1 (100.00%) Disk: 1/1 (100.00%)
ALERT: node is in state Idle but load is high (2.000)
Related Topics
l
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
l
mdiag -n
l
showstate
mcredctl
Synopsis
mcredctl [-d credtype[:credid]] [-h credtype:credid] [-l credtype] [-q
{role|limit|profile|accessfrom|accessto|policies} credtype[:credid]] [-format=xml] [-r {stats|fairshare|uid} <type>[:<ID>] [-t <STARTTIME>
[,<ENDTIME>]
Overview
The mcredctl command controls various aspects about the credential objects
within Moab. It can be used to display configuration, limits, roles, and
relationships for various Moab credential objects.
If using Insight, you must restart Moab to view credential modifications.
Commands
97
Chapter 3 Scheduler Commands
Arguments
In all cases <CREDTYPE> is one of acct, group, user, class, or qos.
In most cases it is necessary to use the --format=xml flag in order to
print the output (see examples below for specific syntax requirements).
-d - DESTROY
Format
<TYPE>:<VAL>
Description
Purge a credential from moab.cfg (does not delete credential from memory).
Example
> mcredctl -d user:john
All references to USERCFG[john] will be commented out of
moab.cfg)
-h - HOLD
Format
<TYPE>:<VAL>
Description
Toggles whether a given credentials' jobs should be place on hold or not.
Example
> mcredctl -h user:john
User [john] will be put on hold.
-l - LIST
98
Format
<TYPE>
Description
List the various sub-objects of the specified credential.
Commands
Chapter 3 Scheduler Commands
-l - LIST
Example
> mcredctl -l user --format=xml
List all users within Moab in XML.
> mcredctl -l group --format=xml
List all groups within Moab in XML.
-q - QUERY
Format
{role | accessfrom | accessto | limit| profile | policies}
limit <TYPE>
policies <TYPE>
role <USER>:<USERID>
profile <TYPE>[:<VAL>]
accessfrom <TYPE>[:<VAL>]
accessto <TYPE>[:<VAL>]
Description:
Example:
Display various aspects of a credential (formatted in XML)
> mcredctl -q role user:bob --format=xml
View user bob's administrative role within Moab in XML
> mcredctl -q limit acct --format=xml
Display limits for all accounts in XML
> mcredctl -q policies user:bob
View limits organized by credential for user bob on each partition and resource manager
> mcredctl -q profile group --format=xml --timeout=00:10:00 -o
time:1388590200,1431529200,types:TPSD
Generates a report of processor hours used by groups per month. TPSD represents total
proc-seconds dedicated by this credental in the profiling interval.
Commands
99
Chapter 3 Scheduler Commands
-r - RESET
Format
{stats|fairshare|uid} <TYPE> [:<ID>]
Description
Reset the stats, fairshare, or uid/gid of a given credential.
When resetting uid, only a type of user is
supported.
Example
> mcredctl -r uid user:john
Resets the UID/GID for the user named john.
-t - TIMEFRAME
Format
<STARTTIME>[,<ENDTIME>]
Description
Can be used in conjunction with the -q profile option to display profiling information for the specified timeframe.
Example
> mcredctl -q profile user -t 14:30_06/20
Credential Statistics XML Output
Credential statistics can be requested as XML (via the --format=xml
argument) and will be written to STDOUT in the following format:
> mcredctl -q profile user --format=xml -o time:1182927600,1183013999
<Data>
<user ...>
<Profile ...>
</Profile>
</user>
</Data>
Example 3-8: Deleting a group
> mcredctl -d group:john
GROUPCFG[john] Successfully purged from config files
100
Commands
Chapter 3 Scheduler Commands
Example 3-9: List users in XML format
> mcredctl -l user --format=xml
<Data><user ID="john"</user><user ID="john"></user><user ID="root"></user><user
ID="dev"></user></Data>
Example 3-10: Display information about a user
> mcredctl -q role user:john --format=xml
<Data><user ID="test" role="admin5"></user></Data>
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mdiag
Synopsis
mdiag -a [accountid]
mdiag -b [-l policylevel] [-t partition]
mdiag -c [classid]
mdiag -C [configfile] // diagnose config file syntax
mdiag -e [-w
<starttime>|<endtime>|<eventtypes>|<oidlist>|<eidlist>|<objectlist>] -xml
mdiag -f [-o user|group|acct|qos|class] [-v]
mdiag -g [groupid]
mdiag -G [Green]
mdiag -j [jobid] [-t <partition>] [-v] [--blocking]
mdiag -l
mdiag -L [-v] // diagnose usage limits
mdiag -n [-A <creds>] [-t partition] [nodeid] [-v]
mdiag -p [-t partition] [-v] // diagnose job priority
mdiag -q [qosid]
mdiag -r [reservationid] [-v] [-w type=<type>] [--blocking]
mdiag -R [resourcemanagername] [-v]
mdiag -s [standingreservationid] [--blocking]
mdiag -S [-v] // diagnose scheduler
mdiag -t [-v] // diagnose partitions
Commands
101
Chapter 3 Scheduler Commands
mdiag -T [triggerid] [-v] [--blocking]
mdiag -u [userid]
mdiag [--format=xml]
Overview
The mdiag command is used to display information about various aspects of the
cluster and the results of internal diagnostic tests. In summary, it provides the
following:
l
current object health and state information
l
current object configuration (resources, policies, attributes, etc)
l
current and historical performance/utilization information
l
reports on recent failure
l
object messages
Some mdiag options gather information from the Moab cache which prevents
them from interrupting the scheduler, but the --blocking option can be used
to bypass the cache and interrupt the scheduler.
Arguments
Argument
Description
-a [accountid]
Display account information.
-b
Display information on jobs blocked by policies, holds, or other factors.
If blocked job diagnostics are specified, the -t option is also available to constrain the
report to analysis of particular partition. Also, with blocked job diagnosis, the -l option
can be used to specify the analysis policy level.
-c [classid]
Display class information.
-C [file]
With the vast array of options in the configuration file, the -C option does not validate function,
but it does analyze the configuration file for syntax errors including use of invalid parameters,
deprecated parameters, and some illegal values. If you start Moab with the -e flag, Moab evaluates the configuration file at startup and quits if an error exists.
mdiag -C does not print out any #INCLUDE lines listed in moab.cfg (and moab.dat), but
it does evaluate and print out the lines found in those included files.
102
Commands
Chapter 3 Scheduler Commands
Argument
Description
-e
Moab will do a query for all events whose eventtime starts at <starttime> and matches the
search criteria. This works only when Moab is configured with ODBC MySQL. The syntax is:
mdiag -e[-w <starttime>|<eventtypes>|
<oidlist>|<eidlist>|<objectlist>] --xml
l
l
starttime default is eventtypes default is command delimited, the default is all event types (possible values
can be found in the EventType table in the Moab database)
l
oidlist is a comma-delimited list of object ids, the default is all objects ids
l
eidlist is a comma-delimited list of specific event ids, the default is all event ids
l
objectlist is a comma-delimited list of object types, the default is all object types
(possible values can be found in the ObjectType table in the Moab database)
-f
Display fairshare information.
-g [groupid]
Display group information.
-G [Green]
Display power management information.
-j [jobid]
Display job information.
-l
Diagnose license information contained in the moab.lic file.
-L
Display limits.
-n [nodeid]
Display nodes.
If node diagnostics are specified, the -t option is also available to constrain the report
to a particular partition.
-p
Display job priority.
If priority diagnostics are specified, the -t option is also available to constrain the
report to a particular partition.
-q [qosid]
Display qos information.
-r [reservationid]
Display reservation information.
-R [rmid]
Display resource manager information.
Commands
103
Chapter 3 Scheduler Commands
Argument
Description
-s [srsv]
Display standing reservation information.
-S
Display general scheduler information.
-t
Display configuration, usage, health, and diagnostic information about partitions maintained by
Moab.
-T [triggerid]
Display trigger information.
-u [userid]
Display user information.
--format=xml
Display output in XML format.
XML Output
Information for most of the options can be reported as XML as well. This is
done with the command mdiag -<option> <CLASS_ID> --format=xml. For
example, XML-based class information will be written to STDOUT in the
following format:
<Data>
<class <ATTR>="<VAL>" ... >
<stats <ATTR>="<VAL>" ... >
<Profile <ATTR>="<VAL>" ... >
</Profile>
</stats>
</class>
<Data>
...
</Data>
Of the mdiag options, only -G and -L cannot be reported as XML.
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
checkjob
checknode
mdiag -a
Synopsis
mdiag -a [accountid]
104
Commands
Chapter 3 Scheduler Commands
Overview
The mdiag -a command provides detailed information about the accounts (aka
projects) Moab is currently tracking. This command also allows an
administrator to verify correct throttling policies and access provided to and
from other credentials.
Example 3-11: Generating information about accounts
> mdiag -a
evaluating acct information
Name
Priority
Flags
QDef
PartitionList Target Limits
engineering
100
high
[B]
30.00 MAXJOB=50,75 MAXPROC=400,500
marketing
1
low
5.00 MAXJOB=100,110 MAXPS=54000,54500
it
10
DEFAULT
100.00 MAXPROC=100,1250 MAXPS=12000,12500
FSWEIGHT=1000
development
100
high
[B]
30.00 MAXJOB=50,75 MAXNODE=100,120
research
100
high
[B]
30.00 MAXNODE=400,500 MAXPS=900000,1000000
DEFAULT
0
0.00 -
QOSList*
high,urgent,low
[A]
low
[A]
DEFAULT,high,urgent,low
[A]
high,urgent,low
[A]
DEFAULT,high,low
[A]
-
-
Related Topics
Account credential
mdiag -b
Synopsis
mdiag -b [-l policylevel] [-t partition]
Overview
The mdiag -b command returns information about blocked jobs.
mdiag -c
Synopsis
mdiag -c [-v] [classid]
Overview
The mdiag -c command provides detailed information about the classes Moab is
currently tracking. This command also allows an administrator to verify correct
throttling policies and access provided to and from other credentials.
Commands
105
Chapter 3 Scheduler Commands
The term class is used interchangeably with the term queue and generally
refers to resource manager queue.
XML Attributes
106
Name
Description
ADEF
Accounts a class has access to.
CAPACITY
Number of procs available to the class.
DEFAULT.ATTR
Default attributes attached to a job.
DEFAULT.DISK
Default required disk attached to a job.
DEFAULT.FEATURES
Default required node features attached to a job.
DEFAULT.GRES
Default generic resources attached to a job.
DEFAULT.MEM
Default required memory attached to a job.
DEFAULT.NODESET
Default specified nodeset attached to a job.
DEFAULT.WCLIMIT
Default wallclock limit attached to a job.
EXCL.FEATURES
List of excluded (disallowed) node features.
EXCL.FLAGS
List of excluded (disallowed) job flags.
FSTARGET
The class' fairshare target.
HOLD
If TRUE this credential has a hold on it, FALSE otherwise.
HOSTLIST
The list of hosts in this class.
JOBEPILOG
Scheduler level job epilog to be run after job is completed by resource manager (script
path).
JOBFLAGS
Default flags attached to jobs in the class.
Commands
Chapter 3 Scheduler Commands
Name
Description
JOBPROLOG
Scheduler level job prolog to be run before job is started by resource manager (script
path).
ID
The unique ID of this class.
LOGLEVEL
The log level attached to jobs in the class.
MAX.PROC
The max processors per job in the class.
MAX.PS
The max processor-seconds per job in the class.
MAX.WCLIMIT
The max wallclock limit per job in the class.
MAXIJOB
The max idle jobs in the class.
MAXIPROC
The max idle processors in the class.
MAXJOBPERUSER
The max jobs per user.
MAXNODEPERJOB
The max nodes per job.
MAXNODEPERUSER
The max nodes per user.
MAXPROCPERJOB
The max processors per job.
MAXPROCPERNODE
The max processors per node.
MAXPROCPERUSER
The max processors per user.
MIN.NODE
The minimum nodes per job in the class.
MIN.PROC
The minimum processors per job in the class.
MIN.WCLIMIT
The minimum wallclock limit per job in the class.
NODEACCESSPOLICY
The node access policy associated with jobs in the class.
OCDPROCFACTOR
Dedicated processor factor.
Commands
107
Chapter 3 Scheduler Commands
Name
Description
OCNODE
Overcommit node.
PRIORITY
The class' associated priority.
PRIORITYF
Priority calculation function.
REQ.FEATURES
Required features for a job to be considered in the class.
REQ.FLAGS
Required flags for a job to be considered in the class.
REQ.IMAGE
Required image for a job to be considered in the class.
REQUIREDUSERLIST
The list of users who have access to the class.
RM
The resource manager reporting the class.
STATE
The class' state.
WCOVERRUN
Tolerated amount of time beyond the specified wallclock limit.
Example 3-12: Generating information about classes
> mdiag -c
Class/Queue Status
ClassID
Priority Flags
QDef
Target Limits
DEFAULT
0 ----0.00 --batch
1 ----70.00 MAXJOB=33:200,250
MAX.WCLIMIT=10:00:00 MAXPROCPERJOB=128
long
1 --low
10.00 MAXJOB=3:100,200
MAX.WCLIMIT=1:00:00:00 MAXPROCPERJOB=128
fast
100 --high
10.00 MAXJOB=8:100,150
MAX.WCLIMIT=00:30:00 MAXPROCPERJOB=128
bigmem
1 --low,high
10.00 MAXJOB=1:100,200
MAXPROCPERJOB=128
QOSList* PartitionList
---
---
---
[A][B]
low
[A]
high
[B]
low
---
In the example above, class fast has MAXJOB soft and hard limits of 100 and 150 respectively and is
currently running 8 jobs.
The Limits column will display limits in the following format:
<USAGE>:<HARDLIMIT>[,<SOFTLIMIT>]
108
Commands
Chapter 3 Scheduler Commands
Related Topics
showstats command - display general statistics
mdiag -f
Synopsis
mdiag -f [-o user|group|acct|qos|class] [--flags=relative] [-w
par=<PARTITIONID>]
Overview
The mdiag -f command is used to display at a glance information about the
fairshare configuration and historic resource utilization. The fairshare usage
may impact job prioritization, job eligibility, or both based on the credential
FSTARGET and FSCAP attributes and by the fairshare priority weights as described
in the Job Prioritization Overview. The information presented by this command
includes fairshare configuration and credential fairshare usage over time.
The command hides information about credentials which have no fairshare
target and no fairshare cap.
If an object type (<OTYPE>) is specified, then only information for that
credential type (user, group, acct, class, or qos) will be displayed. If the
relative flag is set, then per user fairshare usage will be displayed relative to
each non-user credential (see the second example below).
Relative output is only displayed for credentials which have user
mappings. For example, if there is no association between classes and
users, no relative per user fairshare usage class breakdown will be
provided.
Commands
109
Chapter 3 Scheduler Commands
Example 3-13: Standard Fairshare Output
> mdiag -f
FairShare Information
Depth: 6 intervals
Interval Length: 00:20:00
Decay Rate: 0.50
FS Policy: DEDICATEDPES
System FS Settings: Target Usage: 0.00
FSInterval
%
Target
0
1
2
3
4
5
FSWeight
------- ------- 1.0000 0.5000 0.2500 0.1250 0.0625 0.0312
TotalUsage
100.00 ------85.3
476.1
478.9
478.5
475.5
482.8
USER
------------mattp
2.51 ------2.20
2.69
2.21
2.65
2.65
3.01
jsmith
12.82 ------12.66
15.36
10.96
8.74
8.15
13.85
kyliem
3.44 ------3.93
2.78
4.36
3.11
3.94
4.25
tgh
4.94 ------4.44
5.12
5.52
3.95
4.66
4.76
walex
1.51 ------3.14
1.15
1.05
1.61
1.22
1.60
jimf
4.73 ------4.67
4.31
5.67
4.49
4.93
4.92
poy
4.64 ------4.43
4.61
4.58
4.76
5.36
4.90
mjackson
0.66 ------0.35
0.78
0.67
0.77
0.55
0.43
tfw
17.44 ------16.45
15.59
19.93
19.72
21.38
15.68
gjohn
2.81 ------1.66
3.00
3.16
3.06
2.41
3.33
ljill
10.85 ------18.09
7.23
13.28
9.24
14.76
6.67
kbill
11.10 ------7.31
14.94
4.70
15.49
5.42
16.61
stevei
1.58 ------1.41
1.34
2.09
0.75
3.30
2.15
gms
1.54 ------1.15
1.74
1.63
1.40
1.38
0.90
patw
5.11 ------5.22
5.11
4.85
5.20
5.28
5.78
wer
6.65 ------5.04
7.03
7.52
6.80
6.43
2.83
anna
1.97 ------2.29
1.68
2.27
1.80
2.37
2.17
susieb
5.69 ------5.58
5.55
5.57
6.48
5.83
6.16
GROUP
------------dallas
13.25 15.00
14.61
12.41
13.19
13.29
15.37
15.09
sanjose*
8.86 15.00
6.54
9.55
9.81
8.97
8.35
4.16
seattle
10.05 15.00
9.66
10.23
10.37
9.15
9.94
10.54
austin*
30.26 15.00
29.10
30.95
30.89
28.45
29.53
29.54
boston*
3.44 15.00
3.93
2.78
4.36
3.11
3.94
4.25
orlando*
26.59 15.00
29.83
26.77
22.56
29.49
25.53
28.18
newyork*
7.54 15.00
6.33
7.31
8.83
7.54
7.34
8.24
ACCT
------------engineering
31.76 30.00
32.25
32.10
31.94
30.07
30.74
31.14
marketing
8.86
5.00
6.54
9.55
9.81
8.97
8.35
4.16
it
9.12
5.00
7.74
8.65
10.92
8.29
10.64
10.40
development*
24.86 30.00
24.15
24.76
25.00
24.84
26.15
26.78
research
25.40 30.00
29.32
24.94
22.33
27.84
24.11
27.53
QOS
------------DEFAULT*
0.00 50.00 ------- ------- ------- ------- ------- ------high*
83.69 90.00
86.76
83.20
81.71
84.35
83.19
88.02
urgent
0.00
5.00 ------- ------- ------- ------- ------- ------low*
12.00
5.00
7.34
12.70
14.02
12.51
12.86
7.48
CLASS
------------batch*
51.69 70.00
53.87
52.01
50.80
50.38
48.67
52.65
long*
18.75 10.00
16.54
18.36
20.89
18.36
21.53
16.28
fast*
15.29 10.00
18.41
14.98
12.58
16.80
15.15
18.21
bigmem
14.27 10.00
11.17
14.65
15.73
14.46
14.65
12.87
An asterisk (*) next to a credential name indicates that that credential has
exceeded its fairshare target.
110
Commands
Chapter 3 Scheduler Commands
Example 3-14: Grouping User Output by Account
> mdiag -f -o acct --flags=relative
FairShare Information
Depth: 6 intervals
Interval Length: 00:20:00
FS Policy: DEDICATEDPES
System FS Settings: Target Usage: 0.00
FSInterval
%
Target
0
1
FSWeight
------- ------- 1.0000 0.5000
TotalUsage
100.00 ------23.8
476.1
ACCOUNT
------------dallas
13.12 15.00
15.42
12.41
mattp
19.47 ------15.00
21.66
walex
9.93 ------20.91
9.28
stevei
12.19 ------9.09
10.78
anna
14.77 ------16.36
13.54
susieb
43.64 ------38.64
44.74
sanjose*
9.26 15.00
8.69
9.55
mjackson
7.71 ------6.45
8.14
gms
17.61 ------21.77
18.25
wer
74.68 ------71.77
73.61
seattle
10.12 15.00
10.16
10.23
tgh
49.56 ------46.21
50.05
patw
50.44 ------53.79
49.95
austin*
30.23 15.00
25.58
30.95
jsmith
42.44 ------48.77
49.62
tfw
57.56 ------51.23
50.38
boston*
3.38 15.00
3.78
2.78
kyliem
100.00 ------- 100.00 100.00
orlando*
26.20 15.00
30.13
26.77
poy
17.90 ------16.28
17.22
ljill
37.85 ------58.60
26.99
kbill
44.25 ------25.12
55.79
newyork*
7.69 15.00
6.24
7.31
jimf
61.42 ------69.66
58.94
gjohn
38.58 ------30.34
41.06
Decay Rate: 0.50
2
0.2500
478.9
3
0.1250
478.5
4
0.0625
475.5
5
0.0312
482.8
13.19
16.75
7.97
15.85
17.18
42.25
9.81
6.81
16.57
76.62
10.37
53.26
46.74
30.89
35.47
64.53
4.36
100.00
22.56
20.30
58.87
20.83
8.83
64.20
35.80
13.29
19.93
12.14
5.64
13.55
48.74
8.97
8.62
15.58
75.80
9.15
43.14
56.86
28.45
30.70
69.30
3.11
100.00
29.49
16.15
31.33
52.52
7.54
59.46
40.54
15.37
17.26
7.91
21.46
15.44
37.92
8.35
6.54
16.51
76.95
9.94
46.91
53.09
29.53
27.59
72.41
3.94
100.00
25.53
20.98
57.79
21.23
7.34
67.21
32.79
15.09
19.95
10.59
14.28
14.37
40.81
4.16
10.29
21.74
67.97
10.54
45.13
54.87
29.54
46.90
53.10
4.25
100.00
28.18
17.39
23.67
58.94
8.24
59.64
40.36
Related Topics
Fairshare Overview
mdiag -g
Synopsis
mdiag -g [groupid]
Overview
The mdiag -g command is used to present information about groups.
Commands
111
Chapter 3 Scheduler Commands
mdiag -j
Synopsis
mdiag -j [jobid] [-t <partition>] [-v] [-w] [--flags=policy] [--xml] [--blocking]
Overview
The mdiag -j command provides detailed information about the state of jobs
Moab is currently tracking. This command also performs a large number of
sanity and state checks. The job configuration and status information, as well
as the results of the various checks, are presented by this command. The
command gathers information from the Moab cache which prevents it from
interrupting the scheduler, but the --blocking option can be used to bypass
the cache and interrupt the scheduler. If the -v (verbose) flag is specified,
additional information about less common job attributes is displayed. If -flags=policy is specified, information about job templates is displayed.
If used with the -t <partition> option on a running job, the only thing mdiag -j
shows is if the job is running on the specified partition. If used on job that is not
running, it shows if the job is able to run on the specified partition.
The -w flag enables you to specify specific job states (Such as Running,
Completed, Idle, or ALL. See Job States for all valid options.) or jobs
associated with a given credential (user, acct, class, group, qos). For example:
mdiag -j -w user=david
mdiag -j -w state=Idle,Running
# Displays only David's jobs
# Displays only idle or running jobs
The mdiag -j command does not show all subjobs of an array unless you use
mdiag -j --xml. In the XML, the master job element contains a child
element called ArraySubJobs that contains the subjobs in the array.
Using mdiag -j -v --xml shows the completed sub-jobs as well.
XML Output
If XML output is requested (via the --format=xml argument), XML based node
information will be written to STDOUT in the following format:
<Data>
<job ATTR="VALUE" ... > </job>
...
</Data>
For information about legal attributes, refer to the XML Attributes table.
To show jobs in XML, use mdiag -j --xml -w
[completed=true|system=true|ALL=true] to limit or filter jobs. This is
for XML use only.
112
Commands
Chapter 3 Scheduler Commands
Related Topics
checkjob
mdiag
mdiag -n
Synopsis
mdiag -n [-t partitionid] [-A creds] [-w <CONSTRAINT>] [-v] [--format=xml]
[nodeid]
Overview
The mdiag -n command provides detailed information about the state of nodes
Moab is currently tracking. This command also performs a large number of
sanity and state checks. The node configuration and status information as well
as the results of the various checks are presented by this command.
Arguments
Flag
Argument
Description
[-A]
{user|group|account|qos|class|job}:
<OBJECTID>
report if each node is accessible by requested job or credential
[-t]
<partitionid>
report only nodes from specified partition
[-v]
---
show verbose output (do not truncate columns and add columns
for additional node attributes)
[-w]
nodestate=drained
display only jobs associated with the specified constraint:
nodestate (See DISPLAYFLAGS for more information.)
Output
This command presents detailed node information in whitespace-delineated
fields.
The output of this command can be extensive and the values for a number of
fields may be truncated. If truncated, the -v flag can be used to display full
field content.
Commands
113
Chapter 3 Scheduler Commands
114
Column
Format
Name
<NODE NAME>
State
<NODE STATE>
Procs
<AVAILABLE PROCS>:<CONFIGURED PROCS>
Memory
<AVAILABLE MEMORY>:<CONFIGURED MEMORY>
Disk
<AVAILABLE DISK>:<CONFIGURED DISK>
Swap
<AVAILABLE SWAP>:<CONFIGURED SWAP>
Speed
<RELATIVE MACHINE SPEED>
Opsys
<NODE OPERATING SYSTEM>
Arch
<NODE HARDWARE ARCHITECTURE>
Par
<PARTITION NODE IS ASSIGNED TO>
Load
<CURRENT 1 MINUTE BSD LOAD>
Rsv
<NUMBER OF RESERVATIONS ON NODE>
Classes
<CLASS NAME>
Network
<NETWORK NAME>...
Features
<NODE FEATURE>...
Commands
Chapter 3 Scheduler Commands
Examples
Example 3-15:
> mdiag -n
compute node summary
Name
opt-001
opt-002
opt-003
opt-004
opt-005
opt-006
WARNING:
opt-007
opt-008
opt-009
opt-010
opt-011
opt-012
opt-013
opt-014
opt-015
opt-016
x86-001
x86-002
x86-003
x86-004
x86-005
x86-006
x86-007
x86-008
x86-009
x86-010
x86-011
x86-012
x86-013
x86-014
x86-015
x86-016
P690-001
P690-002
P690-003
P690-004
P690-005
P690-006
P690-007
P690-008
WARNING:
P690-009
P690-010
P690-011
P690-012
P690-013
P690-014
P690-015
P690-016
-----
Procs
Memory
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
swap is low on node opt-006
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:2
2048:2048
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Idle
1:1
512:512
Idle
1:1
512:512
Idle
1:1
512:512
Busy
0:1
512:512
Down
1:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
512:512
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Idle
1:1
16384:16384
Idle
1:1
16384:16384
node P690-008 is missing ethernet adapter
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
Busy
0:1
16384:16384
--6:64
745472:745472
Total Nodes: 36
Commands
State
(Active: 30
Idle: 5
Opsys
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
SUSE
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
Redhat
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
AIX
-----
Down: 1)
115
Chapter 3 Scheduler Commands
Warning messages are interspersed with the node configuration
information with all warnings preceded by the keyword WARNING.
XML Output
If XML output is requested (via the --format=xml argument), XML based node
information will be written to STDOUT in the following format:
mdiag -n --format=xml
<Data>
<node> <ATTR>="<VAL>" ... </node>
...
</Data>
XML Attributes
116
Name
Description
ACL
Node Access Control List
AGRES
Available generic resources
ALLOCRES
Special allocated resources (like vlans)
ARCH
The node's processor architecture.
AVLCLASS
Classes available on the node.
AVLETIME
Time when the node will no longer be available (used in Utility centers)
AVLSTIME
Time when the node will be available (used in Utility centers)
CFGCLASS
Classes configured on the node
ENABLEPROFILING
If true, a node's state and usage is tracked over time.
FEATURES
A list of comma-separated custom features describing a node.
GEVENT
A user-defined event that allows Moab to perform some action.
GMETRIC
A list of comma-separated consumable resources associated with a node.
Commands
Chapter 3 Scheduler Commands
Name
Description
GRES
generic resources on the node
HOPCOUNT
How many hops the node took to reach this Moab (used in hierarchical grids)
ISDELETED
Node has been deleted
ISDYNAMIC
Node is dynamic (used in Utility centers)
JOBLIST
The list of jobs currently running on a node.
LOAD
Current load as reported by the resource manager
LOADWEIGHT
Load weight used when calculating node priority
MAXJOB
See Node Policies for details.
MAXJOBPERUSER
See Node Policies for details.
MAXLOAD
See Node Policies for details.
MAXPROC
See Node Policies for details.
MAXPROCPERUSER
See Node Policies for details.
NETWORK
The ability to specify which networks are available to a given node is limited to only a few
resource managers. Using the NETWORK attribute, administrators can establish this node
to network connection directly through the scheduler. The NODECFG parameter allows this
list to be specified in a comma-delimited list.
NODEID
The unique identifier for a node.
NODESTATE
The state of a node.
OS
A node's operating system.
OSLIST
Operating systems the node can run
OSMODACTION
URL for changing the operating system
Commands
117
Chapter 3 Scheduler Commands
118
Name
Description
OWNER
Credential type and name of owner
PARTITION
The partition a node belongs to. See Node Location for details.
POWER
The state of the node's power. Either ON or OFF.
PRIORITY
The fixed node priority relative to other nodes.
PROCSPEED
A node's processor speed information specified in MHz.
RACK
The rack associated with a node's physical location.
RADISK
The total available disk on a node.
RAMEM
The total available memory available on a node.
RAPROC
The total number of processors available on a node.
RASWAP
The total available swap on a node.
RCMEM
The total configured memory on a node.
RCPROC
The total configured processors on a node.
RCSWAP
The total configured swap on a node.
RequestID
Dynamic Node RequestID set by the RM
RESCOUNT
Number of reservations on the node
RESOURCES
Deprecated (use GRES)
RSVLIST
List of reservations on the node
RMACCESSLIST
A comma-separated list of resource managers who have access to a node.
SIZE
The number of slots or size units consumed by the node.
SLOT
The first slot in the rack associated with the node's physical location.
Commands
Chapter 3 Scheduler Commands
Name
Description
SPEED
A node's relative speed.
SPEEDWEIGHT
speed weight used to calculate node's priority
STATACTIVETIME
Time node was active
STATMODIFYTIME
Time node's state was modified
STATTOTALTIME
Time node has been monitored
STATUPTIME
Time node has been up
TASKCOUNT
The number of tasks on a node.
TTL
Dynamic Node Time To Live set by the RM (expiration date in epoch format)
Related Topics
checknode
mdiag -t
Synopsis
mdiag -t [-v] [-v] [partitionid]
Overview
The mdiag -t command is used to present configuration, usage, health, and
diagnostic information about partitions maintained by Moab. The information
presented includes partition name, limits, configured and available resources,
allocation weights and policies.
Examples
Example 3-16: Standard partition diagnostics
> mdiag -t
Partition Status
...
Commands
119
Chapter 3 Scheduler Commands
mdiag -p
Synopsis
mdiag -p [-t partition] [-v]
Overview
The mdiag -p command is used to display at a glance information about the job
priority configuration and its effects on the current eligible jobs. The
information presented by this command includes priority weights, priority
components, and the percentage contribution of each component to the total
job priority.
The command hides information about priority components which have been
deactivated (i.e. by setting the corresponding component priority weight to 0).
For each displayed priority component, this command gives a small amount of
context sensitive information. The following table documents this information.
In all cases, the output is of the form <PERCENT>(<CONTEXT INFO>) where
<PERCENT> is the percentage contribution of the associated priority
component to the job's total priority.
By default, this command only shows information for jobs which are
eligible for immediate execution. Jobs which violate soft or hard policies,
or have holds, job dependencies, or other job constraints in place will not
be displayed. If priority information is needed for any of these jobs, use
the -v flag or the checkjob command.
Format
Flag
Name
Format
Default
Description
-v
VERBOSE
---
---
Display verbose priority information. If
specified, display priority breakdown
information for blocked, eligible, and
active jobs.
By default, only information for
eligible jobs is displayed. To
view blocked jobs in addition to
eligible, run mdiag -p -v -v.
120
Example
> mdiag -p -v
Display
priority
summa
ry
informa
tion for
eligible
and
active
jobs
Commands
Chapter 3 Scheduler Commands
Output
Priority Component
Format
Target
<PERCENT>()
QOS
<PERCENT>(<QOS>:<QOSPRI>)
QOS — QOS associated with job
QOSPRI — Priority assigned to the QOS
FairShare
<PERCENT>
(
<USR>:<GRP>:<ACC>:<QOS>:<CLS>)
USR — user fs usage - user fs target
GRP — group fs usage - group fs target
ACC — account fs usage - account fs target
QOS — QOS fs usage - QOS fs target
CLS — class fs usage - class fs target
Service
<PERCENT>(<QT>:<XF>:<Byp>)
QTime — job queue time which is applicable towards
priority (in minutes)
XF — current theoretical minimum XFactor is job were
to start immediately
Byp — number of times job was bypassed by lower priority jobs via backfill
Resource
<PERCENT>
(<NDE>:<PE>:<PRC>:<MEM>)
NDE — nodes requested by job
PE — Processor Equivalents as calculated by all
resources requested by job
PRC — processors requested by job
MEM — real memory requested by job
Commands
Description
121
Chapter 3 Scheduler Commands
Examples
Example 3-17: mdiag -p
diagnosing job priority information (partition: ALL)
Job
Weights
--------
13678
13698
13019
13030
13099
13141
13146
13153
13177
13203
13211
...
13703
13702
Percent Contribution
PRIORITY*
1( 1)
Cred( QOS)
1( 1)
FS(Accnt)
1( 1)
Serv(QTime)
1321*
235*
8699
8699
8537
8438
8428
8360
8216
8127
8098
7.6(100.0)
42.6(100.0)
0.6( 50.0)
0.6( 50.0)
0.6( 50.0)
0.6( 50.0)
0.6( 50.0)
0.0( 1.0)
0.0( 1.0)
0.6( 50.0)
0.0( 1.0)
0.2( 2.7)
1.1( 2.7)
0.3( 25.4)
0.3( 25.4)
0.3( 25.4)
0.2( 17.6)
0.2( 17.6)
0.1( 11.6)
0.1( 11.6)
0.3( 25.4)
0.1( 11.6)
92.2(1218.)
56.3(132.3)
99.1(8674.)
99.1(8674.)
99.1(8512.)
99.2(8370.)
99.2(8360.)
99.8(8347.)
99.8(8203.)
99.1(8102.)
99.8(8085.)
137
79
36.6( 50.0)
1.3( 1.0)
12.8( 17.6)
5.7( 4.5)
50.6( 69.2)
93.0( 73.4)
--------
0.9( 0.9)
0.4( 0.4)
98.7( 98.7)
* indicates system prio set on job
The mdiag -p command only displays information for priority components actually utilized. In the above
example, QOS, Account Fairshare, and QueueTime components are utilized in determining a job's
priority. Other components, such as Service Targets, and Bypass are not used and thus are not
displayed. (See the Priority Overview for more information) The output consists of a header, a job by job
analysis of jobs, and a summary section.
The header provides column labeling and provides configured priority component and subcomponent
weights. In the above example, QOSWEIGHT is set to 1000 and FSWEIGHT is set to 100. When configuring
fairshare, a site also has the option of weighting the individual components of a job's overall fairshare,
including its user, group, and account fairshare components. In this output, the QoS and account
fairshare weights are set to 1.
The job by job analysis displays a job's total priority and the percentage contribution to that priority of
each of the priority components. In this example, job 13019 has a total priority of 8699. Both QOS and
Fairshare contribute to the job's total priority although these factors are quite small, contributing 0.6%
and 0.3% respectively with the fairshare factor being contributed by an account fairshare target. For this
job, the dominant factor is the service subcomponent qtime which is contributing 99.1% of the total
priority since the job has been in the queue for approximately 8600 minutes.
At the end of the job by job description, a Totals line is displayed which documents the average
percentage contributions of each priority component to the current idle jobs. In this example, the QOS,
Fairshare, and Service components contributed an average of 0.9%, 0.4%, and 98.7% to the jobs' total
priorities.
Related Topics
Job Priority Overview
mdiag -q
Synopsis
mdiag -q [qosid]
122
Commands
Chapter 3 Scheduler Commands
Overview
The mdiag -q command is used to present information about each QOS
maintained by Moab. The information presented includes QOS name,
membership, scheduling priority, weights and flags.
Examples
Example 3-18: Standard QOS Diagnostics
> mdiag -q
QOS Status
System QOS Settings: QList: DEFAULT (Def: DEFAULT) Flags: 0
Name
* Priority QTWeight QTTarget XFWeight XFTarget
JobFlags Limits
DEFAULT
1
1
3
1
5.00
[NONE] [NONE]
Accounts: it research
Classes: batch
[ALL]
0
0
0
0
0.00
[NONE] [NONE]
high
1000
1
2
1
10.00
[NONE] [NONE]
Accounts: engineering it development research
Classes: fast
urgent
10000
1
1
1
7.00
[NONE] [NONE]
Accounts: engineering it development
low
100
1
5
1
1.00
[NONE] [NONE]
Accounts: engineering marketing it development research
Classes: long bigmem
QFlags
PREEMPTEE
[NONE]
PREEMPTOR
PREEMPTOR
PREEMPTEE
mdiag -r
Synopsis
mdiag -r [reservationid] [-v] [-w type=<type>]
Overview
The mdiag -r command allows administrators to look at detailed reservation
information. It provides the name, type, partition, starttime and endtime, proc
and node counts, as well as actual utilization figures. It also provides detailed
information about which resources are being used, how many nodes, how
much memory, swap, and processors are being associated with each task.
Administrators can also view the Access Control Lists for each reservation as
well as any flags that may be active in the reservation. The command gathers
information from the Moab cache which prevents it from waiting for the
scheduler, but the --blocking option can be used to bypass the cache and
allow waiting for the scheduler.
Commands
123
Chapter 3 Scheduler Commands
The -w flag filters the output according to the type of reservation. The
allowable reservation types are Job, and User.
124
Commands
Chapter 3 Scheduler Commands
Examples
Example 3-19:
> mdiag -r
Diagnosing Reservations
RsvID
Type Par
StartTime
EndTime
Duration Node Task
Proc
-------- ------------------------ ---- ---- --engineer.0.1
User
A
-6:29:00
INFINITY
INFINITY
0
0
7
Flags: STANDINGRSV IGNSTATE OWNERPREEMPT
ACL:
CLASS==batch+:==long+:==fast+:==bigmem+ QOS==low-:==high+ JATTR==PREEMPTEE+
CL:
RSV==engineer.0.1
Task Resources: PROCS: [ALL]
Attributes (HostExp='fr10n01 fr10n03 fr10n05 fr10n07 fr10n09 fr10n11 fr10n13
fr10n15')
Active PH: 43.77/45.44 (96.31%)
SRAttributes (TaskCount: 0 StartTime: 00:00:00 EndTime: 1:00:00:00 Days: ALL)
research.0.2
User
A
-6:29:00
INFINITY
INFINITY
0
0
8
Flags: STANDINGRSV IGNSTATE OWNERPREEMPT
ACL:
CLASS==batch+:==long+:==fast+:==bigmem+ QOS==high+:==low- JATTR==PREEMPTEE+
CL:
RSV==research.0.2
Task Resources: PROCS: [ALL]
Attributes (HostExp='fr3n01 fr3n03 fr3n05 fr3n07 fr3n07 fr3n09 fr3n11 fr3n13
fr3n15')
Active PH: 51.60/51.93 (99.36%)
SRAttributes (TaskCount: 0 StartTime: 00:00:00 EndTime: 1:00:00:00 Days: ALL)
fast.0.3
User
A
00:14:05
5:14:05
5:00:00
0
0
16
Flags: STANDINGRSV IGNSTATE OWNERPREEMPT
ACL:
CLASS==fast+ QOS==high+:==low+:==urgent+:==DEFAULT+ JATTR==PREEMPTEE+
CL:
RSV==fast.0.3
Task Resources: PROCS: [ALL]
Attributes (HostExp='fr12n01 fr12n02 fr12n03 fr12n04 fr12n05 fr12n06 fr12n07
fr12n08 fr12n09 fr12n10 fr12n11 fr12n12 fr12n13 fr12n14 fr12n15 fr12n16')
SRAttributes (TaskCount: 0 StartTime: 00:00:00 EndTime: 5:00:00 Days:
Mon,Tue,Wed,Thu,Fri)
fast.1.4
User
A 1:00:14:05 1:05:14:05
5:00:00
0
0
16
Flags: STANDINGRSV IGNSTATE OWNERPREEMPT
ACL:
CLASS==fast+ QOS==high+:==low+:==urgent+:==DEFAULT+ JATTR==PREEMPTEE+
CL:
RSV==fast.1.4
Task Resources: PROCS: [ALL]
Attributes (HostExp='fr12n01 fr12n02 fr12n03 fr12n04 fr12n05 fr12n06 fr12n07
fr12n08 fr12n09 fr12n10 fr12n11 fr12n12 fr12n13 fr12n14 fr12n15 fr12n16')
SRAttributes (TaskCount: 0 StartTime: 00:00:00 EndTime: 5:00:00 Days:
Mon,Tue,Wed,Thu,Fri)
job2411
Job
A
-00:01:00
00:06:30
Each tile contains a
summary information about the service it represents, including the following:
ACL:
JOB==job2411=
CL:
JOB==job2411 USER==jimf GROUP==newyork ACCT==it CLASS==bigmem QOS==low
JATTR==PREEMPTEE DURATION==00:07:30 PROC==6 PS==2700
job1292
Job
A
00:00:00
00:07:30
00:07:30
0
0
4
ACL:
JOB==job1292=
CL:
JOB==job1292 USER==jimf GROUP==newyork ACCT==it CLASS==batch QOS==DEFAULT
JATTR==PREEMPTEE DURATION==00:07:30 PROC==4 PS==1800
Commands
125
Chapter 3 Scheduler Commands
Example 3-20:
With the -v option, a nodes line is included for each reservation and shows how
many nodes are in the reservation as well as how many tasks are on each
node.
126
Commands
Chapter 3 Scheduler Commands
> mdiag -r -v
Diagnosing Reservations
RsvID
Type Par
StartTime
EndTime
Duration Node Task
Proc
-------- ------------------------ ---- ---- --Moab.6
Job
B
-00:01:05
00:00:35
00:01:40
1
1
1
Flags: ISACTIVE
ACL:
JOB==Moab.6=
CL:
JOB==Moab.6 USER==tuser1 GROUP==tgroup1 CLASS==fast QOS==starter
JPRIORITY<=0 DURATION==00:01:40 PROC==1 PS==100
SubType: JobReservation
Nodes='node002:1'
Rsv-Group: Moab.6
Moab.4
Job
B
-00:01:05
00:00:35
00:01:40
1
1
Flags: ISACTIVE
ACL:
JOB==Moab.4=
CL:
JOB==Moab.4 USER==tuser1 GROUP==tgroup1 CLASS==batch QOS==starter
JPRIORITY<=0 DURATION==00:01:40 PROC==1 PS==100
SubType: JobReservation
Nodes='node002:1'
Rsv-Group: Moab.4
1
Moab.5
Job
A
-00:01:05
00:00:35
00:01:40
3
6
Flags: ISACTIVE
ACL:
JOB==Moab.5=
CL:
JOB==Moab.5 USER==tuser1 GROUP==tgroup1 ACCT==marketing CLASS==long
QOS==low JPRIORITY<=0 DURATION==00:01:40 PROC==6 PS==600
Task Resources: PROCS: [ALL]
SubType: JobReservation
Nodes='node008:1,node007:1,node006:1'
Rsv-Group: Moab.5
3
Moab.7
Job
A
-00:01:04
00:00:36
00:01:40
1
1
Flags: ISACTIVE
ACL:
JOB==Moab.7=
CL:
JOB==Moab.7 USER==tuser1 GROUP==tgroup1 CLASS==bigmen QOS==starter
JPRIORITY<=0 DURATION==00:01:40 PROC==1 PS==100
SubType: JobReservation
Nodes='node005:1'
Rsv-Group: Moab.7
1
Moab.2
Job
A
-00:01:07
3:58:53
4:00:00
1
2
Flags: ISACTIVE
ACL:
JOB==Moab.2=
CL:
JOB==Moab.2 USER==tuser1 GROUP==tgroup1 QOS==starter JPRIORITY<=0
DURATION==4:00:00 PROC==2 PS==28800
SubType: JobReservation
Nodes='node009:1'
Rsv-Group: Moab.2
2
Moab.8
Job
A
3:58:53
7:58:53
4:00:00
8
16
16
Flags: PREEMPTEE
ACL:
JOB==Moab.8=
CL:
JOB==Moab.8 USER==tuser1 GROUP==tgroup1 ACCT==development CLASS==bigmen
Commands
127
Chapter 3 Scheduler Commands
QOS==starter JPRIORITY<=0 DURATION==4:00:00 PROC==16 PS==230400
SubType: JobReservation
Nodes='node009:1,node008:1,node007:1,node006:1,node005:1,node004:1,node003:1,node001:
1'
Attributes (Priority=148)
Rsv-Group: idle
system.3
User bas
2
Flags: ISCLOSED,ISACTIVE
ACL:
RSV==system.3=
CL:
RSV==system.3
Accounting Creds: User:root
Task Resources: PROCS: [ALL]
SubType: Other
Nodes='node254:1'
Attributes (HostExp='node254')
Active PH: 0.00/0.01 (0.00%)
History: 1322773208:PROCS=2
-00:01:08
INFINITY
INFINITY
1
1
system.2
User bas
2
Flags: ISCLOSED,ISACTIVE
ACL:
RSV==system.2=
CL:
RSV==system.2
Accounting Creds: User:root
Task Resources: PROCS: [ALL]
SubType: Other
Nodes='node255:1'
Attributes (HostExp='node255')
Active PH: 0.00/0.01 (0.00%)
History: 1322773208:PROCS=2
-00:01:08
INFINITY
INFINITY
1
1
system.1
User bas
2
Flags: ISCLOSED,ISACTIVE
ACL:
RSV==system.1=
CL:
RSV==system.1
Accounting Creds: User:root
Task Resources: PROCS: [ALL]
SubType: Other
Nodes='node256:1'
Attributes (HostExp='node256')
Active PH: 0.00/0.01 (0.00%)
History: 1322773208:PROCS=2
-00:01:08
INFINITY
INFINITY
1
1
mdiag -R
Synopsis
mdiag -R [-v] [resourcemanagerid]
128
Commands
Chapter 3 Scheduler Commands
Overview
The mdiag -R command is used to present information about configured
resource managers. The information presented includes name, host, port,
state, type, performance statistics and failure notifications.
Commands
129
Chapter 3 Scheduler Commands
Examples
130
Commands
Chapter 3 Scheduler Commands
Example 3-21:
Commands
131
Chapter 3 Scheduler Commands
> $ mdiag -R -v
diagnosing resource managers
RM[internal] State: --- Type: SSS
Max Fail/Iteration: 0
JobCounter:
6
Partition:
SHARED
RM Performance:
AvgTime=0.00s
RM Languages:
RM Sub-Languages:
-
ResourceType: COMPUTE
MaxTime=0.00s
(55353 samples)
RM[torque]
State: Active Type: PBS ResourceType: COMPUTE
Timeout:
30000.00 ms
Version:
'4.2.4'
Job Submit URL:
exec:///opt/torque-4.2/bin/qsub
Objects Reported:
Nodes=1 (12 procs) Jobs=1
Nodes Reported:
1 (N/A)
Flags:
executionServer
Partition:
torque
Event Management:
EPORT=15004 (last event: 00:03:07)
NOTE: SSS protocol enabled
Submit Command:
/opt/torque-4.2/bin/qsub
DefaultClass:
batch
Total Jobs Started: 1
RM Performance:
AvgTime=0.00s MaxTime=35.00s (220097 samples)
RM Languages:
PBS
RM Sub-Languages:
PBS
RM[torque] Failures:
clusterquery
(683 of 55349 failed)
-12days 'cannot connect to PBS server '' (pbs_errno=15033, 'Batch protocol
error')'
NOTE:
use 'mrmctl -f messages <RMID>' to clear stats/failures
RM[FLEXlm]
State: Active Type: NATIVE ResourceType: LICENSE
Timeout:
30000.00 ms
Cluster Query URL: exec://$TOOLSDIR/flexlm/license.mon.flexLM.pl
Licenses Reported: 6 types (250 of 282 available)
Partition:
SHARED
License Stats:
Avg License Avail:
239.01 (978 iterations)
Iteration Summary: Idle: 396.42 Active: 150.92 Busy: -447.34
License biocol
50 of 50 available (Idle: 100.00% Active: 0.00%)
License cloudform
100 of 100 available (Idle: 100.00% Active: 0.00%)
License mathworks
8 of 25 available (Idle: 52.00% Active: 48.00%)
License verity
25 of 25 available (Idle: 100.00% Active: 0.00%)
Event Management:
(event interface disabled)
RM Performance:
AvgTime=0.00s MaxTime=0.61s (1307618 samples)
clusterquery:
AvgTime=0.02s MaxTime=0.61s (9465 samples)
queuequery:
AvgTime=0.00s MaxTime=0.00s (1 samples)
rminitialize:
AvgTime=0.00s MaxTime=0.00s (1 samples)
getdata:
AvgTime=0.17s MaxTime=0.60s (978 samples)
RM Languages:
NATIVE
RM Sub-Languages:
NATIVE
AM[mam] Type: MAM State: 'Active'
Host:
localhost
Port:
7112
Timeout:
15
Thread Pool Size:
2
Charge Policy:
DEBITALLWC
Validate Job Submission:
TRUE
Create Failure Action:
CANCEL,HOLD
Start Failure Action:
CANCEL,HOLD
AM[mam] Failures:
132
Commands
Chapter 3 Scheduler Commands
Fri Jun 21 14:32:45 Create
'Failure registering job Create (1) with
accounting manager -- server rejected request with status code 740 - Insufficient
funds: There are no valid allocations to satisfy the quote'
mdiag -S
Synopsis
mdiag -S [-v] [-v]
Overview
The mdiag -S command is used to present information about the status of the
scheduler and grid interface.
This command will report on the following aspects of scheduling:
l
l
l
l
General Scheduler Configuration
o
Reports short and long term scheduler load
o
Reports detected overflows of node, job, reservation, partition, and
other scheduler object tables
High Availability
o
Configuration
o
Reports health of HA primary
o
Reports health of HA backup
Scheduling Status
o
Reports if scheduling is paused
o
Reports if scheduling is stopped
System Reservation Status
o
Reports if global system reservation is active
l
Message Profiling/Statistics Status
l
Moab scheduling activities (only with mdiag -S -v -v)
Commands
o
Activity[JobStart]: Time Moab spends telling the RM to start a job and
waiting for a response.
o
Activity[RMResourceLoad]: Time Moab spends querying license
managers and nodes.
o
Activity[RMWorkloadLoad]: Time Moab spends querying resource
managers about jobs (as opposed to nodes)
o
Activity[Schedule]: Time Moab spends prioritizing jobs and
133
Chapter 3 Scheduler Commands
scheduling them onto nodes.
o
Activity[UIProcess]: Time Moab spends handling client commands.
Examples
Example 3-22:
> mdiag -S
Moab Server running on orion-1:43225 (Mode: NORMAL)
Load(5m) Sched: 12.27% RMAction: 1.16% RMQuery: 75.30% User: 0.29%
Load(24h) Sched: 10.14% RMAction: 0.93% RMQuery: 74.02% User: 0.11%
HA Fallback Server: orion-2:43225 (Fallback is Ready)
Note: system reservation blocking all nodes
Message: profiling enabled (531 of 600 samples/5:00 interval)
Idle: 10.98%
Idle: 13.80%
mdiag -s
Synopsis
mdiag -s [reservationid] [-v]>]
Overview
The mdiag -s command allows administrators to look at detailed standing
reservation information. It provides the name, type, partition, starttime and
endtime, period, task count, host list, and a list of child instances.
Examples
Example 3-23:
> mdiag -s
standing reservation overview
RsvID
Type
--------
134
User
ALL
Par
---
StartTime
---------
---
00:00:00
TestSR
Days:
Depth:
RsvList:
HostExp:
2
testSR.1,testSR.2,testSR.3
'node1,node2,node4,node8'
test2
Days:
TaskCount:
Depth:
RsvList:
User
--ALL
4
1
test2.4,test2.5
00:00:00
EndTime
-------
Duration
--------
Period
------
---
00:00:00
DAY
---
00:00:00
DAY
Commands
Chapter 3 Scheduler Commands
mdiag -T
Synopsis
mdiag -T [triggerid] [-v] [--blocking]
Overview
The mdiag -T command is used to present information about each Trigger. The
information presented includes TrigID, Object ID, Event (Etype) TType, Attype,
ActionDate, State. The command gathers information from the Moab cache
which prevents it from waiting for the scheduler, but the --blocking option can be
used to bypass the cache and allow waiting for the scheduler.
Examples
Example 3-24:
> mdiag -T
TrigID
Object ID
Event TType
AType ActionDate
State
--------------------- -------------------- -------- -------- ----- -------------- ---------sched_trig.0
sched:Moab
end generic
exec
Blocked
3
node:node010
threshol generic
exec
Blocked
5
job:Moab.7
preempt generic
exec
Blocked
6
job:Moab.8
preempt generic
exec
Blocked
7
qos:HIGH
threshol elastic
exec
Blocked
4*
job:Moab.5
start generic
exec
0:00:36
Failure
* indicates trigger has completed
Commands
135
Chapter 3 Scheduler Commands
Example 3-25:
> mdiag -T -v
TrigID
Object ID
Event TType
AType
ActionDate
State
--------------------- -------------------- -------- -------- ---------- ----------sched_trig.0
sched:Moab
end generic
exec
Blocked
Name:
sched_trig
Flags:
globaltrig
BlockUntil:
INFINITY ActiveTime: --Action Data:
date
NOTE: trigger can launch
3
-
node:node010
Blocked
Flags:
BlockUntil:
Threshold:
Action Data:
NOTE: trigger
supported
5
-
6
-
preempt generic
exec
user,globaltrig
INFINITY ActiveTime: --$HOME/tools/preemptnotify.pl $OID $OWNER $HOSTNAME
job:Moab.8
Blocked
Flags:
BlockUntil:
Action Data:
NOTE: trigger
exec
globaltrig
INFINITY ActiveTime: --CPULoad > 3.00 (current value: 0.00)
date
cannot launch - threshold not satisfied - threshold type not
job:Moab.7
Blocked
Flags:
BlockUntil:
Action Data:
threshol generic
----------------
preempt generic
exec
user,globaltrig
INFINITY ActiveTime: --$HOME/tools/preemptnotify.pl $OID $OWNER $HOSTNAME
cannot launch - parent job Moab.8 is in state Idle
7
-
qos:HIGH
threshol elastic
exec
Blocked
Flags:
multifire,globaltrig
BlockUntil:
INFINITY ActiveTime: --Timeout:
00:05:00
Threshold:
BacklogCompletionTime > 500.00 (current value: 0.00)
Trigger Type: elastic
RearmTime:
00:00:10
Action Data:
$HOME/geometry.pl $REQUESTGEOMETRY
NOTE: trigger cannot launch - threshold not satisfied - threshold not satisfied requires usage 0.000000 > 500.000000
4*
job:Moab.5
start generic
exec Mon Jan 16
12:33:00
Failure
Launch Time:
-00:02:17
Flags:
globaltrig
Last Execution State: Failure (ExitCode: 0)
BlockUntil:
00:00:00 ActiveTime: 00:00:00
Action Data:
$HOME/tools/preemptnotify.pl $OID $OWNER $HOSTNAME
ALERT: trigger failure detected
Message:
'exec '/usr/test/moab/tools/preemptnotify.pl' cannot be located or is
not executable'
* indicates trigger has completed
136
Commands
Chapter 3 Scheduler Commands
mdiag -u
Synopsis
mdiag -u [userid]
Overview
The mdiag -u command is used to present information about user records
maintained by Moab. The information presented includes user name, UID,
scheduling priority, default job flags, default QOS level, List of accessible QOS
levels, and list of accessible partitions.
Examples
Example 3-26:
> mdiag -u
evaluating user information
Name
Priority
Target Limits
Flags
QDef
jvella
0
[NONE]
0.00 [NONE]
ALIST=Engineering
Message: profiling enabled (597
[NONE]
0
[NONE]
0.00 [NONE]
reynolds
0
[NONE]
0.00 [NONE]
ALIST=Administration
Message: profiling enabled (597
mshaw
0
[NONE]
0.00 [NONE]
ALIST=Test
Message: profiling enabled (584
kforbes
0
[NONE]
0.00 [NONE]
ALIST=Shared
Message: profiling enabled (597
gastor
0
[NONE]
0.00 [NONE]
ALIST=Engineering
Message: profiling enabled (597
[NONE]
QOSList*
PartitionList
[NONE]
of 3000 samples/00:15:00 interval)
[NONE]
[NONE]
[NONE]
[NONE]
[NONE]
[NONE]
[NONE]
of 3000 samples/00:15:00 interval)
[NONE]
[NONE]
[NONE]
of 3000 samples/00:15:00 interval)
[NONE]
[NONE]
[NONE]
of 3000 samples/00:15:00 interval)
[NONE]
[NONE]
[NONE]
of 3000 samples/00:15:00 interval)
Note that only users which have jobs which are currently queued or have been
queued since Moab was most recently started are listed.
Related Topics
showstats command (display user statistics)
Commands
137
Chapter 3 Scheduler Commands
mjobctl
Synopsis
mjobctl -c jobexp
mjobctl -c -w attr=val
mjobctl -C jobexp
mjobctl -e jobid
mjobctl -F jobexp
mjobctl -h [User|System|Batch|Defer|All] jobexp
mjobctl -m attr{+=|=|-=}valjobexp
mjobctl -N [<SIGNO>] jobexp
mjobctl -n <JOBNAME>
mjobctl -p <PRIORITY> jobexp
mjobctl -q {diag|starttime|hostlist} jobexp
mjobctl -r jobexp
mjobctl -R jobexp
mjobctl -s jobexp
mjobctl -u jobexp
mjobctl -w attr{+=|=|-=}valjobexp
mjobctl -x [-w flags=val] jobexp
Overview
The mjobctl command controls various aspects of jobs. It is used to submit,
cancel, execute, and checkpoint jobs. It can also display diagnostic information
about each job. The mjobctl command enables the Moab administrator to
control almost all aspects of job behavior. See General Job Administration for
more details on jobs and their attributes.
Format
-c - Cancel
Format
138
JOBEXP
Commands
Chapter 3 Scheduler Commands
-c - Cancel
Description
Cancel a job.
Use -w (following a -c flag) to specify job cancellation according to given credentials or job
attributes. See -c -w for more information.
You can use mjobctl -c flags=follow-dependency <job_id> to cancel all jobs that the
<job_id> depends on.
If you wish to cancel all jobs that depend on this <job_id>, add
FLAGS=CANCELFAILEDDEPENDENCYJOBS to your SCHEDCFG entry in moab.cfg file. See
CANCELFAILEDDEPENDENCYJOBS on page 1493 for more information.
Example:
> mjobctl -c job1045
Cancel job job1045.
-c -w - Cancel Where
Format
<ATTR>=<VALUE>
where <ATTR>=[ user | account | qos | class | reqreservation(RsvName) | state (JobState) | jobname(JobName, not job ID)] | partition
Description
Cancel a job based on a given credential or job attribute.
Use -w following a -c flag to specify job cancellation according to credentials or job attributes. (See
examples.)
SeeJob States for a list of all valid job states.
Also, you can cancel jobs from given partitions using -w partition=<PAR1>[<PAR2>...]];
however, you must also either use another -w flag to specify a job or use the standard job
expression.
Example
> mjobctl -c -w state=USERHOLD
Cancels all jobs that currently have a USERHOLD on them.
> mjobctl -c -w user=user1 -w acct=acct1
Cancels all jobs assigned to user1 or acct1.
Commands
139
Chapter 3 Scheduler Commands
-C - Checkpoint
Format
JOBEXP
Description
Checkpoint a job. See Checkpoint/Restart Facilities for more information.
Example
> mjobctl -C job1045
Checkpoint job job1045.
-e - Rerun
Format
JOBID
Description
Rerun the completed Torque job. This works only for jobs that are completed and show up in
Torque as completed. This flag does not work with other resource managers.
Example
> mjobctl -e job1045
Rerun job job1045.
-F - Force Cancel
Format
JOBEXP
Description
Forces a job to cancel and ignores previous cancellation attempts.
Example
> mjobctl -F job1045
Force cancel job job1045.
-h - Hold
Format
<HOLDTYPE><JOBEXP>
<HOLDTYPE> = { user | batch | system | defer | ALL }
Default
140
user
Commands
Chapter 3 Scheduler Commands
-h - Hold
Description
Set or release a job hold
See Job Holds for more information
Example
> mjobctl -h user job1045
Set a user hold on job job1045.
> mjobctl -u all job1045
Unset all holds on job job1045.
-m - Modify
Format
<ATTR>{ += | =| -= } <VAL>
When using mjobctl -m with the hostlist attribute, only "=" is supported.
If using Torque and mjobctl -m with the partition attribute, only "=" is supported. "+=", "=", and "=" are supported with other resource managers (SLURM or Native).
<ATTR>={ account | advres | arraylimit | awduration| class | cpuclock | deadline | depend | eeduration | env | features | feature | flags | gres | group | hold | hostlist | jobdisk | jobmem | jobname
| jobswap | loglevel | maxmem | messages | minstarttime | nodeaccess | nodecount | notificationaddress | partition | priority | queue | qos | reqreservation | rmxstring | reqattr
| reqawduration | sysprio | tpn | trig | trigvar | user | userprio | var | wclimit}
Commands
141
Chapter 3 Scheduler Commands
-m - Modify
Description
Modify a specific job attribute.
If an mjobctl -m attribute can affect how a job starts, then it generally cannot affect a job
that is already running. For example, it is not feasible to change the hostlist of a job that is
already running.
The userprio attribute allows you to specify user priority. For job priority, use the '-p' flag.
Modification of the job dependency is also communicated to the resource manager in the case of
SLURM and PBS/Torque.
Adding --flags=warnifcompleted causes a warning message to print when a job completes.
To define values for awduration, eeduration, minstarttime (Note that the minstarttime
attribute performs the same function as msub -a.), reqawduration, and wclimit, use the time
spec format.
A non-active job's partition list can be modified. If using Torque, only "=" (set) is supported. If
using SLURM or a Native resource manager you can add or subtract partitions, even multiple
partitions. When adding or subtracting multiple partitions, each partition must have its own -m
partition{+= | = | -=}name on the command line. An example for adding multiple
partitions is provided in the list of examples.
To modify a job's generic resources, use the following format: gres{ += | = | -= }
<gresName>[:<count>]. <gresName> is a single resource, not a list. <count> is an integer that, if
not specified, is assumed to be 1. Modifying a job's generic resources causes Moab to append the
new gres (+=), subtract the specified gres (-=), or clear out all existing generic resources attached
to the job and override them with the newly-specified one (=). If <gresName> is an empty string,
all generic resources will be removed from the job.
To modify the node access policy for a queued job, use nodeaccess=[<policy>]. See Node
Access Policies on page 341 for a listed of supported node access policies.
142
Commands
Chapter 3 Scheduler Commands
-m - Modify
Example
> mjobctl -m reqawduration+=600 1664
Add 10 minutes to the job walltime.
> mjobctl -m eeduration=-1 1664
Reset job's effective queue time, to when the job was submitted.
> mjobctl -m var=Flag1=TRUE 1664
Set the job variable Flag1 to TRUE.
> mjobctl -m notificationaddress="name@server.com"
Sets the notification e-mail address associated with a job to name@server.com.
> mjobctl -m partition+=p3 -m partition+=p4 Moab.5
Adds multiple partitions (p3 and p4) to job Moab.5.
Torque only supports "=" . "+=", "-=", and "=" are supported with other resource managers
(SLURM or Native).
> mjobctl -m arraylimit=10 sim.25
Changes the concurrently running sub-job limit to 10 for array sim.25.
> mjobctl -m gres=matlab:1 job0201
Overrides all generic resources applied to job job0201 and replaces them with 1
matlab.
> mjobctl -m user=user.job
Modifies the user of a job that was submitted directly to moab (msub) and has not yet
been migrated.
> mjobctl -m userprio-=100 Moab.4
Reduces the user priority of Moab.4 by 100.
> mjobctl -m tpn=2 Moab.128
Changes the requested "tasks per node" for job Moab.128 to 2.
Commands
143
Chapter 3 Scheduler Commands
-m - Modify
> mjobctl -m maxmem=80mb 157
Modifies the total job memory of job 157. See MAXMEM on page 634 for more
information.
-N - Notify
Format
[signal=]<SIGID>JOBEXP
Description
Send a signal to all jobs matching the job expression.
Example
> mjobctl -N INT 1664
Send an interrupt signal to job 1664.
> mjobctl -N 47 1664
Send signal 47 to job 1664.
-n - Name
Format
Description
Select jobs by job name.
Example
-p - Priority
144
Format
[+|+=|-=]<VAL><JOBID> [--flags=relative]
Description
Modify a job's system priority.
Commands
Chapter 3 Scheduler Commands
-p - Priority
Example
Priority is the job priority plus the system priority. Each format affects the job and system
priorities differently. Using the format <VAL><JOBID> or +<VAL><JOBID> will set the system
priority to the maximum system priority plus the specified value. Using +=<VAL><JOBID> or
<VAL><JOBID> --flags=relative will relatively increase the job's priority and set the system
priority. Using the format -=<VAL> <JOBID> sets the system priority to 0, and does not change
priority based on <VAL> (it will not decrease priority by that number).
For the following example, job1045 has a priority of 10, which is composed of a job priority of 10
and a system priority of 0.
> mjobctl -p +1000 job1045
The system priority changes to the max system priority plus 1000 points, ensuring that
this job will be higher priority than all normal jobs. In this case, the job priority of 10 is
not added, so the priority of job1045 is now 1000001000.
> mjobctl -p -=1 job1045
The system priority of job1045 resets to 0. The job priority is still 10, so the overall
priority becomes 10.
> mjobctl -p 3 job1045 --flags=relative
Adds 3 points to the relative system priority. The priority for job1045 changes from 10
to 13.
-q - Query
Format
[ diag( ALL)| hostlist | starttime| template] <JOBEXP>
Description
Query a job.
Commands
145
Chapter 3 Scheduler Commands
-q - Query
Example
> mjobctl -q diag job1045
Query job job1045.
> mjobctl -q diag ALL --format=xml
Query all jobs and return the output in machine-readable XML.
> mjobctl -q starttime job1045
Query starttime of job job1045.
> mjobctl -q template <job>
Query job templates. If the <job> is set to ALL or empty, it will return information for all
job templates.
> mjobctl -q wiki <jobName>
Query a job with the output displayed in a WIKI string. The job's name may be replaced
with ALL.
--flags=completed will only work with the diag option.
-r - Resume
Format
JOBEXP
Description
Resume a job.
Example
> mjobctl -r
job1045
Resume
job
job1045.
-R - Requeue
Format
146
JOBEXP
Commands
Chapter 3 Scheduler Commands
-R - Requeue
Description
Example
Requeue a job. Adding --flags=unmigrate causes Moab to pull a grid job back to the central
scheduler for further evaluation on all valid partitions.
> mjobctl -R job1045
Requeue job job1045.
-s - Suspend
Format
JOBEXP
Description
Suspend a job. For more information, see Suspend/Resume Handling.
Example
> mjobctl -s job1045
Suspend job job1045.
-u - Unhold
Format
[<TYPE>[,<TYPE>]]JOBEXP
<TYPE> = [ user | system | batch | defer | ALL ]
Default
ALL
Description
Release a hold on a job
See Job Holds for more information.
Example
> mjobctl -u user,system scrib.1045
Release user and system holds on job
scrib.1045.
-w - Where
Format
Commands
[CompletionTime | StartTime][<= | = | >=]<EPOCH_TIME>
147
Chapter 3 Scheduler Commands
-w - Where
Description
Example
Add a where constraint clause to the current command. As it pertains to CompletionTime |
StartTime, the where constraint only works for completed jobs. CompletionTime filters according
to the completed jobs' completion times; StartTime filters according to the completed jobs' start
times.
> mjobctl -q diag ALL --flags=COMPLETED --format=xml
-w CompletionTime>=1246428000 -w CompletionTime<=1254376800
Prints all completed jobs still in memory that completed between July 1, 2009 and October
1, 2009.
-x - Execute
Format
JOBEXP
Description
Execute a job. The -w option allows flags to be set for the job. Allowable flags are, ignorepolicies, ignorenodestate, and ignorersv.
Example
> mjobctl -x job1045
Execute job job1045.
> mjobctl -x -w flags=ignorepolicies job1046
Execute job job1046 and ignore policies, such as MaxJobPerUser.
Parameters
JOB EXPRESSION
Format
148
<STRING>
Commands
Chapter 3 Scheduler Commands
JOB EXPRESSION
Description
The name of a job or a regular expression for several jobs. The flags that support job expressions
can use node expression syntax as described in Node Selection. Using x: indicates the following
string is to be interpreted as a regular expression, and using r: indicates the following string is to
be interpreted as a range. Job expressions do not work for array sub-jobs.
Moab uses regular expressions conforming to the POSIX 1003.2 standard. This standard is
somewhat different than the regular expressions commonly used for filename matching in
Unix environments (see man 7 regex). To interpret a job expression as a regular expression,
use x:.
In most cases, it is necessary to quote the job expression (for example, job13[5-9]) to
prevent the shell from intercepting and interpreting the special characters.
The mjobctl command accepts a comma delimited list of job expressions. Example usage
might be mjobctl -r job[1-2],job4 or mjobctl -c job1,job2,job4.
Example:
> mjobctl
job '802'
job '803'
job '804'
job '805'
job '806'
job '807'
job '808'
job '809'
-c "x:80.*"
cancelled
cancelled
cancelled
cancelled
cancelled
cancelled
cancelled
cancelled
Cancel all jobs starting with 80.
> mjobctl
job '743'
job '744'
job '745'
-m priority+=200 "x:74[3-5]"
system priority modified
system priority modified
system priority modified
> mjobctl -h x:17.*
# This puts a hold on any job that has a 17 that is followed by an unlimited amount
of any
# character and includes jobs 1701, 17mjk10, and 17DjN_JW-07
> mjobctl -h r:1-17
# This puts a hold on jobs 1 through 17.
XML Output
mjobctl information can be reported as XML as well. This is done with the
command mjobctl -q diag <JOB_ID>.
Commands
149
Chapter 3 Scheduler Commands
XML Attributes
150
Name
Description
Account
The account assigned to the job
AllocNodeList
The nodes allocated to the job
Args
The job's executable arguments
AWDuration
The active wall time consumed
BlockReason
The block message index for the reason the job is not eligible
Bypass
Number of times the job has been bypassed by other jobs
Calendar
The job's timeframe constraint calendar
Class
The class assigned to the job
CmdFile
The command file path
CompletionCode
The return code of the job as extracted from the RM
CompletionTime
The time of the job's completion
Cost
The cost of executing the job relative to an accounting manager
CPULimit
The CPU limit for the job
Depend
Any dependencies on the status of other jobs
DRM
The master destination RM
DRMJID
The master destination RM job ID
EEDuration
The duration of time the job has been eligible for scheduling
EFile
The stderr file
Env
The job's environment variables set for execution
Commands
Chapter 3 Scheduler Commands
Name
Description
EnvOverride
The job's overriding environment variables set for execution
EState
The expected state of the job
EstHistStartTime
The estimated historical start time
EstPrioStartTime
The estimated priority start time
EstRsvStartTime
The estimated reservation start time
ExcHList
The excluded host list
Flags
Command delimited list of Moab flags on the job
GAttr
The requested generic attributes
GJID
The global job ID
Group
The group assigned to the job
Hold
The hold list
Holdtime
The time the job was put on hold
HopCount
The hop count between the job's peers
HostList
The requested host list
IFlags
The internal flags for the job
IsInteractive
If set, the job is interactive
IsRestartable
If set, the job is restartable
IsSuspendable
If set, the job is suspendable
IWD
The directory where the job is executed
JobID
The job's batch ID.
Commands
151
Chapter 3 Scheduler Commands
152
Name
Description
JobName
The user-specified name for the job
JobGroup
The job ID relative to its group
LogLevel
The individual log level for the job
MasterHost
The specified host to run primary tasks on
Messages
Any messages reported by Moab regarding the job
MinPreemptTime
The minimum amount of time the job must run before being eligible for preemption
Notification
Any events generated to notify the job's user
OFile
The stdout file
OldMessages
Any messages reported by Moab in the old message style regarding the job
OWCLimit
The original wallclock limit
PAL
The partition access list relative to the job
QueueStatus
The job's queue status as generated this iteration
QOS
The QoS assigned to the job
QOSReq
The requested QoS for the job
ReqAWDuration
The requested active walltime duration
ReqCMaxTime
The requested latest allowed completion time
ReqMem
The total memory requested/dedicated to the job
ReqNodes
The number of requested nodes for the job
ReqProcs
The number of requested procs for the job
ReqReservation
The required reservation for the job
Commands
Chapter 3 Scheduler Commands
Name
Description
ReqRMType
The required RM type
ReqSMinTime
The requested earliest start time
RM
The master source resource manager
RMXString
The resource manager extension string
RsvAccess
The list of reservations accessible by the job
RsvStartTime
The reservation start time
RunPriority
The effective job priority
Shell
The execution shell's output
SID
The job's system ID (parent cluster)
Size
The job's computational size
STotCPU
The average CPU load tracked across all nodes
SMaxCPU
The max CPU load tracked across all nodes
STotMem
The average memory usage tracked across all nodes
SMaxMem
The max memory usage tracked across all nodes
SRMJID
The source RM's ID for the job
StartCount
The number of the times the job has tried to start
StartPriority
The effective job priority
StartTime
The most recent time the job started executing
State
The state of the job as reported by Moab
StatMSUtl
The total number of memory seconds utilized
Commands
153
Chapter 3 Scheduler Commands
Name
Description
StatPSDed
The total number of processor seconds dedicated to the job
StatPSUtl
The total number of processor seconds utilized by the job
StdErr
The path to the stderr file
StdIn
The path to the stdin file
StdOut
The path to the stdout file
StepID
StepID of the job (used with LoadLeveler systems)
SubmitHost
The host where the job was submitted
SubmitLanguage
The RM language that the submission request was performed
SubmitString
The string containing the entire submission request
SubmissionTime
The time the job was submitted
SuspendDuration
The amount of time the job has been suspended
SysPrio
The admin specified job priority
SysSMinTime
The system specified min. start time
TaskMap
The allocation taskmap for the job
TermTime
The time the job was terminated
User
The user assigned to the job
UserPrio
The user specified job priority
UtlMem
The utilized memory of the job
UtlProcs
The number of utilized processors by the job
Variable
154
Commands
Chapter 3 Scheduler Commands
Name
Description
VWCTime
The virtual wallclock limit
Examples
Example 3-27:
> mjobctl -q diag ALL --format=xml
<Data><job AWDuration="346" Class="batch" CmdFile="jobsleep.sh" EEDuration="0"
EState="Running" Flags="RESTARTABLE" Group="test" IWD="/home/test" JobID="11578"
QOS="high"
RMJID="11578.lolo.icluster.org" ReqAWDuration="00:10:00" ReqNodes="1" ReqProcs="1"
StartCount="1"
StartPriority="1" StartTime="1083861225" StatMSUtl="903.570" StatPSDed="364.610"
StatPSUtl="364.610"
State="Running" SubmissionTime="1083861225" SuspendDuration="0" SysPrio="0"
SysSMinTime="00:00:00"
User="test"><req AllocNodeList="hana" AllocPartition="access" ReqNodeFeature="[NONE]"
ReqPartition="access"></req></job><job AWDuration="346" Class="batch"
CmdFile="jobsleep.sh"
EEDuration="0" EState="Running" Flags="RESTARTABLE" Group="test" IWD="/home/test"
JobID="11579"
QOS="high" RMJID="11579.lolo.icluster.org" ReqAWDuration="00:10:00" ReqNodes="1"
ReqProcs="1"
StartCount="1" StartPriority="1" StartTime="1083861225" StatMSUtl="602.380"
StatPSDed="364.610"
StatPSUtl="364.610" State="Running" SubmissionTime="1083861225" SuspendDuration="0"
SysPrio="0"
SysSMinTime="00:00:00" User="test"><req AllocNodeList="lolo" AllocPartition="access"
ReqNodeFeature="[NONE]" ReqPartition="access"></req></job></Data>
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
setspri
canceljob
runjob
mnodectl
Synopsis
mnodectl -m attr{=|-=}val nodeexp
mnodectl -q [cat|diag|profile|wiki] nodeexp
Overview
Change specified attributes for a given node expression.
Commands
155
Chapter 3 Scheduler Commands
Access
By default, this command can be run by any Moab Administrator.
Format
-m - Modify
Format
<ATTR>{=|-=|+=}<VAL>
Where <ATTR> is one of the following:
FEATURES
GEVENT,
GMETRIC,
MESSAGE,
OS,
POWER,
STATE,
VARIABLE
and -=, except when used for features, clears the attribute instead of decrementing the attribute's
value and = indicates that you are specifying a new value to replace the old one(s), if any.
When the -= option is used to modify features, it removes the specified features from the node.
The += option, which is only available for features, allows you to append additional features to the
current list rather than replacing the current list entirely.
Changing OS and POWER require a Moab Adaptive Computing Suite license and a
provisioning resource manager.
Description
Example
Modify the state or attribute of specified node(s)
>
>
>
>
>
>
>
>
mnodectl
mnodectl
mnodectl
mnodectl
mnodectl
mnodectl
mnodectl
mnodectl
-m
-m
-m
-m
-m
-m
-m
-m
features+=fastio,highmem node1
gevent=cpufail:'cpu02 has failed w/ec:0317' node1
gmetric=temp:131.2 node1
message='cpufailure:cpu02 has failed w/ec:0317' node1
OS=RHAS30 node1
power=off node1
state=idle node1
variable=IP=10.10.10.100,Location=R1S2 node1
-q - Query
Format
{cat | diag | profile | wiki}
Description
Query node categories or node profile information (see ENABLEPROFILING for nodes).
The diag and profile options must use --xml.
156
Commands
Chapter 3 Scheduler Commands
-q - Query
Example
> mnodectl -q cat ALL
node categorization stats from Mon Jul 10 00:00:00 to Mon Jul 10 15:30:00
Node: moab
Categories:
busy: 96.88%
idle: 3.12%
Node: maka
Categories:
busy: 96.88%
idle: 3.12%
Node: pau
Categories:
busy: 96.88%
idle: 3.12%
Node: maowu
Categories:
busy: 96.88%
down-hw: 3.12%
Cluster Summary:
busy: 96.88%
down-hw: 0.78%
idle: 2.34%
> mnodectl -v -q profile
...
> mnodectl -q wiki <ALL>
GLOBAL STATE=Idle PARTITION=SHARED
n0 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n1 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n2 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n3 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n4 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n5 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n6 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n7 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n8 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
n9 STATE=Idle PARTITION=base APROC=4
NODEACCESSPOLICY=SHARED
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
CPROC=4 RM=base
Query a node with the output displayed in a WIKI string.
Parameters
Commands
157
Chapter 3 Scheduler Commands
FEATURES
Format
<STRING>
One of the following:
Description
l
a comma-delimited list of features
l
[NONE] (to clear features on the node)
Sets the features on a node.
These node features will be overwritten when an RM reports
features.
Example
mnodectl -m features=fastio,highmem node1
mnodectl -m features=[NONE] node1
GEVENT
Format
<EVENT>:<MESSAGE>
Description
Creates a generic event on the node to which Moab may respond (see Enabling Generic Events).
Example
mnodectl -m gevent=powerfail:'power has failed' node1
GMETRIC
Format
<ATTR>:<VALUE>
Description
Sets the value for a generic metric on the node (see Enabling Generic Metrics).
When a gmetric set in Moab conflicts with what the resource manager reports, Moab uses
the set gmetric until the next time the resource manager reports a different number.
Example
mnodectl -m gmetric=temp:120 node1
MESSAGE
Format
158
'<MESSAGE>'
Commands
Chapter 3 Scheduler Commands
MESSAGE
Description
Example
Sets a message to be displayed on the node.
mnodectl -m message='powerfailure: power has failed'
node1
NODEEXP
Format
<STRING>
Where <NODEEXP> is a node name, regex or ALL
Node regex has the potential to unintentionally match many nodes (for example,
specifying n1 will match n10, n11, n12, n100, etc). To ensure correct matching, explicitly
use the "x:<node_regex>" when modifying multiple nodes in one command. Currently this
is supported for features.
Description
Identifies one or more nodes.
Example
node1 — applies only to node1
fr10n* - all nodes starting with fr10n
ALL - all known nodes
OS
Format
<STRING>
Description
Operating System (see Resource Provisioning).
Example
mnodectl node1 -m OS=RHELAS30
POWER
Format
Commands
{off|on}
159
Chapter 3 Scheduler Commands
POWER
Description
Set the power state of a node. Action will NOT be taken if the node is already in the specified state.
If you power off a node, a green policy will try to turn it back on. If you want the node to
remain powered off, you must associate a reservation with it.
If you request to power off a node that has active work on it, Moab returns a status
indicating that the node is busy (with a job or VM) and will not be powered off. You will
see one of these messages:
l
l
l
l
Ignoring node <name>: power ON in process (indicates node is currently
powering on)
Ignoring node <name>: power OFF in process (indicates node is currently
powering off)
Ignoring node <name>: has active VMs running (indicates the node is
currently running active VMs)
Ignoring node <name>: has active jobs running (indicates the node is
currently running active jobs)
Once you resolve the activity on the node (by preempting or migrating the jobs or VMs, for
example), you can attempt to power the node off again.
You can use the --flags=force option to cause a force override. However, doing this will
power off the node regardless of whether or not its jobs get migrated or preempted (i.e.,
you run the risk of losing the VMs/jobs entirely). For example:
> mnodectl node1 -m power=off --flags=force
Example
> mnodectl node1 -m power=off
STATE
Format
{drained|idle}
Description
Remove (drained) or add (idle) a node from scheduling.
Example
mnodectl node1 -m state=drained
Moab ignores node1 when scheduling.
VARIABLE
Format
160
<name>[=<value>],<name>[=<value>]...
Commands
Chapter 3 Scheduler Commands
VARIABLE
Description
Example
Set a list of variables for a node.
> mnodectl node1 -m
variable=IP=10.10.10.100,Location=R1S2
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mdiag -n
showres -n
checknode
showstats -n — report current and historical node statistics
moab
Synopsis
moab --about --help --loglevel=<LOGLEVEL> --version [-c <CONFIG_FILE>] [C] [-d] [-e] [-h] [-P [<PAUSEDURATION>]] [-R <RECYCLEDURATION>] [-s] [S [<STOPITERATION>]] [-v]
Parameters
Parameter
Description
--about
Displays build environment and version information.
--loglevel
Sets the server loglevel to the specified value.
--version
Displays version information.
-c
Configuration file the server should use.
-C
Clears checkpoint files (.moab.ck, .moab.ck.1).
-d
Debug mode (does not background itself).
Commands
161
Chapter 3 Scheduler Commands
Parameter
Description
-e
Forces Moab to exit if there are any errors in the configuration file, if it can't connect to the configured database, or if it can't find these directories:
l
statdir
l
logdir
l
spooldir
l
toolsdir
-P
Starts Moab in a paused state for the duration specified.
-R
Causes Moab to automatically recycle every time the specified duration transpires.
-s
Starts Moab in the state that was most recently checkpointed.
-S
Suspends/stops scheduling at specified iteration (or at startup if no iteration is specified).
-v
Same as --version.
mrmctl
Synopsis
mrmctl -f [fobject] {rmName|am:[amid]}
mrmctl -l [rmid|am:[amid]]
mrmctl -m <attr>=<value> [rmid]
mrmctl -p {rmid|am:[amid]}
mrmctl -R {AM|ID}[:RMID]}}
Overview
mrmctl allows an admin to query, list, modify, and ping the resource managers
and accounting managers in Moab. mrmctl also allows for a queue (often
referred to as a class) to be created for a resource manager.
Access
By default, this command can be run by level 1 and level 2 Moab administrators
(see ADMINCFG).
162
Commands
Chapter 3 Scheduler Commands
Format
-f - Flush Statistics
Format
[<fobject>] where fobject is optional and one of messages[ID[:id]] or stats.
Default
If no fobject is specified, then reported failures and performance data will be flushed. If no
resource manager id is specified, the first resource manager will be flushed.
Description
Clears resource manager statistics. If messages is specified, then reported failures, performance
data, and messages will be flushed.
Example
> mrmctl -f base
Moab will clear the statistics for RM base.
-l - List
Format
N/A
Default
All RMs and AMs (when no RM/AM is specified)
Description
List Resource and Accounting Manager(s)
Example
> mrmctl -l
Moab will list all resource and
accounting managers.
-m - Modify
Format
N/A
Default
All RMs and AMs (when no RM/AM is specified).
Description
Modify Resource and Accounting Manager(s).
Example
Commands
> mrmctl -m state=disabled peer13
163
Chapter 3 Scheduler Commands
-p - Ping
Format
N/A
Default
First RM configured.
Description
Ping Resource Manager.
Example
> mrmctl -p base
Moab will
ping RM
base.
-R - Reload
Format
{AM|ID}[:RMID]}}
Description
Dynamically reloads server information for the identity manager service if ID is specified; if AM is
specified, reloads the accounting manager service.
Example
> mrmctl -R ID
Reloads the identity manager on demand.
Resource manager interfaces can be enabled/disabled using the modify
operation to change the resource manager state as in the following
example:
#
>
#
>
disable active resource manager interface
mrmctl -m state=disabled torque
restore disabled resource manager interface
mrmctl -m state=enabled torque
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mdiag -R
mdiag -c
164
Commands
Chapter 3 Scheduler Commands
mrsvctl
Synopsis
mrsvctl -B SRVSID
mrsvctl -c [-a ACL] [-b SUBTYPE] [-d DURATION] [-D DESCRIPTION] [-e
ENDTIME] [-E EXCLUSIVE] [-f FEATURES] [-F FLAGS] [-g RSVGROUP] [-h
HOSTLIST] [-n NAME] [-o OWNER] [-p PARTITION] [-P PROFILE] [-R
RESOURCES] [-s STARTTIME] [-S SET ATTRIBUTE] [-t TASKS] [-T TRIGGER] [V VARIABLE] [-x JOBLIST]
mrsvctl -C [-g SRSVID] {RESERVATION ID}
mrsvctl -l [{RESERVATION ID | -i INDEX}]
mrsvctl -m <duration|endtime|reqtaskcount|starttime>{=|+=|-=}<VAL>
<hostexp>{+=|-=}<VAL> <variable>{+=KEY=VAL|-=KEY_TO_REMOVE}
{RESERVATION ID | -i INDEX}
mrsvctl -q {RESERVATION ID | -i INDEX} [--blocking]
mrsvctl -r {RESERVATION ID | -i INDEX}
Overview
mrsvctl controls the creation, modification, querying, and releasing of
reservations.
The timeframe covered by the reservation can be specified on either an
absolute or relative basis. Only jobs with credentials listed in the reservation's
access control list can utilize the reserved resources. However, these jobs still
have the freedom to utilize resources outside of the reservation. The
reservation will be assigned a name derived from the ACL specified. If no
reservation ACL is specified, the reservation is created as a system reservation
and no jobs will be allowed access to the resources during the specified
timeframe (valuable for system maintenance, etc.). See the Reservation
Overview for more information.
Reservations can be viewed using the -q flag and can be released using the -r
flag.
By default, reservations are not exclusive and may overlap with other
reservations and jobs. Use the '-E' flag to adjust this behavior.
Access
By default, this command can be run by level 1 and level 2 Moab administrators
(see ADMINCFG).
Commands
165
Chapter 3 Scheduler Commands
Format
-a
Name
ACL
Format
<TYPE>==<VAL>[,<TYPE>==<VAL>]...
Where <TYPE> is one of the following:
ACCT,
CLASS,
DURATION,
GROUP,
JATTR,
PROC,
QOS, or
USER
Description
Example
List of limitations for access to the reserved resources (See also: ACL Modifiers).
> mrsvctl -c -h node01 -a USER==john+,CLASS==batch-
Moab will make a reservation on node01 allowing access to user john and restricting
access from class batch when other resources are available to class batch
> mrsvctl -m -a USER-=john system.1
Moab will remove user john from the system.1 reservation
166
Commands
Chapter 3 Scheduler Commands
-a
Notes
l
l
l
l
l
l
l
When you specify multiple credentials, a user must only match one of them in order to
access the reservation. To require one or more of the listed limitations for reservation
access, each required specification must end with an asterisk (*). If a user meets the
required limitation(s), he or she has access to the reservation (without meeting any that are
not marked required).
There are three different assignment operators that can be used for modifying most
credentials in the ACL. The operator == will reassess the list for that particular credential
type. The += operator will append to the list for that credential type, and -= will remove
from the list. Two other operators are used to specify DURATION and PROC: >= (greater
than) and <= (less than).
To add multiple credentials of the same type with one command, use a colon to separate
them. To separate lists of different credential types, use commas. For example, to reassign
the user list to consist of users Joe and Bob, and to append the group MyGroup to the
groups list on the system.1 reservation, you could use the command mrsvctl -m -a
USER==Joe:Bob,GROUP+=MyGroup system.1.
Any of the ACL modifiers may be used. When using them, it is often useful to put single
quotes on either side of the assignment command. For example, mrsvctl -m -a
'USER==&Joe' system.1.
Some flags are mutually exclusive. For example, the ! modifier means that the credential is
blocked from the reservation and the & modifier means that the credential must run on
that reservation. Moab will take the most recently parsed modifier. Modifiers may be placed
on either the left or the right of the argument, so USER==&JOE and USER==JOE& are
equivalent. Moab parses each argument starting from right to left on the right side of the
argument, then from left to right on the left side. So, if the command was USER==!Joe&,
Moab would keep the equivalent of USER==!Joe because the ! would be the last one
parsed.
You can set a reservation to have a time limit for submitted jobs using DURATION and the *
modifier. For example, mrsvctl -m -a 'DURATION<=*1:00:00' system.1 would
cause the system.1 reservation to not accept any jobs with a walltime greater than one
hour. Similarly, you can set a reservation to have a processor limit using PROC and the
* modifier. mrsvctl -a 'PROC>=2*' system.2 would cause the system.2
reservation to only allow jobs requesting more than 2 procs to run on it.
You can verify the ACL of a reservation using the mdiag -r command.
mrsvctl -m -a 'USER==Joe:Bob,GROUP-=BadGroup,ACCT+=GoodAccount,DURATION<=*1:00:00'
system.1
Moab will reassign the USER list to be Joe and Bob, will remove BadGroup from the
GROUP list, append GoodAccount to the ACCT list, and only allow jobs that have a
submitted walltime of an hour or less on the system.1 reservation.
Commands
167
Chapter 3 Scheduler Commands
-a
mrsvctl -m -a 'USER==Joe,USER==Bob' system.1
Moab will assign the USER list to Joe, and then reassign it again to Bob. The final result
will be that the USER list will just be Bob. To add Joe and Bob, use mrsvctl -m -a
USER==Joe:Bob system.1 or mrsvctl -m -a USER==Joe,USER+=Bob
system.1.
-b
Name
SUBTYPE
Format
One of the node category values or node category shortcuts.
Description
Add subtype to reservation.
Example
> mrsvctl -c -b SoftwareMaintenance -t ALL
Moab will associate the reserved nodes with the node category
SoftwareMaintenance.
-B
Name
REBUILD
Format
<SRSVID>
Description
Rebuilds standing reservations while Moab is running.
Example
> mrsvctl -B <SRSVID>
-c
168
Name
CREATE
Format
<ARGUMENTS>
Commands
Chapter 3 Scheduler Commands
-c
Description
Creates a reservation.
If a created reservation has a given duration but the start time is in the past, one of the
following actions occur depending on whether the present time falls within the
reservation's given duration.
l
l
If the present time is still within the reservation's duration time frame, the start
time does not change and the reservation shows however long is left in the
reservation (present time minus the duration time).
If present time is outside of the reservation's duration time frame, the reservation
start time automatically sets to the present time and the reservation continues for
its full given duration.
The -x flag, when used with -F ignjobrsv, lets users create reservations but exclude
certain nodes from being part of the reservation because they are running specific jobs.
The -F flag instructs mrsvctl to still consider nodes with current running jobs.
Examples
> mrsvctl -c -t ALL
Moab will create a reservation across all system resources.
> mrsvctl -c -t 5 -F ignjobrsv -x moab.5,moab.6
Moab will create the reservation while assigning the nodes. Nodes running jobs moab5
and moab6 will not be assigned to the reservation.
> mrsvctl -c -t 1 -d INFINITY
Moab will create an infinite reservation.
-C
Name
CLEAR
Format
<RSVID> | -g <SRSVID>
Description
Clears any disabled time slots from standing reservations and allows the recreation of disabled
reservations
Example
> mrsvctl -C -g testing
Moab will clear any disabled timeslots from the standing reservation testing.
Commands
169
Chapter 3 Scheduler Commands
-d
Name
DURATION
Format
[[[DD:]HH:]MM:]SS
Default
INFINITY
Description
Duration of the reservation (not needed if ENDTIME is specified)
Example
> mrsvctl -c -h node01 -d 5:00:00
Moab will create a reservation on node01 lasting 5 hours.
> mrsvctl -c -h node01 -d INFINITY
Moab will create a reservation with a duration of
INFINITY (no endtime).
-D
Name
DESCRIPTION
Format
<STRING>
Description
Human-readable description of reservation or purpose
Example
> mrsvctl -c -h node01 -d 5:00:00 -D 'system maintenance to test
network'
Moab will create a reservation on node01 lasting 5 hours.
-e
Name
170
ENDTIME
Commands
Chapter 3 Scheduler Commands
-e
Format
[HH[:MM[:SS]]][_MO[/DD[/YY]]]
or
+[[[DD:]HH:]MM:]SS
Default
INFINITY
Description
Absolute or relative time reservation will end (not required if Duration specified). ENDTIME also
supports an epoch timestamp.
Example
> mrsvctl -c -h node01 -e +3:00:00
Moab will create a reservation on node01 ending in 3 hours.
-E
Name
EXCLUSIVE
Description
When specified, Moab will only create a reservation if there are no other reservations (exclusive or
otherwise) which would conflict with the time and space constraints of this reservation. If exceptions are desired, the rsvaccesslist attribute can be set or the ignrsv flag can be used.
Example
> mrsvctl -c -h node01 -E
Moab will only create a reservation on node01 if no conflicting reservations are found.
This flag is only used at the time of reservation creation. Once the reservation is created,
Moab allows jobs into the reservation based on the ACL. Also, once the exclusive
reservation is created, it is possible that Moab will overlap it with jobs that match the ACL.
-f
Name
FEATURES
Format
<STRING>[:<STRING>]...
Description
List of node features which must be possessed by the reserved resources. You can use a backslash
and pipe to delimit features to indicate that Moab can use one or the other.
Commands
171
Chapter 3 Scheduler Commands
-f
Example
> mrsvctl -c -h node[0-9] -f fast\|slow
Moab will create a reservation on nodes matching the expression and which also have
either the feature fast or the feature slow.
-F
Name
FLAGS
Format
<flag>[[,<flag>]...]
Description
Comma-delimited list of flags to set for the reservation (see Managing Reservations for flags).
Example
> mrsvctl -c -h node01 -F ignstate
Moab will create a reservation on node01 ignoring any conflicting node states.
-g
Name
RSVGROUP
Format
<STRING>
Description
For a create operation, create a reservation in this reservation group. For list and modify operations, take actions on all reservations in the specified reservation group. The -g option can also be
used in conjunction with the -r option to release a reservation associated with a specified group.
See Reservation Group for more information.
Example
> mrsvctl -c -g staff -h 'node0[1-9]'
Moab will create a reservation on nodes matching the node expression given and assign it
to the reservation group staff.
172
Commands
Chapter 3 Scheduler Commands
-h
Name
HOSTLIST
Format
class:<classname>[,<classname>]...
or
<STRING>
or
'r:<nodeNameStart>[<beginRange>-<endRange>]'
or
ALL
Description
Host expression or a class mapping indicating the nodes which the reservation will allocate.
When you specify a <STRING>, the HOSTLIST attribute is always treated as a regular
expression. foo10 will map to foo10, foo101, foo1006, etc. To request an exact host
match, the expression can be bounded by the carat and dollar op expression markers as in
^foo10$.
Example
> mrsvctl -c -h 'r:node0[1-9]'
Moab will create a reservation on nodes node01, node02, node03, node04, node05,
node06, node07, node08, and node09.
> mrsvctl -c -h class:batch
Moab will create a reservation on all nodes which support class/queue batch.
-i
Name
INDEX
Format
<STRING>
Description
Use the reservation index instead of full reservation ID.
Example
> mrsvctl -m -i 1 starttime=+5:00
Moab will create a reservation on nodes matching the
expression given.
Commands
173
Chapter 3 Scheduler Commands
-l
Name
LIST
Format
<RSV_ID> or ALL
RSV_ID can be the name of a reservation or a regular expression.
Default
ALL
Description
List reservation(s).
Example
> mrsvctl -l system*
Moab will list all of the reservations whose names start
with system.
-m
Name
MODIFY
Format
<ATTR>=<VAL>[-m <ATTR2>=<VAL2>]...
Where <ATTR> is one of the following:
flags
174
duration
duration{+=|-=|=}<RELTIME>
endtime
endtime{+=|-=}<RELTIME> or endtime=<ABSTIME>
hostexp
hostexp[+=|-=]<node>[,<node>]
variable
variable[+=key1=val1|-=key_to_remove]
reqtaskcount
reqtaskcount{+=|-=|=}<TASKCOUNT>
starttime
starttime{+=|-=}<RELTIME> or starttime=<ABSTIME>
Commands
Chapter 3 Scheduler Commands
-m
Description
Modify aspects of a reservation.
Moab is constantly scheduling and updating reservations. Before modifying a reservation it
is recommended that you first stop the scheduler (mschedclt -s) so that the scheduler and
reservation are in a stable and steady state. Once the reservation has been modified,
resume the scheduler with mschedctl -r.
Commands
175
Chapter 3 Scheduler Commands
-m
Example
> mrsvctl -m duration=2:00:00 system.1
Moab sets the duration of reservation system.1 to be exactly two hours, thus modifying
the endtime of the reservation.
> mrsvctl -m starttime+=5:00:00 system.1
Moab advances the starttime of system.1 five hours from its current starttime (without
modifying the duration of the reservation).
> mrsvctl -m endtime-=5:00:00 system.1
Moab moves the endtime of reservation system.1 ahead five hours from its current
endtime (without modifying the starttime; thus, this action is equivalent to modifying the
duration of the reservation).
> mrsvctl -m starttime=15:00:00_7/6/08 system.1
Moab sets the starttime of reservation system.1 to 3:00 p.m. on July 6, 2008.
> mrsvctl -m starttime-=5:00:00 system.1
Moab moves the starttime of reservation system.1 ahead five hours.
> mrsvctl -m starttime+=5:00:00 system.1
Moab moves the starttime of reservation system.1 five hours from the current time.
> mrsvctl -m -duration+=5:00:00 system.1
Moab extends the duration of system.1 by five hours.
> mrsvctl -m flags+=ADVRES system.1
Moab adds the flag ADVRES to reservation system.1.
176
Commands
Chapter 3 Scheduler Commands
-m
> mrsvctl -m variable+key1=val1 system.1
Moab adds the variable key1 with the value key2 to system.1.
> mrsvctl -m variable+=key1=val1 variable+=key2=val2 system.1
Moab adds the variable key1 with the value val1, and variable key2 with val2 to
system.1. (Note that each variable flag requires a distinct -m entry.)
> mrsvctl -m variable-=key1 system.1
Moab deletes the variable key1 from system.1.
> mrsvctl -m variable-=key1 -m variable-=key2 system.1
Moab deletes the variables key1 and key2 from system.1.
Commands
177
Chapter 3 Scheduler Commands
-m
Notes:
l
l
l
l
l
Modifying the starttime does not change the duration of the reservation, so the endtime
changes as well. The starttime can be changed to be before the current time, but if the
change causes the endtime to be before the current time, the change is not allowed.
Modifying the endtime changes the duration of the reservation as well (and vice versa). An
endtime cannot be placed before the starttime or before the current time.
Duration cannot be negative.
The += and -= operators operate on the time of the reservation (starttime+=5 adds
five seconds to the current reservation starttime), while + and - operate on the current
time (starttime+5 sets the starttime to five seconds from now).
If the starttime or endtime specified is before the current time without a date specified, it
is set to the next time that fits the command. To force the date, add the date as well. For
the following examples, assume that the current time is 9:00 a.m. on March 1, 2007.
> mrsvctl -m starttime=8:00:00_3/1/07 system.1
Moab moves system.1's starttime to 8:00 a.m., March 1.
> mrsvctl -m starttime=8:00:00 system.1
Moab moves system.1's starttime to 8:00 a.m., March 2.
> mrsvctl -m endtime=7:00:00 system.1
Moab moves system.1's endtime to 7:00 a.m., March 3. This happens because the
endtime must also be after the starttime, so Moab continues searching until it has found a
valid time that is in the future and after the starttime.
> mrsvctl -m endtime=7:00:00_3/2/07 system.1
Moab will return an error because the endtime cannot be before the starttime.
-n
178
Name
NAME
Format
<STRING>
Commands
Chapter 3 Scheduler Commands
-n
Description
Name for new reservation.
If no name is specified, the reservation name is set to first name listed in ACL or SYSTEM if
no ACL is specified.
Reservation names may not contain whitespace.
Example
mrsvctl -c -h node01 -n John
Moab will create a reservation on node01 with the name John.
-o
Name
OWNER
Format
<CREDTYPE>:<CREDID>
Description
Specifies the owner of a reservation. See Reservation Ownership for more information.
Example
mrsvctl -c -h node01 -o USER:user1
Moab creates a reservation on node01 owned by user1.
-p
Name
PARTITION
Format
<STRING>
Description
Only allocate resources from the specified partition
Example
mrsvctl -c -p switchB -t 14
Moab will allocate 14 tasks from the
switchB partition.
Commands
179
Chapter 3 Scheduler Commands
-P
Name
PROFILE
Format
<STRING>
Description
Indicates the reservation profile to load when creating this reservation
Example
mrsvctl -c -P testing2 -t 14
Moab will allocate 14 tasks to a reservation defined by the testing2
reservation profile.
-q
Name
QUERY
Format
<RSV_ID> — The -r option accepts x: node regular expressions and r: node range expressions
(asterisks (*) are supported wildcards as well).
Description
Get diagnostic information or list all completed reservations. The command gathers information
from the Moab cache which prevents it from interrupting the scheduler, but the --blocking
option can be used to bypass the cache and interrupt the scheduler.
Example
mrsvctl -q ALL
Moab will query reservations.
mrsvctl -q system.1
Moab will query the reservation system.1.
-r
180
Name
RELEASE
Format
<RSV_ID> — The -r option accepts x: node regular expressions and r: node range expressions
(asterisks (*) are supported wildcards as well).
Commands
Chapter 3 Scheduler Commands
-r
Description
Releases the specified reservation.
When you release an instance of a standing reservation, Moab will remember that and
prevent a reservation from being created for that same period (even after a restart of
Moab). When Moab reaches the end of the period, it will still create new reservations in the
future to meet the reservation depth requirement.
Example
> mrsvctl -r system.1
Moab will release reservation system.1.
> mrsvctl -r -g idle
Moab will release all idle job reservations.
-R
Name
RESOURCES
Format
<tid> or
<RES>=<VAL>[{,|+|;}<RES>=<VAL>]...
Where <RES> is one of the following:
PROCS,
MEM,
DISK,
SWAP,
GRES
Default
PROCS=-1
Description
Specifies the resources to be reserved per task. (-1 indicates all resources on node)
For GRES resources, <VAL> is specified in the format <GRESNAME>
[:<COUNT>]
Example
> mrsvctl -c -R MEM=100;PROCS=2 -t 2
Moab will create a reservation for two tasks with the specified resources.
Commands
181
Chapter 3 Scheduler Commands
-s
Name
STARTTIME
Format
[HH[:MM[:SS]]][_MO[/DD[/YY]]]
or
+[[[DD:]HH:]MM:]SS
Default
[NOW]
Description
Absolute or relative time reservation will start. STARTTIME also supports an epoch timestamp.
Example
> mrsvctl -c -t ALL -s 3:00:00_4/4/04
Moab will create a reservation on all system resources at 3:00 am on April 4, 2004
> mrsvctl -c -h node01 -s +5:00
Moab will create a reservation in 5 minutes on node01
> mrsvctl -m -s -=5:00 system.1
This will decrement the start time by 5 minutes.
-S
Name
182
SET ATTRIBUTE
Commands
Chapter 3 Scheduler Commands
-S
Format
<ATTR>=<VALUE> where <ATTR> is one of
aaccount — Accountable account
agroup — accountable group
aqos — accountable QoS
auser — accountable user
MAXJOB — maximum number of jobs allowed within the reservation accross all nodes (cannot be
used with partial node reservations)
reqarch — required architecture
reqmemory — required node memory - in MB
reqos — required operating system
rsvaccesslist — comma-delimited list of reservations or reservation groups which can be accessed
by this reservation request. Because each reservation can access all other reservations by default,
you should make any reservation with a specified rsvaccesslist exclusive by setting the -E flag. This
setting gives the otherwise exclusive reservation access to reservations specified in the list.
Description
Specifies a reservation attribute will be used to create this reservation
Example
> mrsvctl -c -h node01 -S aqos=high
Moab will create a reservation on node01 and will use the QOS high as the accountable
credentia
> mrsvctl -c -S MAXJOB=X
-t
Name
TASKS
Format
<INTEGER>[-<INTEGER>]
Description
Specifies the number of tasks to reserve. ALL indicates all resources available should be reserved.
If the task value is set to ALL, Moab applies the reservation regardless of existing
reservations and exclusive issues. If an integer is used, Moab only allocates accessible
resources. If a range is specified Moab attempts to reserve the maximum number of tasks,
or at least the minimum.
Commands
183
Chapter 3 Scheduler Commands
-t
Example
> mrsvctl -c -t ALL
Moab will create a reservation on all resources.
> mrsvctl -c -t 3
Moab will create a reservation for three tasks.
> mrsvctl -c -t 3-10 -E
Moab will attempt to reserve 10 tasks but will fail if it cannot get at least three.
-T
Name
TRIGGER
Format
<STRING>
Description
Comma-delimited reservation trigger list following format described in the trigger format section
of the reservation configuration overview. See Creating a Trigger for more information.
To cancel a standing reservation with a trigger, the SRCFG parameter's attribute DEPTH
must be set to 0.
Example
> mrsvctl -c -h node01 -T offset=200,etype=start,atype=exec,action=/tmp/email.sh
Moab will create a reservation on node01 and fire the script /tmp/email.sh 200
seconds after it starts
-V
184
Name
VARIABLE
Format
<name>[=<value>][[;<name>[=<value>]]...]
Commands
Chapter 3 Scheduler Commands
-V
Description
Example
Semicolon-delimited list of variables that will be set when the reservation is created (See About
Trigger Variables for more information.). Names with no values will simply be set to TRUE.
> mrsvctl -c -h node01 -V $T1=mac;var2=18.19
Moab will create a reservation on node01 and set $T1 to mac and var2 to 18.19.
For information on modifying a variable on a reservation, see MODIFY.
-x
Name
JOBLIST
Format
-x <jobs to be excluded>
Description
The -x flag, when used with -F ignjobrsv, lets users create reservations but exclude certain
nodes that are running the listed jobs. The -F flag instructs mrsvctl to still consider nodes with current running jobs. The nodes are not listed directly.
Example
> mrsvctl -c -t 5 -F ignjobrsv -x moab.5,moab.6
Moab will create the reservation while assigning the nodes. Nodes running jobs moab5
and moab6 will not be assigned to the reservation.
Parameters
RESERVATION ID
Format
<STRING>
Description
The name of a reservation or a regular expression for several reservations.
Example
system*
Specifies all reservations starting with system.
Commands
185
Chapter 3 Scheduler Commands
Resource Allocation Details
When allocating resources, the following rules apply:
l
l
l
When specifying tasks, each task defaults to one full compute node unless
otherwise specified using the -R specification
When specifying tasks, the reservation will not be created unless all
requested resources can be allocated. (This behavior can be changed by
specifying -F besteffort)
When specifying tasks or hosts, only nodes in an idle or running state will
be considered. (This behavior can be changed by specifying -F ignstate)
Reservation Timeframe Modification
Moab supports dynamically modifying the timeframe of existing reservations.
This can be accomplished using the mrsvctl -m flag. By default, Moab will
perform advanced boundary and resource access to verify that the
modification does not result in an invalid scheduler state. However, in certain
circumstances administrators may wish to FORCE the modification in spite of
any access violations. This can be done using the switch mrsvctl -m -flags=force which forces Moab to bypass any access verification and force
the change through.
Extending a reservation by modifying the endtime
The following increases the endtime of a reservation using the += tag:
$> showres
ReservationID
Type S
Start
End
system.1
User 11:35:57 1:11:35:57
00:00:00
1 reservation located
$> mrsvctl -m endtime+=24:00:00 system.1
endtime for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
End
system.1
User 11:35:22 2:11:35:22
00:00:00
1 reservation located
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
The following increases the endtime of a reservation by setting the endtime to
an absolute time:
186
Commands
Chapter 3 Scheduler Commands
$> showres
ReservationID
Type S
Start
system.1
User 11:33:18
00:00:00
1 reservation located
$> mrsvctl -m endtime=0_11/20 system.1
endtime for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
system.1
User 11:33:05
00:00:00
1 reservation located
End
1:11:33:18
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
End
2:11:33:05
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Extending a reservation by modifying the duration
The following increases the duration of a reservation using the += tag:
$> showres
ReservationID
Type S
Start
End
system.1
User 11:28:46 1:11:28:46
00:00:00
1 reservation located
$> mrsvctl -m duration+=24:00:00 system.1
duration for rsv 'system.1' changed
>$ showres
ReservationID
Type S
Start
End
system.1
User 11:28:42 2:11:28:42
00:00:00
1 reservation located
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
The following increases the duration of a reservation by setting the duration to
an absolute time:
$> showres
ReservationID
Type S
Start
End
system.1
User 11:26:41 1:11:26:41
00:00:00
1 reservation located
$> mrsvctl -m duration=48:00:00 system.1
duration for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
End
system.1
User 11:26:33 2:11:26:33
00:00:00
1 reservation located
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Shortening a reservation by modifying the endtime
The following modifies the endtime of a reservation using the -= tag:
Commands
187
Chapter 3 Scheduler Commands
$> showres
ReservationID
Type S
Start
End
system.1
User 11:15:51 2:11:15:51
00:00:00
1 reservation located
$> mrsvctl -m endtime-=24:00:00 system.1
endtime for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
End
system.1
User 11:15:48 1:11:15:48
00:00:00
1 reservation located
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
The following modifies the endtime of a reservation by setting the endtime to
an absolute time:
$ showres
ReservationID
Type S
Start
system.1
User 11:14:00
00:00:00
1 reservation located
$> mrsvctl -m endtime=0_11/19 system.1
endtime for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
system.1
User 11:13:48
00:00:00
1 reservation located
End
2:11:14:00
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
End
1:11:13:48
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Shortening a reservation by modifying the duration
The following modifies the duration of a reservation using the -= tag:
$> showres
ReservationID
Type S
Start
End
system.1
User 11:12:20 2:11:12:20
00:00:00
1 reservation located
$> mrsvctl -m duration-=24:00:00 system.1
duration for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
End
system.1
User 11:12:07 1:11:12:07
00:00:00
1 reservation located
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
The following modifies the duration of a reservation by setting the duration to
an absolute time:
188
Commands
Chapter 3 Scheduler Commands
$> showres
ReservationID
Type S
Start
End
system.1
User 11:10:57 2:11:10:57
00:00:00
1 reservation located
$> mrsvctl -m duration=24:00:00 system.1
duration for rsv 'system.1' changed
$> showres
ReservationID
Type S
Start
End
system.1
User 11:10:50 1:11:10:50
00:00:00
1 reservation located
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
1:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Modifying the starttime of a reservation
The following increases the starttime of a reservation using the += tag:
$> showres
ReservationID
system.1
00:00:00
1 reservation
$> mrsvctl -m
starttime for
$> showres
ReservationID
system.1
00:00:00
1 reservation
Type S
User -
Start
11:08:30
End
2:11:08:30
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Sun Nov 19
located
starttime+=24:00:00 system.1
rsv 'system.1' changed
Type S
User -
Start
1:11:08:22
End
3:11:08:22
located
The following decreases the starttime of a reservation using the -= tag:
$> showres
ReservationID
system.1
00:00:00
1 reservation
$> mrsvctl -m
starttime for
$> showres
ReservationID
system.1
00:00:00
1 reservation
Type S
User -
Start
11:07:04
End
2:11:07:04
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Fri Nov 17
located
starttime-=24:00:00 system.1
rsv 'system.1' changed
Type S
User -
Start
-12:53:04
End
1:11:06:56
located
The following modifies the starttime of a reservation using an absolute time:
$> showres
ReservationID
system.1
00:00:00
1 reservation
$> mrsvctl -m
starttime for
$> showres
ReservationID
system.1
00:00:00
1 reservation
Commands
Type S
User -
Start
11:05:31
End
2:11:05:31
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Sun Nov 19
located
starttime=0_11/19 system.1
rsv 'system.1' changed
Type S
User -
Start
1:11:05:18
End
3:11:05:18
located
189
Chapter 3 Scheduler Commands
The following modifies the starttime of a reservation using an absolute time:
$> showres
ReservationID
system.1
00:00:00
1 reservation
$> mrsvctl -m
starttime for
$> showres
ReservationID
system.1
00:00:00
1 reservation
Type S
User -
Start
11:04:04
End
2:11:04:04
Duration
2:00:00:00
N/P
1/2
StartTime
Sat Nov 18
Duration
2:00:00:00
N/P
1/2
StartTime
Fri Nov 17
located
starttime=0_11/17 system.1
rsv 'system.1' changed
Type S
User -
Start
-12:56:02
End
1:11:03:58
located
Examples
l
Basic Reservation
l
System Maintenance Reservation
l
Explicit Task Description
l
Dynamic Reservation Modification
l
Reservation Modification
l
Allocating Reserved Resources
l
Modifying an Existing Reservation
Example 3-28: Basic Reservation
Reserve two nodes for use by users john and mary for a period of 8 hours
starting in 24 hours
> mrsvctl -c -a USER=john,USER=mary -starttime +24:00:00 -duration 8:00:00 -t 2
reservation 'system.1' created
Example 3-29: System Maintenance Reservation
Schedule a system wide reservation to allow a system maintenance on Jun 20,
8:00 AM until Jun 22, 5:00 PM.
% mrsvctl -c -s 8:00:00_06/20 -e 17:00:00_06/22 -h ALL
reservation 'system.1' created
Example 3-30: Explicit Task Description
Reserve one processor and 512 MB of memory on nodes node003 through
node node006 for members of the group staff and jobs in the interactive
class
> mrsvctl -c -R PROCS=1,MEM=512 -a GROUP=staff,CLASS=interactive -h 'node00[3-6]'
reservation 'system.1' created
190
Commands
Chapter 3 Scheduler Commands
Example 3-31: Dynamic Reservation Modification
Modify reservation john.1 to start in 2 hours, run for 2 hours, and include
node02 in the hostlist.
> mrsvctl -m starttime=+2:00:00,duration=2:00:00,HostExp+=node02
Note: hosts added to rsv system.3
Example 3-32: Reservation Modification
Remove user John's access to reservation system.1
> mrsvctl -m -a USER=John system.1 --flags=unset
successfully changed ACL for rsv system.1
Example 3-33: Allocating Reserved Resources
Allocate resources for group dev which are exclusive except for resources
found within reservations myrinet.3 or john.6
> mrsvctl -c -E -a group=dev,rsv=myrinet.3,rsv=john.6 -h 'node00[3-6]'
reservation 'dev.14' created
Create exclusive network reservation on racks 3 and 4
> mrsvctl -c -E -a group=ops -g network -f rack3 -h ALL
reservation 'ops.1' created
> mrsvctl -c -E -a group=ops -g network -f rack4 -h ALL
reservation 'ops.2' created
Allocate 64 nodes for 2 hours to new reservation and grant access to
reservation system.3 and all reservations in the reservation group network
> mrsvctl -c -E -d 2:00:00 -a group=dev -t 64 -S rsvaccesslist=system.3,network
reservation 'system.23' created
Allocate 4 nodes for 1 hour to new reservation and grant access to idle job
reservations
> mrsvctl -c -E -d 1:00:00 -t 4 -S rsvaccesslist=idle
reservation 'system.24' created
Example 3-34: Modifying an Existing Reservation
Remove user john from reservation ACL
> mrsvctl -m -a USER=john system.1 --flags=unset
successfully changed ACL for rsv system.1
Change reservation group
> mrsvctl -m RSVGROUP=network ops.4
successfully changed RSVGROUP for rsv ops.4
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
Commands
191
Chapter 3 Scheduler Commands
Admin Reservation Overview
showres
mdiag -r
mshow -a command to identify available resources
job to rsv binding
mschedctl
Synopsis
mschedctl -A '<MESSAGE>'
mschedctl -c message messagestring [-o type:val]
mschedctl -c trigger triggerid -o type:val
mschedctl -d trigger:triggerid
mschedctl -d message:index:wq
mschedctl -f {all|fairshare|usage}
mschedctl -k
mschedctl -l {config|feature|gmetric|gres|message|opsys|trigger|trans} [-v]
[--xml]
mschedctl -L [<LOGLEVEL>[:<LOG_FILE>]]
mschedctl -m config string [-e]
mschedctl -m trigger triggerid attr=val[,attr=val...]
mschedctl -q mschedctl -q pactions --xml
mschedctl -p
mschedctl -r [resumetime]
mschedctl -R
mschedctl -s [STOPITERATION]
mschedctl -S [STEPITERATION]
mschedctl -W
Overview
The mschedctl command controls various aspects of scheduling behavior. It is
used to manage scheduling activity, shutdown the scheduler, and create
resource trace files. It can also evaluate, modify, and create parameters,
triggers, and messages.
192
Commands
Chapter 3 Scheduler Commands
With many flags, the --msg=<MSG> option can be specified to annotate the
action in the event log.
Format
-A - ANNOTATE
Format
<STRING>
Description
Report the specified parameter modification to the event log and annotate it
with the specified message. The RECORDEVENTLIST parameter must be set in
order for this to work.
Example
mschedctl -A 'increase logging' -m 'LOGLEVEL 6'
Adjust the LOGLEVEL parameter and record an associated message.
-c - CREATE
Format
One of:
l
message <STRING> [-o <TYPE>:<VAL>]
l
trigger<TRIGSPEC> -o <OBJECTTYPE>:<OBJECTID>
l
gevent -n <NAME> [-m <message>]
where <ATTR> is one of account, duration, ID, messages, profile, reqresources, resources,
rsvprofile, starttime, user, or variables
Description
Commands
Create a message, trigger, or gevent and attach it to the specified object. To create a trigger on a
default object, use the Moab configuration file (moab.cfg) rather than the mschedctl command.
193
Chapter 3 Scheduler Commands
-c - CREATE
Example
mschedctl -c message tell the admin to be nice
Create a message on the system table.
mschedctl -c trigger EType=start,AType=exec,Action="/tmp/email $OWNER $TIME" -o
rsv:system.1
Create a trigger linked to system.1.
Creating triggers on default objects via mschedctl -c trigger does not propagate the
triggers to individual objects. To propagate triggers to all objects, the triggers must be
created within the moab.cfg file; for example: NODECFG[DEFAULT]TRIGGER.
mschedctl -c gevent -n diskfailure -m "node=n4"
Create a gevent indicating a disk failure on the node labeled n4.
-d - DESTROY
Format
Description
Example
One of:
l
trigger:<TRIGID>
l
message:<INDEX>
Delete a trigger or message.
mschedctl -d
trigger:3
Delete trigger
3.
mschedctl -d
message:5
Delete message
with index 5.
194
Commands
Chapter 3 Scheduler Commands
-f - FLUSH
Format
{all|fairshare|usage}
Description
Reset all internally-stored Moab Scheduler statistics to the initial start-up state as of the time the
command was executed.
Flushing should only be used if you experience corrupt statistics. The best practice is to
pause the Moab scheduler with mschedctl -p before running the flush command. After
running the flush command, unpause the Moab scheduler with mschedctl -r and the jobs
will start flowing again. For all external observers this will be a transparent flush unless
they are watching the stats.
Example
mschedctl -f usage
Flush usage statistics.
-k - KILL
Description
Example
Stop scheduling and exit the scheduler
mschedctl -k
Kill the scheduler.
-l - LIST
Format
{config|feature|gmetric|gres|message|opsys|trans|trigger} [-v] [--xml]
Using the --xml argument with the trans option returns XML that states if the queried
TID is valid or not.
Default
config
Description
List the generic metrics, generic resources, scheduler configuration, system messages, operating
systems, triggers, transactions, or node features recognized by Moab.
This command does not show credential parameters (such as user, group class, QoS,
account).
Commands
195
Chapter 3 Scheduler Commands
-l - LIST
Example
mschedctl -l config
List system parameters.
The config command without the -v flag does not show the settings of all scheduling
parameters. To show the settings of all scheduling parameters, use the -v flag. This will
provide an extended output. This output is often best used in conjunction with the grep
command as the output can be voluminous.
mschedctl -l feature
List all node features recognized by Moab.
mschedctl -l gmetric
List all configured generic metrics.
mschedctl -l gres
List all configured generic resources.
mschedctl -l message
List all system messages.
mschedctl -l opsys
List all recognized operating systems
mschedctl -l trans 1
List transaction id 1.
mschedctl -l trigger
List triggers.
-L - LOG
196
Format
[<LOGLEVEL>[: <LOG_FILE>]]
Default
7 $MOABHOMEDIR/log/moab.log
Commands
Chapter 3 Scheduler Commands
-L - LOG
Description
Example
Create a temporary log file with the specified loglevel. If no log file is given, Moab continues logging
to Moab's default log file.
mschedctl -L7:/tmp/moab.log
-m - MODIFY
Format
One of:
config [<STRING>]
[-e]
<STRING> is any string which would be acceptable in moab.cfg
l
o
If no string is specified, <STRING> is read from STDIN.
o
If -e is specified, the configuration string will be evaluated for correctness but no
configuration changes will take place. Any issues with the provided string will be
reported to STDERR.
mschedctl --flags=persistent -m <config> has been deprecated; use the following
method instead:
1. Run mschedctl -m <config> to put the change into effect dynamically.
2. Manually add the setting to the moab.cfg file, so that it always goes into effect after
any future Moab restarts/recycles.
trigger:<TRIGID> <ATTR>=<VAL>
where <ATTR> is one of action, atype, etype, iscomplete, oid, otype, offset, or threshold.
l
Description
Modify a system parameter or trigger.
Moab only loads the following list of parameters when first starting up. Therefore, to
change any of these, you must edit the setting in moab.cfg and then restart/recycle with
mschedctl -R.
Commands
l
JOBMAXNODECOUNT
l
MAXGMETRIC
l
MAXGRES
l
MAXJOB
l
MAXNODE
l
MAXRSVPERNODE
l
STATPROC*
l
STATTIME*
197
Chapter 3 Scheduler Commands
-m - MODIFY
Example
mschedctl -m config LOGLEVEL 9
Change the system loglevel to 9.
mschedctl -m trigger:2 AType=exec,Offset=200,OID=system.1
Change aspects of trigger 2.
-p - PAUSE
Description
Example
Disable scheduling but allow the scheduler to update its cluster and workload state information.
mschedctl -p
-q QUERY PENDING ACTIONS
Default
mschedctl -q pactions --xml
Description
A way to view pending actions. Only an XML request is valid. Pending actions can be VMs or system jobs.
Example
mschedctl -q pactions --xml
-R - RECYCLE
Description
Recycle scheduler immediately (shut it down and restart it using the original execution environment and command line arguments).
If Moab has been started under systemd, use systemctl restart moab.service
instead of using this option.
198
Commands
Chapter 3 Scheduler Commands
-R - RECYCLE
Example
mschedctl -R
Recycle scheduler immediately.
To restart Moab with its last known scheduler state, use:
mschedctl -R savestate
-r - RESUME
Format
mschedctl -r [[HH:[MM:]]SS]
Default
0
Description
Resume scheduling in the specified amount of time (or immediately if none is specified).
Example
mschedctl -r
Resume scheduling immediately.
-s - STOP
Format
<INTEGER>
Default
0
Description
Suspend/stop scheduling at specified iteration (or at the end of the current iteration if none is specified). If the letter I follows <ITERATION>, Moab will not process client requests until this iteration
is reached.
Example
mschedctl -s 100I
Stop scheduling at iteration 100 and ignore all client requests until then.
Commands
199
Chapter 3 Scheduler Commands
-S - STEP
Format
<INTEGER>
Default
0
Description
Step the specified number of iterations (or to the next iteration if none is specified) and suspend
scheduling If the letter I follows <ITERATION>, Moab will not process client requests until this iteration is reached.
Example
mschedctl -S
Step to the next iteration and stop scheduling.
-W
Description
Example
Preform a manual checkpoint file write.
mschedctl -W
Examples
Example 3-35: Shutting down the Scheduler
mschedctl -k
scheduler will be shutdown immediately
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mshow
Synopsis
mshow [-a] [-q jobqueue=active]
Overview
The mshow command displays various diagnostic messages about the system
and job queues.
200
Commands
Chapter 3 Scheduler Commands
Arguments
Flag
Description
-a
AVAILABLE RESOURCES
-q [<QUEUENAME>]
Displays the job queues.
Format
AVAILABLE RESOURCES
Format
Can be combined with --flags=[tid|verbose|future] --format=xml and/or -w
Description
Display available resources.
Example
> mshow -a -w user=john --flags=tid --format=xml
Show resources available to john in XML format with a transaction id. See mshow
-a for details.
JOB QUEUE
Format
<QUEUENAME>, where the queue name is one of: active, eligible, or blocked. Job queue names can
be delimited by a comma to display multiple queues. If no job queue name is specified, mshow displays all job queues.
Description
Displays the job queues. If a job queue name is specified, mshow shows only that job queue.
Example
> mshow -q active,blocked
[Displays all jobs in the active and blocked queues]
...
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mshow -a command to show available resources
Commands
201
Chapter 3 Scheduler Commands
mshow -a
Synopsis
mshow -a [-i] [-o] [-T] [-w where] [-x] [--xml]
Overview
The mshow -a command allows for querying of available system resources.
Arguments
[-i]
INTERSECTION
[-o]
NO AGGREGATE
[-T]
TIMELOCK
[-w]
WHERE
[-x]
EXCLUSIVE
Table 3-1: Argument Format
--flags
Name
Flags
Format
--flags=[ future | policy | tid | summary | verbose ]
Description
future will return resources available immediately and available in the future.
policy (Deprecated. May be removed in a future release.) will apply charging policies to
determine the total cost of each reported solution (only enabled for XML responses).
summary will assign all jointly allocated transactions as dependencies of the first transaction
reported.
tid will associate a transaction id with the reported results.
verbose will return diagnostic information.
Example
> mshow -a -w user=john --flags=tid --xml
Show resources available to john in XML format with a transaction ID.
202
Commands
Chapter 3 Scheduler Commands
--xml
Name
XML
Format
--xml
Description
Report results in XML format.
Example
> mshow -a -w user=john --flags=tid --xml
Show resources available to john in XML format with a
transaction ID.
-i
Name
INTERSECTION
Description
Specifies that an intersection should be performed during an mshow -a command with multiple
requirements.
-o
Name
NO AGGREGATE
Description
Specifies that the results of the command mshow -a with multiple requirements should not be
aggregated together.
-T
Name
TIMELOCK
Description
Specifies that the multiple requirements of an mshow -a command should be timelocked.
Example
Commands
> mshow -a -w minprocs=1,os=linux,duration=1:00:00 \
-w minprocs=1,os=aix,duration=10:00 \
--flags=tid,future -x -T
203
Chapter 3 Scheduler Commands
-w
Name
WHERE
Format
Comma delimited list of <ATTR>=<VAL> pairs:
<ATTR>=<VAL> [,<ATTR>=<VAL>]...
If any of the <ATTR>=<VAL> pairs contains a sub-list that is also comma delimited, the
entire -w string must be wrapped in single quotations with the sub-list expression
wrapped in double quotations. See the example below.
Attributes are listed below in table 2.
Description
Example
Add a Where clause to the current command (currently supports up to six co-allocation clauses).
> mshow -a -w minprocs=2,duration=1:00:00 -w nodemem=512,duration=1:00:00
Moab returns a list of all nodes with at least 2 processors and one hour duration or with a
memory of 512 and a duration of one hour.
> mshow -a -w nodefeature=\!vmware:gpfs --flags=future
Moab returns a list of all nodes that do not contain the vmware feature but that do
contain the gpfs feature.
> mshow -a -w 'duration=INFINITY,"excludehostlist=n01,n12,n23"'
Moab returns a list of all nodes with a duration of INFINITY, except for nodes named n01,
n12, and n23.
Note the use of single quotations containing the entire -w string and the use of double
quotations containing the excludehostlist attribute.
-x
Name
EXCLUSIVE
Description
Specifies that the multiple requirements of an mshow -a command should be exclusive (i.e. each
node may only be allocated to a single requirement)
Example
204
> mshow -a -w minprocs=1,os=linux -w minprocs=1,os=aix --flags=tid -x
Commands
Chapter 3 Scheduler Commands
Table 3-2: Request Attributes
Name
Description
account
The account credential of the requestor
acl
ACL to attach to the reservation
This ACL must be enclosed in quotation marks. For example:
$ mshow -a ... -w acl=\"user=john\" ...
arch
Select only nodes with the specified architecture
cal
Select resources subject to the constraints of the specified global calendar
class
The class credential of the requestor
coalloc
The co-allocation group of the specific Where request (can be any string but must match coallocation group of at least one other Where request)
The number of tasks requested in each Where request must be equal whether this
taskcount is specified via minprocs, mintasks, or gres.
count
The number of profiles to apply to the resource request
displaymode
Possible value is future. (Example: displaymode=future). Constrains how results are
presented; setting future evaluates which resources are available now and which resources
will be available in the future that match the requested attributes.
duration
The duration for which the resources will be required in format [[[DD:]HH:]MM:]SS
excludehostlist
Do not select any nodes from the given list. The list must be comma delimited.
> mshow -a -w 'duration=INFINITY,"excludehostlist=n01,n12,n23"'
Moab returns a list of all nodes with a duration of INFINITY, except for nodes named
n01, n12, and n23.
Note the use of single quotations to contain the entire -w string, and the use of double
quotations containing the excludehostlist attribute.
gres
Select only nodes which possess the specified generic resource
group
The group credential of the requestor
Commands
205
Chapter 3 Scheduler Commands
Name
Description
hostlist
Select only the specified resources. The list must be comma delimited.
> mshow -a -w 'duration=INFINITY,"hostlist=n01,n12,n23"'
Moab returns a list of nodes from the selected hostlist that have a duration of
INFINITY.
Note the use of single quotations to contain the entire -w string, and the use of double
quotations containing the hostlist attribute.
206
job
Use the resource, duration, and credential information for the job specified as a resource
request template
jobfeature
Select only resources which would allow access to jobs with the specified job features
jobflags
Select only resources which would allow access to jobs with the specified job flags. The jobflags attribute accepts a colon delimited list of multiple flags.
label
Removed label 1/2013 for DOC-16 Associate the specified label with all results matching this
request
minnodes
Return only results with at least the number of nodes specified. If used with TID's, return only
solutions with exactly minnodes nodes available
minprocs
Return only results with at least the number of processors specified. If used with TID's, return
only solutions with exactly minprocs processors available
mintasks
FORMAT: <TASKCOUNT>[@<RESTYPE>:<COUNT>[+<RESTYPE>:<COUNT>]...] where <RESTYPE>
is one of procs, mem, disk, or swap. Return only results with at least the number of tasks specified. If used with TID's, return only solutions with exactly mintasks available
nodedisk
Select only nodes with at least nodedisk MB of local disk configured
nodefeature
Select only nodes with all specified features present and nodes without all \! specified features using format [\!]<feature>[:[\!]<feature>]... You must set the future flag
when specifying node features.
nodemem
Select only nodes with at least nodemem MB of memory configured
offset
Select only resources which can be co-allocated with the specified time offset where offset is
specified in the format [[[DD:]HH:]MM:]SS
Commands
Chapter 3 Scheduler Commands
Name
Description
os
Select only nodes with have, or can be provisioned to have, the specified operating system
partition
The partition in which the resources must be located
policylevel
Enable policy enforcement at the specified policy constraint level
qos
The qos credential of the requestor
rsvprofile
Use the specified profile if committing a resulting transaction id directly to a reservation
starttime
Constrain the timeframe for the returned results by specifying one or more ranges using the
format <STIME>[-<ENDTIME>][;<STIME>[-<ENDTIME>]] where each time is specified in the
format in absolute, relative, or epoch time format ([HH[:MM[:SS]]][_MO[/DD[/YY]]] or +
[[[DD:]HH:]MM:]SS or <EPOCHTIME>).
The starttime specified is not the exact time at which the returned range must start,
but is rather the earliest possible time the range may start.
taskmem
Require taskmem MB of memory per task located
tpn
Require exactly tpn tasks per node on all discovered resources
user
The user credential of the requestor
var
Use associated variables in generating per transaction charging quotes
variables
Takes a string of the format variables='var[=attr]'[;'var[=attr]' and passes the
variables onto the reservation when used in conjunction with --flags=tid and mrsvctl c -R <tid>.
vmusage
Possible value is vmcreate. Moab will find resources for the job assuming it is a vmcreate job,
and if os is also specified, Moab will look for a hypervisor capable of running a VM with the
requested OS.
Usage Notes
The mshow -a command allows for querying of available system resources.
When combined with the --flags=tid option these available resources can
then be placed into a packaged reservation (using mrsvctl -c -R). This allows
system administrators to grab and reserve available resources for whatever
reason, without conflicting with jobs or reservations that may be holding
certain resources.
Commands
207
Chapter 3 Scheduler Commands
There are a few restrictions on which <ATTR> from the -w command can be
placed in the same req: minprocs, minnodes, and gres are all mutually
exclusive, only one may be used per -w request.
The allocation of available nodes will follow the global
NODEALLOCATIONPOLICY.
When the '-o' flag is not used, multi-request results will be aggregated. This
aggregation will negate the use of offsets and request-specific starttimes.
The config parameter RESOURCEQUERYDEPTH controls the maximum number
of options that will be returned in response to a resource query.
Examples
Example 3-36: Basic Compute Node Query and Reservation
> mshow -a -w duration=10:00:00,minprocs=1,os=AIX53,jobfeature=shared -flags=tid,future
Partition
Tasks Nodes
Duration
------------- ----- -----------ALL
1
1
10:00:00
ALL
1
1
10:00:00
ALL
1
1
10:00:00
> mrsvctl -c -R 4
Note: reservation system.2 created
StartOffset
-----------00:00:00
10:00:00
20:00:00
StartDate
-------------13:28:09_04/27
17:14:48_04/28
21:01:27_04/29
TID=4
TID=5
TID=6
ReqID=0
ReqID=0
ReqID=0
Example 3-37: Mixed Processor and License Query
Select one node with 4 processors and 1 matlab license where the matlab
license is only available for the last hour of the reservation. Also, select 16
additional processors which are available during the same timeframe but which
can be located anywhere in the cluster. Group the resulting transactions
together using transaction dependencies so only the first transaction needs to
be committed to reserve all associated resources.
> mshow -a -i -o -x -w mintasks=1@PROCS:4,duration=10:00:00,coalloc=a \
-w gres=matlab,offset=9:00:00,duration=1:00:00,coalloc=a \
-w minprocs=16,duration=10:00:00 --flags=tid,future,summary
Partition
Tasks Nodes
Duration
StartOffset
StartDate
------------- ----- ------------ ------------ -------------ALL
1
1
10:00:00
00:00:00 13:28:09_04/27 TID=4 ReqID=0
ALL
1
1
10:00:00
10:00:00 17:14:48_04/28 TID=5 ReqID=0
ALL
1
1
10:00:00
20:00:00 21:01:27_04/29 TID=6 ReqID=0
> mrsvctl -c -R 4
Note:
Note:
Note:
reservation system.2 created
reservation system.3 created
reservation system.4 created
Example 3-38: Request for Generic Resources
Query for a generic resource on a specific host (no processors, only a generic
resource).
208
Commands
Chapter 3 Scheduler Commands
> mshow -a -i -x -o -w gres=dvd,duration=10:00,hostlist=node03 --flags=tid,future
Partition
Tasks Nodes
StartOffset
Duration
StartDate
------------- ----- ------------ ------------ -------------ALL
1
1
00:00:00
00:10:00 11:33:25_07/27 TID=16
ReqID=0
ALL
1
1
00:10:00
00:10:00 11:43:25_07/27 TID=17
ReqID=0
ALL
1
1
00:20:00
00:10:00 11:53:25_07/27 TID=18
ReqID=0
> mrsvctl -c -R 16
Note: reservation system.6 created
> mdiag -r system.6
Diagnosing Reservations
RsvID
Type Par
StartTime
EndTime
Duration Node Task
Proc
-------- ------------------------ ---- ---- --system.6
User loc
-00:01:02
00:08:35
00:09:37
1
1
0
Flags: ISCLOSED
ACL:
RSV==system.6=
CL:
RSV==system.6
Accounting Creds: User:test
Task Resources: dvd: 1
Attributes (HostExp='^node03$')
Rsv-Group: system.6
Example 3-39: Allocation of Shared Resources
This example walks through a relatively complicated example in which a set of
resources can be reserved to be allocated for shared requests. In the example
below, the first mshow query looks for resources within an existing shared
reservation. In the example, this first query fails because there is now existing
reservation. The second query looks for resources within an existing shared
reservation. In the example, this first query fails because there is now existing
reservation. The second mshow request asks for resources outside of a shared
reservation and finds the desired resources. These resources are then
reserved as a shared pool. The third mshow request again asks for resources
inside of a shared reservation and this time finds the desired resources.
> mshow -a -w duration=10:00:00,minprocs=1,os=AIX53,jobflags=ADVRES,jobfeature=shared
--flags=tid
Partition
Tasks Nodes
Duration
StartOffset
StartDate
------------- ----- ------------ ------------ -------------> mshow -a -w duration=100:00:00,minprocs=1,os=AIX53,jobfeature=shared --flags=tid
Partition
Tasks Nodes
Duration
StartOffset
StartDate
------------- ----- ------------ ------------ -------------ALL
1
1
100:00:00
00:00:00 13:20:23_04/27 TID=1 ReqID=0
> mrsvctl -c -R 1
Note: reservation system.1 created
> mshow -a -w duration=10:00:00,minprocs=1,os=AIX53,jobflags=ADVRES,jobfeature=shared
--flags=tid
Partition
Tasks Nodes
Duration
StartOffset
StartDate
------------- ----- ------------ ------------ -------------ALL
1
1
10:00:00
00:00:00 13:20:36_04/27 TID=2 ReqID=0
> mrsvctl -c -R 2
Note: reservation system.2 created
Commands
209
Chapter 3 Scheduler Commands
Example 3-40: Full Resource Query in XML Format
The following command will report information on all available resources which
meet at least the minimum specified processor and walltime constraints and
which are available to the specified user. The results will be reported in XML to
allow for easy system processing.
210
Commands
Chapter 3 Scheduler Commands
> mshow -a -w class=grid,minprocs=8,duration=20:00 --format=xml --flags=future,verbose
<Data>
<Object>cluster</Object>
<job User="john" time="1162407604"></job>
<par Name="template">
<range duration="Duration" nodecount="Nodes" proccount="Procs"
starttime="StartTime"></range>
</par>
<par Name="ALL" feasibleNodeCount="131" feasibleTaskCount="163">
<range duration="1200" hostlist="opt-001:1,opt-024:1,opt-025:1,opt-027:2,opt041:1,opt-042:1,x86-001:1,P690-001:1,P690-021:1,P690-022:1"
index="0" nodecount="10" proccount="8" reqid="0"
starttime="1162407604"></range>
<range duration="1200" hostlist="opt-001:1,opt-024:1,opt-025:1,opt-027:2,opt039:1,opt-041:1,opt-042:1,x86-001:1,P690-001:1,P690-021:1,P690-022:1"
index="0" nodecount="11" proccount="8"reqid="0"
starttime="1162411204"></range>
<range duration="1200" hostlist="opt-001:1,opt-024:1,opt-025:1,opt-027:2,opt039:1,opt-041:1,opt-042:1,x86-001:1,x86-002:1,x86-004:1,
x86-006:1,x86-013:1,x86-014:1,x86-015:1,x86-016:1,x86-037:1,P690-001:1,P690021:1,P690-022:1"
index="0" nodecount="19" proccount="8" reqid="0"
starttime="1162425519"></range>
</par>
<par Name="SharedMem">
<range duration="1200" hostlist="P690-001:1,P690-002:1,P690-003:1,P690-004:1,P690005:1,P690-006:1,P690-007:1,P690-008:1,P690-009:1,
P690-010:1,P690-011:1,P690-012:1,P690-013:1,P690-014:1,P690-015:1,P690016:1,P690-017:1,P690-018:1,P690-019:1,P690-020:1,P690-021:1,
P690-022:1,P690-023:1,P690-024:1,P690-025:1,P690-026:1,P690-027:1,P690028:1,P690-029:1,P690-030:1,P690-031:1,P690-032:1"
index="0" nodecount="32" proccount="8" reqid="0"
starttime="1163122507"></range>
</par>
<par Name="64Bit">
<range duration="1200" hostlist="opt-001:1,opt-024:1,opt-025:1,opt-027:2,opt039:1,opt-041:1,opt-042:1"
index="0" nodecount="7" proccount="8" reqid="0"
starttime="1162411204"></range>
<range duration="1200" hostlist="opt-001:1,opt-024:1,opt-025:1,opt-027:2,opt039:1,opt-041:1,opt-042:1,opt-043:1,opt-044:1,opt-045:1,
opt-046:1,opt-047:1,opt-048:1,opt-049:1,opt-050:1"
index="0" nodecount="15" proccount="8" reqid="0"
starttime="1162428996"></range>
<range duration="1200" hostlist="opt-001:1,opt-006:1,opt-007:2,opt-008:2,opt009:2,opt-010:2,opt-011:2,opt-012:2,opt-013:2,opt-014:2,
opt-015:2,opt-016:2,opt-017:2,opt-018:2,opt-019:2,opt-020:2,opt-021:2,opt022:2,opt-023:2,opt-024:2,opt-025:1,opt-027:2,opt-039:1,
opt-041:1,opt-042:1,opt-043:1,opt-044:1,opt-045:1,opt-046:1,opt-047:1,opt048:1,opt-049:1,opt-050:1"
index="0" nodecount="33" proccount="8" reqid="0"
starttime="1162876617"></range>
</par>
<par Name="32Bit">
<range duration="1200" hostlist="x86-001:1,x86-002:1,x86-004:1,x86-006:1,x86013:1,x86-014:1,x86-015:1,x86-016:1,x86-037:1"
index="0" nodecount="9" proccount="8" reqid="0"
starttime="1162425519"></range>
<range duration="1200" hostlist="x86-001:1,x86-002:1,x86-004:1,x86-006:1,x86013:1,x86-014:1,x86-015:1,x86-016:1,x86-037:1,x86-042:1,x86-043:1"
index="0" nodecount="11" proccount="8" reqid="0"
Commands
211
Chapter 3 Scheduler Commands
starttime="1162956803"></range>
<range duration="1200" hostlist="x86-001:1,x86-002:1,x86-004:1,x86-006:1,x86013:1,x86-014:1,x86-015:1,x86-016:1,x86-027:1,x86-028:1,
x86-029:1,x86-030:1,x86-037:1,x86-041:1,x86-042:1,x86-043:1,x86-046:1,x86047:1,x86-048:1,x86-049:1"
index="0" nodecount="20" proccount="8" reqid="0"
starttime="1163053393"></range>
</par>
</Data>
This command reports the original query, and the timeframe, resource
size, and hostlist associated with each possible time slot.
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mshow in a hosting environment
mshow -a
Basic Current and Future Requests
The mshow command can report information on many aspects of the scheduling
environment. To request information on available resources, the -a flag
should be used. By default, the mshow command resource availability query
only reports resources that are immediately available. To request information
on specific resources, the type of resources required can be specified using the
-w flag as in the following example:
> mshow -a -w taskmem=1500,duration=600
...
To view current and future resource availability, the future flag should be set
as in the following example:
> mshow -a -w taskmem=1500,duration=600 --flags=future
...
Co-allocation Resources Queries
In many cases, a particular request will need simultaneous access to resources
of different types. The mshow command supports a co-allocation request
specified by using multiple -w arguments. For example, to request 16 nodes
with feature fastcpu and 2 nodes with feature fastio, the following request
might be used:
212
Commands
Chapter 3 Scheduler Commands
> mshow -a -w minprocs=16,duration=1:00:00,nodefeature=fastcpu -w
minprocs=2,nodefeature=fastio,duration=1:00:00 --flags=future
Partition
Procs Nodes
StartOffset
Duration
StartDate
------------- ----- ------------ ------------ -------------ALL
16
8
00:00:00
1:00:00 13:00:18_08/25
ALL
2
1
00:00:00
1:00:00 13:00:18_08/25
ReqID=0
ReqID=1
The mshow -a documentation contains a list of the different resources that may
be queried as well as examples on using mshow.
Using Transaction IDs
By default, the mshow command reports simply when and where the requested
resources are available. However, when the tid flag is specified, the mshow
command returns both resource availability information and a handle to these
resources called a Transaction ID as in the following example:
> mshow -a -w minprocs=16,nodefeature=fastcpu,duration=2:00:00 --flags=future,tid
Partition
Procs Nodes
StartOffset
Duration
StartDate
------------- ----- ------------ ------------ -------------ALL
16
16
00:00:00
2:00:00 13:00:18_08/25 TID=26 ReqID=0
In the preceding example, the returned transaction id (TID) may then be used
to reserve the available resources using the mrsvctl -c -R command:
> mrsvctl -c -R 26
reservation system.1 successfully created
Any TID can be printed out using the mschedctl -l trans command:
Code example (replace with your own content)
> mschedctl -l trans 26 TID[26] A1='node01' A2='600' A3='1093465728' A4='ADVRES' A5='fastio'
Where A1 is the hostlist, A2 is the duration, A3 is the starttime, A4 are any flags,
and A5 are any features.
Using Reservation Profiles
Reservation profiles (RSVPROFILE) stand as templates against which
reservations can be created. They can contain a hostlist, startime, endtime,
duration, access-control list, flags, triggers, variables, and most other
attributes of an Administrative Reservation. The following example illustrates
how to create a reservation with the exact same trigger-set.
Commands
213
Chapter 3 Scheduler Commands
----# moab.cfg
----RSVPROFILE[test1] TRIGGER=Sets=$Var1.$Var2.$Var3.!Net,EType=start,AType=exec,
Action=/tmp/host/triggers/Net.sh,
Timeout=1:00:00
RSVPROFILE[test1]
TRIGGER=Requires=$Var1.$Var2.$Var3,
Sets=$Var4.$Var5,EType=start,
AType=exec,Action=/tmp/host/triggers/
FS.sh+$Var1:$Var2:$Var3,Timeout=20:00
RSVPROFILE[test1]
TRIGGER=Requires=$Var1.$Var2.$Var3.$Var4.$Var5,
Sets=!NOOSinit.OSinit,Etype=start,
AType=exec,
Action=/tmp/host/triggers/
OS.sh+$Var1:$Var2:$Var3:$Var4:$Var5
RSVPROFILE[test1]
TRIGGER=Requires=NOOSini,AType=cancel,EType=start
RSVPROFILE[test1]
TRIGGER=EType=start,Requires=OSinit,AType=exec,
Action=/tmp/host/triggers/success.sh
...
-----
To create a reservation with this profile the mrsvctl -c -P command is used:
> mrsvctl -c -P test1
reservation system.1 successfully created
Using Reservation Groups
Reservation groups are a way for Moab to tie reservations together. When a
reservation is created using multiple Transaction IDs, these transactions and
their resulting reservations are tied together into one group.
> mrsvctl -c -R 34,35,36
reservation system.99 successfully created
reservation system.100 successfully created
reservation system.101 successfully created
In the preceding example, these three reservations would be tied together into
a single group. The mdiag -r command can be used to see which group a
reservation belongs to. The mrsvctl -q diag -g command can also be used to
print out a specific group of reservations. The mrsvctl -c -g command can also
be used to release a group of reservations.
Related Topics
mshow
214
Commands
Chapter 3 Scheduler Commands
msub
Synopsis
msub [-a datetime][-A account][-c interval][-C directive_prefix][-d path] [-e
path][-E][-F][-h][-I][-j join][-k keep][-K][-l resourcelist][-m mailoptions] [-M
user_list][-N name][-o path][-p priority][-P <user>[:<group>]][-q
destination][-r] [-S pathlist][-t jobarrays][-u userlist][-v variablelist][-V] [-W
additionalattributes][-x][-z][--stagein][--stageout][--stageinfile][-stageoutfile][--stageinsize][--stageoutsize][--workflowjobids][script]
Overview
msub allows users to submit jobs directly to Moab. When a job is submitted
directly to a resource manager (such as Torque), it is constrained to run on only
those nodes that the resource manager is directly monitoring. In many
instances, a site may be controlling multiple resource managers. When a job is
submitted to Moab rather than to a specific resource manager, it is not
constrained as to what nodes it is executed on. msub can accept command line
arguments (with the same syntax as qsub), job scripts (in either PBS or
LoadLeveler syntax), or the SSS Job XML specification.
Moab must run as a root user in order for msub submissions to work.
Workload submitted via msub when Moab is running as a non-root user fail
immediately.
Submitted jobs can then be viewed and controlled via the mjobctl command.
Flags specified in the following table are not necessarily supported by all
resource managers.
Access
When Moab is configured to run as root, any user may submit jobs via msub.
Flags
-a
Name
Eligible Date
Format
[[[[CC]YY]MM]DD]hhmm[.SS]
Commands
215
Chapter 3 Scheduler Commands
-a
Description
Example
Declares the time after which the job is eligible for execution.
> msub -a 12041300 cmd.pbs
Moab will not schedule the job until 1:00 pm on December 4, of
the current year.
-A
Name
Account
Format
<ACCOUNT NAME>
Description
Defines the account associated with the job.
Example
> msub -A research cmd.pbs
Moab will associate this job with
account research.
-c
Name
Checkpoint Interval
Format
[n|s|c|c=<minutes>]
Description
Checkpoint of the will occur at the specified interval.
n — No Checkpoint is to be performed.
s — Checkpointing is to be performed only when the server executing the job is shut down.
c — Checkpoint is to be performed at the default minimum time for the server executing the job.
c=<minutes> — Checkpoint is to be performed at an interval of minutes.
Example
> msub -c c=12 cmd.pbs
The job will be checkpointed every 12 minutes.
216
Commands
Chapter 3 Scheduler Commands
-C
Name
Directive Prefix
Format
'<PREFIX NAME>'
Default
First known prefix (#PBS, #@, #BSUB, #!, #MOAB, #MSUB)
Description
Specifies which directive prefix should be used from a job script.
l
It is best to submit with single quotes. '#PBS'
l
An empty prefix will cause Moab to not search for any prefix. -C ''
l
Command line arguments have precedence over script arguments.
l
Custom prefixes can be used with the -C flag. -C '#MYPREFIX'
l
Custom directive prefixes must use PBS syntax.
If the -C flag is not given, Moab will take the first default prefix found. Once a directive is
found, others are ignored.
l
Example
> msub -C '#MYPREFIX' cmd.pbs
#MYPREFIX -l walltime=5:00:00 (in cmd.pbs)
Moab will use the #MYPREFIX directive specified in cmd.pbs, setting the wallclock limit to
five hours.
-d
Name
Initial Working Directory
Format
<path>
Default
Depends on the RM being used. If using Torque, the default is $HOME. If using SLURM, the default
is the submission directory.
Description
Specifies which directory the job should execute in.
Example
> msub -d /home/test/job12 cmd.pbs
The job will begin execution in the /home/test/job12 directory.
Commands
217
Chapter 3 Scheduler Commands
-e
Name
Error Path
Format
[<hostname>:]<path>
Default
$SUBMISSIONDIR/$JOBNAME.e$JOBID
Description
Defines the path to be used for the standard error stream of the batch job.
Example
> msub -e test12/stderr.txt
The STDERR stream of the job will be placed in the relative (to execution)
directory specified.
-E
Name
218
Environment Variables
Commands
Chapter 3 Scheduler Commands
-E
Description
Moab adds the following variables, if populated, to the job's environment:
l
MOAB_ACCOUNT — Account name.
l
MOAB_BATCH — Set if a batch job (non-interactive).
l
MOAB_CLASS — Class name.
l
MOAB_DEPEND — Job dependency string.
l
MOAB_GROUP — Group name.
l
MOAB_JOBARRAYINDEX —For a job in an array, the index of the job.
l
MOAB_JOBARRAYRANGE — For a system with job arrays, the range of all job arrays.
l
MOAB_JOBID — Job ID. If submitted from the grid, grid jobid.
l
MOAB_JOBNAME — Job name.
l
MOAB_MACHINE — Name of the machine (i.e. Destination RM) that the job is running on.
l
MOAB_NODECOUNT — Number of nodes allocated to job.
l
MOAB_NODELIST — Comma-separated list of nodes (listed singly with no ppn info).
l
MOAB_PARTITION — Partition name the job is running in. If grid job, cluster scheduler's
name.
l
MOAB_PROCCOUNT — Number of processors allocated to job.
l
MOAB_QOS — QOS name.
l
MOAB_SUBMITDIR — Directory from which the job was submitted.
l
MOAB_TASKMAP — Node list with procs per node listed. <nodename>.<procs>
l
MOAB_USER — User name.
In SLURM environments, not all variables will be populated since the variables are added at
submission (such as NODELIST). With Torque/PBS, the variables are added just before the job is
started.
This feature only works with SLURM and Torque/PBS.
Example:
> msub -E mySim.cmd
The job mySim will be submitted with extra environment variables.
-F
Name
Script Flags
Format
"<STRING>"
Description
Specifies the flags Torque will pass to the job script at execution time.
The -F flag is only compatible with Torque resource managers.
Commands
219
Chapter 3 Scheduler Commands
-F
Example
> msub -F "arg1 arg2" -1 nodes=1,walltime=60 files/job.sh
Torque will pass parameters arg1 and arg2 to the job.sh script when
the job executes.
-h
Name
Hold
Description
Specifies that a user hold be applied to the job at submission time.
Example
> msub -h cmd.ll
The job will be submitted with a user hold on it.
-I
Name
Interactive
Description
Declares the job is to be run interactively.
qsub must exist on the same host as msub if the interactive job is destined for a
Torque cluster, because the interactive msub request will be converted to a qsub -I request.
Example
> msub -I job117.sh
The job will be submitted in interactive mode.
-j
220
Name
Join
Format
[eo|oe|n]
Default
n (not merged)
Commands
Chapter 3 Scheduler Commands
-j
Description
If eo is specified, the error and output streams are merged into the error stream. If oe is specified,
the error and output streams will be merged into the output stream.
If using either the -e or the -o option and the -j eo|oe option, the -j option takes
precedence and all standard error and output messages go to the chosen output file.
Example
> msub -j oe cmd.sh
STDOUT and STDERR will be merged into one file.
-k
Name
Keep
Format
[e|o|eo|oe|n]
Default
n (not retained)
Description
Defines which (if either) of output and error streams will be retained on the execution host (overrides path for stream).
Example
> msub -k oe myjob.sh
STDOUT and STDERR for the job will be retained on the execution host.
-K
Name
Continue Running
Format
N/A
Description
Tells the client to continue running until the submitted job is completed. The client will query the
status of the job every 5 seconds. The time interval between queries can be specified or disabled
via MSUBQUERYINTERVAL.
Use the -K option sparingly (if at all) as it slows down the Moab scheduler with frequent
queries. Running ten jobs with the -K option creates an additional fifty queries per minute
for the scheduler.
Commands
221
Chapter 3 Scheduler Commands
-K
Example
> msub -K newjob.sh
3
Job 3 completed*
*Only shows up after job completion.
-l
Name
Resource List
Format
<STRING>
-l [BANDWIDTH|DDISK|DEADLINE|DEPEND|DMEM|EXCLUDENODES|FEATURE...|]
Additional options can be referenced on the resource manager extensions page.
Description
Defines the resources that are required by the job and establishes a limit to the amount of
resource that can be consumed. Resources native to the resource manager, scheduler resource
manager extensions, or job flags may be specified. Note that resource lists are dependent on the
resource manager in use.
For information on specifying multiple types of resources for allocation, see Multi-Req Support.
Example
> msub -l nodes=32:ppn=2,pmem=1800mb,walltime=3600,VAR=testvar:myvalue cmd.sh
> msub -l nodes=32:ppn=2,pmem=1800mb,walltime=3600,VAR=testvar:
myvalue cmd.sh
The job requires 32 nodes with 2 processors each, 1800 MB per task, a walltime of 3600
seconds, and a variable named testvar with a value of myvalue.
If JOBNODEMATCHPOLICY is not set, Moab does not reserve the requested number of
processors on the requested number of nodes. It reserves the total number of requested
processors (nodes x ppn) on any number of nodes. Rather than setting
nodes=<value>:ppn=<value>, set procs=<value>, replacing <value> with the total
number of processors the job requires. Note that JOBNODEMATCHPOLICY is not set by
default.
> msub -l nodes=32:ppn=2 -l advres=!<resvid>
This entry would tell Moab to only consider resources outside of the specified <reservation
id>.
222
Commands
Chapter 3 Scheduler Commands
-m
Name
Mail Options
Format
<STRING> (either n or one or more of the characters a, b, and e)
Description
Defines the set of conditions (abort,begin,end) when the server will send a mail message about the
job to the user.
Example
> msub -m be cmd.sh
Mail notifications will be sent when the job begins and ends.
-M
Name
Mail List
Format
<user>[@<host>][,<user>[@<host>],...]
Default
$JOBOWNER
Description
Specifies the list of users to whom mail is sent by the execution server. Overrides the
EMAILADDRESS specified on the USERCFG credential.
Example
> msub -M jon@node01,bill@node01,jill@node02 cmd.sh
Mail will be sent to the specified users if the job is aborted.
-N
Name
Name
Format
<STRING>
Default
STDIN or name of job script
Description
Specifies the user-specified job name attribute.
Commands
223
Chapter 3 Scheduler Commands
-N
Example
> msub -N chemjob3 cmd.sh
Job will be associated with the
name chemjob3.
-o
Name
Output Path
Format
[<hostname>:]<path> - %J and %I are acceptable variables. %J is the master array name and %I is
the array member index in the array.
Default
$SUBMISSIONDIR/$JOBNAME.o$JOBID
Description:
Defines the path to be used for the standard output stream of the batch job.
More variables are allowed when they are used in the job script instead of msub -o. In the job
script, specify a #PBS -o line and input your desired variables. The allowable variables are:
l
OID
l
OTYPE
l
USER
l
OWNER
l
JOBID
l
JOBNAME
Submitting a job script that has the line #PBS -o $(USER)_$(JOBID)_$(JOBNAME).txt
results in a file called <username>_<jobID>_<jobName>.txt.
Do not use msub -o when submitting a job script that has a #PBS -o line defined.
Example
> msub -o test12/stdout.txt
The STDOUT stream of the job will be placed in the relative (to execution) directory
specified.
> msub -t 1-2 -o /home/jsmith/simulations/%J-%I.out ~/sim5.sh
A job array is submitted and the name of the output files includes the master array index
and the array member index.
224
Commands
Chapter 3 Scheduler Commands
-p
Name
Priority
Format
<INTEGER> (between -1024 and 0)
Default
0
Description
Defines the priority of the job.
To enable priority range from -1024 to +1023, see ENABLEPOSUSERPRIORITY.
Example
> msub -p 25 cmd.sh
The job will have a user priority of 25.
-P
Name
Proxy User
Format
<user>[:<group>]
Description
Allows a root user or manager to submit a job as another user. Moab treats proxy jobs as though
the jobs were submitted by the supplied username.
This option can only be used by users in the ADMINCFG[1] security level.
Example
msub -P user1 cmd.pbs
-q
Name
Destination Queue (Class)
Format
[<queue>][@<server>]
Default
[<DEFAULT>]
Description
Defines the destination of the job.
Commands
225
Chapter 3 Scheduler Commands
-q
Example
> msub -q priority cmd.sh
The job will be submitted to the
priority queue.
-r
Name
Rerunable
Format
[y|n]
Default
n
Description:
Declares whether the job is rerunable.
Example
> msub -r n cmd.sh
The job cannot be rerun.
The default for qsub -r is 'y' (yes), which is
opposite from msub -r. For better clarity, use
the following instead.
msub -l
[flags|jobflags]=restartable
-S
226
Name
Shell
Format
<path>[@<host>][,<path>[@<host>],...]
Default
$SHELL
Description
Declares the shell that interprets the job script.
Commands
Chapter 3 Scheduler Commands
-S
Example
> msub -S /bin/bash
The job script will be interpreted by the
/bin/bash shell.
-t
Name
Job Arrays
Format
<name>[<indexlist>]%<limit>
Description
Starts a job array with the jobs in the index list. The limit variable specifies how many jobs may
run at a time. For more information, see Submitting Job Arrays.
Moab enforces an internal limit of 100,000 sub-jobs that a single array job submission can
specify.
Example
> msub -t myarray[1-1000]%4
-u
Name
User List
Format
<user>[@<host>[,<user>[@<host>],...]
Default
UID of msub command
Description
Defines the user name under which the job is to run on the execution system.
Example
> msub -u bill@node01 cmd.sh
On node01 the job will run under Bill's UID, if permitted.
-v
Name
Commands
Variable List
227
Chapter 3 Scheduler Commands
-v
Format
<string>[,<string>,...]
Description
Expands the list the environment variables that are exported to the job (taken from the msub command environment).
Example
> msub -v DEBUG cmd.sh
The DEBUG environment variable will be defined for the job.
-V
Name
All Variables
Description
Declares that all environment variables in the msub environment are exported to the batch job
Example
> msub -V cmd.sh
All environment variables will be exported to the job.
-W
Name
Additional Attributes
Format
<string>
Description
Allows for specification of additional job attributes (See Resource Manager Extension)
Example
> msub -W x=GRES:matlab:1 cmd.sh
The job requires one resource of matlab.
This flag can be used to set a filter for what namespaces will be passed from a job to a trigger
using a comma-delimited list. This limits the trigger's action to objects contained in certain
workflows. For more information, see Requesting Name Space Variables.
> msub -W x="trigns=vc1,vc2"
The job passes namespaces vc1 and vc2 to triggers.
228
Commands
Chapter 3 Scheduler Commands
-x
Format
<script> or <command>
Description
When running an interactive job, the -x flag makes it so that the corresponding script won't be
parsed for PBS directives, but is instead a command that is launched once the interactive job has
started. The job terminates at the completion of this command. This option works only when using
Torque.
The -x option for msub differs from qsub in that qsub does not require the script name to
come directly after the flag. The msub command requires a script or command
immediately after the -x declaration.
Example
> msub -I -x ./script.pl
> msub -I -x /tmp/command
-z
Name
Silent Mode
Description
The job's identifier will not be printed to stdout upon submission.
Example
> msub -z cmd.sh
No job identifier will be printout the stdout upon
successful submission.
Staging data
Data staging, or the ability to copy data required for a job from one location to
another or to copy resulting data to a new location (See About Data Staging for
more information), must be specified at job submission. To stage data in, you
would use the msub --stagein and/or --stageinfile option, optionally with
--stageinsize. You would use similar options the same way for staging out:
--stageout, --stageoutfile, and --stageoutsize. --stagein and -stageout, which you can use multiple times in the same msub command, allow
you to specify a single file or directory to stage in or out. --stageinfile and -stageoutfile allow you to specify a text file that lists the files to stage in or
out. The --stageinsize and [--stageoutsize] options allow you to estimate
the total size of the files and directories that you want to stage in or out, which
can help Moab make an intelligent guess about how long it will take to stage the
data in or out, thus ensuring that the job can start as soon as possible after the
staging has occurred.
Commands
229
Chapter 3 Scheduler Commands
Staging a file or directory
The --stagein and --stageout options use the same format.
--<stagein|stageout><=| ><source>%<destination>
Where <source> and <destination> take on the following format:
[<user>@]<host>:/<path>[/<fileName>]
Specifying a user and file name are optional. If you do not specify a file name,
Moab will assume a directory.
> msub ... --stagein=student@biology:/stats/file001%admin@moab:/tmp/staging
<jobScript>
This msub commands tells Moab that the job requires file001 from student's stats directory on the
biology server to be staged to admin's staging directory on the moab server prior to the job's starting.
You can specify the option multiple times for the same msub command;
however, staging large number of files is easier with --stageinfile or -stageoutfile.
You can also use #MSUB or #PBS within a job script to specify data staging
options. For example:
#MSUB --stageinsize=1gb
#MSUB --stagein=...
See Sample User Job Script for more information. Note that the data staging
options are not compatible with qsub.
Staging multiple files or directories
The --stageinfile and --stageoutfile options use the same format. You
must include the path to a text file that lists each file to stage in or out on its
own line. Each file specification follows the same format as a --stagein or -stageout specification as described above. The format of the command
options looks like this:
--<stageinfile|stageoutfile><=| ><path>/<fileName>
The file contains multiple lines with the following format:
[<user>@]<host>:/<path>[/<fileName>]%[<user>@]<host>:/<path>
[/<fileName>]
...
Moab ignores blank lines in the file. You can comment out lines by preceding
them with a pound sign (#). The following examples demonstrate what the -stageinfile option looks like on the command line and what the file it
specifies might look like.
> msub ... --stageinfile=/tmp/myStagingFile <jobScript>
/tmp/myStagingFile:
230
Commands
Chapter 3 Scheduler Commands
student@biology:/stats/file001%moab:/tmp/staging
student@biology:/stats/file002%moab:/tmp/staging
student@biology:/stats/file003%moab:/tmp/staging
#student@biology:/stats/file004%moab:/tmp/staging
student@biology:/stats/file005%moab:/tmp/staging
student@biology:/stats/file006%moab:/tmp/staging
student@biology:/stats/file007%moab:/tmp/staging
student@biology:/stats/file008%moab:/tmp/staging
student@biology:/stats/file009%moab:/tmp/staging
student@biology:/stats/file010%moab:/tmp/staging
Moab stages in each file listed in myStagingFile to the /tmp/staging directory. Each file resides on the
biology host as the student user. Moab ignores the blank line and the line specifying file004.
Stage in or out file size
The optional --stageinsize and --stageoutsize options give you the
opportunity to estimate the size of the file(s) or directory(-ies) being staged to
aid Moab in choosing an appropriate start time. Both options use the same
format:
--<stageinsize|stageoutsize>=<integer>[unit]
The integer indicates the size of the file(s) and directory(-ies) in megabytes
unless you specify a different unit. Moab accepts the follow case-insensitive
suffixes: KB, MB, GB, or TB.
> msub --stageinfile=/stats/file003 --stageinsize=100 <jobScript>
Moab copies the /davidharris/research/recordlist file, which is approximately 100 megabytes, from
the biology node to the host where the job will run prior to job start.
> msub --stageinfile=/stats/file002 --stageinsize=1gb <jobScript>
Moab copies all files specified in the /davidharris/research/recordlist file, which add up to
approximately 1 gigabyte, to the host where the job will run prior to job start.
Return all the job IDs in the workflow at submission time
By default, msub will print the job ID to stdout at the time of submission. If you
want msub to print all of the jobs that are created as part of the workflow
template, you can use the msub --workflowjobids option to show all the job
IDs at submission time:
$ echo sleep 60 | msub -l walltime=15 --workflowjobids
MoabA.3.dsin MoabA.3 MoabA.3.dsout
Job Script
The msub command supports job scripts written in any one of the following
languages:
Commands
231
Chapter 3 Scheduler Commands
Language
Notes
PBS/Torque Job Submission Language
---
SSS XML Job Object Specification
---
Low Latency
msub can be configured to return a job ID very quickly by eliminating the
processing of some job attributes, filters, remap classes, job arrays,
templates, workflows, limits and other information when a job is submitted.
This can be done globally by configuring DISPLAYFLAGS USENOBLOCKMSUB or
on the individual job submission by appending "--noblock" to the command
line.
It is recommended that when using a non-blocking msub that
JOBIDFORMAT be configured (and PROXYJOBSUBMISSION if desired).
/etc/msubrc
Sites that wish to automatically add parameters to every job submission can
populate the file /etc/msubrc with global parameters that every job
submission will inherit.
For example, if a site wished every job to request a particular generic resource
they could use the following /etc/msubrc:
-W x=GRES:matlab:2
Usage Notes
msub is designed to be as flexible as possible, allowing users accustomed to
PBS, LSF, or LoadLeveler syntax, to continue submitting jobs as they normally
would. It is not recommended that different styles be mixed together in the
same msub command.
When only one resource manager is configured inside of Moab, all jobs are
immediately staged to the only resource manager available. However, when
multiple resource managers are configured Moab will determine which
resource manager can run the job soonest. Once this has been determined,
Moab will stage the job to the resource manager.
It is possible to have Moab take a best effort approach at submission time using
the forward flag. When this flag is specified, Moab will do a quick check and
make an intelligent guess as to which resource manager can run the job
soonest and then immediately stage the job.
232
Commands
Chapter 3 Scheduler Commands
Moab can be configured to instantly stage a job to the underlying resource
manager (like Torque/LOADLEVELER) through the parameter INSTANTSTAGE.
When set inside moab.cfg, Moab will migrate the job instantly to an
appropriate resource manager. Once migrated, Moab will destroy all
knowledge of the job and refresh itself based on the information given to it
from the underlying resource manager.
In most instances Moab can determine what syntax style the job belongs to
(PBS or LoadLeveler); if Moab is unable to make a guess, it will default the style
to whatever resource manager was configured at compile time. If LoadLeveler
and PBS were both compiled then LoadLeveler takes precedence.
Moab can translate a subset of job attributes from one syntax to another. It is
therefore possible to submit a PBS style job to a LoadLeveler resource
manager, and vice versa, though not all job attributes will be translated.
Examples
Example 3-41:
> msub -l nodes=3:ppn=2,walltime=1:00:00,pmem=100kb script2.pbs.cmd
4364.orion
Example 3-42:
This example is the XML-formatted version of the above example. See
Submitting Jobs via msub in XML for more information.
<job>
<InitialWorkingDirectory>/home/user/test/perlAPI
</InitialWorkingDirectory>
<Executable>/home/user/test/perlAPI/script2.pbs.cmd
</Executable>
<SubmitLanguage>PBS</SubmitLanguage>
<Requested>
<Feature>ppn2</Feature>
<Processors>3</Processors>
<WallclockDuration>3600</WallclockDuration>
</Requested>
</job>
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mjobctl command to view, modify, and cancel jobs
checkjob command to view detailed information about the job
mshow command to view all jobs in the queue
MSUBQUERYINTERVAL parameter
SUBMITFILTER parameter
Applying the msub Submit Filter for job script sample
Commands
233
Chapter 3 Scheduler Commands
Applying the msub submit filter
When you use msub to submit a job, msub processes the input, converts it to
XML, and sends the job specification XML to the Moab scheduler. You can
create a submission filter to modify the job XML based on the criteria you set
before Moab receives and processes it.
Image 3-1: Job submission process
The filter gives you the ability to customize the submission process, which is
helpful if jobs should have certain defaults assigned to them, if you want to
keep detailed submission statistics, or if you want to change job requests based
on custom needs.
The submit filter, is a simple executable or script that receives XML via its
standard input and returns the modified XML in its standard output. It modifies
the attributes of the job specification XML based on policies you specify. It can
perform various other actions at your request, too; for instance, logging. Once
the submit filter has modified the job XML based on your criteria, it writes the
XML representing the actual job submission to stdout. The new XML could
potentially match the original XML, depending on whether the job met the
criteria for modification set in the job submit filter script. Job submissions you
want to proceed will leave the filter with an exit code of 0 and continue to Moab
for scheduling. If the job meets the filter's specified criteria for rejection, it
exits with a non-zero value, aborting the job submission process. You can
configure the filter script to write a descriptive rejection message to stderr.
Job submit filters follow these rejection rules: 1) msub will reject job XML with
an exit code of anything other than zero, 2) the msub command displays filter's
error output on the command line, 3) msub will reject the job if the filter outputs
invalid job XML, and 4) msubwill reject the job if it violates any policies in your
general Moab configuration; you cannot use a submit filter to bypass other
policies.
To see the schema for job submission XML, please refer to Submitting Jobs via
msub in XML.
234
Commands
Chapter 3 Scheduler Commands
Submit filter types
You can implement submit filters on either the client or server side of a job
submission. The primary differences between the two submit filter types
are the location from which the filter runs, the powers and privileges of the user
running the filter, and whether a user can bypass the filter. Client-based submit
filters run from the msub client as the user who submits the job and can be
bypassed, and server-based submit filters run from the Moab server as the
user as which the server is running and cannot be bypassed.
Client-based submit filter
Client-based filters run from the msub client as the user who is submitting the
job. Because they do not have elevated privileges, the risk of client-based
submit filters' being abused is low; however, it is possible for the client to
specify its own configuration file and bypass the filter or substitute its own filter.
Job submissions do not even reach the server if a client-based submit filter
rejects it.
To configure msub to use the submit filter, give each submission host access to
the submit filter script and add a SUBMITFILTER parameter to the Moab
configuration file (moab.cfg) on each submission host. The following example
demonstrates how you might modify the moab.cfg file:
SUBMITFILTER /home/submitfilter/filter.pl
If you experience problems with your submit filter and want to debug its
interaction with msub, enter msub --loglevel=9. This will cause msub to print
verbose log messages to the terminal.
Server-based submit filter
Server-based submit filters run from the Moab server as the user as which the
server is running. Because it runs as a privileged user, you must evaluate the
script closely for security implications. A client configuration cannot bypass the
filter.
To configure Moab to automatically apply a filter to all job submissions, use the
SERVERSUBMITFILTER parameter. SERVERSUBMITFILTER specifies the path to a
global job submit filter script, which Moab will run on the head node and apply
to every job submitted.
SERVERSUBMITFILTER /opt/moab/scripts/jobFilter.pl
Moab runs jobFilter.pl, located in the /opt/moab/scripts directory, on the head node, applying the
filter to all jobs submitted.
OutputFormat XML Tag
The "OutputFormat" element is used by a job submit filter to alter the output of
the msub command when it reports the submitted job's job id. For example, if
Commands
235
Chapter 3 Scheduler Commands
a job submit filter performs a complex procedure on behalf of the user, such as
submitting system jobs for a pre-defined workflow to accomplish some
function, the filter can set this element to a value that permits it to return the
job ids of the system jobs it submitted in addition to the user's job id the msub
command returns (The Moab integration with Cray's SSD-based DataWarp
service does precisely this using a job submit filter).
To illustrate this element's functionality using the Moab/DataWarp integration
example, a DataWarp job submit filter submits a "DataWarp instance
creation/input data staging" script as a system job and a corresponding "output
data staging/DataWarp instance destruction" script as another system job, and
then ties them together with job dependencies in a "DataWarp job workflow"
that causes the user job's execution to depend on the successful completion of
the DataWarp creation/input staging job and the DataWarp output
staging/DataWarp Destruction system job to depend on the user job,
regardless whether it completes successfully or not, or is cancelled. This
DataWarp 3-job workflow guarantees the proper creation and destruction of
job-based DataWarp storage; all set up and accomplished by a job submit
filter.
However, users often create job workflows that have dependencies between
their own jobs and may require the job ids of all jobs to be made available in
order to build a desired job workflow; i.e "jobB" may require "jobA" to
complete before "jobB" is able to run. For example, if jobA were a DataWarp
job and jobB should not run unless JobA successfully completes, but not until
JobA's output data files are successfully staged, jobB must depend on jobA's
job id as well as jobA's "output data staging/DataWarp instance destruction"
system job's job id. The user can indicate jobB's job dependencies when jobA is
a DataWarp job using the job submission option:
-l depend=afterok:<jobAid>:<jobAoutputSystemJobId>.
The OutputFormat XML tag provides a way for a job submit filter to pass the job
ids of additional jobs it submitted to perform a service on behalf of the user's
job.
For example, you might submit a job and a job submit filter submits two
additional jobs to assist it; the first additional job, "job11", will run before your
job, and the second additional job, "job12", needs to run after your job
finishes. If the job submit filter requires them to output in the order of "pre",
"user", and "post" job ids (which is the same order Moab outputs job ids for
user jobs with input and output data-staging options), it would return the
following OutputFormat element as the user's job id string.
<OutputFormat>moab.11 %s moab.12</OutputFormat>
msub displays the user id string as "Moab.11 Moab.13 Moab.12"
This means that you can have all three job ids delivered to the end user, or a
job workflow generation script in an easy to read format.
236
Commands
Chapter 3 Scheduler Commands
Sample submit filter script
The following example is a trivial implementation that will not affect whether a
job is submitted. Use it as reference to verify that you are writing your filter
properly.
#!/usr/bin/perl
use strict;
## Simple filter example that re-directs the output to a file.
my $file = "xmllog.out";
open FILE,">>$file" or die "Couldn't open $file: $!";
while (<>)
{
print FILE;
print;
}
close FILE;
Submitting Jobs via msub in XML
The following describes the XML format used with the msub command to submit
a job to a Moab server. This information can be used to implement a filter and
modify the XML normally generated by the msub command. The XML format
described in what follows is based on a variant of the Scalable Systems
Software Job Object Specification.
Overall XML Format
The overall format of an XML request to submit a job can be shown through the
following example:
<job>
**job attribute children**
</job>
An example of a simple job element with all the required children for a job
submission is as follows:
<job>
<Owner>user</Owner>
<UserId>user</UserId>
<GroupId>group</GroupId>
<InitialWorkingDirectory>/home/user/directory</InitialWorkingDirectory>
<UMask>18</UMask>
<Executable>/full/path/to/script/or/first/line/of/stdin</Executable>
<SubmitLanguage>Resource Manager Type</SubmitLanguage>
<SubmitString>\START\23!/usr/bin/ruby\0contents\20of\20script</SubmitString>
</job>
The section that follows entitled Job Element Format describes the possible
attributes and their meanings in detail. In actuality, all that is needed to run a
job in Moab is something similar to the following:
Commands
237
Chapter 3 Scheduler Commands
<job>
<SubmitString>\START\23!/bin/sh\0asleep\201000</SubmitString>
</job>
This piece of XML requests Moab to submit a job using the contents of the SubmitString tag as a script,
which is in this case a simple sh script to sleep for 1000 seconds. The msub command will create default
values for all other needed attributes.
Job Element Format
The job element of the submission request contains a list of children and string
values inside the children that represent the attribute/value pairs for the job.
The earlier section, Overall XML Format, gives an example of this format. This
section explains these attributes in detail.
Arguments — The arguments to be passed to the program are normally
specified as arguments after the first argument specifying the script to be
executed.
EligibleTime — The minimum time after which the job is eligible. This is the
equivalent of the -a option in msub. Format: [[[[CC]YY]MM]DD]hhmm[.SS]
Environment — The semi-colon list of environment variables that are
exported to the job (taken from the msub command environment). The -V msub
flag, for example, adds all the environment variables present at the time msub
is invoked. Environment variables are delimited by the ~rs; characters.
Following is an example of the results of the msub -v arg1=1,arg2=2
command:
<Environment>arg1=1~rs;arg2=2~rs;</Environment>
ErrorFile — Defines the path to be used for the standard error stream of the
batch job. This is equivalent to the -e flag in msub.
Executable — This is normally either the name of the script to be executed, or
the first line of the script if it is passed to msub through standard input.
Extension — The resource manager extension string. This can be specified via
the command line in a number of ways, including the -W x= directive. Some
other requests, such as some extensions used in the -l flag, are also converted
to an extension string. The element has the following format:
<Extension>x=extension</Extension>
See Using the Extension Element to Submit Triggers for additional information
on the extension element.
GroupId — The string name of the group of the user submitting the job. This
will correspond to the user's primary group on the operating system.
Hold — Specifies that a user hold be applied to the job at submission time. This
is the equivalent to the msub flag -h. It will have the form:
238
Commands
Chapter 3 Scheduler Commands
<Hold>User</Hold>
InitialWorkingDirectory — Specifies in which directory the job should begin
executing. This is equivalent to the -d flag in the msub command.
<InitialWorkingDirectory>/home/user/directory</InitialWorkingDirectory>
Interactive — Specifies that the job is to be interactive. This is the equivalent
of the -I flag in msub.
<Interactive>TRUE</Interactive>
JobName — Specifies the user-specified job name attribute. This is equivalent
to the -N flag in msub.
NotificationList — Specifies the job states after which an email should be sent
and also specifies the users to be emailed. This is the equivalent of the -m and M options in msub.
<NotificationList URI=user1:user2>JobFail,JobStart,JobEnd</NotificationList>
In this example, the command msub -m abe -M user1:user2 ran indicating that emails should be sent
when a job fails, starts, or ends, and that they should be sent to user1 and user2.
OutputFile — Defines the path to be used for the standard output stream of
the batch job. This is the equivalent of the -o flag in msub.
Priority — A user-requested priority value. This is the equivalent to the msub p flag.
ProjectId — Defines the account associated with the job. This is equivalent to
the -A msub flag.
QueueName — The requested class of the job. This is the equivalent of the
msub -q flag.
Requested — Specifies resources and attributes the job specifically requests
and has the following form:
<Requested>
<... requested attributes>
</Requested>
See the section dedicated to requestable attributes in this element.
RMFlags — Flags that will get passed directly to the resource manager on job
submission. This is equivalent to any arguments listed after the -l msub flag.
<RMFlags>arg1 arg2 arg3</RMFlags>
ShellName — Declares the shell that interprets the job script. This is
equivalent to the msub flag -S.
Commands
239
Chapter 3 Scheduler Commands
SubmitLanguage — Resource manager whose language the job is using. Use
Torque to specify a Torque resource manager.
SubmitString — Contains the contents of the script to be run, retrieved either
from an actual script or from standard input. This also includes all resource
manager specific directives that may have been in the script already or added
as a result of other command line arguments.
TaskGroup — Groups a set of requested resources together. It does so by
encapsulating a Requested element. For example, the command msub -l
nodes=2+nodes=3:ppn=2 generates the following XML:
<TaskGroup>
<Requested>
<Processors>2</Processors>
<TPN>2</TPN>
</Requested>
</TaskGroup>
<TaskGroup>
<Requested>
<Processors>2</Processors>
</Requested>
</TaskGroup>
UserId — The string value of the user ID of the job owner. This will correspond
to the user's name on the operating system.
Using the Extension Element to Submit Triggers
Use the Extension element to submit triggers. With the exception of certain
characters, the syntax for trigger creation is the same for non-XML trigger
submission. See About Object Triggers for detailed information on triggers.
The ampersand (&) and less than sign (<) characters must be replaced for the
XML to be valid. The following example shows how the Extension element is
used to submit multiple triggers (separated by a semi-colon). Note that
ampersand characters are replaced with &amp; in the example:
<Job>
<UserId>user1</UserId>
<GroupId>user1</GroupId>
<Arguments>60</Arguments>
<Executable>/bin/sleep</Executable>
<Extension>x=trig:AType=exec&amp;Action="env"&amp;EType=start;trig:AType=exec&amp;Acti
on="trig2.sh"&amp;EType=end</Extension>
<Processors>3</Processors>
<Disk>500</Disk>
<Memory>1024</Memory>
<Swap>600</Swap>
<WallclockDuration>300</WallclockDuration>
<Environment>PERL5LIB=/perl5:</Environment>
</Job>
Elements Found in Requested Element
The following describes the tags that can be found in the Requested subelement of the job element in a job submission request.
240
Commands
Chapter 3 Scheduler Commands
Nodes — A list of nodes that the job requests to be run on. This is the
equivalent of the -l hosts=<host-list> msub directive.
<Requested>
<Nodes>
<Node>n1:n2</Node>
</Nodes>
</Requested>
In this example, the users requested the hosts n1 and n2 with the command msub -l host=n1:n2.
Processors — The number of processors requested by the job. The following
example was generated with the command msub -l nodes=5:
<Requested>
<Processors>5</Processors>
</Requested>
TPN — Tasks per node. This is generated using the ppn resource manager
extensions. For example, from msub -l nodes=3:ppn=2, the following
results:
<Requested>
<Processors>6</Processors>
<TPN>2</TPN>
</Requested>
WallclockDuration — The requested wallclock duration of the job. This
attribute is specified in the Requested element.
<Requested>
<WallclockDuration>3600</WallclockDuration>
</Requested>
Related Topics
Applying the msub Submit Filter
SUBMITFILTER parameter
mvcctl (Moab Virtual Container Control)
Synopsis
l
mvcctl -a <OType>:<OName>[,<OType>:<OName>] <name>
l
mvcctl -c [<description>]
l
mvcctl -d <name>
l
mvcctl -m <ATTR>=VAL[,<ATTR>=<VAL>] <name>
l
mvcctl -q [<name>|ALL] [--xml][--blocking][--flags=fullxml]
Commands
241
Chapter 3 Scheduler Commands
l
mvcctl -r <OType>:<OName>[,<OType>:<OName>] <name>
l
mvcctl -x <action><name>
Overview
A virtual container (VC) is a logical grouping of objects with a shared variable
space and applied policies. Containers can hold virtual machines, jobs,
reservations, and nodes. Containers can also be nested inside other
containers.
A VC can be owned by a user, group, or account. Users can only view VCs to
which they have access. Level 1 administrators (Admin1) can view and modify
all VCs. The owner can also be changed. When modifying the owner, you must
also specify the owner type:
mvcctl -m OWNER=acct:bob myvc
Adding objects to VCs at submission: You associate jobs, VMs, and reservations
with a specified VC upon submission. For example,
l
mrsvctl -c ... -H <VC>
l
msub ... -W x="vc=<VC>"
l
mvmctl -c ...,vc=<VC>
The user who submits objects must have access to the VC or the command
is rejected.
FullXML flag
The FullXML flag will cause the mvcctl -q command to show VCs in a hierarchical
manner. If doing a non-XML (plaintext) query, sub-VCs will be listed inside their
parent VCs. Each VC will be indented more than its parent.
VC[vc2] (vc2)
Owner: user:jason
VCs:
VC[vc1] (vc1)
Owner: user:jason
Jobs: Moab.1
Rsvs: system.1
VCs:
VC[vc3] (vc3)
Owner: user:jason
VC[vc4] (vc4)
Owner: user:jason
If doing an XML query, the XML for all sub-objects (VCs, but also reservations,
jobs, etc.) will also be included in the VC.
242
Commands
Chapter 3 Scheduler Commands
<Data>
<vc DESCRIPTION="vc2" NAME="vc2" OWNER="user:jason">
<vc DESCRIPTION="vc1" NAME="vc1" OWNER="user:jason">
<job CmdFile="sleep 7200" Flags="GLOBALQUEUE,NORMSTART"
Group="jason" JobID="Moab.1" PAL="[base]" RM="internal"
ReqAWDuration="2:00:00" User="jason">
<req Index="0"></req>
</job>
<rsv ACL="RSV=%=system.1=;" AUser="jason"
AllocNodeList="n0,n1,n2,n3,n4,n5,n6,n7,n8,n9" HostExp="ALL"
HostExpIsSpecified="TRUE" Name="system.1" Partition="base"
ReqNodeList="n0:1,n1:1,n2:1,n3:1,n4:1,n5:1,n6:1,n7:1,n8:1,n9:1"
Resources="PROCS=[ALL]" StatCIPS="5964" SubType="Other"
Type="User" ctime="1299953557" duration="3600"
endtime="1299957157"
flags="ISCLOSED,ISGLOBAL,ISACTIVE,REQFULL"
starttime="1299953557">
<CL aff="neutral" cmp="%=" name="system.1" type="RSV">
</ACL>
<CL aff="neutral" cmp="%=" name="system.1" type="RSV"></CL>
<History>
<event state="PROCS=40" time="1299953557"></event>
</History>
</rsv>
<vc DESCRIPTION="vc3" NAME="vc3" OWNER="user:jason"></vc>
</vc>
<vc DESCRIPTION="vc4" NAME="vc4" OWNER="user:jason"></vc>
</vc>
</Data>
Note that the XML from the blocking and non-blocking commands may differ.
Virtual Container Flags
The following table indicates available virtual container (VC) flags and
associated descriptions. Note that the Deleting, HasStarted, and Workflow
flags cannot be set by a user but are helpful indicators of status.
VC Flags
DestroyObjects
When the VC is destroyed, any reservations, jobs, and VMs in the VC are also destroyed. This is recursive, so any objects in sub-VCs are also destroyed. Nodes are
not removed.
DestroyWhenEmpty
When the VC is empty, it is destroyed.
Deleting
Set by the scheduler when the VC has been instructed to be removed.
Internal flag. Administrators cannot set or clear this flag.
Commands
243
Chapter 3 Scheduler Commands
VC Flags
HasStarted
This flag is set on a VC workflow where at least one job has started.
Internal flag. Administrators cannot set or clear this flag.
HoldJobs
This flag will place a hold on any job that is submitted to the VC while this flag is
set. It is not applied for already existing jobs that are added into the VC. If a job
with a workflow is submitted to the VC, all jobs within the workflow are placed on
hold.
NoReleaseWhenScheduled
Prevents Moab from lifting the UserHold on the workflow when it is scheduled.
This enables an approval method in which an administrator must release the hold
manually before the service is allowed to start as scheduled.
Workflow
Designates this VC as a VC that is for workflows. This flag is set when generated by
a job template workflow. Workflow jobs can only be attached to one workflow VC.
Internal flag. Administrators cannot set or clear this flag.
Format
-a
Format
mvcctl -a<OType>:<OName>[,<OType>:<OName>] <name>
Where <OType> is one of JOB, RSV, NODE, VC, or VM.
Description
Example
Add the given object(s).
mvcctl -a JOB:Moab.45 vc13
>>job 'Moab.45' added to VC 'vc13'
-c
Format
mvcctl -c [<description>]
Description
Create a virtual container (VC). The VC name is auto-generated. It is recommended that you supply a description; otherwise the description is the same as the auto-generated name.
Example
244
mvcctl -c "Linux testing machine"
>>VC 'vc13' created
Commands
Chapter 3 Scheduler Commands
-d
Format
mvcctl -d<lab01>
Description
Destroy the VC.
Example
mvcctl -d vc13
>>VC 'vc13'
destroyed
-m
Format
mvcctl -m<ATTR>=VAL[,<ATTR>=<VAL>] <name>
Description
Modify the VC. Attributes are flags, owner, reqstarttime, reqnodeset, variables, and owner; note
that only the owner can modify owner. Use reqstarttime when implementing guaranteed start
time to specify when jobs should start. The reqnodeset attribute indicates the node set that jobs
should run in that are submitted to a virtual container.
Example
mvcctl -m variables+=HV=node8 vc13
>>VC 'vc13' successfully modified
mvcctl -m flags+=DESTROYWHENEMPTY vc1
>>VC 'vc1' successfully modified
-q
Format
mvcctl -q [<name>|ALL] [--xml][--blocking][--flags=fullxml]
Description
Query VCs
Example
Commands
mvcctl -q ALL
VC[vc13] (Linux testing machine)
Create Time: 1311027343
Creator: jdoe
Owner: user:jdoe
ACL:
USER=%=jdoe+;
Jobs: Moab.45
Vars: HV=node88
Flags: DESTROYWHENEMPTY
245
Chapter 3 Scheduler Commands
-r
Format
mvcctl -r<OType>:<OName>[,<OType>:<OName>] <name>
Where <OType> is one of JOB, RSV, NODE, VC, or VM.
Description
Example
Remove the given object(s) from the VC.
mvcctl -r JOB:Moab.45 vc13
>>job 'Moab.45' removed from VC 'vc13'
-x
Format
mvcctl -x<action><name>
Description
Executes the given action on the virtual container (VC).
Example
mvcctl -x schedulevc vc1
mvmctl
Synopsis
mvmctl -d [--flags=force] <vmid>
mvmctl -f <migrationPolicy> [--flags=eval [--xml]]
mvmctl -m [<options>] <vmid>
mvmctl -M dsthost=<newhost><vmid>
mvmctl -q <vmid> [--blocking] [--xml]
mvmctl -w state=drained
Overview
mvmctl controls the modification, querying, migration, and destruction of virtual
machines (VMs).
246
Commands
Chapter 3 Scheduler Commands
Format
-d
Name
Destroy
Format
mvmctl -d [--flags=force] <vmid>
Description
Destroys the specified VM. When you add the force flag, Moab forces the deletion of the VM if and
only if it does not have a VM-tracking job.
Example
> mvmctl -d oldVM
> mvmctl -d --flags=force oldVM
Because oldVM does not have a VM-tracking job associated with it and you set the force
flag, Moab forces the deletion of oldVM.
-f
Name
Force Migrate
Format
mvmctl -f consolidation|overcommit [--flags=eval [--xml]]
Description
Forces the migration policy on the system. The eval flag causes Moab to run through migration
routines and report the results without actually migrating the VMs.
Example
> mvmctl -f consolidation --flags=eval
Moab returns a report like the following:
1: VM 'vm1' from 'h0' to 'h3'
2: VM 'vm2' from 'h0' to 'h5'
-m
Name
Modify
Format
[<options>] <vmid>
The <options> variable is a comma-separated list of <attr>=<value> pairs.
Commands
247
Chapter 3 Scheduler Commands
-m
Description
Example
Modifies the VM.
> mvmctl -m gevent=hitemp:'mymessage' myNewVM
Gevents can be set using gevent.
> mvmctl -m gmetric=bob:5.6 myNewVM
Gmetrics can be set using gmetric.
> mvmctl -m os=compute myNewVM
Reprovisioning is done by changing os.
> mvmctl -m powerstate=off myNewVM
Power management is done by modifying powerstate.
> mvmctl -m variable=user:bob+purpose:myVM myNewVM
The modify variable uses the same syntax as Create.
> mvmctl -m flags=cannotmigrate myNewVM
Allow a VM to migrate by setting the canmigrate flag.
> mvmctl -m flags=canmigrate myNewVM
Allows a VM to migrate by setting the canmigrate flag.
Notes
l
The variable option is a set-only operation. Previous variables will be overwritten.
-M
248
Name
Migrate
Format
dsthost=<newhost><vmid>
Commands
Chapter 3 Scheduler Commands
-M
Description
Migrate the given VM to the destination host.
When you set the vmid to ANY, Moab migrates the VM to any available eligible hypervisor. For this
to work, the following conditions must be met:
l
l
Example
The VM reports a CPULOAD, and it is greater than 0.
The VM's AMEMORY is less than its CMEMORY. This indicates that some memory is
currently in use and tells Moab that the RM is reporting memory correctly.
l
The VM's state is not "Unknown."
l
All hypervisors report a CPULOAD, and it is greater than 0.
l
All hypervisors report an AMEMORY, and it is less than its CMEMORY.
l
All hypervisors report a hypervisor type.
> mvmctl -M dsthost=node05 myNewVM
myNewVM migrates to node05.
> mvmctl -M dsthost=ANY vm42
Moab migrates vm42 to a node based on policy destination limitations (such as the
NoVMMigrations flag).
-q
Name
Query
Format
<vmid> [--blocking] [--xml]
Description
Queries the specified VM; that is, it returns detailed information about the given VM. May be used
with or without the --xml flag. ALL may also be used to display information about all VMs. This
option gathers information from the Moab cache which prevents it from waiting for the scheduler,
but the --blocking option can be used to bypass the cache and allow waiting for the scheduler.
Example
> mvmctl -q myNewVM
> mvmctl -q ALL --blocking
> mvmctl -q ALL --xml
Commands
249
Chapter 3 Scheduler Commands
-w
Name
Constraint
Format
state=drained
Description
Overrides the HIDEDRAINED DISPLAYFLAGS attribute allowing display of VMs in a DRAINED state.
Example
> mvmctl -q -w state=drained
showbf
Synopsis
showbf [-A] [-a account] [-c class] [-d duration] [-D] [-f features] [-g group] [L] [-m [==|>|>=|<|<=] memory] [-n nodecount] [-p partition] [-q qos] [-u
user] [-v] [--blocking]
Overview
Shows what resources are available for immediate use.
The results Moab returns do not include resources that may be freed due
to preemption.
This command can be used by any user to find out how many processors are
available for immediate use on the system. It is anticipated that users will use
this information to submit jobs that meet these criteria and thus obtain quick
job turnaround times. This command incorporates down time, reservations,
and node state information in determining the available backfill window.
If specific information is not specified, showbf will return information for
the user and group running but with global access for other credentials.
For example, if -q qos is not specified, Moab will return resource
availability information for a job as if it were entitled to access all QOS
based resources (i.e., resources covered by reservations with a QOS
based ACL), if -c class is not specified, the command will return
information for resources accessible by any class.
250
Commands
Chapter 3 Scheduler Commands
The showbf command incorporates node configuration, node utilization,
node state, and node reservation information into the results it reports.
This command does not incorporate constraints imposed by credential
based fairness policies on the results it reports.
Access
By default, this command can be used by any user or administrator.
Parameters
Parameter
Description
ACCOUNT
Account name.
CLASS
Class/queue required.
DURATION
Time duration specified as the number of seconds or in [DD:]HH:MM:SS notation.
FEATURELIST
Colon separated list of node features required.
GROUP
Specify particular group.
MEMCMP
Memory comparison used with the -m flag. Valid signs are >, >=, ==, <=, and <.
MEMORY
Specifies the amount of required real memory configured on the node, (in MB), used with the m flag.
NODECOUNT
Specify number of nodes for inquiry with -n flag.
PARTITION
Specify partition to check with -p flag.
QOS
Specify QOS to check with -q flag.
USER
Specify particular user to check with -u flag.
Commands
251
Chapter 3 Scheduler Commands
Flags
Flag
Description
-A
Show resource availability information for all users, groups, and accounts. By default, showbf uses the
default user, group, and account ID of the user issuing the command.
-a
Show resource availability information only for specified account.
--blocking
Do not use cache information in the output. The --blocking flag retrieves results exclusively from
the scheduler.
-d
Show resource availability information for specified duration.
-D
Display current and future resource availability notation.
-g
Show resource availability information only for specified group.
-h
Help for this command.
-L
Enforce Hard limits when showing available resources.
-m
Allows user to specify the memory requirements for the backfill nodes of interest. It is important to
note that if the optional MEMCMP and MEMORY parameters are used, they must be enclosed in
single ticks (') to avoid interpretation by the shell. For example, enter showbf -m '==256' to
request nodes with 256 MB memory.
-n
Show resource availability information for a specified number of nodes. That is, this flag can be used
to force showbf to display only blocks of resources with at least this many nodes available.
-p
Show resource availability information for the specified partition.
-q
Show information for the specified QOS.
-r
Show resource availability for the specified processor count.
-u
Show resource availability information only for specified user.
Examples
Example 3-43:
In this example, a job requiring up to 2 processors could be submitted for
immediate execution in partition ClusterA for any duration. Additionally, a job
252
Commands
Chapter 3 Scheduler Commands
requiring 1 processor could be submitted for immediate execution in partition
ClusterB. Note that by default, each task is tracked and reported as a request
for a single processor.
> showbf
Partition
--------ALL
ReqID=0
ClusterA
ReqID=0
ClusterB
ReqID=0
Tasks
----3
Nodes
----3
StartOffset
-----------00:00:00
Duration
-----------INFINITY
StartDate
-------------11:32:38_08/19
1
1
00:00:00
INFINITY
11:32:38_08/19
2
2
00:00:00
INFINITY
11:32:38_08/19
StartOffset is the amount of time remaining before resources will be
available.
Example 3-44:
In this example, the output verifies that a backfill window exists for jobs
requiring a 3 hour runtime and at least 16 processors. Specifying job duration
is of value when time based access is assigned to reservations (i.e., using the
SRCFG TIMELIMIT ACL)
> showbf -r 16 -d 3:00:00
Partition
Tasks Nodes
------------- ----ALL
20
20
Duration
-------INFINITY
StartOffset
----------00:00:00
StartDate
--------09:22:25_07/19
Example 3-45:
In this example, a resource availability window is requested for processors
located only on nodes with at least 512 MB of memory.
> showbf -m ' =512'
Partition
Tasks
------------ALL
20
ClusterA
10
ClusterB
10
Nodes
----20
10
10
Duration
-------INFINITY
INFINITY
INFINITY
StartOffset
----------00:00:00
00:00:00
00:00:00
StartDate
--------09:23:23_07/19
09:23:23_07/19
09:23:23_07/19
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
showq
mdiag -t
Commands
253
Chapter 3 Scheduler Commands
showq
Synopsis
showq [-b] [-g] [-l] [-c|-i|-r] [-n] [-o] [-p partition] [-R rsvid] [-u] [-v] [-w
<CONSTRAINT>] [--blocking] [--noblock]
Overview
Displays information about active, eligible, blocked, and/or recently completed
jobs. Since the resource manager is not actually scheduling jobs, the job
ordering it displays is not valid. The showq command displays the actual job
ordering under the Moab Workload Manager. When used without flags, this
command displays all jobs in active, idle, and non-queued states.
Access
By default, this command can be run by any user. However, the -c, -i, and -r
flags can only be used by level 1, 2, or 3 Moab administrators.
Flags
254
Flag
Description
-b
Display blocked jobs only
-c
Display details about recently completed jobs (see example, JOBCPURGETIME).
-g
Display grid job and system IDs for all jobs.
-i
Display extended details about idle jobs.
-l
Display local/remote view. For use in a Grid environment, displays job usage of both local and remote
compute resources.
-n
Displays normal showq output, but lists job names under JOBID
-o
Displays jobs in the active queue in the order specified (uses format showq -o <specifiedOrder>). Valid options include REMAINING, REVERSEREMAINING, JOB, USER, STATE, and
STARTTIME. The default is REMAINING.
-p
Display only jobs assigned to the specified partition.
Commands
Chapter 3 Scheduler Commands
Flag
Description
-r
Display extended details about active (running) jobs. (see example)
-R
Display only jobs which overlap the specified reservation.
-u
Display all running jobs for a particular user.
-v
Display local and full resource manager job IDs as well as partitions. If specified with the -i option,
will display job reservation time. The -v option displays all array subjobs. All showq commands
without the -v option show just the master jobs in an array.
-w
Display only jobs associated with the specified constraint. Valid constraints include user, group, acct,
nodefeature, class, and qos (see showq -w example.).
--blocking
Do not use cache information in the output. The --blocking flag retrieves results exclusively from
the scheduler.
-noblock
Use cache information for a faster response.
Details
Beyond job information, the showqcommand will also report if the scheduler is
stopped or paused or if a system reservation is in place. Further, the showq
command will also report public system messages.
Examples
l
l
Default Report
o
Detailed Active/Running Job Report
o
Eligible Jobs
o
Detailed Completed Job Report
Filtered Job Report
Example 3-46: Default Report
The output of this command is divided into three parts, Active Jobs, Eligible
Jobs, and Blocked Jobs.
Commands
255
Chapter 3 Scheduler Commands
> showq
active jobs-----------------------JOBIDUSERNAMESTATEPROCSREMAINING
12941
12954
12944
12946
sartois
tgates
eval1
tgates
4 active jobs
Running
Running
Running
Running
STARTTIME
25
4
16
2
2:44:11
2:57:33
6:37:31
1:05:57:31
Sep
Sep
Sep
Sep
1
1
1
1
15:02:50
15:02:52
15:02:50
15:02:50
47 of 48 processors active (97.92%)
32 of 32 nodes active
(100.00%)
eligible jobs---------------------JOBID
USERNAME
STATE
PROCS
WCLIMIT
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
32
4
16
2
2
2
10
2
16
16
16
9
6
1
6:40:00
6:40:00
3:00:00
3:00:00
3:00:00
3:00:00
4:26:40
4:26:40
3:00:00
1:06:00:00
1:00:00:00
1:00:00:00
00:26:40
4:26:40
blocked jobs----------------------JOBID
USERNAME
STATE
PROCS
WCLIMIT
12956
12969
12939
12940
12947
12949
12953
12955
12957
12963
12964
12937
12962
12968
Thu
Thu
Thu
Thu
cfosdyke
cfosdyke
eval1
mwillis
mwillis
eval1
tgates
eval1
tgates
eval1
tgates
allendr
aacker
tamaker
QUEUETIME
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Thu
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
Sep
1
1
1
1
1
1
1
1
1
1
1
1
1
1
15:02:50
15:03:23
15:02:50
15:02:50
15:02:50
15:02:50
15:02:50
15:02:50
15:02:50
15:02:52
15:02:52
15:02:50
15:02:50
15:02:52
14 eligible jobs
QUEUETIME
0 blocked jobs
Total jobs:
18
The fields are as follows:
256
Column
Description
JOBID
Job identifier.
USERNAME
User owning job.
STATE
Job State. Current batch state of the job.
PROCS
Number of processors being used by the job.
Commands
Chapter 3 Scheduler Commands
Column
Description
REMAINING/WCLIMIT
For active jobs, the time the job has until it has reached its wallclock limit or for idle/blocked jobs, the amount of time requested by the job. Time specified in [DD:]HH:MM:SS
notation.
STARTTIME
Time job started running.
Active Jobs
Active jobs are those that are Running or Starting and consuming resources.
Displayed are the job id*, the job's owner, and the job state. Also displayed are
the number of processors allocated to the job, the amount of time remaining
until the job completes (given in HH:MM:SS notation), and the time the job
started. All active jobs are sorted in "Earliest Completion Time First" order.
*Job IDs may be marked with a single character to specify the following
conditions:
Character
Description
_ (underbar)
job violates usage limit
* (asterisk)
job is backfilled AND is preemptible
+ (plus)
job is backfilled AND is NOT preemptible
- (hyphen)
job is NOT backfilled AND is preemptible
Detailed active job information can be obtained using the -r flag.
Eligible Jobs
Eligible Jobs are those that are queued and eligible to be scheduled. They are
all in the Idle job state and do not violate any fairness policies or have any job
holds in place. The jobs in the Idle section display the same information as the
Active Jobs section except that the wallclock CPULIMIT is specified rather than
job time REMAINING, and job QUEUETIME is displayed rather than job
STARTTIME. The jobs in this section are ordered by job priority. Jobs in this
queue are considered eligible for both scheduling and backfilling.
Detailed eligible job information can be obtained using the -i flag.
Blocked Jobs
Commands
257
Chapter 3 Scheduler Commands
Blocked jobs are those that are ineligible to be run or queued. Jobs listed here
could be in a number of states for the following reasons:
State
Description
Idle
Job violates a fairness policy. Use diagnose -q for more information.
UserHold
A user hold is in place.
SystemHold
An administrative or system hold is in place.
BatchHold
A scheduler batch hold is in place (used when the job cannot be run because the requested
resources are not available in the system or because the resource manager has repeatedly failed
in attempts to start the job).
Deferred
A scheduler defer hold is in place (a temporary hold used when a job has been unable to start
after a specified number of attempts. This hold is automatically removed after a short period of
time).
NotQueued
Job is in the resource manager state NQ (indicating the job's controlling scheduling daemon in
unavailable).
A summary of the job queue's status is provided at the end of the output.
Example 3-47: Detailed Active/Running Job Report
> showq -r
active jobs-----------------------JOBID
S PAR EFFIC XFACTOR
REMAINING
STARTTIME
12941
2:43:31 Thu Sep 1
12954
2:56:54 Thu Sep 1
12944
6:36:51 Thu Sep 1
12946
1:05:56:51 Thu Sep
4 active jobs
Total jobs:
R
3 100.00
15:02:50
R
3 100.00
15:02:52
R
2 100.00
15:02:50
R
3 100.00
1 15:02:50
Q
USER
GROUP
-
sartois
Arches
G5-014 25
1.0 Hi
tgates
Arches
G5-016
1.0 De
eval1
RedRock
P690-016 16
tgates
Arches
G5-001
1.0
1.0
-
MHOST PROCS
4
2
47 of 48 processors active (97.92%)
32 of 32 nodes active
(100.00%)
4
The fields are as follows:
258
Commands
Chapter 3 Scheduler Commands
Column
Description
JOBID
Name of active job.
S
Job State. Either R for Running or S for Starting.
PAR
Partition in which job is running.
EFFIC
CPU efficiency of job.
XFACTOR
Current expansion factor of job, where XFactor = (QueueTime + WallClockLimit) / WallClockLimit
Q
Quality Of Service specified for job.
USERNAME
User owning job.
GROUP
Primary group of job owner.
MHOST
Master Host running primary task of job.
PROCS
Number of processors being used by the job.
REMAINING
Time the job has until it has reached its wallclock limit. Time specified in HH:MM:SS notation.
STARTTIME
Time job started running.
After displaying the running jobs, a summary is provided indicating the number
of jobs, the number of allocated processors, and the system utilization.
Column
Description
JobName
Name of active job.
S
Job State. Either R for Running or S for Starting.
CCode
Completion Code. The return/completion code given when a job completes. (Only applicable to completed jobs.)
Par
Partition in which job is running.
Effic
CPU efficiency of job.
Commands
259
Chapter 3 Scheduler Commands
260
Column
Description
XFactor
Current expansion factor of job, where XFactor = (QueueTime + WallClockLimit) / WallClockLimit
Q
Quality Of Service specified for job.
User
User owning job.
Group
Primary group of job owner.
Nodes
Number of processors being used by the job.
Remaining
Time the job has until it has reached its wallclock limit. Time specified in HH:MM:SS notation.
StartTime
Time job started running.
Commands
Chapter 3 Scheduler Commands
> showq -i
eligible jobs---------------------JOBID
PRIORITY XFACTOR
CLASS
SYSTEMQUEUETIME
12956*
batch
12969*
batch
12939
batch
12940
batch
12947
batch
12949
batch
12953
batch
12955
batch
12957
batch
12963
batch
12964
batch
12937
batch
12962
batch
12968
batch
Thu Sep
1 15:02:50
Thu Sep
1 15:03:23
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:52
Thu Sep
1 15:02:52
Thu Sep
1 15:02:50
Thu Sep
1 15:02:50
Thu Sep
1 15:02:52
Q
USER
GROUP
PROCS
WCLIMIT
20
1.0
-
cfosdyke
RedRock
32
6:40:00
19
1.0
-
cfosdyke
RedRock
4
6:40:00
16
1.0
-
eval1
RedRock
16
3:00:00
16
1.0
-
mwillis
Arches
2
3:00:00
16
1.0
-
mwillis
Arches
2
3:00:00
16
1.0
-
eval1
RedRock
2
3:00:00
16
1.0
-
tgates
Arches
10
4:26:40
16
1.0
-
eval1
RedRock
2
4:26:40
16
1.0
-
tgates
Arches
16
3:00:00
16
1.0
-
eval1
RedRock
16
1:06:00:00
16
1.0
-
tgates
Arches
16
1:00:00:00
1
1.0
-
allendr
RedRock
9
1:00:00:00
1
1.2
-
aacker
RedRock
6
00:26:40
1
1.0
-
tamaker
RedRock
1
4:26:40
14 eligible jobs
Total jobs:
14
The fields are as follows:
Column
Description
JOBID
Name of job.
PRIORITY
Calculated job priority.
XFACTOR
Current expansion factor of job, where XFactor = (QueueTime + WallClockLimit) / WallClockLimit
Q
Quality Of Service specified for job.
USER
User owning job.
Commands
261
Chapter 3 Scheduler Commands
Column
Description
GROUP
Primary group of job owner.
PROCS
Minimum number of processors required to run job.
WCLIMIT
Wallclock limit specified for job. Time specified in HH:MM:SS notation.
CLASS
Class requested by job.
SYSTEMQUEUETIME
Time job was admitted into the system queue.
An asterisk at the end of a job (job 12956* in this example) indicates that
the job has a job reservation created for it. The details of this reservation
can be displayed using the checkjob command.
Example 3-48: Detailed Completed Job Report
> showq -c
completed jobs-----------------------JOBID
SCCODE PAR EFFIC XFACTOR Q USERNAME
PROC
WALLTIME
STARTTIME
13098
C
0 bas 93.17
1.0 sartois
25
2:43:31 Thu Sep 1 15:02:50
13102
C
0 bas 99.55
2.2 Hi
tgates
4
2:56:54 Thu Sep 1 15:02:52
13103
C
2 tes 99.30
2.9 De
eval1
16
6:36:51 Thu Sep 1 15:02:50
13115
C
0 tes 97.04
1.0 tgates
2 1:05:56:51 Thu Sep 1 15:02:50
3 completed jobs
GROUP
MHOST
Arches
G5-014
Arches
G5-016
RedRock
P690-016
Arches
G5-001
The fields are as follows:
262
Column
Description
JOBID
job id for completed job.
S
Job State. Either C for Completed or V for Vacated.
CCODE
Completion code reported by the job.
PAR
Partition in which job ran.
Commands
Chapter 3 Scheduler Commands
Column
Description
EFFIC
CPU efficiency of job.
XFACTOR
Expansion factor of job, where XFactor = (QueueTime + WallClockLimit) / WallClockLimit
Q
Quality of Service specified for job.
USERNAME
User owning job.
GROUP
Primary group of job owner.
MHOST
Master Host which ran the primary task of job.
PROCS
Number of processors being used by the job.
WALLTIME
Wallclock time used by the job. Time specified in [DD:]HH:MM:SS notation.
STARTTIME
Time job started running.
After displaying the active jobs, a summary is provided indicating the number
of jobs, the number of allocated processors, and the system utilization.
If the DISPLAYFLAGS parameter is set to ACCOUNTCENTRIC, job group
information will be replaced with job account information.
Example 3-49: Filtered Job Report
Show only jobs associated with user john, class benchmark, and nodefeature
bigmem.
> showq -w class=benchmark -w user=john -w nodefeature=bigmem
...
Job Array
Job arrays show the name of the job array and then in parenthesis, the number
of sub-jobs in the job array that are in the specified state.
Commands
263
Chapter 3 Scheduler Commands
> showq
active jobs-----------------------JOBID
USERNAME
STATE PROCS
Moab.1(14)
aesplin
14 active jobs
2 of 2 nodes active
Running
REMAINING
STARTTIME
00:59:41
Fri May 27 14:58:57
14
14 of 14 processors in use by local jobs (100.00%)
(100.00%)
eligible jobs---------------------JOBID
USERNAME
STATE PROCS
WCLIMIT
QUEUETIME
Moab.1(4)
4
1:00:00
Fri May 27 14:58:52
blocked jobs----------------------JOBID
USERNAME
STATE PROCS
WCLIMIT
QUEUETIME
Moab.1(2)
1:00:00
Fri May 27 14:58:52
aesplin
Idle
4 eligible jobs
aesplin
Blocked
2
2 blocked jobs
Total jobs:
20
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
showbf - command to display resource availability.
mdiag -j - command to display detailed job diagnostics.
checkjob - command to check the status of a particular job.
JOBCPURGETIME - parameter to adjust the duration of time Moab preserves information about completed jobs
DISPLAYFLAGS - parameter to control what job information is displayed
showhist.moab.pl
Synopsis
showhist.moab.pl [-a accountname]
[-c classname] [-e enddate]
[-g groupname] [-j jobid] [-n days]
[-q qosname] [-s startdate]
[-u username]
Overview
The showhist.moab.pl script displays historical job information. Its purpose is
similar to the checkjob command's, but showhist.moab.pl displays
information about jobs that have already completed.
264
Commands
Chapter 3 Scheduler Commands
Access
By default, this script's use is limited to administrators on the head node;
however, end users can also be given power to run the script. To grant access
to the script to end users, move showhist.moab.pl from the tools directory
to the bin directory.
Arguments
-a (Account)
Format
<ACCOUNTNAME>
Description
Displays job records matching the specified account.
Example
> showhist.moab.pl -a myAccount
Information about jobs related to the account
myAccount is displayed.
-c (Class)
Format
<CLASSNAME>
Description
Displays job records matching the specified class (queue).
Example
> showhist.moab.pl -c newClass
Information about jobs related to the class
newClass is displayed.
-e (End Date)
Format
YYYY-MM-DD
Description
Displays the records of jobs recorded before or on the specified date.
Commands
265
Chapter 3 Scheduler Commands
-e (End Date)
Example
> showhist.moab.pl -e 2001-01-03
Information about all jobs recorded on or before January 3,
2001 is displayed.
> showhist.moab.pl -s 2011-01-01 -e 2011-01-31
Information is displayed about all jobs recorded in January
2011.
-g (Group)
Format
<GROUPNAME>
Description
Displays job records matching the specified group.
Example
> showhist.moab.pl -g admins
Information about jobs related to the group
admins is displayed.
-j (Job ID)
Format
<JOBID>
Description
Displays job records matching the specified job id.
Example
> showhist.moab.pl -j moab01
Information about job moab01 is
displayed.
-n (Number of Days)
Format
266
<INTEGER>
Commands
Chapter 3 Scheduler Commands
-n (Number of Days)
Description
Example
Restricts the number of past jobs to search by a specified number of days relative to today.
> showhist.moab.pl -n 90 -j moab924
Displays job information for job moab924. The search is restricted to the last 90
days.
-q (QoS)
Format
<QOSNAME>
Description
Displays job records matching the specified quality of service.
Example
> showhist.moab.pl -q myQos
Information about jobs related to the QoS myQos
is displayed.
-s (Start Date)
Format
YYYY-MM-DD
Description
Displays the records of jobs that recorded on the specified date and later.
Example
> showhist.moab.pl -s 1776-07-04
Information about all jobs recorded on July 4, 1776 and later is
displayed.
> showhist.moab.pl -s 2001-07-05 -e 2002-07-05
Information is displayed about all jobs recorded between July 5, 2001
and July 5, 2002.
Commands
267
Chapter 3 Scheduler Commands
-u (User)
Format
<USERNAME>
Description
Displays job records matching the specified user.
Example
> showhist.moab.pl -u bob
Information about user bob's jobs is
displayed.
Sample Output
> showhist.moab.pl
Job Id
:
User Name
:
Group Name
:
Queue Name
:
Processor Count
:
Wallclock Duration:
Submit Time
:
Start Time
:
End Time
:
Exit Code
:
Allocated Nodelist:
Moab.4
user1
company
NONE
4
00:00:00
Mon Nov 21 10:48:32 2011
Mon Nov 21 10:49:37 2011
Mon Nov 21 10:49:37 2011
0
10.10.10.3
Job Id
:
Executable
:
User Name
:
Group Name
:
Account Name
:
Queue Name
:
Quality Of Service:
Processor Count
:
Wallclock Duration:
Submit Time
:
Start Time
:
End Time
:
Exit Code
:
Allocated Nodelist:
Moab.1
4
user1
company
1321897709
NONE
0M
-0
00:01:05
Mon Nov 21 10:48:29 2011
Mon Nov 21 10:48:32 2011
Mon Nov 21 10:49:37 2011
0
512M
Information is displayed for all completed jobs.
When a job's Start Time and End Time are the same, the job is infinite and
still running.
Related Topics
checkjob - explains how to query for a status report for a specified job.
mdiag -j command - display additional detailed information regarding jobs
showq command - showq high-level job summaries
268
Commands
Chapter 3 Scheduler Commands
showres
Synopsis
showres [-f] [-n [-g]] [-o] [-r] [reservationid]
Overview
This command displays all reservations currently in place within Moab. The
default behavior is to display reservations on a reservation-by-reservation
basis.
Access
By default, this command can be run by any Moab administrator.
Flag
Description
-f
Show free (unreserved) resources rather than reserved resources. The -f flag cannot be used in conjunction with the any other flag
-g
When used with the -n flag, shows grep-able output with nodename on every line
-n
Display information regarding all nodes reserved by <RSVID>
-o
Display all reservations which overlap <RSVID> (in time and space)
Not supported with -n flag
-r
Display reservation timeframes in relative time mode
-v
Show verbose output. If used with the -n flag, the command will display all reservations found on nodes
contained in <RSVID>. Otherwise, it will show long reservation start dates including the reservation year.
Parameter
Description
RSVID
ID of reservation of interest — optional
Commands
269
Chapter 3 Scheduler Commands
Examples
Example 3-50:
> showres
ReservationID
12941
15:02:50
12944
15:02:50
12946
15:02:50
12954
15:02:52
12956
21:02:50
12969
21:42:50
Type S
Start
End
Duration
N/P
StartTime
Job R
-00:05:01
2:41:39
2:46:40
13/25
Thu Sep
1
Job R
-00:05:01
6:34:59
6:40:00
16/16
Thu Sep
1
Job R
-00:05:01
1:05:54:59
1:06:00:00
1/2
Thu Sep
1
Job R
-00:04:59
2:55:01
3:00:00
2/4
Thu Sep
1
Job I
1:05:54:59
1:12:34:59
6:40:00
16/32
Fri Sep
2
Job I
6:34:59
13:14:59
6:40:00
4/4
Thu Sep
1
6 reservations located
The above example shows all reservations on the system.
The fields are as follows:
270
Column
Description
Type
Reservation Type. This will be one of the following: Job or User.
ReservationID
This is the name of the reservation. Job reservation names are identical to the job name. User,
Group, or Account reservations are the user, group, or account name followed by a number. System reservations are given the name SYSTEM followed by a number.
S
State. This field is valid only for job reservations. It indicates whether the job is (S)tarting, (R)unning, or (I)dle.
Start
Relative start time of the reservation. Time is displayed in HH:MM:SS notation and is relative to
the present time.
End
Relative end time of the reservation. Time is displayed in HH:MM:SS notation and is relative to
the present time. Reservations that will not complete in 1,000 hours are marked with the
keyword INFINITY.
Duration
Duration of the reservation in HH:MM:SS notation. Reservations lasting more than 1,000 hours
are marked with the keyword INFINITY.
Nodes
Number of nodes involved in reservation.
Commands
Chapter 3 Scheduler Commands
Column
Description
StartTime
Time Reservation became active.
Commands
271
Chapter 3 Scheduler Commands
Example 3-51:
272
Commands
Chapter 3 Scheduler Commands
> showres -n
reservations on Thu Sep
NodeName
StartTime
G5-001
Sep 1
G5-001
Sep 2
G5-002
Sep 2
G5-002
Sep 1
G5-003
Sep 2
G5-003
Sep 1
G5-004
Sep 2
G5-004
Sep 1
G5-005
Sep 2
G5-005
Sep 1
G5-006
Sep 2
G5-006
Sep 1
G5-007
Sep 2
G5-007
Sep 1
G5-008
Sep 2
G5-008
Sep 1
G5-009
Sep 2
G5-009
Sep 1
G5-010
Sep 2
G5-010
Sep 1
G5-011
Sep 2
G5-011
Sep 1
G5-012
Sep 2
G5-012
Sep 1
G5-013
Sep 2
G5-013
Sep 1
G5-014
Sep 2
G5-014
Sep 1
Commands
1 16:49:59
Type
ReservationID
JobState Task
Start
Duration
Job
12946
Running
2
-1:47:09
1:06:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12953
Running
2
-00:29:37
4:26:40
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12953
Running
2
-00:29:37
4:26:40
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12953
Running
2
-00:29:37
4:26:40
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12953
Running
2
-00:29:37
4:26:40
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12953
Running
2
-00:29:37
4:26:40
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12939
Running
2
-00:29:37
3:00:00
Thu
15:02:50
21:02:50
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
21:02:50
16:20:22
273
Chapter 3 Scheduler Commands
G5-015
Sep 2 21:02:50
G5-015
Sep 1 16:41:02
G5-016
Sep 2 21:02:50
G5-016
Sep 1 16:41:02
P690-001
Sep 1 15:02:50
P690-002
Sep 1 15:02:50
P690-003
Sep 1 15:02:50
P690-004
Sep 1 15:02:50
P690-005
Sep 1 15:02:50
P690-006
Sep 1 15:02:50
P690-007
Sep 1 15:02:50
P690-008
Sep 1 15:02:50
P690-009
Sep 1 15:02:50
P690-010
Sep 1 15:02:50
P690-011
Sep 1 15:02:50
P690-012
Sep 1 15:02:50
P690-013
Sep 1 15:02:50
P690-013
Sep 1 21:42:50
P690-014
Sep 1 15:02:50
P690-014
Sep 1 21:42:50
P690-015
Sep 1 15:02:50
P690-015
Sep 1 21:42:50
P690-016
Sep 1 15:02:50
P690-016
Sep 1 21:42:50
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12949
Running
2
-00:08:57
3:00:00
Thu
Job
12956
Idle
2
1:04:12:51
6:40:00
Fri
Job
12947
Running
2
-00:08:57
3:00:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12969
Idle
1
4:52:51
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12969
Idle
1
4:52:51
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12969
Idle
1
4:52:51
6:40:00
Thu
Job
12944
Running
1
-1:47:09
6:40:00
Thu
Job
12969
Idle
1
4:52:51
6:40:00
Thu
52 nodes reserved
This example shows reservations for nodes.
The fields are as follows:
274
Column
Description
NodeName
Node on which reservation is placed.
Commands
Chapter 3 Scheduler Commands
Column
Description
Type
Reservation Type. This will be one of the following: Job or User.
ReservationID
This is the name of the reservation. Job reservation names are identical to the job name. User,
Group, or Account reservations are the user, group, or account name followed by a number. System reservations are given the name SYSTEM followed by a number.
JobState
This field is valid only for job reservations. It indicates the state of the job associated with the
reservation.
Start
Relative start time of the reservation. Time is displayed in HH:MM:SS notation and is relative to
the present time.
Duration
Duration of the reservation in HH:MM:SS notation. Reservations lasting more than 1000 hours
are marked with the keyword INFINITY.
StartTime
Time Reservation became active.
Example 3-52:
> showres 12956
ReservationID
12956
21:02:50
Type S
Start
End
Duration
N/P
Job I
1:04:09:32
1:10:49:32
6:40:00
16/32
StartTime
Fri Sep
2
1 reservation located
In this example, information for a specific reservation (job) is displayed.
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mrsvctl -c - create new reservations.
mrsvctl -r - release existing reservations.
mdiag -r - diagnose/view the state of existing reservations.
Reservation Overview - description of reservations and their use.
showstart
Synopsis
showstart {jobid|proccount[@duration]|s3jobspec} [-e {all|hist|prio|rsv}] [-f]
[-g [peer]] [-l qos=<QOS>] [--blocking] [--format=xml] [-v]
Commands
275
Chapter 3 Scheduler Commands
Overview
This command displays the estimated start time of a job based a number of
analysis types. This analysis may include information based on historical usage,
earliest available reservable resources, and priority based backlog analysis.
Each type of analysis will provide somewhat different estimates based on
current cluster environmental conditions. By default, only reservation based
analysis is performed.
The start time estimate Moab returns does not account for resources that
will become available due to preemption.
Historical analysis utilizes historical queue times for jobs which match a similar
processor count and job duration profile. This information is updated on a
sliding window which is configurable within moab.cfg
Reservation based start time estimation incorporates information regarding
current administrative, user, and job reservations to determine the earliest
time the specified job could allocate the needed resources and start running. In
essence, this estimate will indicate the earliest time the job would start
assuming this job was the highest priority job in the queue.
Priority based job start analysis determines when the queried job would fit in
the queue and determines the estimated amount of time required to complete
the jobs which are currently running or scheduled to run before this job can
start.
In all cases, if the job is running, this command will return the time the job
started. If the job already has a reservation, this command will return the start
time of the reservation.
Access
By default, this command can be run by any user.
Parameters
276
Parameter
Description
--blocking
Do not use cache information in the output. The --blocking flag retrieves results exclusively
from the scheduler.
DURATION
Duration of pseudo-job to be checked in format [[[DD:]HH:]MM:]SS (default duration is 1 second)
-e
Estimate method. By default, Moab will use the reservation based estimation method.
Commands
Chapter 3 Scheduler Commands
Parameter
Description
-f
Use feedback. If specified, Moab will apply historical accuracy information to Improve the quality
of the estimate.
-g
Grid mode. Obtain showstart information from remote resource managers. If -g is not used and
Moab determines that job is already migrated, Moab obtains showstart information from the
remote Moab where the job was migrated to. All resource managers can be queried by using the
keyword "all" which returns all information in a table.
$ showstart -g all head.1
Estimated Start Times
[ Remote RM ] [ Reservation ] [ Priority ] [ Historical ]
[ c1 ] [ 00:15:35 ] [ ] [ ]
[ c2 ] [ 3:15:38 ] [ ] [ ]
-l qoss=<QOS>
Specifies what QOS the job must start under, using the same syntax as the msub command. Currently, no other resource manager extensions are supported. This flag only applies to hypothetical
jobs by using the proccount[@duration] syntax.
-v
Displays verbose information.
JOBID
Job to be checked
PROCCOUNT
Number of processors in pseudo-job to be checked
S3JOBSPEC
XML describing the job according to the Dept. of Energy Scalable Systems Software/S3 job specification.
Examples
Example 3-53:
> showstart orion.13762
job orion.13762 requires 2 procs for 0:33:20
Estimated Rsv based start in
1:04:55
Estimated Rsv based completion in
2:44:55
Estimated Priority based start in
5:14:55
Estimated Priority based completion in
6:54:55
Estimated Historical based start in
00:00:00
Estimated Historical based completion in
1:40:00
Best Partition: fast
Commands
on
on
on
on
on
on
Fri
Fri
Fri
Fri
Fri
Fri
Jul
Jul
Jul
Jul
Jul
Jul
15
15
15
15
15
15
12:53:40
14:33:40
17:03:40
18:43:40
11:48:45
13:28:45
277
Chapter 3 Scheduler Commands
Example 3-54:
> showstart 12@3600
job 12@3600 requires 12 procs for 1:00:00
Earliest start in
00:01:39 on Wed Aug 31 16:30:45
Earliest completion in
1:01:39 on Wed Aug 31 17:30:45
Best Partition: 32Bit
You cannot specify job flags when running showstart, and since a job by
default can only run on one partition, showstart fails when querying for a
job requiring more nodes than the largest partition available.
Additional Information
For reservation based estimates, the information provided by this command is
more highly accurate if the job is highest priority, if the job has a reservation,
or if the majority of the jobs which are of higher priority have reservations.
Consequently, sites wishing to make decisions based on this information may
want to consider using the RESERVATIONDEPTH parameter to increase the
number of priority based reservations. This can be set so that most or even all
idle jobs receive priority reservations and make the results of this command
generally useful. The only caution of this approach is that increasing the
RESERVATIONDEPTH parameter more tightly constrains the decisions of the
scheduler and may resulting in slightly lower system utilization (typically less
than 8% reduction).
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
checkjob
showres
showstats -f eststarttime
showstats -f avgqtime
Job Start Estimates
showstate
Synopsis
showstate
Overview
This command provides a summary of the state of the system. It displays a list
of all active jobs and a text-based map of the status of all nodes and the jobs
they are servicing. Basic diagnostic tests are also performed and any problems
found are reported.
278
Commands
Chapter 3 Scheduler Commands
Access
By default, this command can be run by any Moab Administrator.
Commands
279
Chapter 3 Scheduler Commands
Examples
Example 3-55:
> showstate
cluster state summary for Wed Nov 23 12:00:21
JobID
S
User
Group Procs
Remaining
StartTime
------------------ - --------- -------- ----- ----------- ------------------(A)
fr17n11.942.0 R
johns
staff
16
13:21:15
Nov 22 12:00:21
(B)
fr17n12.942.0 S
johns
staff
32
13:07:11
Nov 22 12:00:21
(C)
fr17n13.942.0 R
johns
staff
8
11:22:25
Nov 22 12:00:21
(D)
fr17n14.942.0 S
johns
staff
8
10:43:43
Nov 22 12:01:21
(E)
fr17n15.942.0 S
johns
staff
8
9:19:25
Nov 22 12:01:21
(F)
fr17n16.942.0 R
johns
staff
8
9:01:16
Nov 22 12:01:21
(G)
fr17n17.942.0 R
johns
staff
1
7:28:25
Nov 22 12:03:22
(H)
fr17n18.942.0 R
johns
staff
1
3:05:17
Nov 22 12:04:22
(I)
fr17n19.942.0 S
johns
staff
24
0:54:38
Nov 22 12:00:22
Usage Summary: 9 Active Jobs 106 Active Nodes
[0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1]
[1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6]
Frame
2: XXXXXXXXXXXXXXXXXXXXXXXX[ ][A][C][ ][A][C][C][A]
Frame
3: [ ][ ][ ][ ][ ][ ][A][ ][I][ ][I][ ][ ][ ][ ][ ]
Frame
4: [ ][I][ ][ ][ ][A][ ][I][ ][ ][ ][E][ ][I][ ][E]
Frame
5: [F][ ][E][ ][ ][ ][F][F][F][I][ ][ ][E][ ][E][E]
Frame
6: [ ][I][I][E][I][ ][I][I][ ][I][F][I][I][I][I][F]
Frame
7: [ ]XXX[ ]XXX[ ]XXX[ ]XXX[b]XXX[ ]XXX[ ]XXX[#]XXX
Frame
9: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][E][ ]
Frame
11: [ ][ ][ ][ ][ ][ ][I][F][@][ ][A][I][ ][F][ ][A]
Frame
12: [A][ ][ ][A][ ][ ][C][A][ ][C][A][A][ ][ ][ ][ ]
Frame
13: [D]XXX[I]XXX[ ]XXX[ ]XXX[ ]XXX[ ]XXX[I]XXX[I]XXX
Frame
14: [D]XXX[I]XXX[I]XXX[D]XXX[ ]XXX[H]XXX[I]XXX[ ]XXX
Frame
15: [b]XXX[b]XXX[b]XXX[b]XXX[D]XXX[b]XXX[b]XXX[b]XXX
Frame
16: [b]XXX[ ]XXX[b]XXX[ ]XXX[b]XXX[b]XXX[ ]XXX[b]XXX
Frame
17: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]
Frame
21: [ ]XXX[b]XXX[b]XXX[ ]XXX[b]XXX[b]XXX[b]XXX[b]XXX
Frame
22: [b]XXX[b]XXX[b]XXX[ ]XXX[b]XXX[ ]XXX[b]XXX[b]XXX
Frame
27: [b]XXX[b]XXX[ ]XXX[b]XXX[b]XXX[b]XXX[b]XXX[b]XXX
Frame
28: [G]XXX[ ]XXX[D]XXX[ ]XXX[D]XXX[D]XXX[D]XXX[ ]XXX
Frame
29: [A][C][A][A][C][ ][A][C]XXXXXXXXXXXXXXXXXXXXXXXX
Key: XXX:Unknown [*]:Down w/Job [#]:Down [']:Idle w/Job [ ]:Idle [@]:Busy w/No Job
[!]:Drained
Key: [a]:(Any lower case letter indicates an idle node that is assigned to a job)
Check Memory on Node fr3n07
Check Memory on Node fr4n06
Check Memory on Node fr4n09
In this example, nine active jobs are running on the system. Each job listed in the top of the output is
associated with a letter. For example, job fr17n11.942.0 is associated with the letter A. This letter can
now be used to determine where the job is currently running. By looking at the system map, it can be
found that job fr17n11.942.0 (job A) is running on nodes fr2n10, fr2n13, fr2n16, fr3n07 ...
The key at the bottom of the system map can be used to determine unusual node states. For example,
fr7n15 is currently in the state down.
After the key, a series of warning messages may be displayed indicating possible system problems. In
this case, warning message indicate that there are memory problems on three nodes, fr3n07, fr4n06,
and fr4n09. Also, warning messages indicate that job fr15n09.1097.0 is having difficulty starting. Node
fr11n08 is in state BUSY but has no job assigned to it (it possibly has a runaway job running on it).
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
280
Commands
Chapter 3 Scheduler Commands
Specifying Node Rack/Slot Location
showstats
Synopsis
showstats
showstats -a [accountid] [-v] [-t <TIMESPEC>]
showstats -c [classid] [-v] [-t <TIMESPEC>]
showstats -f <statistictype>
showstats -g [groupid] [-v] [-t <TIMESPEC>]
showstats -j [jobtemplate] [-t <TIMESPEC>]
showstats -n [nodeid] [-t <TIMESPEC>]
showstats -q [qosid] [-v] [-t <TIMESPEC>]
showstats -s
showstats -T [leafid | tree-level]
showstats -u [userid] [-v] [-t <TIMESPEC>]
Overview
This command shows various accounting and resource usage statistics for the
system. Historical statistics cover the timeframe from the most recent
execution of the mschedctl -f command.
Access
By default, this command can be run by any Moab level 1, 2, or 3
Administrator.
Parameters
Flag
Description
-a[<ACCOUNTID>]
Display account statistics. See Account statistics for an example.
-c[<CLASSID>]
Display class statistics
-f <statistictype>
Display full matrix statistics (see showstats -f for full details)
Commands
281
Chapter 3 Scheduler Commands
Flag
Description
-g[<GROUPID>]
Display group statistics. See Group statistics for an example.
-j
[<JOBTEMPLATE>]
Display template statistics
-n[<NODEID>]
Display node statistics (ENABLEPROFILING must be set). See Node statistics for an example.
-q [<QOSID>]
Display QoS statistics
-s
display general scheduler statistics
-t
Display statistical information from the specified timeframe:
<START_TIME>[,<END_TIME>]
(ABSTIME: [HH[:MM[:SS]]][_MO[/DD[/YY]]] ie 14:30_06/20)
(RELTIME: -[[[DD:]HH:]MM:]SS)
See Statistics from an absolute time frame and Statistics from a relative time frame for
examples.
Profiling must be enabled for the credential type you want statistics for. See
Credential Statistics for information on how to enable profiling. Also, -t is not a
stand-alone option. It must be used in conjunction with the -a, -c, -g, -n, -q, or -u flag.
282
-T
Display fairshare tree statistics. See Fairshare tree statistics for an example.
-u[<USERID>]
Display user statistics. See User statistics for an example.
-v
Display verbose information. See Verbose statistics for an example.
Commands
Chapter 3 Scheduler Commands
Examples
Example 3-56: Account statistics
> showstats -a
Account Statistics Initialized Tue Aug 26 14:32:39
|----- Running ------|--------------------------------- Completed ---------------------------------|
Account
Jobs Procs ProcHours Jobs
%
PHReq
%
PHDed
%
FSTgt AvgXF
MaxXF AvgQH Effic WCAcc
137651
16
92
1394.52 229 39.15 18486 45.26 7003.5 41.54 40.00
0.77
8.15
5.21 90.70 34.69
462212
11
63
855.27
43
7.35 6028 14.76 3448.4 20.45 6.25
0.71
5.40
3.14 98.64 40.83
462213
6
72
728.12
90 15.38 5974 14.63 3170.7 18.81 6.25
0.37
4.88
0.52 82.01 24.14
005810
3
24
220.72
77 13.16 2537
6.21 1526.6
9.06 ----1.53
14.81
0.42 98.73 28.40
175436
0
0
0.00
12
2.05 6013 14.72 958.6
5.69 2.50
1.78
8.61
5.60 83.64 17.04
000102
0
0
0.00
1
0.17
64
0.16
5.1
0.03 ----- 10.85
10.85 10.77 27.90
7.40
000023
0
0
0.00
1
0.17
12
0.03
0.2
0.00 ----0.04
0.04
0.19 21.21
1.20
This example shows a statistical listing of all active accounts. The top line (Account Statistics
Initialized...) of the output indicates the beginning of the timeframe covered by the displayed statistics.
The statistical output is divided into two categories, Running and Completed. Running statistics include
information about jobs that are currently running. Completed statistics are compiled using historical
information from both running and completed jobs.
The fields are as follows:
Column
Description
Account
Account Number
Jobs
Number of running jobs
Procs
Number of processors allocated to running jobs
ProcHours
Number of proc-hours required to complete running jobs
Jobs*
Number of jobs completed
%
Percentage of total jobs that were completed by account
PHReq*
Total proc-hours requested by completed jobs
%
Percentage of total proc-hours requested by completed jobs that were requested by account
Commands
283
Chapter 3 Scheduler Commands
Column
Description
PHDed
Total proc-hours dedicated to active and completed jobs. The proc-hours dedicated to a job are calculated by multiplying the number of allocated procs by the length of time the procs were allocated, regardless of the job's CPU usage.
%
Percentage of total proc-hours dedicated that were dedicated by account
FSTgt
Fairshare target. An account's fairshare target is specified in the fs.cfg file. This value should be
compared to the account's node-hour dedicated percentage to determine if the target is being met.
AvgXF*
Average expansion factor for jobs completed. A job's XFactor (expansion factor) is calculated by the
following formula: (QueuedTime + RunTime) / WallClockLimit.
MaxXF*
Highest expansion factor received by jobs completed
AvgQH*
Average queue time (in hours) of jobs
Effic
Average job efficiency. Job efficiency is calculated by dividing the actual node-hours of CPU time
used by the job by the node-hours allocated to the job.
WCAcc*
Average wallclock accuracy for jobs completed. Wallclock accuracy is calculated by dividing a job's
actual run time by its specified wallclock limit.
A job's wallclock accuracy is capped at 100% so even if a job exceeds its requested walltime
it will report an accuracy of 100%.
* These fields are empty until an account has completed at least one job.
284
Commands
Chapter 3 Scheduler Commands
Example 3-57: Group statistics
> showstats -g
Group Statistics Initialized Tue Aug 26 14:32:39
|----- Running ------|--------------------------------- Completed ---------------------------------|
GroupName GID Jobs Procs ProcHours Jobs
%
PHReq
%
PHDed
%
FSTgt
AvgXF MaxXF AvgQH Effic WCAcc
univ 214
16
92
1394.52 229 39.15 18486 45.26 7003.5 41.54 40.00
0.77
8.15
5.21 90.70 34.69
daf 204
11
63
855.27
43
7.35 6028 14.76 3448.4 20.45 6.25
0.71
5.40
3.14 98.64 40.83
dnavy 207
6
72
728.12
90 15.38 5974 14.63 3170.7 18.81 6.25
0.37
4.88
0.52 82.01 24.14
govt 232
3
24
220.72
77 13.16 2537
6.21 1526.6
9.06 ----1.53 14.81
0.42 98.73 28.40
asp 227
0
0
0.00
12
2.05 6013 14.72 958.6
5.69 2.50
1.78
8.61
5.60 83.64 17.04
derim 229
0
0
0.00
74 12.65
669
1.64 352.5
2.09 ----0.50
1.93
0.51 96.03 32.60
dchall 274
0
0
0.00
3
0.51
447
1.10 169.2
1.00 25.00
0.52
0.88
2.49 95.82 33.67
nih 239
0
0
0.00
17
2.91
170
0.42 148.1
0.88 ----0.95
1.83
0.14 97.59 84.31
darmy 205
0
0
0.00
31
5.30
366
0.90
53.9
0.32 6.25
0.14
0.59
0.07 81.33 12.73
systems
80
0
0
0.00
6
1.03
67
0.16
22.4
0.13 ----4.07
8.49
1.23 28.68 37.34
pdc 252
0
0
0.00
1
0.17
64
0.16
5.1
0.03 ----10.85 10.85 10.77 27.90
7.40
staff
1
0
0
0.00
1
0.17
12
0.03
0.2
0.00 ----0.04
0.04
0.19 21.21
1.20
This example shows a statistical listing of all active groups. The top line (Group Statistics Initialized...) of
the output indicates the beginning of the timeframe covered by the displayed statistics.
The statistical output is divided into two categories, Running and Completed. Running statistics include
information about jobs that are currently running. Completed statistics are compiled using historical
information from both running and completed jobs.
The fields are as follows:
Column
Description
GroupName
Name of group.
GID
Group ID of group.
Jobs
Number of running jobs.
Procs
Number of procs allocated to running jobs.
ProcHours
Number of proc hours required to complete running jobs.
Jobs*
Number of jobs completed.
Commands
285
Chapter 3 Scheduler Commands
Column
Description
%
Percentage of total jobs that were completed by group.
PHReq*
Total proc-hours requested by completed jobs.
%
Percentage of total proc-hours requested by completed jobs that were requested by group.
PHDed
Total proc-hours dedicated to active and completed jobs. The proc-hours dedicated to a job are calculated by multiplying the number of allocated procs by the length of time the procs were allocated, regardless of the job's CPU usage.
%
Percentage of total proc-hours dedicated that were dedicated by group.
FSTgt
Fairshare target. A group's fairshare target is specified in the fs.cfg file. This value should be
compared to the group's node-hour dedicated percentage to determine if the target is being met.
AvgXF*
Average expansion factor for jobs completed. A job's XFactor (expansion factor) is calculated by
the following formula: (QueuedTime + RunTime) / WallClockLimit.
MaxXF*
Highest expansion factor received by jobs completed.
AvgQH*
Average queue time (in hours) of jobs.
Effic
Average job efficiency. Job efficiency is calculated by dividing the actual node-hours of CPU time
used by the job by the node-hours allocated to the job.
WCAcc*
Average wallclock accuracy for jobs completed. Wallclock accuracy is calculated by dividing a job's
actual run time by its specified wallclock limit.
A job's wallclock accuracy is capped at 100% so even if a job exceeds its requested
walltime it will report an accuracy of 100%.
* These fields are empty until a group has completed at least one job.
286
Commands
Chapter 3 Scheduler Commands
Example 3-58: Node statistics
> showstats -n
node stats from Mon Jul 10 00:00:00 to Mon Jul 10 16:30:00
node
CfgMem MinMem MaxMem AvgMem | CfgProcs MinLoad MaxLoad
node01
58368
0 21122
5841
32
0.00
32.76
node02
122880
0 19466
220
30
0.00
33.98
node03
18432
0
9533
2135
24
0.00
25.10
node04
60440
0 17531
4468
32
0.00
30.55
node05
13312
0
2597
1189
8
0.00
9.85
node06
13312
0
3800
1112
8
0.00
8.66
node07
13312
0
2179
1210
8
0.00
9.62
node08
13312
0
3243
1995
8
0.00
11.71
node09
13312
0
2287
1943
8
0.00
10.26
node10
13312
0
2183
1505
8
0.00
13.12
node11
13312
0
3269
2448
8
0.00
8.93
node12
13312
0 10114
6900
8
0.00
13.13
node13
13312
0
2616
2501
8
0.00
9.24
node14
13312
0
3888
869
8
0.00
8.10
node15
13312
0
3788
308
8
0.00
8.40
node16
13312
0
4386
2191
7
0.00
18.37
node17
13312
0
3158
1870
8
0.00
8.95
node18
13312
0
5022
2397
8
0.00
19.25
node19
13312
0
2437
1371
8
0.00
8.98
node20
13312
0
4474
2486
8
0.00
8.51
node21
13312
0
4111
2056
8
0.00
8.93
node22
13312
0
5136
2313
8
0.00
8.61
node23
13312
0
1850
1752
8
0.00
8.39
node24
13312
0
3850
2539
8
0.00
8.94
node25
13312
0
3789
3702
8
0.00
21.22
node26
13312
0
3809
1653
8
0.00
9.34
node27
13312
0
5637
70
4
0.00
17.97
node28
13312
0
3076
2864
8
0.00
22.91
AvgLoad
27.62
29.54
18.64
24.61
8.45
5.27
8.27
8.02
7.58
9.28
6.71
8.44
8.21
3.85
4.67
8.36
5.91
8.19
7.09
7.11
6.68
5.75
5.71
7.80
12.83
4.91
2.46
10.33
Example 3-59: Verbose statistics
> showstats -v
current scheduler time: Sat Aug
moab active for
00:00:01
statistics for iteration
0
Eligible/Idle Jobs:
Active Jobs:
Successful/Completed Jobs:
Preempt Jobs:
Avg/Max QTime (Hours):
Avg/Max XFactor:
Avg/Max Bypass:
Dedicated/Total ProcHours:
Preempt/Dedicated ProcHours:
Current Active/Total Procs:
Current Active/Total Nodes:
Avg WallClock Accuracy:
Avg Job Proc Efficiency:
Min System Utilization:
Est/Avg Backlog:
18 18:23:02 2007
started on Wed Dec 31 17:00:00
initialized on Sat Aug 11 23:55:25
6/8
(75.000%)
13
167/167
(100.000%)
0
0.34/2.07
1.165/3.26
0.40/8.00
4.46K/4.47K (99.789%)
0.00/4.46K (0.000%)
32/32
(100.0%)
16/16
(100.0%)
64.919%
99.683%
87.323% (on iteration 46)
02:14:06/03:02:567
This example shows a concise summary of the system scheduling state. Note that showstats and showstats s are equivalent.
The first line of output indicates the number of scheduling iterations performed by the current
scheduling process, followed by the time the scheduler started. The second line indicates the amount of
time the Moab Scheduler has been scheduling in HH:MM:SS notation followed by the statistics
initialization time.
Commands
287
Chapter 3 Scheduler Commands
The fields are as follows:
288
Column
Description
Active Jobs
Number of jobs currently active (Running or Starting).
Eligible Jobs
Number of jobs in the system queue (jobs that are considered when scheduling).
Idle Jobs
Number of jobs both in and out of the system queue that are in the LoadLeveler Idle
state.
Completed Jobs
Number of jobs completed since statistics were initialized.
Successful Jobs
Jobs that completed successfully without abnormal termination.
XFactor
Average expansion factor of all completed jobs.
Max XFactor
Maximum expansion factor of completed jobs.
Max Bypass
Maximum bypass of completed jobs.
Available ProcHours
Total proc-hours available to the scheduler.
Dedicated
ProcHours
Total proc-hours made available to jobs.
Effic
Scheduling efficiency (DedicatedProcHours / Available ProcHours).
Min Efficiency
Minimum scheduling efficiency obtained since scheduler was started.
Iteration
Iteration on which the minimum scheduling efficiency occurred.
Available Procs
Number of procs currently available.
Busy Procs
Number of procs currently busy.
Effic
Current system efficiency (BusyProcs/AvailableProcs).
WallClock Accuracy
Average wallclock accuracy of completed jobs (job-weighted average).
Job Efficiency
Average job efficiency (UtilizedTime / DedicatedTime).
Commands
Chapter 3 Scheduler Commands
Column
Description
Est Backlog
Estimated backlog of queued work in hours.
Avg Backlog
Average backlog of queued work in hours.
Example 3-60: User statistics
> showstats -u
User Statistics Initialized Tue Aug 26 14:32:39
|----- Running ------|--------------------------------- Completed ---------------------------------|
UserName UID Jobs Procs ProcHours Jobs
%
PHReq
%
PHDed
%
FSTgt
AvgXF MaxXF AvgQH Effic WCAcc
moorejt 2617
1
16
58.80
2
0.34
221
0.54 1896.6 11.25 ----1.02
1.04
0.14 99.52 100.00
zhong 1767
3
24
220.72
20
3.42 2306
5.65 1511.3
8.96 ----0.71
0.96
0.49 99.37 67.48
lui 2467
0
0
0.00
16
2.74 1970
4.82 1505.1
8.93 ----1.02
6.33
0.25 98.96 57.72
evans 3092
0
0
0.00
62 10.60 4960 12.14 1464.3
8.69
5.0
0.62
1.64
5.04 87.64 30.62
wengel 2430
2
64
824.90
1
0.17
767
1.88 630.3
3.74 ----0.18
0.18
4.26 99.63
0.40
mukho 2961
2
16
71.06
6
1.03
776
1.90 563.5
3.34 ----0.31
0.82
0.20 93.15 30.28
jimenez 1449
1
16
302.29
2
0.34
768
1.88 458.3
2.72 ----0.80
0.98
2.31 97.99 70.30
neff 3194
0
0
0.00
74 12.65
669
1.64 352.5
2.09 10.0
0.50
1.93
0.51 96.03 32.60
cholik 1303
0
0
0.00
2
0.34
552
1.35 281.9
1.67 ----1.72
3.07 25.35 99.69 66.70
jshoemak 2508
1
24
572.22
1
0.17
576
1.41 229.1
1.36 ----0.55
0.55
3.74 99.20 39.20
kudo 2324
1
8
163.35
6
1.03 1152
2.82 211.1
1.25 ----0.12
0.34
1.54 96.77
5.67
xztang 1835
1
8
18.99 ---- ------ ----- ------ 176.3
1.05 10.0 ----- ------ ------ 99.62 -----feller 1880
0
0
0.00
17
2.91
170
0.42 148.1
0.88 ----0.95
1.83
0.14 97.59 84.31
maxia 2936
0
0
0.00
1
0.17
191
0.47 129.1
0.77
7.5
0.88
0.88
4.49 99.84 69.10
ktgnov71 2838
0
0
0.00
1
0.17
192
0.47
95.5
0.57 ----0.53
0.53
0.34 90.07 51.20
This example shows a statistical listing of all active users. The top line (User Statistics Initialized...) of
the output indicates the timeframe covered by the displayed statistics.
The statistical output is divided into two statistics categories, Running and Completed. Running statistics
include information about jobs that are currently running. Completed statistics are compiled using
historical information from both running and completed jobs.
The fields are as follows:
Commands
289
Chapter 3 Scheduler Commands
290
Column
Description
UserName
Name of user.
UID
User ID of user.
Jobs
Number of running jobs.
Procs
Number of procs allocated to running jobs.
ProcHours
Number of proc-hours required to complete running jobs.
Jobs*
Number of jobs completed.
%
Percentage of total jobs that were completed by user.
PHReq*
Total proc-hours requested by completed jobs.
%
Percentage of total proc-hours requested by completed jobs that were requested by user.
PHDed
Total proc-hours dedicated to active and completed jobs. The proc-hours dedicated to a job are calculated by multiplying the number of allocated procs by the length of time the procs were allocated, regardless of the job's CPU usage.
%
Percentage of total proc-hours dedicated that were dedicated by user.
FSTgt
Fairshare target. A user's fairshare target is specified in the fs.cfg file. This value should be compared to the user's node-hour dedicated percentage to determine if the target is being met.
AvgXF*
Average expansion factor for jobs completed. A job's XFactor (expansion factor) is calculated by the
following formula: (QueuedTime + RunTime) / WallClockLimit.
MaxXF*
Highest expansion factor received by jobs completed.
AvgQH*
Average queue time (in hours) of jobs.
Effic
Average job efficiency. Job efficiency is calculated by dividing the actual node-hours of CPU time
used by the job by the node-hours allocated to the job.
Commands
Chapter 3 Scheduler Commands
Column
Description
WCAcc*
Average wallclock accuracy for jobs completed. Wallclock accuracy is calculated by dividing a job's
actual run time by its specified wallclock limit.
A job's wallclock accuracy is capped at 100% so even if a job exceeds its requested walltime
it will report an accuracy of 100%.
* These fields are empty until a user has completed at least one job.
Commands
291
Chapter 3 Scheduler Commands
Example 3-61: Fairshare tree statistics
> showstats -T
statistics initialized Mon Jul 10 15:29:41
|-------- Active ---------|---------------------------------- Completed
-----------------------------------|
user
Jobs Procs ProcHours Mem Jobs
%
PHReq
%
PHDed
%
FSTgt
AvgXF MaxXF AvgQH Effic WCAcc
root
0
0
0.00
0
56 100.00 2.47K 100.00 1.58K 48.87 ----1.22
0.00
0.24 100.00 58.84
l1.1
0
0
0.00
0
25 44.64 845.77 34.31 730.25 22.54 ----1.97
0.00
0.20 100.00 65.50
Administrati
0
0
0.00
0
10 17.86 433.57 17.59 197.17
6.09 ----3.67
0.00
0.25 100.00 62.74
Engineering
0
0
0.00
0
15 26.79 412.20 16.72 533.08 16.45 ----0.83
0.00
0.17 100.00 67.35
l1.2
0
0
0.00
0
31 55.36 1.62K 65.69 853.00 26.33 ----0.62
0.00
0.27 100.00 53.46
Shared
0
0
0.00
0
3
5.36 97.17
3.94 44.92
1.39 ----0.58
0.00
0.56 100.00 31.73
Test
0
0
0.00
0
3
5.36 14.44
0.59 14.58
0.45 ----0.43
0.00
0.17 100.00 30.57
Research
0
0
0.00
0
25 44.64 1.51K 61.16 793.50 24.49 ----0.65
0.00
0.24 100.00 58.82
> showstats -T 2
statistics initialized Mon Jul 10 15:29:41
|-------- Active ---------|---------------------------------- Completed
-----------------------------------|
user
Jobs Procs ProcHours Mem Jobs
%
PHReq
%
PHDed
%
FSTgt
AvgXF MaxXF AvgQH Effic WCAcc
Test
0
0
0.00
0
22
4.99 271.27
0.55 167.42
0.19 ----3.86
0.00
2.89 100.00 60.76
Shared
0
0
0.00
0
59 13.38 12.30K 24.75 4.46K
5.16 ----6.24
0.00 10.73 100.00 49.87
Research
0
0
0.00
0 140 31.75 9.54K 19.19 5.40K
6.25 ----2.84
0.00
5.52 100.00 57.86
Administrati
0
0
0.00
0
84 19.05 7.94K 15.96 4.24K
4.91 ----4.77
0.00
0.34 100.00 62.31
Engineering
0
0
0.00
0 136 30.84 19.67K 39.56 28.77K 33.27 ----3.01
0.00
3.66 100.00 63.70
> showstats -T l1.1
statistics initialized Mon Jul 10 15:29:41
|-------- Active ---------|---------------------------------- Completed
-----------------------------------|
user
Jobs Procs ProcHours Mem Jobs
%
PHReq
%
PHDed
%
FSTgt
AvgXF MaxXF AvgQH Effic WCAcc
l1.1
0
0
0.00
0 220 49.89 27.60K 55.52 33.01K 38.17 ----3.68
0.00
2.39 100.00 63.17
Administrati
0
0
0.00
0
84 19.05 7.94K 15.96 4.24K
4.91 ----4.77
0.00
0.34 100.00 62.31
Engineering
0
0
0.00
0 136 30.84 19.67K 39.56 28.77K 33.27 ----3.01
0.00
3.66 100.00 63.70
292
Commands
Chapter 3 Scheduler Commands
Example 3-62: Statistics from an absolute time frame
> showstats -c batch -v -t 00:00:01_01/01/13,23:59:59_12/31/13
statistics initialized Wed Jan 1 00:00:00
-------- Active -------------------------------
------------------------------------ Completed ------------
class Jobs Procs ProcHours Mem Jobs
MaxXF AvgQH Effic WCAcc
batch
0
0
0.00
0
23
5.01
0.00 88.94 39.87
%
PHReq
100.00
15
%
PHDed
100.00
1
%
100.00
FSTgt AvgXF
-----
0.40
Moab returns information about the class batch from January 1, 2013 to December 31, 2013. For more
information about specifying absolute dates, see "Absolute Time Format" in TIMESPEC.
Example 3-63: Statistics from a relative time frame
> showstats -u bob -v -t -30:00:00:00
statistics initialized Mon Nov 11 15:30:00
-------- Active -------------------------------
------------------------------------ Completed ------------
user Jobs Procs ProcHours Mem Jobs
MaxXF AvgQH Effic WCAcc
bob
0
0
0.00
0
23
5.01
0.00 88.94 39.87
%
PHReq
100.00
15
%
100.00
PHDed
1
%
100.00
FSTgt AvgXF
-----
0.40
Moab returns information about user bob from the past 30 days. For more information about specifying
relative dates, see "Relative Time Format" in TIMESPEC.
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mschedctl -f command - re-initialize statistics
showstats -f command - display full matrix statistics
showstats -f
Synopsis
showstats -f <statistictype>
Overview
Shows table of various scheduler statistics.
This command displays a table of the selected Moab Scheduler statistics, such
as expansion factor, bypass count, jobs, proc-hours, wallclock accuracy, and
backfill information.
Statistics are aggregated over time. This means statistical information is
not available for time frames and the -t option is not supported with
showstats -f.
Commands
293
Chapter 3 Scheduler Commands
Access
This command can be run by any Moab Scheduler Administrator.
Parameters
294
Parameter
Description
AVGBYPASS
Average bypass count. Includes summary of job-weighted expansion bypass and total
samples.
AVGQTIME
Average queue time. Includes summary of job-weighted queue time and total samples.
AVGXFACTOR
Average expansion factor. Includes summary of job-weighted expansion factor, processorweighted expansion factor, processor-hour-weighted expansion factor, and total number of
samples.
BFCOUNT
Number of jobs backfilled. Includes summary of job-weighted backfill job percent and total
samples.
BFPHRUN
Number of proc-hours backfilled. Includes summary of job-weighted backfill proc-hour percentage and total samples.
ESTSTARTTIME
Job start time estimate for jobs meeting specified processor/duration criteria. This estimate is
based on the reservation start time analysis algorithm.
JOBCOUNT
Number of jobs. Includes summary of total jobs and total samples.
MAXBYPASS
Maximum bypass count. Includes summary of overall maximum bypass and total samples.
MAXXFACTOR
Maximum expansion factor. Includes summary of overall maximum expansion factor and total
samples.
PHREQUEST
proc-hours requested. Includes summary of total proc-hours requested and total samples.
PHRUN
proc-hours run. Includes summary of total proc-hours run and total samples.
QOSDELIVERED
Quality of service delivered. Includes summary of job-weighted quality of service success rate
and total samples.
WCACCURACY
Wallclock accuracy. Includes summary of overall wall clock accuracy and total samples.
Commands
Chapter 3 Scheduler Commands
Examples
Example 3-64:
> showstats -f AVGXFACTOR
Average XFactor Grid
[ NODES ][ 00:02:00 ][ 00:04:00 ][ 00:08:00 ][ 00:16:00 ][ 00:32:00 ][ 01:04:00 ][ 02:08:00 ][ 04:16:00 ][ 08:32:00 ][ 17:04:00 ][ 34:08:00 ][ TOTAL ]
[ 1 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ]
[ 2 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ]
[ 4 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ 1.00
1][ -------- ][ 1.12
2][ -------- ][ -------- ][ 1.10
3]
[ 8 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ 1.00
2][ 1.24
2][ -------- ][ -------- ][ -------- ][ 1.15
4]
[ 16 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ 1.01
2][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ 1.01
2]
[ 32 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ]
[ 64 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ]
[ 128 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ]
[ 256 ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ ------- ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ]
[ T TOT ][ -------- ][ -------- ][ -------- ][ -------- ][ -------- ][ 1.01
2][ 1.00
3][ 1.24
2][ 1.12
2][ -------- ][ -------- ]
Job Weighted X Factor:
1.0888
Node Weighted X Factor:
1.1147
NS Weighted X Factor:
1.1900
Total Samples:
9
The showstats -f command returns a table with data for the specified STATISTICTYPE parameter. The leftmost column shows the maximum number of processors required by the jobs shown in the other
columns. The column headers indicate the maximum wallclock time (in HH:MM:SS notation) requested
by the jobs shown in the columns. The data returned in the table varies by the STATISTICTYPE
requested. For table entries with one number, it is of the data requested. For table entries with two
numbers, the left number is the data requested and the right number is the number of jobs used to
calculate the average. Table entries that contain only dashes (-------) indicate no job has completed
that matches the profile associated for this inquiry. The bottom row shows the totals for each column.
Following each table is a summary, which varies by the STATISTICTYPE requested.
The column and row break down can be adjusted using the STATPROC* and STATTIME*
parameters respectively.
This particular example shows the average expansion factor grid. Each table entry indicates two pieces
of information — the average expansion factor for all jobs that meet this slot's profile and the number of
jobs that were used to calculate this average. For example, the XFactors of two jobs were averaged to
obtain an average XFactor of 1.24 for jobs requiring over 2 hours 8 minutes, but not more than 4 hours
16 minutes and between 5 and 8 processors. Totals along the bottom provide overall XFactor averages
weighted by job, processors, and processor-hours.
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
mschedctl -f command
showstats command
STATPROCMIN parameter
STATPROCSTEPCOUNT parameter
Commands
295
Chapter 3 Scheduler Commands
STATPROCSTEPSIZE parameter
STATTIMEMIN parameter
STATTIMESTEPCOUNT parameter
STATTIMESTEPSIZE parameter
TIMESPEC
Relative Time Format
The relative time format specifies a time by using the current time as a
reference and specifying a time offset.
Format
+[[[DD:]HH:]MM:]SS
Examples
2 days, 3 hours and 57 seconds in the future:
+02:03:0:57
21 days (3 weeks) in the future:
+21:0:0:0
30 seconds in the future:
+30
Absolute Time Format
The absolute time format specifies a specific time in the future.
Format
[HH[:MM[:SS]]][_MO[/DD[/YY]]] i.e. 14:30_06/20)
Examples
1 PM, March 1 (this year)
13:00_03/01
296
Commands
Chapter 3 Scheduler Commands
Deprecated Commands
canceljob
This command is deprecated. Use mjobctl -c instead.
Synopsis
canceljob jobid [jobid]...
Overview
The canceljob command is used to selectively cancel the specified job(s) (active,
idle, or non-queued) from the queue.
Access
This command can be run by any Moab Administrator and by the owner of the
job (see ADMINCFG).
Flag
Name
-h
HELP
JOB ID
Format
<STRING>
Default
Description
Example
N/A
Display usage information
> canceljob -h
---
a jobid, a job expression, or the
keyword ALL
> canceljob 13001
13003
Examples
Example 3-65: Cancel job 6397
> canceljob 6397
changeparam
This command is deprecated. Use mschedctl -m instead.
Synopsis
changeparamparametervalue
Commands
297
Chapter 3 Scheduler Commands
Overview
The changeparam command is used to dynamically change the value of any
parameter which can be specified in the moab.cfg file. The changes take effect
at the beginning of the next scheduling iteration. They are not persistent, only
lasting until Moab is shut down.
changeparam is a compact command of mschedctl -m.
Access
This command can be run by a level 1 Moab administrator.
diagnose
This command is deprecated. Use mdiag instead.
Synopsis
diagnose -a [accountid]
diagnose -b [-l policylevel] [-t partition]
diagnose -c [classid]
diagnose -C [configfile]
diagnose -f [-o user|group|account|qos|class]
diagnose -g [groupid]
diagnose -j [jobid]
diagnose -L
diagnose -m [rackid]
diagnose -n [-t partition] [nodeid]
diagnose -p [-t partition]
diagnose -q [qosid]
diagnose -r [reservationid]
diagnose -R [resourcemanagername]
diagnose -s [standingreservationid]
diagnose -S diagnose -u [userid]
diagnose -v
diagnose -x
298
Commands
Chapter 3 Scheduler Commands
Overview
The diagnose command is used to display information about various aspects of
scheduling and the results of internal diagnostic tests.
releasehold
This command is deprecated. Use mjobctl -u instead.
Synopsis
releasehold [-a|-b] jobexp
Overview
Release hold on specified job(s).
This command allows you to release batch holds or all holds (system, user, and
batch) on specified jobs. Any number of jobs may be released with this
command.
Access
By default, this command can be run by any Moab Scheduler Administrator.
Parameters
JOBEXP
Job expression of job(s) to release.
Flags
-a
Release all types of holds (user, system, batch) for specified job(s).
-b
Release batch hold from specified job(s).
-h
Help for this command.
Commands
299
Chapter 3 Scheduler Commands
Examples
Example 3-66: releasehold -b
> releasehold -b 6443
batch hold released for job 6443
In this example, a batch hold was released from this one job.
Example 3-67: releasehold -a
> releasehold -a "81[1-6]"
holds modified for job 811
holds modified for job 812
holds modified for job 813
holds modified for job 814
holds modified for job 815
holds modified for job 816
In this example, all holds were released from the specified jobs.
Related Topics
sethold
mjobctl
releaseres
This command is deprecated. Use mrsvctl -r instead.
Synopsis
releaseres [arguments] reservationid [reservationid...]
Overview
Release existing reservation.
This command allows Moab Scheduler Administrators to release any user,
group, account, job, or system reservation. Users are allowed to release
reservations on jobs they own. Note that releasing a reservation on an active
job has no effect since the reservation will be automatically recreated.
Access
Users can use this command to release any reservation they own. Level 1 and
level 2 Moab administrators may use this command to release any reservation.
300
Commands
Chapter 3 Scheduler Commands
Parameters
RESERVATION ID
Name of reservation to release.
Examples
Example 3-68: Release two existing reservations
> releaseres system.1 bob.2
released User reservation 'system.1'
released User reservation 'bob.2'
resetstats
This command is deprecated. Use mschedctl -f instead.
Synopsis
resetstats
Overview
This command resets all internally-stored Moab Scheduler statistics to the initial
start-up state as of the time the command was executed.
Access
By default, this command can be run by level 1 scheduler administrators.
Examples
Example 3-69:
> resetstats Statistics Reset at time Wed Feb 25 23:24:55 2011
Related Topics
1.1.7 (Optional) Install Moab Client - explains how to distribute this command to client nodes
runjob
This command is deprecated. Use mjobctl -x instead.
Commands
301
Chapter 3 Scheduler Commands
Synopsis
runjob [-c|-f|-n nodelist|-p partition|-s|-x] jobid
Overview
This command will attempt to immediately start the specified job.
runjob is a deprecated command, replaced by mjobctl.
Access
By default, this command can be run by any Moab administrator.
Parameters
JOBID
Name of the job to run.
Args
Description
-c
Clear job parameters from previous runs (used to clear PBS neednodes attribute after PBS job
launch failure)
-f
Attempt to force the job to run, ignoring throttling policies
-n
<NODELIST>
Attempt to start the job using the specified nodelist where nodenames are comma or colon
delimited
-p
<PARTITION>
Attempt to start the job in the specified partition
-s
Attempt to suspend the job
-x
Attempt to force the job to run, ignoring throttling policies, QoS constraints, and reservations
Examples
Example 3-70: Run job cluster.231
> runjob cluster.231
job cluster.231 successfully started
See Also
mjobctl
302
Commands
Chapter 3 Scheduler Commands
canceljob - cancel a job.
checkjob - show detailed status of a job.
showq - list queued jobs.
sethold
This command is deprecated. Use mjobctl -h instead.
Synopsis
sethold [-b] jobid [jobid...]
Overview
Set hold on specified job(s).
Permissions
This command can be run by any Moab Scheduler Administrator.
Parameters
JOB
Job number of job to hold.
Flags
b
Set a batch hold. Typically, only the scheduler places batch holds. This flag allows an administrator to manually set a batch hold.
h
Help for this command.
Examples
Example 3-71:
> sethold -b fr17n02.1072.0 fr15n03.1017.0
Batch Hold Placed on All Specified Jobs
In this example, a batch hold is placed on job fr17n02.1072.0 and job fr15n03.1017.0.
Commands
303
Chapter 3 Scheduler Commands
setqos
This command is deprecated. Use mjobctl -m instead.
Synopsis
setqosqosidjobid
Overview
Set Quality Of Service for a specified job.
This command allows users to change the QOS of their own jobs.
Access
This command can be run by any user.
Parameters
JOBID
Job name.
QOSID
QOS name.
Examples
Example 3-72:
> setqos high_priority moab.3
Job QOS Adjusted
This example sets the Quality Of Service to a value of high_priority for job moab.3.
setres
This command is deprecated. Use mrsvctl -c instead.
Synopsis
setres [arguments] resourceexpression
[ -a <ACCOUNT_LIST> ]
[ -b <SUBTYPE> ]
[ -c <CHARGE_SPEC> ]
[ -d <DURATION> ]
[ -e <ENDTIME> ]
304
Commands
Chapter 3 Scheduler Commands
[ -E ] // EXCLUSIVE
[ -f <FEATURE_LIST> ]
[ -g <GROUP_LIST> ]
[ -n <NAME> ]
[ -o <OWNER> ]
[ -p <PARTITION> ]
[ -q <QUEUE_LIST> ] // (i.e. CLASS_LIST)
[ -Q <QOSLIST> ]
[ -r <RESOURCE_DESCRIPTION> ]
[ -R <RESERVATION_PROFILE> ]
[ -s <STARTTIME> ]
[ -T <TRIGGER> ]
[ -u <USER_LIST> ]
[ -x <FLAGS> ]
Overview
Reserve resources for use by jobs with particular credentials or attributes.
Access
This command can be run by level 1 and level 2 Moab administrators.
Parameters
Name
Format
Default
Description
ACCOUNT_LIST
<STRING>
[:<STRING>]...
---
List of accounts that will be allowed access to the
reserved resources
SUBTYPE
<STRING>
---
Specify the subtype for a reservation
CHARGE_SPEC
<ACCOUNT>
[,<GROUP>
[,<USER>]]
---
Specifies which credentials will be accountable for
unused resources dedicated to the reservation
CLASS_LIST
<STRING>
[:<STRING>]...
---
List of classes that will be allowed access to the
reserved resource
DURATION
[[[DD:]HH:]MM:]SS
INFINITY
Duration of the reservation (not needed if
ENDTIME is specified)
Commands
305
Chapter 3 Scheduler Commands
Name
Format
Default
Description
ENDTIME
[HH[:MM[:SS]]][_
MO[/DD[/YY]]]
or
+
[[[DD:]HH:]MM:]SS
INFINITY
Absolute or relative time reservation will end (not
required if Duration specified)
EXCLUSIVE
N/A
N/A
Requests exclusive access to resources
FEATURE_LIST
<STRING>
[:<STRING>]...
---
List of node features which must be possessed by
the reserved resources
FLAGS
<STRING>
[:<STRING>]...
---
List of reservation flags (See Managing Reservations for details)
GROUP_LIST
<STRING>
[:<STRING>]...
---
List of groups that will be allowed access to the
reserved resources
NAME
<STRING>
Name set to
first name listed in ACL or
SYSTEM if no
ACL specified
Name for new reservation
OWNER
<CREDTYPE>
:<CREDID> where
CREDTYPE is one of
user, group, acct,
class, or qos
N/A
Specifies which credential is granted reservation
ownership privileges
PARTITION
<STRING>
[ANY]
Partition in which resources must be located
QOS_LIST
<STRING>
[:<STRING>]...
---
List of QOS's that will be allowed access to the
reserved resource
RESERVATION_
Existing reservation
profile ID
N/A
Requests that default reservation attributes be
loaded from the specified reservation profile (see
RSVPROFILE)
PROFILE
306
Commands
Chapter 3 Scheduler Commands
Name
Format
Default
Description
RESOURCE_
DESCRIPTION
Colon delimited list
of zero or more of
the following
<ATTR>=<VALUE>
pairs
PROCS=<INTEGER>
MEM=<INTEGER>
DISK=<INTEGER>
SWAP=<INTEGER>
GRES=<STRING>
PROCS=-1
Specifies the resources to be reserved per task. (1 indicates all resources on node)
Required Field.
No Default
Specifies the tasks to reserve. ALL indicates all
resources available should be reserved.
RESOURCE_
EXPRESSION
ALL
or
TASKS
{
==
|>=}<TASKCOUNT>
or
<HOST_REGEX>
If ALL or a host expression is specified,
Moab will apply the reservation regardless
of existing reservations and exclusive
issues. If TASKS is used, Moab will only
allocate accessible resources.
STARTTIME
[HH[:MM[:SS]]][_
MO[/DD[/YY]]]
or
+
[[[DD:]HH:]MM:]SS
NOW
Absolute or relative time reservation will start
TRIGGER
<STRING>
N/A
Comma delimited reservation trigger list following
format described in the trigger format section of
the reservation configuration overview.
USER_LIST
<STRING>
[:<STRING>]...
---
List of users that will be allowed access to the
reserved resources
Description
The setres command allows an arbitrary block of resources to be reserved for
use by jobs which meet the specified access constraints. The timeframe
covered by the reservation can be specified on either an absolute or relative
basis. Only jobs with credentials listed in the reservation ACL (i.e., USERLIST,
GROUPLIST,...) can utilize the reserved resources. However, these jobs still have
the freedom to utilize resources outside of the reservation. The reservation will
be assigned a name derived from the ACL specified. If no reservation ACL is
specified, the reservation is created as a system reservation and no jobs will be
allowed access to the resources during the specified timeframe (valuable for
Commands
307
Chapter 3 Scheduler Commands
system maintenance, etc.). See the Reservation Overview for more
information.
Reservations can be viewed using the showres command and can be released
using the releaseres command.
Examples
Example 3-73:
> setres -u john:mary -s +24:00:00 -d 8:00:00 TASKS==2
reservation 'john.1' created on 2 nodes (2 tasks)
node001:1
node005:1
Reserve two nodes for use by users john and mary for a period of 8 hours starting in 24 hours.
Example 3-74:
> setres -s 8:00:00_06/20 -e 17:00:00_06/22 ALL
reservation 'system.1' created on 8 nodes (8 tasks)
node001:1
node002:1
node003:1
node004:1
node005:1
node006:1
node007:1
node008:1
Schedule a system wide reservation to allow system maintenance on Jun 20, 8:00 AM until Jun 22, 5:00
PM.
Example 3-75:
> setres -r PROCS=1:MEM=512 -g staff -l interactive 'node00[3-6]'
reservation 'staff.1' created on 4 nodes (4 tasks)
node003:1
node004:1
node005:1
node006:1
Reserve one processor and 512 MB of memory on nodes node003 through node node006 for members of
the group staff and jobs in the interactive class.
setspri
This command is deprecated. Use mjobctl -p instead.
Synopsis
setspri [-r] priorityjobid
308
Commands
Chapter 3 Scheduler Commands
Overview
(This command is deprecated by the mjobctl command)
Set or remove absolute or relative system priorities for a specified job.
This command allows you to set or remove a system priority level for a
specified job. Any job with a system priority level set is guaranteed a higher
priority than jobs without a system priority. Jobs with higher system priority
settings have priority over jobs with lower system priority settings.
Access
This command can be run by any Moab Scheduler Administrator.
Parameters
JOB
Name of job.
PRIORITY
System priority level. By default, this priority is an absolute priority overriding the policy generated
priority value. Range is 0 to clear, 1 for lowest, 1000 for highest. The given value is added onto the
system priority (see 32-bit and 64-bit values below), except for a given value of zero. If the '-r' flag is
specified, the system priority is relative, adding or subtracting the specified value from the policy
generated priority.
If a relative priority is specified, any value in the range +/- 1,000,000,000 is acceptable.
Flags
-r
Set relative system priority on job.
Examples
Example 3-76:
> setspri 10 orion.4752
job system priority adjusted
In this example, a system priority of 10 is set for job orion.4752.
Example 3-77:
> setspri 0 clusterB.1102
job system priority adjusted
In this example, system priority is cleared for job clusterB.1102.
Commands
309
Chapter 3 Scheduler Commands
Example 3-78:
> setspri -r 100000 job.00001
job system priority adjusted
In this example, the job's priority will be increased by 100000 over the value determine by configured
priority policy.
This command is deprecated. Use mjobctl instead.
showconfig
Synopsis
showconfig [-v]
Overview
View the current configurable parameters of the Moab Scheduler.
The showconfig command shows the current scheduler version and all scheduler
parameters. These parameters are set via internal defaults, command line
arguments, environment variable settings, parameters in the moab.cfg file,
and via the mschedctl -m command. Because of the many sources of
configuration settings, the output may differ from the contents of the
moab.cfg file. The output is such that it can be saved and used as the contents
of the moab.cfg file if desired.
The showconfig command does not show credential parameters (such as
user, group class, QoS, account).
Access
This command can be run by a level 1, 2, or 3 Moab administrator.
Flags
310
-h
Help for this command.
-v
This optional flag turns on verbose mode, which shows all possible Moab Scheduler
parameters and their current settings. If this flag is not used, this command operates
in context-sensitive terse mode, which shows only certain parameter settings.
Commands
Chapter 3 Scheduler Commands
Examples
Example 3-79: showconfig
> showconfig
# moab scheduler version 4.2.4 (PID: 11080)
BACKFILLPOLICY
FIRSTFIT
BACKFILLMETRIC
NODES
ALLOCATIONPOLICY
MINRESOURCE
RESERVATIONPOLICY
CURRENTHIGHEST
...
The showconfig command without the -v flag does not show the settings of
all scheduling parameters. To show the settings of all scheduling
parameters, use the -v (verbose) flag. This will provide an extended
output. This output is often best used in conjunction with the grep
command as the output can be voluminous.
Related Topics
Use the mschedctl -m command to change the various Moab Scheduler parameters.
See the Parameters document for details about configurable parameters.
Commands
311
312
Commands
Chapter 4 Prioritizing Jobs and Allocating Resources
Chapter 4 Prioritizing Jobs and Allocating Resources
l
Job Prioritization
l
Node Allocation Policies
l
Node Access Policies
l
Node Availability Policies
Job Prioritization
In general, prioritization is the process of determining which of many options
best fulfills overall goals. In the case of scheduling, a site will often have
multiple, independent goals that may include maximizing system utilization,
giving preference to users in specific projects, or making certain that no job sits
in the queue for more than a given period of time. The approach used by Moab
in representing a multi-faceted set of site goals is to assign weights to the
various objectives so an overall value or priority can be associated with each
potential scheduling decision. With the jobs prioritized, the scheduler can
roughly fulfill site objectives by starting the jobs in priority order.
l
Priority Overview
l
Job Priority Factors
l
Fairshare Job Priority Example
l
Common Priority Usage
l
Prioritization Strategies
l
Manual Priority Management
Related Topics
mdiag -p (Priority Diagnostics)
Priority Overview
Moab's prioritization mechanism allows component and subcomponent weights
to be associated with many aspects of a job to enable fine-grained control over
this aspect of scheduling. To allow this level of control, Moab uses a simple
priority-weighting hierarchy where the contribution of each priority
subcomponent is calculated as follows:
<COMPONENT WEIGHT> * <SUBCOMPONENT WEIGHT> * <PRIORITY
SUBCOMPONENT VALUE>
Job Prioritization
313
Chapter 4 Prioritizing Jobs and Allocating Resources
Each priority component contains one or more subcomponents as described in
the section titled Job Priority Factors. For example, the Resource component
consists of Node, Processor, Memory, Swap, Disk, Walltime, and PE
subcomponents. While there are numerous priority components and many
more subcomponents, a site need only focus on and configure the subset of
components related to their particular priority needs. In actual usage, few sites
use more than a small fraction (usually 5 or fewer) of the available priority
subcomponents. This results in fairly straightforward priority configurations
and tuning. By mixing and matching priority weights, sites may generally obtain
the desired job-start behavior. At any time, you can issue the mdiag -p
command to determine the impact of the current priority-weight settings on
idle jobs. Likewise, the command showstats -f can assist the administrator in
evaluating priority effectiveness on historical system usage metrics such as
queue time or expansion factor.
As mentioned above, a job's priority is the weighted sum of its activated
subcomponents. By default, the value of all component and subcomponent
weights is set to 1 and 0 respectively. The one exception is the "QUEUETIME"
subcomponent weight that is set to 1. This results in a total job priority equal to
the period of time the job has been queued, causing Moab to act as a simple
FIFO. Once the summed component weight is determined, this value is then
bounded resulting in a priority ranging between 0 and MAX_PRIO_VAL which is
currently defined as 1000000000 (one billion). In no case will a job obtain a
priority in excess of MAX_PRIO_VAL through its priority subcomponent values.
Negative priority jobs may be allowed if desired; see
ENABLENEGJOBPRIORITY and REJECTNEGPRIOJOBS for more
information.
Using the mjobctl -p command, site administrators may adjust the base
calculated job priority by either assigning a relative priority adjustment or an
absolute system priority. A relative priority adjustment causes the base priority
to be increased or decreased by a specified value. Setting an absolute system
priority, SPRIO, causes the job to receive a priority equal to MAX_PRIO_VAL +
SPRIO, and thus guaranteed to be of higher value than any naturally occurring
job priority.
Related Topics
REJECTNEGPRIOJOBS parameter
Job Priority Factors
314
l
Credential (CRED) Component
l
Fairshare (FS) Component
l
Resource (RES) Component
l
Service (SERVICE) Component
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
l
Target Service (TARG) Component
l
Usage (USAGE) Component
l
Job Attribute (ATTR) Component
Moab allows jobs to be prioritized based on a range of job related factors.
These factors are broken down into a two-tier hierarchy of priority factors and
subfactors, each of which can be independently assigned a weight. This
approach provides the administrator with detailed yet straightforward control
of the job selection process.
Each factor and subfactor can be configured with independent priority weight
and priority cap values (described later). In addition, per credential and per
QoS priority weight adjustments may be specified for a subset of the priority
factors. For example, QoS credentials can adjust the queuetime subfactor
weight and group credentials can adjust fairshare subfactor weight.
The following table highlights the factors and subfactors that make up a job's
total priority.
Factor
SubFactor
Metric
CRED
(job credentials)
USER
user-specific priority (See USERCFG)
GROUP
group-specific priority (See GROUPCFG)
ACCOUNT
account-specific priority (See ACCOUNTCFG)
QOS
QoS-specific priority (See QOSCFG)
CLASS
class/queue-specific priority (See CLASSCFG)
Job Prioritization
315
Chapter 4 Prioritizing Jobs and Allocating Resources
Factor
SubFactor
Metric
FS
(fairshare usage)
FSUSER
user-based historical usage (See Fairshare Overview)
FSGROUP
group-based historical usage (See Fairshare Overview)
FSACCOUNT
account-based historical usage (See Fairshare Overview)
FSQOS
QoS-based historical usage (See Fairshare Overview)
FSCLASS
class/queue-based historical usage (See Fairshare Overview)
FSGUSER
imported global user-based historical usage (See ID Manager
and Fairshare Overview)
FSGGROUP
imported global group-based historical usage (See ID Manager
and Fairshare Overview)
FSGACCOUNT
imported global account-based historical usage (See ID Manager and Fairshare Overview)
FSJPU
current active jobs associated with job user
FSPPU
current number of processors allocated to active jobs associated with job user
FSPSPU
current number of processor-seconds allocated to active jobs
associated with job user
WCACCURACY
user's current historical job wallclock accuracy calculated as
total processor-seconds dedicated / total processor-seconds
requested
Factor values are in the range of 0.0 to 1.0.
316
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
Factor
SubFactor
Metric
RES
(requested job
resources)
NODE
number of nodes requested
PROC
number of processors requested
MEM
total real memory requested (in MB)
SWAP
total virtual memory requested (in MB)
DISK
total local disk requested (in MB)
PS
total processor-seconds requested
PE
total processor-equivalent requested
WALLTIME
total walltime requested (in seconds)
QUEUETIME
time job has been queued (in minutes)
XFACTOR
minimum job expansion factor
BYPASS
number of times job has been bypassed by backfill
STARTCOUNT
number of times job has been restarted
DEADLINE
proximity to job deadline
SPVIOLATION
Boolean indicating whether the active job violates a soft usage
limit
USERPRIO
user-specified job priority
TARGETQUEUETIME
time until queuetime target is reached (exponential)
TARGETXFACTOR
distance to target expansion factor (exponential)
SERV
(current service
levels)
TARGET
(target service
levels)
Job Prioritization
317
Chapter 4 Prioritizing Jobs and Allocating Resources
Factor
SubFactor
Metric
USAGE
(consumed
resources -- active
jobs only)
CONSUMED
processor-seconds dedicated to date
REMAINING
processor-seconds outstanding
PERCENT
percent of required walltime consumed
EXECUTIONTIME
seconds since job started
ATTRATTR
Attribute priority if specified job attribute is set (attributes
may be user-defined or one of preemptor, or preemptee).
Default is 0.
ATTRSTATE
Attribute priority if job is in specified state (see Job States).
Default is 0.
ATTRGRES
Attribute priority if a generic resource is requested. Default is
0.
ATTR
(job attribute-based
prioritization)
*CAP parameters (FSCAP, for example) are available to limit the maximum
absolute value of each priority component and subcomponent. If set to a
positive value, a priority cap will bound priority component values in both
the positive and negative directions.
All *CAP and *WEIGHT parameters are specified as positive or negative
integers. Non-integer values are not supported.
Credential (CRED) Component
The credential component allows a site to prioritize jobs based on political
issues such as the relative importance of certain groups or accounts. This
allows direct political priorities to be applied to jobs.
The priority calculation for the credential component is as follows:
Priority += CREDWEIGHT * (
USERWEIGHT * Job.User.Priority +
GROUPWEIGHT * Job.Group.Priority +
ACCOUNTWEIGHT * Job.Account.Priority +
QOSWEIGHT * Job.Qos.Priority +
CLASSWEIGHT * Job.Class.Priority)
318
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
All user, group, account, QoS, and class weights are specified by setting the
PRIORITY attribute of using the respective *CFG parameter (namely, USERCFG,
GROUPCFG, ACCOUNTCFG, QOSCFG, and CLASSCFG).
For example, to set user and group priorities, you might use the following:
CREDWEIGHT USERWEIGHT GROUPWEIGHT USERCFG[john] USERCFG[paul] GROUPCFG[staff]
1
1
1
PRIORITY=2000
PRIORITY=-1000
PRIORITY=10000
Class (or queue) priority may also be specified via the resource manager
where supported (as in PBS queue priorities). However, if Moab class
priority values are also specified, the resource manager priority values will
be overwritten.
All priorities may be positive or negative.
Fairshare (FS) Component
Fairshare components allow a site to favor jobs based on short-term historical
usage. The Fairshare Overview describes the configuration and use of fairshare
in detail.
The fairshare factor is used to adjust a job's priority based on current and
historical percentage system utilization of the job's user, group, account, class,
or QoS. This allows sites to steer workload toward a particular usage mix across
user, group, account, class, and QoS dimensions.
The fairshare priority factor calculation is as follows:
Priority += FSWEIGHT * MIN(FSCAP, (
FSUSERWEIGHT * DeltaUserFSUsage +
FSGROUPWEIGHT * DeltaGroupFSUsage +
FSACCOUNTWEIGHT * DeltaAccountFSUsage +
FSQOSWEIGHT * DeltaQOSFSUsage +
FSCLASSWEIGHT * DeltaClassFSUsage +
FSJPUWEIGHT * ActiveUserJobs +
FSPPUWEIGHT * ActiceUserProcs +
FSPSPUWEIGHT * ActiveUserPS +
WCACCURACYWEIGHT * UserWCAccuracy ))
All *WEIGHT parameters just listed are specified on a per partition basis in the
moab.cfg file. The Delta*Usage components represent the difference in
actual fairshare usage from the corresponding fairshare usage target. Actual
fairshare usage is determined based on historical usage over the time frame
specified in the fairshare configuration. The target usage can be a target, floor,
or ceiling value as specified in the fairshare configuration file. See the Fairshare
Overview for further information on configuring and tuning fairshare.
Additional insight may be available in the fairshare usage example. The
Job Prioritization
319
Chapter 4 Prioritizing Jobs and Allocating Resources
ActiveUser* components represent current usage by the job's user
credential.
How violated ceilings and floors affect fairshare-based priority
Moab determines FSUsageWeight in the previous section. In order to account
for violated ceilings and floors, Moab multiplies that number by the
FSUsagePriority as demonstrated in the following formula:
FSPriority = FSUsagePriority * FSUsageWeight
When a ceiling or floor is violated, FSUsagePriority = 0, so FSPriority =
0. This means the job will gain no priority because of fairshare. If fairshare is
the only component of priority, then violation takes the priority to 0. For more
information, see Priority-Based Fairshare and Fairshare Targets.
Resource (RES) Component
Weighting jobs by the amount of resources requested allows a site to favor
particular types of jobs. Such prioritization may allow a site to better meet site
mission objectives, improve fairness, or even improve overall system
utilization.
Resource based prioritization is valuable when you want to favor jobs based on
the resources requested. This is good in three main scenarios: (1) when you
need to favor large resource jobs because it's part of your site's mission
statement, (2) when you want to level the response time distribution across
large and small jobs (small jobs are more easily backfilled and thus generally
have better turnaround time), and (3) when you want to improve system
utilization. While this may be surprising, system utilization actually increases as
large resource jobs are pushed to the front of the queue. This keeps the
smaller jobs in the back where they can be selected for backfill and thus
increase overall system utilization. The situation is like the story about filling a
cup with golf balls and sand. If you put the sand in first, it gets in the way and
you are unable to put in as many golf balls. However, if you put in the golf balls
first, the sand can easily be poured in around them completely filling the cup.
The calculation for determining the total resource priority factor is as follows:
Priority += RESWEIGHT* MIN(RESCAP, (
NODEWEIGHT * TotalNodesRequested +
PROCWEIGHT * TotalProcessorsRequested +
MEMWEIGHT * TotalMemoryRequested +
SWAPWEIGHT * TotalSwapRequested +
DISKWEIGHT * TotalDiskRequested +
WALLTIMEWEIGHT* TotalWalltimeRequested +
PEWEIGHT * TotalPERequested))
The sum of all weighted resources components is then multiplied by the
RESWEIGHT parameter and capped by the RESCAP parameter. Memory, Swap,
and Disk are all measured in megabytes (MB). The final resource component,
320
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
PE, represents Processor Equivalents. This component can be viewed as a
processor-weighted maximum percentage of total resources factor.
For example, if a job requested 25% of the processors and 50% of the total
memory on a 128-processor system, it would have a PE value of MAX(25,50) *
128, or 64. The concept of PEs is a highly effective metric in shared resource
systems.
Ideal values for requested job processor count and walltime can be
specified using PRIORITYTARGETPROCCOUNT and
PRIORITYTARGETDURATION.
Service (SERVICE) Component
The Service component specifies which service metrics are of greatest value to
the site. Favoring one service subcomponent over another generally improves
that service metric.
The priority calculation for the service priority factor is as follows:
Priority += SERVICEWEIGHT * (
QUEUETIMEWEIGHT * <QUEUETIME> +
XFACTORWEIGHT * <XFACTOR> +
BYPASSWEIGHT * <BYPASSCOUNT> +
STARTCOUNTWEIGHT * <STARTCOUNT> +
DEADLINEWEIGHT * <DEADLINE> +
SPVIOLATIONWEIGHT * <SPBOOLEAN> +
USERPRIOWEIGHT * <USERPRIO> )
QueueTime (QUEUETIME) Subcomponent
In the priority calculation, a job's queue time is a duration measured in
minutes. Using this subcomponent tends to prioritize jobs in a FIFO order.
Favoring queue time improves queue time based fairness metrics and is
probably the most widely used single job priority metric. In fact, under the
initial default configuration, this is the only priority subcomponent enabled
within Moab. It is important to note that within Moab, a job's queue time is not
necessarily the amount of time since the job was submitted. The parameter
JOBPRIOACCRUALPOLICY allows a site to select how a job will accrue queue
time based on meeting various throttling policies. Regardless of the policy used
to determine a job's queue time, this effective queue time is used in the
calculation of the QUEUETIME, XFACTOR, TARGETQUEUETIME, and
TARGETXFACTOR priority subcomponent values.
The need for a distinct effective queue time is necessitated by the fact that
many sites have users who like to work the system, whatever system it
happens to be. A common practice at some long existent sites is for some users
to submit a large number of jobs and then place them on hold. These jobs
remain with a hold in place for an extended period of time and when the user is
ready to run a job, the needed executable and data files are linked into place
Job Prioritization
321
Chapter 4 Prioritizing Jobs and Allocating Resources
and the hold released on one of these pre-submitted jobs. The extended hold
time guarantees that this job is now the highest priority job and will be the next
to run. The use of the JOBPRIOACCRUALPOLICY parameter can prevent this
practice and prevent "queue stuffers" from doing similar things on a shorter
time scale. These "queue stuffer" users submit hundreds of jobs at once to
swamp the machine and consume use of the available compute resources. This
parameter prevents the user from gaining any advantage from stuffing the
queue by not allowing these jobs to accumulate any queue time based priority
until they meet certain idle and active Moab fairness policies (such as max job
per user and max idle job per user).
As a final note, you can adjust the QUEUETIMEWEIGHT parameter on a per
QoS basis using the QOSCFG parameter and the QTWEIGHT attribute. For
example, the line QOSCFG[special] QTWEIGHT=5000 causes jobs using the QoS
special to have their queue time subcomponent weight increased by 5000.
Expansion Factor (XFACTOR) Subcomponent
The expansion factor subcomponent has an effect similar to the queue time
factor but favors shorter jobs based on their requested wallclock run time. In
its traditional form, the expansion factor (XFactor) metric is calculated as
follows:
XFACTOR = 1 + <QUEUETIME> / <EXECUTIONTIME>
However, a couple of aspects of this calculation make its use more difficult.
First, the length of time the job will actually run—<EXECUTIONTIME>—is not
actually known until the job completes. All that is known is how much time the
job requests. Secondly, as described in the Queue Time Subcomponent
section, Moab does not necessarily use the raw time since job submission to
determine <QUEUETIME> to prevent various scheduler abuses. Consequently,
Moab uses the following modified equation:
XFACTOR = 1 + <EFFQUEUETIME> / <WALLCLOCKLIMIT>
In the equation Moab uses, <EFFQUEUETIME> is the effective queue time
subject to the JOBPRIOACCRUALPOLICY parameter and <WALLCLOCKLIMIT>
is the user—or system—specified job wallclock limit.
Using this equation, it can be seen that short running jobs will have an XFactor
that will grow much faster over time than the xfactor associated with long
running jobs. The following table demonstrates this favoring of short running
jobs:
322
Job Queue Time
1 hour
2 hours
4 hours
8 hours
16 hours
XFactor for 1
hour job
1 + (1 / 1) =
2.00
1 + (2 / 1) =
3.00
1 + (4 / 1) =
5.00
1 + (8 / 1) =
9.00
1 + (16 / 1) =
17.0
XFactor for 4
hour job
1 + (1 / 4) =
1.25
1 + (2 / 4) =
1.50
1 + (4 / 4) =
2.00
1 + (8 / 4) =
3.00
1 + (16 / 4) =
5.0
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
Since XFactor is calculated as a ratio of two values, it is possible for this
subcomponent to be almost arbitrarily large, potentially swamping the value of
other priority subcomponents. This can be addressed either by using the
subcomponent cap XFACTORCAP, or by using the XFMINWCLIMIT parameter.
If the latter is used, the calculation for the XFactor subcomponent value
becomes:
XFACTOR = 1 + <EFFQUEUETIME> / MAX
(<XFMINWCLIMIT>,<WALLCLOCKLIMIT>)
Using the XFMINWCLIMIT parameter allows a site to prevent very short jobs from
causing the XFactor subcomponent to grow inordinately.
Some sites consider XFactor to be a more fair scheduling performance metric
than queue time. At these sites, job XFactor is given far more weight than job
queue time when calculating job priority and job XFactor distribution
consequently tends to be fairly level across a wide range of job durations. (That
is, a flat XFactor distribution of 1.0 would result in a one-minute job being
queued on average one minute, while a 24-hour job would be queued an
average of 24 hours.)
Like queue time, the effective XFactor subcomponent weight is the sum of two
weights, the XFACTORWEIGHT parameter and the QoS-specific XFWEIGHT
setting. For example, the line QOSCFG[special] XFWEIGHT=5000 causes jobs
using the QoS special to increase their expansion factor subcomponent weight
by 5000.
Bypass (BYPASS) Subcomponent
The bypass factor is based on the bypass count of a job where the bypass count
is increased by one every time the job is bypassed by a lower priority job via
backfill. Backfill starvation has never been reported, but if encountered, use
the BYPASS subcomponent.
StartCount (STARTCOUNT) Subcomponent
Apply the startcount factor to sites with trouble starting or completing due to
policies or failures. The primary causes of an idle job having a startcount
greater than zero are resource manager level job start failure, administrator
based requeue, or requeue based preemption.
Deadline (DEADLINE) Subcomponent
The deadline factor allows sites to take into consideration the proximity of a job
to its DEADLINE. As a jobs moves closer to its deadline its priority increases
linearly. This is an alternative to the strict deadline discussed in QOS SERVICE.
Soft Policy Violation (SPVIOLATION) Subcomponent
The soft policy violation factor allows sites to favor jobs which do not violate
their associated soft resource limit policies.
Job Prioritization
323
Chapter 4 Prioritizing Jobs and Allocating Resources
User Priority (USERPRIO) Subcomponent
The user priority subcomponent allows sites to consider end-user specified job
priority in making the overall job priority calculation. Under Moab, end-user
specified priorities may only be negative and are bounded in the range 0 to 1024. See Manual Priority Usage and Enabling End-user Priorities for more
information.
User priorities can be positive, ranging from -1024 to 1023, if
ENABLEPOSUSERPRIORITY TRUE is specified in moab.cfg.
Target Service (TARG) Component
The target factor component of priority takes into account job scheduling
performance targets. Currently, this is limited to target expansion factor and
target queue time. Unlike the expansion factor and queue time factors
described earlier which increase gradually over time, the target factor
component is designed to grow exponentially as the target metric is
approached. This behavior causes the scheduler to do essentially all in its
power to make certain the scheduling targets are met.
The priority calculation for the target factor is as follows:
Priority += TARGETWEIGHT* ( TARGETQUEUETIMEWEIGHT * QueueTimeComponent +
TARGETXFACTORWEIGHT * XFactorComponent)
The queue time and expansion factor target are specified on a per QoS basis
using the XFTARGET and QTTARGET attributes with the QOSCFG parameter. The
QueueTime and XFactor component calculations are designed to produce small
values until the target value begins to approach, at which point these
components grow very rapidly. If the target is missed, this component remains
high and continues to grow, but it does not grow exponentially.
Usage (USAGE) Component
The Usage component applies to active jobs only. The priority calculation for
the usage priority factor is as follows:
Priority += USAGEWEIGHT * ( USAGECONSUMEDWEIGHT * ProcSecondsConsumed +
USAGEHUNGERWEIGHT * ProcNeededToBalanceDynamicJob +
USAGEREMAININGWEIGHT * ProcSecRemaining +
USAGEEXECUTIONTIMEWEIGHT * SecondsSinceStart +
USAGEPERCENTWEIGHT * WalltimePercent )
Job Attribute (ATTR) Component
The Attribute component allows the incorporation of job attributes into a job's
priority. The most common usage for this capability is to do one of the
324
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
following:
l
l
l
l
adjust priority based on a job's state (favor suspended jobs)
adjust priority based on a job's requested node features (favor jobs that
request attribute pvfs)
adjust priority based on internal job attributes (disfavor backfill or
preemptee jobs)
adjust priority based on a job's requested licenses, network consumption,
or generic resource requirements
To use job attribute based prioritization, the JOBPRIOF parameter must be
specified to set corresponding attribute priorities. To favor jobs based on node
feature requirements, the parameter NODETOJOBATTRMAP must be set to
map node feature requests to job attributes.
The priority calculation for the attribute priority factor is as follows:
Priority += ATTRWEIGHT * (
ATTRATTRWEIGHT * <ATTRPRIORITY> +
ATTRSTATEWEIGHT * <STATEPRIORITY> +
ATTRGRESWEIGHT * <GRESPRIORITY>
JOBIDWEIGHT * <JOBID> +
JOBNAMEWEIGHT * <JOBNAME_INTEGER> )
Example 4-1:
ATTRWEIGHT
100
ATTRATTRWEIGHT
1
ATTRSTATEWEIGHT
1
ATTRGRESWEIGHT
5
# favor suspended jobs
# disfavor preemptible jobs
# favor jobs requesting 'matlab'
JOBPRIOF STATE[Running]=100 STATE[Suspended]=1000
[gpfs]=30 GRES[matlab]=400
# map node features to job features
NODETOJOBATTRMAP
...
ATTR[PREEMPTEE]=-200
ATTR
gpfs,pvfs
Related Topics
Node Allocation Priority
Per Credential Priority Weight Offsets
Managing Consumable Generic Resources
Fairshare Job Priority Example
Consider the following information associated with calculating the fairshare
factor for job X.
Job Prioritization
325
Chapter 4 Prioritizing Jobs and Allocating Resources
Job X
User A
Group B
Account C
QOS D
Class E
User A
Fairshare Target: 50.0
Current Fairshare Usage: 45.0
Group B
Fairshare Target: [NONE]
Current Fairshare Usage: 65.0
Account C
Fairshare Target: 25.0
Current Fairshare Usage: 35.0
QOS D
Fairshare Target: 10.0+
Current Fairshare Usage: 25.0
Class E
Fairshare Target: [NONE]
Current Fairshare Usage: 20.0
Priority Weights:
FSWEIGHT 100
FSUSERWEIGHT 10
FSGROUPWEIGHT 20
FSACCOUNTWEIGHT 30
FSQOSWEIGHT 40
FSCLASSWEIGHT 0
In this example, the Fairshare component calculation would be as follows:
Priority += 100 * ( 10 * 5 +
20 * 0 +
30 * (-10) +
40 * 0 +
0 * 0)
User A is 5% below his target so fairshare increases the total fairshare factor
accordingly. Group B has no target so group fairshare usage is ignored. Account C is above its 10% above its fairshare usage target so this component
decreases the job's total fairshare factor. QOS D is 15% over its target but the
'+' in the target specification indicates that this is a 'floor' target, only
influencing priority when fairshare usage drops below the target value. Thus,
the QOS D fairshare usage delta does not influence the fairshare factor.
326
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
Fairshare is a great mechanism for influencing job turnaround time via priority
to favor a particular distribution of jobs. However, it is important to realize that
fairshare can only favor a particular distribution of jobs, it cannot force it. If
user X has a fairshare target of 50% of the machine but does not submit
enough jobs, no amount of priority favoring will get user X's usage up to 50%.
See the Fairshare Overview for more information.
Common Priority Usage
l
Credential Priority Factors
l
Service Level Priority Factors
l
Priority Factor Caps
l
User Selectable Prioritization
Site administrators vary widely in their preferred manner of prioritizing jobs.
Moab's scheduling hierarchy allows sites to meet job control needs without
requiring adjustments to dozens of parameters. Some choose to use
numerous subcomponents, others a few, and still others are content with the
default FIFO behavior. Any subcomponent that is not of interest may be safely
ignored.
Credential Priority Factors
To help clarify the use of priority weights, a brief example may help. Suppose a
site wished to maintain the FIFO behavior but also incorporate some credential
based prioritization to favor a special user. Particularly, the site would like the
user john to receive a higher initial priority than all other users. Configuring this
behavior requires two steps. First, the user credential subcomponent must be
enabled and second, john must have his relative priority specified. Take a look
at the sample moab.cfg file:
USERWEIGHT
USERCFG[john]
1
PRIORITY=300
The "USER" priority subcomponent was enabled by setting the
USERWEIGHT parameter. In fact, the parameters used to specify the
weights of all components and subcomponents follow this same
"*WEIGHT" naming convention (as in RESWEIGHT and
TARGETQUEUETIMEWEIGHT.
The second part of the example involves specifying the actual user priority for
the user john. This is accomplished using the USERCFG parameter. Why was
the priority 300 selected and not some other value? Is this value arbitrary? As
in any priority system, actual priority values are meaningless, only relative
values are important. In this case, we are required to balance user priorities
Job Prioritization
327
Chapter 4 Prioritizing Jobs and Allocating Resources
with the default queue time based priorities. Since queuetime priority is
measured in minutes queued, the user priority of 300 places a job by user john
on par with a job submitted 5 minutes earlier by another user.
Is this what the site wants? Maybe, maybe not. At the onset, most sites are
uncertain what they want in prioritization. Often, an estimate initiates
prioritization and adjustments occur over time. Cluster resources evolve, the
workload evolves, and even site policies evolve, resulting in changing priority
needs over time. Anecdotal evidence indicates that most sites establish a
relatively stable priority policy within a few iterations and make only occasional
adjustments to priority weights from that point.
Service Level Priority Factors
In another example, suppose a site administrator wants to do the following:
l
favor jobs in the low, medium, and high QoSs so they will run in QoS order
l
balance job expansion factor
l
use job queue time to prevent jobs from starving
Under such conditions, the sample moab.cfg file might appear as follows:
QOSWEIGHT
1
XFACTORWEIGHT
1
QUEUETIMEWEIGHT
10
TARGETQUEUETIMEWEIGHT
1
QOSCFG[low]
PRIORITY=1000
QOSCFG[medium]
PRIORITY=10000
QOSCFG[high]
PRIORITY=100000
QOSCFG[DEFAULT]
QTTARGET=4:00:00
This example is a bit more complicated but is more typical of the needs of many
sites. The desired QoS weightings are established by enabling the QoS
subfactor using the QOSWEIGHT parameter while the various QoS priorities
are specified using QOSCFG. XFACTORWEIGHT is then set as this
subcomponent tends to establish a balanced distribution of expansion factors
across all jobs. Next, the queuetime component is used to gradually raise the
priority of all jobs based on the length of time they have been queued. Note
that in this case, QUEUETIMEWEIGHT was explicitly set to 10, overriding its
default value of 1. Finally, the TARGETQUEUETIMEWEIGHT parameter is used
in conjunction with the USERCFG line to specify a queue time target of 4 hours.
Priority Factor Caps
Assume now that the site administrator is content with this priority mix but has
a problem with users submitting large numbers of very short jobs. Very short
jobs would tend to have rapidly growing XFactor values and would
consequently quickly jump to the head of the queue. In this case, a factor cap
would be appropriate. Such caps allow a site to limit the contribution of a job's
priority factor to be within a defined range. This prevents certain priority
328
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
factors from swamping others. Caps can be applied to either priority
components or subcomponents and are specified using the
<COMPONENTNAME>CAP parameter (such as QUEUETIMECAP, RESCAP, and
SERVCAP). Note that both component and subcomponent caps apply to the preweighted value, as in the following equation:
Priority =
C1WEIGHT * MIN(C1CAP,SUM(
S11WEIGHT * MIN(S11CAP,S11S)
S12WEIGHT * MIN(S12CAP,S12S)
...)) +
C2WEIGHT * MIN(C2CAP,SUM(
S21WEIGHT * MIN(S21CAP,S21S)
S22WEIGHT * MIN(S22CAP,S22S)
...)) +
...
+
+
+
+
Example 4-2: Priority cap
QOSWEIGHT
QOSCAP
XFACTORWEIGHT
XFACTORCAP
QUEUETIMEWEIGHT
QUEUETIMECAP
1
10000
1
1000
10
1000
User Selectable Prioritization
Moab allows users to specify a job priority to jobs they own or manage. This
priority may be set at job submission time or it may be dynamically modified
(using setspri or mjobctl) after submitting the job. For fairness reasons, users
may only apply a negative priority to their job and thus slide it further back in
the queue. This enables users to allow their more important jobs to run before
their less important ones without gaining unfair advantage over other users.
User priorities can be positive if ENABLEPOSUSERPRIORITY TRUE is
specified in moab.cfg.
In order to set ENABLEPOSUSERPRIORITY, you must change the
USERPRIOWEIGHT from its default value of 0. For example:
USERPRIOWEIGHT
100
> setspri -r 100 332411
successfully modified job priority
Specifying a user priority at job submission time is resource manager
specific. See the associated resource manager documentation for more
information.
Job Prioritization
329
Chapter 4 Prioritizing Jobs and Allocating Resources
User Selectable Priority w/QoS
Using the QoS facility, organizations can set up an environment in which users
can more freely select the desired priority of a given job. Organizations may
enable access to a number of QoSs each with its own charging rate, priority,
and target service levels. Users can then assign job importance by selecting the
appropriate QoS. If desired, this can allow a user to jump ahead of other users
in the queue if they are willing to pay the associated costs.
Related Topics
User Selectable Priority
Prioritization Strategies
Each component or subcomponent may be used to accomplish different
objectives. WALLTIME can be used to favor (or disfavor) jobs based on their
duration. Likewise, ACCOUNT can be used to favor jobs associated with a
particular project while QUEUETIME can be used to favor those jobs waiting the
longest.
l
Queue Time
l
Expansion Factor
l
Resource
l
Fairshare
l
Credential
l
Target Metrics
Each priority factor group may contain one or more subfactors. For example,
the Resource factor consists of Node, Processor, Memory, Swap, Disk, and PE
components. From the table in Job Priority Factors section, it is apparent that
the prioritization problem is fairly complex since every site needs to prioritize a
bit differently. When calculating a priority, the various priority factors are
summed and then bounded between 0 and MAX_PRIO_VAL, which is currently
defined as 100000000 (one billion).
The mdiag -p command assists with visualizing the priority distribution resulting
from the current job priority configuration. Also, the showstats -f command
helps indicate the impact of the current priority settings on scheduler service
distributions.
Manual Job Priority Adjustment
Batch administrator's regularly find a need to adjust the calculated priority of a
job to meet current needs. Current needs often are broken into two
categories:
330
Job Prioritization
Chapter 4 Prioritizing Jobs and Allocating Resources
1. The need to run an administrator test job as soon as possible.
2. The need to pacify a disserviced user.
You can use the setspri command to handle these issues in one of two ways;
this command allows the specification of either a relative priority adjustment or
the specification of an absolute priority. Using absolute priority specification,
administrators can set a job priority guaranteed to be higher than any
calculated value. Where Moab-calculated job priorities are in the range of 0 to 1
billion, system administrator assigned absolute priorities start at 1 billion and
go up. Issuing the setspri <PRIO> <JOBID> command, for example, assigns
a priority of 1 billion + <PRIO> to the job. Thus, setspri 5 job.1294 sets
the priority of "job.1294" to 1000000005.
For more information, see Common Priority Usage - End-user Adjustment.
Node Allocation Policies
While job prioritization allows a site to determine which job to run, node
allocation policies allow a site to specify how available resources should be
allocated to each job. The algorithm used is specified by the parameter
NODEALLOCATIONPOLICY. There are multiple node allocation policies to
choose from allowing selection based on reservation constraints, node
configuration, resource usage, prefed other factors. You can specify these
policies with a system-wide default value, on a per-partition basis, or on a perjob basis. Please note that LASTAVAILABLE is the default policy.
Available algorithms are described in detail in the following sections and include
FIRSTAVAILABLE, LASTAVAILABLE, PRIORITY, CPULOAD, MINRESOURCE,
CONTIGUOUS, MAXBALANCE, PLUGIN.
l
Node Allocation Overview
o
Heterogeneous Resources
o
Shared Nodes
o
Reservations or Service Guarantees
o
Non-flat Network
l
Node selection factors
l
Resource-Based Algorithms
o
CPULOAD
o
FIRSTAVAILABLE
o
LASTAVAILABLE
o
PRIORITY
Node Allocation Policies
331
Chapter 4 Prioritizing Jobs and Allocating Resources
l
o
MINRESOURCE
o
CONTIGUOUS
o
MAXBALANCE
User-Defined Algorithms
o
l
PLUGIN
Specifying Per Job Resource Preferences
o
Specifying Resource Preferences
o
Selecting Preferred Resources
Node Allocation Overview
Node allocation is the process of selecting the best resources to allocate to a
job from a list of available resources. Making this decision intelligently is
important in an environment that possesses one or more of the following
attributes:
l
heterogeneous resources (resources which vary from node to node in
terms of quantity or quality)
l
shared nodes (nodes may be utilized by more than one job)
l
reservations or service guarantees
l
non-flat network (a network in which a perceptible performance
degradation may potentially exist depending on workload placement)
Heterogeneous Resources
Moab analyzes job processing requirements and assigns resources to
maximize hardware utility.
For example, suppose two nodes are available in a system, A and B. Node A
has 768 MB of RAM and node B has 512 MB. The next two jobs in the queue are
X and Y. Job X requests 256 MB and job Y requests 640 MB. Job X is next in the
queue and can fit on either node, but Moab recognizes that job Y (640 MB) can
only fit on node A (768 MB). Instead of putting job X on node A and blocking job
Y, Moab can put job X on node B and job Y on node A.
Shared Nodes
Symmetric Multiprocessing (SMP)
When sharing SMP-based compute resources amongst tasks from more than
one job, resource contention and fragmentation issues arise. In SMP
environments, the general goal is to deliver maximum system utilization for a
combination of compute-intensive and memory-intensive jobs while
preventing overcommitment of resources.
332
Node Allocation Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
By default, most current systems do not do a good job of logically partitioning
the resources (such as CPU, memory, and network bandwidth) available on a
given node. Consequently contention often arises between tasks of
independent jobs on the node. This can result in a slowdown for all jobs
involved, which can have significant ramifications if large-way parallel jobs are
involved. Virtualization, CPU sets, and other techniques are maturing quickly as
methods to provide logical partitioning within shared resources.
On large-way SMP systems (> 32 processors/node), job packing can result in
intra-node fragmentation. For example, take two nodes, A and B, each with 64
processors. Assume they are currently loaded with various jobs and A has 24
and B has 12 processors free. Two jobs are submitted; job X requests 10
processors and job Y requests 20 processors. Job X can start on either node but
starting it on node A prevents job Y from running. An algorithm to handle intranode fragmentation is straightforward for a single resource case, but the
algorithm becomes more involved when jobs request a combination of
processors, memory, and local disk. These workload factors should be
considered when selecting a site's node allocation policy as well as identifying
appropriate policies for handling resource utilization limit violations.
Interactive Nodes
In many cases, sites are interested in allowing multiple users to simultaneously
use one or more nodes for interactive purposes. Workload is commonly not
compute intensive consisting of intermittent tasks including coding, compiling,
and testing. Because these jobs are highly variant in terms of resource usage
over time, sites are able to pack a larger number of these jobs onto the same
node. Consequently, a common practice is to restrict job scheduling based on
utilized, rather than dedicated resources.
Interactive Node Example
The example configuration files that follow show one method by which node
sharing can be accomplished within a Torque + Moab environment. This
example is based on a hypothetical cluster composed of 4 nodes each with 4
cores. For the compute nodes, job tasks are limited to actual cores preventing
overcommitment of resources. For the interactive nodes, up to 32 job tasks are
allowed, but the node also stops allowing additional tasks if either memory is
fully utilized or if the CPU load exceeds 4.0. Thus, Moab continues packing the
interactive nodes with jobs until carrying capacity is reached.
Example 4-3: /opt/moab/etc/moab.cfg
# constrain interactive jobs to interactive nodes
# constrain interactive jobs to 900 proc-seconds
CLASSCFG[interactive] HOSTLIST=interactive01,interactive02
CLASSCFG[interactive] MAX.CPUTIME=900
RESOURCELIMITPOLICY
CPUTIME:ALWAYS:CANCEL
# base interactive node allocation on load and jobs
NODEALLOCATIONPOLICY PRIORITY
NODECFG[interactive01] PRIORITYF='-20*LOAD - JOBCOUNT'
NODECFG[interactive02] PRIORITYF='-20*LOAD - JOBCOUNT'
Node Allocation Policies
333
Chapter 4 Prioritizing Jobs and Allocating Resources
Example 4-4: /var/spool/torque/server_priv/nodes
interactive01
interactive02
compute01
compute02
np=32
np=32
np=4
np=4
Example 4-5: /var/spool/torque/mom_priv/config on "interactive01"
# interactive01
$max_load 4.0
Example 4-6: /var/spool/torque/mom_priv/config on "interactive02"
# interactive02
$max_load 4.0
Reservations or Service Guarantees
A reservation-based system adds the time dimension into the node allocation
decision. With reservations, node resources must be viewed in a type of two
dimension node-time space. Allocating nodes to jobs fragments this node-time
space and makes it more difficult to schedule jobs in the remaining, more
constrained node-time slots. Allocation decisions should be made in such a way
as to minimize this fragmentation and maximize the scheduler's ability to
continue to start jobs in existing slots. The following figure shows that job A and
job B are running. A reservation, X, is created some time in the future. Assume
that job A is 2 hours long and job B is 3 hours long. Again, two new singleprocessor jobs are submitted, C and D; job C requires 3 hours of compute time
while job D requires 5 hours. Either job will just fit in the free space located
above job A or in the free space located below job B. If job C is placed above
job A, job D, requiring 5 hours of time will be prevented from running by the
presence of reservation X. However, if job C is placed below job B, job D can
still start immediately above job A.
334
Node Allocation Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
Image 4-1: Job A, Job B, and Reservation X scheduled on nodes
The preceding example demonstrates the importance of time based
reservation information in making node allocation decisions, both at the time of
starting jobs and at the time of creating reservations. The impact of time based
issues grows significantly with the number of reservations in place on a given
system. The LASTAVAILABLE algorithm works on this premise, locating resources
that have the smallest space between the end of a job under consideration and
the start of a future reservation.
Non-flat Network
On systems where network connections do not resemble a flat all-to-all
topology, task placement may impact performance of communication intensive
parallel jobs. If latencies and network bandwidth between any two nodes vary
significantly, the node allocation algorithm should attempt to pack tasks of a
given job as close to each other as possible to minimize impact of bandwidth
and latency differences.
Node Allocation Policies
335
Chapter 4 Prioritizing Jobs and Allocating Resources
Node selection factors
While the node allocation policy determines which nodes a job will use, other
factors narrow the options before the policy makes the final decision. The
following process demonstrates how Moab executes its node allocation process
and how other policies affect the decision:
1. Moab eliminates nodes that do not meet the hard resource requirements set
by the job.
2. Moab gathers affinity information, first from workload proximity rules and
then from reservation affinity rules (See Affinity for more information.).
Reservation affinity rules trump workload proximity rules.
3. Moab allocates nodes using the allocation policy.
l
l
l
If more than enough nodes with Required affinity exist, only they are
passed down for the final sort by the node allocation policy.
If the number of nodes with Required affinity matches the number of
nodes requested exactly, then the node allocation policy is skipped
entirely and all of those nodes are assigned to the job.
If too few nodes have Required affinity, all of them are assigned to the
job, then the node allocation policy is applied to the remaining eligible
nodes (after Required, Moab will use Positive, then Neutral, then
Negative.).
Resource-Based Algorithms
Moab contains a number of allocation algorithms that address some of the
needs described earlier. You can also create allocation algorithms and interface
them with the Moab scheduling system. Each of these policies has a name and
descriptive alias. They can be configured using either one, but Moab will only
report their names.
If ENABLEHIGHTHROUGHPUT is TRUE, you must set
NODEALLOCATIONPOLICY to FIRSTAVAILABLE.
The current suite of algorithms is described in what follows:
336
Allocation
algorithm
name
Alias
Description
CPULOAD
ProcessorLoad
Nodes are selected that have the maximum amount of available,
unused CPU power (<#of CPU's> - <CPU load>). CPULOAD is a
good algorithm for timesharing node systems and applies to jobs
starting immediately. For the purpose of future reservations, the
MINRESOURCE algorithm is used.
Node Allocation Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
Allocation
algorithm
name
Alias
Description
FIRSTAVAILABLE
InReportedOrder
Simple first come, first served algorithm where nodes are allocated in the order they are presented by the resource manager.
This is a very simple, and very fast algorithm.
LASTAVAILABLE
InReserveReportedOrder
Nodes are allocated in descending order that they are presented
by the resource manager, or the reverse of FIRSTAVAILABLE.
Node Allocation Policies
337
Chapter 4 Prioritizing Jobs and Allocating Resources
Allocation
algorithm
name
Alias
Description
PRIORITY
CustomPriority
Allows a site to specify the priority of various static and dynamic
aspects of compute nodes and allocate them with preference for
higher priority nodes. It is highly flexible allowing node attribute
and usage information to be combined with reservation affinity.
Using node allocation priority, you can specify the following
priority components:
l
l
l
l
l
l
AMEM - Real memory currently available to batch jobs in
MB.
APROCS - Processors currently available to batch jobs on
node (configured procs - dedicated procs).
ARCH[<ARCH>] - Processor architecture.
ASWAP - Virtual memory currently available to batch
jobs in MB.
CDISK - Total local disk allocated for use by batch jobs in
MB.
l
CMEM - Total real memory on node in MB.
l
COST - Based on node CHARGERATE.
l
CPROCS - Total processors on node.
l
l
l
l
l
l
CSWAP - Total virtual memory configured on node in
MB.
FEATURE[<FNAME>] - Boolean; specified feature is
present on node.
FREETIME - FREETIME is calculated as the time during
which there is no reservation on the machine. It uses
either the job wallclock limit (if there is a job), or 2
months. The more free time a node has within either the
job wallclock limit or 2 months, the higher this value will
be.
GMETRIC[<GMNAME>] - Current value of specified
generic metric on node.
JOBCOUNT - Number of jobs currently running on node.
JOBFREETIME - The number of seconds that the node is
idle between now and when the job is scheduled to start.
l
LOAD - Current 1 minute load average.
l
MTBF - Mean time between failures (in seconds).
l
l
l
l
l
l
338
ADISK - Local disk currently available to batch jobs in MB.
l
NODEINDEX - Node's nodeindex as specified by the
resource manager.
OS - True if job compute requirements match node
operating system.
PARAPROCS - Processors currently available to batch
jobs within partition (configured procs - dedicated procs).
POWER - TRUE if node is ON.
PREF - Boolean; node meets job specific resource
preferences.
PRIORITY - Administrator specified node priority.
Node Allocation Policies
RANDOM - Per iteration random value between 0 and 1.
(Allows introduction of random allocation factor.)
Chapter 4 Prioritizing Jobs and Allocating Resources
Allocation
algorithm
name
Alias
Description
Example 5: Pack tasks onto nodes with the most processors
available and the lowest CPU temperature.
RMCFG[torque] TYPE=pbs
RMCFG[temp]
TYPE=NATIVE
CLUSTERQUERYURL=exec://$TOOLSDIR/hwmon.pl
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF='100*APROCS - GMETRIC
[temp]'
...
MINRESOURCE
MinimumConfiguredResources
Prioritizes nodes according to the configured memory resources
on each node. Those nodes with the fewest configured memory
resources, that still meet the job's resource constraints, are selected.
CONTIGUOUS
Contiguous
Allocates nodes in contiguous (linear) blocks as required by the
Compaq RMS system.
MAXBALANCE
ProcessorSpeedBalance
Attempts to allocate the most balanced set of nodes possible to a
job. In most cases, but not all, the metric for balance of the nodes
is node procspeed. Thus, if possible, nodes with identical procspeeds are allocated to the job. If identical procspeed nodes cannot be found, the algorithm allocates the set of nodes with the
minimum node procspeed span or range.
User-Defined Algorithms
User-defined algorithms allow administrators to define their own algorithms
based on factors such as their system's network topology. When node
allocation is based on topology, jobs finish faster, administrators see better
cluster productivity and users pay less for resources.
PLUGIN
This algorithm allows administrators to define their own node allocation policy
and create a plug-in that allocates nodes based on factors such as a cluster's
network topology. This has the following advantages:
l
l
plug-ins keep the source code of the cluster's interconnect network for
node allocation separate from Moab's source code (customers can
implement plug-ins independent of Moab's release schedule)
plug-ins can be independently created and tailored to specific hardware
and network topology
Node Allocation Policies
339
Chapter 4 Prioritizing Jobs and Allocating Resources
l
plug-ins can be modified without assistance from Adaptive Computing,
Inc.
Specifying Per Job Resource Preferences
While the resource based node allocation algorithms can make a good guess at
what compute resources would best satisfy a job, sites often possess a subset
of jobs that benefit from more explicit resource allocation specification. For
example one job may perform best on a particular subset of nodes due to
direct access to a tape drive, another may be very memory intensive. Resource
preferences are distinct from node requirements. While the former describes
what a job needs to run at all, the latter describes what the job needs to run
well. In general, a scheduler must satisfy a job's node requirement
specification and then satisfy the job's resource preferences as well as possible.
Specifying Resource Preferences
A number of resource managers natively support the concept of resource
preferences (such as Loadleveler). When using these systems, the language
specific preferences keywords may be used. For systems that do not support
resource preferences natively, Moab provides a resource manager extension
keyword, "PREF," which you can use to specify desired resources. This
extension allows specification of node features, memory, swap, and disk space
conditions that define whether the node is considered preferred.
Moab 5.2 (and earlier) only supports feature-based preferences.
Selecting Preferred Resources
Enforcing resource preferences is not completely straightforward. A site may
have a number of potentially conflicting requirements that the scheduler is
asked to simultaneously satisfy. For example, a scheduler may be asked to
maximize the proximity of the allocated nodes at the same time it is supposed
to satisfy resource preferences and minimize node overcommitment. To allow
site specific weighting of these varying requirements, Moab allows resource
preferences to be enabled through the PRIORITY node allocation algorithm.
For example, to use resource preferences together with node load, the
following configuration might be used:
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT]
PRIORITYF='5 * PREF - LOAD'
...
To request specific resource preferences, a user could then submit a job
indicating those preferences. In the case of a PBS job, the following can be
used:
> qsub -l nodes=4,walltime=1:00:00,pref=feature:fast
340
Node Allocation Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
Related Topics
Generic Metrics
Per Job Node Allocation Policy Specification via Resource Manager Extensions
Node Access Policies
Moab allocates resources to jobs on the basis of a job task—an atomic
collection of resources that must be co-located on a single compute node. A
given job may request 20 tasks where each task is defined as one processor
and 128 MB of RAM. Compute nodes with multiple processors often possess
enough resources to support more than one task simultaneously. When it is
possible for more than one task to run on a node, node access policies
determine which tasks may share the compute node's resources.
Moab supports a distinct number of node access policies that are listed in the
following table:
Policy
Description
SHARED
Tasks from any combination of jobs may use available resources.
SHAREDONLY
Only jobs requesting shared node access may use available resources.
SINGLEACCOUNT
Tasks from any jobs owned by the same account may use available resources.
SINGLECLASS
Tasks from any jobs owned by the same class may use available resources.
SINGLEGROUP
Tasks from any jobs owned by the same group may use available resources.
SINGLEJOB
Only tasks from a single job may use the node's resources.
When enforcing limits using CLASSCFG attributes, use MAX.NODE instead of
MAX.PROC. MAX.PROC enforces the requested processors, not the actual processors
dedicated to the job.
SINGLETASK
Only a single task from a single job may run on the node.
SINGLEUSER
Tasks from any jobs owned by the same user may use available resources.
Node Access Policies
341
Chapter 4 Prioritizing Jobs and Allocating Resources
Policy
Description
UNIQUEUSER
Any number of tasks from a single job may allocate resources from a node but only if the
user has no other jobs running on that node. UNIQUEUSER limits the number of jobs a single
user can run on a node, allowing other users to run jobs with the remaining resources.
This policy is useful in environments where job epilog/prologs scripts are used to
clean up processes based on userid.
Configuring Node Access Policies
The global node access polices may be specified via the parameter
NODEACCESSPOLICY. This global default may be overridden on a per node
basis with the ACCESS attribute of the NODECFG parameter or on a per job
basis using the resource manager extension NACCESSPOLICY. Finally, a per
queue node access policy may also be specified by setting either the
NODEACCESSPOLICY or FORCENODEACCESSPOLICY attributes of the
CLASSCFG parameter. FORCENODEACCESSPOLICY overrides any per job
specification in all cases, whereas NODEACCESSPOLICY is overridden by per job
specification.
When multiple node access policies apply to a given job or node (for
example SINGLEJOB is configured globally but the class is configured as
SHARED) then the more restrictive policy applies. The most restrictive policy
is SINGLETASK, followed by SINGLEJOB, the single credentials, and SHARED
being the least restrictive.
By default, nodes are accessible using the setting of the system wide
NODEACCESSPOLICY parameter unless a specific ACCESS policy is specified on a per
node basis using the NODECFG parameter. Jobs may override this policy and
subsequent jobs are bound to conform to the access policies of all jobs
currently running on a given node. For example, if the NODEACCESSPOLICY
parameter is set to SHARED, a new job may be launched on an idle node with a
job specific access policy of SINGLEUSER. While this job runs, the effective node
access policy changes to SINGLEUSER and subsequent job tasks may only be
launched on this node provided they are submitted by the same user. When all
single user jobs have completed on that node, the effective node access policy
reverts back to SHARED and the node can again be used in SHARED mode.
For example, to set a global policy of SINGLETASK on all nodes except nodes 13
and 14, use the following:
# by default, enforce dedicated node access on all nodes
NODEACCESSPOLICY SINGLETASK
# allow nodes 13 and 14 to be shared
NODECFG[node13]
ACCESS=SHARED
NODECFG[node14]
ACCESS=SHARED
342
Node Access Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
You can also set SINGLEJOB using the qsub node-exclusive option (-n). For
example:
qsub -n jobscript.sh
This will set node_exlusive = True in the output for qstat -f <job Id>.
Alternately, you could also use either of the following:
qsub -l naccesspolicy=singlejob jobscript.sh
qsub -W x=naccesspolicy:singlejob jobscript.sh
Related Topics
Per job naccesspolicy specification via Resource Manager Extensions
JOBNODEMATCHPOLICY parameter
NODEAVAILABILITY parameter
Node Availability Policies
l
Node Resource Availability Policies
l
Node Categorization
l
Node Failure/Performance Based Notification
l
Node Failure/Performance Based Triggers
l
Handling Transient Node Failures
l
Allocated Resource Failure Policy for Jobs
Moab enables several features relating to node availability. These include
policies that determine how per node resource availability should be reported,
how node failures are detected, and what should be done in the event of a
node failure.
Node Resource Availability Policies
Moab allows a job to be launched on a given compute node as long as the node
is not full or busy. The NODEAVAILABILITYPOLICY parameter allows a site to
determine what criteria constitute a node being busy. The legal settings are
listed in the following table:
Availability
Policy
Description
DEDICATED
The node is considered busy if dedicated resources equal or exceed configured resources.
Node Availability Policies
343
Chapter 4 Prioritizing Jobs and Allocating Resources
Availability
Policy
Description
UTILIZED
The node is considered busy if utilized resources equal or exceed configured resources.
COMBINED
The node is considered busy if either dedicated or utilized resources equal or exceed configured resources.
The default setting for all nodes is COMBINED, indicating that a node can accept
workload so long as the jobs that the node was allocated to do not request or
use more resources than the node has available. In a load balancing
environment, this may not be the desired behavior. Setting the
NODEAVAILABILITYPOLICY parameter to UTILIZED allows jobs to be packed onto a
node even if the aggregate resources requested exceed the resources
configured. For example, assume a scenario with a 4-processor compute node
and 8 jobs requesting 1 processor each. If the resource availability policy was
set to COMBINED, this node would only allow 4 jobs to start on this node even if
the jobs induced a load of less than 1.0 each. With the resource availability
policy set to UTILIZED, the scheduler continues allowing jobs to start on the node
until the node's load average exceeds a per processor load value of 1.0 (in this
case, a total load of 4.0). To prevent a node from being over populated within a
single scheduling iteration, Moab artificially raises the node's load for one
scheduling iteration when starting a new job. On subsequent iterations, the
actual measured node load information is used.
Per Resource Availability Policies
By default, the NODEAVAILABILITYPOLICY sets a global per node resource
availability policy. This policy applies to all resource types on each node such as
processors, memory, swap, and local disk. However, the syntax of this
parameter is as follows:
<POLICY>[:<RESOURCETYPE>] ...
This syntax allows per resource availability specification. For example, consider
the following:
NODEAVAILABILITYPOLICY
...
DEDICATED:PROC COMBINED:MEM COMBINED:DISK
This configuration causes Moab to only consider the quantity of processing
resources actually dedicated to active jobs running on each node and ignore
utilized processor information (such as CPU load). For memory and disk, both
utilized resource information and dedicated resource information should be
combined to determine what resources are actually available for new jobs.
344
Node Availability Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
Node Categorization
Moab allows organizations to detect and use far richer information regarding
node status than the standard batch "idle," "busy," "down states" commonly
found. Using node categorization, organizations can record, track, and report
on per node and cluster level status including the following categories:
Category
Description
Active
Node is healthy and currently executing batch workload.
BatchFailure
Node is unavailable due to a failure in the underlying batch system (such as a
resource manager server or resource manager node daemon).
Benchmark
Node is reserved for benchmarking.
EmergencyMaintenance
Node is reserved for unscheduled system maintenance.
GridReservation
Node is reserved for grid use.
HardwareFailure
Node is unavailable due to a failure in one or more aspects of its hardware configuration (such as a power failure, excessive temperature, memory, processor, or
swap failure).
HardwareMaintenance
Node is reserved for scheduled system maintenance.
Idle
Node is healthy and is currently not executing batch workload.
JobReservation
Node is reserved for job use.
NetworkFailure
Node is unavailable due to a failure in its network adapter or in the switch.
Other
Node is in an uncategorized state.
OtherFailure
Node is unavailable due to a general failure.
PersonalReservation
Node is reserved for dedicated use by a personal reservation.
Site[1-8]
Site specified usage categorization.
SoftwareFailure
Node is unavailable due to a failure in a local software service (such as automounter,
security or information service such as NIS, local databases, or other required software services).
Node Availability Policies
345
Chapter 4 Prioritizing Jobs and Allocating Resources
Category
Description
SoftwareMaintenance
Node is reserved for software maintenance.
StandingReservation
Node is reserved by a standing reservation.
StorageFailure
Node is unavailable due to a failure in the cluster storage system or local storage infrastructure (such as failures in Lustre, GPFS, PVFS, or SAN).
UserReservation
Node is reserved for dedicated use by a particular user or group and may or may not
be actively executing jobs.
Node categories can be explicitly assigned by cluster administrators using the
mrsvctl -c command to create a reservation and associate a category with that
node for a specified timeframe. Further, outside of this explicit specification,
Moab automatically mines all configured interfaces to learn about its
environment and the health of the resources it is managing. Consequently,
Moab can identify many hardware failures, software failures, and batch failures
without any additional configuration. However, it is often desirable to make
additional information available to Moab to allow it to integrate this information
into reports; automatically notify managers, users, and administrators; adjust
internal policies to steer workload around failures; and launch various custom
triggers to rectify or mitigate the problem.
You can specify the FORCERSVSUBTYPE parameter to require all
administrative reservations be associated with a node category at
reservation creation time. For example:
NODECFG[DEFAULT] ENABLEPROFILING=TRUE
FORCERSVSUBTYPE TRUE
Node health and performance information from external systems can be
imported into Moab using the native resource manager interface. This is
commonly done using generic metrics or consumable generic resources for
performance and node categories or node variables for status information.
Combined with arbitrary node messaging information, Moab can combine
detailed information from remote services and report this to other external
services.
Use the NODECATCREDLIST parameter to generate extended node
category based statistics.
346
Node Availability Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
Node Failure/Performance Based Notification
Moab can be configured to cause node failures and node performance levels
that cross specified thresholds to trigger notification events. This is
accomplished using the GEVENTCFG parameter as described in the Generic
Event Overview section. For example, the following configuration can be used
to trigger an email to administrators each time a node is marked down.
GEVENTCFG[nodedown] ACTION=notify REARM=00:20:00
...
Node Failure/Performance Based Triggers
Moab supports per node triggers that can be configured to fire when specific
events are fired or specific thresholds are met. These triggers can be used to
modify internal policies or take external actions. A few examples follow:
l
l
l
l
decrease node allocation priority if node throughput drops below
threshold X
launch local diagnostic/recovery script if parallel file system mounts
become stale
reset high performance network adapters if high speed network
connectivity fails
create general system reservation on node if processor or memory failure
occurs
As mentioned, Moab triggers can be used to initiate almost any action, from
sending mail to updating a database, to publishing data for an SNMP trap, to
driving a web service.
Handling Transient Node Failures
Since Moab actively schedules both current and future actions of the cluster, it
is often important for it to have a reasonable estimate of when failed nodes will
be again available for use. This knowledge is particularly useful for proper
scheduling of new jobs and management of resources in regard to backfill.
With backfill, Moab determines which resources are available for priority jobs
and when the highest priority idle jobs can run. If a node experiences a failure,
Moab should have a concept of when this node will be restored.
When Moab analyzes down nodes for allocation, one of two issues may occur
with the highest priority jobs. If Moab believes that down nodes will not be
recovered for an extended period of time, a transient node failure within a
reservation for a priority job may cause the reservation to slide far into the
future allowing other lower priority jobs to allocate and launch on nodes
previously reserved for it. Moments later, when the transient node failures are
resolved, Moab may be unable to restore the early reservation start time as
other jobs may already have been launched on previously available nodes.
Node Availability Policies
347
Chapter 4 Prioritizing Jobs and Allocating Resources
In the reverse scenario, if Moab recognizes a likelihood that down nodes will be
restored too quickly, it may make reservations for top priority jobs that allocate
those nodes. Over time, Moab slides those reservations further into the future
as it determines that the reserved nodes are not being recovered. While this
does not delay the start of the top priority jobs, these unfulfilled reservations
can end up blocking other jobs that should have properly been backfilled and
executed.
Creating Automatic Reservations
If a node experiences occasional transient failures (often not associated with a
node state of down), Moab can automatically create a temporary reservation
over the node to allow the transient failure time to clear and prevent Moab
from attempting to re-use the node while the failure is active. This reservation
behavior is controlled using the NODEFAILURERESERVETIME parameter as in
the following example:
# reserve nodes for 1 minute if transient failures are detected
NODEFAILURERESERVETIME 00:01:00
Blocking Out Down Nodes
If one or more resource managers identify failures and mark nodes as down,
Moab can be configured to associate a default unavailability time with this
failure and the node state down. This is accomplished using the
NODEDOWNSTATEDELAYTIME parameter. This delay time floats and is
measured as a fixed time into the future from the time " NOW"; it is not
associated with the time the node was originally marked down. For example, if
the delay time was set to 10 minutes, and a node was marked down 20 minutes
ago, Moab would still consider the node unavailable until 10 minutes into the
future.
While it is difficult to select a good default value that works for all clusters, the
following is a general rule of thumb:
l
l
Increase NODEDOWNSTATEDELAYTIME if jobs are getting blocked due to
priority reservations sliding as down nodes are not recovered.
Decrease NODEDOWNSTATEDELAYTIME if high priority job reservations are
getting regularly delayed due to transient node failures.
# assume down nodes will not be recovered for one hour
NODEDOWNSTATEDELAYTIME 01:00:00
Allocated Resource Failure Policy for Jobs
If a failure occurs within a collection of nodes allocated to a job, Moab can
automatically re-allocate replacement resources. This can be configured with
JOBACTIONONNODEFAILURE.
348
Node Availability Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
How an active job behaves when one or more of its allocated resources fail
depends on the allocated resource failure policy. Depending on the type of job,
type of resources, and type of middleware infrastructure, a site may choose to
have different responses based on the job, the resource, and the type of
failure.
Failure Responses
By default, Moab cancels a job when an allocated resource failure is detected.
However, you can specify the following actions:
Option
Policy action
CANCEL
Cancels the job
FAIL
Terminates the job as a failed job
HOLD
Places a hold on the job. This option is only applicable if you are using checkpointing
IGNORE
Ignores the failed node, allowing the job to proceed
NOTIFY
Notifies the administrator and user of failure but takes no further action
REQUEUE
Requeues job and allows it to run when alternate resources become available
Policy Precedence
For a given job, the applied policy can be set at various levels with policy
precedence applied in the job, class/queue, partition, and then system level.
The following table indicates the available methods for setting this policy:
Object
Parameter
Job
RESFAILPOLICY resource manager extension
> qsub -l resfailpolicy=requeue
Class/Queue
RESFAILPOLICY attribute of CLASSCFG parameter
CLASSCFG[batch] RESFAILPOLICY=CANCEL
Partition
JOBACTIONONNODE
FAILURE attribute of PARCFG parameter
PARCFG[web3]
JOBACTIONONNODEFAILURE=NOTIFY
System
NODEALLOCRESFAILURE
POLICY parameter
NODEALLOCRESFAILUREPOLICY=MIGRATE
Node Availability Policies
Example
349
Chapter 4 Prioritizing Jobs and Allocating Resources
Failure Definition
Any allocated node going down constitutes a failure. However, for certain types
of workload, responses to failures may be different depending on whether it is
the master task (task 0) or a slave task that fails. To indicate that the
associated policy should only take effect if the master task fails, the allocated
resource failure policy should be specified with a trailing asterisk (*), as in the
following example:
CLASSCFG[virtual_services] RESFAILPOLICY=requeue*
Torque Failure Details
When a node fails becoming unresponsive, the resource manager central
daemon identifies this failure within a configurable time frame (default: 60
seconds). Detection of this failure triggers an event that causes Moab to
immediately respond. Based on the specified policy, Moab notifies
administrators, holds the job, requeues the job, allocates replacement
resources to the job, or cancels the job. If the job is canceled or requeued,
Moab sends the request to Torque, which immediately frees all non-failed
resources making them available for use by other jobs. Once the failed node is
recovered, it contacts the resource manager central daemon, determines that
the associated job has been canceled/requeued, cleans up, and makes itself
available for new workload.
Related Topics
Node State Overview
JOBACTIONONNODEFAILURE parameter
NODEFAILURERESERVETIME parameter
NODEDOWNSTATEDELAYTIME parameter (down nodes will be marked unavailable for the specified
duration)
NODEDRAINSTATEDELAYTIME parameter (offline nodes will be marked unavailable for the specified
duration)
NODEBUSYSTATEDELAYTIME parameter (nodes with unexpected background load will be marked
unavailable for the specified duration)
NODEALLOCRESFAILUREPOLICY parameter (action to take if executing jobs have one or more allocated nodes fail)
Task Distribution Policies
Under Moab, task distribution policies are specified at a global scheduler level,
a global resource manager level, or at a per job level. In addition, you can set
up some aspects of task distribution as defaults on a per class basis.
Related Topics
Node Set Overview
350
Task Distribution Policies
Chapter 4 Prioritizing Jobs and Allocating Resources
Node Allocation Overview
Task Distribution Policies
351
352
Task Distribution Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies,
Fairshare, and Allocation Management
l
Fairness Overview on page 353
l
Usage Limits/Throttling Policies on page 356
l
Fairshare on page 376
l
Accounting, Charging, and Allocation Management on page 391
Fairness Overview
The concept of cluster fairness varies widely from person to person and site to
site. While some interpret it as giving all users equal access to compute
resources, more complicated concepts incorporating historical resource usage,
political issues, and job value are equally valid. While no scheduler can address
all possible definitions of fair, Moab provides one of the industry's most
comprehensive and flexible set of tools allowing most sites the ability to
address their many and varied fairness management needs.
Under Moab, most fairness policies are addressed by a combination of the
facilities described in the following table:
Job Prioritization
Description:
Example:
Specifies what is most important to the scheduler. Using service based priority factors allows a site
to balance job turnaround time, expansion factor, or other scheduling performance metrics.
SERVICEWEIGHT
1
QUEUETIMEWEIGHT 10
Causes jobs to increase in priority by 10 points for every minute they remain in the queue.
Usage Limits (Throttling Policies)
Description:
Fairness Overview
Specifies limits on exactly what resources can be used at any given instant.
353
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Usage Limits (Throttling Policies)
Example:
USERCFG[john]
MAXJOB=3
GROUPCFG[DEFAULT] MAXPROC=64
GROUPCFG[staff]
MAXPROC=128
Allows john to only run 3 jobs at a time. Allows the group staff to use up to 128 total
processors and all other groups to use up to 64 processors.
Fairshare
Description:
Example:
Specifies usage targets to limit resource access or adjust priority based on historical cluster and
grid level resource usage.
USERCFG[steve] FSTARGET=25.0+
FSWEIGHT
1
FSUSERWEIGHT
10
Enables priority based fairshare and specifies a fairshare target for user steve such that
his jobs are favored in an attempt to keep his jobs using at least 25.0% of delivered
compute cycles.
Allocation Management
Description:
Example:
Specifies long term, credential-based resource usage limits.
AMCFG[mam] TYPE=MAM HOST=server.sys.net
Enables the Moab Accounting Manager allocation management interface. Within the
accounting manager, project or account based allocations may be configured. These
allocations may, for example, do such things as allow project X to use up to 100,000
processor-hours per quarter, provide various QoS sensitive charge rates, and share
allocation access.
Quality of Service
Description:
354
Specifies additional resource and service access for particular users, groups, and accounts. QoS facilities can provide special priorities, policy exemptions, reservation access, and other benefits (as well
as special charge rates).
Fairness Overview
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Quality of Service
Example:
QOSCFG[orion] PRIORITY=1000 XFTARGET=1.2
QOSCFG[orion] QFLAGS=PREEMPTOR,IGNSYSTEM,RESERVEALWAYS
Enables jobs requesting the orion QoS a priority increase, an expansion factor target to
improve response time, the ability to preempt other jobs, an exemption from system level
job size policies, and the ability to always reserve needed resources if it cannot start
immediately.
Standing Reservations
Description:
Example:
Reserves blocks of resources within the cluster for specific, periodic time frames under the constraints of a flexible access control list.
SRCFG[jupiter] HOSTLIST=node01[1-4]
SRCFG[jupiter] STARTTIME=9:00:00 ENDTIME=17:00:00
SRCFG[jupiter] USERLIST=john,steve ACCOUNTLIST=jupiter
Reserve nodes node011 through node014 from 9:00 AM until 5:00 PM for use by jobs
from user john or steve or from the project jupiter.
Class/Queue Constraints
Description:
Example:
Associates users, resources, priorities, and limits with cluster classes or cluster queues that can be
assigned to or selected by end-users.
CLASSCFG[long]
CLASSCFG[long]
SRCFG[jupiter]
SRCFG[jupiter]
HOSTLIST=acn[1-4][0-9]
MIN.WCLIMIT=24:00:00
PRIORITY=10000
CLASSLIST=long&
Assigns long jobs a high priority but only allow them to run on certain nodes.
Selecting the Correct Policy Approach
Moab supports a rich set of policy controls in some cases allowing a particular
policy to be enforced in more than one way. For example, cycle distribution can
be controlled using usage limits, fairshare, or even queue definitions. Selecting
the most correct policy depends on site objectives and needs; consider the
following when making such a decision:
Fairness Overview
355
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
l
Minimal end-user training
o
l
Does the solution use an approach familiar to or easily learned by
existing users?
End-user transparency
o
Can the configuration be enabled or disabled without impacting user
behavior or job submission?
l
Impact on system utilization and system responsiveness
l
Solution complexity
o
l
Is the impact of the configuration readily intuitive, and is it easy to
identify possible side effects?
Solution extensibility and flexibility
o
Will the proposed approach allow the solution to be easily tuned and
extended as cluster needs evolve?
Related Topics
Job Prioritization
Usage Limits (Throttling Policies)
Fairshare
Allocation Management
Quality of Service
Standing Reservations
Class/Queue Constraints
Usage Limits/Throttling Policies
A number of Moab policies allow an administrator to control job flow through
the system. These throttling policies work as filters allowing or disallowing a job
to be considered for scheduling by specifying limits regarding system usage for
any given moment. These policies may be specified as global or specific
constraints specified on a per user, group, account, QoS, or class basis.
l
356
Fairness via Throttling Policies
o
Basic Fairness Policies
o
Multi-Dimension Fairness Policies
l
Override Limits
l
Idle Job Limits
l
Hard and Soft Limits
l
Per-partition Limits
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
l
Usage-based limits
o
Configuring Actions
o
Specifying Hard and Soft Policy Violations
o
Constraining Walltime Usage
Fairness via Throttling Policies
Moab allows significant flexibility with usage limits, or throttling policies. At a
high level, Moab allows resource usage limits to be specified in three primary
workload categories: (1) active, (2) idle, and (3) system job limits.
Basic Fairness Policies
Workload category
Description
Active job limits
Constrain the total cumulative resources available to active jobs at a given time.
Idle job limits
Constrain the total cumulative resources available to idle jobs at a given time.
System job limits
Constrain the maximum resource requirements of any single job.
These limits can be applied to any job credential (user, group, account, QoS,
and class), or on a system-wide basis. Using the keyword DEFAULT, a site may
also specify the default setting for the desired user, group, account, QoS, and
class. Additionally, you may configure QoS to allow limit overrides to any
particular policy.
To run, a job must meet all policy limits. Limits are applied using the *CFG set of
parameters, particularly USERCFG, GROUPCFG, ACCOUNTCFG, QOSCFG,
CLASSCFG, and SYSCFG. Limits are specified by associating the desired limit to
the individual or default object. The usage limits currently supported are listed
in the following table.
MAXARRAYJOB
Units
Number of simultaneous active array job sub-jobs.
Description
Limits the number of simultaneously active (starting or running) array sub-jobs a credential can
have.
Example
USERCFG[gertrude] MAXARRAYJOB=10
Gertrude can have a maximum of 10 active job array sub-jobs.
Usage Limits/Throttling Policies
357
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
MAXGRES
Units
# of concurrent uses of a generic resource
Description
Limits the concurrent usage of a generic resource to a specific quantity or quantity range.
Example
USERCFG[joe] MAXGRES[matlab]=2
USERCFG[jim] MAXGRES[matlab]=2,4
MAXJOB
Units
# of jobs
Description
Limits the number of jobs a credential may have active (starting or running) at any given time.
Moab places a hold on all new jobs submitted by that credential once it has reached its maximum
number of allowable jobs.
MAXJOB=0 is not supported. You can, however, achieve similar results by using the HOLD
attribute of the USERCFG parameter:
USERCFG[john] HOLD=yes
Example
USERCFG[DEFAULT] MAXJOB=8
GROUPCFG[staff] MAXJOB=2,4
MAXMEM
Units
total memory in MB
Description
Limits the total amount of dedicated memory (in MB) that can be allocated by a credential's active
jobs at any given time.
Example
ACCOUNTCFG[jasper] MAXMEM=2048
MAXNODE
Units
358
# of nodes
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
MAXNODE
Description
Limits the total number of compute nodes that can be in use by active jobs at any given time.
Adaptive Computing recommends that you set JOBNODEMATCHPOLICY EXACTNODE
when using MAXNODE. This ensures jobs submitted using the msub/qsub "-l nodes=#"
syntax will have a node count associated with the request.
On some systems (including Torque/PBS), nodes have been softly defined rather than
strictly defined; that is, a job may request 2 nodes but Torque will translate this request
into 1 node with 2 processors. This can prevent Moab from enforcing a MAXNODE policy
correctly for a single job. Correct behavior can be achieved using MAXPROC.
Example
CLASSCFG[batch] MAXNODE=64
MAXPE
Units
# of processor equivalents
Description
Limits the total number of dedicated processor-equivalents that can be allocated by active jobs at
any given time.
Example
QOSCFG[base] MAXPE=128
MAXPROC
Units
# of processors
Description
Limits the total number of dedicated processors that can be allocated by active jobs at any given
time per credential. To set MAXPROC per job, use msub -W.
Example
CLASSCFG[debug] MAXPROC=32
MAXPS
Units
<# of processors> * <walltime>
Usage Limits/Throttling Policies
359
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
MAXPS
Description
Example
Limits the number of outstanding processor-seconds a credential may have allocated at any given
time. For example, if a user has a 4-processor job that will complete in 1 hour and a 2-processor
job that will complete in 6 hours, they have 4 * 1 * 3600 + 2 * 6 * 3600 = 16 * 3600 outstanding
processor-seconds. The outstanding processor-second usage of each credential is updated each
scheduling iteration, decreasing as jobs approach their completion time.
USERCFG[DEFAULT] MAXPS=720000
MAXSUBMITJOBS
Units
# of jobs
Description
Limits the number of jobs a credential may submit and have in the system at once. Moab will reject
any job submitted beyond this limit.
If you use a Torque resource manager, you should also set max_user_queuable in case the user
submits jobs via qsub instead of msub. See "Queue Attributes in the Torque 6.0.1 Administrator
Guide for more information.
Example
USERCFG[DEFAULT] MAXSUBMITJOBS=5
MAXWC
Units
job duration [[[DD:]HH:]MM:]SS
Description
Limits the cumulative remaining walltime a credential may have associated with active jobs. It
behaves identically to the MAXPS limit (listed earlier) only lacking the processor weighting. Like
MAXPS, the cumulative remaining walltime of each credential is also updated each scheduling
iteration.
MAXWC does not limit the maximum wallclock limit per job. For this capability, use
MAX.WCLIMIT.
Example
USERCFG[ops] MAXWC=72:00:00
The following example demonstrates a simple limit specification:
USERCFG[DEFAULT]
USERCFG[john]
MAXJOB=4
MAXJOB=8
This example allows user john to run up to 8 jobs while all other users may only run up to 4.
360
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Simultaneous limits of different types may be applied per credential and
multiple types of credentials may have limits specified. The next example
demonstrates this mixing of limits and is a bit more complicated.
USERCFG[steve]
GROUPCFG[staff]
CLASSCFG[DEFAULT]
CLASSCFG[batch]
MAXJOB=2 MAXNODE=30
MAXJOB=5
MAXNODE=16
MAXNODE=32
This configuration may potentially apply multiple limits to a single job. As
discussed previously, a job may only run if it satisfies all applicable limits. Thus,
in this example, the scheduler will be constrained to allow at most 2
simultaneous user steve jobs with an aggregate node consumption of no more
than 30 nodes. However, if the job is submitted to a class other than batch, it
may be limited further. Here, only 16 total nodes may be used simultaneously
by jobs running in any given class with the exception of the class batch. If steve
submitted a job to run in the class interactive, for example, and there were
jobs already running in this class using a total of 14 nodes, his job would be
blocked unless it requested 2 or fewer nodes by the default limit of 16 nodes
per class.
Multi-Dimension Fairness Policies and Per Credential Overrides
Multi-dimensional fairness policies allow a site to specify policies based on
combinations of job credentials. A common example might be setting a
maximum number of jobs allowed per queue per user or a total number of
processors per group per QoS. As with basic fairness policies, multi-dimension
policies are specified using the *CFG parameters or through the identity
manager interface. Moab supports the most commonly used multi-dimensional
fairness policies (listed in the table below) using the following format:
*CFG[X] <LIMITTYPE>[<CRED>]=<LIMITVALUE>
*CFG is one of USERCFG, GROUPCFG, ACCOUNTCFG, QOSCFG, or CLASSCFG, the
<LIMITTYPE> policy is one of the policies listed in the table in section 6.2.1.1, and
<CRED> is of the format <CREDTYPE>[:<VALUE>] with CREDTYPE being one of
USER, GROUP, ACCT, QoS, or CLASS. The optional <VALUE> setting can be
used to specify that the policy only applies to a specific credential value. For
example, the following configuration sets limits on the class fast, controlling the
maximum number of jobs any group can have active at any given time and the
number of processors in use at any given time for user steve.
CLASSCFG[fast] MAXJOB[GROUP]=12
CLASSCFG[fast] MAXPROC[USER:steve]=50
CLASSCFG[fast] MAXIJOB[USER]=10
The following example configuration may clarify further:
Usage Limits/Throttling Policies
361
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
# allow class batch to run up the 3 simultaneous jobs
# allow any user to use up to 8 total nodes within class
CLASSCFG[batch] MAXJOB=3 MAXNODE[USER]=8
# allow users steve and bob to use up to 3 and 4 total processors respectively within
class
CLASSCFG[fast] MAXPROC[USER:steve]=3 MAXPROC[USER:bob]=4
Multi-dimensional policies cannot be applied on DEFAULT credentials.
The table below lists the currently implemented, multi-dimensional usage limit
permutations. The "slmt" stands for "Soft Limit" and "hlmt" stands for "Hard
Limit."
362
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Multi-dimension usage limit permutations
ACCOUNTCFG[name]
MAXIJOB[QOS]=hlmt
MAXIJOB[QOS:qosname]=hlmt
MAXIPROC[QOS]=hlmt
MAXIPROC[QOS:qosname]=hlmt
MAXJOB[QOS]=slmt,hlmt
MAXJOB[QOS:qosname]=slmt,hlmt
MAXJOB[USER]=slmt,hlmt
MAXJOB[USER:username]=slmt,hlmt
MAXMEM[USER]=slmt,hlmt
MAXMEM[USER:username]=slmt,hlmt
MAXNODE[USER]=slmt,hlmt
MAXNODE[USER:username]=slmt,hlmt
MAXPE[QOS]=slmt,hlmt
MAXPE[QOS:qosname]=slmt,hlmt
MAXPROC[USER]=slmt,hlmt
MAXPROC[USER:username]=slmt,hlmt
MAXPROC[QOS]=slmt,hlmt
MAXPROC[QOS:qosname]=slmt,hlmt
MAXPROC[USER]=slmt,hlmt
MAXPROC[USER:username]=slmt,hlmt
MAXPS[QOS]=slmt,hlmt
MAXPS[QOS:qosname]=slmt,hlmt
MAXPS[USER]=slmt,hlmt
MAXPS[USER:username]=slmt,hlmt
MAXWC[USER]=slmt,hlmt
MAXWC[USER:username]=slmt,hlmt
Usage Limits/Throttling Policies
363
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Multi-dimension usage limit permutations
CLASSCFG[name]
MAXIJOB[USER]=hlmt
MAXJOB[GROUP]=slmt,hlmt
MAXJOB[GROUP:groupname]=slmt,hlmt
MAXJOB[QOS:qosname]=hlmt
MAXJOB[USER]=slmt,hlmt
MAXJOB[USER:username]=slmt,hlmt
MAXMEM[GROUP]=slmt,hlmt
MAXMEM[GROUP]=slmt,hlmt
MAXMEM[GROUP]=slmt,hlmt
MAXMEM[GROUP:groupname]=slmt,hlmt
MAXMEM[QOS:qosname]=hlmt
MAXMEM[USER]=slmt,hlmt
MAXMEM[USER:username]=slmt,hlmt
MAXNODE[GROUP]=slmt,hlmt
MAXNODE[GROUP:groupname]=slmt,hlmt
MAXNODE[QOS:qosname]=hlmt
MAXNODE[USER]=slmt,hlmt
MAXNODE[USER:username]=slmt,hlmt
MAXPE[GROUP]=slmt,hlmt
MAXPE[GROUP:groupname]=slmt,hlmt
MAXPE[QOS:qosname]=hlmt
MAXPE[USER]=slmt,hlmt
MAXPE[USER:username]=slmt,hlmt
364
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Multi-dimension usage limit permutations
MAXPROC[GROUP]=slmt,hlmt
MAXPROC[GROUP:groupname]=slmt,hlmt
MAXPROC[QOS:qosname]=hlmt
MAXPROC[USER]=slmt,hlmt
MAXPROC[USER:username]=slmt,hlmt
MAXPS[GROUP]=slmt,hlmt
MAXPS[GROUP:groupname]=slmt,hlmt
MAXPS[QOS:qosname]=hlmt
MAXPS[USER]=slmt,hlmt
MAXPS[USER:username]=slmt,hlmt
MAXWC[GROUP]=slmt,hlmt
MAXWC[GROUP:groupname]=slmt,hlmt
MAXWC[QOS:qosname]=hlmt
MAXWC[USER]=slmt,hlmt
MAXWC[USER:username]=slmt,hlmt
Usage Limits/Throttling Policies
365
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Multi-dimension usage limit permutations
GROUPCFG[name]
MAXJOB[CLASS:classname]=slmt,hlmt
MAXJOB[USER]=slmt,hlmt
MAXJOB[USER:username]=slmt,hlmt
MAXMEM[CLASS:classname]=slmt,hlmt
MAXMEM[USER]=slmt,hlmt
MAXMEM[USER:username]=slmt,hlmt
MAXNODE[CLASS:classname]=slmt,hlmt
MAXNODE[USER]=slmt,hlmt
MAXNODE[USER:username]=slmt,hlmt
MAXPE[CLASS:classname]=slmt,hlmt
MAXPE[USER]=slmt,hlmt
MAXPE[USER:username]=slmt,hlmt
MAXPROC[CLASS:classname]=slmt,hlmt
MAXPROC[USER]=slmt,hlmt
MAXPROC[USER:username]=slmt,hlmt
MAXPS[CLASS:classname]=slmt,hlmt
MAXPS[USER]=slmt,hlmt
MAXPS[USER:username]=slmt,hlmt
MAXWC[CLASS:classname]=slmt,hlmt
MAXWC[USER]=slmt,hlmt
MAXWC[USER:username]=slmt,hlmt
366
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Multi-dimension usage limit permutations
QOSCFG[name]
MAXIJOB[ACCT]=hlmt
MAXIJOB[ACCT:accountname]=hlmt
MAXIJOB[USER]=hlmt
MAXINODE[ACCT]=slmt,hlmt
MAXINODE[ACCT:accountname]=slmt,hlmt
MAXINODE[USER]=hlmt
MAXINODE[USER:username]=slmt,hlmt
MAXIPROC[ACCT]=hlmt
MAXIPROC[ACCT:accountname]=hlmt
MAXJOB[ACCT]=slmt,hlmt
MAXJOB[ACCT:accountname]=slmt,hlmt
MAXJOB[USER]=slmt,hlmt
MAXJOB[USER:username]=slmt,hlmt
MAXMEM[USER]=slmt,hlmt
MAXMEM[USER:username]=slmt,hlmt
MAXNODE[USER]=slmt,hlmt
MAXNODE[USER:username]=slmt,hlmt
MAXPE[ACCT]=slmt,hlmt
MAXPE[ACCT:accountname]=slmt,hlmt
MAXPE[USER]=slmt,hlmt
MAXPE[USER:username]=slmt,hlmt
MAXPROC[ACCT]=slmt,hlmt
MAXPROC[ACCT:accountname]=slmt,hlmt
MAXPROC[USER]=slmt,hlmt
MAXPROC[USER:username]=slmt,hlmt
Usage Limits/Throttling Policies
367
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Multi-dimension usage limit permutations
MAXPS[ACCT]=slmt,hlmt
MAXPS[ACCT:accountname]=slmt,hlmt
MAXPS[USER]=slmt,hlmt
MAXPS[USER:username]=slmt,hlmt
MAXWC[USER]=slmt,hlmt
MAXWC[USER:username]=slmt,hlmt
USERCFG[name]
MAXJOB[GROUP]=slmt,hlmt
MAXJOB[GROUP:groupname]=slmt,hlmt
MAXMEM[GROUP]=slmt,hlmt
MAXMEM[GROUP:groupname]=slmt,hlmt
MAXNODE[GROUP]=slmt,hlmt
MAXNODE[GROUP:groupname]=slmt,hlmt
MAXPE[GROUP]=slmt,hlmt
MAXPE[GROUP:groupname]=slmt,hlmt
MAXPROC[GROUP]=slmt,hlmt
MAXPROC[GROUP:groupname]=slmt,hlmt
MAXPS[GROUP]=slmt,hlmt
MAXPS[GROUP:groupname]=slmt,hlmt
MAXWC[GROUP]=slmt,hlmt
MAXWC[GROUP:groupname]=slmt,hlmt
Override Limits
Like all job credentials, the QoS object may be associated with resource usage
limits. However, this credential can also be given special override limits that
supersede the limits of other credentials, effectively causing all other limits of
the same type to be ignored. See QoS Usage Limits and Overrides for a
complete list of policies that can be overridden. The following configuration
provides an example of this in the last line:
368
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
USERCFG[steve]
GROUPCFG[staff]
CLASSCFG[DEFAULT]
CLASSCFG[batch]
QOSCFG[hiprio]
MAXJOB=2
MAXNODE=30
MAXJOB=5
MAXNODE=16
MAXNODE=32
OMAXJOB=3 OMAXNODE=64
Only 3 hiprio QoS jobs may run simultaneously and hiprio QoS jobs may run with up to 64 nodes per
credential ignoring other credential MAXNODE limits.
Given the preceding configuration, assume a job is submitted with the
credentials, user steve, group staff, class batch, and QoS hiprio.
Such a job will start so long as running it does not lead to any of the following
conditions:
l
Total nodes used by user steve does not exceed 64.
l
Total active jobs associated with user steve does not exceed 2.
l
Total active jobs associated with group staff does not exceed 5.
l
Total nodes dedicated to class batch does not exceed 64.
l
Total active jobs associated with QoS hiprio does not exceed 3.
While the preceding example is a bit complicated for most sites, similar
combinations may be required to enforce policies found on many systems.
Idle Job Limits
Idle (or queued) job limits control which jobs are eligible for scheduling. To be
eligible for scheduling, a job must meet the following conditions:
l
l
l
Be idle as far as the resource manager is concerned (no holds).
Have all job prerequisites satisfied (no outstanding job or data
dependencies).
Meet all idle job throttling policies.
If a job fails to meet any of these conditions, it will not be considered for
scheduling and will not accrue service based job prioritization. (See Service
(SERVICE) Component and JOBPRIOACCRUALPOLICY.) The primary purpose
of idle job limits is to ensure fairness among competing users by preventing
queue stuffing and other similar abuses. Queue stuffing occurs when a single
entity submits large numbers of jobs, perhaps thousands, all at once so they
begin accruing queue time based priority and remain first to run despite
subsequent submissions by other users.
Idle limits are specified in a manner almost identical to active job limits with the
insertion of the capital letter I into the middle of the limit name. The following
tables describe the MAXIARRAYJOB, MAXIJOB, and MAXINODE limits, which are idle
limit equivalents to MAXARRAYJOB, MAXJOB, and MAXNODE limits,
respectively.
Usage Limits/Throttling Policies
369
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
MAXIARRAYJOB
Units
Number of simultaneous idle array job sub-jobs.
Description
Limits the number of simultaneously idle (eligible) job array sub-jobs across all job arrays submitted by a credential.
Example
USERCFG[gertrude] MAXARRAYJOB=10 MAXIARRAYJOB=5
Gertrude can have a maximum of 10 active job array sub-jobs and 5 eligible job array
sub-jobs.
MAXIJOB
Units
# of jobs
Description
Limits the number of idle (eligible) jobs a credential may have at any given time.
Example
USERCFG[DEFAULT]
GROUPCFG[staff]
MAXIJOB=8
MAXIJOB=4
MAXINODE
Units
# of nodes
Description
Limits the total number of compute nodes that can be requested by jobs in the eligible/idle queue
at any time. Once the limit is exceeded, the remaining jobs will be placed in the blocked queue. The
number of nodes is determined by <tasks> / <maximumProcsOnOneNode> or, if using
JOBNODEMATCHPOLICY EXACTNODE, by the number of nodes requested.
Example
USERCFG[DEFAULT]
MAXINODE=2
Idle limits can constrain the total number of jobs considered to be eligible on a
per credential basis. Further, like active job limits, idle job limits can also
constrain eligible jobs based on aggregate requested resources. This could, for
example, allow a site to indicate that for a given user, only jobs requesting up
to a total of 64 processors, or 3200 processor-seconds would be considered at
any given time. Which jobs to select is accomplished by prioritizing all idle jobs
and then adding jobs to the eligible list one at a time in priority order until jobs
can no longer be added. This eligible job selection is done only once per
scheduling iteration, so, consequently, idle job limits only support a single hard
limit specification. Any specified soft limit is ignored.
370
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
All single dimensional job limit types supported as active job limits are also
supported as idle job limits. In addition, Moab also supports MAXIJOB[USER]
and MAXIPROC[USER] policies on a per class basis. (See Basic Fairness
Policies.)
Example:
USERCFG[steve]
GROUPCFG[staff]
CLASSCFG[batch]
QOSCFG[hiprio]
MAXIJOB=2
MAXIJOB=5
MAXIJOB[USER]=2 MAXIJOB[USER:john]=6
MAXIJOB=3
Hard and Soft Limits
Hard and soft limit specification allows a site to balance both fairness and
utilization on a given system. Typically, throttling limits are used to constrain
the quantity of resources a given credential (such as user or group) is allowed
to consume. These limits can be very effective in enforcing fair usage among a
group of users. However, in a lightly loaded system, or one in which there are
significant swings in usage from project to project, these limits can reduce
system utilization by blocking jobs even when no competing jobs are queued.
Soft limits help address this problem by providing additional scheduling
flexibility. They allow sites to specify two tiers of limits; the more constraining
limits soft limits are in effect in heavily loaded situations and reflect tight
fairness constraints. The more flexible hard limits specify how flexible the
scheduler can be in selecting jobs when there are idle resources available after
all jobs meeting the tighter soft limits have started. Soft and hard limits are
specified in the format [<SOFTLIMIT>,]<HARDLIMIT>. For example, a given
site may want to use the following configuration:
USERCFG[DEFAULT]
MAXJOB=2,8
With this configuration, the scheduler would select all jobs that meet the per user MAXJOB limit of 2. It
would then attempt to start and reserve resources for all of these selected jobs. If after doing so there
still remain available resources, the scheduler would then select all jobs that meet the less constraining
hard per user MAXJOB limit of 8 jobs. These jobs would then be scheduled and reserved as available
resources allow.
If no soft limit is specified or the soft limit is less constraining than the hard limit, the soft limit is set
equal to the hard limit.
Example:
USERCFG[steve]
GROUPCFG[staff]
CLASSCFG[DEFAULT]
CLASSCFG[batch]
QOSCFG[hiprio]
MAXJOB=2,4 MAXNODE=15,30
MAXJOB=2,5
MAXNODE=16,32
MAXNODE=12,32
MAXJOB=3,5 MAXNODE=32,64
Job preemption status can be adjusted based on whether the job violates
a soft policy using the ENABLESPVIOLATIONPREEMPTION parameter.
Usage Limits/Throttling Policies
371
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Per-partition Limits
Per-partition scheduling can set limits and enforce credentials and polices on a
per-partition basis. Configuration for per-partition scheduling is done on the
grid head. In a grid, each Moab cluster is considered a partition. Per-partition
scheduling is typically used in a Master/Slave grid.
To enable per-partition scheduling, add the following to moab.cfg:
PERPARTITIONSCHEDULING TRUE
JOBMIGRATEPOLICY JUSTINTIME
With per-partition scheduling, it is recommended that limits go on the
specific partitions and not on the global level. If limits are specified on both
levels, Moab will take the more constricting of the limits. Also, please note
that a DEFAULT policy on the global partition is not overridden by any
policy on a specific partition.
Per-partition Limits
You can configure per-job limits and credential usage limits on a per-partition
basis in the moab.cfg file. Here is a sample configuration for partitions g02 and
g03 in moab.cfg.
PARCFG[g02]
PARCFG[g03]
CONFIGFILE=/opt/moab/parg02.cfg
CONFIGFILE=/opt/moab/parg03.cfg
You can then add per-partition limits in each partition configuration file:
# /opt/moab/parg02.cfg
CLASSCFG[pbatch]
MAXJOB=5
# /opt/moab/parg03.cfg
CLASSCFG[pbatch]
MAXJOB=10
You can configure Moab so that jobs submitted to any partition besides g02and
g03 get the default limits in moab.cfg:
stl
CLASSCFG[pbatch]
MAXJOB=2
Supported Credentials and Limits
The user, group, account, QoS, and class credentials are supported in perpartition scheduling.
The following per-job limits are supported:
372
l
MAX.NODE
l
MAX.WCLIMIT
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
MAX.PROC
l
The following credential usage limits are supported:
l
MAXJOB
l
MAXNODE
l
MAXPROC
l
MAXWC
l
MAXSUBMITJOBS
Multi-dimensional limits are supported for the listed credentials and per-job
limits. For example:
CLASSCFG[pbatch]
MAXJOB[user:frank]=10
Usage-based limits
Resource usage limits constrain the amount of resources a given job may
consume. These limits are generally proportional to the resources requested
and may include walltime, any standard resource, or any specified generic
resource. The parameter RESOURCELIMITPOLICY controls which resources
are limited, what limit policy is enforced per resource, and what actions the
scheduler should take in the event of a policy violation.
Configuring Actions
The RESOURCELIMITPOLICY parameter accepts a number of policies, resources,
and actions using the format and values defined below.
If walltime is the resource to be limited, be sure that the resource
manager is configured to not interfere if a job surpasses its given
walltime. For Torque, this is done by using $ignwalltime in the
configuration on each MOM node.
Format
RESOURCELIMITPOLICY<RESOURCE>:[<SPOLICY>,]<HPOLICY>:
[<SACTION>,]<HACTION>[:[<SVIOLATIONTIME>,]<HVIOLATIONTIME>]...
Resource
Description
CPUTIME
Maximum total job proc-seconds used by any single job (allows scheduler enforcement of
cpulimit).
DISK
Local disk space (in MB) used by any single job task.
Usage Limits/Throttling Policies
373
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Resource
Description
JOBMEM
Maximum real memory/RAM (in MB) used by any single job.
JOBMEM will only work with the MAXMEM flag.
374
JOBPROC
Maximum processor load associated with any single job. You must set MAXPROC to use
JOBPROC.
MEM
Maximum real memory/RAM (in MB) used by any single job task.
MINJOBPROC
Minimum processor load associated with any single job (action taken if job is using 5% or less
of potential CPU usage).
NETWORK
Maximum network load associated with any single job task.
PROC
Maximum processor load associated with any single job task.
SWAP
Maximum virtual memory/SWAP (in MB) used by any single job task.
WALLTIME
Requested job walltime.
Policy
Description
ALWAYS
take action whenever a violation is detected
EXTENDEDVIOLATION
take action only if a violation is detected and persists for greater than the specified time limit
BLOCKEDWORKLOADONLY
take action only if a violation is detected and the constrained resource is required
by another job
Action
Description
CANCEL
terminate the job
CHECKPOINT
checkpoint and terminate job
MIGRATE
requeue the job and require a different set of hosts for execution
Usage Limits/Throttling Policies
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Action
Description
NOTIFY
notify admins and job owner regarding violation
REQUEUE
terminate and requeue the job
SUSPEND
suspend the job and leave it suspended for an amount of time defined by the MINADMINSTIME
parameter
Example 5-1: Notify and then cancel job if requested memory is exceeded
# if job exceeds memory usage, immediately notify owner
# if job exceeds memory usage for more than 5 minutes, cancel the job
RESOURCELIMITPOLICY MEM:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:00:05:00
Example 5-2: Checkpoint job on walltime violations
# if job exceeds requested walltime, checkpoint job
RESOURCELIMITPOLICY WALLTIME:ALWAYS:CHECKPOINT
# when checkpointing, send term signal, followed by kill 1 minute later
RMCFG[base] TYPE=PBS CHECKPOINTTIMEOUT=00:01:00 CHECKPOINTSIG=SIGTERM
Example 5-3: Cancel jobs that use 5% or less of potential CPU usage for more than 5 minutes
RESOURCELIMITPOLICY MINJOBPROC:EXTENDEDVIOLATION:CANCEL:5:00
Example 5-4: Migrating a job when it blocks other workload
RESOURCELIMITPOLICY JOBPROC:BLOCKEDWORKLOADONLY:MIGRATE
Specifying Hard and Soft Policy Violations
Moab is able to perform different actions for both hard and soft policy
violations. In most resource management systems, a mechanism does not
exist to allow the user to specify both hard and soft limits. To address this,
Moab provides the RESOURCELIMITMULTIPLIER parameter that allows per
partition and per resource multiplier factors to be specified to generate the
actual hard and soft limits to be used. If the factor is less than one, the soft limit
will be lower than the specified value and a Moab action will be taken before the
specified limit is reached. If the factor is greater than one, the hard limit will be
set higher than the specified limit allowing a buffer space before the hard limit
action is taken.
In the following example, job owners will be notified by email when their
memory reaches 100% of the target, and the job will be canceled if it reaches
125% of the target. For wallclock usage, the job will be requeued when it
reaches 90% of the specified limit if another job is waiting for its resources, and
it will be checkpointed when it reaches the full limit.
Usage Limits/Throttling Policies
375
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
RESOURCELIMITPOLICY
RESOURCELIMITPOLICY
RESOURCELIMITMULTIPLIER
MEM:ALWAYS,ALWAYS:NOTIFY,CANCEL
WALLTIME:BLOCKEDWORKLOADONLY,ALWAYS:REQUEUE,CHECKPOINT
MEM:1.25,WALLTIME:0.9
Constraining Walltime Usage
While Moab constrains walltime using the parameter RESOURCELIMITPOLICY
like other resources, it also allows walltime exception policies which are not
available with other resources. In particular, Moab allows jobs to exceed the
requested wallclock limit by an amount specified on a global basis using the
JOBMAXOVERRUN parameter or on a per credential basis using the
WCOVERRUN attribute of the CLASSCFG parameter.
JOBMAXOVERRUN
CLASSCFG[debug]
00:10:00
wcoverrun=00:00:30
Related Topics
RESOURCELIMITPOLICY parameter
FSTREE parameter (set usage limits within share tree hierarchy)
Credential Overview
JOBMAXOVERRUN parameter
WCVIOLATIONACTION parameter
RESOURCELIMITMULTIPLIER parameter
Fairshare
Fairshare allows historical resource utilization information to be incorporated
into job feasibility and priority decisions. This feature allows site administrators
to set system utilization targets for users, groups, accounts, classes, and QoS
levels. Administrators can also specify the time frame over which resource
utilization is evaluated in determining whether the goal is being reached.
Parameters allow sites to specify the utilization metric, how historical
information is aggregated, and the effect of fairshare state on scheduling
behavior. You can specify fairshare targets for any credentials (such as user,
group, and class) that administrators want such information to affect.
l
376
Fairshare Parameters
o
FSPOLICY - Specifying the Metric of Consumption
o
Specifying Fairshare Timeframe
o
Managing Fairshare Data
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
l
l
Using Fairshare Information
o
Fairshare Targets
o
Fairshare Caps
o
Priority-Based Fairshare
o
Per-Credential Fairshare Weights
o
Fairshare Usage Scaling
o
Extended Fairshare Examples
Hierarchical Fairshare/Share Trees
o
Defining the Tree
o
Controlling Tree Evaluation
Fairshare Parameters
Fairshare is configured at two levels. First, at a system level, configuration is
required to determine how fairshare usage information is to be collected and
processed. Second, some configuration is required at the credential level to
determine how this fairshare information affects particular jobs. The following
are system level parameters:
Parameter
Description
FSINTERVAL
Duration of each fairshare window.
FSDEPTH
Number of fairshare windows factored into current fairshare utilization.
FSDECAY
Decay factor applied to weighting the contribution of each fairshare window.
FSPOLICY
Metric to use when tracking fairshare usage.
Credential level configuration consists of specifying fairshare utilization targets
using the *CFG suite of parameters, including ACCOUNTCFG, CLASSCFG,
GROUPCFG, QOSCFG, and USERCFG.
If global (multi-cluster) fairshare is used, Moab must be configured to
synchronize this information with an identity manager.
Fairshare
377
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Image 5-1: Effective fairshare over 7 days
FSPOLICY - Specifying the Metric of Consumption
As Moab runs, it records how available resources are used. Each iteration
(RMPOLLINTERVAL seconds) it updates fairshare resource utilization statistics.
Resource utilization is tracked in accordance with the FSPOLICY parameter
allowing various aspects of resource consumption information to be measured.
This parameter allows selection of both the types of resources to be tracked as
well as the method of tracking. It provides the option of tracking usage by
dedicated or consumed resources, where dedicated usage tracks what the
scheduler assigns to the job and consumed usage tracks what the job actually
uses.
Metric
Description
DEDICATEDPES
Usage tracked by processor-equivalent seconds dedicated to each job. This is based on the
total number of dedicated processor-equivalent seconds delivered in the system. Useful in
dedicated and shared nodes environments.
DEDICATEDPS
Usage tracked by processor seconds dedicated to each job. This is based on the total number
of dedicated processor seconds delivered in the system. Useful in dedicated node environments.
DEDICATEDPS%
Usage tracked by processor seconds dedicated to each job. This is based on the total number
of dedicated processor seconds available in the system.
[NONE]
Disables fairshare.
UTILIZEDPS
Usage tracked by processor seconds used by each job. This is based on the total number of
utilized processor seconds delivered in the system. Useful in shared node/SMP environments.
Example 5-5:
An example may clarify the use of the FSPOLICY parameter. Assume a 4processor job is running a parallel /bin/sleep for 15 minutes. It will have a
dedicated fairshare usage of 1 processor-hour but a consumed fairshare usage
of essentially nothing since it did not consume anything. Most often, dedicated
378
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
fairshare usage is used on dedicated resource platforms while consumed
tracking is used in shared SMP environments.
FSPOLICY
FSINTERVAL
FSDEPTH
FSDECAY
DEDICATEDPS%
24:00:00
28
0.75
Specifying Fairshare Timeframe
When configuring fairshare, it is important to determine the proper timeframe
that should be considered. Many sites choose to incorporate historical usage
information from the last one to two weeks while others are only concerned
about the events of the last few hours. The correct setting is very site
dependent and usually incorporates both average job turnaround time and site
mission policies.
With Moab's fairshare system, time is broken into a number of distinct fairshare
windows. Sites configure the amount of time they want to consider by
specifying two parameters, FSINTERVAL and FSDEPTH. The FSINTERVAL
parameter specifies the duration of each window while the FSDEPTH parameter
indicates the number of windows to consider. Thus, the total time evaluated by
fairshare is simply FSINTERVAL * FSDEPTH.
Many sites want to limit the impact of fairshare data according to its age. The
FSDECAY parameter allows this, causing the most recent fairshare data to
contribute more to a credential's total fairshare usage than older data. This
parameter is specified as a standard decay factor, which is applied to the
fairshare data. Generally, decay factors are specified as a value between 1 and
0 where a value of 1 (the default) indicates no decay should be specified. The
smaller the number, the more rapid the decay using the calculation
WeightedValue = Value * <DECAY> ^ <N> where <N> is the window
number. The following table shows the impact of a number of commonly used
decay factors on the percentage contribution of each fairshare window.
Decay
Factor
Win0
Win1
Win2
Win3
Win4
Win5
Win6
Win7
1.00
100%
100%
100%
100%
100%
100%
100%
100%
0.80
100%
80%
64%
51%
41%
33%
26%
21%
0.75
100%
75%
56%
42%
31%
23%
17%
12%
0.50
100%
50%
25%
13%
6%
3%
2%
1%
While selecting how the total fairshare time frame is broken up between the
number and length of windows is a matter of preference, it is important to note
Fairshare
379
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
that more windows will cause the decay factor to degrade the contribution of
aged data more quickly.
Managing Fairshare Data
Using the selected fairshare usage metric, Moab continues to update the
current fairshare window until it reaches a fairshare window boundary, at which
point it rolls the fairshare window and begins updating the new window. The
information for each window is stored in its own file located in the Moab
statistics directory. Each file is named FS.<EPOCHTIME>[.<PNAME>] where
<EPOCHTIME> is the time the new fairshare window became active (see
sample data file) and <PNAME> is only used if per-partition share trees are
configured. Each window contains utilization information for each entity as well
as for total usage.
Historical fairshare data is recorded in the fairshare file using the metric
specified by the FSPOLICY parameter. By default, this metric is processorseconds.
Historical fairshare data can be directly analyzed and reported using the
mdiag -f -v command.
When Moab needs to determine current fairshare usage for a particular
credential, it calculates a decay-weighted average of the usage information for
that credential using the most recent fairshare intervals where the number of
windows evaluated is controlled by the FSDEPTH parameter. For example,
assume the credential of interest is user john and the following parameters are
set:
FSINTERVAL
FSDEPTH
FSDECAY
12:00:00
4
0.5
Further assume that the fairshare usage intervals have the following usage
amounts:
380
Fairshare interval
Total user john usage
Total cluster usage
0
60
110
1
0
125
2
10
100
3
50
150
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Based on this information, the current fairshare usage for user john would be
calculated as follows:
Usage = (60 * 1 + .5^1 * 0 + .5^2 * 10 + .5^3 * 50) / (110 + .5^1*125 +
.5^2*100 + .5^3*150)
The current fairshare usage is relative to the actual resources delivered by
the system over the timeframe evaluated, not the resources available or
configured during that time.
Historical fairshare data is organized into a number of data files, each file
containing the information for a length of time as specified by the
FSINTERVAL parameter. Although FSDEPTH, FSINTERVAL, and FSDECAY can
be freely and dynamically modified, such changes may result in
unexpected fairshare status for a period of time as the fairshare data files
with the old FSINTERVAL setting are rolled out.
Using Fairshare Information
Fairshare Targets
Once the global fairshare policies have been configured, the next step involves
applying resulting fairshare usage information to affect scheduling behavior. As
mentioned in the Fairshare Overview, by specifying fairshare targets, site
administrators can configure how fairshare information impacts scheduling
behavior. The targets can be applied to user, group, account, QoS, or class
credentials using the FSTARGET attribute of *CFG credential parameters. These
targets allow fairshare information to affect job priority and each target can be
independently selected to be one of the types documented in the following
table:
Target type - Ceiling
Target modifier
-
Job impact
Priority
Format
Percentage Usage
Description
Adjusts job priority down when usage exceeds target. See How violated ceilings and floors affect
fairshare-based priority for more information on how ceilings affect job priority.
Fairshare
381
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Target type - Floor
Target modifier
+
Job impact
Priority
Format
Percentage Usage
Description
Adjusts job priority up when usage falls below target. See How violated ceilings and floors affect
fairshare-based priority for more information on how floors affect job priority.
Target type - Target
Target modifier
N/A
Job impact
Priority
Format
Percentage Usage
Description
Adjusts job priority when usage does not meet target.
Setting a fairshare target value of 0 indicates that there is no target and
that the priority of jobs associated with that credential should not be
affected by the credential's previous fairshare target. If you want a
credential's cluster usage near 0%, set the target to a very small value,
such as 0.001.
Example
The following example increases the priority of jobs belonging to user john until
he reaches 16.5% of total cluster usage. All other users have priority adjusted
both up and down to bring them to their target usage of 10%:
FSPOLICY
FSWEIGHT
FSUSERWEIGHT
USERCFG[john]
USERCFG[DEFAULT]
...
DEDICATEDPS
1
100
FSTARGET=16.5+
FSTARGET=10
Fairshare Caps
Where fairshare targets affect a job's priority and position in the eligible queue,
fairshare caps affect a job's eligibility. Caps can be applied to users, accounts,
382
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
groups, classes, and QoSs using the FSCAP attribute of *CFG credential
parameters and can be configured to modify scheduling behavior. Unlike
fairshare targets, if a credential reaches its fairshare cap, its jobs can no longer
run and are thus removed from the eligible queue and placed in the blocked
queue. In this respect, fairshare targets behave like soft limits and fairshare
caps behave like hard limits. Fairshare caps can be absolute or relative as
described in the following table. If no modifier is specified, the cap is
interpreted as relative.
Absolute Cap
Cap Modifier:
^
Job Impact:
Feasibility
Format:
Absolute Usage
Description:
Constrains job eligibility as an absolute quantity measured according to the scheduler charge metric as defined by the FSPOLICY parameter
Relative Cap
Cap Modifier:
%
Job Impact:
Feasibility
Format:
Percentage Usage
Description:
Constrains job eligibility as a percentage of total delivered cycles measured according to the scheduler charge metric as defined by the FSPOLICY parameter.
Example
The following example constrains the marketing account to use no more than
16,500 processor seconds during any given floating one week window. At the
same time, all other accounts are constrained to use no more than 10% of the
total delivered processor seconds during any given one week window.
FSPOLICY
DEDICATEDPS
FSINTERVAL
12:00:00
FSDEPTH
14
ACCOUNTCFG[marketing] FSCAP=16500^
ACCOUNTCFG[DEFAULT]
FSCAP=10
...
Fairshare
383
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Priority-Based Fairshare
The most commonly used type of fairshare is priority based fairshare. In this
mode, fairshare information does not affect whether a job can run, but rather
only the job's priority relative to other jobs. In most cases, this is the desired
behavior. Using the standard fairshare target, the priority of jobs of a
particular user who has used too many resources over the specified fairshare
window is lowered. Also, the standard fairshare target increases the priority of
jobs that have not received enough resources.
While the standard fairshare target is the most commonly used, Moab can also
specify fairshare ceilings and floors. These targets are like the default target;
however, ceilings only adjust priority down when usage is too high and floors
only adjust priority up when usage is too low.
Since fairshare usage information must be integrated with Moab's overall
priority mechanism, it is critical that the corresponding fairshare priority
weights be set. Specifically, the FSWEIGHT component weight parameter and
the target type subcomponent weight (such as FSACCOUNTWEIGHT,
FSCLASSWEIGHT, FSGROUPWEIGHT, FSQOSWEIGHT, and FSUSERWEIGHT)
be specified.
If these weights are not set, the fairshare mechanism will be enabled but
have no effect on scheduling behavior. See the Job Priority Factor
Overview for more information on setting priority weights.
Example
# set relative component weighting
FSWEIGHT
1
FSUSERWEIGHT 10
FSGROUPWEIGHT 50
FSINTERVAL 12:00:00
FSDEPTH
4
FSDECAY
0.5
FSPOLICY
DEDICATEDPS
# all users should have a FS target of 10%
USERCFG[DEFAULT] FSTARGET=10.0
# user john gets extra cycles
USERCFG[john]
FSTARGET=20.0
# reduce staff priority if group usage exceed 15%
GROUPCFG[staff] FSTARGET=15.0# give group orion additional priority if usage drops below 25.7%
GROUPCFG[orion] FSTARGET=25.7+
Job preemption status can be adjusted based on whether the job violates
a fairshare target using the ENABLEFSVIOLATIONPREEMPTION
parameter.
384
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Credential-Specific Fairshare Weights
Credential-specific fairshare weights can be set using the FSWEIGHT attribute
of the ACCOUNT, GROUP, and QOS credentials as in the following example:
FSWEIGHT 1000
ACCOUNTCFG[orion1] FSWEIGHT=100
ACCOUNTCFG[orion2] FSWEIGHT=200
ACCOUNTCFG[orion3] FSWEIGHT=-100
GROUPCFG[staff] FSWEIGHT=10
If specified, a per-credential fairshare weight is added to the global component
fairshare weight.
The FSWEIGHT attribute is only enabled for ACCOUNT, GROUP, and QOS
credentials.
Fairshare Usage Scaling
Moab uses the FSSCALINGFACTOR attribute for QOS credentials to get the
calculated fairshare usage of a job.
QOSCFG[qos1]
FSSCALINGFACTOR=<double>
Moab will multiple the actual fairshare usage by this value to get the calculated
fairshare usage of a job. The actual fairshare usage is calculated based on the
FSPOLICY parameter.
For an example, if FSPOLICY is set to DEDICATEDPS and a job runs on two
processors for 100 seconds then the actual fairshare usage would be 200. If
the job ran on a qos with FSSCALINGFACTOR=.5 then Moab would multiply
200*.5=100. If the job ran on a partition with FSSCALINGFACTOR=2 then
Moab would multiply 200*2=400.
PARCFG also lets you specify the FSSCALINGFACTOR for partitions. See
Per-Partition Settings on page 484.
Extended Fairshare Examples
Example 5-6: Multi-Cred Cycle Distribution
Example 1 represents a university setting where different schools have access
to a cluster. The Engineering department has put the most money into the
cluster and therefore has greater access to the cluster. The Math, Computer
Science, and Physics departments have also pooled their money into the
cluster and have reduced relative access. A support group also has access to
the cluster, but since they only require minimal compute time and shouldn't
block the higher-paying departments, they are constrained to five percent of
the cluster. At this time, users Tom and John have specific high-priority projects
that need increased cycles.
Fairshare
385
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
#global general usage limits - negative priority jobs are considered in scheduling
ENABLENEGJOBPRIORITY
TRUE
# site policy - no job can last longer than 8 hours
USERCFG[DEFAULT] MAX.WCLIMIT=8:00:00
# Note: default user FS target only specified to apply default user-to-user balance
USERCFG[DEFAULT] FSTARGET=1
# high-level fairshare config
FSPOLICY
DEDICATEDPS
FSINTERVAL
12:00:00
FSDEPTH
32 #recycle FS every 16 days
FSDECAY
0.8 #favor more recent usage info
# qos config
QOSCFG[inst]
FSTARGET=25
QOSCFG[supp]
FSTARGET=5
QOSCFG[premium] FSTARGET=70
# account config (QoS access and fstargets)
# Note: user-to-account mapping handled via accounting manager
# Note: FS targets are percentage of total cluster, not percentage of QOS
ACCOUNTCFG[cs]
QLIST=inst
FSTARGET=10
ACCOUNTCFG[math] QLIST=inst
FSTARGET=15
ACCOUNTCFG[phys] QLIST=supp
FSTARGET=5
ACCOUNTCFG[eng] QLIST=premium FSTARGET=70
# handle per-user priority exceptions
USERCFG[tom] PRIORITY=100
USERCFG[john] PRIORITY=35
# define overall job priority
USERWEIGHT
10 # user exceptions
# relative FS weights (Note: QOS overrides ACCOUNT which overrides USER)
FSUSERWEIGHT
1
FSACCOUNTWEIGHT 10
FSQOSWEIGHT
100
# apply XFactor to balance cycle delivery by job size fairly
# Note: queuetime factor also on by default (use QUEUETIMEWEIGHT to adjust)
XFACTORWEIGHT
100
# enable preemption
PREEMPTPOLICY REQUEUE
# temporarily allow phys to preempt math
ACCOUNTCFG[phys] JOBFLAGS=PREEMPTOR PRIORITY=1000
ACCOUNTCFG[math] JOBFLAGS=PREEMPTEE
Hierarchical Fairshare/Share Trees
Moab supports arbitrary depth hierarchical fairshare based on a share tree. In
this model, users, groups, classes, and accounts can be arbitrarily organized
and their usage tracked and limited. Moab extends common share tree
concepts to allow mixing of credential types, enforcement of ceiling and floor
style usage targets, and mixing of hierarchical fairshare state with other
priority components.
Defining the Tree
The FSTREE parameter can be used to define and configure the share tree
used in fairshare configuration. This parameter supports the following
attributes:
386
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
SHARES
Format:
<COUNT>[@<PARTITION>][,<COUNT>[@<PARTITION>]]... where <COUNT> is a double and
<PARTITION> is a specified partition name.
Description:
Specifies the node target usage or share.
Example:
FSTREE[Eng]
SHARES=1500.5
FSTREE[Sales] SHARES=2800
MEMBERLIST
Format:
Comma delimited list of child nodes of the format [<OBJECT_TYPE>]:<OBJECT_ID> where
object types are only specified for leaf nodes associated with user, group, class, qos, or acct credentials.
Description:
Specifies the tree objects associated with this node.
Example:
FSTREE[root]
FSTREE[Eng]
FSTREE[Sales]
FSTREE[Sales1]
FSTREE[Sales2]
FSTREE[Sales3]
SHARES=100
SHARES=1500.5
SHARES=2800
SHARES=30
SHARES=10
SHARES=60
MEMBERLIST=Eng,Sales
MEMBERLIST=user:john,user:steve,user:bob
MEMBERLIST=Sales1,Sales2,Sales3
MEMBERLIST=user:kellyp,user:sam
MEMBERLIST=user:ux43,user:ux44,user:ux45
MEMBERLIST=user:robert,user:tjackson
Current tree configuration and monitored usage distribution is available using
the mdiag -f -v commands.
Controlling Tree Evaluation
Moab provides multiple policies to customize how the share tree is evaluated.
Policy
Description
FSTREETIERMULTIPLIER
Decreases the value of sub-level usage discrepancies. It can be a positive or negative
value. When positive, the parent's usage in the tree takes precedence; when negative, the child's usage takes precedence. The usage amount is not changed, only the
coefficient used when calculating the value of fstree usage in priority. When using
this parameter, it is recommended that you research how it changes the values in
mdiag -p to determine the appropriate use.
FSTREECAP
Caps lower level usage factors to prevent them from exceeding upper tier discrepancies.
Fairshare
387
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Using FS Floors and Ceilings with Hierarchical Fairshare
All standard fairshare facilities including target floors, target ceilings, and
target caps are supported when using hierarchical fairshare.
Multi-Partition Fairshare
Moab supports independent, per-partition hierarchical fairshare targets
allowing each partition to possess independent prioritization and usage
constraint settings. This is accomplished by setting the PERPARTITIONSCHEDULING
attribute of the FSTREE parameter to TRUE in moab.cfg and setting
partition="name" in your <fstree> leaf.
FSTREE[tree]
<fstree>
<tnode partition="slave1" name="root" type="acct" share="100" limits="MAXJOB=6">
<tnode name="accta" type="acct" share="50" limits="MAXSUBMITJOBS=2 MAXJOB=1">
<tnode name="fred" type="user" share="1" limits="MAXWC=1:00:00">
</tnode>
</tnode>
<tnode name="acctb" type="acct" share="50" limits="MAXSUBMITJOBS=4 MAXJOB=3">
<tnode name="george" type="user" share="1" >
</tnode>
</tnode>
</tnode>
<tnode partition="slave2" name="root" type="acct" share="100"
limits="MAXSUBMITJOBS=6 MAXJOB=5">
<tnode name="accta" type="acct" share="50">
<tnode name="paul" type="user" share="1">
</tnode>
</tnode>
<tnode name="acctb" type="acct" share="50">
<tnode name="ringo" type="user" share="1">
</tnode>
</tnode>
</tnode>
</fstree>
If no partition is specified for a given share value, then this value is
assigned to the global partition. If a partition exists for which there are no
explicitly specified shares for any node, this partition will use the share
distribution assigned to the global partition.
Dynamically Importing Share Tree Data
Share trees can be centrally defined within a database, flat file, information
service, or other system and this information can be dynamically imported and
used within Moab by setting the FSTREE parameter within the Identity Managers.
This interface can be used to load current information at startup and
periodically synchronize this information with the master source.
To create a fairshare tree in a separate XML file and import it into Moab
1. Create a file to store your fair share tree specification. Give it a descriptive
name and store it in your Moab home directory ($MOABHOMEDIR or
388
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
$MOABHOMEDIR/etc). In this example, the file is called fstree.dat.
2. In the first line of fstree.dat, set FSTREE[myTree] to indicate that this is a
fairshare file.
3. Build a tree in XML to match your needs. For example:
FSTREE[myTree]
<fstree>
<tnode name="root" share="100">
<tnode name="john" type="user" share="50" limits="MAXJOB=8 MAXPROC=24
MAXWC=01:00:00"></tnode>
<tnode name="jane" type="user" share="50" limits="MAXJOB=5"></tnode>
</tnode>
</fstree>
This configuration creates a fairshare tree in which users share a value of 100. Users john and jane
share the value equally, because each has been given 50.
Because 100 is an arbitrary number, users john and jane could be assigned
10000 and 10000 respectively and still have a 50% share under the parent
leaf. To keep the example simple, however, it is recommended that you use
100 as your arbirary share value and distribute the share as percentages. In
this case, john and jane each have 50%.
If the users' numbers do not add up to at least the fairshare value of 100,
the remaining value is shared among all users under the tree. For instance,
if the tree had a value of 100, user john had a value of 50, and user jane
had a value of 25, then 25% of the fairshare tree value would belong to all
other users associated with the tree. By default, tree leaves do not limit who
can run under them.
Each value specified in the tnode elements must be contained in
quotation marks.
4. Optional: Share trees defined within a flat file can be cumbersome; consider
running tidy for xml to improve readability. Sample usage:
> tidy -i -xml mam-tiy.cfg <filename> <output file>
# Sample output
FSTREE[myTree]
<fstree>
<tnode name="root" share="100">
<tnode name="john"
type="user" share="50" limits="MAXJOB=8
MAXPROC=24 MAXWC=01:00:00">
</tnode>
<tnode name="jane" type="user" share="50" limits="MAXJOB=5">
</tnode>
</tnode>
</fstree>
5. Link the new file to Moab using the IDCFG parameter in your Moab
configuration file.
Fairshare
389
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
IDCFG[myTree] server="FILE:///$MOABHOMEDIR/etc/fstree.dat" REFRESHPERIOD=INFINITY
Moab imports the myTree fairshare tree from the fstree.dat file. Setting REFRESHPERIOD to INFINITY
causes Moab to read the file each time it starts or restarts, but setting a positive interval (e.g.
4:00:00) cause Moab to read the file more often. See Refreshing Identity Manager Data for more
information.
6. To view your fairshare tree configuration, run mdiag -f. If it is configured
correctly, the tree information will appear beneath all the information about
your fairshare settings configured in moab.cfg.
> mdiag -f
Share Tree Overview for partition 'ALL'
Name
Usage
Target
(FSFACTOR)
-----------------------root
100.00
100.00 of
100.00 (node: 1171.81) (0.00)
- john
16.44
50.00 of
100.00 (user: 192.65) (302.04) MAXJOB=8
MAXPROC=24 MAXWC=3600
- jane
83.56
50.00 of
100.00 (user: 979.16) (-302.04) MAXJOB=5
The settings you configured in fstree.dat appear in the output. The tree of 100 is shared equally
between users john and jane.
Specifying Share Tree Based Limits
Limits can be specified on internal nodes of the share tree using standard
credential limit semantics. The following credential usage limits are valid:
l
MAXIJOB (Maximum number of idle jobs allowed for the credential)
l
MAXJOB
l
MAXMEM
l
MAXNODE
l
MAXPROC
l
MAXSUBMITJOBS
l
MAXWC
Example 5-7: FSTREE limits example
FSTREE[myTree]
<fstree>
<tnode name="root" share="100">
<tnode name="john"
type="user" share="50" limits="MAXJOB=8
MAXPROC=24 MAXWC=01:00:00">
</tnode>
<tnode name="jane" type="user" share="50" limits="MAXJOB=5">
</tnode>
</tnode>
</fstree>
Other Uses of Share Trees
If a share tree is defined, it can be used for purposes beyond fairshare,
including organizing general usage and performance statistics for reporting
390
Fairshare
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
purposes (see showstats -T), enforcement of tree node based usage limits,
and specification of resource access policies.
Related Topics
mdiag -f command (provides diagnosis and monitoring of the fairshare facility)
FSENABLECAPPRIORITY parameter
ENABLEFSPREEMPTION parameter
FSTARGETISABSOLUTE parameter
Sample FairShare Data File
FS.<EPOCHTIME>
# FS Data File (Duration: 43200 seconds)
user
jvella
134087.910
user
reynolds
98283.840
user
gastor
18751.770
user
uannan
145551.260
user
mwillis
149279.140
...
group
DEFAULT
411628.980
group
RedRock 3121560.280
group
Summit
500327.640
group
Arches 3047918.940
acct
Administration
653559.290
acct
Engineering 4746858.620
acct
Shared
75033.020
acct
Research 1605984.910
qos
Deadline 2727971.100
qos
HighPriority 4278431.720
qos
STANDARD
75033.020
class
batch 7081435.840
sched
iCluster 7081435.840
Starting: Sat Jul
8 06:00:20
The total usage consumed in this time interval is 7081435.840 processor-seconds. Since every job in
this example scenario had a user, group, account, and QOS assigned to it, the sum of the usage of all
members of each category should equal the total usage value: USERA + USERB + USERC + USERD =
GROUPA + GROUPB = ACCTA + ACCTB + ACCTC = QOS0 + QOS1 + QOS2 = SCHED.
Accounting, Charging, and Allocation Management
In this topic:
l
Accounting Manager Overview on page 392
l
Accounting Mode on page 393
l
Accounting Manager Interface Types on page 393
o
MAM on page 394
o
Native on page 394
Accounting, Charging, and Allocation Management
391
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
l
l
Accounting Properties Reported to Moab Accounting Manager on page
396
Accounting Policies on page 399
o
Charge Metrics on page 400
l
Accounting Stages on page 401
l
Accounting Events on page 404
l
Blocking Versus Non-Blocking Accounting Actions on page 404
l
Retrying Failed Charges on page 405
For a complete list of and additional information on the AMCFG parameters and
flags, see AMCFG Parameters and Flags on page 407.
Accounting Manager Overview
An accounting manager is a software system that enables tracking and
charging for job resource usage. Moab Accounting Manager is a commercial
charge-back accounting system that has built-in integration with Moab
Workload Manager. Moab Accounting Manager can be used in a variety of
accounting modes such as for usage tracking, notional charging or allocation
enforcement.
When used for usage tracking only, the accounting manager simply records
workload usage details. When configured additionally to perform charging,
resource charge rates are used to impute a charge for each job. When
configured to enforce resource allocation limits, jobs are charged against
allocations and new jobs may be blocked from running if their account runs out
of funds. See Accounting Mode and see Select an Appropriate Accounting
Mode in the Moab Accounting Manager Administrator Guide for more details on
supported accounting modes.
In a typical allocation enforcement use case, credits are allocated to accounts
for designated time periods; establishing limits on the use of compute
resources. The base currency credits can be defined in terms of system
resource units (e.g. Processor-Seconds) or a real currency (e.g. U.S. dollars).
Charge rates are established for the use of resources. Accounts are created
and users are given access to the appropriate accounts. Deposits are made into
funds associated with the account’s creating allocations. An allocation cycle can
be established whereby funds are reset on a regular periodic basis (such as
yearly, quarterly, or monthly) and where allocations are renewed for accepted
accounts. Before a job is started, Moab Workload Manager will verify that the
user has sufficient credits to run the job by attempting to place a hold against
their funds (referred to as a lien). When a job completes, the user's funds will
be debited via a charge, usage information will be recorded for the job, and the
lien will be removed.
392
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Accounting Mode
The accounting mode (specified via the AMCFG[] MODE parameter) modifies
the way in which accounting-relevant job and reservation stages (e.g. create,
start, end, etc.) are processed. See Accounting Stages on page 401 for more
information on the behaviors of the different values of the accounting mode.
The following table describes the valid values for the accounting mode.
Value
Description
strict-allocation
Use this mode if you wish to strictly enforce allocation limits. Under this mode, holds (called
liens) will be placed against allocations in order to prevent multiple jobs from starting up on
the same funds. Jobs and reservations may be prevented from running if the end-users do
not have sufficient funds. This is the default.
fast-allocation
Use this mode if you wish to debit allocations, but need higher throughput by eliminating
the lien and quote operations of strict-allocation mode. Under this mode, jobs and
reservations check a cached account balance, and may be prevented from running after the
balance has become zero or negative.
If you are using fast-allocation, funds are assumed to have account-based
constraints only. Moab will reject funds having no constraints or having non-account
constraints. It is highly recommended that you enable ENFORCEACCOUNTACCESS to
TRUE and AMCFG[] CREATECRED=TRUE with an appropriate refresh period (via
AMCFG[] REFRESHPERIOD) so that Moab can prevent jobs from running under
accounts that the user does not belong to (this is enforced via liens in the strictallocation accounting mode). Also, the configured refresh period will apply to both
credential updates and account balance updates. See Moab Parameters for more
information on the ENFORCEACCOUNTACCESS parameter.
notional-charging
Use this mode if you wish to calculate and record charges for workload usage, but not keep
track of fund balances or allocation limits.
usage-tracking
Use this mode if you wish to record workload usage details, but not to calculate a charge nor
keep track of fund balances or allocation limits.
Accounting Manager Interface Types
Moab Workload Manager supports two accounting manager interface types:
MAM and Native.
l
l
When using the MAM interface type, Moab communicates directly over the
network with Moab Accounting Manager using the SSS wire protocol.
When using the Native accounting manager interface type, Moab invokes
scripts which can be customized to interact with Moab Accounting Manager
or other third party accounting systems.
Accounting, Charging, and Allocation Management
393
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
MAM
The MAM accounting manager interface type enables direct communication
between Moab Workload Manager and Moab Accounting Manager. This often
results in the fastest accounting performance. Use this interface type if you do
not need to customize the interaction with the accounting manager.
To configure Moab to use the MAM accounting manager interface, run
configure using the --with-am option.
Example 5-8:
./configure -with-am=mam ...
Consequently, make install will add the essential configuration and
connection entries into the moab.cfg and moab-private.cfg files.
The following are typical entries in the Moab configuration files for using the
MAM interface:
l
moab.cfg
AMCFG[mam] TYPE=MAM HOST=localhost
l
moab-private.cfg
CLIENTCFG[AM:mam] KEY=UiW7EihzKyUyVQg6dKirDhV3
Synchronize the secret key with Moab Accounting Manager by copying the value
of the token.value parameter from the MAM_PREFIX/etc/mam-site.conf
file which is randomly generated during the Moab Accounting Manager install
process.
When using the MAM accounting manager interface, by default Moab will
communicate directly with Moab Accounting Manager via the SSS wire protocol.
However, it is possible to enable a hybrid model and override individual
accounting actions by specifying the exec protocol and the path of a custom
script to the appropriate AMCFG[] *URL parameters.
Moab Accounting Manager should be installed, started, and initialized. See
Initial Setup in the Moab Accounting Manager Administrator Guide for
examples of how to initialize MAM for your initial mode of operation.
Native
The Native accounting manager interface type provides a customization layer
between Moab Workload Manager and Moab Accounting Manager. This
interface can be used where greater accounting customization is required. The
native interface can also be customized to interact with third-party accounting
manager systems. Moab passes job accounting details to scripts that handle
the interaction with the external system.
394
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
To configure Moab to use the MAM accounting manager interface, run
configure using the --with-am=native option.
Additionally, you may need to use the --with-am-dir configure option to
specify the prefix directory for Moab Accounting Manager if MAM has been
installed in a non-default location.
Example 5-9:
./configure --with-am=native ...
Consequently, make install will add the essential accounting manager
entries into moab.cfg and install the accounting-related scripts
($PREFIX/tools/mam/usage.*.mam.pl) in the correct locations..
Moab will default to using a set of stock scripts for the accounting stages. To
view the scripts that are currently in use, run mdiag -R -v (even more
information may be available in mdiag -R -v --xml). The following shows
sample output from running the mdiag -R -v command.
AM[mam] Type: Native
Timeout:
Thread Pool Size:
Accounting Mode:
Create URL:
Start URL:
Pause URL:
Resume URL:
Update URL:
Continue URL:
End URL:
Delete URL:
Query URL:
Charge Policy:
State: 'Active'
15
0
strict-allocation
exec:///opt/moab-accounting/tools/mam/usage.quote.mam.pl
exec:///opt/moab-accounting/tools/mam/usage.reserve.mam.pl
exec:///opt/moab-accounting/tools/mam/usage.charge.mam.pl
exec:///opt/moab-accounting/tools/mam/usage.reserve.mam.pl
exec:///opt/moab-accounting/tools/mam/usage.charge.mam.pl
exec:///opt/moab-accounting/tools/mam/usage.reserve.mam.pl
exec:///opt/moab-accounting/tools/mam/usage.charge.mam.pl
exec:///opt/moab-accounting/tools/mam/lien.delete.mam.pl
exec:///opt/moab-accounting/tools/mam/account.query.mam.pl
DEBITSUCCESSFULWC
Moab will invoke the native accounting manager scripts by passing the job or
reservation information via XML to the standard input of the script. You may
override any of the default scripts with a custom script by specifying the
appropriate AMCFG URL parameter in the moab server configuration file. See
AMCFG Parameters and Flags on page 407 for CREATEURL, STARTURL, PAUSEURL,
RESUMEURL, UPDATEURL, CONTINUEURL, ENDURL, DELETEURL, and QUERYURL
values for more information.
The XML sent to the scripts is in the form of an SSS Request that is identical to
the Request sent to MAM when you use the MAM Accounting Manager Interface
type. For example, the XML sent to the usage.charge.mam.pl script in a final
charge consists of an encapsulating Request element with an action attribute
that has a value of "Charge"; an object element with a value of "UsageRecord";
one or more optional Option elements; and a Data element. The Data element
has a single UsageRecord element with property elements describing the job or
reservation properties. For example:
Accounting, Charging, and Allocation Management
395
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
<Request action="Charge"><Object>UsageRecord</Object><Option
name="Duration">1234</Option><Data><UsageRecord><Type>Job</Type><Instance>Moab.165</In
stance><User>amy</User><Group>staff</Group><Account>chemistry</Account><Class>batch</C
lass><QualityOfService>high</QualityOfService><Machine>colony</Machine><Nodes>1</Nodes
><NodeType>Fast</NodeType><NodeCharge>2.000000</NodeCharge><Partition>Torque</Partitio
n><Processors
consumptionRate="0.50">2</Processors><Memory>2048</Memory><Matlab>2</Matlab><StartTime
>1398805354</StartTime><EndTime>1398805357</EndTime><CompletionCode>0</CompletionCode>
<OpSys>CentOS 6</Opsys><Temp>87.00</Temp></UsageRecord></Data></Request> *
In the sample XML above, Matlab is an example of a generic resource, Opsys is
an example of a job variable, and Temp is an example of a generic metric.
A reservation charge, or quote or lien, is very similar. For example:
<Request action="Charge"><Object>UsageRecord</Object><Option
name="Duration">7200</Option><Data><UsageRecord><Type>Reservation</Type><Instance>rese
rvation.7</Instance><User>amy</User><Machine>colony</Machine><Nodes>1</Nodes><Processo
rsconsumptionRate="0.76">12</Processors><Duration>7200</Duration><StartTime>1398797430
</StartTime><EndTime>1398804630</EndTime></UsageRecord></Data></Request>
The majority of the scripts use this same basic XML format; for instance,
usage.quote.mam.pl, usage.reserve.mam.pl, and
usage.charge.mam.pl.
The XML sent to the lien.delete.mam.pl script to clean up after a failure consists
of an encapsulating Request element with an action attribute that has a value
of "Delete"; an object element with the value of "Lien"; and a condition
(Where) element indicating the lien instance to delete. For example:
<Request action="Delete"><Object>Lien</Object><Where
name="Instance">Moab.127</Where></Request>
The script should return a return code (zero for success), data on standard out
and messages on standard error. A failure in CREATEURL, STARTURL, RESUMEURL,
or CONTINUEURL should result in the application of the CREATEFAILUREACTION,
STARTFAILUREACTION, RESUMEFAILUREACTION, or CONTINUEFAILUREACTION
respectively.
Moab Accounting Manager should be installed, started, and initialized. The
simplest procedure is to install it on the same server as Moab Workload
Manager so that the Moab Accounting Manager can share libraries and
configuration files with the Moab Workload Manager and Moab Accounting
Manager scripts. See Initial Setup in the Moab Accounting Manager
Administrator Guide for examples of how to initialize MAM for your initial mode
of operation.
Accounting Properties Reported to Moab Accounting
Manager
When you set the accounting manager TYPE to MAM, Moab can send the
following information to Moab Accounting Manager via charging actions:
396
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
For Jobs
Property name in
MAM Usage
Record
Description of property value recorded in MAM Usage Record
Account
Account name
Charge
If the AMCFG LOCALCOST flag is set, Moab will calculate and pass the Charge amount to
MAM. If it is not, MAM will calculate the charge based on the transmitted job properties.
Class
Class/queue name
CompletionCode*
Exit code
CPUTime
CPU time
Duration
Moab sends the wallclock time for the job charge(s) in seconds. This is aggregated in
MAM as Duration.
EndTime
Job end time
(Generic Metrics)*
The property name is the name of the generic metric. The property value is the average
value of the generic metric across the nodes of the job and time.
(Generic
Resources)*
The property name is the name of the generic resource. The property value is the
number of generic resources consumed by the job.
Table Cell OutsideNUMA-specific
Table: SubmitTime
generic resources include sockets, numanodes, cores, and threads and
represent the NUMA resources dedicated by the job.
Table Cell Outside Table: Job submission time
Group
Group name
Instance
Job ID
Machine
Cluster (RM) name
Memory
Dedicated or utilized memory in megabytes
NodeCharge*
Aggregate node charge rate. See NODECHARGEPOLICY and CHARGERATE for more
information.
Nodes
Node count
Accounting, Charging, and Allocation Management
397
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Property name in
MAM Usage
Record
Description of property value recorded in MAM Usage Record
NodeType*
Node type. See NODETYPE for more information.
Partition
Partition name
Table Cell Outside Table: SubmitTime
Processors
Table Cell
Processor count; this property may also have a consumptionRate attribute (scale mulindicating
how much the processors
Outsidetiplier)
Table:
Job submission
time value should be scaled by when charging for
them.
QualityOfService
QoS name
Stage
Accounting stage
StartTime
Job start time
Type
Set to "Job"
User
User name
(Variables)*
The property name is the name of the job variable. The property value is the value of the
job variable.
* For this property to be recorded in the MAM Usage Record, you must define a
custom usage record attribute in MAM for it. See Customizing the Usage Record
Object in the Moab Accounting Manager Administrator Guide for more
information.
For Reservations
398
Property
name in
MAM Usage
Record
Description of property value recorded in MAM Usage Record
Account
Charge account
Duration
Moab sends the wallclock time for the reservation in seconds. This is aggregated in MAM as Duration.
EndTime
Reservation end time
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Property
name in
MAM Usage
Record
Description of property value recorded in MAM Usage Record
Instance
Reservation ID
Machine
Cluster (RM) name
Nodes
Node count allocated to the reservation
Partition
Partition name
Processors
Processor count allocated to the reservation. This property may also have a consumptionRate
attribute which is the ratio of idle processor seconds to total processor seconds. For instance, for
a reservation in which a total processor seconds was utilized 25% by running jobs, the consumption rate would be transmitted as 0.75, meaning that 75% of the total reservation processor seconds are to be charged for being idle.
Stage
Accounting stage
StartTime
Reservation start time
Type
Set to "Reservation"
User
Charge user or reservation owner
Accounting Policies
When using an accounting mode of strict-allocation, before Moab starts a job, it
contacts the accounting manager and requests an allocation reservation (or
lien) be placed on the associated account. The lien amount is equivalent to the
total amount of allocation that could be consumed by the job (based on the
job's wallclock limit) and is used to prevent the possibility of allocation over
subscription. Moab then starts the job. When the job completes, Moab debits
the allocation by the amount actually consumed by the job and then releases
the lien.
These steps should be transparent to users. Only when an account has
insufficient allocations to run a requested job will the presence of the
accounting manager be noticed. If desired, a fallback account may be specified
for use when a job's primary account is out of allocations. This account,
specified using the AMCFG parameter's FALLBACKACCOUNT attribute, is often
associated with a low QoS privilege and priority, and is often configured to run
only when no other jobs are present.
Accounting, Charging, and Allocation Management
399
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
The scheduler can also be configured to charge for reservations. One of the
hesitations with dedicating resources to a particular group is that if the
resources are not used by that group, they go idle and are wasted. By
configuring a reservation to be chargeable, sites can charge every idle cycle of
the reservation to a particular account. When the reservation is in use, the
consumed resources will be charged to the job using the resources. When the
resources are idle, the resources will be charged to the reservation's charge
account. In the case of standing reservations, this account is specified using the
parameter SRCFG[X], attribute CHARGEACCOUNT. In the case of administrative
reservations, this account is specified via the -S account flag to the mrsvctl -c
command (see mrsvctl).
Charge Metrics
The accounting manager interface allows a site to charge accounts in a number
of different ways. Some sites may wish to charge for all jobs regardless of
whether the job completed successfully. Sites may also want to charge based
on differing usage metrics, such as dedicated wallclock time or processors
actually used. Moab supports the following charge policies specified via the
CHARGEPOLICY attribute.
l
l
l
l
l
l
l
l
400
DEBITALLWC - Charges all jobs regardless of job completion state using
processor weighted wallclock time dedicated as the usage metric.
DEBITALLCPU - Charges all jobs based on processors used by job.
DEBITALLPE - Charges all jobs based on processor-equivalents dedicated
to job.
DEBITALLBLOCKED - Charges all jobs based on processors dedicated and
blocked according to node access policies (see Node Access Policies on
page 341) or QoS node (see Quality of Service (QoS) Facilities on page
486) exclusivity.
DEBITSUCCESSFULWC - Charges only jobs that successfully complete
using processor weighted wallclock time dedicated as the usage metric.
This is the default metric.
DEBITSUCCESSFULCPU - Charges only jobs that successfully complete
using CPU time as the usage metric.
DEBITSUCCESSFULPE - Charges only jobs that successfully complete
using PE weighted wallclock time dedicated as the usage metric.
DEBITSUCCESSFULBLOCKED - Charges only jobs that successfully
complete based on processors dedicated and blocked according to node
access policies (see Node Access Policies on page 341) or QoS node (see
Quality of Service (QoS) Facilities on page 486) exclusivity.
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
DEBITALLBLOCKED or DEBITSUCCESSFULBLOCKED should only be used
with policies that allow only a single job to dedicate a node such as with a
Node Access Policy of SINGEJOB or SINGLETASK, or using a QOS with the
DEDICATED flag. Using DEBITALLBLOCKED or
DEBITSUCCESSFULBLOCKED with any policy allowing more than one job
to dedicate a node (such as a Node Access Policy of SINGLEUSER,
SINGLECLASS, SINGLEACCOUNT or UNIQUEUSER) is not supported.
On systems where job wallclock limits are specified, jobs that exceed their
wallclock limits and are subsequently canceled by the scheduler or
resource manager are considered to have successfully completed as far as
charging is concerned, even though the resource manager may report
these jobs as having been removed or canceled.
If machine-specific allocations are created within the accounting manager,
the accounting manager machine name should be synchronized with the
Moab resource manager name as specified with the RMCFG parameter,
such as the name orion in RMCFG[orion] TYPE=PBS.
To control how jobs are charged when heterogeneous resources are
allocated and per resource charges may vary within the job, use the
NODECHARGEPOLICY attribute.
When calculating the cost of the job, Moab will use the most restrictive
node access policy. See NODEACCESSPOLICY for more information.
Accounting Stages
The accounting manager performs various actions throughout different stages
of a job or reservation lifetime. For a stock configuration (meaning you have
not overridden the accounting actions with custom scripts), the following
describes the stages and the respective actions that occur at these stages
depending on the accounting mode:
l
Create stage – When a job is submitted or a chargeable reservation is
created and either AMCFG[] VALIDATEJOBSUBMISSION is TRUE or an
AMCFG[] FALLBACKACCOUNT or FALLBACKQOS are specified:
o
If the accounting mode is strict-allocation, Moab will check with the
accounting manager to verify that sufficient funds exist for the job or
reservation to run.
o
If the accounting mode is fast-allocation, Moab will check its cached
balance for the job's or reservation's account, to verify that sufficient
Accounting, Charging, and Allocation Management
401
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
funds exist for the fund or reservation to run.
o
l
l
l
l
402
Otherwise, it does nothing.
Start stage – When a job or a chargeable reservation is about to start:
o
If the accounting mode is strict-allocation, Moab will attempt to place a
hold against the allocation in the accounting manager in order to
prevent multiple jobs or reservations from starting on the same
funds.
o
If the accounting mode is fast-allocation, Moab will check its cached
balance for the job's or reservation's account, to verify that sufficient
funds exist for the job or reservation to run.
o
Otherwise, it does nothing.
Delete stage – If a job or chargeable reservation fails to start:
o
If the accounting mode is strict-allocation and Moab has already placed
a hold on an allocation for the job or reservation, Moab will contact
the accounting manager to remove the lien.
o
Otherwise, it does nothing.
Pause stage – If a job becomes suspended, Moab will make a charge for
the resources used for the time the job has run thus far:
o
If the accounting mode is strict-allocation, the usage record will be
updated with resource usage and charge amounts, the allocation will
be debited, and the lien will be reduced.
o
If the accounting mode is fast-allocation, the usage record will be
updated with resource usage and charge amounts, and the allocation
will be debited.
o
If the accounting mode is notional-charging, the usage record will be
updated with resource usage and charge amounts.
o
If the accounting mode is usage-tracking, the usage record will be
updated with resource usage.
Resume stage – If a suspended job is resumed:
o
If the accounting mode is strict-allocation, Moab will attempt to place a
hold against the funds in the accounting manager for the smaller of
(the duration of the next charge period, or the remaining duration of
the job or reservation).
o
If the accounting mode is fast-allocation, Moab will check its cached
balance for the job's or reservation's account, to verify that sufficient
funds exist for the job or reservation to run for the smaller of (the
duration of the next charge period, or the remaining duration of the
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
job or reservation).
o
l
l
l
Otherwise, it does nothing.
Update stage – If AMCFG[] FLUSHINTERVAL is set and Moab has
reached the end of a charge period, Moab will make an incremental
charge for all running jobs and active chargeable reservations for the
resources used during the last charge period:
o
If the accounting mode is strict-allocation, the usage record will be
updated with resource usage and charge amounts, the allocation will
be debited, and the lien will be reduced.
o
If the accounting mode is fast-allocation, the usage record will be
updated with resource usage and charge amounts, and the allocation
will be debited.
o
If the accounting mode is notional-charging, the usage record will be
updated with resource usage and charge amounts.
o
If the accounting mode is usage-tracking, the usage record will be
updated with resource usage.
Continue stage – If AMCFG[] FLUSHINTERVAL is set and Moab is
beginning a new charge period for a job or reservation:
o
If the accounting mode is strict-allocation, Moab will attempt to place a
hold against the funds in the accounting manager for the smaller of
(the duration of the next charge period, or the remaining duration of
the job or reservation).
o
If the accounting mode is fast-allocation, Moab will check its cached
balance for the job's or reservation's account, to verify that sufficient
funds exist for the job or reservation to run for the smaller of (the
duration of the next charge period, or the remaining duration of the
job or reservation).
o
Otherwise, it does nothing.
End stage – If a job or chargeable reservation ends, Moab will make a
final charge for the remainder of the resources used by the job or
reservation:
o
If the accounting mode is strict-allocation, the usage record will be
updated with resource usage and charge amounts, the allocation will
be debited, and the lien will be removed.
o
If the accounting mode is fast-allocation, the usage record will be
updated with resource usage and charge amounts, and the allocation
will be debited.
o
If the accounting mode is notional-charging, the usage record will be
updated with resource usage and charge amounts.
Accounting, Charging, and Allocation Management
403
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
o
If the accounting mode is usage-tracking, the usage record will be
updated with resource usage.
Accounting Events
You can add accounting events to the event log by specifying one or more of
the following with RECORDEVENTLIST. See Event Log Format on page 688 for
more information.
Event
Description
AMCREATE
Record accounting events trigger when an object is created; for example, when a balance check
occurs at job submission.
AMDELETE
Record accounting events triggered when an object's normal accounting lifecycle is interrupted; for
example, when the lifecycle is interrupted to clean up reservations for a failed job start.
AMEND
Record accounting events triggered when an object ends; for example, when a charge occurs at the
end of a job.
AMPAUSE
Record accounting events triggered when an object is paused; for example, when a partial charge
occurs when a job is paused.
AMQUOTE
Record accounting events triggered when an object requires a quote amount.
AMRESUME
Record accounting events triggered when an object is resumed; for example, when a lien is made
when a job is resumed.
AMSTART
Record accounting events triggered when an object is started; for example, when a lien is made
when a job starts.
AMUPDATE
Record accounting events triggered when an object continues past a flush interval; for example,
when a partial charge occurs and new lien is made for a job.
Blocking Versus Non-Blocking Accounting Actions
Moab uses a thread pool to perform non-blocking actions. Instead of blocking
the scheduling thread, the request is added to a queue that is serviced by the
accounting thread pool. Using the thread pool to perform non-blocking
accounting actions can result in faster aggregate scheduling and better client
response times, though individual actions can, in some cases, be shortly
delayed. By default, Moab uses non-blocking calls for the final charge only. The
default behavior for individual accounting actions (such as Create, Start, End)
can be overridden via the associated parameter (CONTINUEISBLOCKING,
404
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
CREATEISBLOCKING, DELETEISBLOCKING, ENDISBLOCKING, PAUSEISBLOCKING,
RESUMEISBLOCKING, STARTISBLOCKING).
For best performance when using non-blocking accounting actions, it is
recommended to specify an RM poll interval with a minimum poll time of
zero (such as RMPOLLINTERVAL=0,30). Setting a non-zero minimum poll
time can prevent Moab from responding quickly to accounting actions and
can result in increased latency in job scheduling.
When using the fast-allocation accounting mode, if the charge action is set
to be non-blocking (which is the default), Moab's account balance cache is
not updated with the effects of the charge until the iteration after the
charge is issued.
Retrying Failed Charges
If the AMCFG[] RETRYFAILEDCHARGES parameter is set to true (this is the
default), job charges will be retried if they have failed due to a connection
failure. When a job charge or usage record update (such as might occur when a
job is suspended, at the periodic charge interval, or when a job completes)
results in a connection failure between Moab and the accounting manager,
then the charge request will be saved to a file in SPOOLDIR/am/retrying/.
Once Moab detects that the connection with the accounting manager has been
restored, the charge will be retried up to CHARGERETRYCOUNT times.
Charges that fail due to reasons other than a connection failure, or connection
failures that surpass the CHARGERETRYCOUNT, will be saved to files in
SPOOLDIR/am/failed/. Although these failures generally represent
permanent failures, in some cases it may be possible to reissue some of these
charges with a slight modification. For example, a user may have been moved
from one account to another after the job started causing the final charge to
fail. For such circumstances, a script has been provided
(TOOLSDIR/mam/mam-charge-retry.pl) to facilitate the re-issuance of a failed
usage charge from a failed charge retry file.
[root]# /software/moab-accounting/tools/mam/mam-charge-retry.pl --help
The mam-charge-retry.pl script mimics the mam-charge command in making
a charge to MAM. The specified command-line options will override the original
values contained in the failed charge file. The --dry-run option can be used to
issue the retry as a quote rather than a charge in order to see if the charge
would be successful. The --delete-on-success option can be used to delete the
retry file after a successful charge. This script cannot be used to rerun a
command when the accounting action uses a native script. In such cases, the
modified request XML from the charge retry file can be passed as the standard
input to the native script to reissue a charge.
Accounting, Charging, and Allocation Management
405
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
Using the Script
This section provides synopsis information and an example on using the
TOOLSDIR/mam/mam-charge-retry.pl script
Synopisis
mam-charge-retry {[--filename] <retry_filename>} [-J <instance_name>] [j <usage_record_id>] [-q <quote_id>] [-l <lien_id>] [-T <usage_record_
type>] [-u <user_name>] [-g <group_name>] [-a <account_name>] [-o
<organization_name>] [-c <class_name>] [-Q <quality_of_service>] [-m
<machine_name>] [-N <nodes>] [-P <processors>] [-C <cpu_time>] [-M
<memory>] [-D <disk>] [--stage <lifecycle_stage>] [-X, --extension
<property>=<value>]... [-t <charge_duration>] [-s <charge_start_time>] [e <charge_end_time>] [-d <charge_description>] [-z <charge_amount>]] [-f
<fund_id>]] [--incremental] [-R <charge_rate_name>][{<charge_rate_
value>]}]=<charge_rate_amount>],...]... [--hours] [--itemize] [--delete-onsuccess] [--dry-run] [--debug] [--site <site_name>]] [--help] [--man] [-quiet] [--verbose] [--version]
Reissuing a charge that has failed example
First we will list the files in the SPOOLDIR/am/failed directory to see if there
are any "permanently" failed charges that we might want to reissue.
[root]# cd /opt/moab/spool/am/failed
[root]# ls
job.250
We see there is a failed charge for job 250. It may be useful to check the
charge file and examine the message to see what went wrong.
[root]# cat job.250
{"action":"End","message":"Failure registering job End (250) with accounting manager - Unable to invoke AM request - server rejected request with status code 740 - Failed
charging 1.00 credits for instance 250 and created usage record
25\nUser amy is not a valid member of Account biology","request":"<Request
action=\"Charge\"><Object>UsageRecord</Object><Option name=\"AccountingMode\">strictallocation</Option><Option name=\"StartTime\">1432070300</Option><Option
name=\"Duration\">300</Option><Data><UsageRecord><Stage>End</Stage><Type>Job</Type><In
stance>
250</Instance><User>amy</User><Group>staff</Group><Account>biology</Account><Class>bat
ch</Class><QualityOfService>premium</QualityOfService>
<Machine>colony</Machine><Nodes>1</Nodes><Partition>colony</Partition><Processors
consumptionRate=\"1.00\">12</Processors><StartTime>1432070300</StartTime><SubmitTime>1
432070300</SubmitTime><EndTime>1432070600</EndTime>
<CompletionCode>0</CompletionCode></UsageRecord></Data></Request>"}
We can see that this charge failed because the user (amy) was not a member
of the specified account (biology). In this case, the user was a member of the
biology account when the job started, but had been moved to the account
chemistry by the time the job ended, resulting in a charge failure.
406
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
If we were to reissue the charge without modification, it would fail again, as we
can see by using the script with the --dry-run option.
[root]# /opt/moab/tools/mam/mam-charge-retry.pl job.250 --dry-run
User amy is not a valid member of Account biology
We can reissue the charge after changing the request to use her new chemistry
account.
[root]# /opt/moab/tools/mam/mam-charge-retry.pl job.250 -a chemistry --dry-run
Successfully quoted 1.00 credits for instance 250
Since that looks like it will work correctly, we'll issue the corrected charge
request and delete the charge file.
[root]# /opt/moab/tools/mam/mam-charge-retry.pl job.250 -a chemistry --delete-onsuccess
Successfully charged 1.00 credits for instance 250 and created usage record 35
Related Topics
AMCFG Parameters and Flags on page 407
Per Class DISABLEAM attribute
Charging for Reserved Resources on page 441
ENFORCEACCOUNTACCESS parameter
AMCFG Parameters and Flags
Moab's accounting manager policies are defined using the AMCFG[] parameter.
All AMCFG parameters must use the same accounting manager name between
the square brackets (e.g. AMCFG[mam]). The following AMCFG parameter
values are supported:
ALWAYSCHARGERESERVATIONS
DISABLEDACTIONS
QUERYURL
BACKUPHOST
ENDISBLOCKING
REFRESHPERIOD
BLOCKINGACTIONS
ENDURL
RESUMEFAILUREACTION
CHARGEPOLICY
FALLBACKACCOUNT
RESUMEISBLOCKING
CHARGERETRYCOUNT
FALLBACKQOS
RESUMEURL
CONTINUEFAILUREACTION
FLAGS
RETRYFAILEDCHARGES
CONTINUEISBLOCKING
FLUSHINTERVAL
SERVER
CONTINUEURL
HOST
STARTFAILUREACTION
CREATECRED
LIENGRANULARITY
STARTISBLOCKING
CREATEFAILUREACTION
MODE
STARTURL
CREATEISBLOCKING
NODECHARGEPOLICY
TIMEOUT
CREATEURL
PAUSEISBLOCKING
TYPE
DELETEISBLOCKING
PAUSEURL
UPDATEURL
DELETEURL
PORT
VALIDATEJOBSUBMISSION
Accounting, Charging, and Allocation Management
407
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
ALWAYSCHARGERESERVATIONS
Format
<BOOLEAN>
Default
FALSE
Description
If set to TRUE, idle cycles in reservations will be charged to the accounting manager by
default, even if the ChargeAccount and ChargeUser are not specified for the reservation.
For reservations that you do not want to be charged with the accounting manager, specify
the reservation Charge attribute with a value of False. If set to FALSE (the default), idle
cycles in reservations will not be charged to the accounting manager unless you specify the
reservation ChargeAccount or ChargeUser attributes or set the reservation Charge attribute with a value of True.
Example
AMCFG[mam] ALWAYSCHARGERESERVATIONS=TRUE
By default, Moab will charge for idle cycles in reservations unless overridden with
Charge=False.
BACKUPHOST
Format
STRING
Default
---
Description
Specifies the backup host name for the accounting manager server daemon.
Example
AMCFG[mam] BACKUPHOST=headnode2
Use the backup accounting manager server on headnode2 if the connection fails to the
primary accounting manager server.
BLOCKINGACTIONS
Description
This parameter is deprecated. It may be removed in a future release.
Instead, specify the corresponding AMCFG[] CREATEISBLOCKING, DELETEISBLOCKING,
ENDISBLOCKING, PAUSEISBLOCKING, RESUMEISBLOCKING, and STARTISBLOCKING parameters.
408
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
CHARGERETRYCOUNT
Format
<INTERGER> (non-negative)
Default
24
Description
Only applicable if RETRYFAILEDCHARGES is enabled.
Specifies the maximum number of times that Moab will retry failed connection charges. Moab will
continue to retry until the charge succeeds, the charge fails due to a non-connection failure, or
until the CHARGERETRYCOUNT limit is reached. If set to zero, no retries will be performed, and all
charge failures will be written to files in the SPOOL/am/failed/ directory.
Example
AMCFG[mam] RETRYFAILEDCHARGES=TRUE CHARGERETRYCOUNT=12
Moab will retry connection-oriented charge failures up to 12 times.
CHARGEPOLICY
Format
One of DEBITALLWC, DEBITALLCPU, DEBITALLPE, DEBITALLBLOCKED,
DEBITSUCCESSFULWC, DEBITSUCCESSFULCPU, DEBITSUCCESSFULPE, or
DEBITSUCCESSFULBLOCKED
Default
DEBITSUCCESSFULWC
Description
Specifies how consumed resources should be charged against the consumer's credentials. See
Charge Metrics on page 400 for details.
DEBITALLBLOCKED or DEBITSUCCESSFULBLOCKED should only be used with
policies that allow only a single job to dedicate a node such as with a Node Access Policy of
SINGEJOB or SINGLETASK, or using a QOS with the DEDICATED flag. Using
DEBITALLBLOCKED or DEBITSUCCESSFULBLOCKED with any policy allowing more
than one job to dedicate a node (such as a Node Access Policy of SINGLEUSER,
SINGLECLASS, SINGLEACCOUNT or UNIQUEUSER) is not supported
The DEBITSUCCESSFUL* policies require Torque to work. Additionally, the job scripts
must return a negative number as the exit code on failure in order to be ignored for
charging.
Accounting, Charging, and Allocation Management
409
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
CHARGEPOLICY
Example
AMCFG[mam] CHARGEPOLICY=DEBITALLCPU
Charges are based on actual CPU usage only, not dedicated CPU resources.
If the LOCALCOST flag (AMCFG[] FLAGS=LOCALCOST) is set, Moab uses the information
gathered with CHARGEPOLICY to calculate charges. If LOCALCOST is not set, Moab sends
this information to the accounting manager to calculate charges.
CONTINUEFAILUREACTION
Format
<GeneralFailureAction>[,<FundsFailureAction>[,<ConnectionFailureAction>]] where the action
is one of CANCEL or IGNORE
Default
IGNORE,IGNORE,IGNORE
Description
If periodic charging is enabled (via the AMCFG[] FLUSHINTERVAL parameter), this parameter
specifies the action to be taken if a failure is detected when Moab performs its periodic
accounting update (e.g. to determine whether the job should be continued).
Moab applies <ConnectionFailureAction> to a job if it is rejected due to a connection
failure to MAM.
l
Moab applies <FundsFailureAction> to a job if it is rejected due to insufficient funds.
l
Moab applies <GeneralFailureAction> to a job if the account manager rejects it for any
other reason.
l
If you do not specify a <ConnectionFailureAction>, or if you do not specify a
<FundsFailureAction>, then Moab will apply the <GeneralFailureAction> for the
unspecified case.
l
If the action is set to CANCEL, Moab cancels the job; for IGNORE, Moab ignores the failure
and continues running the job.
Example
AMCFG[mam] CONTINUEFAILUREACTION=IGNORE,CANCEL,IGNORE
A job will be canceled if there are insufficient funds when Moab performs its periodic
accounting update; but will be allowed to continue running if MAM is down, or for any
other reason.
CONTINUEISBLOCKING
Format
410
<BOOLEAN>
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
CONTINUEISBLOCKING
Default
TRUE
Description
If set to TRUE, the scheduler will block while authorizing the continuation of a job with the
accounting manager. If set to FALSE, the accounting operation will be queued to the accounting
thread pool and scheduling will continue; but application of the failure action will be delayed
until a response is received.
Example
AMCFG[mam] CONTINUEISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when
checking to see if a job should be continued after a periodic accounting update.
CONTINUEURL
Format
exec://<fullPathToContinueScript> or null:
Default
exec://$TOOLSDIR/mam/usage.reserve.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
If periodic charging is enabled (via AMCFG[] FLUSHINTERVAL), when Moab performs a periodic
accounting update for a job, this script is invoked to determine whether there are sufficient
allocations for it to continue running for another period.
For jobs, the CONTINUEFAILUREACTION attribute specifies the action that Moab should take if the
authorization fails (such as for insufficient funds). If you use a DebitSuccessful
{Blocked,CPU,PE,WC} charge policy, Moab will not call the script because it does not yet know the
completion status of the job.
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] CONTINUEURL=exec://$TOOLSDIR/mam/usage.continue.custom.pl
Moab calls the usage.continue.custom.pl script for authorization when checking to see if a
job should be continued after a periodic accounting update.
CREATECRED
Format
<BOOLEAN>
Default
FALSE
Accounting, Charging, and Allocation Management
411
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
CREATECRED
Description
If set to TRUE, Moab will be enabled to query accounts, users, user membership in accounts, and
users' default accounts from Moab Accounting Manager and define them in Moab. These
credentials can be manually updated by running mrmctl -R AM or automatically updated be
setting the AMCFG[] REFRESHPERIOD parameter.
If you want Moab to enforce the imported account-user memberships, you will need to set the
ENFORCEACCOUNTACCESS parameter to TRUE. See Moab Parameters on page 971 for more
information on the ENFORCEACCOUNTACCESS parameter.
Example
AMCFG[mam] CREATECRED=TRUE REFRESHPERIOD=30:00
Moab will automatically update account credential information from MAM every half
hour.
CREATEFAILUREACTION
Format
<AMFailureAction>[,<FundsFailureAction>] where the action is one of CANCEL, DEFER, HOLD,
or IGNORE
Default
IGNORE,IGNORE
Description
Before creating a job that should be tracked or charged within the accounting manager, Moab
contacts the accounting manager for authorization. If the job creation is rejected due to lack of
funds, Moab applies the FundsFailureAction to the job. For any other rejection reason including a
connection problem, Moab applies the AMFailureAction to the job. If you do not specify a
FundsFailureAction, Moab will apply the AMFailureAction for an insufficient funds failure. If the
action is set to CANCEL, Moab cancels the job; DEFER, defers the job; HOLD, puts the job on
hold; and IGNORE, ignores the failure and continues to start the job.
In order for the CREATEFAILUREACTION policy to be applied, the AMCFG[]
VALIDATEJOBSUBMISSION parameter must be set to true.
If you have either the AMCFG[] FALLBACKQOS or FALLBACKACCOUNT parameter
defined and a job is submitted that has insufficient funds to run, the fallback credential
will be applied and the configured CREATEFAILUREACTION action will be ignored.
Example
AMCFG[mam] CREATEFAILUREACTION=HOLD
A job will be placed on hold when submitted if there are insufficient funds for it to start.
412
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
CREATEISBLOCKING
Format
<BOOLEAN>
Default
TRUE
Description
If set to TRUE, the scheduler will block while authorizing the creation of a job with the accounting
manager. If set to FALSE, the accounting operation will be queued to the accounting thread pool
and scheduling will continue, but further consideration for the job will be delayed until a response
is received.
Example
AMCFG[mam] CREATEISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when
creating jobs.
CREATEURL
Format
exec://<fullPathToCreateScript> or null:
Default
exec://$TOOLSDIR/mam/usage.quote.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
Moab runs this script at the time a job or reservation is being created.
For jobs, the CREATEFAILUREACTION attribute specifies the action that should be taken if the
authorization fails (such as for insufficient funds).
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] CREATEURL=exec://$TOOLSDIR/mam/usage.create.custom.pl
Moab calls the usage.create.custom.pl script for authorization before starting a
job or reservation.
DELETEISBLOCKING
Format
<BOOLEAN>
Default
TRUE
Accounting, Charging, and Allocation Management
413
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
DELETEISBLOCKING
Description
Example
If set to TRUE, the scheduler will block while contacting the accounting manager to clean up after
a failed job start. If set to FALSE, the accounting operation will be queued to the accounting
thread pool and scheduling will continue.
AMCFG[mam] DELETEISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when
cleaning up after failed job starts.
DELETEURL
Format
exec://<fullPathToDeleteScript> or null:
Default
exec://$TOOLSDIR/mam/lien.delete.mam.pl if TYPE=Native, otherwise it will make a direct call to
MAM (mam:)
Description
Moab runs this script to clean up after an interrupted job or reservation life-cycle. The default
behavior is to remove outstanding liens.
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] DELETEURL=exec://$TOOLSDIR/mam/usage.delete.custom.pl
Moab calls the usage.delete.custom.pl script to clean up after an interrupted job
or reservation.
DISABLEDACTIONS
Description
This parameter is deprecated. It may be removed in a future release.
Instead, specify an empty value or a protocol of 'null:' for the corresponding AMCFG[]
CREATEURL, DELETEURL, ENDURL, PAUSEURL, RESUMEURL, STARTURL, and UPDATEURL
parameters.
ENDISBLOCKING
Format
414
<BOOLEAN>
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
ENDISBLOCKING
Default
FALSE
Description
If set to TRUE, the scheduler will block while registering the end of a job with the accounting manager. If set to FALSE, the accounting operation will be queued to the accounting thread pool and
scheduling will continue.
Example
AMCFG[mam] ENDISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when a job
ends.
ENDURL
Format
exec://<fullPathToEndScript> or null:
Default
exec://$TOOLSDIR/mam/usage.charge.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
Moab runs this script after the end of a chargeable job or reservation in order to make a final
charge or update the accounting record. The default behavior is to make a prorated charge for the
job or reservation.
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] ENDURL=exec://$TOOLSDIR/mam/usage.end.custom.pl
Moab calls the usage.end.custom.pl script to make the final charge for a job or
reservation.
FALLBACKACCOUNT
Format
<STRING>
Default
---
Description
If specified, Moab verifies adequate allocations for all new jobs. If adequate allocations are not
available in the job's primary account, Moab changes the job's credentials to use the fallback
account. If not specified, Moab places a hold on jobs that do not have adequate allocations in their
primary account.
Accounting, Charging, and Allocation Management
415
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
FALLBACKACCOUNT
Example
AMCFG[mam] FALLBACKACCOUNT=freecycle
Moab assigns the account freecycle to jobs that do not have adequate allocations in their
primary account.
FALLBACKQOS
Format
<STRING>
Default
---
Description
If specified, Moab verifies adequate allocations for all new jobs. If adequate allocations are not
available in the job's primary QoS, Moab changes the job's credentials to use the fallback QoS. If
not specified, Moab places a hold on jobs that do not have adequate allocations in their primary
QoS.
Example
AMCFG[mam] FALLBACKQOS=freecycle
Moab assigns the QoS freecycle to jobs that do not have adequate allocations in their
primary QoS.
FLAGS
Format
<STRING>
Default
---
Description
AMCFG flags are used to enable special services.
Example
AMCFG[mam] FLAGS=LOCALCOST
Moab calculates the charge for the job locally and sends that as a charge to the accounting
manager, which then charges that amount for the job.
416
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
FLUSHINTERVAL
Format
[[[DD:]HH:]MM:]SS or INFINITY
The former values of HOUR, DAY, WEEK, MONTH, or NONE are deprecated and may be
removed in a future release.
Default
INFINITY
Description
Indicates the amount of time between accounting manager updates for long running reservations
and jobs. If FLUSHINTERVAL is set to a positive time period, Moab will update the accounting manager (e.g. make an incremental charge) on the specified period relative to the start of the job or
reservation. If FLUSHINTERVAL is set to INFINITY, the update will only occur at the end of the job
or reservation.
Example
AMCFG[mam] FLUSHINTERVAL=1:00:00:00
Moab will make periodic accounting updates every 24 hours for long running jobs and
reservations.
HOST
Format
<STRING>
Default
localhost
Description
Specifies the host name for the accounting manager server daemon.
Example
AMCFG[mam] HOST=my-mam-server
Moab will communicate with the MAM server running on
my-mam-server.
LIENGRANULARITY
Format
One of: Partial or Combined
Default
Partial
Accounting, Charging, and Allocation Management
417
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
LIENGRANULARITY
Description
When periodic charging is enabled via AMCFG[] FLUSHINTERVAL, lien granularity controls
whether a combined lien is sought for the duration of the entire job (Combined) or whether
partial liens are sought for the duration of each periodic charge interval (Partial).
l
l
Example
When using a lien granularity of Partial, a job or reservation may get started if if has
enough funds to run for the FLUSHINTERVAL, but it may trigger a
CONTINUEFAILUREACTION if it runs out of funds before completion.
When using a lien granularity of Combined, the funds for the entire job or reservation
must be available before it starts, but the funds will be protected by the lien and
consumed on a periodic interval.
AMCFG[mam] LIENGRANULARITY=Combined
When using periodic charging, Moab will seek to obtain a lien for the entire duration of the
job or reservation before starting it.
MODE
Format
One of: strict-allocation, fast-allocation, notional-charging or usage-tracking
Default
strict-allocation
Description
Specifies the accounting mode. The accounting mode modifies the way in which accounting-relevant job stages (e.g. create, start, end, etc.) are processed. See Accounting Mode on page 393 for
details on the behavior of the accounting modes.
Example
AMCFG[mam] MODE=notional-charging
Configures Moab to use the notional-charging accounting mode when interacting with the
accounting manager.
NODECHARGEPOLICY
418
Format
One of AVG, MAX, or MIN
Default
MIN
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
NODECHARGEPOLICY
Description
When charging for resource usage, the accounting manager will charge by node allocation according
to the specified policy. For AVG, MAX, and MIN, the accounting manager will charge by the average,
maximum, and minimum node charge rate of all allocated nodes. See CHARGEPOLICY.
If you use this feature in conjunction with the AMCFG[] LOCALCOST flag, Moab will include the
calculation of the node charge value sent to MAM. See LOCALCOST.
If you do not use this feature in conjunction with the AMCFG[] LOCALCOST flag, you must perform the
following MAM commands to include node charges in charge calculations:
1. Add NodeCharge as a usage record property.
mam-shell Attribute Create Object=UsageRecord Name=NodeCharge DataType=Float
Description="\"Node Charge\""
2. Add NodeCharge as a multiplier charge rate.
mam-create-chargerate -n NodeCharge -z "*1" -d "Node Charge Multiplier"
Example
NODECFG[node01] CHARGERATE=1.5
NODECFG[node02] CHARGERATE=1.75
AMCFG[mam] NODECHARGEPOLICY=MAX
Charge jobs by the maximum allocated node's charge rate.
PAUSEISBLOCKING
Format
<BOOLEAN>
Default
TRUE
Description
If set to TRUE, the scheduler will block while registering the suspension of a job with the accounting manager. If set to FALSE, the accounting operation will be queued to the accounting thread
pool and scheduling will continue.
Example
AMCFG[mam] PAUSEISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when
suspending jobs.
PAUSEURL
Format
exec://<fullPathToPauseScript> or null:
Accounting, Charging, and Allocation Management
419
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
PAUSEURL
Default
exec://$TOOLSDIR/mam/usage.charge.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
Moab runs this script after preempting a job that might be resumed later. The default behavior is
to make an incremental charge but not create a fresh lien. If you use a DebitSuccessful
{Blocked,CPU,PE,WC} charge policy, Moab will not call the script because it does not yet know the
completion status of the job.
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] PAUSEURL=exec://$TOOLSDIR/mam/usage.pause.custom.pl
Moab calls the usage.pause.custom.pl script after pausing a job.
PORT
Format
<INTEGER>
Default
7112
Description
Specifies the listening port for the accounting manager server daemon.
Example
AMCFG[mam] PORT=7731
Moab will communicate with the MAM server listening on
port 7731.
QUERYURL
Format
exec://<fullPathToQueryScript> or null:
Default
exec://$TOOLSDIR/mam/account.query.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
Moab runs this script to customize and forward the Moab query to the accounting manager. The
standard input to the script will be an XML Request in SSS format and is used directly between
Moab and Moab Accounting Manager. Its primary purpose is to synchronize accounts and user
information with the accounting manager if the CREATECRED parameter is specified.
To disable a script from being run at this stage, use 'null:' as the parameter value.
420
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
QUERYURL
Example
AMCFG[mam] QUERYURL=exec://$TOOLSDIR/mam/cred.query.custom.pl
Moab calls the cred.query.custom.pl script in order to obtain account and user
information from the accounting manager.
REFRESHPERIOD
Format
[[[DD:]HH:]MM:]SS or INFINITY
The former values of MINUTE, HOUR, DAY or NONE are deprecated and may be removed
in a future release.
Default
INFINITY
Description
Indicates the period at which Moab will poll for updated information from Moab Accounting
Manager (MAM).
l
l
If AMCFG[] CREATECRED is set to TRUE, Moab will update the accounting credentials from
MAM on the specified period.
If AMCFG[] MODE is set to fast-allocation, Moab will update the account balance cache
from MAM on the specified period.
Moab will poll MAM for updated information when it first starts up unless REFRESHPERIOD is set
to 0. If REFRESHPERIOD is set to a positive time period, Moab will refresh the accounting credentials
on the specified period relative to the scheduler start time. If REFRESHPERIOD is set to INFINITY,
Moab will only request updated information from MAM when first started. Use mrmctl -R am to
force an immediate refresh.
Example
AMCFG[mam] REFRESHPERIOD=2:00:00
Moab will request an update from MAM every two hours..
RESUMEFAILUREACTION
Format
<GeneralFailureAction>[,<FundsFailureAction>[,<ConnectionFailureAction>]] where the action is
one of CANCEL, DEFER, HOLD, IGNORE, or RETRY
Default
IGNORE,IGNORE,IGNORE
Accounting, Charging, and Allocation Management
421
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
RESUMEFAILUREACTION
Description
This action is applied after a failure with the accounting manager when a job is being resumed
(e.g. after being suspended).
l
l
l
l
Moab will apply <ConnectionFailureAction> to a job if there is a connection failure
between Moab and the accounting manager.
Moab applies <FundsFailureAction> to the job if it is rejected due to insufficient funds.
Moab applies <GeneralFailureAction> to a job if the account manager rejects it for any
other reason.
If you do not specify a <ConnectionFailureAction>, or if you do not specify a
<FundsFailureAction>, then Moab will apply the <GeneralFailureAction> for the
unspecified case.
If the action is set to CANCEL, Moab cancels the job; DEFER, Moab defers the job; HOLD,
Moab puts the job on hold; IGNORE, Moab ignores the failure and continues to resume the job;
RETRY, Moab does not resume the job on this attempt but will continue to try to resume the
job at the next opportunity.
Example
AMCFG[mam] RESUMEFAILUREACTION=HOLD,HOLD,IGNORE
A job will be resumed if Moab is unable to contact the accounting manager. Otherwise,
the job will be placed on hold if there is any other failure with the accounting manager
when Moab tries to resume it.
RESUMEISBLOCKING
Format
<BOOLEAN>
Default
TRUE
Description
If set to TRUE, the scheduler will block while authorizing the resumption of a job with the accounting manager. If set to FALSE, the accounting operation will be queued to the accounting thread
pool and scheduling will continue, but resumption of the job will be delayed until a response is
received.
Example
AMCFG[mam] RESUMEISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when
resuming jobs.
RESUMEURL
Format
422
exec://<fullPathToResumeScript> or null:
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
RESUMEURL
Default
exec://$TOOLSDIR/mam/usage.reserve.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
Moab runs this script before resuming a suspended job to determine whether there it has
authorization to resume (e.g. has sufficient funds).
For jobs, the RESUMEFAILUREACTION attribute specifies the action that Moab should take if the
authorization fails (such as for insufficient funds). If you use a DebitSuccessful
{Blocked,CPU,PE,WC} charge policy, Moab will not call the script because it does not yet know the
completion status of the job.
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] RESUMEURL=exec://$TOOLSDIR/mam/usage.resume.custom.pl
Moab calls the usage.resume.custom.pl script for authorization before resuming a
suspended job.
RETRYFAILEDCHARGES
Format
<BOOLEAN>
Default
TRUE
Description
If set to TRUE, job charges will be retried if they have failed due to a connection failure. When a
job charge or usage record update (such as might occur when a job is suspended, at the periodic
charge interval, or when a job completes) results in a connection failure between Moab and the
accounting manager, then the charge request will be saved to a file in
SPOOLDIR/am/retrying/. Once Moab detects that the connection with the accounting
manager has been restored, the charge will be retried up tp CHARGERETRYCOUNT times.
Charges that fail due to reasons other than a connection failure, or connection failures that
surpass the CHARGERETRYCOUNT, will be saved to files in SPOOLDIR/am/failed/.
Example
AMCFG[mam] RETRYFAILEDCHARGES=TRUE
Moab will retry connection-oriented charge failures.
SERVER
Format
<URL>
Accounting, Charging, and Allocation Management
423
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
SERVER
Default
N/A
Description
Specifies the type and location of the accounting manager service.
Example
AMCFG[mam] SERVER=mam://tiny.supercluster.org:4368
STARTFAILUREACTION
Format:
<GeneralFailureAction>[,<FundsFailureAction>[,<ConnectionFailureAction>]] where the action is
one of CANCEL, DEFER, HOLD, IGNORE, or RETRY
Default:
IGNORE,IGNORE,IGNORE
Description:
Moab applies the appropriate failure action if there is a failure when registering the job start
with the accounting manager.
l
l
l
l
Moab applies <ConnectionFailureAction> to the job if there is a communication problem
with the accounting manager.
Moab applies <FundsFailureAction> to the job if it is rejected due to insufficient funds.
Moab applies <GeneralFailureAction> to a job if the account manager rejects it for any
other reason.
If you do not specify a <ConnectionFailureAction>, or if you do not specify a
<FundsFailureAction>, then Moab will apply the <GeneralFailureAction> for the
unspecified case.
If the action is set to CANCEL, Moab cancels the job; DEFER, Moab defers the job; HOLD, Moab
puts the job on hold; IGNORE, Moab ignores the failure and continues to start the job; and
RETRY, Moab does not start the job on this attempt but attempts to start the job at the next
opportunity.
Example:
AMCFG[mam] STARTFAILUREACTION=CANCEL,HOLD,IGNORE
A job will be placed on hold if there are insufficient funds when it is time for it to start. It
will be allowed to start if Moab is unable to reach the accounting manager. For all other
failures with the accounting manager, the job will be canceled.
424
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
STARTISBLOCKING
Format
<BOOLEAN>
Default
TRUE
Description
If set to TRUE, the scheduler will block while authorizing the starting of a job with the accounting
manager. If set to FALSE, the accounting operation will be queued to the accounting thread pool and
scheduling will continue, but the start of the job will be delayed until a response is received.
If using Moab in a Peer-to-Peer grid, do not set this parameter to FALSE. The Start
action is not supported as a non-blocking action in Peer-to-Peer grids.
Example
AMCFG[mam] STARTISBLOCKING=FALSE
Specifies that Moab should use non-blocking calls with the accounting manager when starting
jobs.
STARTURL
Format:
exec://<fullPathToStartScript> or null:
Default:
exec://$TOOLSDIR/mam/usage.reserve.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description:
Moab runs this script on a chargeable job or reservation to determine whether it should start.
For jobs, the STARTFAILUREACTION attribute specifies the action that Moab should take if the
authorization fails (such as for insufficient funds).
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example:
AMCFG[mam] STARTURL=exec://$TOOLSDIR/mam/usage.start.custom.pl
Moab calls the usage.start.custom.pl script for authorization before starting a
job or reservation.
THREADPOOLSIZE
Description
This parameter is
undocumented in 9.0.
Accounting, Charging, and Allocation Management
425
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
TIMEOUT
Format:
[[[DD:]HH:]MM:]SS
Default:
15
Description:
Specifies the maximum delay allowed for communications with the accounting manager.
Example:
AMCFG[mam] TIMEOUT=30
TYPE
Format
One of MAM or Native
Default
MAM
Description
Specifies the accounting manager interface type.
Example
AMCFG[mam] TYPE=MAM
Configures Moab to interact with MAM using the direct SSS
wire protocol.
UPDATEURL
Format
exec://<fullPathToUpdateScript> or null:
Default
exec://$TOOLSDIR/mam/usage.charge.mam.pl if TYPE=Native, otherwise it will make a direct call
to MAM (mam:)
Description
If you have FLUSHINTERVAL set, Moab runs this script every flush interval for each chargeable job
or reservation to charge for the previous interval. This call is usually followed by a call to the
CONTINUEURL script, if defined, to check whether there are sufficient funds to run for the next
interval. If you use a DebitSuccessful{Blocked,CPU,PE,WC} charge policy, Moab will not call the
script because it does not yet know the completion status of the job.
To disable a script from being run at this stage, use 'null:' as the parameter value.
Example
AMCFG[mam] UPDATEURL=exec://$TOOLSDIR/mam/usage.update.custom.pl
Moab calls the usage.update.custom.pl script for authorization to continue a job
or reservation.
426
Accounting, Charging, and Allocation Management
Chapter 5 Managing Fairness - Throttling Policies, Fairshare, and Allocation Management
VALIDATEJOBSUBMISSION
Format
<BOOLEAN>
Default
FALSE
Description
If set to TRUE, when a new job is submitted, Moab will execute the CREATEURL script (for
TYPE=Native) or seek a job quote from Moab Accounting Manager (TYPE=MAM) before allowing the job to be submitted. Otherwise, the fund validation step is just utilized by reservations
and fallback account checks. If the call fails (for example, if the user's account does not have
sufficient funds or specifies an invalid account), Moab applies the CREATEFAILUREACTION.
Example
AMCFG[mam] VALIDATEJOBSUBMISSION=True CREATEFAILUREACTION=Hold
Verify jobs have sufficient funds to run at the time they are submitted.
AMCFG Flags
AMCFG flags can be used to enable special services and to disable default
services. These services are enabled/disabled by setting the AMCFGFLAGS
attribute (see FLAGS).
Flag Name
Description
ACCOUNTFAILASFUNDS
When this flag is set, logic failures within the accounting manager are treated as fund
failures and are canceled. When ACCOUNTFAILASFUNDS is not set, accounting
manager failures are treated as a server failure and the result is a job which requests
an account to which the user does not have access.
LOCALCOST
Moab calculates the charge for the job locally and sends that as a charge to the
accounting manager, which then charges the amount for the job.
STRICTQUOTE
Sends an estimated process count to the accounting manager when an initial quote is
requested for a newly-submitted job.
Related Topics
l
Accounting, Charging, and Allocation Management on page 391
Accounting, Charging, and Allocation Management
427
428
Accounting, Charging, and Allocation Management
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Chapter 6 Controlling Resource Access - Reservations,
Partitions, and QoS Facilities
l
Advance Reservations
l
Partitions
l
Quality of Service (QoS) Facilities
Advance Reservations
An advance reservation is the mechanism by which Moab guarantees the
availability of a set of resources at a particular time. Each reservation consists
of three major components: (1) a set of resources, (2) a time frame, and (3)
an access control list. It is a scheduler role to ensure that the access control list
is not violated during the reservation's lifetime (that is, its time frame) on the
resources listed. For example, a reservation may specify that node002 is
reserved for user Tom on Friday. The scheduler is thus constrained to make
certain that only Tom's jobs can use node002 at any time on Friday. Advance
reservation technology enables many features including backfill, deadline
based scheduling, grid scheduling, and QOS support.
The mrsvctl command is used to create, modify, query, and release
reservations.
l
Reservation Overview
l
Administrative Reservations
l
Standing Reservations
l
Reservation Policies
l
Configuring and Managing Reservations
l
Enabling Reservations for End-users
Reservation Overview
l
Resources
l
TimeFrame
l
Access Control List
l
Job to Reservation Binding
l
Reservation Specification
Advance Reservations
429
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
l
Reservation Behavior
l
Reservation Group
l
Infinite Jobs and Reservations
Every reservation consists of 3 major components: (1) a set of resources, (2) a
time frame, and (3) an access control list. Additionally, a reservation may also
have a number of optional attributes controlling its behavior and interaction
with other aspects of scheduling. Reservation attribute descriptions follow.
Resources
Under Moab, the resources specified for a reservation are specified by way of a
task description. Conceptually, a task can be thought of as an atomic, or
indivisible, collection of resources. If reservation resources are unspecified, a
task is a node by default. To define a task, specify resources. The resources
may include processors, memory, swap, local disk, and so forth. For example,
a single task may consist of one processor, 2 GB of memory, and 10 GB of local
disk.
A reservation consists of one or more tasks. In attempting to locate the
resources required for a particular reservation, Moab examines all feasible
resources and locates the needed resources in groups specified by the task
description. An example may help clarify this concept:
Reservation A requires four tasks. Each task is defined as 1 processor and 1 GB
of memory.
Node X has 2 processors and 3 GB of memory available
Node Y has 2 processors and 1 GB of memory available
Node Z has 2 processors and 2 GB of memory available
When collecting the resources needed for the reservation, Moab examines
each node in turn. Moab finds that Node X can support 2 of the 4 tasks needed
by reserving 2 processors and 2 GB of memory, leaving 1 GB of memory
unreserved. Analysis of Node Y shows that it can only support 1 task reserving 1
processor and 1 GB of memory, leaving 1 processor unreserved. Note that the
unreserved memory on Node X cannot be combined with the unreserved
processor on Node Y to satisfy the needs of another task because a task
requires all resources to be located on the same node. Finally, analysis finds
that node Z can support 2 tasks, fully reserving all of its resources.
Both reservations and jobs use the concept of a task description in specifying
how resources should be allocated. It is important to note that although a task
description is used to allocate resources to a reservation, this description does
not in any way constrain the use of those resources by a job. In the above
example, a job requesting resources simply sees 4 processors and 4 GB of
memory available in reservation A. If the job has access to the reserved
resources and the resources meet the other requirements of the job, the job
could use these resources according to its own task description and needs.
430
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Currently, the resources that can be associated with reservations include
processors, memory, swap, local disk, initiator classes, and any number of
arbitrary resources. Arbitrary resources may include peripherals such as tape
drives, software licenses, or any other site specific resource.
Time Frame
Associated with each reservation is a time frame. This specifies when the
resources will be reserved or dedicated to jobs that meet the reservation's
access control list (ACL). The time frame simply consists of a start time and an
end time. When configuring a reservation, this information may be specified as
a start time together with either an end time or a duration.
Access Control List
A reservation's access control list specifies which jobs can use a reservation.
Only jobs that meet one or more of a reservation's access criteria are allowed
to use the reserved resources during the reservation time frame. Currently,
the reservation access criteria include the following: users, groups, accounts,
classes, QOS, job attributes, job duration, and job templates.
Job to Reservation Binding
While a reservation's ACL will allow particular jobs to use reserved resources, it
does not force any job to use these resources. With each job, Moab attempts to
locate the best possible combination of available resources whether these are
reserved or unreserved. For example, in the following figure, note that job X,
which meets access criteria for both reservation A and B, allocates a portion of
its resources from each reservation and the remainder from resources outside
of both reservations.
Advance Reservations
431
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Image 6-1: Job X uses resources from reservations A and B
Although by default, reservations make resources available to jobs that meet
particular criteria, Moab can be configured to constrain jobs to only run within
accessible reservations. This can be requested by the user on a job by job basis
using a resource manager extension flag, or it can be enabled administratively
via a QoS flag. For example, assume two reservations were created as follows:
> mrsvctl -c -a GROUP==staff -d 8:00:00 -h 'node[1-4]'
reservation staff.1 created
> mrsvctl -c -a USER==john -t 2
reservation john.2 created
If the user "john," who happened to also be a member of the group "staff,"
wanted to force a job to run within a particular reservation, "john" could do so
using the FLAGS resource manager extension. Specifically, in the case of a PBS
job, the following submission would force the job to run within the "staff.1"
reservation.
> msub -l nodes=1,walltime=1:00:00,flags=ADVRES:staff.1 testjob.cmd
Note that for this to work, PBS needs to have resource manager extensions
enabled as described in the PBS Resource Manager Extension Overview.
(Torque has resource manager extensions enabled by default.) If the user
432
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
wants the job to run on reserved resources but does not care which, the user
could submit the job with the following:
> msub -l nodes=1,walltime=1:00:00,flags=ADVRES testjob.cmd
To enable job to reservation mapping via QoS, the QoS flag USERESERVED
should be set in a similar manner.
Use the reservation BYNAME flag to require explicit binding for reservation
access.
To lock jobs linked to a particular QoS into a reservation or reservation group,
use the REQRID attribute.
Reservation Specification
There are two main types of reservations that sites typically deal with. The first,
administrative reservations, are typically one-time reservations created for
special purposes and projects. These reservations are created using the
mrsvctl or setres commands. These reservations provide an integrated
mechanism to allow graceful management of unexpected system
maintenance, temporary projects, and time critical demonstrations. This
command allows an administrator to select a particular set of resources or just
specify the quantity of resources needed. For example an administrator could
use a regular expression to request a reservation be created on the nodes
"blue0[1-9]" or could simply request that the reservation locate the needed
resources by specifying a quantity based request such as "TASKS==20."
The second type of reservation is called a standing reservation. It is specified
using the SRCFG parameter and is of use when there is a recurring need for a
particular type of resource distribution. Standing reservations are a powerful,
flexible, and efficient means for enabling persistent or periodic policies such as
those often enabled using classes or queues. For example, a site could use a
standing reservation to reserve a subset of its compute resources for quick
turnaround jobs during business hours on Monday thru Friday. The Standing
Reservation Overview provides more information about configuring and using
these reservations.
Reservation Behavior
As previously mentioned, a given reservation may have one or more access
criteria. A job can use the reserved resources if it meets at least one of these
access criteria. It is possible to stack multiple reservations on the same node.
In such a situation, a job can only use the given node if it has access to each
active reservation on the node.
Advance Reservations
433
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Reservation Group
Reservations groups are ways of associating multiple reservations. This
association is useful for variable namespace and reservation requests. The
reservations in a group inherit the variables from the reservation group head,
but if the same variable is set locally on a reservation in the group, the local
variable overrides the inherited variable. Variable inheritance is useful for
triggers as it provides greater flexibility with automating certain tasks and
system behaviors.
Jobs may be bound to a reservation group (instead of a single reservation) by
using the resource manager extension ADVRES.
Infinite Jobs and Reservations
To allow infinite walltime jobs, you must have the following scheduler flag set:
SCHEDCFG[Moab] FLAGS=allowinfinitejobs
You can submit an infinite job by completing:
msub -l walltime=INFINITY
Or an infinite reservation by completing:
mrsvctl -c -d INFINITY
Infinite jobs can run in infinite reservations. Infinite walltime also works with
job templates and advres.
Output XML for infinite jobs will print "INFINITY" in the ReqAWDuration, and
XML for infinite rsvs will print "INFINITY" in duration and endtime.
<Data>
<rsv AUser="jgardner" AllocNodeCount="1" AllocNodeList="n5"
AllocProcCount="4" AllocTaskCount="1" HostExp="n5"
LastChargeTime="0" Name="jgardner.1" Partition="base"
ReqNodeList="n5:1" Resources="PROCS=[ALL]" StatCAPS="0"
StatCIPS="0" StatTAPS="0" StatTIPS="0" SubType="Other"
Type="User" cost="0.000000" ctime="1302127058"
duration="INFINITY" endtime="INFINITY" starttime="1302127058">
<ACL aff="neutral" cmp="%=" name="jgardner.1" type="RSV"></ACL>
<ACL cmp="%=" name="jgardner" type="USER"></ACL>
<ACL cmp="%=" name="company" type="GROUP"></ACL>
<ACL aff="neutral" cmp="%=" name="jgardner.1" type="RSV"></ACL>
<History>
<event state="PROCS=4" time="1302127058"></event>
</History>
</rsv>
</Data>
Related Topics
Reservation Allocation Policies
434
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Administrative Reservations
l
Annotating Administrative Reservations
l
Using Reservation Profiles
l
Optimizing Maintenance Reservations
Administrative reservations behave much like standing reservations but are
generally created to address non-periodic, one-time issues. All administrative
reservations are created using the mrsvctl -c (or setres) command and are
persistent until they expire or are removed using the mrsvctl -r (or releaseres)
command.
Annotating Administrative Reservations
Reservations can be labeled and annotated using comments allowing other
administrators, local users, portals and other services to obtain more detailed
information regarding the reservations. Naming and annotations are
configured using the -n and -D options of the mrsvctl command respectively, as
in the following example:
> mrsvctl -c -D 'testing infiniband performance' -n nettest -h 'r:agt[15-245]'
Using Reservation Profiles
You can set up reservation profiles to avoid manually and repetitively inputting
standard reservation attributes. Profiles can specify reservation names,
descriptions, ACLs, durations, hostlists, triggers, flags, and other aspects that
are commonly used. With a reservation profile defined, a new administrative
reservation can be created that uses this profile by specifying the -P flag as in
the following example.
Example 6-1:
RSVPROFILE[mtn1] TRIGGER=AType=exec,Action="/tmp/trigger1.sh",EType=start
RSVPROFILE[mtn1] USERLIST=steve,marym
RSVPROFILE[mtn1] HOSTEXP="r:50-250"
> mrsvctl -c -P mtn1 -s 12:00:00_10/03 -d 2:00:00
Example 6-2: Non-Blocking System Reservations with Scheduler Pause
RSVPROFILE[pause] TRIGGER=atype=exec,etype=start,action="/opt/moab/bin/mschedctl -p"
RSVPROFILE[pause] TRIGGER=atype=exec,etype=cancel,action="/opt/moab/bin/mschedctl -r"
RSVPROFILE[pause] TRIGGER=atype=exec,etype=end,action="/opt/moab/bin/mschedctl -r"
> mrsvctl -c -P pause -s 12:00:00_10/03 -d 2:00:00
Advance Reservations
435
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Optimizing Maintenance Reservations
Any reservation causes some negative impact on cluster performance as it
further limits the scheduler's ability to optimize scheduling decisions. You can
mitigate this impact by using flexible ACLs and triggers.
In particular, a maintenance reservation can be configured to reduce its
effective reservation shadow by allowing overlap with
checkpointable/preemptible jobs until the time the reservation becomes
active. This can be done using a series of triggers that perform the following
actions:
l
Modify the reservation to disable preemption access.
l
Preempt jobs that may overlap the reservation.
l
Cancel any jobs that failed to properly checkpoint and exit.
The following example highlights one possible configuration:
RSVPROFILE[adm1] JOBATTRLIST=PREEMPTEE
RSVPROFILE[adm1] DESCRIPTION="regular system maintenance"
RSVPROFILE[adm1] TRIGGER=EType=start,Offset=300,AType=internal,Action="rsv:-:modify:acl:jattr-=PREEMPTEE"
RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-240,AType=jobpreempt,Action="checkpoint"
RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-60,AType=jobpreempt,Action="cancel"
> mrsvctl -c -P adm1 -s 12:00:00_10/03 -d 8:00:00 -h ALL
This reservation reserves all nodes in the cluster for a period of eight hours.
Five minutes before the reservation starts, the reservation is modified to
remove access to new preemptible jobs. Four minutes before the reservation
starts, preemptible jobs that overlap the reservation are checkpointed. One
minute before the reservation, all remaining jobs that overlap the reservation
are canceled.
Reservations can also be used to evacuate virtual machines from a nodelist. To
do this, you can configure a reservation profile in the moab.cfg file that calls an
internal trigger to enable the evacuate VM logic. For example:
RSVPROFILE[evacvms]
TRIGGER=EType=start,AType=internal,action=node:$(HOSTLIST):evacvms
> mrsvctl -c -P evacvms -s 12:00:00_10/03 -d 8:00:00 -h ALL
Please note that Moab gives its best effort in evacuating VMs; however, if other
reservations and policies prevent Moab from locating an alternate location for
the VMs to be migrated to, then no action will occur. Administrators can attach
additional triggers to the reservation profile to add evacuation logic where
needed.
436
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
If your organization uses Viewpoint 7.1 or later, there is an option when
creating reservations in Viewpoint to evacuate VMs from reserved nodes.
This functionality assumes the reservation profile in Moab is named
"evacvms." For Cloud customers, the evacvms reservation profile already
exists in your moab.cfg file configuration by default.
You can also manually create a reservation that evacuates VMs from a
nodelist by using the EVACVMS reservation flag. For example:
> mrsvctl -c -F EVACVMS -s 12:00:00_10/03 -d 8:00:00 -h ALL
Related Topics
Backfill
Preemption
mrsvctl command
Standing Reservations
Standing reservations build upon the capabilities of advance reservations to
enable a site to enforce advanced usage policies in an efficient manner.
Standing reservations provide a superset of the capabilities typically found in a
batch queuing system's class or queue architecture. For example, queues can
be used to allow only particular types of jobs access to certain compute
resources. Also, some batch systems allow these queues to be configured so
that they only allow this access during certain times of the day or week.
Standing reservations allow these same capabilities but with greater flexibility
and efficiency than is typically found in a normal queue management system.
Standing reservations provide a mechanism by which a site can dedicate a
particular block of resources for a special use on a regular daily or weekly basis.
For example, node X could be dedicated to running jobs only from users in the
accounting group every Friday from 4 to 10 p.m. See the Reservation Overview
for more information about the use of reservations. The Managing
Reservations section provides a detailed explanation of the concepts and steps
involved in the creation and configuration of standing reservations.
A standing reservation is a powerful means of doing the following:
l
Controlling local credential based access to resources.
l
Controlling external peer and grid based access to resources.
l
Controlling job responsiveness and turnaround.
Related Topics
SRCFG
Moab Workload Manager for Grids
Advance Reservations
437
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
mdiag -s (diagnose standing reservations)
Reservation Policies
l
Controlling Priority Reservation Creation
l
Managing Resource Failures
l
Resource Allocation Policy
l
Charging for Reserved Resources
Controlling Priority Reservation Creation
In addition to standing and administrative reservations, Moab can also create
priority reservations. These reservations are used to allow the benefits of outof-order execution (such as is available with backfill) without the side effect of
job starvation. Starvation can occur in any system where the potential exists
for a job to be overlooked by the scheduler for an indefinite period. In the case
of backfill, small jobs may continue to run on available resources as they
become available while a large job sits in the queue, never able to find enough
nodes available simultaneously on which to run.
To avoid such situations, priority reservations are created for high priority jobs
that cannot run immediately. When making these reservations, the scheduler
determines the earliest time the job could start and then reserves these
resources for use by this job at that future time.
Priority Reservation Creation Policy
Organizations have the ability to control how priority reservations are created
and maintained. It is possible that one job can be at the top of the priority
queue for a time and then get bypassed by another job submitted later. The
parameter RESERVATIONPOLICY allows a site to determine how existing
reservations should be handled when new reservations are made.
438
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Value
Description
HIGHEST
All jobs that have ever received a priority reservation up to the RESERVATIONDEPTH
number will maintain that reservation until they run, even if other jobs later bypass them
in priority value.
For example, if there are four jobs with priorities of 8, 10,12, and 20.
RESERVATIONPOLICY HIGHEST
RESERVATIONDEPTH 3
Only jobs 20, 12, and 10 get priority reservations. Later, if a job with priority higher than
20 is submitted into the queue, it will also get a priority reservation along with the jobs
listed previously. If four jobs higher than 20 were to be submitted into the queue, only
three would get priority reservations, in accordance with the condition set in the
RESERVATIONDEPTH policy.
With HIGHEST, Moab may appear to exceed the RESERVATIONDEPTH if it has already
scheduled the maximum number of priority reservations and then users submit jobs with
higher priority than those already given a priority reservation. Moab keeps all of the
previously-created priority reservations and creates new ones for jobs with higher priority
(again up to the quantity specified with RESERVATIONDEPTH). This means that, if your
RESERVATIONDEPTH is set to 3, Moab can potentially schedule up to 3 new priority
reservations each scheduling iteration, as long as new higher-priority jobs are continually
submitted. This behavior ensures that the highest-priority jobs receive attention while the
former highest-priority jobs do not lose their priority reservation.
CURRENTHIGHEST
Only the current top <RESERVATIONDEPTH> priority jobs receive reservations. Under this
policy, all job reservations are destroyed each iteration when the queue is re-prioritized.
The top jobs in the queue are then given new reservations.
NEVER
No priority reservations are made.
Priority Reservation Depth
By default, only the highest priority job receives a priority reservation.
However, this behavior is configurable via the RESERVATIONDEPTH policy.
Moab's default behavior of only reserving the highest priority job allows backfill
to be used in a form known as liberal backfill. Liberal backfill tends to maximize
system utilization and minimize overall average job turnaround time. However,
it does lead to the potential of some lower priority jobs being indirectly delayed
and may lead to greater variance in job turnaround time. The
RESERVATIONDEPTH parameter can be set to a very large value, essentially
enabling what is called conservative backfill where every job that cannot run is
given a reservation. Most sites prefer the liberal backfill approach associated
with the default RESERVATIONDEPTH of 1 or else select a slightly higher value. It is
important to note that to prevent starvation in conjunction with reservations,
monotonically increasing priority factors such as queue time or job XFactor
should be enabled. See the Prioritization Overview for more information on
priority factors.
Advance Reservations
439
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Another important consequence of backfill and reservation depth is how they
affect job priority. In Moab, all jobs are prioritized. Backfill allows jobs to be run
out of order and thus, to some extent, job priority to be ignored. This effect,
known as priority dilution, can cause many site policies implemented via Moab
prioritization policies to be ineffective. Setting the RESERVATIONDEPTH parameter
to a higher value gives job priority more teeth at the cost of slightly lower
system utilization. This lower utilization results from the constraints of these
additional reservations, decreasing the scheduler's freedom and its ability to
find additional optimizing schedules. Anecdotal evidence indicates that these
utilization losses are fairly minor, rarely exceeding 8%.
It is difficult a priori to know the right setting for the RESERVATIONDEPTH
parameter. Surveys indicate that the vast majority of sites use the default
value of 1. Sites that do modify this value typically set it somewhere in the
range of 2 to 10. The following guidelines may be useful in determining if and
how to adjust this parameter:
Reasons to Increase RESERVATIONDEPTH
l
l
l
The estimated job start time information provided by the showstart
command is heavily used and the accuracy needs to be increased.
Priority dilution prevents certain key mission objectives from being
fulfilled.
Users are more interested in knowing when their job will run than in
having it run sooner.
Reasons to Decrease RESERVATIONDEPTH
l
Scheduling efficiency and job throughput need to be increased.
Assigning Per-QoS Reservation Creation Rules
QoS based reservation depths can be enabled via the RESERVATIONQOSLIST
parameter. This parameter allows varying reservation depths to be associated
with different sets of job QoSs. For example, the following configuration
creates two reservation depth groupings:
RESERVATIONDEPTH[0]
RESERVATIONQOSLIST[0]
RESERVATIONDEPTH[1]
RESERVATIONQOSLIST[1]
8
highprio,interactive,debug
2
batch
This example causes that the top 8 jobs belonging to the aggregate group of highprio, interactive, and
debug QoS jobs will receive priority reservations. Additionally, the top two batch QoS jobs will also receive
priority reservations. Use of this feature allows sites to maintain high throughput for important jobs by
guaranteeing that a significant proportion of these jobs progress toward starting through use of the
priority reservation.
By default, the following parameters are set inside Moab:
440
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
RESERVATIONDEPTH[DEFAULT]
1
RESERVATIONQOSLIST[DEFAULT] ALL
This allows one job with the highest priority to get a reservation. These values can be overwritten by
modifying the DEFAULT policy.
Managing Resource Failures
Moab allows organizations to control how to best respond to a number of realworld issues. Occasionally when a reservation becomes active and a job
attempts to start, various resource manager race conditions or corrupt state
situations will prevent the job from starting. By default, Moab assumes the
resource manager is corrupt, releases the reservation, and attempts to recreate the reservation after a short timeout. However, in the interval between
the reservation release and the re-creation timeout, other priority reservations
may allocate the newly available resources, reserving them before the original
reservation gets an opportunity to reallocate them. Thus, when the original job
reservation is re-established, its original resource may be unavailable and the
resulting new reservation may be delayed several hours from the earlier start
time. The parameter RESERVATIONRETRYTIME allows a site that is
experiencing frequent resource manager race conditions and/or corruption
situations to tell Moab to hold on to the reserved resource for a period of time in
an attempt to allow the resource manager to correct its state.
Resource Allocation Policy
By default, when a standing or administrative reservation is created, Moab
allocates nodes in accordance with the specified taskcount, node expression,
node constraints, and the MINRESOURCE node allocation policy.
Charging for Reserved Resources
If an accounting manager is configured within Moab, resources consumed by
jobs are tracked and charged by default. However, resources dedicated to a
reservation are not charged although they are recorded within the reservation
event record. In particular, total processor-seconds reserved by the
reservation are recorded as are total unused processor-seconds reserved
(processor-seconds not consumed by an active job). While this information is
available in real-time using the mdiag -r command (see the "Active PH" field), it
is not written to the event log until reservation completion.
The default behavior for reservation tracking and charging via an accounting
manager is defined by the AMCFG ALWAYSCHARGERESERVATIONS
parameter. The default value for this attribute is False, meaning that charging
will not normally occur for reservations (administrative or standing), unless
specifically requested for the individual reservation. Likewise, if
ALWAYSCHARGERESERVATIONS is set to True, idle cycles will be charged for all
reservations (administrative or standing) unless specifically disabled for the
individual reservation.
Advance Reservations
441
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
If ALWAYSCHARGERESERVATIONS is set to False (the default), charging may be
enabled for individual reservations by specifying the CHARGEACCOUNT and
CHARGEUSER attributes for the reservation. For standing reservations, these
are set via the SRCFG CHARGEACCOUNT and CHARGEUSER parameters. For
administrative reservations, these are set via the -S aaccount and auser
options.
Example 6-3: Enabling charging in a standing reservation
SRCFG[foo] PERIOD=DAY DAYS=Mon,Tue,Wed,Thu,Fri DEPTH=1 USERLIST=amy
CHARGEACCOUNT=chemistry CHARGEUSER=amy RESOURCES=PROCS:1 TASKCOUNT=2
Example 6-4: Enabling charging in an administrative reservation
mrsvctl -c -a USER=amy -S aaccount=chemistry -S auser=amy -R procs=1 -t 1 -d 7200
If ALWAYSCHARGERESERVATIONS is set to True, charging may be disabled for
individual reservations by specifying the reservation Charge attribute with a
value of False. For standing reservations, this are set via the SRCFG CHARGE
parameter. For administrative reservations, this is set via the -S charge
options.
Example 6-5: Disabling charging in a standing reservation
SRCFG[foo] PERIOD=DAY DAYS=Mon,Tue,Wed,Thu,Fri DEPTH=1 USERLIST=amy CHARGE=False
RESOURCES=PROCS:1 TASKCOUNT=2
Example 6-6: Disabling charging in an administrative reservation
mrsvctl -c -a USER=amy -S charge=False -R procs=1 -t 1 -d 7200
Related Topics
Reservation Overview
Backfill
Configuring and Managing Reservations
l
l
442
Reservation Attributes
o
Start/End Time
o
Access Control List (ACL)
o
Selecting Resources
o
Flags
Configuring and Managing Standing Reservations
o
Standing Reservation Attributes
o
Standing Reservation Overview
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
l
o
Specifying Reservation Resources
o
Enforcing Policies Via Multiple Reservations
o
Affinity
o
ACL Modifiers
o
Reservation Ownership
o
Partitions
o
Resource Allocation Behavior
o
Rollback Reservations
o
Modifying Resources with Standing Reservations
Managing Administrative Reservations
Reservation Attributes
All reservations possess a time frame of activity, an access control list (ACL),
and a list of resources to be reserved. Additionally, reservations may also
possess a number of extension attributes including epilog/prolog specification,
reservation ownership and accountability attributes, and special flags that
modify the reservation's behavior.
Start/End Time
All reservations possess a start and an end time that define the reservation's
active time. During this active time, the resources within the reservation may
only be used as specified by the reservation access control list (ACL). This active
time may be specified as either a start/end pair or a start/duration pair.
Reservations exist and are visible from the time they are created until the
active time ends at which point they are automatically removed.
Access Control List (ACL)
For a reservation to be useful, it must be able to limit who or what can access
the resources it has reserved.
By default a reservation may allocate resources that possess credentials
that meet the submitter's ACL. In other words, a user's reservation won't
necessarily allocate only free and idle nodes. If a reservation exists that
coincides with the submitter's ACL, the nodes under that reservation are
also considered for allocation. This is referred to as ACL overlap. To make
new reservations allocate only free and idle nodes, you must use the
NOACLOVERLAP flag.
This is handled by way of an ACL. With reservations, ACLs can be based on
credentials, resources requested, or performance metrics. In particular, with a
standing reservation, the attributes USERLIST, GROUPLIST, ACCOUNTLIST,
Advance Reservations
443
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
CLASSLIST, QOSLIST, JOBATTRLIST, PROCLIMIT, MAXTIME, or TIMELIMIT
may be specified. (See Affinity and Modifiers.)
Reservation access can be adjusted based on a job's requested node
features by mapping node feature requests to job attributes as in the
following example:
NODECFG[DEFAULT]
NODETOJOBATTRMAP
SRCFG[pgs]
FEATURES=ia64
ia64,ia32
JOBATTRLIST=ia32
> mrsvctl -c -a jattr=gpfs\! -h "r:13-500"
Selecting Resources
When specifying which resources to reserve, the administrator has a number of
options. These options allow control over how many resources are reserved
and where they are reserved. The following reservation attributes allow the
administrator to define resources.
Task Description
Moab uses the task concept extensively for its job and reservation
management. A task is simply an atomic collection of resources, such as
processors, memory, or local disk, which must be found on the same node. For
example, if a task requires 4 processors and 2 GB of memory, the scheduler
must find all processors AND memory on the same node; it cannot allocate 3
processors and 1 GB on one node and 1 processor and 1 GB of memory on
another node to satisfy this task. Tasks constrain how the scheduler must
collect resources for use in a standing reservation; however, they do not
constrain the way in which the scheduler makes these cumulative resources
available to jobs. A job can use the resources covered by an accessible
reservation in whatever way it needs. If reservation X allocates 6 tasks with 2
processors and 512 MB of memory each, it could support job Y which requires
10 tasks of 1 processor and 128 MB of memory or job Z which requires 2 tasks
of 4 processors and 1 GB of memory each. The task constraints used to acquire
a reservation's resources are transparent to a job requesting use of these
resources.
Example 6-7:
SRCFG[test] RESOURCES=PROCS:2,MEM:1024
Taskcount
Using the task description, the taskcount attribute defines how many tasks
must be allocated to satisfy the reservation request. To create a reservation, a
taskcount and/or a hostlist must be specified.
444
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Example 6-8:
SRCFG[test] TASKCOUNT=256
Hostlist
A hostlist constrains the set of resources available to a reservation. If no
taskcount is specified, the reservation attempts to reserve one task on each of
the listed resources. If a taskcount is specified that requests fewer resources
than listed in the hostlist, the scheduler reserves only the number of tasks from
the hostlist specified by the taskcount attribute. If a taskcount is specified that
requests more resources than listed in the hostlist, the scheduler reserves the
hostlist nodes first and then seeks additional resources outside of this list.
When specifying resources for a hostlist, you can specify exact set, superset, or
subset of nodes on which the job must run. Use the caret (^) or asterisk (*)
characters to specify a hostlist as superset or subset respectively.
l
l
l
An exact set is defined without a caret or asterisk. An exact set means all
the hosts in the specified hostlist must be selected for the job.
A subset means the specified hostlist is used first to select hosts for the
job. If the job requires more hosts than are in the subset hostlist, they will
be obtained from elsewhere if possible. If the job does not require all of
the nodes in the subset hostlist, it will use only the ones it needs.
A superset means the hostlist is the only source of hosts that should be
considered for running the job. If the job can't find the necessary
resources in the superset hostlist it should not run. No other hosts should
be considered in allocating the job.
Example 6-9:
SRCFG[test] HOSTLIST=node01,node1[3-5]
Example 6-10: Subset
SRCFG[one] HOSTLIST=node1,node5* TASKCOUNT=5 PERIOD=DAY USERLIST=user1
Example 6-11: Superset
SRCFG[two] HOSTLIST=node1,node2,node3,node4,node5^ TASKCOUNT=3 PERIOD=DAY
USERLIST=user1
Node Features
Node features can be specified to constrain which resources are considered.
Example 6-12:
SRCFG[test] NODEFEATURES=fastos
Partition
A partition may be specified to constrain which resources are considered.
Advance Reservations
445
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Example 6-13:
SRCFG[test] PARTITION=core3
Flags
Reservation flags allow specification of special reservation attributes or
behaviors. Supported flags are listed in the following table:
Flag Name
Description
ACLOVERLAP
Deprecated (this is now a default flag). In addition to free or idle
nodes, a reservation may also reserve resources that possess credentials
that meet the reservation's ACL. To change this behavior, set the
NOACLOVERLAP flag.
ADVRESJOBDESTROY
All jobs that have an ADVRES matching this reservation are canceled when
the reservation is destroyed.
ALLOWGRID
By default, jobs migrated from one Moab to another Moab in a grid are
not allowed within local reservations. This flag allows migrated jobs to
access local reservations when they match the ACL.
ALLOWJOBOVERLAP
A job is allowed to start in a reservation that may end before the job completes. When the reservation ends before the job completes, the job will
not be canceled but will continue to run.
BYNAME
Reservation only allows access to jobs that meet reservation ACLs and
explicitly request the resources of this reservation using the job ADVRES
flag. (See Job to Reservation Binding.)
DEDICATEDRESOURCE
(aka EXCLUSIVE)
Reservation placed only on resources that are not reserved by any other
reservation including job, system, and user reservation. There are two
exception to this:
1. Reserved resources could be allocated when DEDICATEDRESOURCE is
combined with IGNJOBRSV*
2. Reserved resources could be allocated when a reservation matches
the submitter's ACL. In this case, to make DEDICATEDRESOURCE truly
exclusive, use the NOACLOVERLAP flag.
The order that SRCFG reservations are listed in the configuration
is important when using DEDICATEDRESOURCE, because
reservations made afterwards can steal resources later. During
configuration, list DEDICATEDRESOURCE reservations last to
guarantee exclusiveness.
446
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Flag Name
Description
EVACVMS
Reservation will automatically evacuate virtual machines from the
reservation nodelist.
The same action can be accomplished by using reservation
profiles. For more information, see Optimizing Maintenance
Reservations.
IGNIDLEJOBS*
Reservation can be placed on top of idle job reservations.
This flag is meant to be used in conjunction with
DEDICATEDRESOURCE.
IGNJOBRSV*
Ignores existing job reservations, allowing the reservation to be forced
onto available resources even if it conflicts with existing job reservations.
User and system reservation conflicts are still valid. It functions the same
as IGNIDLEJOBS plus allows a reservation to be placed on top of an
existing running job's reservation.
This flag is meant to be used in conjunction with
DEDICATEDRESOURCE.
IGNRSV*
Request ignores existing resource reservations allowing the reservation to
be forced onto available resources even if this conflicts with other
reservations. It functions the same as IGNJOBRSV plus allows the
reservation to be placed on top of the system reservations.
This flag is meant to be used in conjunction with
DEDICATEDRESOURCE.
IGNSTATE*
Reservation ignores node state when assigning nodes. It functions the
same as IGNRSV plus allows the reservation to be placed on nodes that are
not currently available. Also ignores resource availability on nodes.
IGNSTATE is specified by default when using a HOSTLIST to define
nodes. However, if using a HOSTLIST and a TASKCOUNT, you need
to specify IGNSTATE if you want Moab to ignore the node state
when assigning nodes to the reservation.
Advance Reservations
447
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Flag Name
Description
NOACLOVERLAP
All resources must be free or idle, with no existing reservations. Moab will
not allocate in-use resources even if they match the reservation's ACL.
mrsvctl -c -t 12 -E -F noacloverlap -a user==john
Moab looks for resources that are exclusive (free). Without the
flag, Moab would look for resources that are exclusive or that
are already running john's jobs.
This flag is meant to be used in conjunction with
DEDICATEDRESOURCE.
NOVMMIGRATION
If set on a reservation, this prevents VMs from being migrated away from
the reservation. If there are multiple reservations on the hypervisor and
at least one reservation does not have the NOVMIGRATION flag, then
VMs will be migrated.
OWNEREXCLUSIVEBF
When the owner of the reservation has an idle job in the queue only
owner jobs will be allowed to backfill into the reservation. This blocks nonowner jobs from backfilling into the reservation.
ENABLEPROFILING must be set for the owner credential.
OWNERPREEMPT
Jobs by the reservation owner are allowed to preempt non-owner jobs
using reservation resources.
OWNERPREEMPTIGNOREMINTIME
Allows the OWNERPREEMPT flag to "trump" the PREEMPTMINTIME
setting for jobs already running on a reservation when the owner of the
reservation submits a job. For example: without the
OWNERPREEMPTIGNOREMINTIME flag set, a job submitted by the
owner of a reservation will not preempt non-owner jobs already running
on the reservation until the PREEMPTMINTIME setting (if set) for those
jobs is passed.
With the OWNERPREEMPTIGNOREMINTIME flag set, a job submitted
by the owner of a reservation immediately preempts non-owner jobs
already running on the reservation, regardless of whether
PREEMPTMINTIME is set for the non-owner jobs.
448
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Flag Name
Description
OWNERPREEMPTQT
Specifies how much time a job from OWNER must wait in the queue
before preempting jobs within the standing reservation.
SRCFG[test] OWNERPREEMPTQT=2:00:00
OWNER jobs must wait 2 hours in the queue before preempting.
REQFULL
Reservation is only created when all resources can be allocated.
SINGLEUSE
Reservation is automatically removed after completion of the first job to
use the reserved resources.
SPACEFLEX
Deprecated (this is now a default flag). Reservation is allowed to adjust
resources allocated over time in an attempt to optimize resource utilization.
* IGNIDLEJOBS, IGNJOBRSV, IGNRSV, and IGNSTATE flags are built on
one another and form a hierarchy. IGNJOBRSV performs the function of
IGNIDLEJOBS plus its own functions. IGNRSV performs the function of
IGNJOBSRV and IGNIDLEJOBS plus its own functions. IGNSTATE performs
the function of IGNRSV, IGNJOBRSV, and IGNIDLEJOBS plus its own
functions. While you can use combinations of these flags, it is not
necessary. If you set one flag, you do not need to set other flags that fall
beneath it in the hierarchy.
Most flags can be associated with a reservation via the mrsvctl -c -F command
or the SRCFG parameter.
Configuring Standing Reservations
Standing reservations allow resources to be dedicated for particular uses. This
dedication can be configured to be permanent or periodic, recurring at a
regular time of day and/or time of week. There is extensive applicability of
standing reservations for everything from daily dedicated job runs to improved
use of resources on weekends. By default, standing reservations can overlap
other reservations. Unless you set an ignore-type flag (ACLOVERLAP,
DEDICATEDRESOURCE, IGNIDLEJOBS, or IGNJOBRSV), they are automatically
given the IGNRSV flag. All standing reservation attributes are specified via the
SRCFG parameter using the attributes listed in the table below.
Advance Reservations
449
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Standing Reservation Attributes
ACCESS
Format
DEDICATED or SHARED
Default
---
Description
If set to SHARED, allows a standing reservation to use resources already allocated to other nonjob reservations. Otherwise, these other reservations block resource access.
Example
SRCFG[test] ACCESS=SHARED
Standing reservation test may access resources allocated to existing standing and
administrative reservations.
The order that SRCFG reservations are listed in the configuration are important when
using DEDICATED, because reservations made afterwards can steal resources later.
During configuration, list DEDICATED reservations last to guarantee exclusiveness.
ACCOUNTLIST
Format
List of valid, comma delimited account names (see ACL Modifiers).
Default
---
Description
Specifies that jobs with the associated accounts may use the resources contained within this reservation.
Example
SRCFG[test] ACCOUNTLIST=ops,staff
Jobs using the account ops or staff are granted access to the resources in standing
reservation test.
CHARGE
450
Format
<BOOLEAN>
Default
---
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
CHARGE
Description
Example
Overrides the default charging behavior. If set to True, indicates that this reservation should be
charged, even if no ChargeAccount or ChargeUser are specified (this assumes your Accounting
Manager is set up to permit this). If set to False, indicates that this reservation should not be
charged. It is not necessary to specify CHARGE=True if CHARGEACCOUNT or CHARGEUSER is specified.
SRCFG[sr_mam1] CHARGE=False
Prevent charges to this reservation (might be used when AMCFG[]
ALWAYSCHARGERESERVATIONS=True).
CHARGEACCOUNT
Format
Any valid account name.
Default
---
Description
Specifies that idle cycles for this reservation should be charged against the specified account (via
the Accounting Manager).
CHARGEACCOUNT must be used in conjunction with CHARGEUSER.
Example
SRCFG[sr_mam1] CHARGEACCOUNT=math
SRCFG[sr_mam1] CHARGEUSER=john
Moab charges all idle cycles within reservations supporting standing reservation sr_
mam1 to account math.
CHARGEUSER
Format
Any valid username.
Default
---
Description
Specifies that idle cycles for this reservation should be charged against the specified user (via the
Accounting Manager).
CHARGEUSER must be used in conjunction with CHARGEACCOUNT.
Advance Reservations
451
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
CHARGEUSER
Example
SRCFG[sr_mam1] CHARGEACCOUNT=math
SRCFG[sr_mam1] CHARGEUSER=john
Moab charges all idle cycles within reservations supporting standing reservation sr_
mam1 to user john.
CLASSLIST
Format
List of valid, comma delimited classes/queues (see ACL Modifiers).
Default
---
Description
Specifies that jobs with the associated classes/queues may use the resources contained within this
reservation.
Example
SRCFG[test] CLASSLIST=!interactive
Jobs not using the class interactive are granted access to the resources in standing
reservation test.
CLUSTERLIST
Format
List of valid, comma-delimited peer clusters (see Moab Workload Manager for Grids).
Default
---
Description
Specifies that jobs originating within the listed clusters may use the resources contained within
this reservation.
Example
SRCFG[test] CLUSTERLIST=orion2,orion7
Moab grants jobs from the listed peer clusters access to the reserved resources.
452
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
COMMENT
Format
<STRING>
If the string contains whitespace, it should be enclosed in single (') or double quotes (").
Default
---
Description
Specifies a descriptive message associated with the standing reservation and all child reservations.
Example
SRCFG[test] COMMENT='rsv for network testing'
Moab annotates the standing reservation test and all child reservations with the specified
message. These messages show up within Moab client commands, Moab web tools, and
graphical administrator tools.
DAYS
Format
One or more of the following (comma-delimited):
l
Mon
l
Tue
l
Wed
l
Thu
l
Fri
l
Sat
l
Sun
l
[ALL]
Default
[ALL]
Description
Specifies which days of the week the standing reservation is active.
Example
SRCFG[test] DAYS=Mon,Tue,Wed,Thu,Fri
Standing reservation test is active Monday through
Friday.
DEPTH
Format
Advance Reservations
<INTEGER>
453
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
DEPTH
Default
2
Description
Specifies the depth of standing reservations to be created (one per period).
To satisfy the DEPTH, Moab creates new reservations at the beginning of the specified
PERIOD. If your reservation ends at the same time that a new PERIOD begins, the number
of reservations may not match the requested DEPTH. To prevent or resolve this issue, set
the ENDTIME a couple minutes before the beginning of the next PERIOD. For example, set
the ENDTIME to 23:58 instead of 00:00.
Example
SRCFG[test] PERIOD=DAY DEPTH=6
Specifies that six reservations will be created for standing reservation test.
DISABLE
Format
<BOOLEAN>
Default
FALSE
Description
Specifies that the standing reservation should no longer spawn child reservations.
Example
SRCFG[test] PERIOD=DAY DEPTH=7 DISABLE=TRUE
Specifies that reservations are created for standing reservation test for today and
the next six days.
ENDTIME
454
Format
[[[DD:]HH:]MM:]SS
Default
24:00:00
Description
Specifies the time of day the standing reservation period ends (end of day or end of week depending on PERIOD).
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
ENDTIME
Example
SRCFG[test] STARTTIME=8:00:00
SRCFG[test] ENDTIME=17:00:00
SRCFG[test] PERIOD=DAY
Standing reservation test is active from 8:00 AM until 5:00 PM.
FLAGS
Format
Comma-delimited list of zero or more flags listed in the reservation flags overview.
Default
---
Description
Specifies special reservation attributes.
Example
SRCFG[test] FLAGS=BYNAME,DEDICATEDRESOURCE
Jobs may only access the resources within this reservation if they explicitly request the
reservation by name. Further, the reservation is created to not overlap with other
reservations.
GROUPLIST
Format
One or more comma-delimited group names.
Default
[ALL]
Description
Specifies the groups allowed access to this standing reservation (see ACL Modifiers).
Example
SRCFG[test] GROUPLIST=staff,ops,special
SRCFG[test] CLASSLIST=interactive
Moab allows jobs with the listed group IDs or which request the job class interactive to
use the resources covered by the standing reservation.
HOSTLIST
Format
Advance Reservations
One or more comma delimited host names or host expressions or the string "class:<classname>".
455
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
HOSTLIST
Default
---
Description
Specifies the set of hosts that the scheduler can search for resources to satisfy the reservation. If
specified using the "class:X" format, Moab only selects hosts that support the specified class. If
TASKCOUNT is also specified, only TASKCOUNT tasks are reserved. Otherwise, all matching hosts
are reserved.
The HOSTLIST attribute is treated as host regular expression so foo10 will map to foo10,
foo101, foo1006, and so forth. To request an exact host match, the expression can be
bounded by the carat and dollar symbol expression markers as in ^foo10$.
When specifying resources for a hostlist, you can specify exact set, superset, or subset of
nodes on which the job must run. Use the caret (^) or asterisk (*) characters to specify a
hostlist as superset or subset respectively. See hostlist in Selecting Resources for more
information.
When using r: ensure your node indexes are correct by customizing the NODEIDFORMAT
parameter. See NODEIDFORMAT for more information.
Example
SRCFG[test] HOSTLIST=node001,node002,node003
SRCFG[test] RESOURCES=PROCS:2;MEM:512
SRCFG[test] TASKCOUNT=2
Moab reserves a total of two tasks with 2 processors and 512 MB each, using resources
located on node001, node002, and/or node003.
SRCFG[test] HOSTLIST=node01,node1[3-5]
The reservation will consume all nodes that have "node01" somewhere in their names and
all nodes that have both "node1" and either a "3," "4," or "5" in their names.
SRCFG[test] HOSTLIST=r:node[1-6]
The reservation will consume all nodes with names that begin with "node" and end with
any number 1 through 6. In other words, it will reserve node1, node2, node3, node4, node5,
and node6.
JOBATTRLIST
Format
456
Comma-delimited list of one or more of the following job attributes:
l
PREEMPTEE
l
INTERACTIVE
l
any generic attribute configured through NODECFG.
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
JOBATTRLIST
Default
---
Description
Specifies job attributes that grant a job access to the reservation.
Values can be specified with a "!="assignment to only allow jobs NOT requesting a certain
feature inside the reservation.
To enable/disable reservation access based on requested node features, use the
parameter NODETOJOBATTRMAP.
Example
SRCFG[test] JOBATTRLIST=PREEMPTEE
Preemptible jobs can access the resources reserved within this reservation.
MAXJOB
Format
<INTEGER>
Default
---
Description
Specifies the maximum number of jobs that can run in the reservation.
Example
SRCFG[test] MAXJOB=1
Only one job will be allowed to run in this reservation.
MAXTIME
Format
[[[DD:]HH:]MM:]SS[+]
Default
---
Description
Specifies the maximum time for jobs allowable. Can be used with Affinity to attract jobs with same
MAXTIME.
Example
SRCFG[test] MAXTIME=1:00:00+
Jobs with a time of 1:00:00 are attracted to this reservation.
Advance Reservations
457
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
NODEFEATURES
Format
Comma-delimited list of node features.
Default
---
Description
Specifies the required node features for nodes that are part of the standing reservation.
Example
SRCFG[test] NODEFEATURES=wide,fddi
All nodes allocated to the standing reservation must have both the wide and fddi
node attributes.
OWNER
Format
<CREDTYPE>:<CREDID>
Where <CREDTYPE> is one of USER, GROUP, ACCT, QoS, CLASS or CLUSTER and
<CREDTYPE> is a valid credential id of that type.
Default
---
Description
Specifies the owner of the reservation. Setting ownership for a reservation grants the user
management privileges, including the power to release it.
Setting a USER as the OWNER of a reservation gives that user privileges to query and
release the reservation.
For sandbox reservations, sandboxes are applied to a specific peer only if OWNER is set to
CLUSTER:<PEERNAME>.
Example
SRCFG[test] OWNER=ACCT:jupiter
User jupiter owns the reservation and may be granted special privileges associated with
that ownership.
PARTITION
458
Format
Valid partition name.
Default
[ALL]
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
PARTITION
Description
Example
Specifies the partition in which to create the standing reservation.
SRCFG[test] PARTITION=OLD
The standing reservation will only select resources from
partition OLD.
PERIOD
Format
One of DAY, WEEK, or INFINITY.
Default
DAY
Description
Specifies the period of the standing reservation.
Example
SRCFG[test] PERIOD=WEEK
Each standing reservation covers a one week period.
PROCLIMIT
Format
<QUALIFIER><INTEGER>
<QUALIFIER> may be one of the following <, <=, ==, >=, >
Default
---
Description
Specifies the processor limit for jobs requesting access to this standing reservation.
Example
SRCFG[test] PROCLIMIT<=4
Jobs requesting 4 or fewer processors are allowed to run.
PSLIMIT
Format
<QUALIFIER><INTEGER>
<QUALIFIER> may be one of the following <, <=, ==, >=, >
Advance Reservations
459
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
PSLIMIT
Default
---
Description
Specifies the processor-second limit for jobs requesting access to this standing reservation.
Example
SRCFG[test] PSLIMIT<=40000
Jobs requesting 40000 or fewer processor-seconds are allowed to run.
QOSLIST
Format
Zero or more valid, comma-delimited QoS names.
Default
---
Description
Specifies that jobs with the listed QoS names can access the reserved resources.
Example
SRCFG[test] QOSLIST=hi,low,special
Moab allows jobs using the listed QOS's access to the reserved
resources.
REQUIREDACCTLIST
Format
One or more comma-delimited accounts.
Default
---
Description
When present, any jobs in the reservation must match one of the listed accounts.
This attribute also be used in conjunction with REQUIREDUSERLIST. If both REQUIREDACCTLIST
and REQUIREDUSERLIST are specified, all jobs in the reservation must match both.
It is recommended that any entries in the REQUIREDACCTLIST be present in the
ACCOUNTLIST attribute to handle reservation affinities.
Example
SRCFG[test] REQUIREDUSERLIST=john,bob USERLIST=john,bob
SRCFG[test] REQUIREDACCTLIST=eng,chem ACCOUNTLIST=eng,chem
A job must belong to either user "john" or "bob" AND either account "eng" or "chem".
460
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
REQUIREDTPN
Format
<QUALIFIER><INTEGER>
<QUALIFIER> may be one of the following <, <=, ==, >=, >
Default
---
Description
Restricts access to reservations based on the job's TPN (tasks per node).
Example
SRCFG[test] REQUIREDTPN==4
Jobs with tpn=4 or ppn=4 would be allowed within the reservation, but any other TPN
value would not. (For more information, see TPN (Exact Tasks Per Node).)
REQUIREDUSERLIST
Format
One or more comma-delimited accounts.
Default
---
Description
When present, any jobs in the reservation must match one of the listed users.
This attribute also be used in conjunction with REQUIREDACCTLIST. If both REQUIREDACCTLIST
and REQUIREDUSERLIST are specified, all jobs in the reservation must match both.
It is recommended that any entries in the REQUIREDUSERLIST be present in the USERLIST
attribute to handle reservation affinities.
Example
SRCFG[test] REQUIREDUSERLIST=john,bob USERLIST=john,bob
SRCFG[test] REQUIREDACCTLIST=eng,chem ACCOUNTLIST=eng,chem
A job must belong to either user "john" or "bob" AND either account "eng" or "chem".
RESOURCES
Format
Semicolon delimited <ATTR>:<VALUE> pairs where <ATTR> may be one of PROCS, MEM, SWAP,
DISK, or GRES.
Default
PROCS:-1 (All processors available on node)
Advance Reservations
461
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
RESOURCES
Description
Example
Specifies what resources constitute a single standing reservation task. (Each task must be able to
obtain all of its resources as an atomic unit on a single node.) Supported resources currently
include the following:
l
PROCS (number of processors)
l
MEM (real memory in MB)
l
SWAP (virtual memory in MB)
l
DISK (local disk in MB)
l
GRES (generic resource specified in the format GRES:<GRESNAME>[:<COUNT>])
SRCFG[test] RESOURCES=PROCS:1;MEM:512;GRES=matlab:3;GRES=fluent:12
Each standing reservation task reserves one processor, 512 MB of real memory, 3 matlab
generic resources and 12 fluent generic resources.
ROLLBACKOFFSET
Format
[[[DD:]HH:]MM:]SS
Default
---
Description
Specifies the minimum time in the future at which the reservation may start. This offset is rollback
meaning the start time of the reservation will continuously roll back into the future to maintain
this offset. Rollback offsets are a good way of providing guaranteed resource access to users under
the conditions that they must commit their resources in the future or lose dedicated access. See
QoS for more info about quality of service and service level agreements; also see Rollback
Reservation Overview.
Neither credlock nor advres is compatible on the jobs submitted for this reservation.
Example
SRCFG[ajax] ROLLBACKOFFSET=24:00:00 TASKCOUNT=32
SRCFG[ajax] PERIOD=INFINITY ACCOUNTLIST=ajax
The standing reservation guarantees access to up to 32 processors within 24 hours to jobs
from the ajax account.
Adding an asterisk to the ROLLBACKOFFSET value pins rollback reservation start times when an
idle reservation is created in the rollback reservation. For example:
SRCFG[staff] ROLLBACKOFFSET=18:00:00* PERIOD=INFINITY
462
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
RSVACCESSLIST
Format
<RESERVATION>[,...]
Default
---
Description
A list of reservations to which the specified reservation has access.
Example
SRCFG[test] RSVACCESSLIST=rsv1,rsv2,rsv3
RSVGROUP
Format
<STRING>
Default
---
Description
See section Reservation Group for a detailed description.
Example
SRCFG[test] RSVGROUP=rsvgrp1
SRCFG[ajax] RSVGROUP=rsvgrp1
STARTTIME
Format
[[[DD:]HH:]MM:]SS
Default
00:00:00:00 (midnight)
Description
Specifies the time of day/week the standing reservation becomes active. Whether this indicates a
time of day or time of week depends on the setting of the PERIOD attribute.
If specified within a reservation profile, a value of 0 indicates the reservation should start
at the earliest opportunity.
Example
SRCFG[test] STARTTIME=08:00:00
SRCFG[test] ENDTIME=17:00:00
SRCFG[test] PERIOD=DAY
The standing reservation will be active from 8:00 a.m. until 5:00 p.m. each day.
Advance Reservations
463
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
TASKCOUNT
Format
<INTEGER>
Default
0 (unlimited tasks)
Description
Specifies how many tasks should be reserved for the reservation.
Example
SRCFG[test] RESOURCES=PROCS:1;MEM:256
SRCFG[test] TASKCOUNT=16
Standing reservation test reserves 16 tasks worth of resources; in this case, 16 processors
and 4 GB of real memory.
TIMELIMIT
Format
[[[DD:]HH:]MM:]SS
Default
-1 (no time based access)
Description
Specifies the maximum allowed overlap between the standing reservation and a job requesting
resource access.
Example
SRCFG[test] TIMELIMIT=1:00:00
Moab allows jobs to access up to one hour of resources in the standing reservation.
TPN (Exact Tasks Per Node)
Format
<INTEGER>
Default
0 (no TPN constraint)
Description
Specifies the exact number of tasks per node that must be available on eligible nodes.
Example
SRCFG[2] TPN=4
SRCFG[2] RESOURCES=PROCS:2;MEM:256
Moab must locate four tasks on each node that is to be part of the reservation. That is,
each node included in standing reservation 2 must have 8 processors and 1 GB of memory
available.
464
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
TRIGGER
Format
See Creating a Trigger for syntax.
Default
N/A
Description
Specifies event triggers to be launched by the scheduler under the scheduler's ID. These triggers
can be used to conditionally cancel reservations, modify resources, or launch various actions at specified event offsets. See About Object Triggers for more detail.
Example
SRCFG[fast]
TRIGGER=EType=start,Offset=5:00:00,AType=exec,Action="/usr/local/domail.pl"
Moab launches the domail.pl script 5 hours after any fast reservation starts.
USERLIST
Format
Comma-delimited list of users.
Default
---
Description
Specifies which users have access to the resources reserved by this reservation (see ACL
Modifiers).
Example
SRCFG[test] USERLIST=bob,joe,mary
Users bob, joe and mary can all access the resources reserved within this reservation.
Standing Reservation Overview
A standing reservation is similar to a normal administrative reservation in that
it also places an access control list on a specified set of resources. Resources
are specified on a per-task basis and currently include processors, local disk,
real memory, and swap. The access control list supported for standing
reservations includes users, groups, accounts, job classes, and QoS levels.
Standing reservations can be configured to be permanent or periodic on a daily
or weekly basis and can accept a daily or weekly start and end time. Regardless
of whether permanent or recurring on a daily or weekly basis, standing
reservations are enforced using a series of reservations, extending a number
of periods into the future as controlled by the DEPTH attribute of the SRCFG
parameter.
The following examples demonstrate possible configurations specified with the
SRCFG parameter.
Advance Reservations
465
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Example 6-14: Basic Business Hour Standing Reservation
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
TASKCOUNT=6 RESOURCES=PROCS:1,MEM:512
PERIOD=DAY DAYS=MON,TUE,WED,THU,FRI
STARTTIME=9:00:00 ENDTIME=17:00:00
CLASSLIST=interactive
When using the SRCFG parameter, attribute lists must be delimited using
the comma (,), pipe (|), or colon (:) characters; they cannot be space
delimited. For example, to specify a multi-class ACL, specify:
SRCFG[test] CLASSLIST=classA,classB
Only one STARTTIME and one ENDTIME value can be specified per
reservation. If varied start and end times are desired throughout the
week, complementary standing reservations should be created. For
example, to establish a reservation from 8:00 p.m. until 6:00 a.m. the
next day during business days, two reservations should be created-one
from 8:00 p.m. until midnight, and the other from midnight until 6:00 a.m.
Jobs can run across reservation boundaries allowing these two
reservations to function as a single reservation that spans the night. The
following example demonstrates how to span a reservation across 2 days
on the same nodes:
SRCFG[Sun] PERIOD=WEEK
SRCFG[Sun] STARTTIME=00:20:00:00 ENDTIME=01:00:00:00
SRCFG[Sun] HOSTLIST=node01,node02,node03
SRCFG[Mon] PERIOD=WEEK
SRCFG[Mon] STARTTIME=01:00:00:00 ENDTIME=01:06:00:00
SRCFG[Sun] HOSTLIST=node01,node02,node03
The preceding example fully specifies a reservation including the quantity of
resources requested using the TASKCOUNT and RESOURCES attributes. In all
cases, resources are allocated to a reservation in units called tasks where a
task is a collection of resources that must be allocated together on a single
node. The TASKCOUNT attribute specifies the number of these tasks that should
be reserved by the reservation. In conjunction with this attribute, the
RESOURCES attribute defines the reservation task by indicating what resources
must be included in each task. In this case, the scheduler must locate and
reserve 1 processor and 512 MB of memory together on the same node for
each task requested.
As mentioned previously, a standing reservation reserves resources over a
given time frame. The PERIOD attribute may be set to a value of DAY, WEEK, or
INFINITY to indicate the period over which this reservation should recur. If not
specified, a standing reservation recurs on a daily basis. If a standing
reservation is configured to recur daily, the attribute DAYS may be specified to
indicate which days of the week the reservation should exist. This attribute
takes a comma-delimited list of days where each day is specified as the first
466
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
three letters of the day in all capital letters: MON or FRI. The preceding
example specifies that this reservation is periodic on a daily basis and should
only exist on business days.
The time of day during which the requested tasks are to be reserved is
specified using the STARTTIME and ENDTIME attributes. These attributes are
specified in standard military time HH:MM:SS format and both STARTTIME and
ENDTIME specification is optional defaulting to midnight at the beginning and
end of the day respectively. In the preceding example, resources are reserved
from 9:00 a.m. until 5:00 p.m. on business days.
The final aspect of any reservation is the access control list indicating who or
what can use the reserved resources. In the preceding example, the CLASSLIST
attribute is used to indicate that jobs requesting the class "interactive" should
be allowed to use this reservation.
Specifying Reservation Resources
In most cases, only a small subset of standing reservation attributes must be
specified in any given case. For example, by default, RESOURCES is set to
PROCS=-1 which indicates that each task should reserve all of the processors on
the node on which it is located. This, in essence, creates a one task equals one
node mapping. In many cases, particularly on uniprocessor systems, this
default behavior may be easiest to work with. However, in SMP environments,
the RESOURCES attribute provides a powerful means of specifying an exact,
multi-dimensional resource set.
An examination of the parameters documentation shows that the default
value of PERIOD is DAYS. Thus, specifying this parameter in the preceding
above was unnecessary. It was used only to introduce this parameter and
indicate that other options exist beyond daily standing reservations.
Example 6-15: Host Constrained Standing Reservation
Although the first example did specify a quantity of resources to reserve, it did
not specify where the needed tasks were to be located. If this information is not
specified, Moab attempts to locate the needed resources anywhere it can find
them. The Example 1 reservation essentially discovers hosts where the needed
resources can be found. If the SPACEFLEX reservation flag is set, then the
reservation continues to float to the best hosts over the life of the reservation.
Otherwise, it will be locked to the initial set of allocated hosts.
If a site wanted to constrain a reservation to a subset of available resources,
this could be accomplished using the HOSTLIST attribute. The HOSTLIST attribute is
specified as a comma-separated list of hostnames and constrains the scheduler
to only select tasks from the specified list. This attribute can exactly specify
hosts or specify them using host regular expressions. The following example
demonstrates a possible use of the HOSTLIST attribute:
Advance Reservations
467
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
DAYS=MON,TUE,WED,THU,FRI
PERIOD=DAY
STARTTIME=10:00:00 ENDTIME=15:00:00
RESOURCES=PROCS:2,MEM:256
HOSTLIST=node001,node002,node005,node020
TASKCOUNT=6
CLASSLIST=interactive
Note that the HOSTLIST attribute specifies a non-contiguous list of hosts. Any combination of hosts may be
specified and hosts may be specified in any order. In this example, the TASKCOUNT attribute is also
specified. These two attributes both apply constraints on the scheduler with HOSTLIST specifying where
the tasks can be located and TASKCOUNT indicating how many total tasks may be allocated. In this
example, six tasks are requested but only four hosts are specified. To handle this, if adequate resources
are available, the scheduler may attempt to allocate more than one task per host. For example, assume
that each host is a quad-processor system with 1 GB of memory. In such a case, the scheduler could
allocate up to two tasks per host and even satisfy the TASKCOUNT constraint without using all of the hosts
in the hostlist.
It is important to note that even if there is a one to one mapping between
the value of TASKCOUNT and the number of hosts in HOSTLIST, the scheduler
will not necessarily place one task on each host. If, for example, node001
and node002 were 8 processor SMP hosts with 1 GB of memory, the
scheduler could locate up to four tasks on each of these hosts fully
satisfying the reservation taskcount without even partially using the
remaining hosts. (Moab will place tasks on hosts according to the policy
specified with the NODEALLOCATIONPOLICY parameter.) If the hostlist
provides more resources than what is required by the reservation as
specified via TASKCOUNT, the scheduler will simply select the needed
resources within the set of hosts listed.
Enforcing Policies Via Multiple Reservations
Single reservations enable multiple capabilities. Combinations of reservations
can further extend a site's capabilities to impose specific policies.
Example 6-16: Reservation Stacking
If HOSTLIST is specified but TASKCOUNT is not, the scheduler will pack as many
tasks as possible onto all of the listed hosts. For example, assume the site
added a second standing reservation named debug to its configuration that
reserved resources for use by certain members of its staff using the following
configuration:
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[interactive]
SRCFG[debug]
SRCFG[debug]
SRCFG[debug]
SRCFG[debug]
468
DAYS=MON,TUE,WED,THU,FRI
PERIOD=DAY
STARTTIME=10:00:00 ENDTIME=15:00:00
RESOURCES=PROCS:2,MEM:256
HOSTLIST=node001,node002,node005,node020
TASKCOUNT=6
CLASSLIST=interactive
HOSTLIST=node001,node002,node003,node004
USERLIST=helpdesk
GROUPLIST=operations,sysadmin
PERIOD=INFINITY
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
The new standing reservation is quite simple. Since RESOURCES is not specified,
it will allocate all processors on each host that is allocated. Since TASKCOUNT is
not specified, it will allocate every host listed in HOSTLIST. Since PERIOD is set to
INFINITY, the reservation is always in force and there is no need to specify
STARTTIME, ENDTIME, or DAYS.
The standing reservation has two access parameters set using the attributes
USERLIST and GROUPLIST. This configuration indicates that the reservation can
be accessed if any one of the access lists specified is satisfied by the job. In
essence, reservation access is logically OR'd allowing access if the requester
meets any of the access constraints specified. In this example, jobs submitted
by either user helpdesk or any member of the groups operations or sysadmin can
use the reserved resources (See ACL Modifiers).
Unless ACL Modifiers are specified, access is granted to the logical OR of access
lists specified within a standing reservation and granted to the logical AND of
access lists across different standing reservations. A comparison of the
standing reservations interactive and debug in the preceding example indicates
that they both can allocate hosts node001 and node002. If node001 had both of
these reservations in place simultaneously and a job attempted to access this
host during business hours when standing reservation interactive was active.
The job could only use the doubly reserved resources if it requests the run class
interactive and it meets the constraints of reservation debug—that is, that it is
submitted by user helpdesk or by a member of the group operations or sysadmin.
As a rule, the scheduler does not stack reservations unless it must. If adequate
resources exist, it can allocate reserved resources side by side in a single SMP
host rather than on top of each other. In the case of a 16 processor SMP host
with two 8 processor standing reservations, 8 of the processors on this host will
be allocated to the first reservation, and 8 to the next. Any configuration is
possible. The 16 processor hosts can also have 4 processors reserved for user
"John," 10 processors reserved for group "Staff," with the remaining 2
processors available for use by any job.
Stacking reservations is not usually required but some site administrators
choose to do it to enforce elaborate policies. There is no problem with doing so
as long as you can keep things straight. It really is not too difficult a concept; it
just takes a little getting used to. See the Reservation Overview section for a
more detailed description of reservation use and constraints.
As mentioned earlier, by default the scheduler enforces standing reservations
by creating a number of reservations where the number created is controlled
by the DEPTH attribute. Each night at midnight, the scheduler updates its
periodic non-floating standing reservations. By default, DEPTH is set to 2,
meaning when the scheduler starts up, it will create two 24-hour reservations
covering a total of two days' worth of time-a reservation for today and one for
tomorrow. For daily reservations, at midnight, the reservations roll, meaning
today's reservation expires and is removed, tomorrow's reservation becomes
today's, and the scheduler creates a new reservation for the next day.
Advance Reservations
469
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
With this model, the scheduler continues creating new reservations in the
future as time moves forward. Each day, the needed resources are always
reserved. At first, all appears automatic but the standing reservation DEPTH
attribute is in fact an important aspect of reservation rollback, which helps
address certain site specific environmental factors. This attribute remedies a
situation that might occur when a job is submitted and cannot run immediately
because the system is backlogged with jobs. In such a case, available
resources may not exist for several days out and the scheduler must reserve
these future resources for this job. With the default DEPTH setting of two, when
midnight arrives, the scheduler attempts to roll its standing reservations but a
problem arises in that the job has now allocated the resources needed for the
standing reservation two days out. Moab cannot reserve the resources for the
standing reservation because they are already claimed by the job. The
standing reservation reserves what it can but because all needed resources are
not available, the resulting reservation is now smaller than it should be, or is
possibly even empty.
If a standing reservation is smaller than it should be, the scheduler will attempt
to add resources each iteration until it is fully populated. However, in the case
of this job, the job is not going to release its reserved resources until it
completes and the standing reservation cannot claim them until this time. The
DEPTH attribute allows a site to specify how deep into the future a standing
reservation should reserve its resources allowing it to claim the resources first
and prevent this problem. If a partial standing reservation is detected on a
system, it may be an indication that the reservation's DEPTH attribute should be
increased.
In Example 3, the PERIOD attribute is set to INFINITY. With this setting, a single,
permanent standing reservation is created and the issues of resource
contention do not exist. While this eliminates the contention issue, infinite
length standing reservations cannot be made periodic.
Example 6-17: Multiple ACL Types
In most cases, access lists within a reservation are logically OR'd together to
determine reservation access. However, exceptions to this rule can be
specified by using the required ACL marker-the asterisk (*). Any ACL marked
with this symbol is required and a job is only allowed to use a reservation if it
meets all required ACLs and at least one non-required ACL (if specified). A
common use for this facility is in conjunction with the TIMELIMIT attribute. This
attribute controls the length of time a job may use the resources within a
standing reservation. This access mechanism can be AND'd or OR'd to the
cumulative set of all other access lists as specified by the required ACL marker.
Consider the following example configuration:
470
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
SRCFG[special]
SRCFG[special]
SRCFG[special]
SRCFG[special]
SRCFG[special]
SRCFG[special]
SRCFG[special]
SRCFG[special]
TASKCOUNT=32
PERIOD=WEEK
STARTTIME=1:08:00:00
ENDTIME=5:17:00:00
NODEFEATURES=largememory
TIMELIMIT=1:00:00*
QOSLIST=high,low,specialACCCOUNTLIST=!projectX,!projectY
The above configuration requests 32 tasks which translate to 32 nodes. The
PERIOD attribute makes this reservation periodic on a weekly basis while the
attributes STARTTIME and ENDTIME specify the week offsets when this
reservation is to start and end (Note that the specification format has changed
to DD:HH:MM:SS.). In this case, the reservation starts on Monday at 8:00 a.m.
and runs until Friday at 5:00 p.m. The reservation is enforced as a series of
weekly reservations that only cover the specified time frame. The
NODEFEATURES attribute indicates that each of the reserved nodes must have
the node feature "largememory" configured.
As described earlier, TIMELIMIT indicates that jobs using this reservation can
only use it for one hour. This means the job and the reservation can only
overlap for one hour. Clearly jobs requiring an hour or less of wallclock time
meet this constraint. However, a four-hour job that starts on Monday at 5:00
a.m. or a 12-hour job that starts on Friday at 4:00 p.m. also satisfies this
constraint. Also, note the TIMELIMIT required ACL marker, *; it is set indicating
that jobs must not only meet the TIMELIMIT access constraint but must also
meet one or more of the other access constraints. In this example, the job can
use this reservation if it can use the access specified via QOSLIST or ACCOUNTLIST;
that is, it is assigned a QoS of high, low, or special , or the submitter of the job
has an account that satisfies the !projectX and !projectY criteria. See the QoS
Overview for more info about QoS configuration and usage.
Affinity
Reservation ACLs allow or deny access to reserved resources but they may be
configured to also impact a job's affinity for a particular reservation. By default,
jobs gravitate toward reservations through a mechanism known as positive
affinity. This mechanism allows jobs to run on the most constrained resources
leaving other, unreserved resources free for use by other jobs that may not be
able to access the reserved resources. Normally this is a desired behavior.
However, sometimes, it is desirable to reserve resources for use only as a last
resort-using the reserved resources only when there are no other resources
available. This last resort behavior is known as negative affinity. Note the '-'
(hyphen or negative sign) following the special in the QOSLIST values. This
special mark indicates that QoS special should be granted access to this
reservation but should be assigned negative affinity. Thus, the QOSLIST attribute
specifies that QoS high and low should be granted access with positive affinity
(use the reservation first where possible) and QoS special granted access with
negative affinity (use the reservation only when no other resources are
available).
Advance Reservations
471
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Affinity status is granted on a per access object basis rather than a per access
list basis and always defaults to positive affinity. In addition to negative affinity,
neutral affinity can also be specified using the equal sign (=) as in QOSLIST[0]
normal= high debug= low-.
When a job matches multiple ACLs for a reservation, the final node affinity for
the node, job, and reservation combination is based on the last matching
ACL entry found in the configuration file.
For example, given the following reservation ACLs, a job matching both will
receive a negative affinity:
SRCFG[res1] USERLIST=joe+ MAXTIME<=4:00:00-
With the following reservation ACLs, a job matching both will receive a positive
affinity:
SRCFG[res1] MAXTIME<=4:00:00- USERLIST=joe+
ACL Modifiers
ACL modifiers allow a site to change the default behavior of ACL processing. By
default, a reservation can be accessed if one or more of its ACLs can be met by
the requestor. This behavior can be changed using following modifiers:
Not
Symbol:
! (exclamation point)
Description
If attribute is met, the requestor is denied access regardless of any other satisfied ACLs.
Example
SRCFG[test] GROUPLIST=staff USERLIST=!steve
Allow access to all staff members other than steve.
Required
Symbol:
* (asterisk)
Description
All required ACLs must be satisfied for requestor access to be granted.
Example
SRCFG[test] QOSLIST=*high MAXTIME=*2:00:00
Only jobs in QoS high that request less than 2 hours of walltime are
granted access.
472
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
XOR
Symbol:
^ (carat)
Description
All attributes of the type specified other than the ones listed in the ACL satisfy the ACL.
Example
SRCFG[test] QOSLIST=^high
All jobs other than those requesting QoS high are granted access.
CredLock
Symbol:
Description
Example
& (ampersand)
Matching jobs will be required to run on the resources reserved by this reservation. You can use
this modifier on accounts, classes, groups, qualities of service, and users.
SRCFG[test] USERLIST=&john
All of user john's jobs must run in this reservation.
HPEnable (hard policy enable)
Symbol:
~ (tilde)
Description
ACLs marked with this modifier are ignored during soft policy scheduling and are only considered
for hard policy scheduling once all eligible soft policy jobs start.
Example
SRCFG[johnspace] USERLIST=john CLASSLIST=~debug
All of user john's jobs are allowed to run in the reservation at any time. Debug jobs are
also allowed to run in this reservation but are only considered after all of John's jobs are
given an opportunity to start. User john's jobs are considered before debug jobs
regardless of job priority.
If HPEnable and Not markers are used in conjunction, then specified credentials are
blocked-out of the reservation during soft-policy scheduling.
Note the ACCOUNTLIST values in Example 6-17 are preceded with an
exclamation point, or NOT symbol. This indicates that all jobs with accounts
Advance Reservations
473
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
other than projectX and projectY meet the account ACL. Note that if a !<X> value
(!projectX) appears in an ACL line, that ACL is satisfied by any object not
explicitly listed by a NOT entry. Also, if an object matches a NOT entry, the
associated job is excluded from the reservation even if it meets other ACL
requirements. For example, a QoS 3 job requesting account projectX is denied
access to the reservation even though the job QoS matches the QoS ACL.
Example 6-18: Binding Users to Reservations at Reservation Creation
# create a 4 node reservation for john and bind all of john's jobs to that reservation
> mrsvctl -c -a user=&john -t 4
Reservation Ownership
Reservation ownership allows a site to control who owns the reserved
resources during the reservation time frame. Depending on needs, this
ownership may be identical to, a subset of, or completely distinct from the
reservation ACL. By default, reservation ownership implies resource
accountability and resources not consumed by jobs are accounted against the
reservation owner. In addition, ownership can also be associated with special
privileges within the reservation.
Ownership is specified using the OWNER attribute in the format
<CREDTYPE>:<CREDID>, as in OWNER=USER:john. To enable john's jobs to
preempt other jobs using resources within the reservation, the SRCFG attribute
FLAG should be set to OWNERPREEMPT. In the example below, the jupiter
project chooses to share resources with the saturn project but only when it does
not currently need them.
Example 6-19: Limited Shared Access
ACCOUNTCFG[jupiter] PRIORITY=10000
SRCFG[jupiter] HOSTLIST=node0[1-9]
SRCFG[jupiter] PERIOD=INFINITY
SRCFG[jupiter] ACCOUNTLIST=jupiter,saturnSRCFG[jupiter] OWNER=ACCT:jupiter
SRCFG[jupiter] FLAGS=OWNERPREEMPT
Partitions
A reservation can be used in conjunction with a partition. Configuring a
standing reservation on a partition allows constraints to be (indirectly) applied
to a partition.
Example 6-20: Time Constraints by Partition
The following example places a 3-day wall-clock limit on two partitions and a 64
processor-hour limit on jobs running on partition small.
SRCFG[smallrsv] PARTITION=small MAXTIME=3:00:00:00 PSLIMIT<=230400 HOSTLIST=ALL
SRCFG[bigrsv] PARTITION=big MAXTIME=3:00:00:00 HOSTLIST=ALL
474
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Resource Allocation Behavior
As mentioned, standing reservations can operate in one of two modes,
floating, or non-floating (essentially node-locked). A floating reservation is
created when the flag SPACEFLEX is specified. If a reservation is non-floating,
the scheduler allocates all resources specified by the HOSTLIST parameter
regardless of node state, job load, or even the presence of other standing
reservations. Moab interprets the request for a non-floating reservation as, "I
want a reservation on these exact nodes, no matter what!"
If a reservation is configured to be floating, the scheduler takes a more relaxed
stand, searching through all possible nodes to find resources meeting standing
reservation constraints. Only Idle, Running, or Busy nodes are considered and
further, only considered if no reservation conflict is detected. The reservation
attribute ACCESS modifies this behavior slightly and allows the reservation to
allocate resources even if reservation conflicts exist.
If a TASKCOUNT is specified with or without a HOSTEXPRESSION, Moab will, by
default, only consider "up" nodes for allocation. To change this behavior,
the reservation flag IGNSTATE can be specified as in the following
example:
SRCFG[nettest]
SRCFG[nettest]
SRCFG[nettest]
SRCFG[nettest]
SRCFG[nettest]
GROUPLIST=sysadm
FLAGS=IGNSTATE
HOSTLIST=node1[3-8]
STARTTIME=9:00:00
ENDTIME=17:00:00
Access to existing reservations can be controlled using the reservation flag
IGNRSV.
Other standing reservation attributes not covered here include PARTITION and
CHARGEACCOUNT. These parameters are described in some detail in the
parameters documentation.
Example 6-21: Using Reservations to Guarantee Turnover
In some cases, it is desirable to make certain a portion of a cluster's resources
are available within a specific time frame. The following example creates a
floating reservation belonging to the jupiter account that guarantees 16 tasks
for use by jobs requesting up to one hour.
SRCFG[shortpool]
SRCFG[shortpool]
SRCFG[shortpool]
SRCFG[shortpool]
SRCFG[shortpool]
SRCFG[shortpool]
SRCFG[shortpool]
OWNER=ACCT:jupiter
FLAGS=SPACEFLEX
MAXTIME=1:00:00
TASKCOUNT=16
STARTTIME=9:00:00
ENDTIME=17:00:00
DAYS=Mon,Tue,Wed,Thu,Fri
This reservation enables a capability similar to what was known in early Maui
releases as "shortpool." The reservation covers every weekday from 9:00 a.m.
Advance Reservations
475
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
to 5:00 p.m., reserving 16 tasks and allowing jobs to overlap the reservation
for up to one hour. The SPACEFLEX flag indicates that the reservation may be
dynamically modified--over time to re-locate to more optimal resources. In the
case of a reservation with the MAXTIME ACL, this would include migrating to
resources that are in use but that free up within the MAXTIME time frame.
Additionally, because the MAXTIME ACL defaults to positive affinity, any jobs
that fit the ACL attempt to use available reserved resources first before looking
elsewhere.
Rollback Reservations
Rollback reservations are enabled using the ROLLBACKOFFSET attribute and
can be used to allow users guaranteed access to resources, but the guaranteed
access is limited to a time-window in the future. This functionality forces users
to commit their resources in the future or lose access.
Image 6-2: Rollback reservation over 3 iterations
Example 6-22: Rollback Reservations
SRCFG[ajax] ROLLBACKOFFSET=24:00:00 TASKCOUNT=32
SRCFG[ajax] PERIOD=INFINITY ACCOUNTLIST=ajax
Adding an asterisk to the ROLLBACKOFFSET value pins rollback reservation start
times when an idle reservation is created in the rollback reservation. For
example: SRCFG[staff] ROLLBACKOFFSET=18:00:00* PERIOD=INFINITY.
476
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Modifying Resources with Standing Reservations
Moab can customize compute resources associated with a reservation during
the life of the reservation. This can be done generally using the TRIGGER
attribute, or it can be done for operating systems using the shortcut attribute
OS. If set, Moab dynamically reprovisions allocated reservation nodes to the
requested operating system as shown in the following example:
SRCFG[provision] PERIOD=DAY DAY=MON,WED,FRI STARTTIME=7:00:00 ENDTIME=10:00:00
SRCFG[provision] OS=rhel4 # provision nodes to use redhat during reservation, restore
when done
Managing Administrative Reservations
A default reservation with no ACL is termed an administrative reservation, but
is occasionally referred to as a system reservation. It blocks access to all jobs
because it possesses an empty access control list. It is often useful when
performing administrative tasks but cannot be used for enforcing resource
usage policies.
Administrative reservations are created and managed using the mrsvctl
command. With this command, all aspects of reservation time frame, resource
selection, and access control can be dynamically modified. The mdiag -r
command can be used to view configuration, state, allocated resource
information as well as identify any potential problems with the reservation. The
following table briefly summarizes commands used for common actions. More
detailed information is available in the command summaries.
Action
Command
create reservation
mrsvctl -c <RSV_DESCRIPTION>
list reservations
mrsvctl -l
release reservation
mrsvctl -r <RSVID>
modify reservation
mrsvctl -m <ATTR>=<VAL> <RSVID>
query reservation configuration
mdiag -r <RSVID>
display reservation hostlist
mrsvctl -q resources <RSVID>
Related Topics
SRCFG (configure standing reservations)
RSVPROFILE (create reservation profiles)
Advance Reservations
477
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Personal Reservations
l
Enabling Personal Reservation Management
l
Reservation Accountability and Defaults
o
Setting Reservation Default Attributes
l
Reservation Limits
l
Reservation and Job Binding
o
Constraining a job to only run in a particular reservation
o
Constraining a Reservation to Only Accept Certain Jobs
By default, advance reservations are only available to scheduler
administrators. While administrators may create and manage reservations to
provide resource access to end-users, end-users cannot create, modify, or
destroy these reservations. Moab extends the ability to manage reservations to
end-users and provides control facilities to keep these features manageable.
Reservations created by end-users are called personal reservations or user
reservations.
Enabling Personal Reservation Management
User, or personal, reservations can be enabled on a per QoS basis by setting
the ENABLEUSERRSV flag as in the following example:
QOSCFG[titan]
QFLAGS=ENABLEUSERRSV # allow 'titan' QOS jobs to create user
reservations
USERCFG[DEFAULT] QDEF=titan
# allow all users to access 'titan' QOS
...
If set, end-users are allowed to create, modify, cancel, and query reservations
they own. As with jobs, users may associate a personal reservation with any
QoS or account to which they have access. This is accomplished by specifying
per reservation accountable credentials as in the following example:
> mrsvctl -c -S AQOS=titan -h node01 -d 1:00:00 -s 1:30:00
Note: reservation test.126 created
As in the preceding example, a non-administrator user who wants to create a
reservation must ALWAYS specify an accountable QoS with the mrsvctl -S flag.
This specified QoS must have the ENABLEUSERRSVflag. By default, a personal
reservation is created with an ACL of only the user who created it.
Example 6-23: Allow All Users in Engineering Group to Create Personal Reservations
QOSCFG[rsv]
QFLAGS=ENABLEUSERRSV # allow 'rsv' QOS jobs to create user
reservations
GROUPCFG[sales] QDEF=rsv
# allow all users in group sales to access 'rsv'
QOS
...
478
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Example 6-24: Allow Specific Users to Create Personal Reservations
# special qos has higher job priority and ability to create user reservations
QOSCFG[special] QFLAGS=ENABLEUSERRSV
QOSCFG[special] PRIORITY=1000
# allow betty and steve to use the special qos
USERCFG[betty] QDEF=special
USERCFG[steve] QLIST=fast,special,basic QDEF=rsv
...
Reservation Accountability
Personal reservations must be configured with a set of accountable credentials.
These credentials (user, group, account, and so forth) indicate who is
responsible for the resources dedicated by the reservation. If resources are
dedicated by a reservation but not consumed by a job, these resources can be
charged against the specified accountable credentials. Administrators are
allowed to create reservations and specify any accountable credentials for that
reservation. While end-users can also be allowed to create and otherwise
modify personal reservations, they are only allowed to create reservations with
accountable credentials to which they have access. Further, while
administrators may manage any reservation, end-users may only control
reservations they own.
Like jobs, reservation accountable credentials specify which credentials are
charged for reservation usage and what policies are enforced as far as usage
limits and allocation management is concerned. (See the mrsvctl command
documentation for more information on setting personal reservation
credentials.) While similar to jobs, personal reservations do have a separate
set of usage limits and different allocation charging policies.
Setting Reservation Default Attributes
Organizations can use reservation profiles to set default attributes for personal
reservations. These attributes can include reservation aspects such as
management policies, charging credentials, ACLs, host constraints, and time
frame settings.
Reservation Limits
Allowing end-users the ability to create advance reservations can lead to
potentially unfair and unproductive resource usage. This results from the fact
that by default, there is nothing to prevent a user from reserving all resources
in a given system or reserving resources during time slots that would greatly
impede the scheduler's ability to schedule jobs efficiently. Because of this, it is
highly advised that sites initially place either usage or allocation based
constraints on the use of personal reservations. This can be achieved
using Moab Accounting Manager (see the Moab Accounting Manager
Administrator Guide).
Advance Reservations
479
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Reservation and Job Binding
Moab allows job-to-reservation binding to be configured at an administrator or
end-user level. This binding constrains how job to reservation mapping is
allowed.
Constraining a job to only run in a particular reservation
Jobs may be bound to a particular reservation at submit time (using the RM
extension ADVRES) or dynamically using the mjobctl command (See Job to
Reservation Mapping.). In either case, once bound to a reservation, a job may
only run in that reservation even if other resources may be found outside of
that reservation. The mjobctl command may also be used to dynamically
release a job from reservation binding.
Example 6-25: Bind job to reservation
> mjobctl -m flags+=advres:grid.3 job1352
Example 6-26: Release job from reservation binding
> mjobctl -m flags-=advres job1352
Constraining a Reservation to Only Accept Certain Jobs
Binding a job to a reservation is independent of binding a reservation to a job.
For example, a reservation may be created for user "steve." User "steve" may
then submit a number of jobs including one that is bound to that reservation
using the ADVRES attribute. However, this binding simply forces that one job to
use the reservation, it does not prevent the reservation from accepting other
jobs submitted by user "steve." To prevent these other jobs from using the
reserved resources, reservation to job binding must occur. This binding is
accomplished by specifying either general job binding or specific job binding.
General job binding is the most flexible form of binding. Using the BYNAME
attribute, a reservation may be created that only accepts jobs specifically
bound to it.
Specific job binding is more constraining. This form of binding causes the
reservation to only accept specific jobs, regardless of other job attributes and is
set using the JOB reservation ACL.
Example 6-27: Configure a reservation to accept only jobs that are bound to it
> mrsvctl -m flags+=byname grid.3
Example 6-28: Remove general reservation to job binding
> mrsvctl -m flags-=byname grid.3
480
Advance Reservations
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Example 6-29: Configure a reservation to accept a specific job
> mrsvctl -m -a JOB=3456 grid.3
Example 6-30: Remove a specific reservation to job binding
> mrsvctl -m -a JOB=3456 grid.3 --flags=unset
Partitions
l
Partition Overview
l
Defining Partitions
l
Managing Partition Access
l
Requesting Partitions
l
Per-Partition Settings
l
Miscellaneous Partition Issues
Partition Overview
Partitions are a logical construct that divide available resources. Any single
resource (compute node) may only belong to a single partition. Often, natural
hardware or resource manager bounds delimit partitions such as in the case of
disjoint networks and diverse processor configurations within a cluster. For
example, a cluster may consist of 256 nodes containing four 64 port switches.
This cluster may receive excellent interprocess communication speeds for
parallel job tasks located within the same switch but sub-stellar performance
for tasks that span switches. To handle this, the site may choose to create four
partitions, allowing jobs to run within any of the four partitions but not span
them.
While partitions do have value, it is important to note that within Moab, the
standing reservation facility provides significantly improved flexibility and
should be used in the vast majority of politically motivated cases where
partitions may be required under other resource management systems.
Standing reservations provide time flexibility, improved access control
features, and more extended resource specification options. Also, another
Moab facility called Node Sets allows intelligent aggregation of resources to
improve per job node allocation decisions. In cases where system partitioning
is considered for such reasons, node sets may be able to provide a better
solution.
Still, one key advantage of partitions over standing reservations and node sets
is the ability to specify partition specific policies, limits, priorities, and
scheduling algorithms although this feature is rarely required. An example of
this need may be a cluster consisting of 48 nodes owned by the Astronomy
Partitions
481
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Department and 16 nodes owned by the Mathematics Department. Each
department may be willing to allow sharing of resources but wants to specify
how their partition will be used. As mentioned, many of Moab's scheduling
policies may be specified on a per partition basis allowing each department to
control the scheduling goals within their partition.
The partition associated with each node should be specified as indicated in the
Node Location section. With this done, partition access lists may be specified on
a per job or per QoS basis to constrain which resources a job may have access
to. (See the QoS Overview for more information.) By default, QoSs and jobs
allow global partition access. Note that by default, a job may only use
resources within a single partition.
If no partition is specified, Moab creates one partition per resource manager
into which all resources corresponding to that resource manager are placed.
(This partition is given the same name as the resource manager.)
A partition may not span multiple resource managers. In addition to these
resource manager partitions, a pseudo-partition named " [ALL]" is created
that contains the aggregate resources of all partitions.
While the resource manager partitions are real partitions containing
resources not explicitly assigned to other partitions, the " [ALL]" partition is
only a convenience object and is not a real partition; thus it cannot be
requested by jobs or included in configuration ACLs.
Defining Partitions
Node to partition mappings can be established directly using the NODECFG
parameter or indirectly using the FEATUREPARTITIONHEADER parameter. If
using direct mapping, this is accomplished as shown in the example that
follows.
NODECFG[node001]
NODECFG[node002]
...
NODECFG[node049]
...
PARTITION=astronomy
PARTITION=astronomy
PARTITION=math
By default, Moab creates two partitions, "DEFAULT" and "[ALL]." These
are used internally, and consume spots in the 31-partition maximum
defined in the MMAX_PAR parameter. If more partitions are needed, you
can adjust the maximum partition count. See Adjusting Default Limits for
information on increasing the maximum number of partitions.
482
Partitions
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Managing Partition Access
Partition access can be constrained by credential ACLs and by limits based on
job resource requirements.
Credential Based Access
Determining who can use which partition is specified using the *CFG parameters
(USERCFG, GROUPCFG, ACCOUNTCFG, QOSCFG, CLASSCFG, and SYSCFG).
These parameters allow you to select a partition access list on a credential or
system wide basis using the PLIST attribute. By default, the access associated
with any given job is the logical OR of all partition access lists assigned to the
job's credentials.
For example, assume a site with two partitions, general, and test. The site
management would like everybody to use the general partition by default.
However, one user, Steve, needs to perform the majority of his work on the
test partition. Two special groups, staff and management will also need access
to use the test partition from time to time but will perform most of their work in
the general partition. The following example configuration enables the needed
user and group access and defaults for this site:
SYSCFG[base]
USERCFG[DEFAULT]
USERCFG[steve]
GROUPCFG[staff]
GROUPCFG[mgmt]
PLIST=general:test
PLIST=general
PLIST=general:test
PLIST=general:test
PLIST=general:test
While using a logical OR approach allows sites to add access to certain jobs,
some sites prefer to work the other way around. In these cases, access is
granted by default and certain credentials are then restricted from accessing
various partitions. To use this model, a system partition list must be specified
as in the following example:
SYSCFG[base]
PLIST=general,test&
USERCFG[demo]
PLIST=test&
GROUPCFG[staff] PLIST=general&
In the preceding example, note the ampersand (&). This character, which can
be located anywhere in the PLIST line, indicates that the specified partition list
should be logically AND'd with other partition access lists. In this case, the
configuration limits jobs from user demo to running in partition test and jobs
from group staff to running in partition general. All other jobs are allowed to run
in either partition.
When using AND-based partition access lists, the base system access list
must be specified with SYSCFG.
Partitions
483
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Per Job Resource Limits
Access to partitions can be constrained based on the resources requested on a
per job basis with limits on both minimum and maximum resources requested.
All limits are specified using PARCFG. See Usage Limits for more information on
the available limits.
PARCFG[amd]
PARCFG[pIII]
PARCFG[aix]
MAX.PROC=16
MAX.WCLIMIT=12:00:00 MIN.PROC=4
MIN.NODE=12
Requesting Partitions
Users may request to use any partition they have access to on a per job basis.
This is accomplished using the resource manager extensions since most native
batch systems do not support the partition concept. For example, on a Torque
system, a job submitted by a member of the group staff could request that the
job run in the test partition by adding the line -l partition=test to the qsub
command line. See the resource manager extension overview for more
information on configuring and using resource manager extensions.
Per-Partition Settings
The following settings can be specified on a per-partition basis using the
PARCFG parameter:
Setting
Description
FSSCALINGFACTOR
Moab will multiple the actual fairshare usage by this value to get the calculated
fairshare usage of a job. The actual fairshare usage is calculated based on the
FSPOLICY on page 1012 parameter.
For an example, if FSPOLICY is set to DEDICATEDPS and a job runs on two
processors for 100 seconds then the actual fairshare usage would be 200. If the
job ran on a partition with FSSCALINGFACTOR=.5 then Moab would multiply
200*.5=100. If the job ran on a partition with FSSCALINGFACTOR=2 then Moab
would multiply 200*2=400.
PARCFG[par1]
FSSECONDARYGROUPS
484
FSSCALINGFACTOR=<double>
Map unix groups to fair share groups.
Partitions
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Setting
Description
GMETRIC
Specifies a generic metric to apply to the partition. It is configured like a Moab
parameter, with the gmetric name inside square brackets. Specify multiple
gmetrics by separating each configuration with a space. For example:
PARCFG[par1] GMETRIC[GM1]=20 GMETRIC[GM2]=10
Partition par1 has a GM1 metric of 20 and a GM2 metric of 10.
JOBNODEMATCHPOLICY
Specifies the JOBNODEMATCHPOLICY to be applied to jobs that run in the specified partition.
NODEACCESSPOLICY
Specifies the NODEACCESSPOLICY to be applied to jobs that run in the specified
partition.
NODEALLOCATIONPOLICY
Specifies the NODEALLOCATIONPOLICY to be applied to jobs that run in the specified partition.
RESOURCELIMITMULTIPLIER
Specifies the RESOURCELIMITMULTIPLIER[<PARID>] on page 1087 to be
applied to jobs that run in the specified partition.
This can only be viewed with "showconfig -v"
PARCFG[A] RESOURCELIMITMULTIPLER=PROC:1.1
RESOURCELIMITMULTIPLIER=MEM:2.0
RESOURCELIMITPOLICY
Specifies the RESOURCELIMITPOLICY to be applied to jobs that run in the
specified partition.
This can only be viewed with "showconfig -v"
PARCFG[A] RESOURCELIMITPOLICY=WALLTIME:ALWAYS:CANCEL
PARCFG[B] RESOURCELIMITPOLICY=WALLTIME:ALWAYS:REQUEUE
USETTC
Specifies whether TTC specified at submission should be used and displayed by
the scheduler.
VMCREATEDURATION
Specifies the maximum amount of time VM creation can take before Moab considers it a failure (in [HH[:MM[:SS]). If no value is set, there is no maximum limit.
VMDELETEDURATION
Specifies the maximum amount of time VM deletion can take before Moab considers it a failure (in [HH[:MM[:SS]). If no value is set, there is no maximum limit.
Partitions
485
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Setting
Description
VMMIGRATEDURATION
Specifies the maximum amount of time VM migration can take before Moab considers it a failure (in [HH[:MM[:SS]). If no value is set, there is no maximum limit.
Miscellaneous Partition Issues
A brief caution: Use of partitions has been quite limited in recent years as
other, more effective approaches are selected for site scheduling policies.
Consequently, some aspects of partitions have received only minor testing.
Still, note that partitions are fully supported and any problem found will be
rectified.
Related Topics
Standing Reservations
Node Sets
FEATUREPARTITIONHEADER parameter
PARCFG parameter
Quality of Service (QoS) Facilities
This section describes how to do the following:
l
l
l
Allow key projects access to special services (such as preemption,
resource dedication, and advance reservations).
Provide access to special resources by requested QoS.
Enable special treatment within priority and fairshare facilities by
requested QoS.
l
Provide exemptions to usage limits and other policies by requested QoS.
l
Specify delivered service and response time targets.
l
Enable job deadline guarantees.
l
Control the list of QoSs available to each user and job.
l
Enable special charging rates based on requested or delivered QoS levels.
l
Enable limits on the extent of use for each defined QoS.
l
Monitor current and historical usage for each defined QoS.
It contains the following sub-sections:
486
Quality of Service (QoS) Facilities
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
l
QoS Overview
l
QoS Enabled Privileges
o
Special Prioritization
o
Service Access and Constraints
o
Usage Limits and Overrides
o
Service Access Thresholds
o
QoS Metrics
o
Preemption Management
l
Managing QoS Access
l
Requesting QoS Services at Job Submission
l
Restricting Access to Special Attributes
QoS Overview
Moab's QoS facility allows a site to give special treatment to various classes of
jobs, users, groups, and so forth. Each QoS object can be thought of as a
container of special privileges ranging from fairness policy exemptions, to
special job prioritization, to special functionality access. Each QoS object also
has an extensive access list of users, groups, and accounts that can access
these privileges.
Sites can configure various QoSs each with its own set of priorities, policy
exemptions, and special resource access settings. They can then configure
user, group, account, and class access to these QoSs. A given job will have a
default QoS and may have access to several additional QoSs. When the job is
submitted, the submitter may request a specific QoS or just allow the default
QoS to be used. Once a job is submitted, a user may adjust the QoS of the job
at any time using the setqos command. The setqos command will only allow the
user to modify the QoS of that user's jobs and only change the QoS to a QoS
that this user has access to. Moab administrators may change the QOS of any
job to any value.
Jobs can be granted access to QoS privileges if the QoS is listed in the system
default configuration QDEF (QoS default) or QLIST (QoS access list), or if the
QoS is specified in the QDEF or QLIST of a user, group, account, or class
associated with that job. Alternatively, a user may access QoS privileges if that
user is listed in the QoSs MEMBERULIST attribute.
The mdiag -q command can be used to obtain information about the current
QoS configuration including specified credential access.
Quality of Service (QoS) Facilities
487
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
QoS Enabled Privileges
The privileges enabled via QoS settings may be broken into the following
categories:
l
Special Prioritization
l
Service Access and Constraints
l
Usage Limits and Overrides
l
Service Access Thresholds
l
Preemption Management
All privileges are managed via the QOSCFG parameter.
Special Prioritization
Attribute name
Description
FSTARGET
Specifies QoS fairshare target.
FSWEIGHT
Sets QoS fairshare weight offset affecting a job's fairshare priority component.
PRIORITY
Assigns priority to all jobs requesting particular QoS.
QTTARGET
Sets QoS queuetime target affecting a job's target priority component and QoS delivered.
QTWEIGHT
Sets QoS queuetime weight offset affecting a job's service priority component.
XFTARGET
Sets QoS XFactor target affecting a job's target priority component and QoS delivered.
XFWEIGHT
Sets QoS XFactor weight offset affecting a job's service priority component.
Example 6-31:
# assign priority for all qos geo jobs
QOSCFG[geo]
PRIORITY=10000
Service Access and Constraints
The QoS facility can be used to enable special services and to disable default
services. These services are enabled/disabled by setting the QoS QFLAGS
attribute.
488
Quality of Service (QoS) Facilities
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Flag Name
Description
DEADLINE
Job may request an absolute or relative completion deadline and Moab will
reserve resources to meet that deadline. (An alternative priority based deadline
behavior is discussed in the PRIORITY FACTORS section.)
DEDICATED
Moab dedicates all resources of an allocated node to the job meaning that the job
will not share a node's compute resources with any other job.
ENABLEUSERRSV
Allow user or personal reservations to be created and managed.
IGNALL
Scheduler ignores all resource usage policies for jobs associated with this QoS.
JOBPRIOACCRUALPOLICY
Specifies how Moab should track the dynamic aspects of a job's priority. The two
valid values are ACCRUE and RESET.
l
l
ACCRUE indicates that the job will accrue queuetime based priority from
the time it is submitted unless it violates any of the policies not specified in
JOBPRIOEXCEPTIONS.
RESET indicates that it will accrue priority from the time it is submitted
unless it violates any of the JOBPRIOEXCEPTIONS. However, with RESET,
if the job does violate JOBPRIOEXCEPTIONS then its queuetime based
priority will be reset to 0.
JOBPRIOACCRUALPOLICY is a global parameter, but can be configured to
work only in QOSCFG:
QOSCFG[arrays]
JOBPRIOACCRUALPOLICY=ACCRUE
The following old JOBPRIOACCRUALPOLICY values have been deprecated and
should be adjusted to the following values:
l
l
Quality of Service (QoS) Facilities
QUEUEPOLICY = ACCRUE and JOBPRIOEXCEPTIONS
SOFTPOLICY,HARDPOLICY
QUEUEPOLICYRESET = RESET and JOBPRIOEXCEPTIONS
SOFTPOLICY,HARDPOLICY
l
ALWAYS = ACCRUE and JOBPRIOEXCEPTIONS ALL
l
FULLPOLICY = ACCRUE and JOBPRIOEXCEPTIONS NONE
l
FULLPOLICYRESET = RESET and JOBPRIOEXCEPTIONS NONE
489
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Flag Name
Description
JOBPRIOEXCEPTIONS
Specifies exceptions for calculating a job's dynamic priority (QUEUETIME,
XFACTOR, TARGETQUEUETIME). Valid values are a comma delimited list of any of
the following: DEFER, DEPENDS, SOFTPOLICY, HARDPOLICY, IDLEPOLICY,
USERHOLD, BATCHHOLD, and SYSTEMHOLD (ALL or NONE can also be
specified on their own).
Normally, when a job violates a policy, is placed on hold, or has an unsatisfied
dependency, it will not accrue priority. Exceptions can be configured to allow a job
to accrue priority in spite of any of these violations. With DEPENDS a job will
increase in priority even if there exists an unsatisfied dependency. With
SOFTPOLICY, HARDPOLICY, or IDLEPOLICY a job can accrue priority despite
violating a specific limit. With DEFER, USERHOLD, BATCHHOLD, or
SYSTEMHOLD a job can accrue priority despite being on hold.
JOBPRIOEXCEPTIONS is a global parameter, but can be configured to work
only in QOSCFG:
QOSCFG[arrays]
JOBPRIOEXCEPTIONS=IDLEPOLICY
NOBF
Job is not considered for backfill.
NORESERVATION
Job should never reserve resources regardless of priority.
NTR
Job is prioritized as next to run (NTR) and backfill is disabled to prevent other
jobs from jumping in front of ones with the NTR flag.
It is important to note that jobs marked with this flag should not be
blocked. If they are, Moab will stop scheduling because if a job is marked
with this flag, no other jobs will be run until the flagged NTR (Next to Run)
job starts. Consider using the PRIORITY attribute of the QOSCFG[<QOSID>]
parameter instead, when possible. Or, as you may encounter a scheduling
delay for NTR-flagged jobs to start, consider using the
RESERVATIONDEPTH and RESERVATIONQOSLIST parameters to provide
better scheduling flow. See Reservation Policies (especially the section on
Assigning Per-QoS Reservation Creation Rules) for more information.
490
PREEMPTCONFIG
User jobs may specify options to alter how preemption impacts the job such as
minpreempttime.
PREEMPTEE
Job may be preempted by higher priority PREEMPTOR jobs.
PREEMPTFSV
Job may be preempted by higher priority PREEMPTOR jobs if it exceeds its fairshare target when started.
Quality of Service (QoS) Facilities
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Flag Name
Description
PREEMPTOR
Job may preempt lower priority PREEMPTEE jobs.
PREEMPTSPV
Job may be preempted by higher priority PREEMPTOR jobs if it currently violates a
soft usage policy limit.
PROVISION
If the job cannot locate available resources with the needed OS or software, the
scheduler may provision a number of nodes to meet the needed OS or software
requirements.
RESERVEALWAYS
Job should create resource reservation regardless of job priority.
RUNNOW
Boosts a job's system priority and makes the job a preemptor.
RUNNOW overrides resource restrictions such as MAXJOB or MAXPROC.
TRIGGER
The job is able to directly specify triggers.
USERESERVED[:<RSVID>]
Job may only use resources within accessible reservations. If <RSVID> is specified,
job may only use resources within the specified reservation.
Example 6-32: For lowprio QoS job, disable backfill and make job preemptible
QOSCFG[lowprio]
QFLAGS=NOBF,PREEMPTEE
Example 6-33: Bind all jobs to chemistry reservation
QOSCFG[chem-b]
QFLAGS=USERESERVED:chemistry
Other QoS Attributes
In addition to the flags, there are attributes that alter service access.
Attribute name
Description
SYSPRIO
Sets the system priority on jobs associated with this QoS.
Example: All jobs submitted under a QoS sample receive a system priority of 1
QOSCFG[sample] SYSPRIO=1
Once a system priority has been added to a job, either manually or through
configuration, it can only be removed manually.
Quality of Service (QoS) Facilities
491
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Attribute name
Description
REQUESTGEOMETRY
Defines the size that is requested when Elastic Computing occurs. Potential values are
"PRIORITYJOBSIZE" or "<NODECOUNT>@<DURATION>". If PRIORITYJOBSIZE is set,
then the nodecount and duration for Elastic Computing is set in realtime to whatever is
the size of the highest priority idle job.
Example:
QOSCFG[sample] REQUESTGEOMETRY=12@4:00:00:00
Per QoS Required Reservations
If desired, jobs associated with a particular QoS can be locked into a
reservation or reservation group using the REQRID attribute. For example, to
force jobs using QoS jasper to only use the resources within the failsafe standing
reservation, use the following:
QOSCFG[jasper] REQRID=failsafe
...
Usage Limits and Overrides
All credentials, including QoS, allow specification of job usage limits as
described in the Basic Fairness Policies overview. In such cases, jobs are
constrained by the most limiting of all applicable policies. With QoSs, an
override limit may also be specified and with this limit, jobs are constrained by
the override, regardless of other limits specified. The following parameters can
override the throttling policies from other credentials:
OMAXJOB, OMAXNODE, OMAXPE, OMAXPROC, OMAXPS, OMAXJPROC, OMAXJPS,
OMAXJWC, and OMAXJNODE.
(See Usage Limits/Throttling Policies Override Limits.)
Example 6-34:
# staff QoS should have a limit of 48 jobs, ignoring the user limit
USERCFG[DEFAULT]
MAXJOB=10
QOSCFG[staff]
OMAXJOB=48
Service Access Thresholds
Jobs can be granted access to services such as preemption and reservation
creation, and they can be granted access to resource reservations. However,
with QoS thresholds, this access can be made conditional on the current
queuetime and XFactor metrics of an idle job. The following table lists the
available QoS service thresholds:
492
Quality of Service (QoS) Facilities
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Threshold attribute
Description
PREEMPTQTTHRESHOLD
A job with this QoS becomes a preemptor if the specified queuetime threshold is
reached.
PREEMPTXFTHRESHOLD
A job with this QoS becomes a preemptor if the specified XFactor threshold is
reached.
RSVQTTHRESHOLD
A job with this QoS can create a job reservation to guarantee resource access if the
specified queuetime threshold is reached.
RSVXFTHRESHOLD
A job with this QoS can create a job reservation to guarantee resource access if the
specified XFactor threshold is reached.
ACLQTTHRESHOLD
A job with this QoS can access reservations with a corresponding QoS ACL only if the
specified queuetime threshold is reached.
ACLXFTHRESHOLD
A job with this QoS can access reservations with a corresponding QoS ACL only if the
specified XFactor threshold is reached.
TRIGGERQTTHRESHOLD
If a job with this QoS fails to run before this threshold is reached, any failure triggers associated with this QoS will fire.
QoS Metrics
Metric name
Description
BACKLOGCOMPLETIONTIME
The estimated run-time to all idle jobs for a certain QoS. More specifically, it is the processor
second count of all the idle jobs in the QOS, divided by the total processors on the system.
QQOSCFG[HIGH
TRIGGER=EType=threshold,AType=exec,TType=elastic,threshold=BACKLOGCOMPLETIONTIM
E>1,Action="$HOME/geometry.pl blah",timeout=5:00
In order to calculate the BacklogCompletionTime, the QoS must have
ENABLEPROFILING=TRUE, either on the QoS itself or on the DEFAULT QoS.
Preemption Management
Job preemption facilities can be controlled on a per-QoS basis using the
PREEMPTEE and PREEMPTOR flags. Jobs that are preemptible can optionally be
constrained to only be preempted in a particular manner by specifying the QoS
PREEMPTPOLICY attribute as in the following example:
QOSCFG[special] QFLAGS=PREEMPTEE PREEMPTPOLICY=CHECKPOINT
Quality of Service (QoS) Facilities
493
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
For preemption to be effective, a job must be marked as a preemptee and
must be enabled for the requested preemption type. For example, if the
PREEMPTPOLICY is set to suspend, a potential target job must be both a
preemptee and marked with the job flag SUSPENDABLE. (See suspension for
more information.) If the target job is not suspendable, it will be either
requeued or canceled. Likewise, if the PREEMPTPOLICY is set to requeue, the job
will be requeued if it is marked restartable. Otherwise, it will be canceled.
The minimum time a job must run before being considered eligible for
preemption can also be configured on a per-QoS basis using the
PREEMPTMINTIME parameter, which is analogous to the
JOBPREEMPTMINACTIVETIME. Conversely, PREEMPTMAXTIME sets a threshold
for which a job is no longer eligible for preemption; see
JOBPREEMPTMAXACTIVETIME for analogous details.
The PREEMPTEES attribute allows you to specify which QoSs that a job in a
specific QoS is allowed to preempt. The PREEMPTEES list is a comma-delimited
list of QoS IDs. When a PREEMPTEES attribute is specified, a job using that QoS
can only preempt jobs using QoSs listed in the PREEMPTEES list. In turn, those
QoSs must be flagged as PREEMPTEE as in the following example:
QOSCFG[a] QFLAGS=PREEMPTOR PREEMPTEES=b,c
QOSCFG[b] QFLAGS=PREEMPTEE
QOSCFG[c] QFLAGS=PREEMPTEE
In the example, jobs in the 'a' QoS can only preempt jobs in the b and c QoSs.
Managing QoS Access
Specifying Credential Based QoS Access
You can define the privileges allowed within a QoS by using the QOSCFG
parameter; however, in most cases access to the QoS is enabled via credential
specific *CFG parameters, specifically the USERCFG, GROUPCFG,
ACCOUNTCFG, and CLASSCFG parameters, which allow defining QoS access
lists and QoS defaults. Specify credential specific QoS access by using the QLIST
and/or QDEF attributes of the associated credential parameter.
QOS Access via Logical OR
To enable QoS access, the QLIST and/or QDEF attributes of the appropriate user,
group, account, or class/queue should be specified as in the following example:
# user john's jobs can access QOS geo, chem, or staff with geo as default
USERCFG[john]
QDEF=geo
QLIST=geo,chem,staff
# group system jobs can access the development qos
GROUPCFG[systems] QDEF=development
# class batch jobs can access the normal qos
CLASSCFG[batch]
QDEF=normal
494
Quality of Service (QoS) Facilities
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
By default, jobs may request a QoS if access to that QoS is allowed by any of
the job's credentials. (In the previous example, a job from user john submitted
to the class batch could request QoSs geo, chem, staff, or normal).
QOS Access via Logical AND
If desired, QoS access can be masked or logically AND'd if the QoS access list is
specified with a terminating ampersand (&) as in the following example:
# user john's jobs can access QOS geo, chem, or staff with geo as default
USERCFG[john]
QDEF=geo
QLIST=geo,chem,staff
# group system jobs can access the development qos
GROUPCFG[systems] QDEF=development
# class batch jobs can access the normal qos
CLASSCFG[batch]
QDEF=normal
# class debug jobs can only access the development or lowpri QoSs regardless of other
credentials
CLASSCFG[debug]
QLIST=development,lowpri&
Specifying QoS Based Access
QoS access may also be specified from within the QoS object using the QoS
MEMBERULIST attribute as in the following example:
# define qos premiere and grant access to users steve and john
QOSCFG[premiere] PRIORITY=1000 QFLAGS=PREEMPTOR MEMBERULIST=steve,john
By default, if a job requests a QoS that it cannot access, Moab places a
hold on that job. The QOSREJECTPOLICY can be used to modify this
behavior.
Requesting QoS Services at Job Submission
By default, jobs inherit a default QoS based on the user, group, class, and
account associated with the job. If a job has access to multiple QoS levels, the
submitter can explicitly request a particular QoS using the QoS resource
manager extension as in the following example:
> msub -l nodes=1,walltime=100,qos=special3 job.cmd
Restricting Access to Special Attributes
This feature is removed for Moab 9.0 and later. You can achieve the same
results using job templates.
Related Topics
Credential Overview
Allocation Management Overview
Rollback Reservations
Job Deadlines
Quality of Service (QoS) Facilities
495
Chapter 6 Controlling Resource Access - Reservations, Partitions, and QoS Facilities
Using QoS preemption
496
Quality of Service (QoS) Facilities
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Chapter 7 Optimizing Scheduling Behavior – Backfill and
Node Sets
l
Optimization Overview
l
Backfill
l
Node Set Overview
Optimization Overview
Moab optimizes cluster performance. Every policy, limit, and feature is
designed to allow maximum scheduling flexibility while enforcing the required
constraints. A driving responsibility of the scheduler is to do all in its power to
maximize system use and to minimize job response time while honoring the
policies that make up the site's mission goals.
However, as all jobs are not created equal, optimization must be abstracted
slightly further to incorporate this fact. Cluster optimization must also focus on
targeted cycle delivery. In the scientific HPC community, the true goal of a
cluster is to maximize delivered research. For businesses and other
organizations, the purposes may be slightly different, but all organizations
agree on the simple tenet that the cluster should optimize the site's mission
goals.
To obtain this goal, the scheduler has several levels of optimization it performs:
Level
Description
Workload
Ordering
Prioritizing workload and utilizing backfill
Intelligent
Resource
Allocation
Selecting those resources that best meet the job's needs or best enable future jobs to run (see
node allocation)
Maximizing
Intra-Job
Efficiency
Selecting the type of nodes, collection of nodes, and proximity of nodes required to maximize job
performance by minimizing both job compute and inter-process communication time (see node
sets and node allocation)
Job Preemption
Preempting jobs to allow the most important jobs to receive the best response time (see preemption)
Optimization Overview
497
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Level
Description
Utilizing
Flexible
Policies
Using policies that minimize blocking and resource fragmentation while enforcing needed constraints (see soft throttling policies and reservations)
Backfill
l
Backfill Overview
l
Backfill Algorithms
l
Configuring Backfill
Backfill Overview
Backfill is a scheduling optimization that allows a scheduler to make better use
of available resources by running jobs out of order. When Moab schedules, it
prioritizes the jobs in the queue according to a number of factors and then
orders the jobs into a highest priority first (or priority FIFO) sorted list. It starts
the jobs one by one stepping through the priority list until it reaches a job it
cannot start. Because all jobs and reservations possess a start time and a
wallclock limit, Moab can determine the completion time of all jobs in the
queue. Consequently, Moab can also determine the earliest the needed
resources will become available for the highest priority job to start.
Backfill operates based on this earliest job start information. Because Moab
knows the earliest the highest priority job can start, and which resources it will
need at that time, it can also determine which jobs can be started without
delaying this job. Enabling backfill allows the scheduler to start other, lowerpriority jobs so long as they do not delay the highest priority job. If backfill is
enabled, Moab protects the highest priority job's start time by creating a job
reservation to reserve the needed resources at the appropriate time. Moab
then can start any job that will not interfere with this reservation.
498
Backfill
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Image 7-1: Scheduling with backfill
Backfill offers significant scheduler performance improvement. In a typical
large system, enabling backfill increases system utilization by about 20% and
improves turnaround time by an even greater amount. Because of the way it
works, essentially filling in holes in node space, backfill tends to favor smaller
and shorter running jobs more than larger and longer running ones. It is
common to see over 90% of these small and short jobs backfilled.
Consequently, sites will see marked improvement in the level of service
delivered to the small, short jobs and moderate to little improvement for the
larger, long ones.
With most algorithms and policies, there is a trade-off. Backfill is not an
exception but the negative effects are minor. Because backfill locates jobs to
run from throughout the idle job queue, it tends to diminish the influence of the
job prioritization a site has chosen and thus may negate some desired
workload steering attempts through this prioritization. Although by default the
start time of the highest priority job is protected by a reservation, there is
nothing to prevent the third priority job from starting early and possibly
delaying the start of the second priority job. This issue is addressed along with
its trade-offs later in this section.
Another problem is a little more subtle. Consider the following scenario
involving a two-processor cluster. Job A has a four-hour wallclock limit and
requires one processor. It started one hour ago (time zero) and will reach its
wallclock limit in three more hours. Job B is the highest priority idle job and
Backfill
499
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
requires two processors for one hour. Job C is the next highest priority job and
requires one processor for two hours. Moab examines the jobs and correctly
determines that job A must finish in three hours and thus, the earliest job B can
start is in three hours. Moab also determines that job C can start and finish in
less than this amount of time. Consequently, Moab starts job C on the idle
processor at time one. One hour later (time two), job A completes early.
Apparently, the user overestimated the amount of time job A would need by a
few hours. Since job B is now the highest priority job, it should be able to run.
However, job C, a lower priority job was started an hour ago and the resources
needed for job B are not available. Moab re-evaluates job B's reservation and
determines that it can slide forward an hour. At time three, job B starts.
In review, backfill provided positive benefits. Job A successfully ran to
completion. Job C was started immediately. Job B was able to start one hour
sooner than its original target time, although, had backfill not been enabled,
job B would have been able to run two hours earlier.
The scenario just described occurs quite frequently because user estimates for
job duration are generally inaccurate. Job wallclock estimate accuracy, or
wallclock accuracy, is defined as the ratio of wall time required to actually run
the job divided by the wall time requested for the job. Wallclock accuracy varies
from site to site but the site average is rarely better than 50%. Because the
quality of the walltime estimate provided by the user is so low, job reservations
for high priority jobs are often later than they need to be.
Although there do exist some minor drawbacks with backfill, its net
performance impact on a site's workload is very positive. While a few of the
highest priority jobs may get temporarily delayed, their position as highest
priority was most likely accelerated by the fact that jobs in front of them were
able to start earlier due to backfill. Studies have shown that only a very small
number of jobs are truly delayed and when they are, it is only by a fraction of
their total queue time. At the same time, many jobs are started significantly
earlier than would have occurred without backfill.
The following image demonstrates how Moab might schedule a queue using
backfill.
Backfill Algorithms
BACKFILLPOLICY controls which job gets selected first to be backfilled.
Backfill jobs are still placed on nodes according to the
NODEALLOCATIONPOLICY.
The algorithm behind Moab backfill scheduling is straightforward, although
there are a number of issues and parameters that should be highlighted. First
of all, Moab makes two backfill scheduling passes. For each pass, Moab selects
a list of jobs that are eligible for backfill. On the first pass, only those jobs that
meet the constraints of the soft fairness throttling policies are considered and
scheduled. The second pass expands this list of jobs to include those that meet
the hard (less constrained) fairness throttling policies.
500
Backfill
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
The second important concept regarding Moab backfill is the concept of backfill
windows. The figure below shows a simple batch environment containing two
running jobs and a reservation for a third job. The present time is represented
by the leftmost end of the box with the future moving to the right. The light
gray boxes represent currently idle nodes that are eligible for backfill. For this
example, let's assume that the space represented covers 8 nodes and a 2 hour
time frame. To determine backfill windows, Moab analyzes the idle nodes
essentially looking for largest node-time rectangles. It determines that there
are two backfill windows. The first window, Window 1, consists of 4 nodes that
are available for only one hour (because some of the nodes are blocked by the
reservation for Job 3). The second window contains only one node but has no
time limit because this node is not blocked by the reservation for Job 3. It is
important to note that these backfill windows overlap.
Backfill
501
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Image 7-2: Backfillable nodes create backfill windows 1 and 2
Once the backfill windows have been determined, Moab begins to traverse
them. The current behavior is to traverse these windows widest window first
(most nodes to fewest nodes). As each backfill window is evaluated, Moab
applies the backfill algorithm specified by the BACKFILLPOLICY parameter.
If the FIRSTFIT algorithm is applied, the following steps are taken:
502
Backfill
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
1. The list of feasible backfill jobs is filtered, selecting only those that will
actually fit in the current backfill window.
2. The first job is started.
3. While backfill jobs and idle resources remain, repeat step 1.
If NONE is set, the backfill policy is disabled.
Other backfill policies behave in a generally similar manner. The parameters
documentation provides further details.
Liberal versus Conservative Backfill
By default, Moab reserves only the highest priority job resulting in a liberal and
aggressive backfill. This reservation guarantees that backfilled jobs will not
delay the highest priority job, although they may delay other jobs. The
parameter RESERVATIONDEPTH controls how conservative or liberal the
backfill policy is. This parameter controls how deep down the queue priority
reservations will be made. While increasing this parameter improves
guarantees that priority jobs will not be bypassed, it reduces the freedom of
the scheduler to backfill resulting in somewhat lower system utilization. The
significance of the trade-offs should be evaluated on a site by site basis.
Configuring Backfill
Backfill Policies
Backfill is enabled in Moab by specifying the BACKFILLPOLICY parameter. The
BACKFILLPOLICY parameter is used to control which job gets selected first to
be backfilled. Once the job has been selected, it is still placed on nodes
according to the NODEALLOCATIONPOLICY you have defined. By default,
backfill is enabled in Moab using the FIRSTFIT algorithm. However, this
parameter can also be set to NONE (disabled).
The number of reservations that protect the resources required by priority jobs
can be controlled using RESERVATIONDEPTH. This depth can be distributed
across job QoS levels using RESERVATIONQOSLIST.
Backfill Chunking
In a batch environment saturated with serial jobs, serial jobs will, over time,
dominate the resources available for backfill at the expense of other jobs. This
is due to the time-dimension fragmentation associated with running serial jobs.
For example, given an environment with an abundance of serial jobs, if a multiprocessor job completes freeing processors, one of three things will happen:
1. The freed resources are allocated to another job requiring the same number
of processors.
Backfill
503
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
2. Additional jobs may complete at the same time allowing a larger job to
allocate the aggregate resources.
3. The freed resources are allocated to one or more smaller jobs.
In environments where the scheduling iteration is much higher than the
average time between completing jobs, case 3 occurs far more often than case
2, leading to smaller and smaller jobs populating the system over time.
To address this issue, the scheduler incorporates the concept of chunking.
Chunking allows the scheduler to favor case 2 maintaining a more controlled
balance between large and small jobs. The idea of chunking involves
establishing a time-based threshold during which resources available for
backfill are aggregated. This threshold is set using the parameter
BFCHUNKDURATION. When resources are freed, they are made available only
to jobs of a certain size (set using the parameter BFCHUNKSIZE) or larger.
These resources remain protected from smaller jobs until either additional
resources are freed up and a larger job can use the aggregate resources, or
until the BFCHUNKDURATION threshold time expires.
Backfill chunking is only activated when a job of size BFCHUNKSIZE or larger
is blocked in backfill due to lack of resources.
It is important to note that the optimal settings for these parameters is very
site-specific and will depend on the workload (including the average job
turnaround time, job size, and mix of large to small jobs), cluster resources,
and other scheduling environmental factors. Setting too restrictive values
needlessly reduces utilization while settings that are too relaxed do not allowed
the desired aggregation to occur.
Backfill chunking is only enabled in conjunction with the FIRSTFIT backfill
policy.
Virtual Wallclock Time Scaling
In most environments, users submit jobs with rough estimations of the
wallclock times. Within the HPC industry, a job typically runs for 40% of its
specified wallclock time. Virtual Wallclock Time Scaling takes advantage of this
fact to implement a form of optimistic backfilling. Jobs that are eligible for
backfilling and not restricted by other policies are virtually scaled by the
BFVIRTUALWALLTIMESCALINGFACTOR (assuming that the jobs finish before
this new virtual wallclock limit). The scaled jobs are then compared to backfill
windows to see if there is space and time for them to be scheduled. The scaled
jobs are only scheduled if there is no possibility that it will conflict with a
standing or administrator reservation. Conflicts with such reservations occur if
the virtual wallclock time overlaps a reservation, or if the original non-virtual
wallclock time overlaps a standing or administrator reservation. Jobs that can
fit into an available backfill window without having their walltime scaled are
backfilled "as-is" (meaning, without virtually scaling the original walltime).
504
Backfill
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Virtual Wallclock Time Scaling is only enabled when the
BFVIRTUALWALLTIMESCALINGFACTOR parameter is defined.
If a virtually-scaled job fits into a window, and is backfilled, it will run until
completion or until it comes within one scheduling iteration (RMPOLLINTERVAL
defines the exact time of an iteration) of the virtual wallclock time expiration. In
the latter case the job's wallclock time is restored to its original time and Moab
checks and resolves conflicts caused by this "expansion." Conflicts may occur
when the backfilled job is restored to its full duration resulting in reservation
overlap. The BFVIRTUALWALLTIMECONFLICTPOLICY parameter controls how
Moab handles these conflicts.
If the BFVIRTUALWALLTIMECONFLICTPOLICY parameter is set to NONE or is not
specified, the overlapped job reservations are rescheduled.
Related Topics
BACKFILLDEPTH Parameter
BACKFILLPOLICY Parameter
BFMINVIRTUALWALLTIME
Reservation Policy Overview
Node Set Overview
l
Node Set Usage Overview
l
Node Set Configuration
o
Node Set Policy
o
Node Set Attribute
o
Node Set Constraint Handling
o
Node Set List
o
Node Set Tolerance
o
Node Set Priority
o
NODESETPLUS
o
Nested Node Sets
l
Requesting Node Sets for Job Submission
l
Configuring Node Sets for Classes
Node Set Overview
505
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Node Set Usage Overview
While backfill improves the scheduler's performance, this is only half the battle.
The efficiency of a cluster, in terms of actual work accomplished, is a function
of both scheduling performance and individual job efficiency. In many clusters,
job efficiency can vary from node to node as well as with the node mix
allocated. Most parallel jobs written in popular languages such as MPI or PVM
do not internally load balance their workload and thus run only as fast as the
slowest node allocated. Consequently, these jobs run most effectively on
homogeneous sets of nodes. However, while many clusters start out as
homogeneous, they quickly evolve as new generations of compute nodes are
integrated into the system. Research has shown that this integration, while
improving scheduling performance due to increased scheduler selection, can
actually decrease average job efficiency.
A feature called node sets allows jobs to request sets of common resources
without specifying exactly what resources are required. Node set policy can be
specified globally or on a per-job basis. In addition to their use in forcing jobs
onto homogeneous nodes, these policies may also be used to guide jobs to one
or more types of nodes on which a particular job performs best, similar to job
preferences available in other systems. For example, an I/O intensive job may
run best on a certain range of processor speeds, running slower on slower
nodes, while wasting cycles on faster nodes. A job may specify
ANYOF:FEATURE:bigmem,fastos to request nodes with the bigmem or fastos
feature. Alternatively, if a simple feature-homogeneous node set is desired,
ONEOF:FEATURE may be specified. On the other hand, a job may request a
feature based node set with the configuration
ONEOF:FEATURE:bigmem,fastos, in which case Moab will first attempt to
locate adequate nodes where all nodes contain the bigmem feature. If such a
set cannot be found, Moab will look for sets of nodes containing the other
specified features. In highly heterogeneous clusters, the use of node sets
improves job throughput by 10 to 15%.
Node sets can be requested on a system wide or per job basis. System wide
configuration is accomplished via the NODESET* parameters while per job
specification occurs via the resource manager extensions.
The GLOBAL node is included in all feature node sets.
When creating node sets, you have the option of using a fixed configuration or
of creating node sets dynamically (by using the msub command). This topic
explains how to set up both node set use cases.
Node Set Configuration Examples
Global node sets are defined using the NODESETPOLICY,
NODESETATTRIBUTE, NODESETLIST, and NODESETISOPTIONAL parameters.
As stated before, you can create node sets dynamically (see Dynamic example)
506
Node Set Overview
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
or with a fixed configuration (see Fixed configuration example). The use of
these parameters can be best highlighted with two examples.
Fixed configuration example
In this example, a large site possesses a Myrinet based interconnect and wishes
to, whenever possible, allocate nodes within Myrinet switch boundaries. To
accomplish this, they could assign node attributes to each node indicating which
switch it was associated with (switchA, switchB, and so forth) and then use
the following system wide node set configuration:
NODESETPOLICY
NODESETATTRIBUTE
NODESETISOPTIONAL
NODESETLIST
...
ONEOF
FEATURE
TRUE
switchA,switchB,switchC,switchD
Node Set Policy
In the preceding example, the NODESETPOLICY parameter is set to the policy
ONEOF and tells Moab to allocate nodes within a single attribute set. Other
node set policies are listed in the following table:
Policy
Description
ANYOF
Select resources from all sets contained in node set list. The job could span multiple node sets.
FIRSTOF
Select resources from first set to match specified constraints.
ONEOF
Select a single set that contains adequate resources to support job.
Node Set Attribute
The example's NODESETATTRIBUTE parameter is set to FEATURE, specifying
that the node sets are to be constructed along node feature boundaries.
You could also set the NODESETATTRIBUTE to VARATTR, specifying that node sets
are to be constructed according to VARATTR values on the job.
Node Set Constraint Handling
The next parameter, NODESETISOPTIONAL, indicates that Moab should not
delay the start time of a job if the desired node set is not available but
adequate idle resources exist outside of the set. Setting this parameter to
TRUE basically tells Moab to attempt to use a node set if it is available, but if
not, run the job as soon as possible anyway.
Setting NODESETISOPTIONAL to FALSE will force the job to always run in a
complete nodeset regardless of any start delay this imposes.
Node Set Overview
507
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Node Set List
Finally, the NODESETLIST value of switchA switchB... tells Moab to only
use node sets based on the listed feature values. This is necessary since sites
will often use node features for many purposes and the resulting node sets
would be of little use for switch proximity if they were generated based on
irrelevant node features indicating things such as processor speed or node
architecture.
To add nodes to the NODESETLIST, you must configure features on your nodes
using the NODECFG FEATURES attribute.
NODECFG[node01] FEATURES=switchA
NODECFG[node02] FEATURES=switchA
NODECFG[node03] FEATURES=switchB
Nodes node01 and node02 contain the switchA feature, and node node03 contains the switchB feature.
Node Set Priority
When resources are available in more than one resource set, the
NODESETPRIORITYTYPE parameter allows control over how the best resource
set is selected. Legal values for this parameter are described in the following
table:
Priority
Type
Description
Details
AFFINITY
Avoid a resource set with negative
affinity.
Choosing this type causes Moab to select a node set with no
negative affinity nodes (nodes that have a reservation that
with negative affinity). If all node sets have negative affinity, then Moab will select the first matching node set.
BESTFIT
Select the smallest resource set
possible.
Choosing this type causes Moab, when selecting a node set,
to eliminate sets that do not have all the required
resources. From the remaining sets, Moab chooses the set
with the least amount of resources. This priority type most
closely matches the job requirements in order to waste the
least amount of resources.
This type minimizes fragmentation of larger resource sets.
FIRSTFIT
508
Select the first set with enough
resources.
Moab will select the first nodeset with enough resources to
satisfy the job. This is the fastest of the priority types.
Node Set Overview
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Priority
Type
MINLOSS
WORSTFIT
Description
Details
Select the resource set that results
in the minimal wasted resources
assuming no internal job load balancing is available. (Assumes parallel jobs only run as fast as the
slowest allocated node.)
Choosing this type works only when using the following
configuration:
Select the largest resource set possible.
This type causes Moab, when choosing a node set, to
eliminate sets that do not have all the required resources.
From the remaining sets, Moab chooses the set with the
greatest amount of resources.
NODESETATTRIBUTE FEATURE
In a SHAREDMEM environment (See Moab-NUMASupport Integration Guide for more information.), Moab will
select the node set based on NUMA properties (the smallest
feasible node set).
This type minimizes fragmentation of smaller resource sets,
but increases fragmentation of larger resource sets.
Dynamic example
In this example, a site wants to be able to dynamically specify which VARATTR
values the node set will be based on. To accomplish this, they could use the
following configuration in the moab.cfg file:
NODESETISOPTIONAL FALSE
NODESETPOLICY
FIRSTOF
NODESETATTRIBUTE VARATTR
Node Set Attribute
The example's NODESETATTRIBUTE parameter is set to VARATTR specifying
that the node sets are to be constructed by job VARATTR values that are
specified dynamically in the msub command.
Node Set Policy
In the preceding example, the NODESETPOLICY parameter is set to the policy
FIRSTOF and tells Moab to allocate nodes from the first set that matches
specified constraints.
Node Set Constraint Handling
The parameter, NODESETISOPTIONAL, indicates that Moab should not delay
the start time of a job if the desired node set is not available but adequate idle
resources exist outside of the set. Setting this parameter to FALSE will force
the job to always run in a complete node set regardless of any start delay this
imposes.
Node Set Overview
509
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
msub example
With the configuration (above) set in the moab.cfg, Moab is configured for
dynamic node sets. You can create node sets dynamically by using the msub -l
command. (For more information, see Resource Manager Extensions.) Use the
following format:
msub -l nodeset=FIRSTOF:VARATTR:<var>[=<value>],...
For example, if you wanted to create a dynamic node set for the Provo
datacenter:
msub -l nodeset=FIRSTOF:VARATTR:datacenter=Provo
This command causes Moab to set datacenter=Provo as the node set.
You can specify more than one VARATTR in the command. For example, if
you want to create a dynamic node set for the Provo datacenter and the
SaltLake datacenter:
msub -l nodeset=FIRSTOF:VARATTR:datacenter=Provo:datacenter=SaltLake
If you specify only datacenter (without specifying a value, such as =Provo),
Moab will look up all possible values (values reported on the node for that
VARATTR), and then choose one. So if, for example, you have nodes that have
VARATTRs datacenter=Provo, datacenter=SaltLake, and
datacenter=StGeorge, then specifying msub -l
nodeset=FIRSTOF:VARATTR:datacenter will cause the job to run in Provo or
SaltLake or StGeorge.
You should also note that Moab also adds the VARATTR (whether you specify it
or if Moab chooses it) to the required attribute (REQATTR) of the job. For
example, if you specify datacenter=Provo as the VARATTR,
datacenter=Provo will also be added to the job REQATTR. Likewise, if you
specify only datacenter, and Moab chooses datacenter=SaltLake, then
datacenter=SaltLake will be added to the job REQATTR.
If you do not request a VARATTR in the nodeset of the msub -l command, the
job will run as if it did not use node sets at all, and nothing will be added to its
REQATTR.
If you manually specify a different REQATTR on a job (for example,
datacenter=SaltLake) from the node set VARATTR (for example,
datacenter=Provo), the job will never run.
NODESETPLUS
Moab supports additional NodeSet behavior by specifying the NODESETPLUS
parameter. Possible values when specifying this parameter are SPANEVENLY
and DELAY.
510
Node Set Overview
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Neither SPANEVENLY nor DELAY will work with multi-req jobs or
preemption.
Value
Description
SPANEVENLY
Moab attempts to fit all jobs within one node set, or it spans any number of node sets evenly.
When a job specifies a NODESETDELAY, Moab attempts to contain the job within a single node
set; if unable to do so, it spans node sets evenly, unless doing so would delay the job beyond the
requested NODESETDELAY.
DELAY
Moab attempts to schedule the job within a nodeset for the configured NODESETDELAY. If Moab
cannot find space for the job to start within NODESETDELAY (Moab considers future workload to
determine if space will open up in time and might create a future reservation), then Moab
schedules the job and ignores the nodeset requirement.
Nested Node Sets
Moab attempts to fit jobs on node sets in the order they are specified in the
NODESETLIST. You can create nested node sets by listing your node sets in a
specific order. Here is an example of a "smallest to largest" nested node set:
NODESETPOLICY ONEOF
NODESETATTRIBUTE FEATURE
NODESETISOPTIONAL FALSE
NODESETLIST blade1a,blade1b,blade2a,blade2b,blade3a,
blade3b,blade4a,blade4b,quad1a,quad1b,quad2a,
quad2b,octet1,octet2,sixteen
The accompanying cluster would look like this:
Node Set Overview
511
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
Image 7-3: Octet, quad, and blade node sets on a cluster
In this example, Moab tries to fit the job on the nodes in the blade sets first. If
that doesn't work, it moves up to the nodes in the quad sets (a set of four blade
sets). If the quads are insufficient, it tries the nodes in the octet sets (a set of
four quad node sets).
Requesting Node Sets for Job Submission
On a per job basis, each user can specify the equivalent of all parameters
except NODESETDELAY. As mentioned previously, this is accomplished using the
resource manager extensions.
Configuring Node Sets for Classes
Classes can be configured with a default node set. In the configuration file,
specify DEFAULT.NODESET with the following syntax:
DEFAULT.NODESET=<SETTYPE>:<SETATTR>[:<SETLIST>[,<SETLIST>]...].
For example, in a heterogeneous cluster with two different types of
512
Node Set Overview
Chapter 7 Optimizing Scheduling Behavior – Backfill and Node Sets
processors, the following configuration confines jobs assigned to the amd class
to run on either ATHLON or OPTERON processors:
CLASSCFG[amd] DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON
...
Related Topics
l
Resource Manager Extensions
l
CLASSCFG
l
Partition Overview
Node Set Overview
513
514
Node Set Overview
Chapter 8 Evaluating System Performance - Statistics, Profiling and Testing
Chapter 8 Evaluating System Performance - Statistics,
Profiling and Testing
l
Moab Performance Evaluation Overview
l
Accounting: Job and System Statistics
l
Testing New Versions and Configurations
Moab Performance Evaluation Overview
Moab Workload Manager tracks numerous performance statistics for jobs,
accounting, users, groups, accounts, classes, QoS, the system, and so forth.
These statistics can be accessed through various commands or Moab Cluster
Manager/Monitor.
Accounting: Job and System Statistics
Moab provides extensive accounting facilities that allow resource usage to be
tracked by resources (compute nodes), jobs, users, and other objects. The
accounting facilities may be used in conjunction with, and correlated with, the
accounting records provided by the resource and accounting manager.
Moab maintains both raw persistent data and a large number of processed in
memory statistics allowing instant summaries of cycle delivery and system
utilization. With this information, Moab can assist in accomplishing any of the
following tasks:
l
Determining cumulative cluster performance over a fixed time frame.
l
Graphing changes in cluster utilization and responsiveness over time.
l
Identifying which compute resources are most heavily used.
l
l
l
Charting resource usage distribution among users, groups, projects, and
classes.
Determining allocated resources, responsiveness, and failure conditions
for jobs completed in the past.
Providing real-time statistics updates to external accounting systems.
This section describes how to accomplish each of these tasks using Moab tools
and accounting information.
Moab Performance Evaluation Overview
515
Chapter 8 Evaluating System Performance - Statistics, Profiling and Testing
l
Accounting Overview
l
Real-Time Statistics
l
FairShare Usage Statistics
Accounting Overview
Moab provides accounting data correlated to most major objects used within
the cluster scheduling environment. These records provide job and reservation
accounting, resource accounting, and credential-based accounting.
Job and Reservation Accounting
As each job or reservation completes, Moab creates a complete persistent
trace record containing information about who ran, the time frame of all
significant events, and what resources were allocated. In addition, actual
execution environment, failure reports, requested service levels, and other
pieces of key information are also recorded. A complete description of each
accounting data field can be found within section Workload Traces.
Resource Accounting
The load on any given node is available historically allowing identification of not
only its usage at any point in time, but the actual jobs which were running on it.
Moab Cluster Manager can show load information (assuming load is configured
as a generic metric), but not the individual jobs that were running on a node at
some point in the past. For aggregated, historical statistics covering node
usage and availability, the showstats command may be run with the -n flag.
Credential Accounting
Current and historical usage for users, groups, account, QoSs, and classes are
determined in a manner similar to that available for evaluating nodes. For
aggregated, historical statistics covering credential usage and availability, the
showstats command may be run with the corresponding credential flag.
If needed, detailed credential accounting can also be enabled globally or on a
credential by credential basis. With detailed credential accounting enabled,
real-time information regarding per-credential usage over time can be
displayed. To enable detailed per credential accounting, the ENABLEPROFILING
attribute must be specified for credentials that are to be monitored. For
example, to track detailed credentials, the following should be used:
USERCFG[DEFAULT]
QOSCFG[DEFAULT]
CLASSCFG[DEFAULT]
GROUPCFG[DEFAULT]
ACCOUNTCFG[DEFAULT]
516
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
ENABLEPROFILING=TRUE
Accounting: Job and System Statistics
Chapter 8 Evaluating System Performance - Statistics, Profiling and Testing
Credential level profiling operates by maintaining a number of time-based
statistical records for each credential. The parameters PROFILECOUNT and
PROFILEDURATION control the number and duration of the statistical records.
Real-Time Statistics
Moab provides real-time statistical information about how the machine is
running from a scheduling point of view. The showstats command is actually a
suite of commands providing detailed information on an overall scheduling
basis as well as a per user, group, account and node basis. This command gets
its information from in memory statistics that are loaded at scheduler start
time from the scheduler checkpoint file. (See Checkpoint/Restart for more
information.) This checkpoint file is updated periodically and when the
scheduler is shut down allowing statistics to be collected over an extended time
frame. At any time, real-time statistics can be reset using the mschedctl -f
command.
In addition to the showstats command, the showstats -f command also obtains
its information from the in memory statistics and checkpoint file. This
command displays a processor-time based matrix of scheduling performance
for a wide variety of metrics. Information such as backfill effectiveness or
average job queue time can be determined on a job size/duration basis.
FairShare Usage Statistics
Regardless of whether fairshare is enabled, detailed credential based fairshare
statistics are maintained. Like job traces, these statistics are stored in the
directory pointed to by the STATDIR parameter. Fairshare stats are maintained
in a separate statistics file using the format FS.<EPOCHTIME>
(FS.982713600, for example) with one file created per fairshare window. (See
the Fairshare Overview for more information.) These files are also flat text and
record credential based usage statistics. Information from these files can be
seen via the mdiag -f command.
Related Topics
Simulation Overview
Generic Consumable Resources
Object Variables
Generic Event Counters
Testing New Versions and Configurations
l
MONITOR Mode
l
INTERACTIVE Mode
Testing New Versions and Configurations
517
Chapter 8 Evaluating System Performance - Statistics, Profiling and Testing
MONITOR Mode
Moab supports a scheduling mode called MONITOR. In this mode, the scheduler
initializes, contacts the resource manager and other peer services, and
conducts scheduling cycles exactly as it would if running in NORMAL or
production mode. Job are prioritized, reservations created, policies and limits
enforced, and administrator and end-user commands enabled. The key
difference is that although live resource management information is loaded,
MONITOR mode disables Moab's ability to start, preempt, cancel, or otherwise
modify jobs or resources. Moab continues to attempt to schedule exactly as it
would in NORMAL mode but its ability to actually impact the system is disabled.
Using this mode, a site can quickly verify correct resource manager
configuration and scheduler operation. This mode can also be used to validate
new policies and constraints. In fact, Moab can be run in MONITOR mode on a
production system while another scheduler or even another version of Moab is
running on the same system. This unique ability can allow new versions and
configurations to be fully tested without any exposure to potential failures and
with no cluster downtime.
To run Moab in MONITOR mode, simply set the MODE attribute of the SCHEDCFG
parameter to MONITOR and start Moab. Normal scheduler commands can be
used to evaluate configuration and performance. Diagnostic commands can be
used to look for any potential issues. Further, the Moab log file can be used to
determine which jobs Moab attempted to start, and which resources Moab
attempted to allocate.
If another instance of Moab is running in production and a site administrator
wants to evaluate an alternate configuration or new version, this is easily done
but care should be taken to avoid conflicts with the primary scheduler. Potential
conflicts include statistics files, logs, checkpoint files, and user interface ports.
One of the easiest ways to avoid these conflicts is to create a new test directory
with its own log and stats subdirectories. The new moab.cfg file can be created
from scratch or based on the existing moab.cfg file already in use. In either
case, make certain that the PORT attribute of the SCHEDCFG parameter differs
from that used by the production scheduler by at least two ports. If testing with
the production binary executable, the MOABHOMEDIR environment variable
should be set to point to the new test directory to prevent Moab from loading
the production moab.cfg file.
INTERACTIVE Mode
INTERACTIVE mode allows for evaluation of new versions and configurations in
a manner different from MONITOR mode. Instead of disabling all resource and
job control functions, Moab sends the desired change request to the screen and
asks for permission to complete it. For example, before starting a job, Moab
may print something like the following to the screen:
Command: start job 1139.ncsa.edu on node list test013,test017,test018,test021
Accept: (y/n) [default: n]?
518
Testing New Versions and Configurations
Chapter 8 Evaluating System Performance - Statistics, Profiling and Testing
The administrator must specifically accept each command request after
verifying it correctly meets desired site policies. Moab then executes the
specified command. This mode is highly useful in validating scheduler behavior
and can be used until configuration is appropriately tuned and all parties are
comfortable with the scheduler's performance. In most cases, sites will want to
set the scheduling mode to NORMAL after verifying correct behavior.
Related Topics
Testing New Releases and Policies
Side-by-Side Mode
Testing New Versions and Configurations
519
520
Testing New Versions and Configurations
Chapter 9 General Job Administration
Chapter 9 General Job Administration
l
Job Holds
l
Job Priority Management
l
Suspend/Resume Handling
l
Checkpoint/Restart Facilities
l
Job Dependencies
l
Job Defaults and Per Job Limits
l
General Job Policies
l
Using a Local Queue
l
Job Deadlines
l
Job Arrays
Job Holds
Holds and Deferred Jobs
Moab supports job holds applied by users (user holds), administrators (system
holds), and resource managers (batch holds). There is also a temporary hold
known as a job defer.
User Holds
User holds are very straightforward. Many, if not most, resource managers
provide interfaces by which users can place a hold on their own job that tells the
scheduler not to run the job while the hold is in place. Users may use this
capability because the job's data is not yet ready, or they want to be present
when the job runs to monitor results. Such user holds are created by, and
under the control of a non-privileged user and may be removed at any time by
that user. As would be expected, users can only place holds on their jobs. Jobs
with a user hold in place will have a Moab state of Hold or UserHold depending
on the resource manager being used.
System Holds
The system hold is put in place by a system administrator either manually or by
way of an automated tool. As with all holds, the job is not allowed to run so long
as this hold is in place. A batch administrator can place and release system
holds on any job regardless of job ownership. However, unlike a user hold,
Job Holds
521
Chapter 9 General Job Administration
normal users cannot release a system hold even on their own jobs. System
holds are often used during system maintenance and to prevent particular jobs
from running in accordance with current system needs. Jobs with a system hold
in place will have a Moab state of Hold or SystemHold depending on the
resource manager being used.
Batch Holds
Batch holds are placed on a job by the scheduler itself when it determines that
a job cannot run. The reasons for this vary but can be displayed by issuing the
checkjob<JOBID> command. Possible reasons are included in the following
list:
l
l
l
l
No Resources — The job requests resources of a type or amount that do
not exist on the system.
System Limits — The job is larger or longer than what is allowed by the
specified system policies.
Bank Failure — The allocations bank is experiencing failures.
No Allocations — The job requests use of an account that is out of
allocations and no fallback account has been specified.
l
RM Reject — The resource manager refuses to start the job.
l
RM Failure — The resource manager is experiencing failures.
l
l
Policy Violation — The job violates certain throttling policies preventing it
from running now and in the future.
No QOS Access — The job does not have access to the QoS level it
requests.
Jobs which are placed in a batch hold will show up within Moab in the state
BatchHold.
Job Defer
In most cases, a job violating these policies is not placed into a batch hold
immediately; rather, it is deferred. The parameter DEFERTIME indicates how
long it is deferred. At this time, it is allowed back into the idle queue and again
considered for scheduling. If it again is unable to run at that time or at any time
in the future, it is again deferred for the timeframe specified by DEFERTIME. A
job is released and deferred up to DEFERCOUNT times at which point the
scheduler places a batch hold on the job and waits for a system administrator
to determine the correct course of action. Deferred jobs have a Moab state of
Deferred. As with jobs in the BatchHold state, the reason the job was deferred
can be determined by use of the checkjob command.
At any time, a job can be released from any hold or deferred state using the
releasehold command. The Moab logs should provide detailed information
about the cause of any batch hold or job deferral.
522
Job Holds
Chapter 9 General Job Administration
Under Moab, the reason a job is deferred or placed in a batch hold is
stored in memory but is not checkpointed. Thus this information is
available only until Moab is recycled at which point the checkjob command
no longer displays this reason information.
Related Topics
DEFERSTARTCOUNT - number of job start failures allowed before job is deferred
Job Priority Management
Job priority management is controlled via both configured and manual
intervention mechanisms.
l
Priority Configuration - see Job Prioritization
l
Manual Intervention with setspri
Suspend/Resume Handling
When supported by the resource manager, Moab can suspend and resume
jobs. By default, a job is suspended for one minute before it can be resumed by
Moab. You can modify this default time using the MINADMINSTIME parameter.
Moab schedules suspended jobs each iteration to see if they can be
resumed. If the node the jobs are running on is free, then Moab
automatically resumes the job.
Alternately, a user can suspend his/her own jobs, but only an administrator can
resume them. The administrator can resume jobs before the time set for Moab
to resume.
A job must be marked as suspendable for Moab to suspend and resume it. To
do so, either submit the job with the suspendable flag attached to it or configure
a credential to pass the flag to its associated jobs. These methods are
demonstrated in the examples below:
msub -l flags=suspendable
GROUPCFG[default] JOBFLAGS=SUSPENDABLE
Once the job is suspendable, Moab allows you to suspend jobs using the two
following methods: (1) manually on the command line and (2) automatically in
the moab.cfg file.
Job Priority Management
523
Chapter 9 General Job Administration
To manually suspend jobs, use the mjobctl command as demonstrated in the
following example: > mjobctl -s job05
Moab suspends job05, preventing it from running immediately in the job queue.
If you are an administrator and want to resume a job, use the mjobctl command
as demonstrated in the following example:
> mjobctl -r job05
Moab removes job05 from a suspended state and allows it to run.
You can also configure the Moab preemption policy to suspend and resume
jobs automatically by setting the PREEMPTPOLICY parameter to SUSPEND. A
sample Moab configuration looks like this: PREEMPTPOLICY SUSPEND
...
USERCFG[tom] JOBFLAGS=SUSPENDABLE
Moab suspends jobs submitted by user tom if necessary to make resources available for jobs with higher
priority.
If your resource manager has a native interface, you must configure
JOBSUSPENDURL to suspend and resume jobs.
For more information about suspending and resuming jobs in Moab, see the
following sections:
l
manual preemption with the mjobctl command
l
Job preemption
Checkpoint/Restart Facilities
Checkpointing records the state of a job, allowing for it to restart later without
interruption to the job's execution. Checkpointing can be performed manually,
as the result of triggers or events, or in conjunction with various QoS policies.
Moab's ability to checkpoint is dependent upon both the cluster's resource
manager and operating system. In most cases, two types of checkpoint are
enabled, including (1) checkpoint and continue and (2) checkpoint and
terminate. While either checkpointing method can be activated using the
mjobctl command, only the checkpoint and terminate type is used by internal
scheduling and event managements facilities.
Checkpointing behavior can be configured on a per-resource manager basis
using various attributes of the RMCFG parameter.
524
Checkpoint/Restart Facilities
Chapter 9 General Job Administration
Related Topics
Job Preemption Overview
PREEMPTPOLICY Parameter
Resource Manager CHECKPOINTSIG Attribute
Resource Manager CHECKPOINTTIMEOUT Attribute
Job Dependencies
Basic Job Dependency Support
l
o
Job Dependency Syntax
Basic Job Dependency Support
By default, basic single step job dependencies are supported through
completed/failed step evaluation. Basic dependency support does not require
special configuration and is activated by default. Dependent jobs are only
supported through a resource manager and therefore submission methods
depend upon the specific resource manager being used.
Use the -l depend=<STRING> flag for the Torque qsub command and the
Moab msub command.
Torque qsub also supports the -W x=depend=<STRING> or -W
depend=<STRING> flag. Moab msub command also supports the -W
x=depend=<STRING> flag.
For other resource managers, consult the resource manager specific
documentation.
Job Dependency Syntax
Dependency
Format
Description
after
after:<job>
[:<job>]...
Job may start at any time after specified jobs have started execution.
afterany
afterany:<job>
[:<job>]...
Job may start at any time after all specified jobs have completed
regardless of completion status.
afterok
afterok:<job>
[:<job>]...
Job may be start at any time after all specified jobs have successfully
completed.
Job Dependencies
525
Chapter 9 General Job Administration
Dependency
Format
Description
afternotok
afternotok:<job>
[:<job>]...
Job may start at any time after all specified jobs have completed
unsuccessfully.
before
before:<job>
[:<job>]...
Job may start at any time before specified jobs have started execution.
beforeany
beforeany:<job>
[:<job>]...
Job may start at any time before all specified jobs have completed
regardless of completion status.
beforeok
beforeok:<job>
[:<job>]...
Job may start at any time before all specified jobs have successfully
completed.
beforenotok
beforenotok:<job>
[:<job>]...
Job may start at any time before any specified jobs have completed
unsuccessfully.
on
on:<count>
Job may start after <count> dependencies on other jobs have been
satisfied.
synccount
synccount:<count>
Job is the first in a set of jobs to be executed at the same time.
<count> is the number of additional jobs in the set, which can be up
to 5. synccount is valid for single-request jobs with Torque as the
resource manager.
syncwith
syncwith:<job>
Job is an additional member of a set of jobs to be executed at the
same time. Moab supports up to 5 jobs. syncwith is valid for singlerequest jobs with Torque as the resource manager.
<job>={JOBNAME.jobname|jobid}
When using JobName dependencies, prepend "JOBNAME." to avoid
ambiguity.
The before*, synccount, and syncwith dependencies do not work with
jobs submitted with msub; they work only with qsub.
Any of the dependencies containing before must be used in conjunction with
the on dependency. So, if job A must run before job B, job B must be
submitted with depend=on:1, as well as job A having depend=before:A. This
means job B cannot run until one dependency of another job on job B has been
fulfilled. This prevents job B from running until job A can be successfully
submitted.
526
Job Dependencies
Chapter 9 General Job Administration
When you submit a dependency job and the dependency is not met, the job will
remain idle in the queue indefinitely. To configure Moab to automatically cancel
these failed dependency jobs, set the CANCELFAILEDDEPENDENCYJOBS
scheduler flag. Moab also lets you cancel all jobs that a specified <job_id>
depends on using mjobctl -c flags=follow-dependency <job_id>.
Related Topics
Job Deadlines
Job Defaults and Per Job Limits
Job Defaults
Job defaults can be specified on a per queue basis. These defaults are specified
using the CLASSCFG parameter. The following table shows the applicable
attributes:
Attribute
Format
Example
DEFAULT.FEATURES
comma-delimited list of node
features
CLASSCFG[batch] DEFAULT.FEATURES=fast,io
Jobs submitted to class batch will request
nodes features fast and io.
DEFAULT.WCLIMIT
[[[DD:]HH:]MM:]SS
CLASSCFG[batch] DEFAULT.WCLIMIT=1:00:00
Jobs submitted to class batch will request
one hour of walltime by default.
Per Job Maximum Limits
Job maximum limits can be specified on a per queue basis. These defaults are
specified using the CLASSCFG parameter. The following table shows the
applicable attributes:
Attribute
Format
MAX.WCLIMIT
[[[DD:]HH:]MM:]SS
Example
CLASSCFG[batch] MAX.WCLIMIT=1:00:00
Jobs submitted to class batch can request no more than
one hour of walltime.
Job Defaults and Per Job Limits
527
Chapter 9 General Job Administration
Per Job Minimum Limits
Furthermore, minimum job defaults can be specified with the CLASSCFG
parameter. The following table shows the applicable attributes:
Attribute
Format
MIN.PROC
<integer>
Example
CLASSCFG[batch] MIN.PROC=10
Jobs submitted to class batch can request no less than
ten processors.
Related Topics
Usage-based Limits
General Job Policies
l
Multi-Node Support
l
Multi-Req Support
l
Malleable Job Support
l
Enabling Job User Proxy
There are a number of configurable policies that help control advanced job
functions. These policies help determine allowable job sizes and structures.
Multi-Node Support
You can configure the ability to allocate resources from multiple nodes to a job
with the MAX.NODE limit.
Multi-Req Support
Jobs can specify multiple types of resources for allocation. For example, a job
could request 4 nodes with 256 MB of memory and 8 nodes with feature fast
present.
Resources specified in a multi-req job are delimited with a plus sign (+).
Neither SPANEVENLY nor DELAY values of the NODESETPLUS parameter
will work with multi-req jobs or preemption.
528
General Job Policies
Chapter 9 General Job Administration
Example 9-1:
General Job Policies
529
Chapter 9 General Job Administration
-l nodes=4:ppn=1+10:ppn=5+2:ppn=2
This example requests 4 nodes with 1 proc each, 10 nodes with 5 procs each, and 2 nodes with 2 procs
each. The total number of processors requested is (4*1) + (10*5) + (2*2), or 58 processors.
530
General Job Policies
Chapter 9 General Job Administration
Example 9-2:
General Job Policies
531
Chapter 9 General Job Administration
-l nodes=15+1:ppn=4
The job submitted in this example requests a total of 16 nodes. 15 of these nodes have no specific
requirements, but the remaining node must have 4 processors.
532
General Job Policies
Chapter 9 General Job Administration
Example 9-3:
General Job Policies
533
Chapter 9 General Job Administration
-l nodes=3:fast+1:io
The job requests a total of 4 nodes: 3 nodes with the fast feature and 1 node with the io feature.
Malleable Job Support
A job can specify whether it is able to use more processors or less processors
and what effect, if any, that has on its wallclock time. For example, a job may
run for 10 minutes on 1 processor, 5 minutes on 2 processors and 3 minutes on
3 processors. When a job is submitted with a task request list attached, Moab
determines which task request fits best and molds the job based on its
specifications. To submit a job with a task request list and allow Moab to mold it
based on the current scheduler environment, use the TRL (Format 1) or the
TRL (Format 2) flag in the Resource Manager Extension.
Enabling Job User Proxy
By default, user proxying is disabled. To be enabled, it must be authorized
using the PROXYLIST attribute of the USERCFG parameter. This parameter can
be specified either as a comma-delimited list of users or as the keyword
validate. If the keyword validate is specified, the RMCFG attribute
JOBVALIDATEURL should be set and used to confirm that the job's owner can
proxy to the job's execution user. An example script performing this check for
ssh-based systems is provided in the tools directory (See Job Validate Tool
Overview.).
For some resource managers (RM), proxying must also be enabled at the RM
level. The following example shows how ssh-based proxying can be
accomplished in a Moab+Torque with SSH environment.
To validate proxy users, Moab must be running as root.
534
General Job Policies
Chapter 9 General Job Administration
Example 9-4: SSH Proxy Settings
USERCFG[DEFAULT] PROXYLIST=validate
RMCFG[base] TYPE=<resource manager>
JOBVALIDATEURL=exec://$HOME/tools/job.validate.sshproxy.pl
> qmgr -c 's s allow_proxy_user=true'
> su - testuser
> qsub -I -u testuser2
qsub: waiting for job 533.igt.org to start
qsub: job 533.igt.org ready
testuser2@igt:~$
In this example, the validate tool, 'job.validate.sshproxy.pl', can verify proxying is allowed by
becoming the submit user and determining if the submit user can achieve passwordless access to the
specified execution user. However, site-specific tools can use any method to determine proxy access
including a flat file look-up, database lookup, querying of an information service such as NIS or LDAP, or
other local or remote tests. For example, if proxy validation is required but end-user accounts are not
available on the management node running Moab, the job validate service could perform the validation
test on a representative remote host such as a login host.
This feature supports qsub only.
The job validate tool is highly flexible allowing any combination of job attributes
to be evaluated and tested using either local or remote validation tests. The
validate tool allows not only pass/fail responses but also allows the job to be
modified, or rejected in a custom manner depending on the site or the nature
of the failure.
Related Topics
Usage Limits
Using a Local Queue
Moab allows jobs to be submitted directly to the scheduler. With a local queue,
Moab is able to directly manage the job or translate it for resubmission to a
standard resource manager queue. There are multiple advantages to using a
local queue:
l
l
l
l
Jobs may be translated from one resource manager job submission
language to another (such as submitting a PBS job and running it on an
LSF cluster).
Jobs may be migrated from one local resource manager to another.
Jobs may be migrated to remote systems using Moab peer-to-peer
functionality.
Jobs may be dynamically modified and optimized by Moab to improve
response time and system utilization.
Using a Local Queue
535
Chapter 9 General Job Administration
l
l
l
Jobs may be dynamically modified to account for system hardware
failures or other issues.
Jobs may be dynamically modified to conform to site policies and
constraints.
Grid jobs are supported.
Local Queue Configuration
A local queue is configured just like a standard resource manager queue. It
may have defaults, limits, resource mapping, and credential access
constraints. The following table describes the most common settings:
Default queue
Format
RMCFG[internal] DEFAULTCLASS=<CLASSID>
Description
The job class/queue assigned to the job if one is not explicitly requested by the submitter.
All jobs submitted directly to Moab are initially received by the pseudo-resource manager
internal. Therefore, default queue configuration may only be applied to it.
Example
RMCFG[internal] DEFAULTCLASS=batch
Class default resource requirements
Format
CLASSCFG[<CLASSID>] DEFAULT.FEATURES=<X> CLASSCFG[<CLASSID>]
DEFAULT.MEM=<X> CLASSCFG[<CLASSID>] DEFAULT.NODE=<X> CLASSCFG[<CLASSID>]
DEFAULT.NODESET=<X> CLASSCFG[<CLASSID>] DEFAULT.PROC=<X> CLASSCFG
[<CLASSID>] DEFAULT.WCLIMIT=<X>
Description
The settings assigned to the job if not explicitly set by the submitter. Default values are available
for node features, per task memory, node count, nodeset configuration, processor count, and
wallclock limit.
Example
CLASSCFG[batch] DEFAULT.WCLIMIT=4 DEFAULT.FEATURES=matlab
or
CLASSCFG[batch] DEFAULT.WCLIMIT=4
CLASSCFG[batch] DEFAULT.FEATURES=matlab
536
Using a Local Queue
Chapter 9 General Job Administration
Class maximum resource limits
Format
CLASSCFG[<CLASSID>] MAX.FEATURES=<X> CLASSCFG[<CLASSID>] MAX.NODE=<X>
CLASSCFG[<CLASSID>] MAX.PROC=<X> CLASSCFG[<CLASSID>] MAX.WCLIMIT=<X>
Description
The maximum node features, node count, processor count, and wallclock limit allowed for a job submitted to the class/queue. If these limits are not satisfied, the job is not accepted and the submit
request fails. MAX.FEATURES indicates that only the listed features may be requested by a job.
Example
CLASSCFG[smalljob] MAX.PROC=4 MAX.FEATURES=slow,matlab
or
CLASSCFG[smalljob] MAX.PROC=4
CLASSCFG[smalljob] MAX.FEATURES=slow,matlab
Class minimum resource limits
Format
CLASSCFG[<CLASSID>] MIN.FEATURES=<X> CLASSCFG[<CLASSID>] MIN.NODE=<X>
CLASSCFG[<CLASSID>] MIN.PROC=<X> CLASSCFG[<CLASSID>] MIN.WCLIMIT=<X>
Description
The minimum node features, node count, processor count, and wallclock limit allowed for a job submitted to the class/queue. If these limits are not satisfied, the job is not accepted and the submit
request fails. MIN.FEATURES indicates that only the listed features may be requested by a job.
Example
CLASSCFG[bigjob] MIN.PROC=4 MIN.WCLIMIT=1:00:00
or
CLASSCFG[bigjob] MIN.PROC=4
CLASSCFG[bigjob] MIN.WCLIMIT=1:00:00
Class access
Format
Description
Example
Using a Local Queue
CLASSCFG[<CLASSID>] REQUIREDUSERLIST=<USERID>[,<USERID>]...
The list of users who may submit jobs to the queue.
CLASSCFG[math] REQUIREDUSERLIST=john,steve
537
Chapter 9 General Job Administration
Available resources
Format
CLASSCFG[<CLASSID>] HOSTLIST=<HOSTID>[,<HOSTID>]...
Description
Example
The list of nodes that jobs in the queue may use.
CLASSCFG[special] HOSTLIST=node001,node003,node13
Class mapping between multiple sites is described in the section on Moab grid
facilities.
If a job is submitted directly to the resource manager used by the local queue,
the class default resource requirements are not applied. Also, if the job violates
a local queue limitation, the job is accepted by the resource manager, but
placed in the Blocked state.
Job Deadlines
l
Deadline Overview
l
Setting Job Deadlines via QoS
o
Setting Job Deadlines at Job Submission
o
Submitting a Job to a QoS with a Preconfigured Deadline
l
Job Termination Date
l
Conflict Policies
Deadline Overview
Job deadlines may be specified on a per job and per credential basis and are
also supported using both absolute and QoS based specifications. A job
requesting a deadline is first evaluated to determine if the deadline is
acceptable. If so, Moab adds it to the list of deadline jobs and allocates
resources to guarantee that all accepted deadline jobs are able to complete on
or before their requested deadline. Once the scheduler confirms that all
deadlines can be satisfied, it then optimizes resource allocation (in priority
order) attempting to execute all jobs at the earliest possible time.
Setting Job Deadlines via QoS
Two types of job deadlines exist in Moab. The priority-based deadline linearly
increases a job's priority as its deadline approaches (See Deadline (DEADLINE)
Subcomponent for more information). The QoS method allows you to set a job
538
Job Deadlines
Chapter 9 General Job Administration
completion time on job submission if, and only if, it requests and is allowed to
access a QoS with the DEADLINE QFLAG set. This method is more powerful
than the priority method, because Moab will attempt to make a reservation for
the job as soon as the job enters the queue in order to meet the deadline,
essentially bumping it to the front of the queue.
When a job is submitted to a QoS with the DEADLINE flag set, the job's -l
deadline attribute is honored. If such QoS access is not available, or if
resources do not exist at job submission time to allow the deadline to be
satisfied, the job's deadline request is ignored.
Two methods exist for setting deadlines with a QoS: l
l
Submitting a job to a deadline-enabled QoS and specifying a deadline
using msub -l.
Submitting a job to a deadline-enabled QoS with a QTTARGET specified.
Setting Job Deadlines at Job Submission
This method of setting a job deadline allows you to specify a job deadline as
you submit the job. You can set the deadline as either an exact date and time
or as an amount of time after job submission (i.e. three hours after
submission).
To specify a deadline on job submission
1. In moab.cfg, create a QoS with the DEADLINE flag enabled.
...
QOSCFG[special] QFLAGS=DEADLINE
Jobs requesting the QoS special may submit jobs with a deadline that Moab will honor.
2. Submit a job to the QoS and set a deadline. This can be either absolute or
relative.
a. For an absolute deadline, use the format hh:mm:ss_mm/dd/yy. The
following configuration sets a deadline for a job to finish by 8 a.m. on
March 15th, 2013.
msub -l qos=special deadline=08:00:00_03/15/13 job.sh
The job must finish running by 8 A.M. on March 15, 2013.
b. For a relative deadline, or the completion deadline of the job relative to
its submission time, use the time format [[[DD:]HH:]MM:]SS.
msub -l qos=special deadline=5:00:00 job.sh
The job's deadline is 5 hours after its submission.
Job Deadlines
539
Chapter 9 General Job Administration
Submitting a Job to a QoS with a Preconfigured Deadline
You may also set a relative job deadline by limiting the job's queue time. This
method allows you to pre-configure the deadline rather than giving the power
to specify a deadline to the user submitting the job. For jobs requesting these
QoSes, Moab identifies and sets job deadlines to satisfy the corresponding
response time targets.
To submit a job to a QoS with a preconfigured deadline
1. In moab.cfg, create a QoS with both the DEADLINE QFLAG and a response
time target (QTTARGET). The QTTARGET is the maximum amount of time that
Moab should allow the job to be idle in the queue.
...
QOSCFG[special2] QFLAGS=DEADLINE QTTARGET=1:00:00
Given this configuration, a job requesting QoS special2 must spend a maximum of one hour in the
queue.
2. Submit a job requesting the special2 quality of service.
msub -l qos=special2 walltime=2:00:00 job.sh
This two-hour job has a completion time deadline set to three hours after its submission (one hour of
target queue time and two hours of run time).
Job Termination Date
In addition to job completion targets, jobs may also be submitted with a
TERMTIME attribute. The scheduler attempts to complete the job prior to the
termination date, but if it is unsuccessful, it will terminate (cancel) the job once
the termination date is reached.
Conflict Policies
The specific policy can be configured using the DEADLINEPOLICY parameter.
Moab does not have a default policy for this parameter.
540
Policy
Description
CANCEL
The job is canceled and the user is notified that the deadline could not be satisfied.
HOLD
The job has a batch hold placed on it indefinitely. The administrator can then decide what action to
take.
RETRY
The job continually retries each iteration to meet its deadline; note that when used with QTTARGET
the job's deadline continues to slide with relative time.
IGNORE
The job has its request ignored and is scheduled as normal.
Job Deadlines
Chapter 9 General Job Administration
Deadline scheduling may not function properly with per partition
scheduling enabled. Check that PARALLOCATIONPOLICY is disabled to ensure
DEADLINEPOLICY will work correctly.
Related Topics
QoS Facilities
Job Submission Eligible Start Time constraints
Job Arrays
l
Job Array Overview
l
Enabling Job Arrays
l
Sub-job Definitions
l
Using Environment Variables to Specify Array Index Values
o
Control
o
Reporting
l
Job Array Cancellation Policies
l
Examples
o
Submitting Job Arrays
Job Array Overview
You can submit an array of jobs to Moab via the msub command. Array jobs
are an easy way to submit many sub-jobs that perform the same work using
the same script, but operate on different sets of data. Sub-jobs are the jobs
created by an array job and are identified by the array job ID and an index; for
example, if 235[1] is an identifier, the number 235 is a job array ID, and 1 is
the sub-job.
Sub-jobs of an array are executed in sub-job index order.
Moab job arrays are different from Torque job arrays.
Enabling Job Arrays
To enable job arrays, include the ENABLEJOBARRAYS parameter in the Moab
configuration file (moab.cfg).
Job Arrays
541
Chapter 9 General Job Administration
Sub-job Definitions
Like a normal job, an array job submits a job script, but it additionally has a
start index (sidx) and an end index (eidx); array jobs also have increment
(incr) values, which Moab uses to create sub-jobs, all executing the same
script. The model for sub-job creation follows the formula of end index minus
start index plus increment divided by the increment value: (eidx - sidx +
incr) / incr.
To illustrate, suppose an array job has a start index of 1, an end index of 100,
and an increment of 1. This is an array job that creates (100 - 1 + 1) / 1 = 100
sub-jobs with indexes of 1, 2, 3, ..., 100. An increment of 2 produces (100 - 1 +
2) / 2 = 50 sub-jobs with indexes of 1, 3, 5, ..., 99. An increment of 2 with a
start index of 2 produces (100 - 2 + 2) / 2 = 50 sub-jobs with indexes of 2, 4, 6,
..., 100. Again, sub-jobs are jobs in their own right that have a slightly different
job naming convention jobID[subJobIndex] (e.g. mycluster.45[37] or 45
[37]).
Using Environment Variables to Specify Array Index Values
The script can use an environment variable to obtain the array index value to
form data file and/or directory names unique to an array job's particular subjob. The following two environment variables are supplied so job scripts can
recognize what index in the array they are in; use the msub command with the
-V option to pass the environment parameters to the resource manager, or
include the parameters in a job script; for example: #PBS -V MOAB_
JOBARRAYRANGE.
Environment
Parameter
Description
MOAB_
JOBARRAYINDEX
Used to create dataset file names, directory names, and so forth, when splitting up a single
problem into multiple jobs.
For example, a user may split up a problem into 20 separate jobs, each with its own input
and output data files whose names contain the numbers 1-20.
To illustrate, assume a user submits the 20 sub-jobs using two msub commands; one to
submit the ten even-numbered jobs and one to submit the ten odd-numbered jobs.
msub -t job1.[1-20:2]
msub -t job2.[2-20:2]
The MOAB_JOBARRAYINDEX environment variable value would populate each of the two
job arrays' ten sub-jobs as 1, 3, 5, 7, 9, 11, 13, 15, 17 and 19 for the first array job's ten subjobs, and 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 for the second array job's ten sub-jobs.
MOAB_
JOBARRAYRANGE
542
The count of jobs in the array.
Job Arrays
Chapter 9 General Job Administration
Control
Users can control individual sub-jobs in the same manner as normal jobs. In
addition, an array job represents its group of sub-jobs and any user or
administrator commands performed on an array job apply to its sub-jobs; for
example, the command canceljob <arrayJobId> cancels all sub-jobs that
belong to the array job. For more information about job control, see the
documentation for the mjobctl command.
Reporting
In the first example below, the parts unique to array subjobs are in red.
$ checkjob -v Moab.1[1]
job Moab.1[1]
AName: Moab
State: Running
Creds: user:user1 group:usergroup1
WallTime:
00:00:17 of 8:20:00
SubmitTime: Thu Nov 4 11:50:03
(Time Queued Total: 00:00:00 Eligible:
StartTime: Thu Nov 4 11:50:03
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: base
Average Utilized Procs: 0.96
NodeCount: 1
Allocated Nodes:
[node010:1]
INFINITY)
Job Group:
Moab.1
Parent Array ID: Moab.1
Array Index:
1
Array Range:
10
SystemID:
Moab
SystemJID: Moab.1[1]
Task Distribution: node010
IWD:
/home/user1
UMask:
0000
Executable:
/opt/moab/spool/moab.job.3CvNjl
StartCount:
1
Partition List: base
SrcRM:
internal DstRM: base DstRMJID: Moab.1[1]
Flags:
ARRAYJOB,GLOBALQUEUE
StartPriority: 1
PE:
1.00
Reservation 'Moab.1[1]' (-00:00:19 -> 8:19:41 Duration: 8:20:00)
If the array range is not provided, the output displays all the jobs in the array.
Job Arrays
543
Chapter 9 General Job Administration
$ checkjob -v Moab.1
job Moab.1
AName: Moab
Job Array Info:
Name: Moab.1
1 : Moab.1[1] :
2 : Moab.1[2] :
3 : Moab.1[3] :
4 : Moab.1[4] :
5 : Moab.1[5] :
6 : Moab.1[6] :
7 : Moab.1[7] :
8 : Moab.1[8] :
9 : Moab.1[9] :
10 : Moab.1[10]
11 : Moab.1[11]
12 : Moab.1[12]
13 : Moab.1[13]
14 : Moab.1[14]
15 : Moab.1[15]
16 : Moab.1[16]
17 : Moab.1[17]
18 : Moab.1[18]
19 : Moab.1[19]
20 : Moab.1[20]
Totals:
Active:
20
Idle:
0
Complete: 0
Running
Running
Running
Running
Running
Running
Running
Running
Running
: Running
: Running
: Running
: Running
: Running
: Running
: Running
: Running
: Running
: Running
: Running
You can also use showq. This displays the array master job with a count of how
many sub-jobs are in each queue.
544
Job Arrays
Chapter 9 General Job Administration
$ showq
active jobs-----------------------JOBID
USERNAME
STATE PROCS
Moab.1(5)
Moab.2(1)
aesplin
aesplin
6 active jobs
1 of 1 nodes active
Running
Running
5
1
REMAINING
STARTTIME
00:52:41
00:53:41
Thu Jun 23 17:05:56
Thu Jun 23 17:06:56
6 of 6 processors in use by local jobs (100.00%)
(100.00%)
eligible jobs---------------------JOBID
USERNAME
STATE PROCS
WCLIMIT
QUEUETIME
Moab.2(4)
4
1:00:00
Thu Jun 23 17:06:56
blocked jobs----------------------JOBID
USERNAME
STATE PROCS
WCLIMIT
QUEUETIME
Moab.2(1)
1:00:00
Thu Jun 23 17:06:56
aesplin
Idle
4 eligible jobs
aesplin
Blocked
1
1 blocked job
Total jobs:
11
Moab.1 has five sub-jobs running. Moab.2 has one sub-job running, four waiting to run, and one that is
currently blocked.
Job Array Cancellation Policies
Job arrays can be canceled based on the success or failure of the first sub-job,
the first success or failure of any sub-job, or if any sub-job exits with a specified
exit code. The job array cancellation policies are:
Cancel Policy
Description
CancelOnFirstFailure
Cancels the job array if the first sub-job (JOBARRAYINDEX = 1) fails.
Exclusivity
Mutually
exclusive
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstFailure
CancelOnFirstSuccess
Cancels the job array if the first sub-job (JOBARRAYINDEX = 1) succeeds.
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstSuccess
CancelOnAnyFailure
Cancels the job array if any sub-job fails.
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnAnyFailure
Job Arrays
545
Chapter 9 General Job Administration
Cancel Policy
Description
CancelOnAnySuccess
Cancels the job array if any sub-job succeeds.
Exclusivity
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnAnySuccess
CancelOnExitCode
Cancels the job array if any sub-job returns the specified exit code.
> msub -t myarray[1-1000%50] -l
...,flags=CancelOnExitCode:<error code list>
The syntax for the error code list are ranges specified with a dash and
individual codes delimited by a plus (+) sign, such as: 1-4+9+15
Exit codes 1-387 are accepted.
Up to two cancellation polices can be specified for an array and the two policies
must be delimited by a colon (:). The two "first sub-job" policies are mutually
exclusive, as are the three "any sub-job" policies. You can use either "first subjob" policy with one of the "any sub-job" policies, as shown in this example:
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstFailure:CancelOnExitCode:3-7+11
Examples
Operations can be performed on individual jobs, a selection of jobs in a job
array, or on the entire array.
Submitting Job Arrays
The syntax for submitting job arrays is: msub -t [<jobname>]<indexlist>
[%<limit>] arrayscript.sh
The <jobname> and <limit> are optional. The jobname does not override the
jobID Moab assigns to the array. When submitting an array with a jobname,
Moab returns the jobID, which is the scheduler name followed by a unique ID.
For example, if the scheduler name in moab.cfg is Moab (SCHEDCFG[Moab]),
submitting an array with a jobname responds like this:
> msub -t myarray[1-10] job.sh
Moab.6
To specify that only a certain number of sub-jobs in the array can run at a time,
use the percent sign (%) delimiter. In this example, only five sub-jobs in the
array can run at a time:
> msub -t myarray[1-1000]%5
546
Job Arrays
Chapter 9 General Job Administration
To submit a specific set of array sub-jobs, use the comma delimiter in the array
index list:
> msub -t myarray[1,2,3,4]
> msub -t myarray[1-5,7,10]
You can use the checkjob command on either the jobID or the jobname you
specified.
> msub -t myarray[1-2] job.sh
Moab.10
$ checkjob -v myarray
job Moab.10
AName: myarray
Job Array Info:
Name: Moab.10
1 : Moab.10[1] : Running
2 : Moab.10[2] : Running
Sub-jobs:
Active:
Eligible:
Blocked:
Completed:
2
2
0
0
0
( 100.0% )
( 0.0% )
( 0.0% )
( 0.0% )
State: Idle
Creds: user:tuser1 group:tgroup1
WallTime:
00:00:00 of 99:23:59:59
SubmitTime: Thu Jun 2 16:37:17
(Time Queued Total: 00:00:33 Eligible: 00:00:00)
Total Requested Tasks: 1
Req[0]
TaskCount: 1
Partition: ALL
To submit a job with a step size, use a colon in the array range and specify how
many jobs to step. In the example below, a step size of 2 is requested. The
sub-jobs will be numbered according to the step size inside the index limit. The
array master job name will be the same as explained above.
Job Arrays
547
Chapter 9 General Job Administration
$ msub -t myarray[2-10:2] job.sh
job Moab.15
$ checkjob -v myarray #or you could use 'checkjob -v Moab.15'
job Moab.15
AName: myarray
Job Array Info:
Name: Moab.15
2 : Moab.15[2] :
4 : Moab.15[4] :
6 : Moab.15[6] :
8 : Moab.15[8] :
10 : Moab.15[10]
Running
Running
Running
Running
: Running
Sub-jobs:
Active:
Eligible:
Blocked:
Completed:
5
5
0
0
0
( 100.0% )
( 0.0% )
( 0.0% )
( 0.0% )
State: Idle
Creds: user:tuser1 group:tgroup1
WallTime:
00:00:00 of 99:23:59:59
SubmitTime: Thu Jun 2 16:37:17
(Time Queued Total: 00:00:33 Eligible: 00:00:00)
Total Requested Tasks: 1
Req[0]
TaskCount: 1
Partition: ALL
Related Topics
Moab Workload Manager for Grids
Job Dependencies
548
Job Arrays
Chapter 10 General Node Administration
Chapter 10 General Node Administration
Moab has a very flexible and generalized definition of a node. This flexible
definition, together with the fact that Moab must inter-operate with many
resource managers of varying capacities, requires that Moab must possess a
complete set of mechanisms for managing nodes that in some cases may be
redundant with resource manager facilities.
In this topic:
l
Resource Manager Specified 'Opaque' Attributes
l
Scheduler Specified Default Node Attributes
l
Scheduler Specified Node Attributes
Resource Manager Specified 'Opaque' Attributes
Many resource managers support the concept of opaque node attributes,
allowing a site to assign arbitrary strings to a node. These strings are opaque in
the sense that the resource manager passes them along to the scheduler
without assigning any meaning to them. Nodes possessing these opaque
attributes can then be requested by various jobs. Using certain Moab
parameters, sites can assign a meaning within Moab to these opaque node
attributes and extract specific node information. For example, setting the
parameter FEATUREPROCSPEEDHEADER xps causes a node with the opaque
string xps950 to be assigned a processor speed of 950 MHz within Moab.
Scheduler Specified Default Node Attributes
Some default node attributes can be assigned on a rack or partition basis. In
addition, many node attributes can be specified globally by configuring the
DEFAULT node template using the NODECFG parameter (i.e., NODECFG
[DEFAULT] PROCSPEED=3200). Unless explicitly specified otherwise, nodes
inherit node attributes from the associated rack or partition or from the default
node template. See the Partition Overview for more information.
Scheduler Specified Node Attributes
The NODECFG parameter also allows direct per-node specification of virtually all
node attributes supported via other mechanisms and also provides a number
of additional attributes not found elsewhere. For example, a site administrator
may want to specify something like the following:
NODECFG[node031] MAXJOB=2 PROCSPEED=600 PARTITION=small
549
Chapter 10 General Node Administration
These approaches may be mixed and matched according to the site's local
needs. Precedence for the approaches generally follows the order listed
earlier in cases where conflicting node configuration information is
specified through one or more mechanisms.
Related Topics
l
Node Location
l
Node Attributes
l
Node Specific Policies
l
Managing Shared Cluster Resources (Floating Resources)
l
Managing Node State
l
Managing Consumable Generic Resources
l
Enabling Generic Metrics
l
Enabling Generic Events
Node Location
Nodes can be assigned three types of location information based on partitions,
racks, and queues.
In this topic:
l
Partitions
l
Racks
l
Queues
o
l
Torque/OpenPBS Queue to Node Mapping
Node Selection
o
Node Lists
o
Exact Lists
o
Node Range
o
Node Regular Expression
Partitions
The first form of location assignment, the partition, allows nodes to be grouped
according to physical resource constraints or policy needs. By default, jobs are
not allowed to span more than one partition so partition boundaries are often
valuable if an underlying network topology make certain resource allocations
550
Node Location
Chapter 10 General Node Administration
undesirable. Additionally, per-partition policies can be specified to grant control
over how scheduling is handled on a partition by partition basis. See the
Partition Overview for more information.
Racks
Rack-based location information is orthogonal to the partition based
configuration and is mainly an organizational construct. In general rack based
location usage, a node is assigned both a rack and a slot number. This
approach has descended from the IBM SP2 organizational approach in which a
rack can contain any number of slots but typically contains between 1 and 99.
Using the rack and slot number combo, individual compute nodes can be
grouped and displayed in a more ordered manner in certain Moab commands
(i.e., showstate). Currently, rack information can only be specified directly by
the system via the SDR interface on SP2/Loadleveler systems. In all other
systems, this information must be specified using an information service or
specified manually using the RACK, SLOT, and SIZE attributes of the NODECFG
parameter.
Sites may arbitrarily assign nodes to racks and rack slots without
impacting scheduling behavior. Neither rack numbers nor rack slot
numbers need to be contiguous; their use is simply for convenience
purposes in displaying and analyzing compute resources.
Example 10-1:
NODECFG[node024] RACK=1 SLOT=1
NODECFG[node025] RACK=1 SLOT=2
NODECFG[node026] RACK=2 SLOT=1 PARTITION=special
...
When specifying node and rack information, slot values must be in the range of
1 to 99, and racks must be in the range of 1 to 399.
Queues
Some resource managers allow queues (or classes) to be defined and then
associated with a subset of available compute resources. With systems such as
Loadleveler or PBSPro these queue to node mappings are automatically
detected. On resource managers that do not provide this service, Moab
provides alternative mechanisms for enabling this feature.
Torque/OpenPBS Queue to Node Mapping
Under Torque, queue to node mapping can be accomplished by using the qmgr
command to set the queue acl_hosts parameter to the mapping hostlist
desired. Further, the acl_host_enable parameter should be set to False.
Node Location
551
Chapter 10 General Node Administration
Setting acl_hosts and then setting acl_host_enable to True constrains
the list of hosts from which jobs may be submitted to the queue.
The following example highlights this process and maps the queue debug to the
nodes host14 through .
> qmgr
Max open servers: 4
Qmgr: set queue debug acl_hosts = "host14,host15,host16,host17"
Qmgr: set queue debug acl_host_enable = false
Qmgr: quit
All queues that do not have acl_hosts specified are global; that is, they
show up on every node. To constrain these queues to a subset of nodes,
each queue requires its own acl_hosts parameter setting.
Node Selection
When selecting or specifying nodes either via command line tools or via
configuration file based lists, Moab offers three types of node expressions that
can be based on node lists, exact lists, node ranges, or regular expressions.
Node Lists
Node lists can be specified as one or more comma or whitespace delimited
node IDs. Specified node IDs can be based on either short or fully qualified
hostnames. Each element will be interpreted as a regular expression.
SRCFG[basic]
...
HOSTLIST=cl37.icluster,ax45,ax46
Exact Lists
When Moab receives a list of nodes it will, by default, interpret each element as
a regular expression. To disable this and have each element interpreted as a
string node name, the l: can be used as in the following example:
> setres l:n00,n01,n02
Node Range
Node lists can be specified as one or more comma or whitespace delimited
node ranges. Each node range can be based using either <STARTINDEX><ENDINDEX> or <HEADER>[<STARTINDEX>-<ENDINDEX>] format. To explicitly
request a range, the node expression must be preceded with the string r: as in
the following example:
> setres r:37-472,513,516-855
552
Node Location
Chapter 10 General Node Administration
When you specify a <HEADER> for the range, note that it must only contain
alphabetical characters. As always, the range must be numeric.
CLASSCFG[long] HOSTLIST=r:anc-b[37-472]
Only one expression is allowed with node ranges.
By default, Moab attempts to extract a node's node index assuming this
information is built into the node's naming convention. If needed, this
information can be explicitly specified in the Moab configuration file using
NODECFG's NODEINDEX attribute, or it can be extracted from alternately
formatted node IDs by specifying the NODEIDFORMAT parameter.
Node Regular Expression
Node lists may also be specified as one or more comma or whitespace
delimited regular expressions. Each node regular expression must be specified
in a format acceptable by the standard C regular expression libraries that allow
support for wildcard and other special characters such as the following:
l
* (asterisk)
l
. (period)
l
[ ] (left and right bracket)
l
^ (caret)
l
$ (dollar)
Node lists are by default interpreted as a regular expression but can also be
explicitly requested with the string x: as in the following examples:
# select nodes cl30 thru cl55
SRCFG[basic] HOSTLIST=x:cl[34],cl5[0-5]
...
# select nodes cl30 thru cl55
SRCFG[basic] HOSTLIST=cl[34],cl5[0-5]
...
To control node selection search ordering, set the OBJECTELIST parameter
to one of the following options: exact, range, regex, rangere, or rerange.
Node Attributes
In this topic:
Node Attributes
553
Chapter 10 General Node Administration
l
Configurable Node Attributes
l
Node Features/Node Properties
Configurable Node Attributes
Nodes can possess a large number of attributes describing their configuration
which are specified using the NODECFG parameter. The majority of these
attributes such as operating system or configured network interfaces can only
be specified by the direct resource manager interface. However, the number
and detail of node attributes varies widely from resource manager to resource
manager. Sites often have interest in making scheduling decisions based on
scheduling attributes not directly supplied by the resource manager.
Configurable node attributes are listed in the following table; click an attribute
for more detailed information:
ACCESS
NETWORK
PROCSPEED
ARCH
NODEAVAILABILITYPOLICY PROVRM
CHARGERATE
NODEINDEX
RACK
COMMENT
NODETYPE
RADISK
ENABLEPROFILING OS
RCDISK
FEATURES
OSLIST
RCMEM
FLAGS
OVERCOMMIT
RCPROC
GRES
PARTITION
RCSWAP
MAXIOIN
POWERPOLICY
SIZE
MAXJOB
PREEMPTMAXCPULOAD
SLOT
MAXJOBPERUSER PREEMPTMINMEMAVAIL
SPEED
MAXPE
PREEMPTPOLICY
TRIGGER
MAXPEPERJOB
PRIORITY
VARIABLE
MAXPROC
PRIORITYF
VMOCTHRESHOLD
Attribute
Description
ACCESS
Specifies the node access policy that can be one of SHARED, SHAREDONLY,
SINGLEJOB, SINGLETASK, or SINGLEUSER. See Node Access Policies for more
details.
NODECFG[node013] ACCESS=singlejob
ARCH
Specifies the node's processor architecture.
NODECFG[node013] ARCH=opteron
554
Node Attributes
Chapter 10 General Node Administration
Attribute
Description
CHARGERATE
Allows a site to assign specific charging rates to the usage of particular resources. The
CHARGERATE value may be specified as a floating point value and is integrated into a
job's total charge (as documented in the Charging and Allocation Management
section).
NODECFG[DEFAULT] CHARGERATE=1.0
NODECFG[node003] CHARGERATE=1.5
NODECFG[node022] CHARGERATE=2.5
COMMENT
Allows an organization to annotate a node via the configuration file to indicate special
information regarding this node to both users and administrators. The COMMENT
value may be specified as a quote delimited string as shown in the example that
follows. Comment information is visible using checknode, mdiag, and Moab Cluster
Manager.
NODECFG[node013] COMMENT="Login Node"
ENABLEPROFILING
Allows an organization to track node state over time. This information is available
using showstats -n.
NODECFG[DEFAULT] ENABLEPROFILING=TRUE
FEATURES
Not all resource managers allow specification of opaque node features (also known
as node properties). For these systems, the NODECFG parameter can be used to
directly assign a list of node features to individual nodes. To append node features,
use FEATURES=<X>; to overwrite or remove a node's features, you must update
them in your Moab configuration file or resource manager.
NODECFG[node013] FEATURES=gpfs,fastio
Node node013 now has features gpfs and fastio in addition to any other
features configured in this file or the resource manager.
The total number of supported node features is limited as described in the
Adjusting Default Limits section.
If supported by the resource manager, the resource manager specific
manner of requesting node features/properties within a job may be used.
(Within Torque, use qsub -l nodes=<NODECOUNT>:<NODEFEATURE>.)
However, if either not supported within the resource manager or if support is
limited, the Moab feature resource manager extension may be used.
Node Attributes
555
Chapter 10 General Node Administration
Attribute
Description
FLAGS
Specifies various flags that should be set on the given node. Node flags must be set
using the mschedctl -m config command. Do not set node flags in the moab.cfg file.
Flags set in moab.cfg may conflict with settings controlled automatically by
resource managers, Moab Web Services, or Viewpoint.
l
l
globalvars - The node has variables that may be used by triggers.
novmmigrations - Excludes this hypervisor from VM auto-migrations. This
means that VMs cannot automatically migrate to or from this hypervisor
while this flag is set.
NODECFG[node1] FLAGS=NoVMMigrations
To allow VMs to resume migrating, remove this flag using
mschedctl -m config 'NODECFG[node1] FLAGS=NoVMMigrations' or use a resource manager to unset the flag.
Because both Moab and the RM report the novmmigration flag
and the RM's setting always overrides the Moab setting, you cannot
remove the flag via the Moab command when the RM is reporting
it.
GRES
Many resource managers do not allow specification of consumable generic node
resources. For these systems, the NODECFG parameter can be used to directly assign
a list of consumable generic attributes to individual nodes or to the special pseudonode global, which provides shared cluster (floating) consumable resources. To
set/overwrite a node's generic resources, use GRES=<NAME>[:<COUNT>]. (See
Managing Consumable Generic Resources.)
NODECFG[node013] GRES=quickcalc:20
MAXIOIN
Maximum input allowed on node before it is marked busy.
MAXJOB
See Node Policies for details.
MAXJOBPERUSER
See Node Policies for details.
MAXPE
See Node Policies for details.
MAXPEPERJOB
Maximum allowed Processor Equivalent per job on this node. A job will not be
allowed to run on this node if its PE exceeds this number.
NODECFG[node024] MAXPEPERJOB=10000
...
556
Node Attributes
Chapter 10 General Node Administration
Attribute
Description
MAXPROC
Maximum dedicated processors allowed on this node. No jobs are scheduled on this
node when this number is reached. See Node Policies for more information.
NODECFG[node024] MAXPROC=8
...
NETWORK
The ability to specify which networks are available to a given node is limited to only a
few resource managers. Using the NETWORK attribute, administrators can establish
this node to network connection directly through the scheduler. The NODECFG
parameter allows this list to be specified in a comma-delimited list.
NODECFG[node024] NETWORK=GigE
...
NODEAVAILABILITYPOLICY
Specifies how available node resources are reported.
This sets the NODEAVAILABILITYPOLICY at the local level and uses a
different format from the NODEAVAILIBILITYPOLICY server parameter. See
NODEAVAILABILITYPOLICY on page 1054.
NODECFG[node00]
NODEAVAILABILITYPOLICY=DEDICATED:PROC,UTILIZED:MEM,COMBINED:DISK
NODEINDEX
The node's index. See Node Location for details.
NODETYPE
The NODETYPE attribute is most commonly used in conjunction with an accounting
manager such as Moab Accounting Manager. In these cases, each node is assigned a
node type and within the accounting manager, each node type is assigned a charge
rate. For example, a site administrator may want to charge users more for using large
memory nodes and may assign a node type of BIGMEM to these nodes. The
accounting manager would then charge a premium rate for jobs using BIGMEM
nodes. (See the Accounting, Charging, and Allocation Management for more
information.)
Node types are specified as simple strings. If no node type is explicitly set, the node
will possess the default node type of DEFAULT. Node type information can be
specified directly using NODECFG or through use of the FEATURENODETYPEHEADER
parameter.
NODECFG[node024] NODETYPE=BIGMEM
Node Attributes
557
Chapter 10 General Node Administration
Attribute
Description
OS
This attribute specifies the node's operating system.
NODECFG[node013] OS=suse10
Because the Torque operating system overwrites the Moab operating system,
change the operating system with opsys instead of OS if you are using
Torque.
OSLIST
This attribute specifies the list of operating systems the node can run.
NODECFG[compute002] OSLIST=linux,windows
OVERCOMMIT
Specifies the high-water limit for over-allocation of processors or memory on a
hypervisor. This setting is used to protect hypervisors from having too many VMs
placed on them, regardless of the utilization level of those VMs. Possible attributes
include DISK, MEM, PROC, and SWAP. Usage is <attr>:<integer>.
NODECFG[node012] OVERCOMMIT=PROC:2,MEM:4
PARTITION
See Node Location for details.
POWERPOLICY
The POWERPOLICY can be set to OnDemand or STATIC. It defaults to STATIC if not
set. If set to STATIC, Moab will never automatically change the power status of a
node. If set to OnDemand, Moab will turn the machine off and on based on workload and global settings. See Green Computing for further details.
PREEMPTMAXCPULOAD
If the node CPU load exceeds the specified value, any batch jobs running on the
node are preempted using the preemption policy specified with the node's
PREEMPTPOLICY attribute. If this attribute is not specified, the global default policy
specified with PREEMPTPOLICY parameter is used. See Sharing Server Resources for
further details.
NODECFG[node024] PRIORITY=-150 COMMENT="NFS Server Node"
NODECFG[node024] PREEMPTPOLICY=CANCEL PREEMPTMAXCPULOAD=1.2
...
558
Node Attributes
Chapter 10 General Node Administration
Attribute
Description
PREEMPTMINMEMAVAIL
If the available node memory drops below the specified value, any batch jobs
running on the node are preempted using the preemption policy specified with the
node's PREEMPTPOLICY attribute. If this attribute is not specified, the global default
policy specified with PREEMPTPOLICY parameter is used. See Sharing Server
Resources for further details.
NODECFG[node024] PRIORITY=-150 COMMENT="NFS Server Node"
NODECFG[node024] PREEMPTPOLICY=CANCEL PREEMPTMINMEMAVAIL=256
...
PREEMPTPOLICY
If any node preemption policies are triggered (such as PREEMPTMAXCPULOAD or
PREEMPTMINMEMAVAIL) any batch jobs running on the node are preempted using
this preemption policy if specified. If not specified, the global default preemption
policy specified with PREEMPTPOLICY parameter is used. See Sharing Server
Resources for further details.
NODECFG[node024] PRIORITY=-150 COMMENT="NFS Server Node"
NODECFG[node024] PREEMPTPOLICY=CANCEL PREEMPTMAXCPULOAD=1.2
...
PRIORITY
The PRIORITY attribute specifies the fixed node priority relative to other nodes. It is
only used if NODEALLOCATIONPOLICY is set to PRIORITY. The default node priority
is 0. A default cluster-wide node priority may be set by configuring the PRIORITY
attribute of the DEFAULT node. See Priority Node Allocation for more details.
NODEALLOCATIONPOLICY PRIORITY
NODECFG[node024] PRIORITY=120
...
Node Attributes
559
Chapter 10 General Node Administration
Attribute
Description
PRIORITYF
The PRIORITYF attribute specifies the function to use when calculating a node's
allocation priority specific to a particular job. It is only used if
NODEALLOCATIONPOLICY is set to PRIORITY. The default node priority function
sets a node's priority exactly equal to the configured node priority. The priority
function allows a site to indicate that various environmental considerations such as
node load, reservation affinity, and ownership be taken into account as well using
the following format:
<COEFFICIENT> * <ATTRIBUTE> [ + <COEFFICIENT> * <ATTRIBUTE>
]...
<ATTRIBUTE> is an attribute from the table found in the Priority Node Allocation
section.
A default cluster-wide node priority function may be set by configuring the
PRIORITYF attribute of the DEFAULT node. See Priority Node Allocation for more
details.
NODEALLOCATIONPOLICY PRIORITY
NODECFG[node024] PRIORITYF='APROC + .01 * AMEM - 10 * JOBCOUNT'
...
PROCSPEED
Knowing a node's processor speed can help the scheduler improve intra-job
efficiencies by allocating nodes of similar speeds together. This helps reduce losses
due to poor internal job load balancing. Moab's node set scheduling policies allow a
site to control processor speed based allocation behavior.
Processor speed information is specified in MHz and can be indicated directly using
NODECFG or through use of the FEATUREPROCSPEEDHEADER parameter.
PROVRM
Provisioning resource managers can be specified on a per node basis. This allows
flexibility in mixed environments. If the node does not have a provisioning resource
manager, the default provisioning resource manager will be used. The default is
always the first one listed in moab.cfg.
RMCFG[prov] TYPE=NATIVE RESOURCETYPE=PROV
RMCFG[prov] PROVDURATION=10:00
RMCFG[prov] NODEMODIFYURL=exec://$HOME/tools/os.switch.pl
...
NODECFG[node024] PROVRM=prov
560
RACK
The rack associated with the node's physical location. Valid values range from 1 to
400. See Node Location for details.
RADISK
Jobs can request a certain amount of disk space through the RM Extension String's
DDISK parameter. When done this way, Moab can track the amount of disk space
available for other jobs. RADISK is not a configurable value, but it is determined by
"RCDISK - <JOB USAGE>".
Node Attributes
Chapter 10 General Node Administration
Attribute
Description
RCDISK
Jobs can request a certain amount of disk space (in MB) through the RM Extension
String's DDISK parameter. When done this way, Moab can track the amount of disk
space available for other jobs. The RCDISK attribute constrains the amount of disk
reported by a resource manager while the RADISK attribute specifies the amount of
disk available to jobs. If the resource manager does not report available disk, the
RADISK attribute should be used.
RCMEM
Jobs can request a certain amount of real memory (RAM) in MB through the RM
Extension String's DMEM parameter. When done this way, Moab can track the
amount of memory available for other jobs. The RCMEM attribute constrains the
amount of RAM reported by a resource manager while the RAMEM attribute
specifies the amount of RAM available to jobs. If the resource manager does not
report available disk, the RAMEM attribute should be used.
Please note that memory reported by the resource manager will override the
configured value unless a trailing caret (^) is used.
NODECFG[node024] RCMEM=2048
...
If the resource manager does not report any memory, then Moab will
assign node0242048 MB of memory.
NODECFG[node024] RCMEM=2048^
...
Moab will assign 2048 MB of memory to node024 regardless of what the
resource manager reports.
RCPROC
The RCPROC specifies the number of processors available on a compute node.
NODECFG[node024] RCPROC=8
...
Node Attributes
561
Chapter 10 General Node Administration
Attribute
Description
RCSWAP
Jobs can request a certain amount of swap space in MB.
RCSWAP works similarly to RCMEM. Setting RCSWAP on a node will set the
swap but can be overridden by swap reported by the resource manager. If
the trailing caret (^) is used, Moab will ignore the swap reported by the
resource manager and use the configured amount.
NODECFG[node024] RCSWAP=2048
...
If the resource manager does not report any memory, Moab will assign
node0242048 MB of swap.
NODECFG[node024] RCSWAP=2048^
...
Moab will assign 2048 MB of swap to node024 regardless of what the
resource manager reports.
SIZE
The number of slots or size units consumed by the node. This value is used in
graphically representing the cluster using showstate or Moab Cluster Manager. See
Node Location for details. For display purposes, legal size values include 1, 2, 3, 4, 6,
8, 12, and 16.
NODECFG[node024] SIZE=2
...
SLOT
The first slot in the rack associated with the node's physical location. Valid values
range from 1 to MMAX_RACKSIZE (default=64). See Node Location for details.
SPEED
Because today's processors have multiple cores and adjustable clock frequency, this
feature has no meaning and will be deprecated.
The SPEED specification must be in the range of 0.01 to 100.0.
TRIGGER
See About Object Triggers for information.
VARIABLE
Variables associated with the given node, which can be used in job scheduling. See -l
PREF.
NODECFG[node024] VARIABLE=var1
...
562
Node Attributes
Chapter 10 General Node Administration
Attribute
Description
VMOCTHRESHOLD
Specifies the high-water threshold for utilization of resources on a server (i.e.
processor and memory). This setting is used to protect hypervisors from becoming
too highly utilized and thus negatively impacting the performance of VMs running on
the hypervisor. Possible attributes include PROC and MEM.
NODECFG[node024] VMOCTHRESHOLD=PROC=2,MEM=2
Node Features/Node Properties
A node feature (or node property) is an opaque string label that is associated
with a compute node. Each compute node may have any number of node
features assigned to it, and jobs may request allocation of nodes that have
specific features assigned. Node features are labels and their association with a
compute node is not conditional, meaning they cannot be consumed or
exhausted.
Node features may be assigned by the resource manager, and this information
may be imported by Moab or node features may be specified within Moab
directly. Moab supports hyphens and underscores in node feature names.
As a convenience feature, certain node attributes can be specified via node
features using the parameters listed in the following table:
PARAMETER
DESCRIPTION
FEATURENODETYPEHEADER
Set Node Type
FEATUREPARTITIONHEADER
Set Partition
FEATUREPROCSPEEDHEADER
Set Processor Speed
FEATURERACKHEADER
Set Rack
FEATURESLOTHEADER
Set Slot
Example 10-2:
FEATUREPARTITIONHEADER
FEATUREPROCSPEEDHEADER
par
cpu
Related Topics
Job Preferences
Configuring Specifying Node Features (Node Properties) in Torque
Configuring Node Features in Moab with NODECFG
Node Attributes
563
Chapter 10 General Node Administration
Specifying Job Feature Requirements
Viewing Feature Availability Breakdown with mdiag -t
Differences between Node Features and Managing Consumable Generic Resources
Node Specific Policies
Node policies within Moab allow specification of not only how the node's load
should be managed, but who can use the node, and how the node and jobs
should respond to various events. These policies allow a site administrator to
specify on a node by node basis what the node will and will not support. Node
policies may be applied to specific nodes or applied system-wide using the
specification NODECFG[DEFAULT] ....
In this topic:
l
l
Node Usage/Throttling Policies
o
MAXJOB
o
MAXJOBPERUSER
o
MAXJOBPERGROUP
o
MAXLOAD
o
MAXPE
o
MAXPROC
o
MAXPROCPERUSER
o
MAXPROCPERGROUP
Node Access Policies
Node Usage/Throttling Policies
MAXJOB
This policy constrains the number of total independent jobs a given node may
run simultaneously. It can only be specified via the NODECFG parameter.
On Cray XT systems, use the NID (node id) instead of the node name. For
more information, see Configuring the moab.cfg file.
MAXJOBPERUSER
Constrains the number of total independent jobs a given node may run
simultaneously associated with any single user. It can only be specified via the
NODECFG parameter.
564
Node Specific Policies
Chapter 10 General Node Administration
MAXJOBPERGROUP
Constrains the number of total independent jobs a given node may run
simultaneously associated with any single group. It can only be specified via the
NODECFG parameter.
MAXLOAD
MAXLOAD constrains the CPU load the node will support as opposed to the
number of jobs. This maximum load policy can also be applied system wide
using the parameter NODEMAXLOAD.
MAXPE
This policy constrains the number of total dedicated processor-equivalents a
given node may support simultaneously. It can only be specified via the
NODECFG parameter.
MAXPROC
This policy constrains the number of total dedicated processors a given node
may support simultaneously. It can only be specified via the NODECFG
parameter.
MAXPROCPERUSER
This policy constrains the number of total processors a given node may have
dedicated to any single user. It can only be specified via the NODECFG
parameter.
MAXPROCPERGROUP
This policy constrains the number of total processors a given node may have
dedicated to any single group. It can only be specified via the NODECFG
parameter.
Node throttling policies are used strictly as constraints. If a node is defined
as having a single processor or the NODEACCESSPOLICY is set to
SINGLETASK, and a MAXPROC policy of 4 is specified, Moab will not run
more than one task per node. A node's configured processors must be
specified so that multiple jobs may run and then the MAXJOB policy will be
effective. The number of configured processors per node is specified on a
resource manager specific basis. PBS, for example, allows this to be
adjusted by setting the number of virtual processors with the np
parameter for each node in the PBS nodes file.
Node Specific Policies
565
Chapter 10 General Node Administration
Example 10-3:
NODECFG[node024]
NODECFG[node025]
NODECFG[node026]
NODECFG[DEFAULT]
...
MAXJOB=4 MAXJOBPERUSER=2
MAXJOB=2
MAXJOBPERUSER=1
MAXLOAD=2.5
Node Access Policies
While most sites require only a single cluster wide node access policy
(commonly set using NODEACCESSPOLICY), it is possible to specify this policy
on a node by node basis using the ACCESS attributes of the NODECFG
parameter. This attribute may be set to any of the valid node access policy
values listed in the Node Access Policies section.
Example 10-4:
To set a global policy of SINGLETASK on all nodes except nodes 13 and 14, use
the following:
# by default, enforce dedicated node access on all nodes
NODEACCESSPOLICY SINGLETASK
# allow nodes 13 and 14 to be shared
NODECFG[node13]
ACCESS=SHARED
NODECFG[node14]
ACCESS=SHARED
Related Topics
mnodectl
Managing Shared Cluster Resources (Floating
Resources)
This section describes how to configure, request, and reserve cluster file
system space and bandwidth, software licenses, and generic cluster resources.
In this topic:
l
l
Shared Cluster Resource Overview
Configuring Generic Consumable Floating Resources
o
l
l
566
Requesting Consumable Floating Resources
Configuring Cluster File Systems
Configuring Cluster Licenses
Managing Shared Cluster Resources (Floating Resources)
Chapter 10 General Node Administration
l
Configuring Generic Resources as Features
o
Managing Feature GRES via Moab Commands
o
Managing Feature GRES via the Resource Manager
Shared Cluster Resource Overview
Shared cluster resources such as file systems, networks, and licenses can be
managed through creating a pseudo-node. You can configure a pseudo-node
via the NODECFG parameter much as a normal node would be but additional
information is required to allow the scheduler to contact and synchronize state
with the resource.
In the following example, a license manager is added as a cluster resource by
defining the GLOBAL pseudo-node and specifying how the scheduler should
query and modify its state.
NODECFG[GLOBAL] RMLIST=NATIVE
NODECFG[GLOBAL] QUERYCMD=/usr/local/bin/flquery.sh
NODECFG[GLOBAL] MODIFYCMD=/usr/local/bin/flmodify.sh
In some cases, pseudo-node resources may be very comparable to nodelocked generic resources however there are a few fundamental differences
which determine when one method of describing resources should be used
over the other. The following table contrasts the two resource types.
Attribute
Pseudo-Node
Generic Resource
Node-Locked
No - Resources can be encapsulated as
an independent node.
Yes - Must be associated with an existing compute node.
Requires exclusive batch system
control over
resource
No - Resources (such as file systems and
licenses) may be consumed both inside
and outside of batch system workload.
Yes - Resources must only be consumed by
batch workload. Use outside of batch control
results in loss of resource synchronization.
Allows scheduler
level allocation
of resources
Yes - If required, the scheduler can take
external administrative action to allocate
the resource to the job.
No - The scheduler can only maintain logical
allocation information and cannot take any
external action to allocate resources to the job.
Configuring Generic Consumable Floating Resources
Consumable floating resources are configured in the same way as node-locked
generic resources with the exception of using the GLOBAL node instead of a
particular node.
Managing Shared Cluster Resources (Floating Resources)
567
Chapter 10 General Node Administration
NODECFG[GLOBAL] GRES=tape:4,matlab:2
...
In this setup, four resources of type tape and 2 of type matlab are floating and available across all nodes.
Requesting Consumable Floating Resources
Floating resources are requested on a per task basis using native resource
manager job submission methods or using the GRES resource manager
extensions.
Configuring Cluster File Systems
Moab allows both the file space and bandwidth attributes or a cluster file
system to be tracked, reserved, and scheduled. With this capability, a job or
reservation may request a particular quantity of file space and a required
amount of I/O bandwidth to this file system. While file system resources are
managed as a cluster generic resource, they are specified using the FS attribute
of the NODECFG parameter as in the following example:
NODECFG[GLOBAL] FS=PV1:10000@100,PV2:5000@100
...
In this example, PV1 defines a 10 GB file system with a maximum throughput of 100 MB/s while PV2
defines a 5 GB file system also possessing a maximum throughput of 100 MB/s.
A job may request cluster file system resources using the fs resource manager
extension. For a Torque based system, the following could be used:
>qsub -l nodes=1,walltime=1:00:00 -W x=fs:10@50
Configuring Cluster Licenses
Jobs may request and reserve software licenses using native methods or using
the GRES resource manager extension. If the cluster license manager does not
support a query interface, license availability may be specified within Moab
using the GRES attribute of the NODECFG parameter.
Example 10-5: Configure Moab to support four floating quickcalc and two floating matlab licenses.
NODECFG[GLOBAL] GRES=quickcalc:4,matlab:2
...
Example 10-6: Submit a Torque job requesting a node-locked or floating quickcalc license.
> qsub -l nodes=1,software=quickcalc,walltime=72000 testjob.cmd
Configuring Generic Resources as Features
Moab can be configured to treat generic resources as features in order to
provide more control over server access. For instance, if a node is configured
568
Managing Shared Cluster Resources (Floating Resources)
Chapter 10 General Node Administration
with a certain GRES and that GRES is turned off, jobs requesting the node will not
run. To turn a GRES into a feature, set the FEATUREGRES attribute of GRESCFG
to TRUE in the moab.cfg file.
GRESCFG[gres1] FEATUREGRES=TRUE
Moab now treats gres1 as a scheduler-wide feature rather than a normal generic resource.
Note that jobs are submitted normally using the same GRES syntax.
If you are running a grid, verify that FEATUREGRES=TRUE is set on all
members of the grid.
You can safely upgrade an existing cluster to use the feature while jobs
are running. If you are in a grid, upgrade all clusters at the same time.
Two methods exist for managing GRES features: via Moab commands and via
the resource manager. Using Moab commands means that feature changes
are not checkpointed; they do not remain in place when Moab restarts. Using
the resource manager causes changes to be reported by the RM, so any
changes made before a Moab restart are still present after it.
These methods are mutually exclusive. Use one or the other, but do not mix
methods.
Managing Feature GRES via Moab Commands
In the following example, gres1 and gres2 are configured in the moab.cfg file.
gres1 is not currently functioning correctly, so it is set to 0, turning the feature
off. Values above 0 and non-specified values turn the feature on.
NODECFG[GLOBAL] GRES=gres1:0
NODECFG[GLOBAL] GRES=gres2:10000
GRESCFG[gres1] FEATUREGRES=TRUE
GRESCFG[gres2] FEATUREGRES=TRUE
Moab now treats gres1 and gres2 as features.
To verify that this is set up correctly, run mdiag -S -v. It returns the following:
> mdiag -S -v
...
Scheduler FeatureGres: gres1:off,gres2:on
Once Moab has started, use mschedctl -m to modify whether the feature is
turned on or off.
mschedctl -m sched featuregres:gres1=on
INFO: FeatureGRes 'gres1' turned on
You can verify that the feature turned on or off by once again running mdiag -S v.
Managing Shared Cluster Resources (Floating Resources)
569
Chapter 10 General Node Administration
If Moab restarts, it will not checkpoint the state of these changed feature
general resources. Instead, it will read the moab.cfg file to determine
whether the feature GRES is on or off.
With feature GRES configured, jobs are submitted normally, requesting GRES
type gres1 and gres2. Moab ignores GRES counts and reads the feature simply
as on or off.
> msub -l nodes=1,walltime=600,gres=gres1
1012
> checkjob 1012
job 1012
AName: STDIN
State: Running
.....
StartTime: Tue Jul 3 15:33:28
Feature GRes: gres1
Total Requested Tasks: 1
If you request a feature that is currently turned off, the state is not reported as
Running, but as Idle. A message like the following returns:
BLOCK MSG: requested feature gres 'gres2' is off
Managing Feature GRES via the Resource Manager
You can automate the process of having a feature GRES turn on and off by
setting up an external tool and configuring Moab to query the tool the same
way that Moab queries a license manager. For example:
RMCFG[myRM] CLUSTERQUERYURL=file:///$HOME/tools/myRM.dat TYPE=NATIVE
RESOURCETYPE=LICENSE
GRESCFG[gres1] FEATUREGRES=TRUE
GRESCFG[gres2] FEATUREGRES=TRUE
LICENSE means that the RM does not contain any compute resources and that Moab should not attempt to
use it to manage any jobs (start, cancel, submit, etc.).
The myRM.dat file should contain something like the following:
GLOBAL state=Idle cres=gres1:0,gres2:10
External tools can easily update the file based on filesystem availability.
Switching any of the feature GRES to 0 turns it off and switching it to a positive
value turns it on. If you use this external mechanism, you do not need to use
mschedctl -m to turn a feature GRES on or off. You also do not need to worry
about whether Moab has checkpointed the information or not, since the
information is provided by the RM and not by any external commands.
Related Topics
Managing Resources Directly with the Native Interface
570
Managing Shared Cluster Resources (Floating Resources)
Chapter 10 General Node Administration
Managing Node State
There are multiple models in which Moab can operate allowing it to either honor
the node state set by an external service or locally determine and set the node
state. This section covers the following:
l
l
l
identifying meanings of particular node states
specifying node states within locally developed services and resource
managers
adjusting node state within Moab based on load, policies, and events
In this topic:
l
Node State Definitions
l
Specifying Node States within Native Resource Managers
l
Moab Based Node State Adjustment
l
Adjusting Scheduling Behavior Based on Reported Node State
o
Down State
Node State Definitions
State
Definition
Down
Node is either not reporting status, is reporting status but failures are detected, or is reporting status
but has been marked down by an administrator.
Idle
Node is reporting status, currently is not executing any workload, and is ready to accept additional
workload.
Busy
Node is reporting status, currently is executing workload, and cannot accept additional workload due
to load.
Running
Node is reporting status, currently is executing workload, and can accept additional workload.
Drained
Node is reporting status, currently is not executing workload, and cannot accept additional workload
due to administrative action.
Draining
Node is reporting status, currently is executing workload, and cannot accept additional workload due
to administrative action.
Managing Node State
571
Chapter 10 General Node Administration
Specifying Node States within Native Resource Managers
Native resource managers can report node state implicitly and explicitly, using
NODESTATE, LOAD, and other attributes. See Managing Resources Directly with
the Native Interface for more information.
Moab Based Node State Adjustment
Node state can be adjusted based on reported processor, memory, or other
load factors. It can also be adjusted based on reports of one or more resource
managers in a multi-resource manager configuration. Also, both generic
events and generic metrics can be used to adjust node state.
l
Torque health scripts (allow compute nodes to detect and report site
specific failures).
Adjusting Scheduling Behavior Based on Reported Node
State
Based on reported node state, Moab can support various policies to make
better use of available resources. For more information, see the Green
Computing Overview.
Down State
l
l
JOBACTIONONNODEFAILURE parameter (cancel/requeue jobs if
allocated nodes fail).
Triggers (take specified action if failure is detected).
Related Topics
Managing Resources Directly with the Native Interface
License Management
Adjusting Node Availability
NODEMAXLOAD parameter
Green computing overview
Managing Consumable Generic Resources
Each time a job is allocated to a compute node, it consumes one or more types
of resources. Standard resources such as CPU, memory, disk, network adapter
bandwidth, and swap are automatically tracked and consumed by Moab.
However, in many cases, additional resources may be provided by nodes and
consumed by jobs that must be tracked. The purpose of this tracking may
include accounting, billing, or the prevention of resource over-subscription.
Generic consumable resources may be used to manage software licenses, I/O
572
Managing Consumable Generic Resources
Chapter 10 General Node Administration
usage, bandwidth, application connections, or any other aspect of the larger
compute environment; they may be associated with compute nodes, networks,
storage systems, or other real or virtual resources.
These additional resources can be managed within Moab by defining one or
more generic resources. The first step in defining a generic resource involves
naming the resource. Generic resource availability can then be associated with
various compute nodes and generic resource usage requirements can be
associated with jobs.
In this topic:
l
Differences Between Node Features and Consumable Resources
l
Configuring Node-locked Consumable Generic Resources
l
o
Requesting Consumable Generic Resources
o
Using Generic Resource Requests in Conjunction with other
Constraints
o
Requesting Resources with No Generic Resources
o
Requesting Generic Resources Automatically within a Queue/Class
Managing Generic Resource Race Conditions
Differences Between Node Features and Consumable
Resources
A node feature (or node property) is an opaque string label that is associated
with a compute node. Each compute node may have any number of node
features assigned to it and jobs may request allocation of nodes that have
specific features assigned. Node features are labels and their association with a
compute node is not conditional, meaning they cannot be consumed or
exhausted.
Configuring Node-locked Consumable Generic Resources
Consumable generic resources are supported within Moab using either direct
configuration or resource manager auto-detect (as when using Torque and
accelerator hardware). For direct configuration, node-locked consumable
generic resources (or generic resources) are specified using the NODECFG
parameter's GRES attribute. This attribute is specified using the format
<ATTR>:<COUNT> as in the following example:
NODECFG[titan001] GRES=tape:4
NODECFG[login32] GRES=matlab:2,prime:4
NODECFG[login33] GRES=matlab:2
...
By default, Moab supports up to 128 independent generic resource types.
Managing Consumable Generic Resources
573
Chapter 10 General Node Administration
Requesting Consumable Generic Resources
Generic resources can be requested on a per task or per job basis using the
GRES resource manager extension. If the generic resource is located on a
compute node, requests are by default interpreted as a per task request. If the
generic resource is located on a shared, cluster-level resource (such as a
network or storage system), then the request defaults to a per job
interpretation.
Generic resources are specified per task, not per node. When you submit a
job, each processor becomes a task. For example, a job asking for
nodes=3:ppn=4,gres=test:5 asks for 60 gres of type test ((3*4
processors)*5).
If using Torque, the GRES or software resource can be requested as in the
following examples:
Example 10-7: Per Task Requests
NODECFG[compute001]
NODECFG[compute002]
NODECFG[compute003]
NODECFG[compute004]
NODECFG[compute005]
NODECFG[compute006]
NODECFG[compute007]
NODECFG[compute008]
GRES=dvd:2
GRES=dvd:2
GRES=dvd:2
GRES=dvd:2
SPEED=2200
SPEED=2200
SPEED=2200
SPEED=2200
SPEED=2200
SPEED=2200
SPEED=2200
SPEED=2200
# submit job which will allocate only from nodes 1 through 4 requesting one dvd per
task
> qsub -l nodes=2,walltime=100,gres=dvd job.cmd
In this example, Moab determines that compute nodes exist that possess the requested generic
resource. A compute node is a node object that possesses processors on which compute jobs actually
execute. License server, network, and storage resources are typically represented by non-compute
nodes. Because compute nodes exist with the requested generic resource, Moab interprets this job as
requesting two compute nodes each of which must also possess a DVD generic resource.
Example 10-8: Per Job Requests
NODECFG[network] PARTITION=shared GRES=bandwidth:2000000
# submit job which will allocate 2 nodes and 10000 units of network bandwidth
> qsub -l nodes=2,walltime=100,gres=bandwidth:10000 job.cmd
In this example, Moab determines that there exist no compute nodes that also possess the generic
resource bandwidth so this job is translated into a multiple-requirement—multi-req—job. Moab creates a
job that has a requirement for two compute nodes and a second requirement for 10000 bandwidth generic
resources. Because this is a multi-req job, Moab knows that it can locate these needed resources
separately.
574
Managing Consumable Generic Resources
Chapter 10 General Node Administration
Using Generic Resource Requests in Conjunction with other
Constraints
Jobs can explicitly specify generic resource constraints. However, if a job also
specifies a hostlist, the hostlist constraint overrides the generic resource
constraint if the request is for per task allocation. In the Per Task Requests
example, if the job also specified a hostlist, the DVD request is ignored.
Requesting Resources with No Generic Resources
In some cases, it is valuable to allocate nodes that currently have no generic
resources available. This can be done using the special value none as in the
following example:
> qsub -l nodes=2,walltime=100,gres=none job.cmd
In this case, the job only allocates compute nodes that have no generic resources associated with them.
Requesting Generic Resources Automatically within a Queue/Class
Generic resource constraints can be assigned to a queue or class and inherited
by any jobs that do not have a gres request. This allows targeting of specific
resources, automation of co-allocation requests, and other uses. To enable
this, use the DEFAULT.GRES attribute of the CLASSCFG parameter as in the
following example:
CLASSCFG[viz] DEFAULT.GRES=graphics:2
For each node requested by a viz job, also request two graphics cards.
Managing Generic Resource Race Conditions
A software license race condition "window of opportunity" opens when Moab
checks a license server for sufficient available licenses and closes when the
user's software actually checks out the software licenses. The time between
these two events can be seconds to many minutes depending on overhead
factors such as node OS provisioning, job startup, licensed software startup,
and so forth.
During this window, another Moab-scheduled job or a user or job external to
the cluster or cloud can obtain enough software licenses that by the time the
job attempts to obtain its software licenses, there are an insufficient quantity of
available licenses. In such cases a job will sit and wait for the license, and while
it waits it occupies but does not use resources that another job could have
used. Use the STARTDELAY parameter to prevent such a situation.
GRESCFG[<license>] STARTDELAY=<window_of_opportunity>
With the STARTDELAY parameter enabled (on a per generic resource basis)
Moab blocks any idle jobs requesting the same generic resource from starting
Managing Consumable Generic Resources
575
Chapter 10 General Node Administration
until the <window_of_opportunity> passes. The window is defined by the
customer on a per generic resource basis.
Related Topics
GRESCFG parameter
Generic Metrics
Generic Events
General Node Attributes
Floating Generic Resources
Per Class Assignment of Generic Resource Consumption
mnodectl -m command to dynamically modify node resources
Favoring Jobs Based On Generic Resource Requirements
Enabling Generic Metrics
Moab allows organizations to enable generic performance metrics. These
metrics allow decisions to be made and reports to be generated based on site
specific environmental factors. This increases Moab's awareness of what is
occurring within a given cluster environment, and allows arbitrary information
to be associated with resources and the workload within the cluster. Uses of
these metrics are widespread and can cover anything from tracking node
temperature, to memory faults, to application effectiveness.
l
Execute triggers when specified thresholds are reached
l
Modify node allocation affinity for specific jobs
l
Initiate automated notifications when thresholds are reached
l
Display current, average, maximum, and minimum metrics values in
reports and charts within Moab Cluster Manager
In this topic:
l
Configuring Generic Metrics
l
Example Generic Metric Usage
Configuring Generic Metrics
A new generic metric is automatically created and tracked at the server level if
it is reported by either a node or a job.
To associate a generic metric with a job or node, a native resource manager
must be set up and the GMETRIC attribute must be specified. For example, to
associate a generic metric of temp with each node in a Torque cluster, the
following could be reported by a native resource manager:
576
Enabling Generic Metrics
Chapter 10 General Node Administration
# temperature output
node001 GMETRIC[temp]=113
node002 GMETRIC[temp]=107
node003 GMETRIC[temp]=83
node004 GMETRIC[temp]=85
...
Generic metrics are tracked as floating point values allowing virtually any
number to be reported.
In the preceding example, the new metric, temp, can now be used to monitor
system usage and performance or to allow the scheduler to take action should
certain thresholds be reached. Some uses include the following:
l
Executing triggers based on generic metric thresholds
l
Adjust a node's availability for accepting additional workload
l
Adjust a node's allocation priority
l
Initiate administrator notification of current, minimum, maximum, or
average generic metric values
l
Use metrics to report resource and job performance
l
Use metrics to report resource and job failures
l
l
l
l
l
l
Using job profiles to allow Moab to learn which resources best run which
applications
Tracking effective application efficiency to identify resource brown
outseven when no node failure is obvious
Viewing current and historical cluster-wide generic metric values to
identify failure, performance, and usage
Enable charging policies based on consumption of generic metrics
patterns
View changes in generic metrics on nodes, jobs, and cluster wide over
time
Submit jobs with generic metric based node-allocation requirements
Generic metric values can be viewed using checkjob, checknode, mdiag n,mdiag -j, or Moab Cluster Manager Charting and Reporting Features.
Historical job and node generic metric statistics can be cleared using the
mjobctl and mnodectl commands.
Example Generic Metric Usage
As an example, consider a cluster with two primary purposes for generic
metrics. The first purpose is to track and adjust scheduling behavior based on
Enabling Generic Metrics
577
Chapter 10 General Node Administration
node temperature to mitigate overheating nodes. The second purpose is to
track and charge for utilization of a locally developed data staging service.
The first step in enabling a generic metric is to create probes to monitor and
report this information. Depending on the environment, this information may
be distributed or centralized. In the case of temperature monitoring, this
information is often centralized by a hardware monitoring service and available
via command line or an API. If monitoring a locally developed data staging
service, this information may need to be collected from multiple remote nodes
and aggregated to a central location. The following are popular freely available
monitoring tools:
Tool
Link
BigBrother
http://www.bb4.org
Ganglia
http://ganglia.sourceforge.net
Monit
http://www.tildeslash.com/monit
Nagios
http://www.nagios.org
Once the needed probes are in place, a native resource manager interface
must be created to report this information to Moab. Creating a native resource
manager interface should be very simple, and in most cases a script similar to
those found in the $TOOLSDIR($PREFIX/tools) directory can be used as a
template. For this example, we will assume centralized information and will use
the RM script that follows:
#!/usr/bin/perl
# 'hwctl outputs information in format '<NODEID> <TEMP>'
open(TQUERY,"/usr/sbin/hwctl -q temp |");
while (<TQUERY>)
{
my $nodeid,$temp = split /\w+/;
$dstage=GetDSUsage($nodeid);
print "$nodeid GMETRIC[temp]=$temp GMETRIC[dstage]=$dstage
";
}
With the script complete, the next step is to integrate this information into
Moab. This is accomplished with the following configuration line:
RMCFG[local] TYPE=NATIVE CLUSTERQUERYURL=file://$TOOLSDIR/node.query.local.pl
...
Moab can now be recycled and temperature and data staging usage information will be integrated into
Moab compute node reports.
If the checknode command is run, output similar to the following is reported:
578
Enabling Generic Metrics
Chapter 10 General Node Administration
> checknode cluster013
...
Generic Metrics: temp=113.2,dstage=23748
...
Moab Cluster Manager reports full current and historical generic metric information in its visual cluster
overview screen.
The next step in configuring Moab is to inform Moab to take certain actions
based on the new information it is tracking. For this example, there are two
purposes. The first purpose is to get jobs to avoid hot nodes when possible.
This is accomplished using the GMETRIC attribute of the Node Allocation Priority
function as in the following example:
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF=PRIORITY-10*GMETRIC[temp]
...
This simple priority function reduces the priority of the hottest nodes making
such less likely to be allocated. See Node Allocation Priority Factors for a
complete list of available priority factors.
The example cluster is also interested in notifying administrators if the
temperature of a given node ever exceeds a critical threshold. This is
accomplished using a trigger. The following line will send email to
administrators any time the temperature of a node exceeds 120 degrees.
NODECFG[DEFAULT] TRIGGER=atype=mail,etype=threshold,threshold=gmetric
[temp]>120,action='warning: node $OID temp high'
...
Related Topics
Simulation Overview
Generic Consumable Resources
Object Variables
Generic Event Counters
Enabling Generic Events
Generic events are used to identify failures and other occurrences that Moab or
other systems must be made aware. This information may result in automated
resource recovery, notifications, adjustments to statistics, or changes in policy.
Generic events also have the ability to carry an arbitrary human readable
message that may be attached to associated objects or passed to
administrators or external systems. Generic events typically signify the
occurrence of a specific event as opposed to generic metrics which indicate a
change in a measured value.
Enabling Generic Events
579
Chapter 10 General Node Administration
Using generic events, Moab can be configured to automatically address many
failures and environmental changes improving the overall performance. Some
sample events that sites may be interested in monitoring, recording, and
taking action on include:
l
l
l
Machine Room Status
o
Excessive Room Temperature
o
Power Failure or Power Fluctuation
o
Chiller Health
Network File Server Status
o
Failed Network Connectivity
o
Server Hardware Failure
o
Full Network File System
Compute Node Status
o
Machine Check Event (MCE)
o
Network Card (NIC) Failure
o
Excessive Motherboard/CPU Temperature
o
Hard Drive Failures
In this topic:
l
Configuring Generic Events
o
l
Action Types
o
Named Events
o
Generic Metric (GMetric) Events
Reporting Generic Events
o
Using Generic Events for VM Detection
l
Generic Events Attributes
l
Manually Creating Generic Events
Configuring Generic Events
Generic events are defined in the moab.cfg file and have several different
configuration options. The only required option is action.
The full list of configurable options for generic events is contained in the
following table:
580
Enabling Generic Events
Chapter 10 General Node Administration
Attribute
Description
ACTION
Comma-delimited list of actions to be processed when a new event is received.
ECOUNT
Number of events that must occur before launching action.
Action will be launched each <ECOUNT> event if rearm is set.
REARM
Minimum time between events specified in [[[DD:]HH:]MM:]SS format.
SEVERITY
An arbitrary severity level from 1 through 4, inclusive. SEVERITY appears in the output of mdiag n -v -v --xml.
The severity level will not be used for any other purpose.
Action Types
The impact of the event is controlled using the ACTION attribute of the
GEVENTCFG parameter. The ACTION attribute is comma-delimited and may
include any combination of the actions in the following table:
Value
Description
DISABLE
[:<OTYPE>:<OID>]
Marks event object (or specified object) down until event report is cleared.
EXECUTE
Executes a script at the provided path. The value of EXECUTE is not contained in quotation marks. Arguments are allowed at the end of the path and are separated by question
marks (?). Trigger variables (such as $OID) are allowed.
NOTIFY
Notifies administrators of the event occurrence.
OBJECTXMLSTDIN
If the EXECUTE action type is also specified, this flag passes an XML description of the firing gevent to the script.
OFF
Powers off node or resource.
ON
Powers on node or resource.
PREEMPT
[:<POLICY>]
Preempts workload associated with object (valid for node, job, reservation, partition,
resource manager, user, group, account, class, QoS, and cluster objects).
Enabling Generic Events
581
Chapter 10 General Node Administration
Value
Description
RECORD
Records events to the event log. The record action causes a line to be added to the event log
regardless of whether or not RECORDEVENTLIST includes GEVENT.
RESERVE
[:<DURATION>]
Reserves node for specified duration (default: 24 hours).
RESET
Resets object (valid for nodes - causes reboot).
SIGNAL[:<SIGNO>]
Sends signal to associated jobs or services (valid for node, job, reservation, partition,
resource manager, user, group, account, class, QoS, and cluster objects).
This is an example of using objectxmlstdin with a gevent:
<gevent name="bob" statuscode="0" time="1320334763">Testing</gevent>
Named Events
In general, generic events are named, with the exception of those based on
generic metrics. Names are used primarily to differentiate between different
events and do not have any intrinsic meaning to Moab. It is suggested that the
administrator choose names that denote specific meanings within the
organization.
Example 10-9:
# Note: cpu failures require admin attention, create maintenance reservation
GEVENTCFG[cpufail] action=notify,record,disable,reserve rearm=01:00:00# Note: power
failures are transient, minimize future use
GEVENTCFG[powerfail] action=notify,record, rearm=00:05:00
# Note: fs full can be automatically fixed
GEVENTCFG[fsfull] action=notify,execute:/home/jason/MyPython/cleartmp.py?$OID?nodefix
# Note: memory errors can cause invalid job results, clear node immediately
GEVENTCFG[badmem] action=notify,record,preempt,disable,reserve
Generic Metric (GMetric) Events
GMetric events are generic events based on generic metrics. They are used for
executing an action when a generic metric passes a defined threshold. Unlike
named events, GMetric events are not named and use the following format:
GEVENTCFG[GMETRIC<COMPARISON>VALUE] ACTION=...
Example 10-10:
GEVENTCFG[cputemp>150] action=off
This form of generic events uses the GMetric name, as returned by a GMETRIC
attribute in a native Resource Manager interface.
582
Enabling Generic Events
Chapter 10 General Node Administration
Only one generic event may be specified for any given generic metric.
Valid comparative operators are shows in the following table:
Type
Comparison
Notes
>
greater than
Numeric values only
>=
greater than or equal to
Numeric values only
==
equal to
Numeric values only
<
less than
Numeric values only
<=
less than or equal to
Numeric values only
<>
not equal
Numeric values only
Reporting Generic Events
Unlike generic metrics, generic events can be optionally configured at the
global level to adjust rearm policies, and other behaviors. In all cases, this is
accomplished using the GEVENTCFG parameter.
To report an event associated with a job or node, use the native Resource
Manager interface or the mjobctl or mnodectl commands. You can report
generic events on the scheduler with the mschedctl command.
If using the native Resource Manager interface, use the GEVENT attribute as in
the following example:
node001 GEVENT[hitemp]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs]='/var/tmp is full'
The time at which the event occurred can be passed to Moab to prevent
multiple processing of the same event. This is accomplished by specifying
the event type in the format <GEVENTID>[:<EVENTTIME>] as in what
follows:
node001 GEVENT[hitemp:1130325993]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs:1130325142]='/var/tmp is full'
Enabling Generic Events
583
Chapter 10 General Node Administration
Using Generic Events for VM Detection
To enable Moab to detect a virtual machine (VM) reported by a generic event,
do the following:
1. Set up your resource manager to detect virtual machine creation and to
submit a generic event to Moab.
2. Configure moab.cfg to recognize a generic event.
GEVENTCFG[NewVM] ACTION=execute:/opt/moab/AddVM.py,OBJECTXMLSTDIN
3. Report the event.
> mschedctl -c gevent -n NewVM -m "VM=newVMName"
With the ObjectXMLStdin action set, Moab sends an XML description of the generic event to the
script, so the message passes through.
The following sample Perl script submits a VMTracking job for the new VM:
584
Enabling Generic Events
Chapter 10 General Node Administration
#!/usr/bin/perl
# in moab.cfg: GEVENTCFG[NewVM] ACTION=execute:$TOOLSDIR/newvm_event.pl,OBJECTXMLSTDIN
# trigger gevent with: mschedctl -c gevent -n NewVM -m "VM=TestVM1"
# input to this script: <gevent name="NewVM" statuscode="0"
time="1318500261">VM=TestVM1</gevent>
use strict;
my $vmidVarName = "preVMID";
my $vmTemplate = "existingVM";
my $vmOwner = "operator";
$ENV{MOABHOMEDIR} = '/opt/moab';
my $xml = join "", <STDIN>;
my ($vmid) = ($xml =~ m/VM=([^\<]+)\</);
if ( defined $vmid )
{
my $cmd = qq| $ENV{MOABHOMEDIR}/bin/mvmctl -q $vmid --xml |;
my $vmxml = `$cmd`;
my ($hv, $os, $proc, $disk, $mem) = (undef, undef, undef, undef, undef);
($hv) = ($vmxml =~ m/CONTAINERNODE="([^"]+)"/);
($os) = ($vmxml =~ m/OS="([^"]+)"/);
($proc) = ($vmxml =~ m/RCPROC="([^"]+)"/);
($mem) = ($vmxml =~ m/RCMEM="([^"]+)"/);
($disk) = ($vmxml =~ m/RCDISK="([^"]+)"/);
die "Error parsing VM XML. Invalid VMID $vmid or $hv || $os || $proc || $mem ||
$disk?
"
if ( ! defined $hv || !defined $os || !defined $proc || !defined $mem || !defined
$disk );
$cmd = qq| $ENV{MOABHOMEDIR}/bin/msub -l
hostlist=$hv,os=$os,nodes=1:ppn=$proc,mem=$mem,file=$disk,template=$vmTemplate,VAR=$vm
idVarName=$vmid --proxy=$vmOwner /dev/null |;
my $msubout = `$cmd`;
die "Error executing msub. Output is:
$msubout
" if ( $? );
} else {
die "Error parsing VMID from GEVENT message
";
}
Generic Events Attributes
Each node will record the following about reported generic events:
l
status - is event active
l
message - human readable message associated with event
l
count - number of event incidences reported since statistics were cleared
l
time - time of most recent event
Each event can be individually cleared, annotated, or deleted by cluster
administrators using a mnodectl command.
Enabling Generic Events
585
Chapter 10 General Node Administration
Generic events are only available in Moab 4.5.0 and later.
Manually Creating Generic Events
Generic events may be manually created on a physical node or VM.
To add GEVENT event with message "hello" to node02, do the following:
> mnodectl -m gevent=event:"hello" node02
To add GEVENT event with message "hello" to myvm, do the following:
> mvmctl -m gevent=event:"hello" myvm
Related Topics
Simulation Overview
Generic Consumable Resources
Object Variables
Generic Event Counters
586
Enabling Generic Events
Chapter 11 Resource Managers and Interfaces
Chapter 11 Resource Managers and Interfaces
l
Resource Manager Overview
l
Resource Manager Configuration
l
Resource Manager Extensions
l
Adding New Resource Manager Interfaces
l
Managing Resources Directly with the Native Interface
l
Utilizing Multiple Resource Managers
l
License Management
l
Resource Provisioning
l
Resource Manager Translation
Moab provides a powerful resource management interface that enables
significant flexibility in how resources and workloads are managed. Highlights
of this interface are listed in what follows:
Highlight
Description
Support for Multiple Standard
Resource Manager Interface Protocols
Manage cluster resources and workloads via PBS, Loadleveler, SGE, LSF, or BProc based
resource managers.
Support for Generic Resource
Manager Interfaces
Manage cluster resources securely via locally developed or open source projects using
simple flat text interfaces or XML over HTTP.
Support for Multiple Simultaneous
Resource Managers
Integrate resource and workload streams from multiple independent sources reporting disjoint sets of resources.
Independent
Workload and
Resource Management
Allow one system to manage your workload (queue manager) and another to manage your
resources.
587
Chapter 11 Resource Managers and Interfaces
Highlight
Description
Support for Rapid
Development
Interfaces
Load resource and workload information directly from a file, a URL, or from the output of a
configurable script or other executable.
Resource Extension Information
Integrate information from multiple sources to obtain a cohesive view of a compute
resource. (That is, mix information from NIM, OpenPBS, FLEXlm, and a cluster performance
monitor to obtain a single node image with a coordinated state and a more extensive list of
node configuration and utilization attributes.)
Resource Manager Overview
For most installations, the Moab Workload Manager uses the services of a
resource manager to obtain information about the state of compute resources
(nodes) and workload (jobs). Moab also uses the resource manager to manage
jobs, passing instructions regarding when, where, and how to start or
otherwise manipulate jobs.
Moab can be configured to manage more than one resource manager
simultaneously, even resource managers of different types. Using a local
queue, jobs may even be migrated from one resource manager to another.
However, there are currently limitations regarding jobs submitted directly to a
resource manager (not to the local queue.) In such cases, the job is
constrained to only run within the bound of the resource manager to which it
was submitted.
l
Scheduler/Resource Manager Interactions
o
Resource Manager Commands
o
Resource Manager Flow
l
Resource Manager Specific Details (Limitations/Special Features)
l
Synchronizing Conflicting Information
l
Evaluating Resource Manager Availability and Performance
Scheduler/Resource Manager Interactions
Moab interacts with all resource managers using a common set of commands
and objects. Each resource manager interfaces, obtains, and translates Moab
concepts regarding workload and resources into native resource manager
objects, attributes, and commands.
Information on creating a new scheduler resource manager interface can be
found in the Adding New Resource Manager Interfaces section.
588
Resource Manager Overview
Chapter 11 Resource Managers and Interfaces
Resource Manager Commands
For many environments, Moab interaction with the resource manager is limited
to the following objects and functions:
Object
Function
Details
Job
Query
Collect detailed state, requirement, and utilization information about jobs
Modify
Change job state and/or attributes
Start
Execute a job on a specified set of resources
Cancel
Cancel an existing job
Preempt/Resume
Suspend, resume, checkpoint, restart, or requeue a job
Query
Collect detailed state, configuration, and utilization information about compute
resources
Modify
Change node state and/or attributes
Query
Collect detailed policy and configuration information from the resource manager
Node
Queue
Using these functions, Moab is able to fully manage workload, resources, and
cluster policies. More detailed information about resource manager specific
capabilities and limitations for each of these functions can be found in the
individual resource manager overviews. (LL, PBS, LSF, SGE, BProc, or WIKI).
Beyond these base functions, other commands exist to support advanced
features such as provisioning and cluster level resource management.
Resource Manager Flow
In general, Moab interacts with resource managers in a sequence of steps each
scheduling iteration. These steps are outlined in what follows:
1. load global resource information
2. load node specific information (optional)
3. load job information
4. load queue/policy information (optional)
5. cancel/preempt/modify jobs according to cluster policies
6. start jobs in accordance with available resources and policy constraints
7. handle user commands
Resource Manager Overview
589
Chapter 11 Resource Managers and Interfaces
Typically, each step completes before the next step is started. However, with
current systems, size and complexity mandate a more advanced parallel
approach providing benefits in the areas of reliability, concurrency, and
responsiveness.
Resource Manager Specific Details (Limitations/Special
Features)
l
Torque
o
l
Torque Homepage
SLURM/Wiki
o
SLURM Integration Guide
o
Wiki Overview
Synchronizing Conflicting Information
Moab does not trust resource manager information. Node, job, and policy
information is reloaded on each iteration and discrepancies are detected.
Synchronization issues and allocation conflicts are logged and handled where
possible. To assist sites in minimizing stale information and conflicts, a number
of policies and parameters are available.
l
Node State Synchronization Policies (see Moab Parameters)
l
Stale Data Purging (see JOBPURGETIME)
l
Thread Management (preventing resource manager failures from
affecting scheduler operation)
l
Resource Manager Poll Interval (see RMPOLLINTERVAL)
l
Node Query Refresh Rate (see NODEPOLLFREQUENCY)
Evaluating Resource Manager Availability and Performance
Each resource manager is individually tracked and evaluated by Moab. Using
the mdiag -R command, a site can determine how a resource manager is
configured, how heavily it is loaded, what failures, if any, have occurred in the
recent past, and how responsive it is to requests.
Related Topics
Resource Manager Configuration
Resource Manager Extensions
590
Resource Manager Overview
Chapter 11 Resource Managers and Interfaces
Resource Manager Configuration
l
Defining and Configuring Resource Manager Interfaces
o
l
l
Resource Manager Attributes
Resource Manager Configuration Details
o
Resource Manager Types
o
Resource Manager Name
o
Resource Manager Location
o
Resource Manager Flags
Scheduler/Resource Manager Interactions
Defining and Configuring Resource Manager Interfaces
Moab resource manager interfaces are defined using the RMCFG parameter.
This parameter allows specification of key aspects of the interface. In most
cases, only the TYPE attribute needs to be specified and Moab determines the
needed defaults required to activate and use the selected interface. In the
following example, an interface to a Loadleveler resource manager is defined.
RMCFG[orion] TYPE=LL...
Note that the resource manager is given a label of orion. This label can be any
arbitrary site-selected string and is for local usage only. For sites with multiple
active resource managers, the labels can be used to distinguish between them
for resource manager specific queries and commands.
Resource Manager Attributes
The following table lists the possible resource manager attributes that can be
configured.
Resource Manager Configuration
591
Chapter 11 Resource Managers and Interfaces
ADMINEXEC
AUTHTYPE
BANDWIDTH
CHECKPOINTSIG
CHECKPOINTTIMEOUT
CLIENT
CLUSTERQUERYURL
CONFIGFILE
DATARM
DEFAULTCLASS
DEFAULTHIGHSPEEDADAPTER
DESCRIPTION
ENV
EPORT
FAILTIME
FBSERVER
FLAGS
FNLIST
HOST
IGNHNODES
JOBCANCELURL
JOBEXTENDDURATION
JOBIDFORMAT
JOBMODIFYURL
JOBRSVRECREATE
JOBSTARTURL
JOBSUBMITURL
JOBSUSPENDURL
JOBVALIDATEURL
MAXDSOP
MAXITERATIONFAILURECOUNT
MAXJOBPERMINUTE
MAXJOBS
MINETIME
NMPORT
NODEFAILURERSVPROFILE
NODESTATEPOLICY
OMAP
PORT
PROVDURATION
PTYSTRING
RESOURCECREATEURL
RESOURCETYPE
RMSTARTURL
RMSTOPURL
SBINDIR
SERVER
SLURMFLAGS
SOFTTERMSIG
STAGETHRESHOLD
STARTCMD
SUBMITCMD
SUBMITPOLICY
SUSPENDSIG
SYNCJOBID
SYSTEMMODIFYURL
SYSTEMQUERYURL
TARGETUSAGE
TIMEOUT
TRIGGER
TYPE
USEVNODES
VARIABLES
VERSION
VMOWNERRM
WORKLOADQUERYURL
ADMINEXEC
Format
"jobsubmit"
Default
NONE
Description
Normally, when the JOBSUBMITURL is executed, Moab will drop to the UID and GID of the user submitting the job. Specifying an ADMINEXEC of jobsubmit causes Moab to use its own UID and GID
instead (usually root). This is useful for some native resource managers where the JOBSUBMITURL
is not a user command (such as qsub) but a script that interfaces directly with the resource manager.
Example
RMCFG[base] ADMINEXEC=jobsubmit
Moab will not use the user's UID and GID for executing the JOBSUBMITURL.
AUTHTYPE
592
Format
One of CHECKSUM, OTHER, PKI, SECUREPORT, or NONE.
Default
CHECKSUM
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
AUTHTYPE
Description
Specifies the security protocol to be used in scheduler-resource manager
communication.
Only valid with WIKI based interfaces.
Example
RMCFG[base] AUTHTYPE=CHECKSUM
Moab requires a secret key-based checksum associated with each
resource manager message.
BANDWIDTH
Format:
<FLOAT>[{M|G|T}]
Default:
-1 (unlimited)
Description:
Specifies the maximum deliverable bandwidth between the Moab server and the resource manager for staging jobs and data. Bandwidth is specified in units per second and defaults to a unit of
MB/s. If a unit modifier is specified, the value is interpreted accordingly (M - megabytes/sec, G gigabytes/sec, T - terabytes/sec).
Example:
RMCFG[base] BANDWIDTH=340G
Moab will reserve up to 340 GB of network bandwidth when scheduling job and data
staging operations to and from this resource manager.
CHECKPOINTSIG
Format
One of suspend, <INTEGER>, or SIG<X>
Description
Specifies what signal to send the resource manager when a job is checkpointed. See Checkpoint
Overview.
Example
RMCFG[base] CHECKPOINTSIG=SIGKILL
Moab routes the signal SIGKILL through the resource manager to the job when a job is
checkpointed.
Resource Manager Configuration
593
Chapter 11 Resource Managers and Interfaces
CHECKPOINTTIMEOUT
Format
[[[DD:]HH:]MM:]SS
Default
0 (no timeout)
Description
Specifies how long Moab waits for a job to checkpoint before canceling it. If
set to 0, Moab does not cancel the job if it fails to checkpoint. See Checkpoint
Overview.
Example
RMCFG[base] CHECKPOINTTIMEOUT=5:00
Moab cancels any job that has not exited 5 minutes after receiving
a checkpoint request.
CLIENT
Format
<PEER>
Default
Use name of resource manager for peer client lookup
Description
If specified, the resource manager will use the peer value to authenticate
remote connections. See configuring peers. If not specified, the resource
manager will search for a CLIENTCFG[<X>] entry of RM:<RMNAME>in the
moab-private.cfg file.
Example
RMCFG[clusterBI] CLIENT=clusterB
Moab will look up and use information for peer clusterB when
authenticating the clusterBI resource manager.
CLUSTERQUERYURL
Format
[file://<path> | http://<address> | <path>]
If file:// is specified, Moab treats the destination as a flat text file. If http:// is specified, Moab treats
the destination as a hypertext transfer protocol file. If just a path is specified, Moab treats the destination as an executable.
Description
594
Specifies how Moab queries the resource manager See Native RM, URL Notes, and interface details.
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
CLUSTERQUERYURL
Example
RMCFG[base] CLUSTERQUERYURL=file:///tmp/cluster.config
Moab reads /tmp/cluster.config when it queries base resource manager.
CONFIGFILE
Format
<STRING>
Description
Specifies the resource manager specific configuration file that must be used to enable correct API
communication.
Only valid with LL- and SLURM-based interfaces.
Example
RMCFG[base] TYPE=LL CONFIGFILE=/home/loadl/loadl_config
The scheduler uses the specified file when establishing the resource manager/scheduler
interface connection.
DATARM
Format
<RM NAME>
Description
If specified, the resource manager uses the given storage resource manager to handle staging data
in and out.
Example
RMCFG[clusterB] DATARM=clusterB_storage
When data staging is required by jobs starting/completing on clusterB, Moab uses the
storage interface defined by clusterB_storage to stage and monitor the data.
DEFAULTCLASS
Format
<STRING>
Description
Specifies the class to use if jobs submitted via this resource manager interface do not have an associated class.
Resource Manager Configuration
595
Chapter 11 Resource Managers and Interfaces
DEFAULTCLASS
Example
RMCFG[internal] DEFAULTCLASS=batch
Moab assigns the class batch to all jobs from the resource manager internal that do not
have a class assigned.
If you are using PBS as the resource manager, a job will never come from PBS without a
class, and the default will never apply.
DEFAULTHIGHSPEEDADAPTER
Format:
<STRING>
Default:
sn0
Description:
Specifies the default high speed switch adapter to use when starting LoadLeveler jobs (supported in version 4.2.2 and higher of Moab and 3.2 of LoadLeveler).
Example:
RMCFG[base]
DEFAULTHIGHSPEEDADAPTER=sn1
The scheduler will start jobs requesting a high speed adapter on sn1.
DESCRIPTION
Format
<STRING>
Description
Specifies the human-readable description for the resource manager interface. If white space is
used, the description should be quoted.
Example
RMCFG[torque] DESCRIPTION='Torque RM for launching jobs'
Moab annotates the Torque resource manager accordingly.
ENV
596
Format
Semi-colon-delimited (;) list of <KEY>=<VALUE> pairs
Default
MOABHOMEDIR=<MOABHOMEDIR>
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
ENV
Description
Example
Specifies a list of environment variables that will be passed to URLs of type exec:// for that
resource manager.
RMCFG[base] ENV=HOST=node001;RETRYTIME=50
RMCFG[base] CLUSTERQUERYURL=exec:///opt/moab/tools/cluster.query.pl
RMCFG[base] WORKLOADQUERYURL=exec:///opt/moab/tools/
workload.query.pl
RMCFG[base] ENV=HOST=node001;RETRYTIME=50
RMCFG[base] CLUSTERQUERYURL=exec:///opt/moab/tools/cluster.query.pl
RMCFG[base] WORKLOADQUERYURL=exec:///opt/moab/tools/workload.query.pl
The environment variables HOST and RETRYTIME (with values node001 and 50
respectively) are passed to the /opt/moab/tools/cluster.query.pl and
/opt/moab/tools/workload.query.pl when they are executed.
EPORT
Format:
<INTEGER>
Description:
Specifies the event port to use to receive resource manager based scheduling events.
Example:
RMCFG[base] EPORT=15017
The scheduler will look for scheduling events from the resource manager host
at port 15017.
FAILTIME
Format:
[[[DD:]HH:]MM:]SS
Description:
Specifies how long a resource manager must be down before any failure triggers associated with
the resource manager fire.
Example:
RMCFG[base] FAILTIME=3:00
If the base resource manager is down for three minutes, any resource manager failure
triggers fire.
Resource Manager Configuration
597
Chapter 11 Resource Managers and Interfaces
FBSERVER
Format:
<RMNAME>
Description:
Specifies the fallback server to use when talking to Moab in an HA configuration.
Example:
RMCFG[base] TYPE=MOAB SERVER=server1
FBSERVER=server1-ha
FLAGS
Format
Comma-delimited list of zero or selected resource manger flags. See
Resource Manager Flags on page 618 for valid values.
Description
Specifies various attributes of the resource manager.
Example
RMCFG[base] FLAGS=asyncstart
Moab directs the resource manager to start the job
asynchronously.
FNLIST
Format
Comma-delimited list of zero or more of the following: clusterquery, jobcancel, jobrequeue, jobresume, jobstart, jobsuspend, queuequery, resourcequery or workloadquery
Description
By default, a resource manager utilizes all functions supported to query and control batch objects.
If this parameter is specified, only the listed functions are used.
Example
RMCFG[base] FNLIST=queuequery
Moab only uses this resource manager interface to load queue configuration information.
HOST
598
Format
<STRING>
Default
localhost
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
HOST
Description
Example
The host name of the machine on which the resource manager server is running.
RMCFG[base] host=server1
IGNHNODES
Format
<BOOLEAN>
Default
FALSE
Description
Specifies whether to read in the PBSPro host nodes. This parameter is used in conjunction with
USEVNODES. When both are set to TRUE, the host nodes are not queried.
Example
RMCFG[pbs] IGNHNODES=TRUE
JOBCANCELURL
Format
<protocol>://[<host>[:<port>]][<path>]
Default
---
Description
Specifies how Moab cancels jobs via the resource manager. See URL Notes.
Example
RMCFG[base] JOBCANCELURL=exec:///opt/moab/job.cancel.lsf.pl
Moab executes /opt/moab/job.cancel.lsf.pl to cancel
specific jobs.
JOBEXTENDDURATION
Format
[[[DD:]HH:]MM:]SS[,[[[DD:]HH:]MM:]SS][!][<] (or <MIN TIME>[,<MAX TIME>]
[!])
Default
---
Resource Manager Configuration
599
Chapter 11 Resource Managers and Interfaces
JOBEXTENDDURATION
Description
Specifies the minimum and maximum amount of time that can be added to a job's walltime if it is
possible for the job to be extended. See MINWCLIMIT. As the job runs longer than its current
specified minimum wallclock limit (-l minwclimit, for example), Moab attempts to extend the job's
limit by the minimum JOBEXTENDDURATION. This continues until either the extension can no
longer occur (it is blocked by a reservation or job), the maximum JOBEXTENDDURATION is
reached, or the user's specified wallclock limit (-l walltime) is reached. When a job is extended, it
is marked as PREEMPTIBLE, unless the ! is appended to the end of the configuration string. If
the < is at the end of the string, however, the job is extended the maximum amount possible.
JOBEXTENDDURATION and JOBEXTENDSTARTWALLTIME TRUE cannot be configured
together. If they are in the same moab.cfg or are both active, then the
JOBEXTENDDURATION will not be honored.
For example, comment out the JOBEXTENDSTARTWALLTIME.
RMCFG[base] JOBEXTENDDURATION=30,1:00:00
#JOBEXTENDSTARTWALLTIME TRUE
Example
RMCFG[base] JOBEXTENDDURATION=30,1:00:00
Moab extends a job's walltime by 30 seconds each time the job is about to run out of
walltime until it is bound by one hour, a reservation/job, or the job's original
"maximum" wallclock limit.
JOBIDFORMAT
Format
INTEGER
Default
---
Description
Specifies that Moab should use numbers to create job IDs. This eliminates multiple job IDs associated with a single job.
Example
RMCFG[base] JOBIDFORMAT=INTEGER
Job IDs are generated as numbers.
JOBMODIFYURL
Format
600
<protocol>://[<host>[:<port>]][<path>]
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
JOBMODIFYURL
Default
---
Description
Specifies how Moab modifies jobs via the resource manager. See URL Notes, and interface details.
Example
RMCFG[base] JOBMODIFYURL=exec://$TOOLSDIR/job.modify.dyn.pl
Moab executes /opt/moab/job.modify.dyn.pl to modify specific jobs.
JOBRSVRECREATE
Format
Boolean
Default
TRUE
Description
Specifies whether Moab will re-create a job reservation each time job information is updated by a
resource manager. See Considerations for Large Clusters for more information.
Example
RMCFG[base] JOBRSVRECREATE=FALSE
Moab only creates a job reservation once when the job first starts.
JOBSTARTURL
Format
<protocol>://[<host>[:<port>]][<path>]
Default
TRUE
Description
Specifies how Moab starts jobs via the resource manager. See URL Notes.
Example
RMCFG[base]
JOBSTARTURL=http://orion.bsu.edu:1322/moab/jobstart.cgi
Moab triggers the jobstart.cgi script via http to start specific
jobs.
Resource Manager Configuration
601
Chapter 11 Resource Managers and Interfaces
JOBSUBMITURL
Format
<protocol>://[<host>[:<port>]][<path>]
Description
Specifies how Moab submits jobs to the resource manager. See URL Notes.
Example
RMCFG[base] JOBSUBMITURL=exec://$TOOLSDIR/job.submit.dyn.pl
Moab submits jobs directly to the database located on host
dbserver.flc.com.
JOBSUSPENDURL
Format
<protocol>://[<host>[:<port>]][<path>]
Description
Specifies how Moab suspends jobs via the resource manager. See URL Notes.
Example
RMCFG[base] JOBSUSPENDURL=EXEC://$HOME/scripts/job.suspend
Moab executes the job.suspend script when jobs are suspended.
JOBVALIDATEURL
Format
<protocol>://[<host>[:<port>]][<path>]
Description
Specifies how Moab validates newly submitted jobs. See URL Notes. If the script returns with a
non-zero exit code, the job is rejected. See User Proxying/Alternate Credentials.
Example
RMCFG[base] JOBVALIDATEURL=exec://$TOOLS/job.validate.pl
Moab executes the 'job.validate.pl' script when jobs are submitted to verify they
are acceptable.
MAXDSOP
602
Format
<INTEGER>
Default
-1 (unlimited)
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
MAXDSOP
Description
Example
Specifies the maximum number of data staging operations that may be simultaneously active.
RMCFG[ds] MAXDSOP=16
MAXITERATIONFAILURECOUNT
Format
<INTEGER>
Default
80
Description
Specifies the number of times the RM must fail within a certain iteration before Moab considers it down or corrupt. When an RM is down or corrupt, Moab will not attempt to interact
with it.
Example
RMCFG[base] MAXITERATIONFAILURECOUNT=25
The RM base must fail 25 times in a single iteration for Moab to consider it down
and cease interacting with it.
MAXJOBPERMINUTE
Format
<INTEGER>
Default
-1 (unlimited)
Description
Specifies the maximum number of jobs allowed to start per minute via the resource manager.
Example
RMCFG[base] MAXJOBPERMINUTE=5
The scheduler only allows five jobs per minute to launch via the resource manager
base.
MAXJOBS
Format
<INTEGER>
Resource Manager Configuration
603
Chapter 11 Resource Managers and Interfaces
MAXJOBS
Default
0 (limited only by the Moab MAXJOB setting)
Description
Specifies the maximum number of active jobs that this interface is allowed to load from the
resource manager.
Only works with Moab peer resource managers at this time.
Example
RMCFG[cluster1] SERVER=moab://cluster1 MAXJOBS=200
The scheduler loads up to 200 active jobs from the remote Moab peer cluster1.
MINETIME
Format
<INTEGER>
Default
1
Description
Specifies the minimum time in seconds between processing subsequent scheduling events.
Example
RMCFG[base] MINETIME=5
The scheduler batch-processes scheduling events that occur less than five seconds
apart.
NMPORT
Format
<INTEGER>
Default
(any valid port number)
Description
Allows specification of the resource manager's node manager port and is only required when this
port has been set to a non-default value.
Example
RMCFG[base] NMPORT=13001
The scheduler contacts the node manager located on each compute node at port 13001.
604
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
NODEFAILURERSVPROFILE
Format
<STRING>
Description
Specifies the rsv template to use when placing a reservation onto failed nodes. See also
NODEFAILURERESERVETIME.
Example
# moab.cfg
RMCFG[base] NODEFAILURERSVPROFILE=long
RSVPROFILE[long]
DURATION=25:00RSVPROFILE[long]
USERLIST=john
The scheduler will use the long rsv profile when creating reservations over failed
nodes belonging to base.
NODESTATEPOLICY
Format
One of OPTIMISTIC or PESSIMISTIC
Default
PESSIMISTIC
Description
Specifies how Moab should determine the state of a node when multiple resource managers are
reporting state.
OPTIMISTIC specifies that if any resource manager reports a state of up, that state will be used.
PESSIMISTIC specifies that if any resource manager reports a state of down, that state will be
used.
Example
# moab.cfg
RMCFG[native] TYPE=NATIVE NODESTATEPOLICY=OPTIMISTIC
OMAP
Format
<protocol>://[<host>[:<port>]][<path>]
Description
Specifies an object map file that is used to map credentials and other objects when using this
resource manager peer. See Grid Credential Management for full details.
Example
moab.cfg
RMCFG[peer1] OMAP=file:///opt/moab/omap.dat
When communicating with the resource manager peer1, objects are mapped according to
the rules defined in the /opt/moab/omap.dat file.
Resource Manager Configuration
605
Chapter 11 Resource Managers and Interfaces
PORT
Format
<INTEGER>
Default
0
Description
Specifies the port on which the scheduler should contact the associated resource manager. The
value 0 specifies that the resource manager default port should be used.
Example
RMCFG[base] TYPE=PBS HOST=cws PORT=20001
Moab attempts to contact the PBS server daemon on host cws, port 20001.
PROVDURATION
Format
[[[DD:]HH:]MM:]SS
Default
2:30
Description
Specifies the upper bound (walltime) of a provisioning request. After this duration, Moab will consider the provisioning attempt failed.
Example
RMCFG[base] PROVDURATION=5:00
When RM base provisions a node for more than 5 minutes, Moab considers the
provisioning as having failed.
PTYSTRING
606
Format
<STRING>
Default
srun -n1 -N1 --pty
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
PTYSTRING
Description
When a SLURM interactive job is submitted, it builds an salloc command that gets the requested
resources and an srun command that creates a terminal session on one of the nodes. The srun
command is called the PTYString. PTYString is configured in moab.cfg.
There are two special things you can do with PTYString:
1. You can have PTYSTRING=$salloc which says to use the default salloc command
(SallocDefaultCommand, look in the slurm.conf man page) defined in slurm.conf.
Internally, Moab won't add a PTYString because SLURM will call the
SallocDefaultCommand.
2. As in the example below, you can add $SHELL. $SHELL will be expanded to either what you
request on the command line (such as msub -S /bin/tcsh -l) or to the value of $SHELL in your
current session.
PTYString works only with SLURM.
Example
RMCFG[slurm] PTYSTRING="srun -n1 -N1 --pty --preserve-env $SHELL"
RESOURCECREATEURL
Format
<STRING>
Default
[exec://<path> | http://<address> | <path>]
If exec:// is specified, Moab treats the destination as an executable file; if http:// is specified,
Moab treats the destination as a hypertext transfer protocol file.
Description
Specifies a script or method that can be used by Moab to create resources dynamically, such as
creating a virtual machine on a hypervisor.
Example
RMCFG[base] RESOURCECREATEURL=exec:///opt/script/vm.provision.py
Moab invokes the vm.provision.py script, passing in data as command line
arguments, to request a creation of new resources.
RESOURCETYPE
Format
{COMPUTE|FS|LICENSE|NETWORK|PROV}
Description
Specifies which type of resource this resource manager is configured to control. See Native
Resource Managers for more information.
Resource Manager Configuration
607
Chapter 11 Resource Managers and Interfaces
RESOURCETYPE
Example
RMCFG[base] TYPE=NATIVE RESOURCETYPE=FS
Resource manager base will function as a NATIVE resource manager and control file
systems.
RMSTARTURL
Format
[exec://<path> | http://<address> | <path>]
If exec:// is specified, Moab treats the destination as an executable file; if http:// is specified, Moab
treats the destination as a hypertext transfer protocol file.
Description
Example
Specifies how Moab starts the resource manager.
RMCFG[base] RMSTARTURL=exec:///tmp/nat.start.pl
Moab executes /tmp/nat.start.pl to start the resource manager base.
RMSTOPURL
Format
[exec://<path> | http://<address> | <path>]
If exec:// is specified, Moab treats the destination as an executable file; if http:// is specified, Moab
treats the destination as a hypertext transfer protocol file.
Description
Example
Specifies how Moab stops the resource manager.
RMCFG[base] RMSTOPURL=exec:///tmp/nat.stop.pl
Moab executes /tmp/nat.stop.pl to stop the resource manager base.
SBINDIR
608
Format
<PATH>
Description
For use with Torque; specifies the location of the Torque system binaries (supported in Torque
1.2.0p4 and higher).
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
SBINDIR
Example
RMCFG[base] TYPE=pbs
SBINDIR=/usr/local/torque/sbin
Moab tells Torque that its system binaries are located in /usr/local/torque/sbin.
SERVER
Format
<URL>
Description
Specifies the resource management service to use. If not specified, the scheduler locates the
resource manager via built-in defaults or, if available, with an information service.
Example
RMCFG[base] server=ll://supercluster.org:9705
Moab attempts to use the Loadleveler scheduling API at the specified location.
SLURMFLAGS
Format
<STRING>
Description
Specifies characteristics of the SLURM resource manager interface. The COMPRESSOUTPUT flag
instructs Moab to use the compact hostlist format for job submissions to SLURM. The flag
NODEDELTAQUERY instructs Moab to request delta node updates when it queries SLURM for
node configuration.
Example
RMCFG[slurm] SLURMFLAGS=COMPRESSOUTPUT
Moab uses the COMPRESSOUTPUT flag to determine interface characteristics with
SLURM.
SOFTTERMSIG
Format
<INTEGER>or SIG<X>
Description
Specifies what signal to send the resource manager when a job reaches its soft wallclock limit. See
JOBMAXOVERRUN.
Resource Manager Configuration
609
Chapter 11 Resource Managers and Interfaces
SOFTTERMSIG
Example
RMCFG[base] SOFTTERMSIG=SIGUSR1
Moab routes the signal SIGUSR1 through the resource manager to the job when a job
reaches its soft wallclock limit.
STAGETHRESHOLD
Format
[[[DD:]HH:]MM:]SS
Description
Specifies the maximum time a job waits to start locally before considering being migrated to a
remote peer. In other words, if a job's start time on a remote cluster is less than the start time on
the local cluster, but the difference between the two is less than STAGETHRESHOLD, then the job is
scheduled locally. The aim is to avoid job/data staging overhead if the difference in start times is
minimal.
If this attribute is used, backfill is disabled for the associated resource manager.
Example
RMCFG[remote_cluster] STAGETHRESHOLD=00:05:00
Moab only migrates jobs to remote_cluster if the jobs can start five minutes sooner on the
remote cluster than they could on the local cluster.
STARTCMD
Format
<STRING>
Description
Specifies the full path to the resource manager job start client. If the resource manager API fails,
Moab executes the specified start command in a second attempt to start the job.
Moab calls the start command with the format <CMD><JOBID> -H <HOSTLIST> unless
the environment variable MOABNOHOSTLIST is set in which case Moab will only pass the
job ID.
Example
RMCFG[base] STARTCMD=/usr/local/bin/qrun
Moab uses the specified start command if API failures occur when launching jobs.
610
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
SUBMITCMD
Format
<STRING>
Description
Specifies the full path to the resource manager job submission client.
Example
RMCFG[base] SUBMITCMD=/usr/local/bin/qsub
Moab uses the specified submit command when migrating
jobs.
SUBMITPOLICY
Format
One of NODECENTRIC or PROCCENTRIC
Default
PROCCENTRIC
Description
If set to NODECENTRIC, each specified node requested by the job is interpreted as a true compute host, not as a task or processor.
Example
RMCFG[base] SUBMITPOLICY=NODECENTRIC
Moab uses the specified submit policy when migrating jobs.
SUSPENDSIG
Format
<INTEGER> (valid UNIX signal between 1 and 64)
Default
RM-specific default
Description
If set, Moab sends the specified signal to a job when a job suspend request is issued.
Example
RMCFG[base] SUSPENDSIG=19
Moab uses the specified suspend signal when suspending jobs within the base
resource manager.
SUSPENDSIG should not be used with Torque or other PBS-based resource
managers.
Resource Manager Configuration
611
Chapter 11 Resource Managers and Interfaces
SYNCJOBID
Format
<BOOLEAN>
Description
Specifies that Moab should migrate jobs to the local resource manager with the job's Moabassigned job ID. In a grid, the grid-head will only pass dependencies to the underlying Moab if
SYNCJOBID is set. This attribute can be used with the JOBIDFORMAT attribute and
PROXYJOBSUBMISSION flag in order to synchronize job IDs between Moab and the resource
manager. For more information about all steps necessary to synchronize job IDs between Moab
and Torque, see Synchronizing Job IDs in Torque and Moab.
Example
RMCFG[slurm] TYPE=wiki:slurm SYNCJOBID=TRUE
SYSTEMMODIFYURL
Format
[exec://<path> | http://<address> | <path>]
If exec:// is specified, Moab treats the destination as an executable file; if http:// is specified, Moab
treats the destination as a hypertext transfer protocol file.
Description
Example
Specifies how Moab modifies attributes of the system. This interface is used in data staging.
RMCFG[base] SYSTEMMODIFYURL=exec:///tmp/system.modify.pl
Moab executes /tmp/system.modify.pl when it modifies system attributes in
conjunction with the resource manager base.
SYSTEMQUERYURL
Format
[exec://<path> | http://<address> | <path>]
If file:// is specified, Moab treats the destination as a flat text file; if http:// is specified, Moab treats
the destination as a hypertext transfer protocol file; if just a path is specified, Moab treats the destination as an executable.
Description
Example
Specifies how Moab queries attributes of the system. This interface is used in data staging.
RMCFG[base] SYSTEMQUERYURL=file:///tmp/system.query
Moab reads /tmp/system.query when it queries the system in conjunction with base
resource manager.
612
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
TARGETUSAGE
Format
<INTEGER>[%]
Default
90%
Description
Amount of resource manager resources to explicitly use. In the case of a storage resource manager,
indicates the target usage of data storage resources to dedicate to active data migration requests.
If the specified value contains a percent sign (%), the target value is a percent of the configured
value. Otherwise, the target value is considered to be an absolute value measured in megabytes
(MB).
Example
RMCFG[storage] TYPE=NATIVE RESOURCETYPE=storage
RMCFG[storage] TARGETUSAGE=80%
Moab schedules data migration requests to never exceed 80% usage of the storage
resource manager's disk cache and network resources.
TIMEOUT
Format
<INTEGER>
Default
30
Description
Time (in seconds) the scheduler waits for a response from the resource manager.
Example
RMCFG[base] TIMEOUT=40
Moab waits 40 seconds to receive a response from the resource manager before timing
out and giving up. Moab tries again on the next iteration.
TRIGGER
Format
<TRIG_SPEC>
Description
A trigger specification indicating behaviors to enforce in the event of certain events associated with
the resource manager, including resource manager start, stop, and failure.
Example
RMCFG[base] TRIGGER=<X>
Resource Manager Configuration
613
Chapter 11 Resource Managers and Interfaces
TYPE
Format
<RMTYPE>[:<RMSUBTYPE>] where <RMTYPE> is one of the following: Torque, NATIVE, PBS, RMS,
SSS, or WIKI and the optional <RMSUBTYPE> value is one of RMS.
Default
PBS
Description
Specifies type of resource manager to be contacted by the scheduler.
For TYPE WIKI, AUTHTYPE must be set to CHECKSUM. The <RMSUBTYPE> option is
currently only used to support Compaq's RMS resource manager in conjunction with PBS.
In this case, the value PBS:RMS should be specified.
Example
RMCFG[clusterA] TYPE=PBS HOST=clusterA PORT=15003
RMCFG[clusterB] TYPE=PBS HOST=clusterB PORT=15005
Moab interfaces to two different PBS resource managers, one located on server clusterA
at port 15003 and one located on server clusterB at port 15005.
USEVNODES
Format
<BOOLEAN>
Default
FALSE
Description
Specifies whether to schedule on PBS virtual nodes. When set to TRUE, Moab queries PBSPro for
vnodes and puts jobs on vnodes rather than hosts. In some systems, such as PBS + Altix, it may not
be desirable to read in the host nodes; for such situations refer to the IGNHNODES attribute.
Example
RMCFG[pbs] USEVNODES=TRUE
VARIABLES
Format
<VAR>=<VAL>[,<VAR>=<VAL>]
Description
Opaque resource manager variables.
Example
RMCFG[base] VARIABLES=SCHEDDHOST=head1
Moab associates the variable SCHEDDHOST with the value head1 on resource
manager base.
614
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
VERSION
Format
<STRING>
Default
SLURM: 10200 (i.e., 1.2.0)
Description
Resource manager-specific version string.
Example
RMCFG[base] VERSION=10124
Moab assumes that resource manager base has a version
number of 1.1.24.
VMOWNERRM
Format
<STRING>
Description
Used with provisioning resource managers that can create VMs. It specifies the resource manager
that will own any VMs created by the resource manager.
Example
RMCFG[torque]
RMCFG[prov] RESOURCETYPE=PROV VMOWNERRM=torque
WORKLOADQUERYURL
Format
[file://<path> | http://<address> | <path>]
If file:// is specified, Moab treats the destination as a flat text file; if http:// is specified, Moab
treats the destination as a hypertext transfer protocol file; if just a path is specified, Moab treats
the destination as an executable.
Description
Example
Specifies how Moab queries the resource manager for workload information. (See Native RM,
URL Notes, and interface details.)
RMCFG[Torque] WORKLOADQUERYURL=exec://$TOOLSDIR/job.query.dyn.pl
Moab executes /opt/moab/tools/job.query.dyn.pl to obtain updated
workload information from resource manager Torque.
URL notes
URL parameters can load files by using the file, exec, and http protocols.
Resource Manager Configuration
615
Chapter 11 Resource Managers and Interfaces
For the protocol file, Moab loads the data directly from the text file pointed to by
path.
RMCFG[base] SYSTEMQUERYURL=file:///tmp/system.query
For the protocol exec, Moab executes the file pointed to by path and loads the
output written to STDOUT. If the script requires arguments, you can use a
question mark (?) between the script name and the arguments, and an
ampersand (&) for each space.
RMCFG[base] JOBVALIDATEURL=exec://$TOOLS/job.validate.pl
RMCFG[native] CLUSTERQUERYURL=exec://opt/moab/tools/cluster.query.pl?-group=group1&arch=x86
Synchronizing Job IDs in Torque and Moab
Unless you use an msub submit filter or you're in a grid, it is recommended
that you use your RM-specific job submission command (for instance,
qsub).
In order to synchronize your job IDs between Torque and Moab you must
perform the following steps:
1. Verify that you are using Torque version 2.5.6 or later.
2. Set SYNCJOBID to TRUE in all resource managers.
RMCFG[torque] TYPE=PBS SYNCJOBID=TRUE
3. Set the PROXYJOBSUBMISSION flag. With PROXYJOBSUBMISSION enabled,
you must run Moab as a Torque manager or operator. Verify that other
users can submit jobs using msub. Moab, as a non-root user, should still be
able to submit jobs to Torque and synchronize job IDs.
RMCFG[torque] TYPE=PBS SYNCJOBID=TRUE
RMCFG[torque] FLAGS=PROXYJOBSUBMISSION
4. Add JOBIDFORMAT=INTEGER to the internal RM. Adding this parameter
forces Moab to only use numbers as job IDs and those numbers to
synchronize across Moab, Torque, and the entire grid. This enhances the
end-user experience as it eliminates multiple job IDs associated with a single
job.
RMCFG[torque] TYPE=PBS SYNCJOBID=TRUE
RMCFG[torque] FLAGS=PROXYJOBSUBMISSION
RMCFG[internal] JOBIDFORMAT=INTEGER
Resource Manager Configuration Details
As with all scheduler parameters, follows the syntax described within the
Parameters Overview.
616
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
Resource Manager Types
The RMCFG parameter allows the scheduler to interface to multiple types of
resource managers using the TYPE or SERVER attributes. Specifying these
attributes, any of the following listed resource managers may be supported.
Type
Resource
managers
Details
Moab
Moab Workload Manager
Use the Moab peer-to-peer (grid) capabilities to enable grids and other configurations.
(See Grid Configuration.)
MWS
Moab Web
Services
The MWS resource manager type is a native integration between Moab and MWS.
Resource manager data is passed directly between Moab and MWS using JSON (rather
than Moab's native WIKI syntax). This simplifies RM configuration for systems where
one or more MWS plugins are acting as resource managers. See the "Moab Workload
Manager resource manager integration" section of the MWS plugins chapter in the
MWS documentation for more information.
Native
Moab Native
Interface
Used for connecting directly to scripts, files, and databases. (See Managing Resources
Directly with the Native Interface.)
PBS
Torque (all
versions)
N/A
SSS
Scalable Systems Software
Project version 2.0 and
higher
N/A
WIKI
Wiki interface specification
version 1.0
and higher
Used for LRM, YRM, ClubMASK, BProc, SLURM, and others.
Resource Manager Name
Moab can support more than one resource manager simultaneously.
Consequently, the RMCFG parameter takes an index value such as RMCFG
[clusterA]. This index value essentially names the resource manager (as
done by the deprecated parameter RMNAME). The resource manager name is
used by the scheduler in diagnostic displays, logging, and in reporting resource
Resource Manager Configuration
617
Chapter 11 Resource Managers and Interfaces
consumption to the accounting manager. For most environments, the selection
of the resource manager name can be arbitrary.
Resource Manager Location
The HOST, PORT, and SERVER attributes can be used to specify how the resource
manager should be contacted. For many resource managers the interface
correctly establishes contact using default values. These parameters need only
to be specified for resource managers such as the WIKI interface (that do not
include defaults) or with resources managers that can be configured to run at
non-standard locations (such as PBS). In all other cases, the resource manager
is automatically located.
Resource Manager Flags
The FLAGS attribute can be used to modify many aspects of a resources
manager's behavior.
AUTOSYNC, COLLAPSEDVIEW, HOSTINGCENTER, PRIVATE, REPORT,
SHARED, and STATIC are deprecated.
Flag
Description
ASYNCDELETE
Moab directs the resource manager to not wait for confirmation that the job
correctly cancels before the API call returns. See Large Cluster Tuning for more
information.
This flag is only applicable for Torque or Moab Native resource
managers.
ASYNCSTART
Jobs started on this resource manager start asynchronously. In this case, the
scheduler does not wait for confirmation that the job correctly starts before
proceeding. See Large Cluster Tuning for more information.
This flag is only applicable for Torque or Moab Native resource
managers.
618
AUTOSTART
Jobs staged to this resource manager do not need to be explicitly started by
the scheduler. The resource manager itself handles job launch.
BECOMEMASTER
Nodes reported by this resource manager will transfer ownership to this
resource manager if they are currently owned by another resource manager
that does not have this flag set.
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
Flag
Description
CLIENT
A client resource manager object is created for diagnostic/statistical purposes
or to configure Moab's interaction with this resource manager. It represents an
external entity that consumes server resources or services, allows a local
administrator to track this usage, and configures specific policies related to
that resource manager. A client resource manager object loads no data and
provides no services.
CLOCKSKEWCHECKING
Setting CLOCKSKEWCHECKING allows you to configure clock skew adjustments. Most of the time it is sufficient to use an NTP server to keep the clocks
in your system synchronized.
DYNAMICCRED
The resource manager creates credentials within the cluster as needed to support workload. See Identity Manager Overview.
EnableCondensedQuery
Enables the condensed workload query.
Only applies if the Torque parameter job_full_report_time is used
(Torque Resource Manager version 5.1.x or later). See Server
Parameters in the Torque Resource Manager Administrator Guide.
EXECUTIONSERVER
The resource manager is capable of launching and executing batch workload.
FSISREMOTE
Add this flag if the working file system doesn't exist on the server to prevent
Moab from validating files and directories at migration.
FULLCP
Always checkpoint full job information (useful with Native resource managers).
IGNQUEUESTATE
The queue state reported by the resource manager should be ignored. May be
used if queues must be disabled inside of a particular resource manager to
allow an external scheduler to properly operate.
IGNWORKLOADSTATE
When this flag is applied to a native resource manager, any jobs that are
reported via that resource manager's "workload query URL" have their
reported state ignored. For example, if an RM has the IgnWorkloadState flag
and it reports that a set of jobs have a state of "Running," this state is ignored
and the jobs will either have a default state set or will inherit the state from
another RM reporting on that same set of jobs.
This flag only changes the behavior of RMs of type NATIVE.
LOCALWORKLOADEXPORT
Resource Manager Configuration
When set, destination peers share information about local and remote jobs,
allowing job management of different clusters at a single peer. For more
information, see Workload Submission and Control.
619
Chapter 11 Resource Managers and Interfaces
Flag
Description
MIGRATEALLJOBATTRIBUTES
When set, this flag causes additional job information to be migrated to the
resource manager; additional job information includes things such as node features applied via CLASSCFG[name] DEFAULT.FEATURES, the account to
which the job was submitted, and job walltime limit.
NOAUTORES
If the resource manager does not report CPU usage to Moab because CPU
usage is at 0%, Moab assumes full CPU usage. When set, Moab recognizes the
resource manager report as 0% usage. This is only valid for PBS.
NoCondensedQuery
Disables the condensed workload query. This is the default for Moab 9.0 and
later.
Only applies if the Torque parameter job_full_report_time is used
(Torque Resource Manager version 5.1.x or later). See Server
Parameters in the Torque Resource Manager Administrator Guide.
620
NOCREATERESOURCE
To use resources discovered from this resource manager, they must be created
by another resource manager first. For example, if you set
NOCREATERESOURCE on RM A, which reports nodes 1 and 2, and RM B
only reports node 1, then node 2 will not be created because RM B did not
report it.
PROXYJOBSUBMISSION
Enables Admin proxy job submission, which means administrators may submit
jobs in behalf of other users.
PUSHSLAVEJOBUPDATES
Enables job changes made on a grid slave to be pushed to the grid head or
master. Without this flag, jobs being reported to the grid head do not show any
changes made on the remote Moab server (via mjobctl and so forth).
RECORDGPUMETRICS
Enables the recording of GPU metrics for nodes.
RECORDMICMETRICS
Enables the recording of MIC metrics for nodes.
THREADEDQUERIES
When this flag is set for an individual RM, the queries that Moab performs to
get information from the RM is done in a separate thread from the main Moab
process. This allows Moab to remain responsive during the query and ultimately reduces the time spent in a scheduling cycle. If multiple RMs are being
used the effect can be more significant because all RMs will be queried in parallel.
Resource Manager Configuration
Chapter 11 Resource Managers and Interfaces
Flag
Description
USEPHYSICALMEMORY
Tells Moab to use a node's physical memory instead of the swap space.
For example:
If a node has 12 GB of RAM and an additional 12 GB of swap space, it has 24
GB of virtual memory. If a 4 GB job is assigned to that node, the reported
available memory shows 12 GB because the job is using the swap space not
the physical memory. The reported available memory doesn't decrease until
the swap space is used up. When this flag is set, the 4 GB job immediately
reduces the available memory to 8 GB (physical memory - used memory).
USERSPACEISSEPARATE
Tells Moab to ignore validating the user's uid and gid in the case that information doesn't exist on the Moab server.
Example
# resource manager 'torque' should use asynchronous job start
RMCFG[torque] FLAGS=asyncstart
Scheduler/Resource Manager Interactions
In the simplest configuration, Moab interacts with the resource manager using
the following four primary functions:
Function
Description
GETJOBINFO
Collect detailed state and requirement information about idle, running, and recently completed
jobs.
GETNODEINFO
Collect detailed state information about idle, busy, and defined nodes.
STARTJOB
Immediately start a specific job on a particular set of nodes.
CANCELJOB
Immediately cancel a specific job regardless of job state.
Using these four simple commands, Moab enables nearly its entire suite of
scheduling functions. More detailed information about resource manager
specific requirements and semantics for each of these commands can be found
in the specific resource manager (such as WIKI) overviews.
In addition to these base commands, other commands are required to support
advanced features such as suspend/resume, gang scheduling, and scheduler
initiated checkpoint restart.
Resource Manager Configuration
621
Chapter 11 Resource Managers and Interfaces
Information on creating a new scheduler resource manager interface can be
found in the Adding New Resource Manager Interfaces section.
Resource Manager Extensions
l
Resource Manager Extension Specification
l
Resource Manager Extension Values
l
Resource Manager Extension Examples
l
Configuring dynamic features in Torque and Moab
All resource managers are not created equal. There is a wide range in what
capabilities are available from system to system. Additionally, there is a large
body of functionality that many, if not all, resource managers have no concept
of. A good example of this is job QoS. Since most resource managers do not
have a concept of quality of service, they do not provide a mechanism for users
to specify this information. In many cases, Moab is able to add capabilities at a
global level. However, a number of features require a per job specification.
Resource manager extensions allow this information to be associated with the
job.
Resource Manager Extension Specification
Specifying resource manager extensions varies by resource manager. Torque,
OpenPBS, PBSPro, Loadleveler, LSF, S3, and Wiki each allow the specification
of an extension field as described in the following table:
Resource
manager
Specification method
Torque 2.0+
-l
> qsub -l nodes=3,qos=high sleepy.cmd
Torque
1.x/OpenPBS
-W x=
> qsub -l nodes=3 -W x=qos:high sleepy.cmd
OpenPBS does not support this ability by default but can be patched as described in the
PBS Resource Manager Extension Overview.
622
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
Resource
manager
Specification method
Loadleveler
#@comment
#@nodes = 3
#@comment = qos:high
LSF
-ext
> bsub -ext advres:system.2
PBSPro
-l
> qsub -l advres=system.2
Use of PBSPro resources requires configuring the server_priv/resourcedef file to
define the needed extensions as in the following example:
advres
qos
sid
sjid
Wiki
type=string
type=string
type=string
type=string
comment
comment=qos:high
Resource Manager Extension Values
Using the resource manager specific method, the following job extensions are
currently available:
Resource Manager Extensions
623
Chapter 11 Resource Managers and Interfaces
ADVRES
BANDWIDTH
CPUCLOCK
DDISK
DEADLINE
DEPEND
DMEM
EPILOGUE
EXCLUDENODES
FEATURE
GATTR
GMETRIC
GPUs
GRES and SOFTWARE
HOSTLIST
JGROUP
JOBFLAGS (aka FLAGS)
JOBREJECTPOLICY
MAXMEM
MAXPROC
MEM
MICs
MINPREEMPTTIME
MINPROCSPEED
MINWCLIMIT
MSTAGEIN
MSTAGEOUT
NACCESSPOLICY
NALLOCPOLICY
NCPUS
NMATCHPOLICY
NODESET
NODESETCOUNT
NODESETDELAY
NODESETISOPTIONAL
OPSYS
PARTITION
PMEM
PREF
PROCS
PROLOGUE
PVMEM
QoS
QUEUEJOB
REQATTR
RESFAILPOLICY
RMTYPE
SIGNAL
GRES and SOFTWARE
SPRIORITY
TEMPLATE
TERMTIME
TPN
TRIG
TRL (Format 1)
TRL (Format 2)
VAR
VC
VMEM
ADVRES
Format
[!]<RSVID>
Description
Specifies that reserved resources are required to run the job. If <RSVID> is specified, then only
resources within the specified reservation may be allocated (see Job to Reservation Binding).
You can request to not use a specific reservation by using advres=!<reservationname>.
Example
> qsub -l advres=grid.3
Resources for the job must come from grid.3.
> qsub -l advres=!grid.5
Resources for the job must not come from grid.5
BANDWIDTH
Format
<DOUBLE> (in MB/s)
Description
Minimum available network bandwidth across allocated resources (See Network Management.).
Example
624
> bsub -ext bandwidth=120 chemjob.txt
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
CPUCLOCK
Format
<STRING>
Resource Manager Extensions
625
Chapter 11 Resource Managers and Interfaces
CPUCLOCK
Description
Specify the CPU clock frequency for each node requested for this job. A cpuclock request applies to
every processor on every node in the request. Specifying varying CPU frequencies for different
nodes or different processors on nodes in a single job request is not supported.
Not all CPUs support all possible frequencies or ACPI states. If the requested frequency is not
supported by the CPU, the nearest frequency is used.
If a job does not place any load on the node then some OSs will drop the frequency below
the requested frequency.
Using cpuclock sets NODEACCESSPOLICY to SINGLEJOB.
ALPS 1.4 or later is required when using cpuclock on Cray.
The clock frequency can be specified via:
l
a number that indicates the clock frequency (with or without the SI unit suffix).
l
a Linux power governor policy name. The governor names are:
o
performance: This governor instructs Linux to operate each logical processor at its
maximum clock frequency.
This setting consumes the most power and workload executes at the fastest
possible speed.
o
powersave: This governor instructs Linux to operate each logical processor at its
minimum clock frequency.
This setting executes workload at the slowest possible speed. This setting does not
necessarily consume the least amount of power since applications execute slower,
and may actually consume more energy because of the additional time needed to
complete the workload's execution.
o
ondemand: This governor dynamically switches the logical processor's clock
frequency to the maximum value when system load is high and to the minimum
value when the system load is low.
This setting causes workload to execute at the fastest possible speed or the slowest
possible speed, depending on OS load. The system switches between consuming
the most power and the least power.
The power saving benefits of ondemand might be non-existent due to
frequency switching latency if the system load causes clock frequency
changes too often.
This has been true for older processors since changing the clock frequency
required putting the processor into the C3 "sleep" state, changing its clock
frequency, and then waking it up, all of which required a significant amount
of time.
Newer processors, such as the Intel Xeon E5-2600 Sandy Bridge processors,
can change clock frequency dynamically and much faster.
o
conservative: This governor operates like the ondemand governor but is more
conservative in switching between frequencies. It switches more gradually and
uses all possible clock frequencies.
This governor can switch to an intermediate clock frequency if it seems
626
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
CPUCLOCK
appropriate to the system load and usage, which the ondemand governor does not
do.
l
an ACPI performance state (or P-state) with or without the P prefix. P-states are a special
range of values (0-15) that map to specific frequencies. Not all processors support all 16
states, however, they all start at P0. P0 sets the CPU clock frequency to the highest
performance state which runs at the maximum frequency. P15 sets the CPU clock
frequency to the lowest performance state which runs at the lowest frequency.
When reviewing job or node properties when cpuclock was used, be mindful of unit conversion.
The OS reports frequency in Hz, not MHz or GHz.
If a job does not place any load on the node then some OSs will drop the frequency below
the requested frequency.
Example
msub -l cpuclock=1800,nodes=2 script.sh
msub -l cpuclock=1800mhz,nodes=2 script.sh
This job requests 2 nodes and specifies their CPU frequencies should be set to 1800 MHz.
msub -l cpuclock=performance,nodes=2 script.sh
This job requests 2 nodes and specifies their CPU frequencies should be set to the
performance power governor policy.
msub -l cpuclock=3,nodes=2 script.sh
msub -l cpuclock=p3,nodes=2 script.sh
This job requests 2 nodes and specifies their CPU frequencies should be set to a
performance state of 3.
DDISK
Format
<INTEGER>
Default
0
Description
Dedicated disk per task in MB.
Example
> qsub -l ddisk=2000
Resource Manager Extensions
627
Chapter 11 Resource Managers and Interfaces
DEADLINE
Format
Relative time: [[[DD:]HH:]MM:]SS
Absolute time: hh:mm:ss_mm/dd/yy
Description
Example:
Either the relative completion deadline of job (from job submission time) or an absolute deadline
in which you specify the date and time the job will finish.
> qsub -l deadline=2:00:00,nodes=4 /tmp/bio3.cmd
The job's deadline is 2 hours after its submission.
DEPEND
Format
[<DEPENDTYPE>:][{jobname|jobid}.]<ID>[:[{jobname|jobid}.]<ID>]...
Description
Allows specification of job dependencies for compute or system jobs. If no ID prefix (jobname or
jobid) is specified, the ID value is interpreted as a job ID.
Example
# submit job which will run after job 1301 and 1304 complete
> msub -l depend=orion.1301:orion.1304 test.cmd
orion.1322
# submit jobname-based dependency job
> msub -l depend=jobname.data1005 dataetl.cmd
orion.1428
DMEM
Format
<INTEGER>
Default
0
Description
Dedicated memory per task in bytes.
Example
> msub -l dmem=20480
Moab will dedicate 20 MB of
memory to the task.
628
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
EPILOGUE
Format
<STRING>
Description
Specifies a user owned epilogue script which is run before the system epilogue and epilogue.user scripts at the completion of a job. The syntax is epilogue=<file>. The file can be
designated with an absolute or relative path.
This parameter works only with Torque.
Example
> msub -l epilogue=epilogue_script.sh job.sh
EXCLUDENODES
Format
{<nodeid>|<node_range>}[:...]
Description
Specifies nodes that should not be considered for the given job.
Example
> msub -l excludenodes=k1:k2:k[5-8]
# Comma separated ranges work only with SLURM
> msub -l excludenodes=k[1-2,5-8]
FEATURE
Format
<FEATURE>[{:|}<FEATURE>]...
Description
Required list of node attribute/node features.
If the pipe (|) character is used as a delimiter, the features are logically OR'd together and
the associated job may use resources that match any of the specified features.
Requesting node names as features will result in the job being blocked from running.
Example
> qsub -l feature='fastos:bigio' testjob.cmd
GATTR
Format
<STRING>
Resource Manager Extensions
629
Chapter 11 Resource Managers and Interfaces
GATTR
Description
Example
Generic job attribute associated with job. The maximum size for an attribute is 63 bytes (the core
Moab size limit of 64, including a null byte)
> qsub -l gattr=bigjob
GMETRIC
Format
Generic metric requirement for allocated nodes where the requirement is specified using the
format <GMNAME>[:{lt:,le:,eq:,ge:,gt:,ne:}<VALUE>]
Description
Indicates generic constraints that must be found on all allocated nodes. If a <VALUE> is not specified, the node must simply possess the generic metric (See Generic Metrics for more information.).
Example
> qsub -l gmetric=bioversion:ge:133244 testj.txt
GPUs
Format
msub -l nodes=<VALUE>:ppn=<VALUE>:gpus=<VALUE>[:mode][:reseterr]
Where mode is one of:
exclusive - The default setting. The GPU is used exclusively by one process thread.
exclusive_thread - The GPU is used exclusively by one process thread.
exclusive_process - The GPU is used exclusively by one process regardless of process thread.
If present, reseterr resets the ECC memory bit error counters. This only resets the volatile error
counts, or errors since the last reboot. The permanent error counts are not affected.
Moab passes the mode and reseterr portion of the request to Torque for processing.
Moab does not support requesting GPUs as a GRES. Submitting msub -l gres=gpus:x
does not work.
Description
630
Moab schedules GPUs as a special type of node-locked generic resources. When Torque reports
GPUs to Moab, Moab can schedule jobs and correctly assign GPUs to ensure that jobs are scheduled efficiently. To have Moab schedule GPUs, configure them in Torque then submit jobs using
the "GPU" attribute. Moab automatically parses the "GPU" attribute and assigns them in the correct manner. For information about GPU metrics, see GPGPUMetrics.
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
GPUs
Examples
> msub -l nodes=2:ppn=2:gpus=1:exclusive_process:reseterr
Submits a job that requests 2 tasks, 2 processors and 1 GPU per task (2 GPUs total). Each
GPU runs only threads related to the task and resets the volatile ECC memory big error
counts at job start time.
> msub -l nodes=4:gpus=1,tpn=2
Submits a job that requests 4 tasks, 1 GPU per node (4 GPUs total), and 2 tasks per node.
Each GPU is dedicated exclusively to one task process and the ECC memory bit error
counters are not reset.
> msub -l nodes=4:gpus=1:reseterr
Submits a job that requests 4 tasks, 1 processor and 1 GPU per task (4 GPUs total). Each
GPU is dedicated exclusively to one task process and resets the volatile ECC memory bit
error counts at job start time.
> msub -l nodes=4:gpus=2+1:ppn=2,walltime=600
Submits a job that requests two different types of tasks, the first is 4 tasks, each with 1
processor and 2 gpus, and the second is 1 task with 2 processors. Each GPU is dedicated
exclusively to one task process and the ECC memory bit error counters are not reset.
GRES and SOFTWARE
Format
Percent sign (%) delimited list of generic resources where each resource is specified using the
format <RESTYPE>[{+|:}<COUNT>]
Description
Indicates generic resources required by the job. If the generic resource is node-locked, it is a pertask count. If a <COUNT> is not specified, the resource count defaults to 1.
Example
> qsub -W x=GRES:tape+2%matlab+3 testj.txt
When specifying more than one generic resource with -l, use the percent (%) character to
delimit them.
> qsub -l gres=tape+2%matlab+3 testj.txt
> qsub -l software=matlab:2 testj.txt
Resource Manager Extensions
631
Chapter 11 Resource Managers and Interfaces
HOSTLIST
Format
Comma (,) or plus (+) delimited list of hostnames. Ranges and regular expressions are supported
in msub only.
Description
Indicates an exact set, superset, or subset of nodes on which the job must run. Use the caret (^) or
asterisk (*) characters to specify a host list as superset or subset respectively.
An exact set is defined without a caret or asterisk. An exact set means all the hosts in the specified
hostlist must be selected for the job.
A subset means the specified hostlist is used first to select hosts for the job. If the job requires
more hosts than are in the subset hostlist, they will be obtained from elsewhere if possible. If the
job does not require all of the nodes in the subset hostlist, it will use only the ones it needs.
A superset means the hostlist is the only source of hosts that should be considered for running the
job. If the job can't find the necessary resources in the superset hostlist it should not run. No other
hosts should be considered in allocating the job.
Torque ignores hostlist as an extension. Hostlist is only supported in Moab.
632
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
HOSTLIST
Examples
> msub -l hostlist=nodeA+nodeB+nodeE
hostlist=foo[1-5]
This is an exact set of (foo1,foo2,...,foo5). The job must run on all these nodes.
hostlist=foo1+foo[3-9]
This is an exact set of (foo1,foo3,foo4,...,foo9). The job must run on all these nodes.
hostlist=foo[1,3-9]
This is an exact set of the same nodes as the previous example.
hostlist=foo[1-3]+bar[72-79]
This is an exact set of (foo1,foo2,foo3,bar72,bar73,...,bar79). The job must run on all these
nodes.
hostlist=^node[1-50]
This is a superset of (node1,node2,...,node50). These are the only nodes that can be
considered for the job. If the necessary resources for the job are not in this hostlist, the job
is not run. If the job does not require all the nodes in this hostlist, it will use only the ones
that it needs.
hostlist=*node[15-25]
This is a subset of (node15,node16,...,node25). The nodes in this hostlist are considered first
for the job. If the necessary resources for the job are not in this hostlist, Moab tries to
obtain the necessary resources from elsewhere. If the job does not require all the nodes in
this hostlist, it will use only the ones that it needs.
JGROUP
Format
<JOBGROUPID>
Description
ID of job group to which this job belongs (different from the GID of the user running the job).
Example
> msub -l JGROUP=bluegroup
Resource Manager Extensions
633
Chapter 11 Resource Managers and Interfaces
JOBFLAGS (aka FLAGS)
Format
One or more of the following colon delimited job flags including ADVRES[:RSVID], NOQUEUE,
NORMSTART, PREEMPTEE, PREEMPTOR, RESTARTABLE, or SUSPENDABLE (see job flag overview
for a complete listing).
Description
Associates various flags with the job.
Example
> qsub -l nodes=1,walltime=3600,jobflags=advres myjob.py
JOBREJECTPOLICY
Format:
One or more of CANCEL, HOLD, IGNORE, MAIL, or RETRY
Default:
HOLD
Details:
Specifies the action to take when the scheduler determines that a job can never run. CANCEL
issues a call to the resource manager to cancel the job. HOLD places a batch hold on the job
preventing the job from being further evaluated until released by an administrator.
Administrators can dynamically alter job attributes and possibly fix the job with mjobctl -m.
With IGNORE, the scheduler will allow the job to exist within the resource manager queue but will
neither process it nor report it. MAIL will send email to both the admin and the user when rejected
jobs are detected. If RETRY is set, then Moab will allow the job to remain idle and will only attempt
to start the job when the policy violation is resolved. Any combination of attributes may be
specified.
This is a per-job policy specified with msub -l. JOBREJECTPOLICY also exists as a global parameter.
Also see QOSREJECTPOLICY.
Example:
> msub -l jobrejectpolicy=cancel:mail
MAXMEM
634
Forma:
<INTEGER> (in megabytes)
Description
Maximum amount of memory the job may consume across all tasks before the JOBMEM action is
taken.
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
MAXMEM
Example
> qsub -l x=MAXMEM:1000mb bw.cmd
If a RESOURCELIMITPOLICY is set for per-job memory utilization, its action will be taken
when this value is reached.
MAXPROC
Format
<INTEGER>
Description
Maximum CPU load the job may consume across all tasks before the JOBPROC action is taken.
Example
> qsub -W x=MAXPROC:4 bw.cmd
If a RESOURCELIMITPOLICY is set for per-job processor utilization, its action will be
taken when this value is reached.
MEM
Format
<INTEGER>
Description
Specify the maximum amount of physical memory used by the job. If you do not specify MB or GB,
Moab uses bytes if your resource manger is Torque and MB if your resource manager is Native.
Example
> msub -l nodes=4:ppn=2,mem=1024mb
The job must have 4 compute nodes with 2 processors per node. The job is limited to 1024
MB of memory.
Resource Manager Extensions
635
Chapter 11 Resource Managers and Interfaces
MICs
Format
msub -l nodes=<VALUE>:ppn=<VALUE>:mics=<VALUE>[:mode]
Where mode is one of:
exclusive - The default setting. The MIC is used exclusively by one process thread.
exclusive_thread - The MIC is used exclusively by one process thread.
exclusive_process - The MIC is used exclusively by one process regardless of process thread.
Moab passes the mode portion of the request to Torque for processing.
Moab does not support requesting MICs as a GRES. Submitting msub -l gres=mics:x
does not work.
Description
Examples
Moab schedules MICs as a special type of node-locked generic resources. When Torque reports
MICs to Moab, Moab can schedule jobs and correctly assign MICs to ensure that jobs are scheduled efficiently. To have Moab schedule MICs , configure them in Torque then submit jobs using
the "MIC" attribute. Moab automatically parses the "MIC" attribute and assigns them in the correct
manner.
> msub -l nodes=2:ppn=2:mics=1:exclusive_process
Submits a job that requests 2 tasks, 2 processors and 1 MIC per task (2 MICs total). Each
MIC runs only threads related to the task.
> msub -l nodes=4:mics=1,tpn=2
Submits a job that requests 4 tasks, 1 MIC per node (4 MICs total), and 2 tasks per node.
Each MIC is dedicated exclusively to one task process.
> msub -l nodes=4:mics=1
Submits a job that requests 4 tasks, 1 processor and 1 MIC per task (4 MICs total). Each
MIC is dedicated exclusively to one task process.
> msub -l nodes=4:mics=2+1:ppn=2,walltime=600
Submits a job that requests two different types of tasks, the first is 4 tasks, each with 1
processor and 2 MICs , and the second is 1 task with 2 processors. Each MIC is dedicated
exclusively to one task process.
MINPREEMPTTIME
Format
636
[[DD:]HH:]MM:]SS
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
MINPREEMPTTIME
Description
Minimum time job must run before being eligible for preemption.
Can only be specified if associated QoS allows per-job preemption configuration by setting
the preemptconfig flag.
Example
> qsub -l minpreempttime=900 bw.cmd
Job cannot be preempted until it has run for 15 minutes.
MINPROCSPEED
Format
<INTEGER>
Default
0
Description
Minimum processor speed (in MHz) for every node that this job will run on.
Example
> qsub -W x=MINPROCSPEED:2000 bw.cmd
Every node that runs this job must have a processor speed of at
least 2000 MHz.
MINWCLIMIT
Format
[[DD:]HH:]MM:]SS
Default
---
Description
Minimum wallclock limit job must run before being eligible for extension (See
JOBEXTENDDURATION or JOBEXTENDSTARTWALLTIME.).
Example
> qsub -l minwclimit=300,walltime=16000 bw.cmd
Job will run for at least 300 seconds but up to 16,000 seconds if possible (without
interfering with other jobs).
Resource Manager Extensions
637
Chapter 11 Resource Managers and Interfaces
MSTAGEIN
Format
[<SRCURL>[|<SRCRUL>...]%]<DSTURL>
Description
Indicates a job has data staging requirements. The source URL(s) listed will be transferred to the
execution system for use by the job. If more than one source URL is specified, the destination URL
must be a directory.
The format of <SRCURL> is: [PROTO://][HOST][:PORT]][/PATH]where the path is local.
The format of <DSTURL> is:
[PROTO://][HOST][:PORT]][/PATH]where the path is remote.
PROTO can be any of the following protocols: ssh, file, or gsiftp.
HOST is the name of the host where the file resides.
PATH is the path of the source or destination file. The destination path may be a directory when sending a single file and must be a directory when sending multiple files. If a directory is specified, it must
end with a forward slash (/).
Valid variables include:
$JOBID
$HOME - Path the script was run from
$RHOME - Home dir of the user on the remote system
$SUBMITHOST
$DEST - This is the Moab where the job will run
$LOCALDATASTAGEHEAD
If no destination is given, the protocol and file name will be set to the same as the source.
The $RHOME (remote home directory) variable is for when a user's home directory on the
compute node is different than on the submission host.
Example:
> msub Wx='mstagein=file://$HOME/helperscript.sh|file:///home/dev/datafile.txt%ssh://host/hom
e/dev/' script.sh
Copy helperscript.sh and datafile.txt from the local machine to /home/dev/ on host for use in execution of script.sh. $HOME is a path containing a preceding / (i.e. /home/adaptive)
MSTAGEOUT
Format
638
[<SRCURL>[|<SRCRUL>...]%]<DSTURL>
Resource Manager Extensions
Chapter 11 Resource Managers and Interfaces
MSTAGEOUT
Description
Indicates whether a job has data staging requirements. The source URL(s) listed will be transferred
from the execution system after the completion of the job. If more than one source URL is specified, the
destination URL must be a directory.
The format of <SRCURL> is: [PROTO://][HOST][:PORT]][/PATH]where the path is remote.
The format of <DSTURL> is: [PROTO://][HOST][:PORT]][/PATH]where the path is local.
PROTO can be any of the following protocols: ssh, file, or gsiftp.
HOST is the name of the host where the file resides.
PATH is the path of the source or destination file. The destination path may be a directory when sending a single file and must be a directory when sending multiple files. If a directory is specified, it must
end with a forward slash (/).
Valid variables include:
$JOBID
$HOME - Path the script was run from
$RHOME - Home dir of the user on the remote system
$SUBMITHOST
$DEST - This is the Moab where the job will run
$LOCALDATASTAGEHEAD
If no destination is given, the protocol and file name will be set to the same as the source.
The $RHOME (remote home directory) variable is for when a user's home directory on the
compute node is different than on the submission host.
Example
> msub -W
x='mstageout=ssh://$DEST/$HOME/resultfile1.txt|ssh://host/home/dev/resultscript.sh%file
:///home/dev/' script.sh
Copy resultfile1.txt and resultscript.sh from the execution system to
/home/dev/ after the execution of script.sh is complete. $HOME is a path containing a
preceding / (i.e. /home/adaptive).
NACCESSPOLICY
Format
One of SHARED, SINGLEJOB, SINGLETASK, SINGLEUSER, or UNIQUEUSER
Resource Manager Extensions
639
Chapter 11 Resource Managers and Interfaces
NACCESSPOLICY
Description
Specifies how node resources should be accessed. (See Node Access Policies for more information).
The naccesspolicy option can only be used to make node access more constraining than is
specified by the system, partition, or node policies. If the effective node access policy is
shared, naccesspolicy can be set to singleuser, if the effective node access policy is
singlejob, naccesspolicy can be set to singletask.
Example
> qsub -l naccesspolicy=singleuser bw.cmd
> bsub -ext naccesspolicy=singleuser lancer.cmd
Job can only allocate free nodes or nodes running jobs by same us