XC™ Series Lustre® Administration Guide (CLE 6.0.UP01)

XC™ Series Lustre® Administration Guide (CLE 6.0.UP01)
XC™ Series Lustre® Administration Guide
(CLE 6.0.UP01)
Contents
Contents
1 About this Publication.............................................................................................................................................5
2 Introduction to Lustre on CLE 6.0...........................................................................................................................8
2.1 Lustre Software Components and Node Types.......................................................................................10
2.2 Lustre Framework....................................................................................................................................11
3 Configuring Lustre File Systems...........................................................................................................................12
3.1 Configure Lustre File Systems on a Cray System...................................................................................12
3.1.1 Ensure a File System Definition is Consistent with Cray System Configurations......................12
3.1.2 File System Definition Parameters.............................................................................................13
3.1.3 Mount and Unmount Lustre Clients Manually............................................................................17
3.1.4 Unmount Lustre from Compute Node Clients Using lustre_control...................................17
3.1.5 Configure the NFS client to Mount the Exported Lustre File System.........................................17
3.1.6 Lustre Option for Panic on LBUG for CNL and Service Nodes...................................................18
3.1.7 Verify Lustre File System Configuration.....................................................................................18
3.2 Configure Striping on Lustre File Systems..............................................................................................19
3.2.1 Configuration and Performance Trade-off for Striping................................................................20
3.2.2 Override File System Striping Defaults......................................................................................20
4 LNet Routing.........................................................................................................................................................21
4.1 Recommended LNet Router Node Parameters.......................................................................................21
4.2 LNet Feature: Router Pinger...................................................................................................................22
4.3 LNet Feature: ARF Detection..................................................................................................................22
4.4 External Server Node Recommended LNet Parameters.........................................................................23
4.5 Internal Client Recommended LNet Parameters.....................................................................................23
4.6 Internal Server (DAL) Recommended LNet Parameters.........................................................................24
4.7 Configure Lustre PTLRPC ldlm_enqueue_min Parameter..................................................................24
4.8 DVS Server Node Recommended LNet Parameters..............................................................................25
4.9 External Client Recommended LNet Parameters....................................................................................25
4.10 LNet Feature: Peer Health.....................................................................................................................25
4.11 Manually Control LNet Routers.............................................................................................................26
5 Configure Fine-grained Routing with clcvt............................................................................................................27
5.1 clcvt Prerequisite Files.............................................................................................................................27
5.2 The info.file-system-identifier File............................................................................................................28
5.3 The client-system.hosts File....................................................................................................................29
5.4 The client-system.ib File..........................................................................................................................31
5.5 The cluster-name.ib File..........................................................................................................................32
2
Contents
5.6 The client-system.rtrIm File.....................................................................................................................33
5.7 Generate ip2nets and routes Information................................................................................................33
5.8 Create the persistent-storage File.................................................................................................34
5.9 Create ip2nets and routes Information for the Compute Nodes..............................................................34
5.10 Create ip2nets and routes information for service node Lustre clients (MOM and internal login
nodes).................................................................................................................................................35
5.11 Create ip2nets and routes information for the LNet router nodes.........................................................36
5.12 Create ip2nets and routes Information for the Lustre Server Nodes.....................................................36
6 Lustre System Administration...............................................................................................................................38
6.1 Lustre Commands for System Administrators.........................................................................................38
6.2 Identify MDS and OSTs...........................................................................................................................38
6.3 Start Lustre..............................................................................................................................................39
6.4 Stop Lustre..............................................................................................................................................39
6.5 Add OSSs and OSTs...............................................................................................................................40
6.6 Recover From a Failed OST....................................................................................................................42
6.6.1 Deactivate a Failed OST and Remove Striped Files..................................................................42
6.6.2 Reformat a Single OST..............................................................................................................43
6.7 OSS Read Cache and Writethrough Cache............................................................................................44
6.8 Lustre 2.x Performance and max_rpcs_in_flight............................................................................45
6.9 Check Lustre Disk Usage........................................................................................................................45
6.10 Lustre User and Group Quotas.............................................................................................................46
6.11 Check the Lustre File System................................................................................................................46
6.11.1 Perform an e2fsck on All OSTs in a File System with lustre_control.............................46
6.12 Lustre liblustreapi Usage.......................................................................................................................46
6.13 Dump Lustre Log Files..........................................................................................................................47
6.14 File System Error Messages.................................................................................................................47
6.15 Lustre Users Report ENOSPC Errors......................................................................................................47
7 Lustre Failover on Cray Systems..........................................................................................................................48
7.1 Configuration Types for Failover..............................................................................................................48
7.2 Configure Manual Lustre Failover...........................................................................................................48
7.2.1 Configure DAL Failover for CLE 6.0...........................................................................................49
7.2.2 Perform Lustre Manual Failover.................................................................................................51
7.2.3 Monitor Recovery Status............................................................................................................52
7.3 Lustre Automatic Failover........................................................................................................................52
7.3.1 Lustre Automatic Failover Database Tables...............................................................................53
7.3.2 Back Up SDB Table Content......................................................................................................55
7.3.3 Use the xtlusfoadmin Command...........................................................................................56
7.3.4 System Startup and Shutdown when Using Automatic Lustre Failover.....................................57
3
Contents
7.3.5 Configure Lustre Failover for Multiple File Systems...................................................................59
7.4 Back Up and Restore Lustre Failover Tables..........................................................................................60
7.5 Perform Lustre Failback on CLE Systems...............................................................................................61
8 LMT Configuration for DAL...................................................................................................................................63
8.1 Configure LMT MySQL Database for DAL..............................................................................................63
8.2 Configure the LMT GUI...........................................................................................................................65
8.3 Configure LMT MySQL for Remote Access ............................................................................................66
8.4 LMT Disk Usage......................................................................................................................................67
9 LMT Overview.......................................................................................................................................................69
9.1 View and Aggregate LMT Data................................................................................................................70
9.2 Remove LMT Data..................................................................................................................................71
9.3 Stop Cerebro and LMT............................................................................................................................71
9.4 Delete the LMT MySQL Database...........................................................................................................72
9.5 LMT Database Recovery Process...........................................................................................................72
4
About this Publication
1
About this Publication
This release includes major revisions to support Lustre® 2.5.4 in CLE. Additional Information about Lustre is
available from: https://wiki.hpdd.intel.com/display/PUB/Documentation
Lustre information in this guide is based, in part, on documentation from Oracle®, Whamcloud®, and Intel®. Lustre
information contained in Cray publications supersedes information found in Intel publications.
Revisions to this Publication
Content in this publication was previously released in April 2015. It was included in the Manage Lustre for the
Cray Linux Environment (CLE), S-0010 and the XC CLE System Administration Guide, S-2393, which have been
combined in this single publication.
Previous Lustre content that supports CLE5.2UP04 or earlier releases is available from http://docs.cray.com.
Related Publications
●
XC™ Series System Administration Guide
●
XC™ Series System Software Installation and Configuration Guide
Typographic Conventions
Monospace
Indicates program code, reserved words, library functions, command-line prompts,
screen output, file/path names, key strokes (e.g., Enter and Alt-Ctrl-F), and
other software constructs.
Monospaced Bold
Indicates commands that must be entered on a command line or in response to an
interactive prompt.
Oblique or Italics
Indicates user-supplied values in commands or syntax definitions.
Proportional Bold
Indicates a graphical user interface (GUI) window or element.
\ (backslash)
At the end of a command line, indicates the Linux® shell line continuation character
(lines joined by a backslash are parsed as a single line). Do not type anything after
the backslash or the continuation feature will not work correctly.
Command Prompts
hostname in
command
prompts
Hostnames in command prompts indicate where the command must be run.
hostname#
Run the command on this hostname.
5
About this Publication
smw#
boot#
sdb#
login#
smw1#
smw2#
smwactive#
smwpassive#
Run the command on the SMW.
Run the command on the boot node.
Run the command on the SDB node.
Run the command on any login node.
For a system configured with the SMW failover feature there are two
SMWs—one in an active role and the other in a passive role. The
SMW that is active at the start of a procedure is smw1. The SMW that
is passive is smw2.
In some scenarios, the active SMW is smw1 at the start of a
procedure—then the procedure requires a failover to the other SMW.
In this case, the documentation will continue to refer to the formerly
active SMW as smw1, even though smw2 is now the active SMW. If
further clarification is needed in a procedure, the active SMW will be
called smwactive and the passive SMW will be called smwpassive.
account name The account that must run the command is also indicated in the prompt.
in command
The root or super-user account always has the # character at the
prompts
smw#
end
of the prompt.
boot#
sdb#
login#
hostname#
crayadm@smw>
crayadm@login>
user@hostname>
Any non-root account is indicated with account@hostname.
A user account that is neither root nor crayadm is referred to as
user.
Scope and Audience
This publication is written for experienced Cray system software administrators.
Trademarks
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and
design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: APPRENTICE2,
CHAPEL, CLUSTER CONNECT, CRAYDOC, CRAYPAT, CRAYPORT, DATAWARP, ECOPHLEX, LIBSCI,
NODEKARE. The following system family marks, and associated model number marks, are trademarks of Cray
Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from
LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Lustre is a registered
6
About this Publication
trademark of Xyratex Technology, Ltd. in the United States. Other trademarks used in this document are the
property of their respective owners.
Feedback
Visit the Cray Publications Portal at http://pubs.cray.com. Make comments online using the Contact Us button in
the upper-right corner or Email pubs@cray.com. Your comments are important to us and we will respond within 24
hours.
7
Introduction to Lustre on CLE 6.0
2
Introduction to Lustre on CLE 6.0
The Lustre file system is optional on Cray systems with the Cray Linux environment (CLE 6.0). Storage RAID may
be configured with other file systems as site requirements dictate. Lustre is a scalable, high-performance, POSIXcompliant file system. It consists of software subsystems, storage, and an associated network (LNet).
Lustre uses the ldiskfs file system for back-end storage. The ldiskfs file system is an extension to the Linux
ext4 file system with enhancements for Lustre. The packages for the Lustre file system are installed during the
CLE 6.0 software installation.
There are two different options for installing a Lustre filesystem on a Cray system. Direct-attached Lustre (DAL) is
connected to the Aries high-speed network (HSN) and is not routed. External Lustre is typically InfiniBand (IB)
based—it connects to Cray service nodes to route LNet to the HSN.
Direct-attached Lustre (DAL)
DAL file system servers are Cray service nodes connected to the Aries HSN. They access the logical units
(LUNs) of storage on external RAIDs, either through Fibre Channel or IB connections.
Figure 1. Direct-Attached Lustre Block Diagram
Cray XC Series
IB switch
MDT/MGT
Aries HSN
A
DAL Service Node MDS/MGS
HCA
DAL Service Node OSS1
HCA
DAL Service Node OSS2
HCA
B
OST1
A
B
Compute Node - Lclient
OST2
Compute Node - Lclient
Service Node - Lclient
HCA
A
B
External Lustre
External Lustre file systems, (Cray Sonexion), use Cray service nodes as LNet routers. They connect the
Sonexion Lustre servers to the Aries HSN on an external IB storage network.
8
Introduction to Lustre on CLE 6.0
Figure 2. External Lustre Block Diagram
Cray XC Series
IB switch
Lustre Server MDS/MGS
Aries HSN
HCA
LNET Node
Lclient
HCA
HCA
LNET Node
Lclient
HCA
HCA
Compute Node
Lclient
HCA
HCA
Lustre Server OSS1
HCA
HCA
HCA
Lustre Server OSS2
HCA
Compute Node
Lclient
HCA
HCA
SAS/FC/IB
switch
MDT/MGT
A
B
OST1
A
B
OST2
A
B
Service Node HCA
Lclient
MOM Node
Lclient
Cray Configuration Management Framework (CMF)
Lustre setup and management on CLE 6.0 is accomplished using the Cray configuration management framework
(CMF). The CMF comprises configuration data for the entire system, the tools to manage and distribute that data,
and the software to apply the configuration data to the running image at boot time.
The cfgset command and the configurator that it invokes are the primary tools that Cray provides for managing
configuration data. Some services are not yet supported with the configurator—such as managing Cerebro and
creating the Lustre monitoring tool (LMT) MySQL database on the SMW—and must be configured manually.
IMPS Distribution Service
The Cray image management and provisioning system (IMPS) distribution service (IDS) makes the configuration
sets that are on the management node available to all Lustre client and server nodes in DAL systems, or LNet
nodes to support Sonexion systems. At each node, config set data is consumed by Ansible plays—another
component of the CMF—which act upon that data during the booting phase to enable each node to dynamically
self-configure. Lustre services are configured for a "configuration set" that dynamically self-configure Lustre
nodes.
The services in the configuration set for managing Lustre are:
●
cray_lustre_server – Lustre servers
●
cray_lustre_client – Lustre clients
●
cray_lnet – LNet routers
●
cray_net – Network configuration
●
cray_lmt – Lustre monitoring tool (LMT)
IMPS Recipies
IMPS recipies are included with the system to support the various Lustre node types.
9
Introduction to Lustre on CLE 6.0
List the IMPS recipes for Lustre:
smw# recipe list |grep lustre
compute-large-lustre-2.x_cle_rhine_sles_12_x86-64_ari
compute-large-lustre-master_cle_rhine_sles_12_x86-64_ari
compute-lustre-2.x_cle_rhine_sles_12_x86-64_ari
compute-lustre-master_cle_rhine_sles_12_x86-64_ari
dal-lustre-2.x_cle_rhine_centos_6.5_x86-64_ari
dal-lustre-master_cle_rhine_centos_6.5_x86-64_ari
elogin-large-lustre-2.x_cle_rhine_sles_12_x86-64_ari
elogin-lustre-2.x_cle_rhine_sles_12_x86-64_ari
...
Show more information about a specific Lustre IMPS recipe:
smw# recipe show service-lustre-2.x_cle_rhine_sles_12_x86-64_ari --fields description
service-lustre-2.x_cle_rhine_sles_12_x86-64_ari:
description: Generic service node with Lustre 2.x, specifically excludes the login node.
2.1
Lustre Software Components and Node Types
The following Lustre software components can be implemented on selected nodes of the Cray system running
CLE 6.0 UP01.
Clients
Services or programs that access the file system. On Cray systems, clients are typically
associated with login or compute nodes.
LNet Routers
Transfers LNet traffic between Lustre servers and Lustre clients that are on different
networks. On Cray systems, LNet routers are typically used to connect external Lustre file
systems on InfiniBand (IB) networks to the Cray high-speed network (HSN).
Object Storage
Target (OST)
Software interface to back-end storage volumes. There may be one or more OSTs. The
client performs parallel I/O operations across multiple OSTs. Configure the characteristics
of the OSTs during the Lustre setup.
Object Storage
Server (OSS)
Node that hosts the OSTs. Each OSS node, referenced by node ID (NID), has Fibre
Channel or IB connections to a RAID controller. The OST is a logical device, the OSS is the
physical node.
Metadata Server
(MDS)
Owns and manages information about the files in the Lustre file system. Handles
namespace operations such as file creation, but does not contain any file data. Stores
information about which file is located on which OSTs, how the blocks of files are striped
across the OSTs, the date and time the file was modified, and so on. The MDS is consulted
whenever a file is opened or closed. Because file namespace operations are done by the
MDS, they do not impact operations that manipulate file data. Configure MDS
characteristics during the Lustre setup.
Metadata Target
(MDT)
Software interface to back-end storage volumes for the MDS. Stores metadata for the
Lustre file system.
Management
Server (MGS)
Controls the configuration information for all Lustre file systems running at a site. Clients
and servers contact the MGS to retrieve or change configuration information. Cray
installation and upgrade utilities automatically create a default Lustre configuration where
the MGS and the MDS are co-located on a service node and share the same physical
device for data storage.
10
Introduction to Lustre on CLE 6.0
2.2
Lustre Framework
The system processes (Lustre components) that run on Cray nodes are referred to as Lustre services throughout
this topic. The interactions of these services make up the framework for the Lustre file system as follows. The
metadata server (MDS) transforms client requests into journaled, batched, metadata updates on persistent
storage. The MDS can batch large numbers of requests from a single client. It can also batch large numbers of
requests generated by different clients, such as when many clients are updating a single object. After objects
have been created by the MDS, the object storage server (OSS) handles remote procedure calls from clients and
relays the transactions to the appropriate objects. The OSS read cache uses the Linux page cache to store data
on a server until it can be written. Site administrators and analysts should take this into consideration as it may
impact service node memory requirements. For more information, see OSS Read Cache and Writethrough Cache
on page 44.
The characteristics of the MDS—such as how files are stored across object storage targets (OSTs)—can be
configured as part of the Lustre setup. Each pair of subsystems acts according to protocol.
MDS-Client The MDS interacts with the client for metadata handling such as the acquisition and updates of
inodes, directory information, and security handling.
OST-Client The OST interacts with the client for file data I/O, including the allocation of blocks, striping, and
security enforcement.
MDS-OST
The MDS and OST interact to pre-allocate resources and perform recovery.
The Lustre framework enables files to be structured at file system installation to match data transfer requirements.
One MDS plus one or more OSTs make up a single instance of Lustre and are managed together. Client nodes
mount the Lustre file system over the network and access files with POSIX file system semantics. Each client
mounts Lustre, uses the MDS to access metadata, and performs file I/O directly through the OSTs.
Figure 3. Layout of Lustre File System
Compute Node
User Application
I/O Library Routines
Lnet LND
System Interconnection Network
Lnet LND
OSS
RAID Storage
Lnet LND
MDS
Metadata
11
Configuring Lustre File Systems
3
Configuring Lustre File Systems
3.1
Configure Lustre File Systems on a Cray System
The Cray Linux environment (CLE) software includes Lustre control utilities from Cray. These utilities access sitespecific parameters stored in a file system definition (fs_name.fs_defs) file and use that information to
interface with the Lustre MountConf system and management server (MGS). When using the Lustre control
configuration utilities, system administrators do not need to access the MGS directly. The lustre_control
command and fs_defs are used to manage direct-attached Lustre (DAL) file systems. Sonexion systems use
the cscli and other commands.
The file system definition file (fs_name.fs_defs) describes the characteristics of a file system—MDS, OST,
clients, network, and storage specifications—in combination with configuring the cray_lnet,
cray_lustre_client, cray_lustre_server, and cray_net services in the configuration set. The first task
in setting up a Lustre file system on a Cray system is to create a unique file system definition file with values
appropriate for the site. Each fs_defs file represents one file system—if there is more than one Lustre file
system, a fs_defs file must be created for each file system.
An optional file system tuning file (fs_name.fs_tune) contains commands for setting Lustre tunable
parameters. It is passed as an argument to the lustre_control set_tune command. This command can be
used to set parameters for multiple file systems. It is also available as a convenience feature for administrators
who wish to modify their file system settings.
The lustre_control utility generates the appropriate commands to manage Lustre file system operations on a
CLE system. By convention, the Lustre control utilities and example fs_defs and fs_tune files are located
in /opt/cray-xt-lustre-utils/default/etc on the SMW.
Service node and compute node clients reference Lustre like a local file system. References to Lustre are
handled transparently through the virtual file system (VFS) switch in the kernel. Lustre file systems can be
mounted and unmounted with the mount_clients and umount_clients actions of lustre_control.
The Lustre file systems are mounted on compute node clients automatically during startup. Lustre file systems
can also be manually mounted using the mount command.
Use cfgset to modify the cray_lustre_client, or cray_lustre_server services in the configuration set
and Lustre client mount points and settings for DAL.
3.1.1
Ensure a File System Definition is Consistent with Cray System Configurations
It is possible for /dev/sd* type device names to change upon reboot of a Cray Linux environment (CLE) system.
Host names and node identifiers (NIDs) are dynamically allocated in Cray systems running CLE. They will not
change otherwise.
12
Configuring Lustre File Systems
CAUTION: Use persistent device names in the Lustre file system definition. Non-persistent device names
(for example, /dev/sdc) can change when the system reboots. If non-persistent names are specified in
the fs_name.fs_defs file, then Lustre may try to mount the wrong devices and fail to start when the
system reboots.
For more information about Lustre control utilities, see the lustre_control(8) and
lustre.fs_defs(5) man pages.
Several options within lustre_control enable an administrator to prepare for hardware and software
upgrades, link failover, and other dynamics one may encounter that can render the Lustre file system unusable. It
is possible that host names, NIDs, and/or device names of either Lustre servers or their storage targets will reflect
a configuration different than what is found in the file system definition file.
SCSI device names (/dev/sd*) are not guaranteed to be numbered the same from boot to boot. This
inconsistency can cause serious problems following a reboot—the Lustre configuration specified in the Lustre file
system definition file may differ from actual device names, resulting in a failure to start the file system. Because of
this behavior, Cray strongly recommends that persistent device names for Lustre are configured.
Cray supports and tests the /dev/disk/by-id persistent device naming conventions. The by-id names
typically include a portion of the device serial number in the name. For
example, /dev/disk/by-id/scsi-3600a0b800026e1407000192e4b66eb97.
A separate udev rule can be used to create aliases for these devices.
3.1.2
File System Definition Parameters
When the lustre_control utility is used, the first step is to create a Lustre file system definition file
(fs_name.fs_defs) for each Lustre file system. A sample file system definition file is provided
in /etc/opt/cray/lustre-utils/default/etc/example.fs_defs on the SMW.
File system definition parameters use the following conventions for node and device naming:
●
nodename is a host or node name using the format nidxxxxx; for example, nid00008
●
device is a device path using the format /dev/disk/by-id/ID-partN where ID is the volume identifier
and partN is the partition number (if applicable); for example:
/dev/disk/by-id/scsi-3600a0b800026e1400000192e4b66eb97-part2
CAUTION: Use persistent device names in the Lustre file system definition. Non-persistent device
names (for example, /dev/sdc) can change when the system reboots. If non-persistent names are
specified in the fs_name.fs_defs file, then Lustre may try to mount the wrong devices and fail to
start when the system reboots.
For more information about Lustre control utilities, see the lustre_control(8) and
lustre.fs_defs(5) man pages.
●
target_type can be one of ost, mdt, or mgt (if fs_name.fs_defs parameters are changed, always run
the lustre_control install command to regenerate the Lustre configuration and apply the changes)
3.1.2.1
Required File System Definitions
The following parameters must be defined in a fs_name.fs_defs file.
13
Configuring Lustre File Systems
fs_name: example
Specify the unique name for the Lustre file system
defined by this fs_name.fs_defs file. This
parameter is limited to eight characters. Used
internally by Lustre.
nid_map:
nodes=nid000[27-28,31] nids=[27-28,31]@gni
Lustre server hosts to LNet nid mapping. Each line
listed here should have a 1:1 mapping between the
node name and its associated LNet nid. Use
multiple lines for node names that are mapped to
multiple LNet nids. Multiple lines are additive.
Use pdsh hostlist expressions. For example,
prefix[a,k-l,...] where a,k,l are integers
with k < l.
3.1.2.1.1
Device Configuration
Device configuration parameters for the metadata target (MDT), management target (MGT), and object storage
targets (OSTs) must be defined. Target device descriptions can span multiple lines and they accept the
components listed in the table.
Table 1. fs_name.fs_defs Device Configuration Components
node
Specifies the primary device host.
dev
Specifies the device path.
fo_node
Specifies the backup (failover) device host.
fo_dev
Specifies the backup (failover) device path. (Only required if different from the primary
device path.)
jdev
Specifies the external journal device (for OST configuration only).
index
Force a particular OST or MDT index. If this component is specified for one OST or
MDT, it should be specified for all of them. By default, the index is zero-based and is
assigned based on the order in which devices are defined in this file. For example, the
first OST has an index value of 0 and the second has an index value of 1, etc.
mdt: node=nodename dev=device fo_node=nodename
Specify at least the node and device for the metadata target. For failover configurations, also specify the failover
node.
mgt: node=nodename dev=device fo_node=nodename
Specify at least the node and device for the management target. For failover configurations, also specify the
failover node.
ost: node=nodename dev=device fo_node=nodename
Specify at least the node and device for the OST(s). For failover configurations, also specify the failover node.
Including an index value makes managing a large number of targets much easier.
14
Configuring Lustre File Systems
3.1.2.1.2
Mount Path Patterns
Device target mount paths must be defined in a configuration. The table describes variables that may be used in
mount path definitions.
Table 2. fs_name.fs_defs Mount Path Variables
__fs_name__
File system name defined in fs_name:.
__label__
Component label. For example, foo-OST0002.
__type__
Component type. For example, mdt, mgt, or ost.
__index__
Target index. For example, 1, 2, 36, etc.
mgt_mount_path: /tmp/lustre/__fs_name__/__type__
Specify the mount path to the MGT.
mdt_mount_path: /tmp/lustre/__fs_name__/__type__
Specify the mount path to the MDT.
ost_mount_path: /tmp/lustre/__fs_name__/__type____index__
Specify the mount path to the OSTs.
3.1.2.2
Optional File System Definitions
routers: nodes=nid000[80-90]
Specifies the service nodes that will be Lustre routers. Use pdsh
hostlist expressions—for example, prefix[a,k-l,...] where
a,k,l are integers with k < l.
auto_fo: yes
Set this parameter to yes to enable automatic failover when failover
is configured—set to no to select manual failover. The default setting
is yes. Automatic failover on direct-attached file systems is only
supported for combined MGS/MDT configurations.
imp_rec: yes
Set this parameter to yes to configure failover with imperative
recovery. This option only applies to direct-attached Lustre file
systems.
stripe_size: 1048576
Stripe size in bytes. This is automatically added to the relevant
mkfs.lustre format parameters. Cray recommends a default value
of 1048576 (1MB).
stripe_count: 1
Integer count of the default number of OSTs used for a file. This is
automatically added to the relevant mkfs.lustre format
parameters. Valid range is 1 to the number of OSTs. A value of -1
specifies striping across all OSTs. Cray recommends a stripe count
of 2 to 4 OSTs.
15
Configuring Lustre File Systems
k
journal_size: 400
Journal size, in megabytes, on underlying ldiskfs file systems.
This is automatically added to the relevant mkfs.lustre format
parameters. The default value is 400.
journal_block_size: 4096
Journal block size, in bytes, for the journal device. This is
automatically added to the relevant mkfs.lustre format
parameters. The default value is 4096 (4KB).
timeout: 300
Lustre timeout in seconds. This is automatically added to the
relevant mkfs.lustre format parameters. The default value is 300.
back_fs_type: ldiskfs
Lustre backing file system type. This is automatically added to the
relevant mkfs.lustre format parameters. The default is ldiskfs.
mgt|mdt|ost_format_params
These "catchall" options, such as --device-size or --param are
passed to mkfs.lustre. Multiple lines are additive. For more
information on available options, see the mkfs.lustre man page.
mgt|mdt|ost_mkfs_mount_options: These options are passed to mkfs.lustre via -mkfsoptions="options". Mount options specified here replace
the default mount options. Multiple lines are additive. The defaults for
ldskfs are:
OST: errors=remout-ro
MGT/MDT: error=remout-ro,iopen_nopriv,user_xattr
mgt|mdt|ost_mkfs_options:
Format options for the backing file system. ext3 options can be set
here—these options are wrapped with --mkfsoptions="" and
passed to mkfs.lustre. Multiple lines are additive. For more
information on options to format backing ext3 and ldiskfs file
systems, see the make2fs(8) man page.
mgt|mdt|ost_mount_options
Optional arguments used when starting a target. Default is no
options. For more information on mount options for Lustre file
systems, see the mount.lustre man page. If OSTs are larger than
8TB in size, the force_over_8tb option may need to be added to
this parameter for Lustre to start properly. For Lustre 2.x, if OSTs are
larger than 128TB in size, add the force_over_128tb option.
recovery_time_hard: 900
Specifies a hard recovery timeout window for failover. The server will
incrementally extend its timeout up to a hard maximum of
recovery_time_hard seconds. The default hard recovery timeout
is set to 900 (15 minutes).
recovery_time_soft: 300
Specifies a rolling recovery timeout window for failover. This value
should be less than or equal to recovery_time_hard. Allows
recovery_time_soft seconds for clients to reconnect for
recovery after a server crash. This timeout will incrementally extend
if it is about to expire and the server is still handling new connections
from recoverable clients. The default soft recovery timeout is set to
300 (five minutes).
16
Configuring Lustre File Systems
quota: yes
Deprecated for Lustre 2.4.0 or greater. To enable quota support, set
quota: yes (the default value is no). For more information on
quotas in Lustre file systems, see the lfs(8) man page.
quota_type: ug
Deprecated for Lustre 2.4.0 or greater. If quotas are enabled, set
quota_type to u for user quotas, g for group quotas, or ug for both
user and group quotas.
3.1.3
Mount and Unmount Lustre Clients Manually
While boot time mounting is handled automatically, Lustre clients occasionally need to be mounted or unmounted
while the system is running. The mount_clients and umount_clients actions of the lustre_control
command allow this to be done. By adding the -c option, Lustre can be mounted or unmounted from the compute
node clients. This can prevent them from flooding the system with connection RPCs (remote procedure calls)
when Lustre services on an MDS or OSS node are stopped. By default, the -c option mounts or unmounts the
Lustre file system on all compute nodes at the client_mount_point location specified in the
fs_name.fs_defs file. For more flexibility, the -m and -w options allows the mount point to be specified and a
list of nodes to receive the mount or unmount commands.
For more information, see the lustre_control(8) man page.
3.1.4
Unmount Lustre from Compute Node Clients Using lustre_control
Prerequisites
The busybox mount command available on the Cray compute nodes is not Lustre-aware, so mount.lustre
must be used to manually mount compute node clients.
Procedure
1. Unmount Lustre from all compute node clients.
boot# lustre_control umount_clients -c -a
2. Mount Lustre manually on a compute node using mount.lustre (substituting values from the particular
site).
boot# mount.lustre -o rw.flock 12@gni:/lus0 /mnt/lustre/lus0
3.1.5
Configure the NFS client to Mount the Exported Lustre File System
About this task
Depending on the site client system, the configuration may be different. This procedure contains general
information that will help configure the client system to properly mount the exported Lustre file system. Consult
the client system documentation for specific configuration instructions.
17
Configuring Lustre File Systems
Procedure
1. As root, verify that the nfs client service is started at boot.
2. Add a line to the /etc/fstab file to mount the exported file system. The list below describes various
recommended file system mount options. For more information on NFS mount options, see the mount(8)
and nfs(5) man pages.
server@network:/filesystem /client/mount/point lustre file_system_options 0 0
Recommended file system mount options.
rsize=1048576,wsize=1048576
Set the read and write buffer sizes from the server at 1MiB.
These options match the NFS read/write transaction to the
Lustre filesystem block size, which reduces cache/buffer
thrashing on the service node providing the NFS server
functionality.
soft,intr
Use a soft interruptible mount request.
async
Use asynchronous NFS I/O. Once the NFS server has
acknowledged receipt of an operation, let the NFS client
move along even though the physical write to disk on the
NFS server has not been confirmed. For sites that need endto-end write-commit validation, set this option to sync
instead.
proto=tcp
Force use of TCP transport—this makes the larger rsize/
wsize operations more efficient. This option reduces the
potential for UDP retransmit occurrences, which improves
end-to-end performance.
relatime,timeo=600,local_lock=none Lock and time stamp handling, transaction timeout at 10
minutes.
nfsvers=3
Use NFSv3 specifically. NFSv4 is not supported at this time.
3. Mount the file system manually or reboot the client to verify that it mounts correctly at boot.
3.1.6
Lustre Option for Panic on LBUG for CNL and Service Nodes
A Lustre configuration option, panic_on_lbug, is available to control Lustre behavior when processing a fatal
file system error.
When a Lustre file system hits an unexpected data condition internally, it produces an LBUG error to guarantee
overall file system data integrity. This renders the file system on the node inoperable. In some cases, an
administrator wants the node to remain functional; for example, when there are dependencies such as a login
node that has several other mounted file systems. However, there are also cases where the desired effect is for
the LBUG to cause a node to panic. Compute nodes are good examples, because when this state is triggered by a
Lustre or system problem, a compute node is essentially useless.
18
Configuring Lustre File Systems
3.1.7
Verify Lustre File System Configuration
The lustre_control verify config command compares the mds, mgs, and ost definitions in the file
system definition file (fs_name.fs_defs) to the configured Lustre file system and reports any differences. If
failover is configured, the contents of the fs_name.fs_defs file will also be verified to match the contents of the
failover tables in the SDB. The failover configuration check will be skipped if auto_fo: no in the
filesystem.fs_defs file.
Verifying Lustre File System Configuration with lustre_control verify_config
Execute the following command to verify all installed Lustre file systems.
boot# lustre_control verify_config -a
Performing 'verify_config' from boot at Thu Aug
2 17:29:16 CDT 2012
No problems detected for the following file system(s):
fs_name
3.2
Configure Striping on Lustre File Systems
Striping is the process of distributing data from a single file across more than one device. To improve file system
performance for a few very large files, files can be striped across several or all OSTs.
The file system default striping pattern is determined by the stripe_count and stripe_size parameters in
the Lustre file system definition file. These parameters are defined as follows.
stripe_count The number of OSTs that each file is striped across. Any number of OSTs can be striped
across, from a single OST to all available OSTs.
stripe_size
The number of bytes in each stripe. This much data is written to each stripe before starting to
write in the next stripe. The default is 1048576.
Striping can also be overridden for individual files. See Override File System Striping Defaults
on page 20.
CAUTION: Striping can increase the rate that data files can be read or written.
However, reliability decreases as the number of stripes increases. Damage to a single
OST can cause loss of data in many files.
When configuring striping for Lustre file systems, Cray recommends:
●
Striping files across one to four OSTs
●
Setting stripe count value greater than 2 (this gives good performance for many
types of jobs; for larger file systems, a larger stripe width may improve
performance)
●
Choosing the default stripe size of 1MB (1048576 bytes)
Stripe size can be increased by powers of two but there is rarely a need to configure a stripe size greater than
2MB. Stripe sizes smaller than 1MB, however, can result in degraded I/O bandwidth. They should be avoided,
even for files with writes smaller than the stripe size.
19
Configuring Lustre File Systems
3.2.1
Configuration and Performance Trade-off for Striping
For maximum aggregate performance, it is important to keep all OSTs occupied. The following circumstances
should be considered when striping a Lustre file system.
Single OST
When many clients in a parallel application are each creating their own files, and where the
number of clients is significantly larger than the number of OSTs, the best aggregate
performance is achieved when each object is put on only a single OST.
Multiple OSTs At the other extreme, for applications where multiple processes are all writing to one large
(sparse) file, it is better to stripe that single file over all of the available OSTs. Similarly, if a few
processes write large files in large chunks, it is a good idea to stripe over enough OSTs to keep
the OSTs busy on both the write and the read path.
3.2.2
Override File System Striping Defaults
Each Lustre file system is built with a default stripe pattern that is specified in fs_name.fs_defs. However,
users may select alternative stripe patterns for specific files or directories with the lfs setstripe command, as
shown in File striping. For more information, see the lfs(1) man page.
File Striping
The lfs setstripe command has the following syntax, lfs setstripe -s stripe_size -c
stripe_count -i stripe_start filename.
This example creates the file, npf, with a 2MB (2097152 bytes) stripe that starts on OST0 (0) and stripes over
two object storage targets (OSTs) (2).
$ lfs setstripe -s 2097152 -c 2 -i 0 npf
Here the -s specifies the stripe size, the -c specifies the stripe count, and the -i specifies the index of the
starting OST.
The first two megabytes, bytes 0 through 2097151, of npf are placed on OST0, and then the third and fourth
megabytes, 2097152-4194303, are placed on OST1. The fifth and sixth megabytes are again placed on OST0
and so on.
The following special values are defined for the lfs setstripe options.
stripe_size=0
Uses the file system default for stripe size.
stripe_start=-1
Uses the default behavior for setting OST values.
stripe_count=0
Uses the file system default for stripe count.
stripe_count=-1
Uses all OSTs.
20
LNet Routing
4
LNet Routing
4.1
Recommended LNet Router Node Parameters
LNet routers are service nodes connected to both the Aries high speed network (HSN) and an external network,
such as InfiniBand (IB). LNet routers route Lustre traffic and are dedicated to bridging the different networks to
connect Lustre clients on the HSN—such as compute or login nodes—to Lustre servers on the external network.
Recommended LNet parameters for XC™ Series router nodes are shown below.
## ko2iblnd parameters
options ko2iblnd timeout=10
options ko2iblnd peer_timeout=40
options ko2iblnd credits=2048
options ko2iblnd ntx=2048
### Note peer_credits must be consistent across all peers on the IB network
options ko2iblnd peer_credits=126
options ko2iblnd concurrent_sends=63
options ko2iblnd peer_buffer_credits=128
## kgnilnd parameters
options kgnilnd credits=2048
options kgnilnd peer_health=1
## LNet parameters
options lnet large_router_buffers=1024
options lnet small_router_buffers=16384
large_router_buffers=1024
●
Routers only
●
Sets the number of large buffers (greater than one page) on a router node
●
Divided equally among CPU partitions (CPTs)
small_router_buffers=16384
●
Routers only
●
Sets the number of small buffers (one page) on a router node
●
Divided equally among CPTs
21
LNet Routing
4.2
LNet Feature: Router Pinger
The router pinger determines the status of configured routers so that bad (dead) routers are not used for LNet
traffic.
The router pinger is enabled on clients and servers when the LNet module parameters
live_router_check_interval and dead_router_check_interval have values greater than 0. The
router pinger is always enabled on routers, though it is typically only used to update the status of local network
interfaces. This means it does not do any pinging. In multi-hop configurations (server->router1->router2->client),
the router pinger on a router behaves similarly to its behavior on other nodes, meaning it does do pinging.
The router_checker thread (router pinger) periodically sends traffic (an LNet ping) to each known router. Live
routers are pinged every live_router_check_interval (in seconds). Dead routers are pinged every
dead_router_check_interval (in seconds). If a response is not received from an alive route after a timeout
period, then the route is marked down and is not used for further LNet traffic. Dead routes are marked alive once
a response is received.
The router_checker is also integral in the use of asymmetric routing failure (ARF). The payload of the ping
reply contains the status (up or down) of each router network interface. This information is used to determine
whether a particular router should be used to communicate with particular remote network.
The ping timeout is determined from the router_pinger_timeout, dead_router_check_interval, and
live_router_check_interval module parameters. The effective maximum timeout is
router_pinger_timeout + MAX(dead_router_check_interval, live_router_check_interval). In
the recommended tunings, 50 + MAX(60, 60) = 110 seconds.
4.3
LNet Feature: ARF Detection
Asymmetric router failure (ARF) detection enables Lustre networking (LNet) peers to determine if a route to a
remote network is alive.
Peers use only known good routes. Attempting to send a message via a bad route generally results in a
communication failure and requires a resend. Sending via a bad route also consumes router resources that could
be utilized for other communication.
ARF detection is enabled by setting avoid_asym_router_failure=1 in the LNet module settings. This
feature piggy-backs off the router pinger feature. The ping reply sent from router to clients and servers contains
information about the status of the router network interfaces for remote networks. Clients and servers then use
this information to determine whether a particular router should be used when attempting to send a message to a
remote network.
For example, assume a router at 454@gni with an IB interface on o2ib@1000 and another IB interface on
o2ib@1002. Suppose this router responds to a router checker ping with the following information:
o2ib@1000 -> down
o2ib@1002 -> up
When a client wants to send a message to a remote network, it considers each configured router in turn. When
considering 454@gni, the client knows that this router can be used to send a message to o2ib@1002, but not to
o2ib@1000.
22
LNet Routing
4.4
External Server Node Recommended LNet Parameters
Recommended LNet parameters for external Lustre server settings to configure Lustre nodes connected to
external Sonexion storage systems. These parameters are set from the Sonexion management node. Refer to the
Sonexion Administrator Guide for more information.
## ko2iblnd parameters
options ko2iblnd timeout=10
options ko2iblnd peer_timeout=0
options ko2iblnd keepalive=30
options ko2iblnd credits=2048
options ko2iblnd ntx=2048
### Note peer_credits must be consistent across all peers on the IB network
options ko2iblnd peer_credits=126
options ko2iblnd concurrent_sends=63
## LNet parameters
options lnet router_ping_timeout=10
options lnet live_router_check_interval=35
options lnet dead_router_check_interval=35
## Sonexion only (if off by default)
options lnet avoid_asym_router_failure=1
## ptlrpc parameters
options ptlrpc at_max=400
options ptlrpc at_min=40
options ptlrpc ldlm_enqueue_min=260
Cray recommends an object-based disk (OBD) timeout of 100 seconds, which is the default value. Set this
parameter using the lctl conf_param on the management server (MGS).
$ lctl conf_param fs_name.sys.timeout=100
For example:
$ lctl conf_param husk1.sys.timeout=100
$ cat /proc/sys/lustre/timeout
100
$
4.5
Internal Client Recommended LNet Parameters
As root, use cfgset to modify the cray_lustre_client service in the configuration set.
smw# cfgset update --service cray_lustre_client -l advanced --mode interactive partition
cray_lustre_client
[ status: enabled ] [ validation: valid ]
-------------------------------------------------------------------------------------Selected
#
Settings
Value/Status (level=advanced)
-------------------------------------------------------------------------------------module_params
1)
libcfs_panic_on_lbug
[ unconfigured, default=True ]
2)
ptlrpc_at_min
[ unconfigured, default=40 ]
23
LNet Routing
3)
4)
ptlrpc_at_max
ptlrpc_ldlm_enqueue_min
[ unconfigured, default=400 ]
[ unconfigured, default=260 ]
5)
client_mounts
fs_name: esfprod
[ OK ]
fs_name: snx11023
[ OK ]
fs_name: dal
[ OK ]
--------------------------------------------------------------------------------------
Recommended LNet parameters for internal client settings are the default settings.
ptlrpc_at_min
ptlrpc_at_max
ptlrpc_ldlm_enqueue_min
4.6
[ unconfigured, default=40 ]
[ unconfigured, default=400 ]
[ unconfigured, default=260 ]
Internal Server (DAL) Recommended LNet Parameters
As root, use cfgset to modify the cray_lustre_server service in the configuration set.
smw# cfgset update --service cray_lustre_server -l advanced --mode interactive partition
cray_lustre_server
[ status: enabled ] [ validation: valid ]
----------------------------------------------------------------------------------------------Selected
#
Settings
Value/Status (level=advanced)
----------------------------------------------------------------------------------------------lustre_servers
1)
mgs
c0-0c0s2n1
2)
mds
c0-0c0s2n1, c0-0c1s6n1
3)
oss
c0-0c0s2n2, c0-0c1s0n2, c0-0c1s6n2, c0-0c2s1n2
4)
5)
6)
ptlrpc
at_max
at_min
ldlm_enqueue_min
[ unconfigured, default=400 ]
[ unconfigured, default=40 ]
[ unconfigured, default=260 ]
-----------------------------------------------------------------------------------------------
Recommended internal server settings are the default settings.
at_max
at_min
ldlm_enqueue_min
4.7
[ unconfigured, default=400 ]
[ unconfigured, default=40 ]
[ unconfigured, default=260 ]
Configure Lustre PTLRPC ldlm_enqueue_min Parameter
ldlm_enqueue_min=260
The ldlm_enqueue_min parameter sets the minimum amount of time a server waits to see traffic on a lock
before assuming a client is malfunctioning, revoking the lock, and evicting the client. Set this value large enough
such that clients are able to resend an RPC from scratch without being evicted in the event that the first RPC was
lost. The time it takes for an RPC to be sent is the sum of the network latency and the time it takes for the server
to process the request. Both of these variables have a lower bound of at_min. Additionally, it should be large
enough so that clients are not evicted as a result of the high-speed network (HSN) quiesce period. Thus, the
minimum value is calculated as:
24
LNet Routing
ldlm_enqueue_min = max(2*net latency, net latency + quiesce duration) + 2*service
time =
max(2*40, 40 + 140) + 2*40 = 180 + 80 = 260
The quiesce duration of 140 in the above equation was determined experimentally. It could be smaller or larger
depending on the nature of the HSN failure or the size of the system. The quiesce duration of 140 strikes a
balance between resiliency of Lustre against extended network flaps (larger ldlm_enqueue_min) and the ability
for Lustre to detect malfunctioning clients (smaller ldlm_enqueue_min).
4.8
DVS Server Node Recommended LNet Parameters
Use the default settings.
4.9
External Client Recommended LNet Parameters
Recommended external client server settings for Cray development and login (CDL) clients for XC™ Series
systems.
## o2iblnd parameters
options ko2iblnd timeout=10
options ko2iblnd peer_timeout=0
options ko2iblnd keepalive=30
options ko2iblnd credits=2048
options ko2iblnd ntx=2048
### Note peer_credits must be consistent across all peers on the IB network
options ko2iblnd peer_credits=126
options ko2iblnd concurrent_sends=63
4.10 LNet Feature: Peer Health
Peer health queries the interface when a peer is on to determine whether it is alive or dead before allowing traffic
to be sent to that peer. If a peer is dead, then LNet aborts the send. This functionality is needed to avoid
communication attempts with known dead peers (which wastes network interface credits, router buffer credits,
and other resources that could otherwise be used to communicate with alive peers).
Enable peer health by setting these Lustre network driver (LND) module parameters:
●
gnilnd — Set the peer_health and peer_timeout parameters
●
o2iblnd — Set the peer_timeout parameter (setting this parameter to 0 disables peer health)
When a LND completes a transmit, receive, or connection setup operation for a peer, it records the current time in
a last_alive field associated with the peer. When a client of LNet (for example, ptlrpc) attempts to send
anything to a particular peer, the last_alive value for that peer is inspected and, if necessary, updated by
querying the LND. The LND query serves a dual purpose—in addition to dropping a message, it causes the LND
to attempt a new connection to a dead peer. If the last_alive is more than peer_timeout seconds (plus a
fudge factor for gnilnd), then the peer is considered dead and the message is dropped.
25
LNet Routing
Routed Configurations
For routed configurations, disable peer health on clients and servers. Clients and servers always have a peer in
the middle (the router) and router aliveness is determined by the router checker feature. Peer health interferes
with the normal operation of the router checker thread by preventing the router checker pings from being sent to
dead routers. Thus, it would be impossible to determine when dead routers become alive again.
4.11 Manually Control LNet Routers
Procedure
If /etc/init.d/lnet is not provided, send the following commands to each LNet router node to control
them manually:
●
For startup:
modprobe lnet
lctl net up
●
For shutdown:
lctl net down
lustre_rmmod
26
Configure Fine-grained Routing with clcvt
5
Configure Fine-grained Routing with clcvt
The clcvt command, available on the boot node and the system management workstation (SMW), aids in the
configuration of Lustre networking (LNet) fine-grained routing (FGR). FGR is a routing scheme that aims to group
sets of Lustre servers on the file system storage array with LNet routers on the Cray system. This grouping
maximizes file system performance on larger systems by using a router-to-server ratio where the relative
bandwidth is roughly equal on both sides. FGR also minimizes the number of LNet network hops (hop count) and
file system network congestion by sending traffic to particular Lustre servers over dedicated network lanes instead
of the default round-robin configuration.
The clcvt command takes as input several file-system-specific files and generates LNet kernel module
configuration information that can be used to configure the servers, routers, and clients for that file system. The
utility can also create cable maps in HTML, CSV, and human-readable formats and validate cable connection on
installed systems. For more information, such as available options and actions for clcvt, see the clcvt(8)
man page.
5.1
clcvt Prerequisite Files
The clcvt command requires several prerequisite files in order to compute the ip2nets and routes
information for the specific configuration. Before clcvt can be executed for the first time, these files must be
placed in an empty directory on the boot node or SMW—depending on where clcvt is run.
Deciding how to assign which routers to which object storage servers (OSSs), what fine grained routing (FGR)
ratios to use, which interface on which router to use for a Lustre networking (LNet) group, and router placement
are all things that can vary greatly from site to site. LNet configuration is determined as the system is ordered and
configured; see a Cray representative for the site-specific values.
info.file-system-identifier A file with global file system information for the cluster-name server
machine and each client system that will access it.
client-system.hosts
A file that maps the client system (such as the Cray mainframe) IP
addresses to unique host names, such as the boot node /etc/hosts file.
The client-system name must match one of the clients in the
info.file-system-identifier file.
client-system.ib
A file that maps the client system LNet router InfiniBand IP addresses to
system hardware cnames. The client-system name must match one
of the clients in the info.file-system-identifier file. This file
must be created by an administrator.
clustername.ib
A file that maps the Lustre server InfiniBand IP addresses to cluster (for
example, Sonexion) host names. The clustername name must match
the clustername in the info.file-system-identifier file. This
file must be created by an administrator.
27
Configure Fine-grained Routing with clcvt
client-system.rtrIm
5.2
A file that contains rtr -Im command output (executed on the SMW) for
the client-system.
The info.file-system-identifier File
info.file-system-identifier is a manually-created file that contains global file system information for the
Lustre server machine and each client system that will access it. Based on the ratio of server to LNet routers in
the configuration, the [clustername] section and each [client-system] section will define which servers
and routers will belong to each InfiniBand (IB) subnet.
This file is in the form of a ini style file, and the possible keywords in the [info] section include clustername,
ssu_count, and clients.
clustername Defines the base name used for all file system servers. For a Sonexion file system, as in the
example below, it might be something like snxs11029n. Thus, all server hostnames will be
snxs11029nNNN. NNN is a three-digit number starting at 000 and 001 for the primary and
secondary Cray Sonexion management servers (MGMT), 002 for the MGS, 003 for the MDS,
004 for the first OSS, and counting up from there for all remaining OSSs.
ssu_count
Defines how many SSUs make up a Sonexion file system. If this is missing, then this is not a
Sonexion file system but a CLFS installation.
clients
Defines a comma-separated list of mainframe names that front-end this file system.
The info.file-system-identifier file also needs a [client-system] section for each client system
listed in the clients line of the [info] section to describe the client systems and a [clustername] section to
describe the Lustre server system. Each of these sections contain a literal lnet_network_wildcard in the
format of LNET-name:IP-wildcard which instructs the LNet module to match a host IP address to
IP-wildcard and, if it matches, instantiate LNet LNET-name on them. Sample
info.file-system-identifier File: info.snx11029 on page 29 shows a sample
info.file-system-identifier configuration file.
The hostname fields in the [client-system] section of this file are fully-qualified interface specifications of the
form hostname(ibn), where (ib0) is the assumed default if not specified.
XC™ Series systems support multiple IB interfaces per router. Configure the second IB interface (see Configure
LNet Routers) and append the interface names (ibn) to the cname for the routers. See Sample
info.file-system-identifier File Using Multiple IB Interfaces Per Router on page 29 for reference. These
interface names must be appended to the client-system.ib file. IB port assignments are shown below in
XC™ Series InfiniBand Port Assignment on page 28.
Figure 4. XC™ Series InfiniBand Port Assignment
mlx4_0 Port 1
mlx4_0 Port 2
1
LEFT - LOWER SLOT
ACTIVITY
POWER
ib0
ib1
ib2
ib3
mlx4_1 Port 1
Outer Slot
ib2
I/O MODULE
0
mlx4_1 Port 2
3
mlx4_1 Port 2
Node 2
mlx4_1 Port 1
2
ib3
ib0
ib1
mlx4_0 Port 1
mlx4_0 Port 2
Inner Slot
Node 1
28
Configure Fine-grained Routing with clcvt
Sample info.file-system-identifier File: info.snx11029
# This section describes the size of this filesystem.
[info]
clustername = snx11029n
SSU_count = 6
clients = hera
[hera]
lnet_network_wildcard = gni1:10.128.*.*
# Because of our cabling assumptions and naming conventions, we only
# need to know which XIO nodes are assigned to which LNETs. From that
# our tool can actually generate a "cable map" for the installation folks.
o2ib6000: c0-0c2s2n0, c0-0c2s2n2 ; MGS and MDS
o2ib6002: c1-0c0s7n0, c1-0c0s7n1, c1-0c0s7n2, c1-0c0s7n3 ; OSSs 2/4/6
o2ib6003: c3-0c1s5n0, c3-0c1s5n1, c3-0c1s5n2, c3-0c1s5n3 ; OSSs 3/5/7
o2ib6004: c3-0c1s0n0, c3-0c1s0n1, c3-0c1s0n2, c3-0c1s0n3 ; OSSs 8/10/12
o2ib6005: c3-0c2s4n0, c3-0c2s4n1, c3-0c2s4n2, c3-0c2s4n3 ; OSSs 9/11/13
[snx11029n]
lnet_network_wildcard = o2ib6:10.10.100.*
o2ib6000:
o2ib6002:
o2ib6003:
o2ib6004:
o2ib6005:
snx11029n002,
snx11029n004,
snx11029n005,
snx11029n010,
snx11029n011,
snx11029n003 ; MGS and MDS
snx11029n006, snx11029n008
snx11029n007, snx11029n009
snx11029n012, snx11029n014
snx11029n013, snx11029n015
;
;
;
;
OSSs
OSSs
OSSs
OSSs
2/4/6
3/5/7
8/10/12
9/11/13
Sample info.file-system-identifier File Using Multiple IB Interfaces Per
Router
[info]
clustername = snx11014n
SSU_count = 4
clients = crystal
[crystal]
lnet_network_wildcard = gni0:10.128.*.*
o2ib1000: c2-0c0s1n1, c2-0c1s1n1 ; MGS and MDS
o2ib1002: c0-0c0s1n2(ib0), c3-0c0s0n1(ib0), c3-0c0s0n2(ib0) ; OSSs 4/6/8/10
o2ib1003: c0-0c0s1n2(ib2), c3-0c0s0n1(ib2), c3-0c0s0n2(ib2) ; OSSs 5/7/9/11
[snx11014n]
lnet_network_wildcard = o2ib:10.10.100.*
o2ib1000: snx11014n002, snx11014n003 ; MGS and MDS
o2ib1002: snx11014n004, snx11014n006, snx11014n008, snx11014n010 ; OSSs 4/6/8/10
o2ib1003: snx11014n005, snx11014n007, snx11014n009, snx11014n011 ; OSSs 5/7/9/11
5.3
The client-system.hosts File
For a typical Cray system, use the /etc/hosts file from the boot node. To use with the clcvt command, copy
the /etc/hosts file from the boot node to a working directory.
29
Configure Fine-grained Routing with clcvt
Sample client-system.hosts File:
#
# hosts
#
#
#
#
# Syntax:
#
# IP-Address
#
127.0.0.1
This file describes a number of hostname-to-address
mappings for the TCP/IP subsystem. It is mostly
used at boot time, when no name servers are running.
On small systems, this file can be used instead of a
"named" name server.
Full-Qualified-Hostname
Short-Hostname
localhost
# special IPv6 addresses
::1
ipv6-localhost localhost
ipv6-loopback
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
# Licenses
172.30.74.55
tic
tic.us.cray.com
172.30.74.56
tac
tac.us.cray.com
172.30.74.57
toe
toe.us.cray.com
172.30.74.206
cflls01 cflls01.us.cray.com
172.30.74.207
cflls02 cflls02.us.cray.com
172.30.74.208
cflls03 cflls03.us.cray.com
##LDAP Server Info
172.30.12.46
kingpin kingpin.us.cray.com
kingpin.cray.com
172.30.12.48
kingfish
kingfish.us.cray.com
kingfish.cray.com
##esLogin Info
172.30.48.62
kiyi
kiyi.us.cray.com
el-login0.us.cray.com
10.2.0.1
kiyi-eth1
##Networker server
#172.30.74.90
booboo booboo.us.cray.com
10.3.1.1
10.128.0.1
10.128.0.2
10.128.0.31
10.128.0.32
10.128.0.3
10.128.0.4
10.128.0.29
10.128.0.30
10.128.0.5
10.128.0.6
10.128.0.27
10.128.0.28
10.128.0.7
10.128.0.8
10.128.0.25
10.128.0.26
10.128.0.9
10.128.0.10
smw
nid00000
nid00001
nid00030
nid00031
nid00002
nid00003
nid00028
nid00029
nid00004
nid00005
nid00026
nid00027
nid00006
nid00007
nid00024
nid00025
nid00008
nid00009
c0-0c0s0n0
c0-0c0s0n1
c0-0c0s0n2
c0-0c0s0n3
c0-0c0s1n0
c0-0c0s1n1
c0-0c0s1n2
c0-0c0s1n3
c0-0c0s2n0
c0-0c0s2n1
c0-0c0s2n2
c0-0c0s2n3
c0-0c0s3n0
c0-0c0s3n1
c0-0c0s3n2
c0-0c0s3n3
c0-0c0s4n0
c0-0c0s4n1
dvs-0
boot001 boot002
#old ddn6620_mds
hera-rsip2
login
sdb001
login1
sdb002
hera
30
Configure Fine-grained Routing with clcvt
10.128.0.23
10.128.0.24
...
5.4
nid00022
nid00023
c0-0c0s4n2
c0-0c0s4n3
hera-rsip hera-rsip1
mds nid00023_mds
The client-system.ib File
The client-system.ib file contains a client-system LNet router InfiniBand (IB) IP address to cname mapping
information in a /etc/hosts style format. The hostname field in this file is a fully-qualified interface specification
of the form hostname(ibn), where (ib0) is the assumed default if not specified. This file must be created by
an administrator.
XC™ Series systems can support multiple IB interfaces per router—configure the second IB interface and append
the interface names (ibn) to the cname for the routers. The LNet router IB IP addresses should be within the
same subnet as the Lustre servers (MGS/MDS/OSS)—one possible address assignment scheme would be to use
a contiguous set of IP addresses, with ib0 and ib2 on each node having adjacent addresses. (See Sample
info.file-system-identifier File Using Multiple IB Interfaces Per Router on page 29 for reference.) These
interface names must be appended to the info.file-system-identifier file.
Sample client-system.ib File:
#
# This is the /etc/hosts-like file for Infiniband IP addresses
# on "hera".
#
10.10.100.101
c0-0c2s2n0
10.10.100.102
c0-0c2s2n2
10.10.100.103
c1-0c0s7n0
10.10.100.104
c1-0c0s7n1
10.10.100.105
c1-0c0s7n2
10.10.100.106
c1-0c0s7n3
10.10.100.107
c3-0c1s0n0
10.10.100.108
c3-0c1s0n1
10.10.100.109
c3-0c1s0n2
10.10.100.110
c3-0c1s0n3
10.10.100.111
c3-0c1s5n0
10.10.100.112
c3-0c1s5n1
10.10.100.113
c3-0c1s5n2
10.10.100.114
c3-0c1s5n3
10.10.100.115
c3-0c2s4n0
10.10.100.116
c3-0c2s4n1
10.10.100.117
c3-0c2s4n2
10.10.100.118
c3-0c2s4n3
Sample client-system.ib File Using Multiple IB Interfaces Per Router
#
# This is the /etc/hosts-like file for Infiniband IP addresses
# on "crystal".
#
10.10.100.101
c0-0c0s1n2
10.10.101.102
c0-0c0s1n2(ib2)
10.10.100.103
c3-0c0s0n1
10.10.101.104
c3-0c0s0n1(ib2)
10.10.100.105
c3-0c0s0n2
31
Configure Fine-grained Routing with clcvt
10.10.101.106
10.10.100.107
10.10.101.108
5.5
c3-0c0s0n2(ib2)
c2-0c0s1n1
c2-0c1s1n1
The cluster-name.ib File
The cluster-name.ib file contains Lustre server InfiniBand (IB) IP addresses to cluster (for example,
Sonexion) host name mapping information in a /etc/hosts style format. This file must be created by an
administrator.
Sample cluster-name.ib File: snx11029n.ib
#
# This is the /etc/hosts-like file for Infiniband IP addresses
# on the Sonexion known as "snx11029n".
#
10.10.100.1
snx11029n000
#mgmnt
10.10.100.2
snx11029n001
#mgmnt
10.10.100.3
snx11029n002
#mgs
10.10.100.4
snx11029n003
#mds
10.10.100.5
snx11029n004
#first oss, oss0
10.10.100.6
snx11029n005
10.10.100.7
snx11029n006
10.10.100.8
snx11029n007
10.10.100.9
snx11029n008
10.10.100.10
snx11029n009
10.10.100.11
snx11029n010
10.10.100.12
snx11029n011
10.10.100.13
snx11029n012
10.10.100.14
snx11029n013
10.10.100.15
snx11029n014
10.10.100.16
snx11029n015
#last oss, oss11
Sample cluster-name.ib file: snx11029n.ib Using Multiple IB Interfaces Per Router
#
# This is the /etc/hosts-like file for Infiniband IP addresses
# on the Sonexion known as "snx11029n".
#
10.10.100.5
10.10.101.6
10.10.100.7
10.10.101.8
10.10.100.9
10.10.101.10
10.10.100.11
10.10.101.12
10.10.100.13
10.10.101.14
10.10.100.15
10.10.101.16
snx11029n004
snx11029n005
snx11029n006
snx11029n007
snx11029n008
snx11029n009
snx11029n010
snx11029n011
snx11029n012
snx11029n013
snx11029n014
snx11029n015
#first oss, oss0
#last oss, oss11
32
Configure Fine-grained Routing with clcvt
5.6
The client-system.rtrIm File
About this task
The client-system.rtrIm file contains output from the rtr -Im command as executed from the SMW.
When capturing the command output to a file, use the -H option to remove the header information from rtr -Im
or open the file after capturing and delete the first two lines.
Follow this procedure to create the client-system.rtrIm file on the SMW.
Procedure
1. Log on to the SMW.
crayadm@boot> ssh smw
Password:
Last login: Sun Feb 24 23:05:29 2013 from boot
2. Run the following command to capture the rtr -Im output (without header information) to a file.
crayadm@boot> rtr -Im -H > client-system.rtrIm
3. Move the client-system.rtrIm file to the working directory from which the clcvt command will be run.
crayadm@boot> mv client-system.rtrIm /path/to/working/dir/
Sample client-system.rtrIm file:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
30
31
28
29
...
0
1
4
5
8
9
12
13
16
17
20
21
24
25
28
29
32
33
36
37
c0-0c0s0n0
c0-0c0s0n1
c0-0c0s1n0
c0-0c0s1n1
c0-0c0s2n0
c0-0c0s2n1
c0-0c0s3n0
c0-0c0s3n1
c0-0c0s4n0
c0-0c0s4n1
c0-0c0s5n0
c0-0c0s5n1
c0-0c0s6n0
c0-0c0s6n1
c0-0c0s7n0
c0-0c0s7n1
c0-0c0s0n2
c0-0c0s0n3
c0-0c0s1n2
c0-0c0s1n3
c0-0c0s0g0
c0-0c0s0g0
c0-0c0s1g0
c0-0c0s1g0
c0-0c0s2g0
c0-0c0s2g0
c0-0c0s3g0
c0-0c0s3g0
c0-0c0s4g0
c0-0c0s4g0
c0-0c0s5g0
c0-0c0s5g0
c0-0c0s6g0
c0-0c0s6g0
c0-0c0s7g0
c0-0c0s7g0
c0-0c0s0g1
c0-0c0s0g1
c0-0c0s1g1
c0-0c0s1g1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
0
0
1
1
33
Configure Fine-grained Routing with clcvt
5.7
Generate ip2nets and routes Information
When the prerequisite files have been created and gathered, the administrator can generate the
persistent-storage file with the clcvt generate action. This portable file will then be used to create
ip2nets and routes directives for the servers, routers, and clients.
The following procedures frequently use the --split-routes=4 flag, which will print information that can be
loaded into ip2nets and routes files. This method of adding modprobe.conf directives is particularly valuable
for large systems where the directives might otherwise exceed the modprobe buffer limit.
5.8
Create the persistent-storage File
Procedure
1. Move all prerequisite files to an empty directory on the boot node or SMW (the clcvt command is only
available on the boot node or the SMW).
The working directory should look similar to this when done.
crayadm@smw$
total 240
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
ll
crayadm
crayadm
crayadm
crayadm
crayadm
crayadm
crayadm
crayadm
crayadm
crayadm
23707
548
36960
1077
662
Feb
Feb
Feb
Feb
Feb
8
8
8
8
8
14:27
14:27
14:27
14:27
14:27
hera.hosts
hera.ib
hera.rtrIm
info.snx11029
snx11029n.ib
2. Create the persistent-storage file.
crayadm@smw$ clcvt generate
The clcvt command does not print to stdout with successful completion. If there are errors when running
the command, however, set the --debug flag to add debugging information.
5.9
Create ip2nets and routes Information for the Compute Nodes
Procedure
1. Execute the clcvt command with the compute flag to generate directives for the compute nodes.
crayadm@smw$ clcvt compute --split-routes=4
# Place the following line(s) in the appropriate 'modprobe' file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
options lnet ip2nets=/path/to/ip2nets-loading/filename
options lnet routes=/path/to/route-loading/filename
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate ip2nets-loading file.
34
Configure Fine-grained Routing with clcvt
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
gni1 10.128.*.*
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate route-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
o2ib6000 1 [68,90]@gni1
o2ib6002 1 [750,751,752,753]@gni1
o2ib6003 1 [618,619,628,629]@gni1
o2ib6004 1 [608,609,638,639]@gni1
o2ib6005 1 [648,649,662,663]@gni1
o2ib6000 2
[608,609,618,619,628,629,638,639,648,649,662,663,750,751,752,753]@gni1
o2ib6002 2 [608,609,638,639]@gni1
o2ib6003 2 [648,649,662,663]@gni1
o2ib6004 2 [750,751,752,753]@gni1
o2ib6005 2 [618,619,628,629]@gni1
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2. Follow the procedures in Configure the LNet Compute Node Clients to update the compute node boot image
modprobe information using the ip2nets and routes information produced by the previous step.
5.10 Create ip2nets and routes information for service node Lustre
clients (MOM and internal login nodes)
Procedure
1. Execute the clcvt command with the login flag to generate directives for the service node Lustre clients.
crayadm@smw$ clcvt login --split-routes=4
# Place the following line(s) in the appropriate 'modprobe' file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
options lnet ip2nets=/path/to/ip2nets-loading/filename
options lnet routes=/path/to/route-loading/filename
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate ip2nets-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
gni1 10.128.*.*
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate route-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
o2ib6000 1 [68,90]@gni1
o2ib6002 1 [750,751,752,753]@gni1
o2ib6003 1 [618,619,628,629]@gni1
o2ib6004 1 [608,609,638,639]@gni1
o2ib6005 1 [648,649,662,663]@gni1
o2ib6000 2
[608,609,618,619,628,629,638,639,648,649,662,663,750,751,752,753]@gni1
o2ib6002 2 [608,609,638,639]@gni1
o2ib6003 2 [648,649,662,663]@gni1
o2ib6004 2 [750,751,752,753]@gni1
o2ib6005 2 [618,619,628,629]@gni1
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
35
Configure Fine-grained Routing with clcvt
2. Follow the procedures in Specifying service node LNET routes and ip2nets directives with files to update the
modprobe information for the default view of the shared root using the ip2nets and routes information
produced by the previous step.
5.11 Create ip2nets and routes information for the LNet router nodes
Procedure
1. Execute the clcvt command with the router flag to generate directives for the LNet router nodes.
crayadm@smw$ clcvt router --split-routes=4
# Place the following line(s) in the appropriate 'modprobe' file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
options lnet ip2nets=/path/to/ip2nets-loading/filename
options lnet routes=/path/to/route-loading/filename
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate ip2nets-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
gni1 10.128.*.*
o2ib6000 10.10.100.
[101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118]
o2ib6002 10.10.100.[103,104,105,106,107,108,109,110]
o2ib6003 10.10.100.[111,112,113,114,115,116,117,118]
o2ib6004 10.10.100.[103,104,105,106,107,108,109,110]
o2ib6005 10.10.100.[111,112,113,114,115,116,117,118]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate route-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
o2ib6000 1 [68,90]@gni1
o2ib6002 1 [750,751,752,753]@gni1
o2ib6003 1 [618,619,628,629]@gni1
o2ib6004 1 [608,609,638,639]@gni1
o2ib6005 1 [648,649,662,663]@gni1
o2ib6000 2
[608,609,618,619,628,629,638,639,648,649,662,663,750,751,752,753]@gni1
o2ib6002 2 [608,609,638,639]@gni1
o2ib6003 2 [648,649,662,663]@gni1
o2ib6004 2 [750,751,752,753]@gni1
o2ib6005 2 [618,619,628,629]@gni1
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2. Follow the procedures in Specifying service node Lnet routes and ip2nets directives with files to update the
modprobe information for the LNet router view of the shared root using the ip2nets and routes
information produced by the previous step.
36
Configure Fine-grained Routing with clcvt
5.12 Create ip2nets and routes Information for the Lustre Server
Nodes
Procedure
1. Execute the clcvt command with the server flag to generate directives for the Lustre server nodes.
crayadm@smw$ clcvt server --split-routes=4
# Place the following line(s) in the appropriate 'modprobe' file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
options lnet ip2nets=/path/to/ip2nets-loading/filename
options lnet routes=/path/to/route-loading/filename
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate ip2nets-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
o2ib6 10.10.100.*
o2ib6000 10.10.100.[3,4]
o2ib6002 10.10.100.[5,7,9]
o2ib6003 10.10.100.[6,8,10]
o2ib6004 10.10.100.[11,13,15]
o2ib6005 10.10.100.[12,14,16]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Place the following line(s) in the appropriate route-loading file.
#vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
gni1 1 10.10.100.[101,102]@o2ib6000
gni1 1 10.10.100.[103,104,105,106]@o2ib6002
gni1 1 10.10.100.[111,112,113,114]@o2ib6003
gni1 1 10.10.100.[107,108,109,110]@o2ib6004
gni1 1 10.10.100.[115,116,117,118]@o2ib6005
gni1 2 10.10.100.
[103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118]@o2ib6000
gni1 2 10.10.100.[107,108,109,110]@o2ib6002
gni1 2 10.10.100.[115,116,117,118]@o2ib6003
gni1 2 10.10.100.[103,104,105,106]@o2ib6004
gni1 2 10.10.100.[111,112,113,114]@o2ib6005
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2. Update the modprobe information for the Lustre servers using the ip2nets and routes information
produced by the previous step. For more information, refer to the site-specific Lustre server documentation.
37
Lustre System Administration
6
Lustre System Administration
6.1
Lustre Commands for System Administrators
Cray provides administrative commands that configure and maintain Lustre file systems as shown in Lustre
Administrative Commands Provided with CLE—man pages are accessible by using the man command a Cray
system.
For more information about standard Lustre system administration, see the following man pages: Lustre(7),
mount(8), mkfs.lustre(8), tunefs.lustre(8), mount.lustre(8), lctl(8), and lfs(1).
Table 3. Lustre Administrative Commands Provided with CLE
Command
Function
lustre_control
Manages Lustre file system using standard Lustre commands and
a customized Lustre file system definition file.
update_fs_defs
Updates an older fs_defs file to the format required in CLE 4.1
and beyond.
xtlusfoadmin
Displays the contents of the lustre_failover,
lustre_service, and filesystem database tables in service
database (SDB). It is also used by the system administrator to
update database fields to enable or disable automatic Lustre
service failover handled by the xt-lustre-proxy daemon.
(Direct-attached Lustre file systems only.)
xtlusfoevntsndr
Sends Lustre failover imperative recovery events to start the
Lustre client connection switch utility on login and compute nodes.
(Direct-attached Lustre file systems only)
6.2
Identify MDS and OSTs
Identifying MDS and OSTs
Use the lustre_control status command to identify the OSTs and MDS for direct-attached file systems.
This command must be root to be used.
boot# lustre_control status -a
38
Lustre System Administration
To identify the OSTs and MDS on all (including external) Lustre file systems, as root, use the lfs check
servers command.
login# lfs check servers
If there is more than one Lustre file system, the lfs check servers command does not necessarily sort the
OSTs and MDSs by file system.
Checking the Status of Individual Nodes
The status of targets on an individual node can be checked with the lustre_control status command.
boot# lustre_control status -a -w nodename
6.3
Start Lustre
About this task
Lustre file systems are started at boot time by CLE boot automation files. Lustre file systems can be manually
started using the lustre_control command.
Start all installed Lustre file systems.
Procedure
1. Start the file systems using lustre_control.
boot# module load lustre-utils
boot# lustre_control start -a
2. Mount the service node clients.
boot# lustre_control mount_clients -a
3. Mount the compute node clients. (If the appropriate /etc/fstab entries for the Lustre file system are
present in the CNL boot image, then the compute nodes—at boot—will mount Lustre automatically.)
To manually mount Lustre on compute nodes that are already booted, use the following command.
boot# lustre_control mount_clients -a -c
6.4
Stop Lustre
About this task
Lustre file systems are stopped during shutdown by CLE system boot automation files. Lustre file systems can be
manually stopped using the lustre_control command.
39
Lustre System Administration
If all compute nodes and service node clients must be unmounted and all services stopped, alternatively the
lustre_control shutdown -a command can be executed. The following procedure breaks this process up
into three steps.
Procedure
1. Unmount Lustre from the compute node clients.
boot# lustre_control umount_clients -a -c
2. Unmount Lustre from the service node clients.
boot# lustre_control umount_clients -a
3. Stop Lustre services.
boot# lustre_control stop -a
For more information, see the lustre_control(8) man page.
6.5
Add OSSs and OSTs
About this task
New object storage servers (OSSs) and object storage targets (OSTs) can be added—or new targets can be
added to existing servers—by performing the following procedure.
Procedure
1. Unmount Lustre from the compute node clients.
boot# lustre_control umount_clients -f fs_name -c
2. Unmount Lustre from the service node clients.
boot# lustre_control umount_clients -f fs_name
3. Stop Lustre services.
boot# lustre_control stop -f fs_name
4. Update the Lustre file system definition file, /etc/opt/cray/lustre-utils/fs_name.fs_defs on the
SMW node.
Add the Lustre server host to LNet nid mapping (unless there is already a nid_map listing for this OSS).
nid_map: nodes=nid00026 nids=26@gni
5. Add the OSTs.
40
Lustre System Administration
ost: node=nid00026
dev=/dev/disk/by-id/IDa
index=n
ost: node=nid00026
dev=/dev/disk/by-id/IDb
index=n+1
The new index numbers (index=n) in this sequence must follow the pre-existing index sequence numbers.
6. Remove the existing file system configuration and rerun the lustre_control install command with the
updated fs_defs file.
smw# lustre_control remove -c p0 -f fs_name
smw# lustre_control install –c p0 /home/crayadm/fs_name.fs_defs
7. Format the new targets on the OSS.
nid00026# mkfs.lustre --fsname=filesystem --index=n \
--ost --mgsnode=12@gni /dev/disk/by-id/IDa
nid00026# mkfs.lustre --fsname=filesystem --index=n+1 \
--ost --mgsnode=12@gni /dev/disk/by-id/IDb
If desired, other file system options can be added to this command, such as --mkfsoptions, --index, or
--failnode. For more information, see the mkfs.lustre(8) man page.
8. Regenerate the Lustre configuration logs with the lustre_control script.
boot# lustre_control write_conf -f fs_name
Lustre services will be started and then stopped as part of this command to allow for proper registration of the
new OSTs with the MGS.
9. Start the Lustre file system.
boot# lustre_control start -p fs_name
10. Mount Lustre on the service node clients.
boot# lustre_control mount_clients -f fs_name
11. Check that the OST is among the active targets in the file system on the login node from the Lustre mount
point.
boot# ssh login
login# cd /fs_name
login:/fs_name# lfs check servers
fs_name-MDT0000-mdc-ffff8100f14d7c00
fs_name-OST0001-osc-ffff8100f14d7c00
fs_name-OST0002-osc-ffff8100f14d7c00
fs_name-OST0003-osc-ffff8100f14d7c00
fs_name-OST0004-osc-ffff8100f14d7c00
active.
active.
active.
active.
active.
12. Write a file to the Lustre file system from the login node to test if a new target is receiving I/O.
login:/fs_name# cd mydirectory
login:/fs_name/mydirectory# lfs setstripe testfile -s 0 -c -1 -i -1
41
Lustre System Administration
login:/fs_name/mydirectory# dd if=/dev/zero of=testfile bs=10485760 count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.026317 seconds, 398 MB/s
Then check that the file object is stored on the new target using lfs.
login:/fs_name/mydirectory# lfs getstripe testfile
OBDS:
0: ost0_UUID ACTIVE
1: ost1_UUID ACTIVE
2: ost2_UUID ACTIVE
3: ost3_UUID ACTIVE
4: ost4_UUID ACTIVE
testfile
obdidx
objid
objid
4
1237766
0x12e306
3
564292
0x89c44
0
1
437047
0x6ab37
0
720254
0xafd7e
2
487517
0x7705d
group
0
0
0
0
13. Mount Lustre on the compute node clients.
boot# lustre_control mount_clients -f fs_name -c
6.6
Recover From a Failed OST
Use these procedures when an OST has failed and is not recoverable by e2fsck. In this case, the individual
OST can be reformatted and brought back into the file system. Before reformatting, the OST must be deactivated
and any striped files residing on it must be identified and removed.
6.6.1
Deactivate a Failed OST and Remove Striped Files
Procedure
1. Log in to the MDS and deactivate the failed OST in order to prevent further I/O operations on the failed
device.
nid00012# lctl --device ostidx deactivate
The ostdix is displayed in the left column of the output generated by the lctl dl command.
2. Regenerate the list of Lustre devices and verify that the state for the deactivated OST is IN (inactive) and not
UP.
nid00012# lctl dl
3. Identify the ostname for the OST by running the following command.
42
Lustre System Administration
login> lfs df
UUID
1K-blocks
Used Available Use% Mounted on
lustre-MDT0000_UUID 358373232
1809780 336083452
0% /lus/nid00012[MDT:0]
lustre-OST0000_UUID 2306956012 1471416476 718352736
63% /lus/nid00018[OST:0]
lustre-OST0001_UUID 2306956012 1315772068 873988520
57% /lus/nid00018[OST:1]
The ostname will be similar to fsname-OSTxxxx_UUID.
4. Log in to a Lustre client, such as a login node, and search for files on the failed OST.
login> lfs find /mnt/filesystem --print --obd ostname
5. Remove (unlink or rm) any striped files on the OST before reformatting.
6.6.2
Reformat a Single OST
About this task
Refer to this procedure if there is a failed OST on a Lustre file system. For example, if the OST is damaged and
cannot be repaired by e2fsck. This procedure can be used for an OST that is available and accessible. But—
prior to completing the remaining steps—Deactivate a Failed OST and Remove Striped Files on page 42 should
be completed to generate a list of affected files and unlink or remove them.
Procedure
1. Unmount Lustre from the compute node clients.
boot# lustre_control umount_clients -f fs_name -c
2. Unmount Lustre from the service node clients.
boot# lustre_control umount_clients -f fs_name
3. Stop Lustre services.
boot# lustre_control stop -f fs_name
4. Reformat the OST from the OSS node that serves it.
Use values from the file system definition file for the following options: nid is the node value for the mgt,
ostidx is the index value for this OST, and ostdevice is the dev device name for this OST. If there are
any additional ost_mkfs_options in the fs_name.fs_defs file, append them to the -J size=400 value
of --mkfsoptions in the following command. Make sure to append and "catchall" (such as --param)
options to this command, as well.
nid00018# mkfs.lustre --reformat --ost --fsname=fs_name --mgsnode=nid@gni \
--index=ostidx --param sys.timeout=300 --mkfsoptions="-J size=400" ostdevice
5. Regenerate the Lustre configuration logs on the servers by invoking the following command from the boot
node.
boot# lustre_control write_conf -f fs_name
43
Lustre System Administration
6. On the MDS node, mount the MDT device as ldiskfs, and rename the lov_objid file.
nid00012# mount -t ldiskfs mdtdevice /mnt
nid00012# mv /mnt/lov_objid /mnt/lov_objid.old
nid00012# umount /mnt
7. Start Lustre on the servers.
boot# lustre_control start -f fs_name
8. Activate the newly reformatted OST on the MDS device.
a. Generate a list of all the Lustre devices with the lctl dl command. (Note the device index for the OST
that was reformatted in the far left column.)
nid00012# lctl dl
b. Activate the OST using the index from the previous step as ostidx.
nid00012# lctl --device ostidx activate
c.
Regenerate the list of Lustre devices and verify that the state for the activated OST is UP and not IN.
nid00012# lctl dl
9. Mount Lustre on the clients.
6.7
OSS Read Cache and Writethrough Cache
This section describes several commands that can be used to tune, enable, and disable object storage servers
(OSS) server cache settings. If these settings are modified on a system, the modification commands must be run
on each of the OSSs.
Lustre uses the Linux page cache to provide read-only caching of data on OSSs. This strategy reduces disk
access time caused by repeated reads from an object storage target (OST). Increased cache utilization, however,
can evict more important file system metadata that subsequently needs to be read back from the disk. Very large
files are not typically read multiple times, so their pressure on the cache is unwarranted.
Administrators can control the maximum size of files that are cached on the OSSs with the
readcache_max_filesize parameter. To adjust this parameter from the default value for all OSTs on
nid00018, invoke the following command.
nid00018# lctl set_param obdfilter.*.readcache_max_filesize=value
The asterisk in the above command is a wild card that represents all OSTs on that server. If administrators wish to
affect a single target, individual OST names can be used instead.
This command sets readcache_max_filesize to value, so that files larger than value will not be cached on
the OSS servers. Administrators can specify value in bytes or shorthand such as 32MiB. Setting value to -1
will cache all files regardless of size (this is the default setting).
OSS read cache is enabled by default. It can be disabled, however, by setting /proc parameters. For example,
invoke the following on the OSS.
44
Lustre System Administration
nid00018# lctl set_param obdfilter.*.read_cache_enable=0
Writethrough cache can also be disabled. This prevents file writes from ending up in the read cache. To disable
writethrough cache, invoke the following on the OSS.
nid00018# lctl set_param obdfilter.*.writethrough_cache_enable=0
Conversely, setting read_cache_enable and writethrough_cache_enable equal to 1 will enable them.
6.8
Lustre 2.x Performance and max_rpcs_in_flight
Performance comparisons of 1.8.x and 2.x Lustre clients have brought to light large differences in direct I/O (DIO)
rates. Lustre clients use a semaphore initialized to max_rpcs_in_flight to throttle the amount of I/O RPCs to
each storage target. However, Lustre 1.8.x clients do not abide by the tunable for DIO request and there is no
limit to the number of DIO requests in flight. This results in increased single client DIO performance compared to
2.x clients. Aggregate performance is comparable given enough clients.
To increase single client DIO performance of Lustre 2.x, modify the max_rpcs_in_flight osc tunable. The
tunable can be configured permanently for the file system via the lctl conf_param command on the MGS.
Alternatively, for direct-attached and esFS Lustre file systems, the lustre_control set_tune can be used to
easily change it in a non-permanent fashion. The default value is 8. The maximum value is 256.
# lctl conf_param osc.*.max_rpcs_in_flight=value
Be aware that tune-up of the value increases client memory consumption. This might be a concern for file
systems with a large number of object storage targets (OSTs). There will be an additional 9k/RPC/target increase
for each credit. Note that increasing the tunable to increase single client DIO performance will also allow more
RPCs in flight to each OST for all other RPC request types. That means that buffered I/O performance should
increase as well.
Also be aware that more operations in flight will put an increased load onto object storage servers (OSSs).
Increased load can result in longer service times and recovery periods during failover. Overloaded servers are
identified by slow response and/or informational messages from clients. However, even large increases in
max_rpcs_in_flight should not cause overloading. Cray has completed large-scale testing of increased
max_rpcs_in_flight values using file systems with tens of thousands of simulated clients (that were
simulated with multiple mounts per client) under extreme file system load. No scale issues have been found.
6.9
Check Lustre Disk Usage
From a login node, type the following command.
login$ df -t lustre
Filesystem
12@gni:/lus0
1K-blocks
2302839200
Used Available Use% Mounted on
1928332 2183932948
1% /lustre
The lfs df command can be used to view free space on a per-OST basis.
login$ lfs df
UUID
mds1_UUID
1K-blocks
958719056
Used Available
57816156 900902900
Use% Mounted on
6% /scratch[MDT:0]
45
Lustre System Administration
ost0_UUID
ost1_UUID
ost2_UUID
ost3_UUID
ost4_UUID
ost5_UUID
ost6_UUID
ost7_UUID
ost8_UUID
ost9_UUID
ost10_UUID
ost11_UUID
ost12_UUID
ost13_UUID
ost14_UUID
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
1061446736
570794228
571656852
604100184
604444248
588747532
597193036
575854840
576749764
582282984
577588324
571413316
574388200
593370792
585151932
564455796
490652508
489789884
457346552
457002488
472699204
464253700
485591896
484696972
479163752
483858412
490033420
487058536
468075944
476294804
496990940
53%
53%
56%
56%
55%
56%
54%
54%
54%
54%
53%
54%
55%
55%
53%
filesystem summary:
15921701040 8738192028 7183509012
/scratch[OST:0]
/scratch[OST:1]
/scratch[OST:2]
/scratch[OST:3]
/scratch[OST:4]
/scratch[OST:5]
/scratch[OST:6]
/scratch[OST:7]
/scratch[OST:8]
/scratch[OST:9]
/scratch[OST:10]
/scratch[OST:11]
/scratch[OST:12]
/scratch[OST:13]
/scratch[OST:14]
54% /scratch
6.10 Lustre User and Group Quotas
Disk quotas provide system administrators with the ability to set the amount of disk space available to users and
groups. Cray Lustre utilities allow administrators to easily enable user and group quotas on a system.
6.11 Check the Lustre File System
Lustre makes use of a journaling file system for its underlying storage (OSTs and MDT). This journaling feature
allows for automatic recovery of file system data following a system crash. While not normally required for
recovery, the e2fsck command can be run on individual OSTs and the MDT to check integrity. Before Lustre is
restarted, administrators should always run an e2fsck on a target associated with any file system failures. In the
case of a catastrophic failure, a special Lustre file system check utility, lfsck, is also provided. The lfsck
command can be used to check coherency of the full Lustre file system.
The lustre_control command provides an fs_check action. It performs an e2fsck on specified devices by
label, all devices on a specified host, or even all target types in a specified file system.
6.11.1 Perform an e2fsck on All OSTs in a File System with lustre_control
Procedure
Check all object storage targets (OSTs) in fs_name with lustre_control.
boot# lustre_control fs_check -f fs_name -t OST
For more information, see the lustre_control(8), e2fsck(8) and lfsck(8) man pages.
46
Lustre System Administration
6.12 Lustre liblustreapi Usage
Users can compile on a login node against liblustreapi.
cc program.c `pkg-config --cflags --libs cray-lustre-api-devel`
6.13 Dump Lustre Log Files
When Lustre encounters a problem, internal Lustre debug logs are generated on the MDS and OSS nodes
in /tmp directory log files. These files can be dumped on both server and client nodes. Log files are named by a
timestamp and PID. For example, /tmp/lustre-log-nid00135.1122323203.645.
The xtdumpsys command does not collect this data automatically, and since the files reside in /tmp, they
disappear on reboot. Create a script to retrieve the dump files from all MDS and OST nodes and store them in the
dump directory. The files can be collected by invoking the script at the end of the xtdumpsys_mid function in an
xtdumpsys plugin file.
Lustre debug logging can also be enabled on compute node clients. To do this, execute the following command
on the compute nodes.
# echo 1 > /proc/sys/lustre/dump_on_eviction
Collect these logs before shutdown as they will also disappear on reboot.
6.14 File System Error Messages
Lustre errors are normally reported in both the syslog messages file and in the Cray system console log.
Found inode with zero generation or link
Free block count wrong
Free inode count wrong
If there are errors, run e2fsck to ensure that the ldiskfs file structure is intact.
6.15 Lustre Users Report ENOSPC Errors
When any of the object storage targets (OSTs) that make up a Lustre file system become filled, subsequent writes
to the OST may fail with an ENOSPC error (errno 28). Lustre reports that the file system is full, even though there
is space on other OSTs. This can be confusing to users, as the df command may report that the file system has
free space available. Although new files will not be created on a full OST, write requests to existing files will fail if
the write would extend the file on the full OST.
Use the lfs setstripe command to place files on a specific range of OSTs to avoid this problem. Disk usage
on individual OSTs can also be checked by using the lfs df command.
47
Lustre Failover on Cray Systems
7
Lustre Failover on Cray Systems
To support Lustre failover, each LUN (logical unit) must be visible to two Lustre service nodes. For more
information about setting up the hardware storage configuration for Lustre failover, contact a Cray service
representative.
Failover is generally defined as a service that switches to a standby or secondary server when the primary system
fails or the service is temporarily shutdown for maintenance. Lustre can be configured to automatically failover
(see Lustre Automatic Failover on page 52), or failover can be set up to be done manually as described in
Configure Manual Lustre Failover on page 48.
7.1
Configuration Types for Failover
The Lustre failover configuration requires two nodes (a failover pair) that must be connected to a shared storage
device. Nodes can be configured in two ways—active/active or active/passive. An active node actively serves
data and a passive node idly stands by to take over in the event of a failure. (Automatic failover is only supported
on direct-attached Lustre file systems with combined MGS/MDS configurations.)
Active/
Passive
In this configuration, only one node is actively serving data all the time. The other node takes over in
the case of failure. Failover for the Lustre metadata server (MDS) is configured this way on a Cray
system. For example, the active node is configured with the primary MDS, and the service is started
when the node is up. The passive node is configured with the backup or secondary MDS service, but
it is not started when the node is up so that the node is idling. If an active MDS fails, the secondary
service on the passive node is started and takes over the service.
Active/
Active
In this configuration, both nodes actively serve data all the time on different storage devices. In the
case of a failure, one node takes over for the other, serving both storage devices. Failover for the
Lustre object storage target (OST) is configured this way on a Cray system. For example, one active
node is configured with primary OST services ost1 and ost3 and secondary services ost2 and
ost4. The other active node is configured with primary OST services ost2 and ost4 and
secondary services ost1 and ost3. In this case, both nodes are only serving their primary targets
and failover services are not active. If an active node fails, the secondary services on the other active
node are started and begin serving the affected targets.
CAUTION: To prevent severe data corruption, the physical storage must be protected from simultaneous
access by two active nodes across the same partition. For proper protection, the failed server must be
completely halted or disconnected from the shared storage device.
48
Lustre Failover on Cray Systems
7.2
Configure Manual Lustre Failover
Manual failover may be attractive to administrators who require full control over Lustre servers. Manual failover is
the only option for failing over file systems with separate MGS/MDS configurations.
Configure Lustre for manual failover using Cray Lustre control utilities to interface with the standard Lustre
configuration system. For additional information about configuring Lustre for Cray systems, see XC™ Series
System Software Initial Installation and Configuration Guide CLE 6001.
7.2.1
Configure DAL Failover for CLE 6.0
Prerequisites
This task assumes the direct-attached Lustre (DAL) file is configured without failover, and that the system is
currently running with the DAL file system.
About this task
Targets are configured for failover by specifying the fo_node component of the target definition in the fs_defs
file. If new servers are added, then the nid_map in the fs_defs needs to be updated. For manual failover,
auto_fo: no is set in the fs_defs. If necessary, the cray_lustre_client service is updated to specify an
additional MGS LNet nid.
Procedure
1. Add fo_node definitions to each target in the lustre_control lustre/.lctrl/fs_defs file for the
config set.
/var/opt/cray/imps/config/sets/p0/lustre/.lctrl/dal.fs_defs.2015052.1440768293
2. Copy the dal.fs_defs file to a work area on the SMW.
3. Edit the file so that each Lustre target, management target (MGT), metadata target (MDT), and object storage
target (OST) has a corresponding fo_node defined.
## MGT
## Management Target
mgt: node=nid00009
+
fo_node=nid00010
dev=/dev/disk/by-id/scsi-360080e50003ea4c20000065d5257da5d
## MDT
## MetaData Target(s)
mdt: node=nid00009
+
fo_node=nid00010
dev=/dev/disk/by-id/scsi-360080e50003ea4c20000065f52580c00
index=0
## OST
## Object Storage Target(s)
ost: node=nid00010
+
fo_node=nid00009
dev=/dev/disk/by-id/scsi-360080e50003ea5ba000006605257db2f
index=0
49
Lustre Failover on Cray Systems
ost: node=nid00010
fo_node=nid00009
dev=/dev/disk/by-id/scsi-360080e50003ea5ba000006625257db5c
index=1
ost: node=nid00010
+
fo_node=nid00009
dev=/dev/disk/by-id/scsi-360080e50003ea5ba000006645257db84
index=2
+
4. Set auto_fo to yes.
+
+auto_fo: yes
5. Save the fs_defs file.
6. Install the fs_defs into the config set and overwrite the existing file system definition using
lustre_control install on the SMW.
smw# lustre_control install -c p0 dal.fs_defs.20150828.1440768293
Performing 'install' from smw at Wed Oct 21 16:28:35 CDT 2015
Parsing file system definitions file: dal.fs_defs.20150828.1440768293
Parsed file system definitions file: dal.fs_defs.20150828.1440768293
A file system named "dal" has already been installed.
Would you like to overwrite the existing file system definitions with those in
dal.fs_defs.20150828.1440768293? (y|n|q)
y
Operating on file system - "dal"
The 'dal' file system definitions were successfully removed.
Failover tables need to be updated. Please execute the following command from
the boot node:
lustre_control update_db
The 'dal' file system definitions were successfully installed!
Failover tables need to be updated. Please execute the following command from
the boot node:
lustre_control update_db
7. Update the SDB failover tables by running the following command on the boot node.
boot# lustre_control update_db
8. Stop the file system.
a. Unmount all clients before stopping the file system.
boot# xtxqtcmd ALL_COMPUTE "umount $DAL_MOUNT_POINT"
b. Unmount the file system on the login nodes and any other nodes where it is mounted.
boot# lustre_control umount_clients -w $NODE_LIST -f $FS_NAME
c.
Stop the file system.
boot# lustre_control stop -f $FS_NAME
9. Reboot the DAL nodes.
50
Lustre Failover on Cray Systems
This step ensures DAL nodes have the most recent version of the config set containing the new version of the
fs_defs file with failover configured. Do not reboot the whole system. The default auto boot file does not run
the correct lustre_control commands needed to initially set up failover.
10. Perform a write_conf before restarting the file system.
boot# lustre_control write_conf -f $FS_NAME
11. Restart the file system.
boot# lustre_control start -p -f $FS_NAME
12. Mount the file system on clients.
a. Mount compute nodes.
boot# xtxqtcmd ALL_COMPUTE "mount $DAL_MOUNT_POINT"
b. Mount service nodes.
boot# lustre_control mount_clients -f $FS_NAME -w $NODE_LIST
7.2.2
Perform Lustre Manual Failover
About this task
If the system is set up for failover and a node fails or an object storage target (OST) is not functional, perform the
following steps to initiate Lustre manual failover.
Procedure
1. Halt the failing node with the xtcli command on the SMW.
smw# xtcli halt -f c0-0c0s6n2
The -f option is required to make sure that an alert flag does not prohibit the halt from completing.
CAUTION: Prior to starting secondary OST services ensure that the primary node is halted.
Simultaneous access from multiple server nodes can cause severe data corruption.
2. Check the status of the OSS node to make sure the halt command executed properly.
smw# xtcli status c0-0c0s6n2
Network topology: class 0
Network type: Gemini
Nodeid: Service Core Arch| Comp state
[Flags]
-------------------------------------------------------------------------c0-0c0s6n2: service IB06
OP|
halt
[noflags|]
-------------------------------------------------------------------------3. Start the failover process with the lustre_control failover command.
Here, nid00018 is the failing primary server that was halted in.
51
Lustre Failover on Cray Systems
boot# lustre_control failover -w nid00018 -f lus0
The recovery process may take several minutes, depending on the number of clients. Attempting warm boots
during Lustre recovery is not advised as it will break recovery for the remaining clients. Recovery will begin
automatically as clients reconnect, unless abort_recovery is specified via lctl.
To monitor the status of recovery, see Monitor Recovery Status on page 52
4. Run, after recovery, the df or lfs df command on a login node.
Applications that use Lustre on login nodes should be able to continue to check if all services are working
properly.
If there are a large number of clients doing Lustre I/O at the time that the failure occurs, the recovery time
may become very long. But it will not exceed the value specified by the recovery_time_hard parameter in
the fs_name.fs_defs file.
7.2.3
Monitor Recovery Status
The lustre_control status command may be used to monitor an OST that is in the recovery process. Upon
completion, the status changes from RECOVERING to COMPLETE.
OST Recovery After Failover
boot# lustre_control status -t ost -a
Performing 'status' from boot at Tue May 8 08:09:28 CDT 2012
File system: fs_name
Device
Host
Mount
OST Active Recovery Status
fs_name-OST0000
nid00026 Unmounted
N/A
N/A
fs_name-OST0000* nid00018 Mounted
N/A
RECOVERING
[...]
Note: '*' indicates a device on a backup server
ade#
boot# lustre_control status -t ost -a
Performing 'status' from boot at Tue May 8 08:22:38 CDT 2012
File system: fs_name
Device
Host
Mount
OST Active Recovery Status
fs_name-OST0000
nid00026 Unmounted
N/A
N/A
fs_name-OST0000* nid00018 Mounted
N/A
COMPLETE
Note: '*' indicates a device on a backup server
The recovery status is recorded in the following /proc entries.
For OSS
For MDS
/proc/fs/lustre/obdfilter/lus0-OST0000/recovery_status
/proc/fs/lustre/obdfilter/lus0-OST0002/recovery_status
/proc/fs/lustre/mds/lus0-MDT0000/recovery_status
52
Lustre Failover on Cray Systems
7.3
Lustre Automatic Failover
This section describes the framework and utilities that enable Lustre services to failover automatically in the event
that the primary Lustre services fail. Lustre automatic failover is only applicable to direct-attached Lustre (DAL) file
systems.
The automatic Lustre failover framework includes the xt-lustre-proxy process, the service database, a set of
database utilities and the lustre_control command. The Lustre configuration and failover states are kept in
the service database (SDB). Lustre database utilities and the xt-lustre-proxy process are used in
conjunction with lustre_control for Lustre startup and shutdown and for failover management. The xtlustre-proxy process is responsible for automatic Lustre failover in the event of a Lustre service failure.
To enable automatic failover for a Lustre file system, set auto_fo: yes in the file system definition file. If
automatic failover is enabled, lustre_control starts an xt-lustre-proxy process on the MDS and OSSs. It
then monitors the health of MDS and OSS services through the hardware supervisory system (HSS) system. If
there is a node-failure or service-failure event, HSS notifies the xt-lustre-proxy process on the secondary
node to start up the backup services.
The primary and secondary configuration is specified in the fs_name.fs_defs. The failover configuration is
stored in the SDB for use by xt-lustre-proxy. To avoid both primary and secondary services running at the
same time, the xt-lustre-proxy service on the secondary node issues a node reset command to shut
down the primary node before starting the secondary services. The proxy also marks the primary node as dead in
the SDB so that if the primary node is rebooted while the secondary system is still running, xt-lustre-proxy
will not start on the primary node.
When Lustre automatic failover is configured, the lustre_control utility starts and stops the xt-lustreproxy daemon each time Lustre services are started and stopped. lustre_control uses the configuration
information in the fs_name.fs_defs file to start the xt-lustre-proxy daemon with options appropriate for
the configuration. Typically, xt-lustre-proxy is not used directly by system administrators.
Services can be disabled to prevent some MDS or OST services from participating in automatic failover. See the
xtlusfoadmin(8) man page and Use the xtlusfoadmin Command on page 56 for more information on
enabling and disabling Lustre services.
The status of Lustre automatic failover is recorded in syslog messages.
7.3.1
Lustre Automatic Failover Database Tables
Three service database (SDB) tables are used by the Lustre failover framework to determine failover processes.
The lustre_control install command creates and populates the filesystem, lustre_service, and
lustre_failover database tables as described in the following sections. The lustre_control remove
command updates these tables as necessary. After the lustre_control install and the lustre_control
remove operations are performed, lustre_control update_db must be run on the boot node to modify the
three SDB tables.
7.3.1.1
The filesystem Database Table
The filesystem table stores information about file systems. The fields in the filesystem table are shown in
filesystem SDB Table Fields. For more information, see the xtfilesys2db(8) and xtdb2filesys(8) man
pages.
53
Lustre Failover on Cray Systems
Table 4. filesystem SDB Table Fields
Database Table Field
Description
fs_fsidx
File system index. Each file system is given a unique index number
based on the time of day.
fs_name
Character string of the internal file system name. Should match the
value of fs_name as defined in the fs_name.fs_defs file.
fs_type
File system type. Valid values are fs_lustre or fs_other. The
fs_other value is currently not used.
fs_active
File system status snapshot. Valid values are fs_active or
fs_inactive.
fs_time
Timestamp when any field gets updated. Format is 'yyyy-mm-dd
hh:mm:ss'.
fs_conf
Reserved for future use. Specify as a null value using ''.
7.3.1.2
The lustre_service Database Table
The lustre_service table stores the information about Lustre services. The fields in the lustre_service
table are shown in lustre_service SDB Table Fields. For more information, see the xtlustreserv2db(8) and
xtdb2lustreserv(8) man pages.
Table 5. lustre_service SDB Table Fields
Database Table Field
Description
lst_srvnam
Name of the Lustre metadata server (MDS) or object storage target
(OST) services.
lst_srvidx
Service index. For an MDS, use a value of 0. For OSTs, use the
index of the OSTs as defined in the fs_name.fs_defs file. Format
is an integer number.
lst_fsidx
File system index. Format is a character string.
lst_prnid
Node ID (NID) of the primary node for this service. Format is an
integer value.
lst_prdev
Primary device name, such as /dev/disk/by-id/IDa, for
metadata target (MDT) or OST. Format is a character string.
lst_bknid
NID of the backup or secondary node for this service. Format is an
integer value.
lst_bkdev
Backup or secondary device name, such
as /dev/disk/by-id/IDa, for MDT or OST. Format is a
character string.
lst_ltype
Lustre service type. Valid values are lus_type_mds or
lus_type_ost.
54
Lustre Failover on Cray Systems
Database Table Field
Description
lst_failover
Enables failover. Valid values are lus_fo_enable to enable the
failover process and lus_fo_disable to disable the failover
process.
lst_time
Timestamp when any field gets updated.
7.3.1.3
The lustre_failover Database Table
The lustre_failover table maintains the Lustre failover states. The fields in the lustre_failover table are
shown in lustre_failover SDB Table Fields. For more information, see the xtlustrefailover2db(8),
xtdb2lustrefailover(8) and lustre_control(5) man pages.
Table 6. lustre_failover SDB Table Fields
Database Table Field
Description
lft_prnid
NID for primary node.
lft_bknid
NID for backup or secondary node. A value of -1 (displays
4294967295) indicates there is no backup node for the primary
node.
lft_state
Current state for the primary node. Valid states are
lus_state_down, lus_state_up or lus_state_dead. The
lus_state_dead state indicates that Lustre services on the node
have failed and the services are now running on the secondary
node. The services on this node should not be started until the state
is changed to lus_state_down by a system administrator.
lft_init_state
Initial primary node state at the system boot. The state here will be
copied to lft_state during system boot. Valid states are
lus_state_down or lus_state_dead. For normal operations,
set the state to lus_state_down. If Lustre services on this node
should not be brought up, set the state to lus_state_dead.
lft_time
Timestamp when any field gets updated.
7.3.2
Back Up SDB Table Content
The following set of utilities can be used to dump the database entries to a data file.
CAUTION: By default, these utilities will create database-formatted files named lustre_failover,
lustre_serv, and filesys in the current working directory. Use the -f option to override default
names.
55
Lustre Failover on Cray Systems
Table 7. Lustre Automatic Failover SDB Table Dump Utilities
Command
Description
xtdb2lustrefailover
Dumps the lustre_failover table in the SDB to the
lustre_failover data file.
xtdb2lustreserv
Dumps the lustre_service table in the SDB to the
lustre_serv data file.
xtdb2filesys
Dumps the filesystem table in the SDB to the filesys
data file.
7.3.3
Use the xtlusfoadmin Command
The xtlusfoadmin command can be used to modify or display fields of a given automatic Lustre failover
database table. When it is used to make changes to database fields, failover operation is impacted accordingly.
For example, xtlusfoadmin is used to set file systems active or inactive or to enable or disable the Lustre
failover process for Lustre services. For more information, see the xtlusfoadmin(8) man page.
Use the query option (-q | --query) of the xtlusfoadmin command to display the fields of a database table.
For example:
xtlusfoadmin
xtlusfoadmin
xtlusfoadmin
xtlusfoadmin
-q
-q
-q
-q
o
s
f
a
#
#
#
#
display
display
display
display
lustre_failover table
lustre_service table
filesystem table
all three tables
Some invocations of xtlusfoadmin require the variables fs_index and service_name. The available values
for these variables can be found by invoking the xtdb2lustreserv -f - command, which prints the
lustre_service table to stdout.
7.3.3.1
Identify Lustre File System fs_index and service_name Values
Invoke the xtdb2lustreserv -f -command to display the lustre_service table contents. In this example
—which has been truncated to save space—the fs_index for this file system is 79255228, and the
service_name for the device listed is fs_name-MDT0000.
boot# xtdb2lustreserv -f Connected
#
# This file lists tables currently in the system database,
#
# Each line contains a record of comma-delineated pairs of the form \
field1=val1, field2=val2, etc.
#
# Note: this file is automatically generated from the system database.
#
lst_srvnam='fs_name-MDT0000',lst_srvidx=0,lst_fsidx=79255228...
Use the following commands to modify fields in the database and impact failover operation.
56
Lustre Failover on Cray Systems
Enable/Disable Failover Process for Whole File System
To either enable or disable the failover process for the whole file system, use the activate (-a | --activate_fs)
or deactivate (-d | --deactivate_fs) options with the xtlusfoadmin command. These options set the value
of the fs_active field in the filesystem table to either fs_active or fs_inactive.
boot# xtlusfoadmin -a fs_index
boot# xtlusfoadmin -d fs_index
# activate
# deactivate
This needs to be set before the xt-lustre-proxy process starts. If set while the proxy is running, xtlustre-proxy needs to be restarted in order to pick up the change. Always shutdown xt-lustre-proxy
gracefully before restart. A failover can be triggered if there is not a graceful shutdown. A graceful shutdown is a
successful completion of the lustre_control stop command.
Enable or Disable Failover Process for Lustre Service on Specific Node
To enable or disable the failover process for a Lustre service on a specific node, use the --enable_fo_by_nid
(-e) or --disable_fo_by_nid (-f) options.
boot# xtlusfoadmin -e nid
boot# xtlusfoadmin -f nid
# enable
# disable
Enable or Disable Failover Process for a Lustre Service by Name
To enable or disable the failover process for a Lustre service by name, use the enable (-j | --enable_fo) or
disable (-k | --disable_fo) options. These options set the value of the lst_failover field in the
lustre_service table to either lus_fo_enable or lus_fo_disable.
boot# xtlusfoadmin -j fs_index service_name
boot# xtlusfoadmin -k fs_index service_name
# enable
# disable
Change Initial State of Service Node
To change the initial state of a service node, use the --init_state (-i) option. This option sets the value of the
lft_init_state field the in the lustre_failover table to either lus_state_down or lus_state_dead.
boot# xtlusfoadmin -i nid n
boot# xtlusfoadmin -i nid d
# down
# dead
By setting a node as dead, Lustre services should not be started on that node after a reboot.
Reinitialize Current State of Service Node
To reinitialize the current state of a service node, use the --set_state (-s) option. This option would most
commonly be used during failback, following a failover. Use the set_state option to change the state of a
primary node from dead to down in order to failback to the primary node. This option sets the value of the
lft_state field in the lustre_failover table to either lus_state_down or lus_state_dead.
boot# xtlusfoadmin -s nid n
boot# xtlusfoadmin -s nid d
# down
# dead
57
Lustre Failover on Cray Systems
7.3.4
System Startup and Shutdown when Using Automatic Lustre Failover
Use the lustre_control command to start Lustre services. lustre_control starts both Lustre services and
launches xt-lustre-proxy.
The following failover database information will impact startup operations as indicated.
Service
Failover
Enable/
Disable
In the event that there is a failure, the failover-disabled service does not trigger a failover
process. If any services for the node have failover enabled, the failure of the service triggers
the failover process. To prevent a failover process from occurring for an MDS or OSS, failover
must be disabled for all the services on that node. Use the xtlusfoadmin command to
disable failover on a service. For example, to disable failover for an entire file system, run this
command:
xtlusfoadmin --disable_fo fs_index
To disable failover for all services on a node, type the following command:
xtlusfoadmin --disable_fo_by_nid nid
Initial State
At system startup, the current state (lft_state) of each primary MDS and OSS node is
changed to the initial state (lft_init_state), which is usually lus_state_down.
Current State
Following an
Automatic
Failover
When failing back the primary services from the secondary node after automatic failover, the
primary node state will be lus_state_dead and will require re-initialization. The xtlustre-proxy process will need the node to be in the lus_state_down state to start. Use
the xtlusfoadmin command to change the current state of a node to lus_state_down. For
example:
xtlusfoadmin --set_state nid n
7.3.4.1
Lustre Startup Procedures for Automatic Failover
Procedure
1. Log on to the boot node as root.
2. Start Lustre services and xt-lustre-proxy.
Type the following commands for each Lustre file system that has been configured.
boot# lustre_control start -f fs_name
boot# lustre_control mount_clients -f fs_name
3. Optional: Mount the compute node Lustre clients at this time.
boot# lustre_control mount_clients -c -f fs_name
58
Lustre Failover on Cray Systems
7.3.4.2
Lustre Shutdown Procedures for Automatic Failover
About this task
CAUTION: The lustre_control shutdown command gracefully shuts down the xt-lustre-proxy
process. Issuing SIGTERM will also work for a graceful shutdown. Any other method of termination, such
as sending a SIGKILL signal, triggers the failover process and results in a failure event delivered to the
secondary node. The secondary node then issues a node reset command to shut down the primary
node and starts Lustre services.
Procedure
1. Log on to the boot node.
2. Shut down the Lustre file system.
boot# lustre_control shutdown -f fs_name
7.3.5
Configure Lustre Failover for Multiple File Systems
In order to support automatic Lustre failover for multiple file systems, the following limitation must be worked
around—the lustre_control stop option terminates the xt-lustre-proxy process on servers that the
specified file system uses. This includes servers that are shared with other file systems.
When shutting down only a single file system, xt-lustre-proxy must be restarted on the shared servers for
the other file systems that are still active. Follow Shut Down a Single File System in a Multiple File System
Configuration on page 59 to properly shut down a single file system in a multiple file system environment.
7.3.5.1
Shut Down a Single File System in a Multiple File System Configuration
About this task
This procedure is not required to shut down all Lustre file systems. It is only needed to shut down a single file
system while leaving other file systems active.
After stopping Lustre on one file system, restart xt-lustre-proxy on the shared Lustre servers. Lustre
services are still active for the file systems not stopped. The xt-lustre-proxy daemon on the shared servers,
however, is terminated when a file system is shut down. In this example, myfs2 is shut down.
Procedure
1. Unmount Lustre from the compute node clients.
boot# lustre_control umount_clients -c -f myfs2
2. Unmount Lustre from the service node clients.
boot# lustre_control umount_clients -f myfs2
59
Lustre Failover on Cray Systems
3. Stop Lustre services.
boot# lustre_control stop -f myfs2
4. Restart xt-lustre-proxy on the shared Lustre servers by using lustre_control.
The remaining active Lustre services are not affected when xt-lustre-proxy is started. The
lustre_control start command is written to first start any MGS services, then any OST services, and
finally any MDT services. If there are errors at any step (by trying to start an MGS that is already running), the
script will exit before attempting to mount any subsequent targets. In order to successfully restart xtlustre-proxy, choose the command(s) to execute next based on the role(s) of the shared servers.
If only OSS servers are shared, execute this command.
boot# lustre_control start -w oss1_node_id,oss2_node_id,... -f myfs1
If only a combined MGS/MDT server is shared, execute this command.
boot# lustre_control start -w mgs_node_id -f myfs1
If a combined MGS/MDT server and OSS servers are shared, execute these commands.
boot# lustre_control start -w mgs_node_id -f myfs1
boot# lustre_control start -w oss1_node_id,oss2_node_id,... -f myfs1
7.4
Back Up and Restore Lustre Failover Tables
About this task
To minimize the potential impact of an event that creates data corruption in the service database (SDB), Cray
recommends creating a manual backup of the Lustre tables that can be restored after a reinitialization of the SDB.
Procedure
Manually Back Up Lustre Failover Tables
1. Log on to the boot node as root.
2. Back up the lustre_service table.
boot# mysqldump -h sdb XTAdmin lustre_service > /var/tmp/lustre_service.sql
3. Back up the lustre_failover table.
boot# mysqldump -h sdb XTAdmin lustre_failover > /var/tmp/lustre_failover.sql
Manually Restore Lustre Failover Tables
4. Log on to the boot node as root.
5. After the SDB is recreated, restore the lustre_service table.
60
Lustre Failover on Cray Systems
boot# mysqldump -h sdb XTAdmin < /var/tmp/lustre_service.sql
6. Restore the lustre_failover table.
boot# mysqldump -h sdb XTAdmin < /var/tmp/lustre_failover.sql
7.5
Perform Lustre Failback on CLE Systems
About this task
In this procedure, nid00018 (ost0 - /dev/disk/by-id/IDa, ost2 - /dev/disk/by-id/IDc) and
nid00026 (ost1 - /dev/disk/by-id/IDb, ost3 - /dev/disk/by-id/IDd) are failover pairs.
nid00018 failed and nid00026 is serving both the primary and backup OSTs. After these steps are completed,
ost0 and ost2 should failback to nid00018.
Procedure
1. Reset the primary node state in the SDB for an automatic failover. (There is no need to do this in manual
failover since the SDB was not involved. If the system is not configured for automatic failover, skip ahead to
step 4.)
During a failover, the failed node was set to the lus_state_dead state in the lustre_failover table.
This prevents xt-lustre-proxy from executing upon reboot of the failed node. The failed node must be
reset to the initial lus_state_down state. The following displays the current and initial states for the primary
node. In this example, nid00018 has failed and nid00026 now provides services for its targets.
nid00026# xtlusfoadmin
sdb lustre_failover table
PRNID
BKNID
STATE
12
134
lus_state_up
18
26
lus_state_dead
26
18
lus_state_up
INIT_STATE
lus_state_down
lus_state_down
lus_state_down
TIME
2008-01-16 14:32:46
2008-01-16 14:37:17
2008-01-16 14:31:32
2. Reset the state using the following command.
nid00026# xtlusfoadmin -s 18 n
lft_state in lustre_failover table was updated to lus_state_down for nid 18
Here the command option -s 18 n sets the state for the node with nid 18 to n (lus_state_down). For
more information, see the xtlusfoadmin(8) man page.
3. Run xtlusfoadmin again to verify that the state has been changed.
nid00026# xtlusfoadmin
sdb lustre_failover table
PRNID
BKNID
STATE
12
134
lus_state_up
18
26
lus_state_down
26
18
lus_state_up
INIT_STATE
lus_state_down
lus_state_down
lus_state_down
TIME
2008-01-16 14:32:46
2008-01-16 14:59:39
2008-01-16 14:31:32
4. Unmount the secondary OSTs from the remaining live OSS in the failover pair.
61
Lustre Failover on Cray Systems
In this case, ost0 and ost2 are the secondary OSTs and nid00026 is the remaining live OSS.
nid00026# umount /mnt/lustre/fs_name/ost0
nid00026# umount /mnt/lustre/fs_name/ost2
It is acceptable if the node is unable to unload some Lustre modules. This is because they are still in use by
the primary OSTs belonging to nid00026. In order to proceed, the umount commands have to finish
successfully.
5. Verify that ost0 and ost2 are no longer showing up in the device list. (When this command is entered, the
following message indicates the OSTs used.)
nid00026# lctl dl
0 UP mgc MGC12@gni 59f5af70-8926-62b7-3c3e-180ef1a6d48e 5
1 UP ost OSS OSS_uuid 3
2 UP obdfilter mds1-OST0001 mds1-OST0001_UUID 9
5 UP obdfilter mds1-OST0003 mds1-OST0003_UUID 3
6. Boot the failed node.
7. Optional: Start recovering Lustre using lustre_control from the boot node with the following command.
boot# lustre_control failback -w nid00018 -f fs_name
8. Check the recovery_status to see if it has completed.
boot# lustre_control status -w nid00018 -f fs_name
62
LMT Configuration for DAL
8
LMT Configuration for DAL
The Lustre® monitoring tool (LMT) for direct-attached Lustre (DAL) on Cray Linux environment (CLE 6.0) requires
some manual configuration during the software installation process.
Configure Storage for the
LMT Database
At least 40GB of storage space must be made available to the MGS node. See
LMT Disk Usage on page 67.
Configure the LMT MySQL
Database
The IMPS configuration does not set up this database, so this must be
configured manually for CLE 6.0 UP01. See Configure LMT MySQL Database
for DAL on page 63.
Configure the LMT GUI
(Optional)
See Configure the LMT GUI on page 65.
Use the configurator to configure the LMT for DAL on CLE 6.0. Guidance is provided for each LMT configuration
setting in the cfgset utility.
The cray_lmt configurator template configures LMT settings for specific nodes when they are booted. The
default system configuration value for the LMT service is disabled (false). Log in to the SMW as root and use
the cfgset command to modify the cray_lmt configuration settings to configure LMT.
smw# cfgset update -s cray_lmt -m interactive CONFIG_SET
8.1
Configure LMT MySQL Database for DAL
Prerequisites
A MySQL server instance must be configured on the management server (MGS) node. All commands described
below should be executed on the MGS for the direct-attached Lustre (DAL) file system.
About this task
A MySQL server instance on the management server (MGS) node stores real-time and historical Lustre
monitoring tool (LMT) data. The configurator does not handle the initial setup of the LMT MySQL users and
database. It must, therefore, be done manually. All commands described below should be executed on the MGS
for the DAL file system.
Procedure
1. Log on to the MGS as root.
Where nidMGS is the node ID (NID) of the MGS node.
63
LMT Configuration for DAL
boot# ssh nidMGS
2. Start the MySQL server daemon (if not already running).
mgs# /sbin/service mysqld start
3. Run the mysql_secure_installation script to improve MySQL server instance security.
This sets the password for the root MySQL user, disallows remote root access to the database, removes
anonymous users, removes the test database, and reloads privileges. If this is the first time configuring LMT,
create a symlink before running mysql_secure_installation to ensure that MySQL uses the correct
socket.
mgs# ln -s /var/run/mysql/mysql.sock /var/lib/mysql/mysql.sock
mgs# mysql_secure_installation
Prompts and recommended responses generated by the script.
Enter current password for root (enter for none): <Enter>
Set root password? [Y/n] Y
New password: Enter a secure password
Re-enter new password: Enter the secure password again
Remove anonymous users? [Y/n] Y
Disallow root login remotely? [Y/n] Y
Remove test database and access to it? [Y/n] Y
Reload privilege tables now? [Y/n] Y
4. Ensure root only access to the LMT user configuration file, /usr/share/lmt/mkusers.sql.
mgs# chmod 600 /usr/share/lmt/mkusers.sql
5. Edit the LMT user configuration file /usr/share/lmt/mkusers.sql.
This file will not be used at run time by any LMT or MySQL processes. It is simply a script that will be run to
create the MySQL users on the persistent storage set up for use by the MySQL databases. Once it is run
through MySQL, it is no longer needed.
mgs# vi /usr/share/lmt/mkusers.sql
This file contains MySQL statements that create users named lwatchclient and lwatchadmin. It gives
them privileges only on databases that start with filesystem_. Cray recommends making the following
changes to mkusers.sql.
Edit the GRANT
Statement
Edit the GRANT statements to grant privileges on only filesystem_fsname.*
where fsname is the name of the file system. This will only grant permissions on the
database for the file system being monitored.
Edit the Password Edit the password for lwatchadmin by changing mypass to the desired password.
Also add a password for the lwatchclient user.
CREATE USER 'lwatchclient'@'localhost' IDENTIFIED BY 'foo';
GRANT SELECT ON filesystem_scratch.* TO 'lwatchclient'@'localhost';
64
LMT Configuration for DAL
CREATE USER 'lwatchadmin'@'localhost' IDENTIFIED BY 'bar';
GRANT SELECT,INSERT,DELETE ON filesystem_scratch.* TO
'lwatchadmin'@'localhost';
GRANT CREATE,DROP
ON filesystem_scratch.* TO
'lwatchadmin'@'localhost';
FLUSH PRIVILEGES;
6. Save the changes and execute the following command. This prompts for the MySQL root user password,
which was set when mysql_secure_installation was executed.
mgs# mysql -u root -p < /usr/share/lmt/mkusers.sql
7. Create the database for the file system to be monitored.
mgs# lmtinit -a fsname
Where fsname is the name of the DAL file system (LMT data will be inserted into the LMT MySQL database
the next time the Cerebro service is restarted on the MGS).
8. Restart Cerebro.
mgs# service cerebrod restart
9. Verify that LMT is adding data to the MySQL database.
a. Initiate the LMT shell.
mgs# lmtsh -f fsname
b. List tables.
fsname> t
c.
8.2
List tables again after several seconds to verify that Row Count is increasing.
Configure the LMT GUI
About this task
The Lustre monitoring tool (LMT) graphical user interface (GUI) package is installed on login nodes. It contains a
GUI called lwatch and a command-line tool for viewing live data called lstat. The configuration file ~/.lmtrc
must be set up prior to using either tool.
Procedure
1. Login to the MGS node as root.
2. Edit the sample configuration file /usr/share/doc/packages/lmt-gui/sample.lmtrc to reflect the
site specific LMT configuration—where db_name is set to the name of the MySQL database used by LMT,
that is, filesystem_fsname.
65
LMT Configuration for DAL
# LMT Configuration File - place in $HOME/.lmtrc
filesys.1.name=<insert_fsname_here>
filesys.1.mountname=<insert_/path/to/mountpoint_here>
filesys.1.dbhost=<insert_db_host_ip_here>
filesys.1.dbport=<insert_db_port_here>
filesys.1.dbuser=<insert_db_client_username_here>
# Leave dbauth blank if the given client has no password
filesys.1.dbauth=<insert_db_client_password_here>
filesys.1.dbname=<insert_db_name_here>
3. Save the updated .lmtrc as ~/.lmtrc.
Both lwatch and lstat are now usable.
Here is an example for configuring access to the LMT database for the file system named scratch_1, which
was set up so that the user lwatchclient has no password. In this example, access is being configured on
the LMT server node, so the database is local. Thus, the db_host is localhost.
filesys.1.name=scratch_1
filesys.1.mountname=/lus/scratch_1
filesys.1.dbhost=localhost
filesys.1.dbport=3306
filesys.1.dbuser=lwatchclient
filesys.1.dbauth=
filesys.1.dbname=filesystem_scratch_1
After setting up ~/.lmtrc, lwatch and lstat can be run on this node.
To run the GUI from a remote node, the MySQL database must be configured to allow remote access for the
read-only user, lwatchclient. See Configure LMT MySQL for Remote Access on page 66.
8.3
Configure LMT MySQL for Remote Access
In order to run the Lustre monitoring tool (LMT) graphical user interface (GUI) on a separate node from the LMT
server, the MySQL server instance (running on the LMT server) must be configured to enable remote access for
the LMT read-only user, lwatchclient. These MySQL statements can be added
to /usr/share/lmt/mkusers.sql prior to executing the statements in that file. They can also be executed
directly. In these examples, FSNAME is the name of the file system being monitored.
CREATE USER 'lwatchclient'@'%' IDENTIFIED BY 'foo';
GRANT SELECT ON filesystem_FSNAME.* TO 'lwatchclient'@'%';
To execute these statements directly, log on to the DAL MGS node, open a mysql shell as the root MySQL user,
and run the statements as follows.
mgs# mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
...
mysql> CREATE USER 'lwatchclient'@'%';
Query OK, 0 rows affected (0.00 sec)
...
66
LMT Configuration for DAL
mysql> GRANT SELECT ON filesystem_FSNAME.* TO 'lwatchclient'@'%';
Query OK, 0 rows affected (0.00 sec)
This enables the user named lwatchclient to connect from any hostname.
To allow connections from a certain IP address, replace the '%' with an IP address in single quotes.
CREATE USER 'lwatchclient'@'10.11.255.252' IDENTIFIED BY 'foo';
GRANT SELECT ON filesystem_FSNAME.* TO 'lwatchclient'@'10.11.255.252';
8.4
LMT Disk Usage
LMT requires at least 40GB persistent storage attached to the LMT server (i.e., the management server (MGS))
to store historical data. If the storage becomes full, data can be deleted from the database using MySQL delete
statements.
MySQL Tables
Five tables store general file system statistics. These tables are populated by lmt_agg.cron script.
Table 8. General File System Tables
Table Name
On-Disk Growth Rate
FILESYSTEM_AGGREGATE_HOUR
0.8 KB/hour
FILESYSTEM_AGGREGATE_DAY
0.8 KB/day
FILESYSTEM_AGGREGATE_WEEK
0.8 KB/week
FILESYSTEM_AGGREGATE_MONTH
0.8 KB/month
FILESYSTEM_AGGREGATE_YEAR
0.8 KB/year
Table 9. MDS Aggregate Tables and Growth Rates
Table Name
Approximage On-Disk Growth Rate
MDS_AGGREGATE_HOUR
0.5 KB/hour/MDS
MDS_AGGREGATE_DAY
0.5 KB/day/MDS
MDS_AGGREGATE_WEEK
0.5 KB/week/MDS
MDS_AGGREGATE_MONTH
0.5 KB/month/MDS
MDS_AGGREGATE_YEAR
0.5 KB/year/MDS
67
LMT Configuration for DAL
Table 10. OST Aggregate Tables and Growth Rates
Table Name
On-Disk Growth Rate
OST_AGGREGATE_HOUR
0.7 KB/hour/OST
OST_AGGREGATE_DAY
0.7 KB/day/OST
OST_AGGREGATE_WEEK
0.7 KB/week/OST
OST_AGGREGATE_MONTH
0.7 KB/month/OST
OST_AGGREGATE_YEAR
0.7 KB/year/OST
Calculate Expected Disk Usage for a File System
Use this formula to calculate the approximate rate of disk space usage for a file system. Disregard the
AGGREGATE tables as they grow so much more slowly than the raw data tables.
(56 KB/hour/filesystem) * (# of filesystems) + (1000 KB/hour/MDS) * (# of MDSs)
+ (44 KB/hour/OSS) * (# of OSSs) + (70 KB/hour/OST) * (# of OSTs) = Total KB/
hour
Calculate the Disk Usage for a File System for 1 Year
In this example, LMT is monitoring one file system with one MDS, four object storage servers (OSS), and eight
object storage targets (OST). The amount of disk space used by the LMT database to is expected to grow at this
hourly rate.
56 KB/hour/filesystem * 1 filesystem + 1000 KB/hour/MDS * 1 MDS
+ 44 KB/hour/OSS * 4 OSSs + 70 KB/hour/OST * 8 OSTs = 1792 KB/hour
Which translates to this yearly rate.
1792 KB/hour * 24 hours/day * 365 days/year * 1 MB/1024KB
* 1 GB/1024MB = 15 GB / year
68
LMT Overview
9
LMT Overview
The Lustre monitoring tool (LMT) monitors Lustre file system servers metadata target (MDT), object storage target
(OST), and Lustre networking (LNet) routers. It collects data using the Cerebro monitoring system and stores it in
a MySQL database. Graphical and text clients are provided which display historical and real time data pulled from
the database.
There is currently no support for multiple MDTs in the same filesystem (DNE1).
Cerebro Multicast
Desktop
MySQL API
lmt-gui
Mgmt
Node
MDS
OSS
lmt-server
LNET
Router
lmt-server-agent
View and Aggregate Data
Two commands display data provided by LMT:
●
ltop displays live data
●
lmtsh displays historical data from the MySQL database
Configuration of the data aggregation cron job is enabled by using the cfgset command.
smw# cfgset update -s cray_lmt -m interactive partition
Interfaces
An LMT MySQL database is accessed using a MySQL client. The database created is named filesystem_ fsname
where fsname is the name of the filesystem which LMT monitors.
Additional command-line interfaces (CLIs) to LMT are ltop, lmtinit, and lmtsh. These interfaces are only
available on the LMT server and lmtinit and lmtsh can only be used by root.
●
ltop provides access to live data collected by LMT
●
lmtinit sets up a MySQL database for LMT
●
lmtsh provides access to data in the LMT MySQL database
69
LMT Overview
The LMT graphical user interface (GUI) package provides two other interfaces to LMT called lwatch and lstat.
lwatch is the GUI, and lstat provides output similar to the output of ltop. Any user with network connectivity
to the LMT server and credentials for a MySQL account with read access to the LMT database can use the CLI.
LMT also provides data aggregation scripts that act on raw data in the MySQL database and calculate hourly,
daily, and monthly data. The main script for aggregation is /usr/share/lmt/cron/lmt_agg.cron.
Dependencies
The MySQL-server runs on the MGS node. The IMPS handles dependencies as long as the packages needed
are in the CentOS image repository.
The two-disk RAID which is currently used as the management target (MGT) must split into two volumes in
SANtricity. The MGT volume must be 1GB in size. The other volume must be an ext3 volume using the rest of
the space on the disks (599GB unformatted).
The LMT GUI requires the Java runtime environment (JRE) and works best with IBM JRE. This is available on the
CentOS media for IMPS DAL.
Failover
The failover MGS can be used as the LMT server as long as all LMT agents (Lustre servers) are configured to
send Cerebro messages to both the primary and the failover MGSs. There, Cerebro daemon—cerebrod—will
be running on the MGS and its failover partner all the time since its failover partner is the metadata server (MDS).
However, listening on the failover MGS (the MDS) can be turned off until the MGS failover occurs. The disks used
for the MySQL database must be accessible to the primary and failover MGS. The nodes must be prevented from
accessing the disks at the same time using STONITH.
If any object storage server (OSS) or MDS fails over, start cerebrod on its failover partner when failover has
completed.
9.1
View and Aggregate LMT Data
View Data
There are two ways to view data provided by Lustre monitoring tool (LMT). Data can be viewed live with ltop.
Historical data can be viewed from the MySQL database with lmtsh. These utilities are available only on the LMT
server. For CLE with direct attached Lustre (DAL), the LMT server is the management server (MGS).
For help using ltop or lmtsh, see man page, or view usage information using the --help option.
Because the data is held in a MySQL database on the LMT server, the MySQL database can be directly accessed
using MySQL commands if more control is needed over how the data is presented.
Aggregate Data
DAL configuration of the data aggregation cron job is handled through the IMPS configurator. LMT provides
scripts which aggregate data into the MySQL database aggregate tables. To run the aggregation scripts, type the
following:
mgs# /usr/share/lmt/cron/lmt_agg.cron
70
LMT Overview
The first time the command is run will take longer than subsequent executions. Use lmtsh to see the tables
populated by the aggregation scripts. The aggregation script can be set up to run as a cron job.
To set up the cron job:
As root, type crontab -e and then enter:
0 * * * * /usr/share/lmt/cron/lmt_agg.cron
This configures the lmt_agg.cron job to run every hour, on the hour.
9.2
Remove LMT Data
The Lustre monitoring tool (LMT) does not provide a way to clear old data from the MySQL database. The
following mysql commands, run on the LMT server (the MGS in CLE systems using DAL), clear all data from the
MDS_OPS_DATA table which is older than October 4th at 15:00:00. fsname is the name of the file system being
cleared.
As root, access the MySQL database:
mysql -p -e "use filesystem_fsname;
delete MDS_OPS_DATA from
MDS_OPS_DATA inner join TIMESTAMP_INFO
on MDS_OPS_DATA.TS_ID=TIMESTAMP_INFO.TS_ID
where TIMESTAMP < '2013-10-04 15:00:00';"
Use the lmtsh shell to completely clear a table:
As root, type lmtsh, to start the lmtsh shell, the following at the lmtsh prompt where TABLE_NAME is the
name of the table to clear.
mgs# lmtsh
lmtsh> clr TABLE_NAME
To clear all aggregated tables:
lmtsh> clr agg
See the lmtsh man page for more information.
9.3
Stop Cerebro and LMT
To stop Cerebro from feeding data into Lustre monitoring tool (LMT), stop the Cerebro daemon (cerebrod) from
running on all Lustre servers and the LMT server as follows.
mgs# pdsh -w node-list "/sbin/service cerebrod stop"
mgs# /sbin/service cerebrod stop
This will stop the Lustre servers from sending file system data to the Cray management system (CMS). It will also
stop Cerebro from listening for data on the CMS. If required, the MySQL database can be deleted—as described
in this publication.
71
LMT Overview
If cerebrod has been turned on with chkconfig, it can also be turned off with chkconfig so that it won't start
every time the system is booted. To turn off cerebrod, use the same command as for turning it on, but replace
on with off. (This does not stop cerebrod immediately—use the service command to do that, as shown
above.)
mgs# chkconfig --level 235 cerebrod off
9.4
Delete the LMT MySQL Database
Prerequisites
There must be data stored in the Lustre monitoring tool (LMT) MySQL database to delete.
About this task
This procedure deletes all LMT data.
Procedure
1. Log into the LMT server (the management server (MGS) node in direct-attached Lustre (DAL) systems).
2. Delete the LMT MySQL database where fsname is the name of the file system to be removed.
mgs# lmtinit -d fsname
3. Optional: Remove the MySQL users added by LMT.
mgs# mysql -u root -p -e "drop user 'lwatchclient'@'localhost'; drop user
'lwatchadmin'@'localhost';"
9.5
LMT Database Recovery Process
The Lustre monitoring tool (LMT) database can be corrupted when the management server (MGS)/primary
metadata server (MDS) crashes in a direct-attach Lustre (DAL) file system. The corruption can be repaired by
running mysqlcheck on the MGS/primary MDS.
Run mysqlcheck just after the primary MDS is rebooted. LMT will work as soon as the primary MDS is rebooted
so long as the database is usable. If mysqlcheck is run after reboot, performance numbers are generated from
LMT even when using the secondary MDS.
nid00325# mysqlcheck -r -A -p
Enter password:
filesystem_dal.EVENT_DATA
filesystem_dal.EVENT_INFO
filesystem_dal.FILESYSTEM_AGGREGATE_DAY
filesystem_dal.FILESYSTEM_AGGREGATE_HOUR
OK
OK
OK
OK
72
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising