VAXcluster Systems Quorum

VAXcluster Systems Quorum
VAXcluster Systems
Quorum
The Technical Journal for VAXcluster System Management
In This Issue:
•
About the HSC Cache
•
VAXcluster State Change Timing
•
Building Large VAXcluster Configurations
Volume 8, Issue 1, August 1992
VAXcluster Systems Quorum
Quorum is published quarterly by Corporate User Information Products Group in
Marlborough, MA. It contains VAXcluster-specific technical articles and VAXclusterrelated articles.
Quorum welcomes comments and suggestions from its readers. Individuals or groups
are encouraged to submit articles about their configurations and experiences. The editor
reserves the right to edit, condense, and seek further authorization of any contribution.
Please send submissions to the editor.
To obtain a Quorum subscription, contact Susan Pillsbury, (508) 467-7180 (in the USA).
To request back issues of Quorum, contact Susan Pillsbury.
Restricted Rights: Use, duplication, or disclosure by the U.S. Government is subject to
restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and
Computer Software clause at DFARS 252.227-7013.
Copyright ©1992 Digital Equipment Corporation
All Rights Reserved.
Editor
Susan Pillsbury
Assistant Editor
Brenda Tucker
Digital Equipment Corporation
MRO1-3/C8
P.O. Box 1001
Marlborough, MA 01752
The material in this document is for information purposes only. Digital believes the
information to be accurate as of its publication date; such information is subject to
change without notice. Digital is not responsible for any inadvertent errors. The
opinions expressed by non-Digital contributors in this document in no way represent the
attitudes or opinions of Digital, its employees, or management.
The performance information in this document is for guidance only. System performance
is highly dependent on many factors including system hardware, system and user
software, and user application characteristics. Customer applications should be carefully
evaluated before performance is measured. Digital does not warrant or represent that
a user can or will achieve similar performance expressed in transactions per second
(TPS) or normalized cost/performance ($/TPS). No warranty on system performance or
cost/performance is expressed or implied in this document.
The following are trademarks of Digital Equipment Corporation: ACMS, BI, CI, DBMS,
DECamds, DECintact, DECnet, DECperformance, DECram, DECterm, DECUS, DELUA,
DEQNA, DEUNA, Digital, HSC, KDM, MicroVAX, MSCP, MS780–E, MS780–H, PDP–11,
RA, Rdb/VMS, RL, RM, RP, RV, RX, SA, SBI, SPM, TA, ThinWire, TK, TU, ULTRIX,
VAX, VAX DOCUMENT, VAX Performance Advisor, VAX RMS, VAX Supercomputer,
VAX Volume Shadowing, VAXcluster, VAXft, VAXserver, VAXsimPLUS, VAXstation, VMS,
and the DIGITAL logo.
The following is a third-party trademark: PostScript is a registered trademark of Adobe
Systems, Inc.
This document was prepared using VAX DOCUMENT, Version 2.0.
VAXcluster Updates
1
About the HSC Cache
James Hatt
Mass Storage Engineering
Roy Davis
VAXcluster Systems Engineering
This article offers some answers to commonly asked questions about the
HSC cache option in a VAXcluster system.
Q: What does the HSC cache option do?
A: The HSC cache is a high-speed data store that can reduce the response
time for disk accesses and is available to the entire VAXcluster system.
The HSC cache reduces the response time of an I/O stream that has the
following characteristics :
•
A majority of reads
•
Small transfer sizes
•
Repetitive reads of the same information
The HSC cache was designed to accelerate the reading of frequently used
data. An I/O stream that has more writes than reads can be impeded by
the presence of HSC cache. However, most VMS timesharing situations
have I/O streams composed of mostly reads, and the requested size
is usually less than eight sectors. Figure 1–1 shows the request size
distribution for a timeshare VAXcluster system at the Digital site in
Colorado Springs. This figure represents a typical timeshare environment
and illustrates that the bulk of I/O was small request sizes. Data accesses
must also be repetitive, instead of sequential, to benefit most from the
cache. If sequential reads of the disk are requested, the cached data is
never referenced, and it is also thrown out of the cache at the same rate
at which it is put into the cache. The HSC cache option was designed
to help the mainstream of timeshare environments. These environments
have work loads that comply with the three characteristics mentioned in
the previous list, and the HSC cache is positioned to boost the throughput
of these VAXcluster configurations.
3
4 About the HSC Cache
50%
Probability
40%
30%
20%
10%
0%
0
10
20
30
40
50
60
70
80
90
100
110
120 130
Request Size (blocks)
MR-5900-AL
Figure 1–1 Request Size Distribution
The HSC cache can avoid the usual long interval of a disk access by
keeping frequently requested data in its electronic memory. The size of
this data store is 32 megabytes of error correction code (ECC) memory. If
access to disk information is the constraint for a VAXcluster system, the
HSC cache can help eliminate this constraint and improve VAXcluster
performance.
The reduction in response time for a cache hit — a read request satisfied
from cache, instead of read from the disk — is significant. The HSC
cache was specifically designed to reduce the response times for typical
VAXcluster systems that run timeshare applications. Figure 1–2
indicates specification and typical response times for a three-sector
read request. The specification time represents the response time if
the drive specification is used for average seek time. The VMS operating
system tends to collocate the blocks of files, so the typical seek distance is
shorter than the drive’s specification seek distance. Note that when the
typical VMS request is satisfied from HSC cache, a large time savings is
realized. This time savings can accelerate the application issuing the I/O
requests or allow more applications to be run on the VAXcluster system.
For more information on I/O workload characterization, disk latencies,
and I/O performance issues, refer to VAX I/O Subsystems: Optimizing
Performance.1
1
Bates, Ken. VAX I/O Subsystems: Optimizing Performance. Horsham:
Professional Press Books, 1991.
About the HSC Cache 5
HSC
Seek
Rotate
Transfer
Specification
Response
Time
Typical VMS
Response
Time
Cache Hit
Response
Time
Time Savings
MR-5799-AL
Figure 1–2 Response Times
The author points out that the response time and I/O rates are related.
He also provides a chart illustrating this relationship. Figure 1–3
illustrates how the HSC subsystem affects a disk’s throughput with hit
rates as low as 25 percent. A 25 percent hit rate means that, for 100 I/O
requests, 25 were read requests satisfied from cache and the remaining
75 required disk access. As indicated in the chart, hit rates as low as 25
percent significantly increase the throughput of a disk drive. Holding the
drive response time at 50 milliseconds, the noncached performance is 30
requests per second for an RA92 disk. Throughput rates for cached disks,
at a 50-millisecond response time, are approximately 47, 72, and 147
requests per second for hit rates of 25, 50, and 75 percent, respectively.
As the hit rate increases, the effective bandwidth of the disk increases
dramatically, approaching ESE solid-state disk speeds.
Q: Why not use an ESE for my hot files?
A: You should use ESE disks, where appropriate. You should also consider
using the HSC cache to accelerate disk access to the main VAXcluster
storage area. The HSC cache and the ESE products are complementary
products and should be combined to provide the most effective storage
technology for VAXcluster performance. The ESE products are on the
higher end of the storage price/performance scale. The disk drive is in the
middle of the curve; tape technology is on the low end. The mechanical
disk drive offers a much larger storage area per dollar than does the
ESE, but its performance has limitations. The ESE provides much higher
performance, but has storage limitations because of its high price per
Response Time (milliseconds)
6 About the HSC Cache
100
90
80
70
60
50
40
30
20
10
0
Not
Cached
25%
50%
75%
0
25
50
75
100
125
Requests Per Second
150 175
MR-5800-AL
Figure 1–3 Disk Throughput
megabyte. To bridge the gap between the ESE and the disks drives, the
HSC cache was developed.
The HSC cache can accelerate the access to and the throughput of the
main storage area of the VAXcluster system, which consists mostly of
rotating mechanically actuated disk drives. Depending on the hit rate,
as seen in Figure 1–3, the throughput of a disk drive can be multiplied
several times by the use of the HSC cache. Therefore, you have the option
of accelerating the access speed to the frequently used blocks in the main
storage area.
Q: What type of storage does the HSC cache option not cache?
A: The HSC subsystem does not cache tape or the following special disk
types:
•
Phase I shadowed disks
•
Solid-state disks (ESE20, ESE50)
•
Internal floppies
•
Disks with data check enabled
Tape is not a random access type of device, therefore, the HSC cache does
not benefit tape access.
About the HSC Cache 7
The HSC subsystem does not cache any disk drive that requires special
handling. Moreover, the ESE solid-state disks and floppies are not
appropriate candidates for HSC caching. The Phase I (also known as
controller-based) volume shadowed disks require special treatment by the
HSC software. Phase II (also known as host-based) volume shadowed
disks require no special treatment by the HSC software and can be
cached. Note that Phase II Volume Shadowing is a much more robust
shadowing technology than Phase I Volume Shadowing. It will eventually
replace Phase I as the only type of volume shadowing that is supported by
the VMS operating system.
If the data check option is enabled for read or write requests on a disk,
the integrity of the data that is transferred by each such request is
verified by a second reference to the disk. This activity defeats the HSC
caching goal of improved disk performance. The data check option itself
would be further complicated if the data had to be verified on disk and in
cache. Thus, HSC caching is not used with transfers that have this option
enabled.
Q: How do I know if the HSC cache option will benefit my
VAXcluster system?
A: A tool was developed to analyze an I/O stream on an HSC subsystem to
determine if it will benefit from HSC cache.
The tool is called Cache Needs Analysis Tool (CNAT) and is available
on the distribution media for HSC Version 6.5 software. CNAT is not
supported on the HSC50. CNAT includes two programs that run on VMS:
LGCOPY.EXE and LGCACHE.EXE. The media also includes a PostScript
file, LGCACHE.PS, that has instructions on how to use CNAT to collect
and analyze data.
This tool requires a dedicated disk drive to collect the HSC I/O
information. The dedicated disk drive is not available for any other
use while the I/O information is collected. An HSC utility program is used
to acquire the MSCP commands that define the I/O stream. This utility
is DSKLOG, which was known as LGUTIL in versions prior to HSC
Version 6.0 software. Note that a reboot of the HSC subsystem is required
to activate the DSKLOG utility. Therefore, some advanced planning is
required before this tool can be used.
Once the HSC subsystem is configured and ready to collect data, you must
examine the loads of interest and collect data during data collection. If
you have a timeshare system that is used primarily during the day, collect
data during the day, when the system load is the greatest. However, if
you want to accelerate batch jobs that are run during the early morning,
acquire the data when the batch jobs run.
After the data is acquired, analysis can take place using the LGCACHE
program. LGCOPY is provided to extract the data from the dedicated
acquisition disk to a VMS file. Once the data is in a VMS file, LGCACHE
reads the data and provides a report. The report includes cached and
8 About the HSC Cache
noncached response times and their respective I/O rates for each of the
active disks during the measurement period.
The report gives you a profile of the I/O stream for an HSC subsystem and
can predict if this stream will benefit from the HSC cache option. Refer
to the LGCACHE.PS document for help to analyze the report information.
Using CNAT is an effective first step toward understanding whether HSC
cache is an appropriate investment for a VAXcluster system.
Q: Do I need to give up a channel card slot to accept the HSC
cache option?
A: No, there is an empty slot next to the K.pli module where the M.cache
module is installed.
While there is an empty slot available for the M.cache module in all
HSC70 class systems (these include the HSC40, HSC70, HSC60, and
HSC90 systems), the M.cache option is only available for the HSC60 and
HSC90 systems. The HSC40 and HSC70 can be upgraded to an HSC60
or an HSC90, respectively. The HSC60 and HSC90 have some different
modules than those that are used in the HSC40 and HSC70 systems.
These modules include a faster P.ioj (the module that performs the caching
function), a new M.std4 module, optional eight-port K.si modules, and a
new K.ci module set. Note, however, that all four-port channel cards from
all HSC systems are interchangeable.
The cache software looks for the proper hardware configuration and does
not run without it.
Q: How does the 16-bit PDP–11 access the blocks stored in the
32-megabyte M.cache module?
A: A reference to a cached disk is decomposed to point to data structures
that are made up of sets of 8-kilobyte PDP–11 pages that hold references
to the blocks stored in the M.cache module.
The PDP–11 processor on the P.ioj module performs the cache function
in the HSC. An example of how the PDP–11 software executes a cache
lookup is presented here. This example is a simplified representation of
the HSC cache server software.
Data for up to 48 disk units can be cached. While you determine which
disks to cache, the HSC software managing the cache automatically
assigns a cache unit number in the range of 0 to 47 to each disk unit
being cached.
Cache memory is divided into sector buffers. Each sector buffer contains
a 512-byte block of data and 8 bytes of overhead information. When a
cache sector buffer is in use, its address is kept in an HSC data structure
known as an index node or iNode. All iNodes were previously allocated
and distributed among 128 logical buckets numbered from 0 to 127. When
in use, an iNode can account for up to 16 contiguous data blocks stored in
sector buffers whose addresses are in the iNode.
About the HSC Cache 9
The specific bucket containing the iNode for a particular disk block is
determined by a formula involving the cache unit number and the cylinder
number of the disk on which the block resides.
The following equation illustrates how the bucket number, B, is calculated:
!#"% $
& ')(
The sum of the cylinder number and cache unit number
is divided
by 128, and the remainder of this division determines the bucket.
Within each bucket, iNodes in use are kept in a linked list ordered
according to the starting block number of the range of contiguous blocks
each iNode represents. A search for an iNode representing part or all of a
range of blocks ends either with any iNode whose starting block number
is beyond the desired range or with the end of the list.
Corresponding to each bucket is a bucket header containing 48 cache
unit arrays, one for each cache unit number. The entries in each array
keep track of iNodes for a disk unit within that bucket. To see how this
is done, consider an I/O request that references a set of blocks starting
in cylinder 130 on a disk whose cache unit number is 5. Assume that
HSC parameters are set so this request qualifies for caching, regardless of
whether it is a read or a write request.
The remainder of dividing 128 into the sum of 130 and 5 is 7. Figure 1–4
shows the logical bucket pointed to by the bucket value 7. Within this
bucket, the cache unit array corresponding to cache unit number 5 is
located.
The entries in each array are maintained in starting block number order.
Thus, the array is searched for the last entry whose starting block number
is less than or equal to the starting block number of this I/O request.
Starting with the iNode to which this array entry points, the list of iNodes
that are in use within the bucket is searched.
If one or more iNodes indicate that all the data for a read request is in
cache, the requested data is retrieved from cache. For a read request
where at least part of the data is not in cache and for a write request, one
of three alternative situations prevails:
•
The request does not overlap blocks accounted for by any iNode and
does not fit within an iNode and remain contiguous with that iNode’s
blocks. One or more new iNodes and the necessary sector buffers are
allocated to the request.
•
The request fits within and can be merged with a set of one or more
iNodes with the result that the blocks accounted for by this set are
contiguous. If more than one iNode is involved, the blocks these
iNodes account for do not overlap. The request simply uses this set of
iNodes, and additional sector buffers are allocated to them if necessary.
10 About the HSC Cache
•
The request overlaps blocks accounted for by a set of one or more
iNodes, but cannot be merged with those iNodes in a way that satisfies
both conditions just described. The iNodes and sector buffers in the set
are released; one or more new iNodes and the necessary sector buffers
are allocated to the request.
Bucket 7 - Cylinder number and cache unit number identify Bucket 7
Cache Unit Arrays in Bucket Header
0
1
5
Collection of iNodes
Block of Data from Disk
Block of Data from Disk
MR-5798-AL
Figure 1–4 Bucket 7 Example
Because of the way in which the HSC subsystem’s micro PDP–11 performs
memory management, all the iNodes in any one bucket are confined to
an 8-kilobyte region of memory. An HSC parameter that ranges from 1
to 16 determines the maximum number of contiguous disk blocks a single
iNode can handle. These two facts limit the number of iNodes within
a bucket. It is, therefore, possible for there to be no free iNodes in the
bucket in which some data is to be cached. When this occurs, the cache
sector buffers pointed to by the least recently used iNode in that bucket
are released, and that iNode is given to the request.
It is also possible that there may not be a sufficient number of free sector
buffers in cache, even after an iNode is released. This situation is handled
by taking sector buffers away from buckets in a round-robin order. A
round-robin pointer points to the next bucket from which buffers are to
be obtained. Sector buffers associated with the least recently used iNode
in that bucket are released, that iNode is returned to a list of free iNodes
for its bucket, and the round-robin pointer is advanced to point to the
About the HSC Cache 11
next bucket. This procedure is repeated if more sector buffers are needed,
but the maximum number of sector buffers released in this way will not
exceed the maximum number of contiguous disk blocks that an iNode can
handle.
Q: If I have the cache installed, how do I turn it on?
A: There are SETSHO commands to enable the caching software and to
enable the disks you chose for caching.
To enable the cache server software on the HSC, use the following
SETSHO command:
SETSHO>SET SERVER DISK/CACHE
To enable a disk drive, in this case drive 123, to participate in caching,
use the following command:
SETSHO>SET D123 CACHE
No reboot is required to use either of these commands. There are several
commands used on the HSC subsystem to tune and enable the caching
function. The SET commands relevant to the HSC cache and the functions
they perform follow. All the SETSHO utility commands are documented
in Chapter 6 of the HSC Controller User Guide.
The cache server process manages the HSC cache and associated
algorithms if caching is enabled. While the cache sector buffers reside
in the 32-megabyte cache, the data structures used to manage caching
(the bucket headers, iNodes, and so forth) are kept in HSC main memory.
Thus, the disk server knows which disk units are cached and passes
operations for those units to the cache server. The cache server is not
enabled simply because a cache is present, but, rather, by an HSC utility.
HSC disk caching is controlled by the SETSHO utility. Disk caching can
be enabled and disabled dynamically without rebooting using the following
command:
SETSHO>SET SERVER DISK/[NO]CACHE
Read requests beyond a defined maximum size are not cached. This
parameter ranges from 1 to 64 blocks; as of this writing, the default value
is 8. It can be changed, with the new value specified in blocks using the
following command:
SETSHO>SET SERVER CACHE/CACHE_SIZE_THRESHOLD=n
Disks that tend to benefit the most from HSC caching are those that
satisfy all the following conditions:
•
They have a high I/O rate.
•
They are read intensive. Read intensive here applies to both read only
and read and update environments.
•
They contain frequently accessed data. The more a particular data
item is accessed, the more likely it will be found in cache.
12 About the HSC Cache
•
Read operations with these disks are small. In this context, small
means less than or equal to the CACHE_SIZE_THRESHOLD value.
Even though the limit on read requests that are cached can be set as
high as 64 blocks, the iNode data structure used to represent cached data
cannot account for more than 16 contiguous blocks. Thus, more than one
iNode may be required to cache a specific read request. Digital usually
recommends that the number of blocks an iNode can handle be left at
its default value of 16. However, this value can be changed using the
following command:
SETSHO>SET SERVER CACHE/XDE_SIZE=n
The MSCP ERASE command is used to clear or write zeros to one or
more disk blocks. ERASE requests with block counts exceeding a certain
threshold only clear those blocks on disk, but invalidate all the data in the
HSC cache from that disk. ERASE requests with block counts of less than
or equal to the threshold clear the blocks on disk and in cache, but leave
other data in cache from the disk unchanged. Digital recommends leaving
this threshold at its default value of 256 blocks. It can be changed using
the following command:
SETSHO>SET SERVER CACHE/ERASE_PURGE_THRESHOLD=n
Assuming that a disk eligible for HSC caching has the unit number
x, caching can be enabled or disabled for that disk with the following
command:
SETSHO>SET [DEVICE] Dx [NO]CACHE
Blocks written to a cached disk result in two separate actions: the data is
written to the disk, and the cache server adjusts its view of the data. This
adjustment ensures that no obsolete blocks reside in the cache memory.
Whether the blocks from the write request are inserted into or removed
from cache is determined by a cache server policy, known as the write
allocation policy. There are three alternative write allocation policies
relative to the caching of data for write requests. These are known as the
small write policies and can be set on a drive-by-drive basis.
•
ALLOCATE — Disk blocks referenced by the write request that are
already in cache are updated by the request. Disk blocks referenced by
the write request that are not in cache are placed in cache.
This policy is recommended for disks with data that is referenced
frequently or soon after being written.
•
INVALIDATE — Disk blocks referenced by the write request that are
already in cache are invalidated by the write request. The net result
is that these blocks are no longer in cache. Disk blocks referenced by a
write request that are not in cache are not placed in cache by the write
request. They have no effect on cache.
This policy is recommended for disks with data that, after being
written or modified, is referenced infrequently (if at all).
About the HSC Cache 13
•
UPDATE — Disk blocks referenced by the write request that are in
cache are updated by the request. Disk blocks referenced by the write
request that are not in cache are not placed in cache by the request
and have no effect on cache. This policy is recommended for disks with
data that might be referenced soon or frequently after it is written or
modified.
For write operations that exceed a certain size known as the WRITE_
THRESHOLD, the write policy is known as the large write policy. The
large write policy is determined by the small write policy. Table 1–1
illustrates the relationship between small and large write policies.
Table 1–1 Small and Large Write Policies
Small Write Policy Set To:
Large Write Policy Becomes:
ALLOCATE
UPDATE
INVALIDATE
INVALIDATE
UPDATE
UPDATE
The WRITE_THRESHOLD parameter, which defaults to 8 blocks, can be
set using the following command:
SETSHO>SET SERVER CACHE/WRITE_THRESHOLD=n
For write operations that do not exceed the WRITE_THRESHOLD, any of
the three policies can be applied to a disk whose unit number is x using
one of the following commands:
SETSHO>SET [DEVICE] Dx W_ALLOCATE
SETSHO>SET [DEVICE] Dx W_INVALIDATE
SETSHO>SET [DEVICE] Dx W_UPDATE
The default small write policy is W_ALLOCATE. Therefore, the default
large write policy is UPDATE. Use the following command to list the disks
enabled for caching and their small write polices:
SETSHO>SHOW DISKS
Unit
Req
27
7
60
3
61
3
182
5
:
:
Port
1
0
6
3
Type
RA70
RA71
RA71
ESE20
State
Available
Available
Available
Available
Cached
Yes-A
Yes-I
Yes-U
No
In the cached field from SHOW DISKS, Yes or No is displayed. If the
drive was enabled for caching, Yes has a single character concatenated to
it to indicate the small write policy in place. This character is A, I, or U
— abbreviations for ALLOCATE, INVALIDATE, or UPDATE.
14 About the HSC Cache
SHOW SERVER DISK and SHOW SERVER CACHE show the status
of disk caching on the HSC. The SETSHO utility chapter in the HSC
Controller User Guide gives details of what these commands display and
is the ultimate source of documentation for the preceding commands.
Q: How do I make sure my HSC cache is properly configured?
A: Three tools are available to help you configure and tune the HSC cache:
VTDPY, DSCACHE, and DSTAT. These tools come with the HSC Version
6.5 software and higher.
VTDPY is a versatile tool if you are evaluating HSC activity. It displays
I/O rates and cache hit rates on a drive-by-drive basis . If tapes are
present on the HSC, the screen can be locked into disk-only mode by
pressing Ctrl/A on the terminal on which VTDPY runs. Up to 48 disk
drives are listed on every refresh of the screen display. The disks are
labeled D0 to D4095. The field indicated under the S column is the status
field for that drive. When a disk is cached, its status is C. When a drive is
mounted, but not cached, its status is O. If mounted ESE solid-state disks
are using the enhanced protocol, their status is E, otherwise their status
is O. A disk with caching enabled has a number printed in the hits field.
There are up to three statistics listed for each mounted disk: IO/S, IOs,
and HITs. IO/S is the number of requests per second the drive averaged
over the last measurement period. IOs indicates the total number of
requests this drive serviced during the previous refresh period. HITs
indicates the number of requests that were fulfilled from the HSC cache.
Comparing the IOs and HITs fields indicates the ratio of I/O requests to
cache hits. From this ratio, average response times can be inferred and
decisions of whether a particular disk should participate in cache can be
made. All the fields of VTDPY are described in the chapter on VTDPY in
the HSC Controller User Guide.
VTDPY can give you a feel for the activity on an HSC; this is often
effective. However, there is a new tool available in HSC Version 6.5
software that presents more accurate information about the drives
participating in caching. Therefore, you can make more educated
decisions about configuring the HSC cache.
The DSCACHE program can be found in the HSC Version 6.5 software
distribution. There are two files of interest for the DSCACHE user:
DSCACHE.EXE and DSCACHE.PS. The DSCACHE program is a VMS
executable image, referred to as Cache Performance Analysis Tool (CPAT).
The documentation for the program is in the accompanying PostScript file,
DSCACHE.PS. The article explains the implications of cache configuration
changes so you understand the product and its potential in the VAXcluster
system. For details on the DSCACHE program operation, read the
documentation. This article covers only a few key highlights.
About the HSC Cache 15
The DSCACHE program uses the HSC utility DSTAT. DSCACHE runs
on VMS and invokes the DSTAT utility to run on the target HSC. The
network connection between the VMS host and the HSC subsystem is
the Diagnostic Utility Protocol (DUP). The user must have DIAGNOSE
privilege enabled. The DSCACHE program prompts the operator for the
name of the HSC subsystem and the update interval. It establishes a
connection to the specified HSC subsystem and passes the update interval
to the DSTAT utility. At this point, DSCACHE waits for measurement
information from DSTAT.
DSTAT runs on minute intervals from 1 to 90. Although originally
designed to provide data for DSCACHE, DSTAT may be run by itself
if the user wants raw I/O information from the HSC. The output from
DSTAT has only simple text characters embedded in the output listing
and is perfect for postprocessing of HSC information. Some of the DSTAT
output includes the number of reads and writes and the sum of their
sizes. From this information, read-to-write ratios, average read sizes, and
average write sizes can be easily calculated. To save the DSTAT output in
a VMS file, use the following VMS command to establish a connection to
the HSC subsystem and save DSTAT output:
$ SET HOST/HSC/LOG=DSTAT.TXT HSC002
This method of saving DSTAT output can be useful in conjunction with
DSCACHE because DSCACHE can read DSTAT output saved to a file.
This is a useful way for you to compare recent information with old
information and interpret trends.
DSCACHE requires a 132-column-mode terminal for its display. It prints
three columns of disk information and is able to report on all disks.
Drives that benefit little or not at all from the HSC cache are highlighted
on the screen so that they are conspicuous. For each drive, the following
statistics are output to the screen:
Disk
44
S
C
cRsp
69%
rdSiz
17.2
wrSiz R/S
7.0 5.6
Hits
38%
htSz
3.0
Pg
1
Briefly, here are the definitions of the DSCACHE output fields:
•
The Disk field is the drive number known by the HSC.
•
The S field represents the drive status.
Three characters are possible: C for a cached disk; O for an online, not
cached disk; and U for an unknown drive type. Although no numerical
information is displayed for a status of O or U, the O drive causes a
hint to be printed if the number of reads is greater than 75 percent.
The hint is the phrase "Consider caching."
•
The cRsp field represents the relative response time of a cached disk.
One hundred percent means that this disk has the response time
of a noncached disk, and cache is of no help here. When a disk has a
response time of greater than 95 percent, its data fields are highlighted
to indicate that this disk is benefiting little from cache and should be
16 About the HSC Cache
disabled from caching. In the previous example, a 69 percent response
time means that the average response time for disk 44 is 0.69 times
the response time for an noncached disk.
•
The rdSiz field represents the average read size. Use this field as a
basis for changing the cache read threshold. In the previous example,
the average read size is 17 sectors, and the HSC default read threshold
is eight sectors. A change of the read threshold to 20 might be tried in
this case.
•
The wrSiz field represents the average write size. Like the rdSiz field,
the wrSiz field is a good source of information for adjusting the write
threshold.
•
The R/S field represents the requests per second. R/S indicates the
magnitude of traffic to this disk.
•
The Hits field represents the cache hit rate. Of all the commands and
reads and writes sent to this drive, this field indicates what percentage
were satisfied from the cache. This field indicates what percentage of
all reads and write commands issued to this drive are satisfied from
cache.
•
The htSz field represents the average cache hit sizes. If this field and
the average read size field differ greatly, as in the previous example,
adjustments to the read threshold may be needed.
•
The Pg field represents the purge ratio field. This ratio reflects the
magnitude at which this drive is replacing blocks from cache compared
with other cached drives. Numbers greater than one indicate that the
drive is attempting to use more of the cache by removing more cache
blocks than its fair share. It is recommended that this field be used as
a last resort in the process of eliminating disks from cache.
The DSCACHE documentation covers strategies on how disks should be
included and excluded from caching. There is also a section on tuning,
which presents strategies on adjusting the cache parameters based on the
use of DSCACHE.
Q: When I want to enable caching on all my drives, do I need to
type in all 48 disks?
A: No, use D_ALL as the drive name in SETSHO. SETSHO provides a
shorthand way to enable or disable all disks for caching. The following is
an example of its use:
SETSHO>SET D_ALL CACHE
SETSHO>SET D_ALL NOCACHE
These commands enable all eligible disks for caching and then disables
them. This shorthand notation saves time when configuring the HSC
subsystem for caching.
2
VAXcluster State Change Timing
Tom Speake
Global Information Systems
Keith Parris
VMS Engineering
Digital Equipment Corporation
Introduction
This article is about one of the many recovery mechanisms within
VAXcluster systems. To provide resource sharing and concurrency control
for applications and users, a VAXcluster system depends on reliable, high
bandwidth, low-latency communication among all nodes. Every node
must have a direct connection to each of the other nodes. Any permanent
disturbance of this connectivity must be addressed. Every attempt is
made to compensate for the connectivity defect and allow as many of the
members as possible to continue to operate.
The result of this compensation goes by many names. The two most
popular are VAXcluster transition and VAXcluster state change. For
simplicity’s sake, we refer to this as a transition for the rest of this article.
A transition is controlled primarily by code within VMS called the
Connection Manager.
The duration of a transition and the effect of a transition on users
(applications) depends on the nature of the failure, the configuration,
the application, and the extent of the recovery necessary.
This article describes the effects of some of the more common failures and
what steps you can take to recover from them. From this discussion you
can determine the approximate time for recovery from these failures.
17
18 VAXcluster State Change Timing
Detection and Recovery Phases
Every transition goes through one or more phases. The type of failure
determines the phase involved. The phases are:
•
Failure detection
•
Repair attempt
•
Last gasp datagram processing
•
New member detection
•
Reconfiguration
•
System recovery
•
Application recovery
The names of these phases are unofficial, descriptive terms that are used
in this article to facilitate the discussion.
This article discusses each of these phases in some detail and looks at the
causes, duration, effect on the system, and what, if any, control you have
over the duration of the phases.
For our discussion, assume that you have a three-node VAXcluster system,
with the nodes named LARRY, MOE, and CURLY. At the moment, it does
not matter what the interconnect is.
In addition, assume that there are three applications running in the
system. They are:
•
Pi Job — A compute-bound job that is calculating the value of to the
first 1,000,000 decimal places. It is doing no I/O and is not requesting
any locks.
•
User Edit — A user editing LOGIN.COM.
•
Database — One or more users accessing a shared file or database.
Detection Phase
Users usually generate internode (system communication services (SCS))
traffic as a side effect of I/O activity. However, this is by no means
guaranteed. If there is currently no traffic between MOE and CURLY, a
communication defect can go unnoticed for a considerable amount of time.
Therefore, to ensure that there is a minimum amount of message traffic,
there are periodic messages generated by each node. The exact format
and timing of these polling messages depend on the interconnect involved.
Note that the lack of message traffic indicates that there is a problem
but gives little information about the nature of the problem. It is easy
to stand in the computer room and see the loose cable, for instance, but
the operating system running inside the box has no such knowledge.
VAXcluster State Change Timing 19
As I mentioned in a previous article on the quorum concept,1 to really
understand the dilemma of the Connection Manager, you must give up
your privileged position of outside observer and must, so to speak, climb
inside the CPU cabinet.
In VMS Version 5.5, a new System Generation Utility (SYSGEN)
parameter, TIMVCFAIL, was introduced. TIMVCFAIL is specified in
hundredths of a second. The minimum value is 100 (1 second), and
the maximum is 65535 (almost 11 minutes). The default value is 1600
(16 seconds). The duration of the detection phase is a function of the
TIMVCFAIL parameter and the type of interconnect that is used in the
configuration.
Ethernet and FDDI
The PEDRIVER device driver for the Ethernet and Fiber Distributed Data
Interface (FDDI) does not use TIMVCFAIL, unless it has a value of less
than 9 seconds.
There is only one failure mode for the Ethernet. It stops carrying
messages to and from a node or nodes.
CI and DSSI
The CI has more than one failure mode. The most obvious one is when the
CI interface breaks or loses power and it can no longer receive messages
or send acknowledgments. When this happens, the other nodes know the
interface failed as soon as they send a message to that node, because the
failed node does not return an acknowledgment.
The CI interface is microprocessor-based and can do virtual-to-physical
memory address translation. Therefore, it is possible for the CI interface
on a node to work, even though the VAX processor that it is connected to is
not processing messages. In this case, the interface receives the message,
acknowledges it, and stores it in memory. However, the CPU never
processes the message. The classic example of this is when an operator
at the console presses Ctrl/P and then enters HALT. It is also possible
that the processor hardware can fail, or an errant piece of privileged code
ties up the processor and prevents the CI timer reset code from running.
These cases are unlikely compared to power failures and shutdowns.
To detect this situation, the interface has a sanity timer that the operating
system must reset periodically. If the operating system fails to reset the
timer, the timer expires and the interface stops acknowledging messages.
This timeout value is the SYSGEN parameter PASTIMOUT.
Beginning with VMS Version 5.5, the sanity timer in the CI interface is
determined by the following formula:
1
T. Speake, ‘‘The Quorum Concept,’’ VAXcluster Systems Quorum, Volume 4,
Issue 4, February 1989.
20 VAXcluster State Change Timing
The SYSGEN parameter PASTIMOUT is still present for compatibility
but cannot be altered through the SYSGEN Utility.
The following formulas give the detection time range for CI and DSSI
in terms of TIMVCFAIL. There are several cases where this time can be
shorter than that given by the following formulas:
' Table 2–1 shows some nominal detection times for CI and Ethernet based
on the minimum and default values for the related SYSGEN parameter.
Table 2–1 Detection Times (in seconds)
TIMVCFAIL
CI
Minimum
Ethernet
Maximum
Minimum
Maximum
0.3
1
1
0.3
1
16
5.3
16
8
9
32
10.7
32
8
9
Imagine that there was a failure in the CI connection. The Pi Job
is unaware that there is anything wrong, assuming that the lack of
communication is not the result of the failure of the processor on which Pi
Job is running.
User Edit is not likely to be affected because it is probably not generating
traffic across the faulty connection. Its resources are probably local to the
processor on which it is running.
The Database application may notice a problem if the node on the other
end of the faulty connection is the resource manager for a lock that
is important for access to the shared database. Because of this fault
in communication, the Database application stalls waiting for the lock
request to complete.
The setting of the TIMVCFAIL value is a compromise between low values
that can produce instability and high values that result in long failure
detection times.
There are other mechanisms that can help detect failures. The CI
interface requirement for an immediate hardware acknowledgment
is one example. The software polling used to detect new members is
another example. However, the effect of these mechanisms is generally
minor. TIMVCFAIL is the dominant factor and should be used in your
calculations.
VAXcluster State Change Timing 21
Repair Attempt
If the communication path to a node is lost, the VAXcluster system must
be reconfigured. This reconfiguration process, which is discussed in
Reconfiguration, is expensive in time and system resources. However,
there are transient losses of communication that do not warrant a system
reconfiguration. To prevent transient conditions from causing unnecessary
reconfigurations, the system waits for a period to see if the connection will
be restored.
The amount of time invested in repair attempts is determined by the
SYSGEN parameter, RECNXINTERVAL.
RECNXINTERVAL
The RECNXINTERVAL SYSGEN parameter determines how long the
system tries to reestablish communications. The connection management
code tries for RECNXINTERVAL seconds to reestablish the connection. If
it fails, the system decides that the connection is irrevocably broken. The
RECNXINTERVAL value set by the system manager is actually only a
suggested value.
The VMS operating system enforces a lower limit for the
RECNXINTERVAL parameter, that is, RECNXINTERVAL can be set
no lower than TIMVCFAIL.
There is no practical upper limit for the value of RECNXINTERVAL1 . A
high value increases the amount of time it takes before the Connection
Manager can begin reconfiguring the system. There are reasons for
picking a value higher than the minimum.
Higher values are usually used when you want to ride out network
transients, such as bridge reinitializations.
The value selected for RECNXINTERVAL, like those selected for detection
timers, is a compromise between stability and timely reaction to failures.
Last Gasp Processing
The transition phases that begin with last gasp and new-member
detection phases are probably the most important phases of transition
timing to understand because 95 percent do not go through the detection
and correction phases. When a VMS system bugchecks or is shut down,
last gasp processing is the first phase to be executed.
The VMS bugcheck code causes datagrams to be sent to the other nodes.
This datagram states the node’s intention to sever communications and
stop sharing resources. Since this datagram is virtually the last piece
of business transacted by the dying node, it is referred to as a last gasp
datagram.
1
The maximum allowed by SYSGEN of 32767 seconds translates to a little more
than 9 hours.
22 VAXcluster State Change Timing
If a node receives this last gasp, the failing node is removed from the
configuration. The reconfiguration phase is entered immediately. This is a
virtual phase and has no real duration.
New-Member Detection
Early in the boot sequence, polling messages are sent out by the booting
node to find VAXcluster members. The would-be member selects one
of the current members to ask for membership. This selection does not
substantively affect the detection time or the duration of the next phase.
The current member acts as the would-be member’s advocate and proposes
reconfiguring the VAXcluster system to include the new member. This
initiates the reconfiguration phase. While the new node is booting, and
until the reconfiguration phase is begun, none of the applications is
affected.
Reconfiguration
Once the communication defect is declared permanent and irreparable or a
connection to a new member needs to be established, a new configuration
must be established.
One node is chosen to be coordinator, and messages are exchanged with all
nodes to determine the optimal subcluster. The algorithm that determines
the optimal subcluster configures a VAXcluster system with the most
votes and the most nodes. During this phase, user activity is blocked.
This phase usually lasts less than 1 second.
If the new configuration has quorum, the transition moves to the next
phase. If the new configuration does not have quorum, all user activity
is stalled and the nodes wait for an event that changes the configuration,
such as a node booting. At that point, this phase is reexecuted to
determine the new configuration and if quorum exists. In our example
situation, none of our three jobs is running. This can happen not only
because the failed node’s votes were necessary, but because the quorum
disk’s votes are not valid at this point.
If quorum is lost because the failing node holds a necessary vote, there is
no way to determine how long the reconfiguration pause will last. It lasts
until quorum is regained.
If quorum is lost because of the quorum disk, the duration can be as long
as 4*QDSKINTERVAL seconds.
System Recovery
Once a new configuration is established, the following, additional tasks
are required to complete the recovery:
•
Lock database rebuild
•
Quorum disk validation
VAXcluster State Change Timing 23
•
Disk rebuild
Some of these tasks can be completed in parallel.
Lock Database Rebuild
Because it is distributed among all members, some portion of the lock
database may need rebuilding. There are four types of rebuild:
•
Merge
•
Partial
•
Directory
•
Full
The type of rebuild that is required depends on the value of the SYSGEN
parameter LOCKDIRWT and the reason for the rebuild as shown in the
following list:
•
Merge — Performed when a node on which LOCKDIRWT = 0 boots
Since a booting node has no locks and the LOCKDIRWT table does not
need to be rebuilt, there is no need to rebuild the lock database. The
new node is simply added to the existing VAXcluster system. This is
what happens when a satellite boots. User locking is not affected in
this case.
•
Partial — Done when a node on which LOCKDIRWT = 0 leaves the
VAXcluster system
In this case, the LOCKDIRWT table does not need to be rebuilt but
the departing node was probably participating in the lock activity.
A new resource manager is found for resource trees that were being
managed by the failing node. Locks held by processes on the failing
node are released and those locks are reevaluated to see if waiting
locks can be granted. If the node was shut down normally, its locks
were remastered before the node was removed from the system. This
is typically what happens when a satellite is shut down. User locking
activity is stalled while this takes place.
•
Directory — Performed when a node on which LOCKDIRWT
leaves or boots
0
In this case, the LOCKDIRWT table needs to be rebuilt. This
necessitates redistributing the resource directory across the
VAXcluster nodes.
•
Full — No longer performed in VAXcluster configurations running
VMS Version 5.2 or later.
While the lock database is rebuilt, all user lock activity is stalled. This
does not affect Pi Job. It can affect User Edit if the editor needs to acquire
a lock, to write to the journal file, for instance. It is reasonable to assume
that the Database application is making use of locks and will be affected.
24 VAXcluster State Change Timing
There is a second side effect to the rebuild: the rebuild consumes CPU
cycles. This affects all jobs since there is serious competition for the
available processing power. Pi Job, for example, probably does not make
as much progress during the rebuild as it did before.
The duration of this phase is dependent on the following factors:
•
Whether the node was shut down or failed — One of the last actions
performed as part of a normal shutdown is to find a new home for
locks that the departing node mastered. This can significantly reduce
the work done during the rebuild.
•
The type of rebuild — The more work the rebuild must do, the longer
it takes.
•
The total number of locks — The more locks there are to scan and
possibly move, the longer the rebuild takes.
•
The location (node) of the resource manager for the locks — If a failed
node was the resource manager for many lock trees, then more work
must be done.
•
The interconnect — The lock rebuild generates a lot of messages. The
speed of the interconnect affects the speed of the rebuild.
•
The types of processors in the configuration — Faster processors can
rebuild the lock database faster. However, since the rebuild is done in
distinct steps, faster processors may have to wait for slower processors.
Table 2–2 lists some approximate times for lock rebuilds. These are only
averages under near-ideal conditions. They should be used only as an
indicator of the delays you might experience.
Table 2–2 Lock Database Rebuild Times
Interconnect
Rebuild Type
CI
Ethernet
Merge
0
0
Partial
0
1
Directory
5
10
Quorum Disk Validation
Immediately after the reconfiguration phase, the information in the
quorum file can be invalid. If so, the quorum disk’s votes cannot be
counted for up to 4*QDSKINTERVAL seconds. Starting with VMS Version
5.3, the quorum disk is considered invalid if the departing node did
not issue a last gasp. This is a change from previous versions where
VAXcluster State Change Timing 25
the quorum disk was always considered invalid after a node left the
VAXcluster system.
If the system has quorum, quorum disk validation has no noticeable effect
on applications. At this point, the significance of quorum disk validation is
that the system is without the quorum disk’s votes and the loss of another
voting member can cause the system to stall for lack of quorum.
If the system is below quorum and no other system recovery is going on,
applications are stalled waiting for quorum to be regained.
System Recovery
As the system recovers, there is some residual activity that is a mixture of
the system’s recovery and user activity.
•
During the transition, locking activity may be blocked. Once locking
is allowed again, there is a backlog of lock requests that require
attention.
•
There are operator communication process (OPCOM) messages to print
on the consoles.
•
There is probably DECnet activity resulting from the loss of the node.
•
If a node boots, it is still busy with its startup procedures. If the
booting node is a satellite, it might put a larger than normal load on
its disk server.
•
Finally, it is at this point that the next phase, application recovery,
begins.
Application Recovery
The VMS operating system has no hand in application recovery, but it
must be included when you assess the impact of a transition at your
site. This impact includes replaying a journal file, cleaning up recovery
units, and users logging in because the terminal server failed them over
to another processor.
Disk Rebuild
If a node fails or is shut down without dismounting the disks, the disks
are marked as improperly dismounted and they need to be rebuilt. Disks
are usually rebuilt with the MOUNT command during system startup.
Because the rebuild operation allocates critical disk locks, any application
that accesses the disk from remaining nodes may be stalled until the
rebuild is finished and the locks are released.
This rebuild can be deferred. There is no disk data integrity loss. The
reason for the rebuild is to reclaim space and disk quota information that
was cached by the failed node, but never returned as would happen in a
normal dismount.
26 VAXcluster State Change Timing
The rebuild can be prevented at boot time by using the /NOREBUILD
qualifier on the MOUNT commands in the system startup files. For
system disks, which are implicitly mounted, the ACP_REBLDSYSD
SYSGEN parameter controls the rebuild behavior. Setting ACP_
REBLDSYSD to 0 prevents the system disk from being rebuilt at boot
time.
At a more convenient time, for example, 2:00 A.M., you can run a batch
job that issues a SET VOLUME/REBUILD command on all disks. It does
not hurt to issue this command for disks that do not need rebuilding, so
this job can be scheduled to run every night. This command needs to be
issued only once for each disk in a VAXcluster system.
Summary
Table 2–3 summarizes our discussion. This table assumes that quorum is
not lost and that disk rebuilds were deferred. Application recovery times
are not included.
Table 2–3 Transition Timing Summary
Interconnect1
Phase
CI
Ethernet
16
8 to 9
Failure Detection (TIMVCFAIL)
1600 (default)
100 (minimum)
1
1
Typical2
0
0
16
16
1 (minimum)
1
1
Typical2
0
0
Repair Attempt (RECNXINTERVAL)
16 (default)
Lock Rebuild
0-5
1 For
mixed-interconnect VAXcluster systems, use the Ethernet column.
2 Shutdown
or boot.
0 to 10
VAXcluster State Change Timing 27
Notice that you can total the values in a column and have a result of zero
for transition time (satellites coming and going).
The following list summarizes how you can control the timing of these
phases.
•
Detection phase — TIMVCFAIL can be reduced. This may introduce
instability. If connections start breaking and forming, raise the values.
•
Repair attempt — The value of RECNXINTERVAL can be reduced.
This can also increase instability. You can experiment to find the
correct values for your configuration.
•
Lock rebuild
— Avoid simply halting nodes; use SHUTDOWN whenever possible.
— Select an appropriate value for LOCKDIRWT.
— It is usually not practical to control the number of locks used by
applications.
•
Disk rebuild — Defer disk rebuilds.
•
Quorum disk — Eliminate the quorum disk from the configuration
as soon as practical. VAXcluster systems with three or more voting
members do not require a quorum disk.
•
Application — Use small run units to minimize rollbacks.
•
Additional Suggestions — Although transitions generally go unnoticed,
some high performance, realtime applications, such as data collection,
cannot tolerate these minimal delays. In these situations, there are
some additional ways to eliminate the effects of transitions.
— Front End — As one alternative, you can use a front end processor
that is not a VAXcluster member to act as a buffer and ride out
the transition. This processor uses DECnet to connect to the
VAXcluster members.
— High IPL — State transition processing takes place at IPL
8. Privileged code running at a higher IPL is not affected by
transition activity. However, writing privileged code is complex and
must be implemented carefully.
3
Building Large VAXcluster Configurations
Keith Parris
VMS Engineering
Digital Equipment Corporation
Introduction
This article explores the following issues involved in configuring a large
VAXcluster system:
•
What is a large VAXcluster configuration
•
Reasons to build a large VAXcluster system
•
Recent changes that make building larger VAXcluster systems easier
•
Potential problem areas and solutions: an in-depth look at areas that
require special care
What Is a Large VAXcluster Configuration?
The VAXcluster Software Product Description (SPD) for VMS Version 5.5
contains some specific limits on VAXcluster size:
•
The maximum number of CPUs that is supported in a VAXcluster
configuration is 96.
•
The maximum number of CPUs that can be connected to a Star
Coupler is 16, regardless of the Star Coupler size.
•
The maximum number of VAXcluster members that can be directly
connected to the FDDI, through the DEC FDDIcontroller 400 (DEMFA)
controller, is 16.
Other possible definitions of a large configuration involve a large number
of VAX units of processing (VUPs), the number of Star Couplers, or the
geographical extent of the configuration.
For the purposes of this article, however, large VAXcluster configurations
are defined as systems that have a large number of nodes. Note that
some of the principles that are discussed are valid for smaller VAXcluster
configurations, as well.
28
Building Large VAXcluster Configurations 29
Why Build a Large VAXcluster Configuration?
There are two main reasons for wanting to build very large VAXcluster
systems:
•
Shared access to resources when users need the same environment, as
in the following examples:
— Editing a data file on a workstation, then running a job on a vector
processor in the VAXcluster system or through a VAX-to-Cray
gateway to do calculations and displaying the data in graphics on
the workstation
— Shared read-write access to common application files, with locking
down to the record level
•
Easier system management, for example:
— Performing tasks, such as software installation, only once, rather
than having to repeat them for multiple VAXcluster or standalone
systems
— Adding a node to an existing VAXcluster system, rather than going
through the trouble of setting up a standalone system
Enhancements to Large VAXcluster Configurations
A number of technological advances have been made since the VAXcluster
node limit of 96 was introduced in VMS Version 5.2. The following
advances make building large VAXcluster configurations easier by
providing enhanced performance and availability:
•
Faster CPUs — Faster server CPUs are now available, including the
VAX 4000–500 and VAX 6000–600 systems and the VAX 9000 series.
•
Faster network adapters — Network interfaces with better
performance are available, such as the DEMNA XMI-to-Ethernet
adapter and the second-generation Ethernet chip (SGEC)-based
integral Ethernet adapter that is used in the VAX 4000 systems.
The DEMFA FDDI adapter provides FDDI throughput for XMI-based
systems.
•
Multiple network adapters per system — The bandwidth of a single
adapter is not as much a limiting factor.
•
FDDI as a VAXcluster interconnect — With the 100 megabit-persecond speed of FDDI, local area network bandwidth is less likely to
be a problem in a VAXcluster configuration.
•
DSSI for 4000 and XMI systems — Fast DSSI adapters are available
for VAX 4000 and XMI-based systems, so these systems can support
I/O rates previously attainable only with HSC subsystems.
•
Multiple CI and DSSI adapters per system — Systems have more
bandwidth available for disk serving.
30 Building Large VAXcluster Configurations
•
MSCP server load balancing — The disk serving load is spread evenly
across the available disk servers at the time of a disk mount or mount
verification operation in what is called static load balancing.
•
Faster disks — Average seek and rotational latency times are
decreasing.
•
Caching — Caching is available for some HSC subsystems, and some
disks have integral caches.
•
Host-based volume shadowing — Disk shadowing is no longer limited
to configurations with CI and HSC hardware, but is available in the
full range of possible configurations.
•
VAXcluster Software — Recent changes reduce the impact of node
addition and removal during state transitions and balance the locking
work load during normal operations.
Configuring a Large VAXcluster System
The following sections discuss factors that you should be aware of when
configuring a large VAXcluster system.
Boot Times
A large VAXcluster system must be carefully configured to ensure
sufficient capacity to boot the nodes in a reasonable amount of time.
The following factors affect this capacity and the boot time.
•
System disk throughput
According to the results of a study, which appeared in the VMS 5.2
Release Notes, it takes about 4,200 I/O operations to the system disk
to boot a satellite node. A typical disk can handle approximately 50
I/Os per second. Therefore, each satellite takes a minute and a half of
attention from the system disk.
Consider what happens when a hundred nodes need to boot at the
same time. This can create an I/O bottleneck that makes rebooting
take several hours.
•
Network bandwidth
An Ethernet segment can handle a finite amount of traffic, and
individual Ethernet segments and adapters can become saturated
during a mass boot.
Building Large VAXcluster Configurations 31
•
Avoiding reboots
One way to minimize the problem of boot times is to avoid rebooting.
— Power protection
You can avoid reboots cased by power failures by protecting server
nodes or satellites with Uninterruptible Power System (UPS)
systems.
For some situations, it is worth having a UPS system for every
workstation. It adds about 5 percent to the cost of the workstation
and can easily pay for itself in the amount of work time it saves.
— Careful configuration
As an example of how important a carefully configured system is,
consider a typical VAXcluster configuration that uses multiple
interconnects and has a few large CPU nodes as DECnet
Maintenance Operation Protocol (MOP) and disk servers. These
CPUs are connected to a Star Coupler and have votes. There are
also a number of satellites, none of which have votes.
If one of the large nodes in this VAXcluster system has a single
network adapter and that network adapter fails, the VAXcluster
system must reconfigure itself so that surviving members can
communicate directly with all the other VAXcluster members.
The algorithm that selects an optimal subset of surviving nodes
assigns a Figure of Merit (FOM) to each potential subset using the
following formula:
& (
This algorithm gives preference to a subset of nodes with a higher
total number of votes over a subset with a larger number of nodes.
Therefore, the node that has a vote, but not a network connection,
wins out over all the satellites. As a result, the satellites do a
CLUEXIT bugcheck and leave the VAXcluster system.
To avoid this problem, give one vote to a single node that is
attached to the local area network (LAN), but not to the CI or DSSI
busses of the big nodes. This way, if the only network adapter on
one of the big nodes fails, it is that node that does the CLUEXIT
bugcheck and leaves the VAXcluster system, rather than all the
satellites .
Another solution is to provide redundant network adapters on the
server nodes.
For more information, consult the Guidelines for VAXcluster System
Configurations (Part Number EK–VAXCS–CG–005). This book contains
data on the performance, throughput, and capacity of the various pieces of
hardware that can be part of a VAXcluster configuration.
32 Building Large VAXcluster Configurations
System Disk Throughput
To achieve enough system disk throughput requires some combination of
the following techniques:
•
Offload work from the system disk
Remove system files, such as SYSUAF, RIGHTSLIST, and queue files;
page and swap files; and layered products from the system disk.
Moving these files from the system disk to a separate disk eliminates
most of the write activity to the system disk. This raises the read
/write ratio and maximizes the performance of volume shadowing on
the system disk.
•
Avoid disk rebuilds at boot time
The VMS file system maintains a cache of preallocated file headers
and disk blocks. When a disk is not properly dismounted, as happens
when a system fails, this preallocated space becomes temporarily
unavailable. When the disk is mounted again, VMS scans the disk to
recover that space. This is called a disk rebuild.
User response times can be degraded during a disk rebuild operation
because most I/O activity on that disk is blocked. Avoid a disk rebuild
during prime time. It is particularly undesirable to have a small
satellite node do the rebuild, yet this is what would normally happen
if a satellite is the first to reboot after it or another node crashes.
To minimize the impact of disk rebuilds at boot time, make the
following changes:
— ACP_REBLDSYSD = 0 for system disk — Set the SYSGEN
parameter ACP_REBLDSYSD to 0, at least for the satellite nodes.
This prevents a rebuild operation on the system disk when it is
mounted implicitly by VMS early in the boot process.
— MOUNT/NOREBUILD for all user disks — Startup procedures
that mount user disks should include the /NOREBUILD qualifier
on MOUNT commands, at least on the satellite nodes.
— SET VOLUME/REBUILD during nonprime time — Once the
system is running, you can run a batch job or a command procedure
to do a SET VOLUME/REBUILD command for each disk drive.
Because the SET VOLUME/REBUILD command determines if a
rebuild is needed, the job can execute the command for every disk.
This job can be run during off hours, preferably on one of the more
powerful nodes.
In large VAXcluster systems, a lot of disk space can be preallocated
to caches, and, if many CPUs abruptly leave (during a power
failure, for example), this space becomes temporarily unavailable.
If you usually run with very full disks, you might not want to
disable rebuilds on the big server nodes at boot time.
Building Large VAXcluster Configurations 33
•
Multiple system disk spindles — Using more than one disk spindle for
the system disk provides additional actuators to handle the I/O load.
This can be done using the following techniques:
— Volume shadowing can be used to minimize the number of separate
system disks that must be maintained.
— Multiple system disks
There is a limit on the number of spindles that can be combined
into a single virtual unit using volume shadowing. Thus, you may
need to create separate system disks (or shadow sets) for different
groups of nodes.
Because system management work load increases as separate
system disks are added and does so in direct proportion to the
number of separate system disks that need to be maintained, you
want to minimize the number of system disks added to provide the
required level of performance. One way is to create a system disk
(or shadow set) with roots for all VAXcluster nodes. Use this as
the master copy, and perform all software upgrades on this system
disk. For the cloned system disks, you can make a backup of the
master to the other disks, then change the volume names so they
are unique. If you have not moved system files off the system disk,
you must have SYLOGICALS.COM point to system files on the
master system disk.
Before an upgrade, be sure to save any changes you need from the
cloned disks since the last upgrade, such as MODPARAMS.DAT
and AUTOGEN feedback data, accounting files for billing, and
password history.
•
Use caching in the HSC subsystem or in RF or RZ disks to improve
the effective system disk throughput.
•
Add a solid-state disk, such as an ESE20 or ESE50, to your
configuration. These devices have lower latencies and can handle
a higher request rate than a regular magnetic disk. A solid-state
disk can be used as a system disk or to hold hot system files. You
can use VAX Software Performance Monitor (SPM), VAX Performance
Advisor (VPA), or DECperformance Solution (DECps) to identify hot
files during a mass boot.
•
Use DECram software to create RAMdisks on MOP servers to hold
copies of selected hot read-only files to improve boot times. A RAMdisk
is an area of main memory within a system that is set aside to store
data, but is accessed as if it were a disk.
34 Building Large VAXcluster Configurations
Network Bandwidth
A single Ethernet is unlikely to have sufficient bandwidth to meet the
needs of a large VAXcluster system. Likewise, a single Ethernet adapter
can become a bottleneck, especially for a disk server. A large VAXcluster
configuration requires careful planning to provide satisfactory LAN
performance.
Sufficient network bandwidth can be provided using some of these
techniques:
•
Fast network adapters
•
Multiple Ethernet segments
Divide the network into multiple segments, using bridges to segregate
the work load. For example, a bridge can isolate traffic so a disk
server’s network adapter and the satellites it serves are on a private
network segment.
•
Multiple network adapters on MOP and disk servers
•
FDDI network backbone
Use FDDI as a network backbone, with 10/100 bridges to connect to
Ethernet segments.
•
FDDI adapters for direct connection to the FDDI
•
Fast CPUs for MOP and disk servers
Note that a multiprocessor system provides no more disk serving
capacity than the uniprocessor version of the same CPU, because
MSCP serving work is done on the interrupt stack on the primary
processor.
•
Multiple server nodes
Configuring multiple server nodes allows CPUs to share the disk
serving load and provides better availability than that of a single disk
server node.
•
Throttling demand
Various techniques are also available to control or throttle the demand
generated by rebooting all the satellites at once.
Controlling Satellite Booting
The following list provides some ways you can control the satellite boot
process:
•
Use DECbootsync
A tool such as DECbootsync controls the number of workstations
starting up simultaneously. This uses the Distributed Lock Manager to
control the number of satellites allowed to perform startup command
procedures at once. This does not control the booting of VMS, but does
Building Large VAXcluster Configurations 35
control the execution of startup command procedures and the setup
work for layered products.
•
Disable MOP service on MOP servers
You can temporarily disable boot requests by setting the DECnet
circuit to a service disabled state until the MOP server can complete
its own startup operations, shadow copies can complete, and so on.
This does not prevent the satellites from requesting a boot, but it does
prevent the MOP server from servicing them.
If a satellite is requesting a boot and gets no response, it makes a
request less often as time passes, so it can take longer than normal to
get the satellite up again once MOP service is reenabled.
•
Disable MOP service for individual satellites
You can disable boot requests on a per-node basis by temporarily
clearing a node’s information from the DECnet database on the MOP
server using Network Control Program (NCP), and resetting it to
reenable nodes as desired to control booting. Again, this does not
prevent the satellites from requesting boot service.
•
Bring satellites to console prompt on shutdown
VAX systems with model numbers in the 2000, 3100, and 4000 series
can be set up so they halt upon restoration of power or execution of a
HALT instruction, rather than rebooting at once.
A program can be used to perform a HALT instruction and cause
the satellite to enter console mode. This prevents the satellite
from immediately trying to reboot and allows more control over the
rebooting process.
This is also important for those systems that support remote triggering
of a boot, because they only allow a remote trigger operation to occur
while the system is in console mode.
•
Boot satellites using MOP TRIGGER
VAX systems with model numbers in the 3100 series and VAXstation
4000 systems can be set up so that a boot can be initiated remotely
using a DECnet TRIGGER operation. For these systems, you can
create a command procedure to trigger satellites to boot in a controlled
fashion.
The systems only respond to a trigger while they are in console mode.
Controlling MOP Service on a MOP Server
To disable MOP service during startup of a MOP server use the following
commands:
$ MCR NCP DEFINE CIRCUIT circuit SERVICE DISABLED
$ @SYS$MANAGER:STARTNET
$ MCR NCP DEFINE CIRCUIT circuit SERVICE ENABLED
36 Building Large VAXcluster Configurations
In this example, circuit represents your MOP service circuit name, such
as MNA-0 or BNA-0. At this point, service is still disabled in the DECnet
volatile database.
Do not disable this permanently (in the DECnet permanent database),
because CLUSTER_CONFIG relies on MOP service enabled on a node
that is supposed to be a MOP server.
To reenable MOP service later, use the following commands:
$ MCR NCP
NCP> SET CIRCUIT circuit STATE OFF
NCP> SET CIRCUIT circuit SERVICE ENABLED
NCP> SET CIRCUIT circuit STATE ON
Do this with a command procedure, so that it is done quickly and DECnet
service to the users is not disrupted.
Controlling MOP Service on a Per-Node Basis
To disable MOP service for a given node, use the following commands:
$ MCR NCP
NCP> CLEAR NODE satellite HARDWARE ADDRESS
Here, satellite represents the DECnet node name of the satellite node for
which MOP service is to be disabled.
To reenable MOP service for that node later, use the following commands:
$ MCR NCP
NCP> SET NODE satellite ALL
Bringing Satellites to Console Prompt
Using a VAXcluster Console System (VCS) connection is the easy way to
bring a system to the console prompt. You can also press the HALT button
at the workstation. If these methods are impractical, use the following
instructions.
To set up a satellite so that it stops in console mode when a HALT
instruction is executed, do the following:
For VAX systems in the 3100 and 4000 series:
>>> SET HALT 3
For VAX systems in the 2000 series:
>>> TEST 53
2 ? >>> 3
The default action setting of 2 means to reboot on a HALT or when power
returns. You only need to do this once on each system, since the setting is
saved in nonvolatile random access memory (NVRAM).
Building Large VAXcluster Configurations 37
When you want the satellite to halt after a shutdown, set things up
in NCP so a reboot loads an image that does a HALT instruction. The
READ_ADDR.SYS program, which is normally used to find the Ethernet
address of a satellite node, does a HALT instruction when it runs, and is
readily available.
To halt the satellite on the next reboot attempt, use the following
commands:
$ MCR NCP
NCP> CLEAR NODE satellite LOAD ASSIST PARAMETER
NCP> CLEAR NODE satellite LOAD ASSIST AGENT
NCP> SET NODE satellite LOAD FILE MOM$LOAD:READ_ADDR.SYS
Shut the satellite down normally, but with an immediate reboot specified,
by entering the following:
$ MCR SYSMAN
SYSMAN> SET ENVIRONMENT/NODE=satellite
SYSMAN> DO @SYS$UPDATE:AUTOGEN REBOOT
When the satellite is rebooted, the READ_ADDR.SYS program is loaded,
it prints out the Ethernet address, and the system halts in console mode.
When you want the satellite to boot normally, fix the node information in
NCP so that the satellite loads VMS, rather than the READ_ADDR.SYS
program, when it is booted.
To allow the satellite to reboot normally, use the following commands:
$ MCR NCP
NCP> CLEAR NODE satellite LOAD FILE
NCP> SET NODE satellite ALL
Using the DECnet MOP Trigger Facility to Boot Satellites
The console firmware in some systems (VAX 3100 models and VAXstation
4000 systems) allows you to trigger them remotely to boot upon demand.
You must turn on this capability before you can use it. Use the following
commands at the console to turn this on.
Enable the MOP listener.
>>> SET MOP 1
Enable remote triggering.
>>> SET TRIGGER 1
Set the password that validates a remote trigger request. The system
prompts you twice for a 16-digit hexadecimal password string.
>>> SET PSWD
Like the HALT setting, you only need to do this once on each system,
since the settings are saved in NVRAM.
38 Building Large VAXcluster Configurations
The NCP utility is used to initiate a boot of the satellite node by entering
the following:
$ MCR NCP
NCP> TRIGGER NODE satellite VIA circuit SERVICE PASSWORD xxxxxxxxxxxxxxxx
Here circuit represents your MOP service circuit, and xxxxxxxxxxxxxxxx
is the 16-digit hexadecimal number specified when you did the >>> SET
PSWD command at the satellite node’s console.
The MOP server can be set to automatically run a command procedure
and trigger 5 or 10 satellites at a time to stagger the boot-time work load,
if desired. This can be done in priority order (that is, your workstation
first, your manager’s next, and so on).
Note that when the power-on/halt switch is set as indicated previously, a
power failure at the satellite results in it stopping at the console prompt,
instead of automatically rebooting when power is restored. For a mass
power failure, this is good, because it prevents server overloading.
However, if someone trips over the power cord for a single satellite, it
does not reboot automatically.
A mechanism can be provided to scan for and trigger a reboot of satellites
that go down inadvertently. This can be a batch job that runs periodically
and uses the DCL lexical function F$GETSYI to check each node that
should be in the VAXcluster system. To do this, it uses the CLUSTER_
MEMBER item code, and issues an NCP TRIGGER command for any
satellite that it finds is not currently a member of the VAXcluster system.
System Parameters
As the VAXcluster system grows, some data structures within VMS must
also grow to accommodate the larger number of nodes. If this is not
possible, because of a shortage of nonpaged pool, for example, it may
induce intermittent problems that are difficult to diagnose.
Digital recommends that AUTOGEN with FEEDBACK be run periodically
on all nodes, so that it can make the appropriate adjustments as the node
count increases. Repeating this every time 10 nodes are added works well.
There are a few parameters that the AUTOGEN Utility does not currently
handle well in large VAXcluster configurations. Instructions for manually
checking these parameter values are included in the next three sections.
SCSCONNCNT
The SYSGEN parameter SCSCONNCNT controls the number of
Connection Descriptor Table (CDT) entries allocated at boot time.
These are used for the different connections a VAXcluster node makes
to other VAXcluster nodes for tasks such as general VAXcluster system
coordination and disk and tape MSCP serving work.
Building Large VAXcluster Configurations 39
The default value of SCSCONNCNT is 40. An additional 200 entries
are allocated in the Connection Description List (CDL) by VMS to avoid
running out. Once the initial CDTs are used, up to 200 more can be
created.
The default value typically becomes insufficient when the VAXcluster
configuration grows to between 50 to 70 nodes. A shortage is most likely
to occur on VAXcluster nodes that are disk servers or tape servers, or
both.
Symptoms of a shortage are nodes that are unable to join the VAXcluster
system and nodes that see disks or tapes served from some of the servers,
but not from others.
The System Dump Analyzer Utility (SDA) can be used to check the
number of CDT entries in use on a given VAXcluster node.
$ ANALYZE/SYSTEM
VAX/VMS System analyzer
SDA> SHOW CONNECTIONS
--- CDT Summary Page --CDT Address
----------807C1BB0
807C1D10
807C1E70
807C1FD0
807C2130
807C2290
807C23F0
807C2550
807C2810
Local Process
-------------
Connection ID
-------------
SCS$DIRECTORY
MSCP$TAPE
MSCP$DISK
VMS$VAXcluster
SCA$TRANSPORT
VMS$DISK_CL_DRVR
VMS$DISK_CL_DRVR
VMS$VAXcluster
VMS$VAXcluster
Number of free CDTs:
CA840000
CA840001
CA840002
CA840003
CA840004
CA880005
CAA00006
CA840007
CA860009
State
----listen
listen
listen
listen
listen
open
open
open
open
Remote Node
-----------
VAX1
VAX2
VAX1
ALPHA1
31
To determine the number of connections in use, count the lines of output,
to see if the total is close to or over the value of SCSCONNCNT. Note
that the Number of free CDTs: line does not include any extra slots VMS
allocated as a cushion, unless they were used and freed over time, so it is
not very useful in determining if you are close to running out.
If the value of SCSCONNCNT is insufficient, add a line such as the
following to MODPARAMS.DAT, specifying a value appropriate for your
configuration, then run AUTOGEN, and reboot:
MIN_SCSCONNCNT = 300
The number you enter here should be the current usage for that node, or
slightly higher if you anticipate growth. However, for each unit that you
increase SCSCONNCNT, 352 bytes of memory are allocated at boot time.
(Remember that the 200-slot cushion is there if you need it and can be
used if you are not short of nonpaged pool.)
40 Building Large VAXcluster Configurations
SCSRESPCNT
SCSRESPCNT controls the number of Request Descriptor Table (RDT)
entries, that keep track of requests made to other nodes that did not
receive a response. The default value for this may be insufficient once a
VAXcluster configuration is near the 96-node limit.
A shortage of entries affects performance, since requests wait for a free
RDT. To see if SCSRESPCNT is insufficient, check each system for
requests that waited because there were not enough free RDTs. The
SDA utility may be used to check this.
$ ANALYZE/SYSTEM
SDA> READ SYS$SYSTEM:SCSDEF
SDA> EXAM @SCS$GL_RDT + RDT$L_QRDT_CNT
810610E4: 00000000
"...."
If this number is nonzero and continues to increase during normal
operations, you may want to increase the value of SCSRESPCNT. Each
unit that you increase SCSRESPCNT takes only 8 bytes of memory, so
you can afford to be generous.
SCSBUFFCNT
SCSBUFFCNT controls the number of Buffer Descriptor Table (BDT)
entries, which describe data buffers used in block data transfers between
nodes. A shortage of SCSBUFFCNT is unlikely to occur in VAXcluster
systems at or below the 96-node limit. If it does occur, it would be most
likely to occur on nodes doing lots of MSCP serving. Again, a shortage of
entries affects performance. To determine if SCSBUFFCNT is insufficient,
check each system to see if it waited for BDT entries. The SDA utility
may be used to check this, or you can use the Show Cluster Utility
(SHOW CLUSTER):
$ ANALYZE/SYSTEM
SDA> READ SYS$SYSTEM:SCSDEF
SDA> EXAM @SCS$GL_BDT + CIBDT$L_QBDT_CNT
81092E2C: 00000000
"...."
SDA> EXIT
$ SHOW CLUSTER/CONTINUOUS
Command > ADD BDT_WAITS
View of Cluster from system ID 4321 node: VAX1
+-------------------+---------+-----------+
|
SYSTEMS
| MEMBERS | COUNTERS |
+--------+----------+---------+-----------+
| NODE | SOFTWARE | STATUS | BDT_WAITS |
+--------+----------+---------+-----------+
| VAX1
| VMS V5.5 | MEMBER |
|
| HSC1
| HSC V601 |
|
0 |
| VAX2
| VMS V5.5 |
|
0 |
| VAX3
| VMS V5.5 |
|
0 |
+--------+----------+---------+-----------+
Building Large VAXcluster Configurations 41
If you find a count that tends to increase over time under normal
operations, consider raising SCSBUFFCNT. For each unit that you raise
SCSBUFFCNT, only 16 bytes of memory are allocated, so overestimating
this is not a problem.
Network Problems
Network instability can seriously affect VAXcluster operation. The
following ideas can help minimize network problems:
•
Adjust the RECNXINTERVAL parameter
The SYSGEN parameter RECNXINTERVAL specifies the number of
seconds the VAXcluster system waits when it loses contact with a
node, before it removes that node from the configuration. Many large
VAXcluster configurations are operated with the SYSGEN parameter
RECNXINTERVAL raised to 40 seconds, compared with the of 20
seconds. This allows the VAXcluster system to ride through most
network problems while troubleshooting.
Raising RECNXINTERVAL can sometimes result in longer perceived
application hangs when a node leaves the VAXcluster system
abnormally while holding a lock on a shared resource. The longer
hangs occur because the Connection Manager software waits for
RECNXINTERVAL seconds before removing the departed node from
the VAXcluster configuration and freeing up any locks that it held.
•
Protect the network
The LAN must be treated as if it is a part of the VAXcluster system.
An environment in which a random user can disconnect a ThinWire
segment and attach a new PC while 20 satellites hang, is not a
rewarding one.
•
Adopt a divide-and-conquer strategy
Dividing the network into multiple segments, connected by bridges,
and putting barrel connectors at midpoints of Ethernet segments can
make troubleshooting much easier.
•
Choose your hardware and configuration carefully
Certain hardware is not suitable for use in a large VAXcluster system.
Some network components can appear to work well with light loads,
but be unable to operate properly under high traffic conditions.
Improper operation can result in lost or garbled packets which
require packet retransmissions. This reduces performance and can
affect the stability of the VAXcluster configuration. Beware of bridges
that cannot filter and forward at full line rates, or repeaters that
do not handle congested conditions well. Digital network hardware
is carefully tested to ensure that it will operate correctly in large
VAXcluster configurations.
42 Building Large VAXcluster Configurations
•
Use the LAVC$FAILURE_ANALYSIS facility
New capabilities with multiple adapter support that can assist in
the isolation of network faults have been added to the VAXcluster
Software. See the VMS VAXcluster Manual for VMS Version 5.5 for
information about the LAVC$FAILURE_ANALYSIS facility, which was
introduced with VMS Version 5.4–3.
Hot system files
In addition to the system disk throughput issues during a mass boot,
which were discussed earlier, you must pay attention to the latency of
access to system files during steady-state operations. It can affect the
response time for various user activities, such as logging in, starting up
an application, or issuing a PRINT command.
With many users, the following system files can be quite busy:
•
SYSUAF.DAT, RIGHTSLIST.DAT, and so forth
•
JBCSYSQUE.DAT (before VMS Version 5.5)
Job controller and queue manager performance can be a serious
issue for large configurations if you are running VMS Version 5.4
or earlier versions. It may be necessary to place the system queue
file JBCSYSQUE.DAT on a solid-state disk to achieve satisfactory
response times for queue operations. The Version 5.5 queue manager
solved these performance problems.
•
Application images
•
Any other hot files VPA, SPM, or DECps point out
Once you identify which files are of concern, you can apply one of several
techniques available to provide the required level of performance in
accessing these files:
•
Solid-state disks
Consider moving hot files to a solid-state disk, such as the ESE20 or
ESE50.
•
DECram for read-only files and images
Putting hot read-only files, such as commonly used executable images
and command procedures, into DECram can help.
•
Caching
All the standard disk I/O techniques help here, such as:
— HSC cache
— Integral caches in RF and RZ disks
— RMS global buffers
Building Large VAXcluster Configurations 43
— The INSTALL Utility to add frequently-used images to the knownfile list, using the /OPEN, /HEADER_RESIDENT, and /SHARED
qualifiers
•
Volume shadowing
Volume shadowing improves availability, but because it provides you
with multiple spindles, it also helps performance.
•
Disk striping
You can use disk striping, which is not an option for the system disk
but can be used for hot files that are located on or were moved to other
disks.
If you are using volume shadowing and then use disk striping as well,
you may not see as much gain as you normally get because you already
spread the I/Os across multiple spindles.
MOP/Disk Server and System Disk Availability
Typically, a large number of satellite nodes depend on MOP and disk
servers for their operation. These types of servers are helpful in large
VAXcluster systems in the following situations:
•
MOP server failure
Configuring more than one system as a MOP server allows satellites
to boot, despite loss of one MOP server node.
•
Disk server or controller failure
Configuring redundant disk servers, such as dual-host DSSI systems,
or multiple disk servers on CI, avoids disruption of satellite disk
service if a single disk server becomes unavailable.
•
Power disruption
Putting at least the MOP and disk server nodes on a UPS system can
increase availability.
•
Disk failure
Because many satellites typically boot from each system disk, failure of
that disk can disrupt operations for a large number of users. Volume
shadowing can be used in conjunction with VAXsimPLUS to reduce the
chances of an outage caused by disk failure.
44 Building Large VAXcluster Configurations
Shared Resource Availability
To provide a common environment, most VAXcluster systems use a
common authorization file, common queue file, and so on. Sometimes
a process takes out a lock on one of these common files and then hangs
because of a problem, such as a resource shortage. Other processes
that need to use these common files then hang waiting for a lock, other
processes can hang waiting for them, and so on. The odds of such a
problem occurring increase as the number of nodes increases.
The following tools can help with shared resource availability:
•
DECamds
The DEC Availability Manager for Distributed Systems (DECamds)
includes capabilities that assist in locating and correcting locking and
resource shortage problems.
•
DECterm window on system manager’s workstation for each major
system
Leaving a session logged in to a privileged account on each of the
major nodes is helpful during troubleshooting.
•
DECUS tools
DECUS members have written tools, such as FINDLOCK, to assist
with shared resource availability. These are available through the
DECUS library.
Cluster Alias
The DECnet cluster alias has the following node limits:
•
Support for the DECnet Phase IV cluster alias is currently limited to
64 nodes. At least one system in the VAXcluster system must be a
DECnet full-function, or routing, node.
•
Plans for DECnet Phase V include support for 144 nodes under a
cluster alias and elimination of the requirement that a VAXcluster
member be a routing node.
System Disk Space
The minimal essential files for a satellite root take up very little space,
so that more than a hundred roots can easily fit on a single system disk.
However, if you use separate dump files for each satellite node or put page
and swap files for all the satellites on the system disk, you quickly run out
of disk space.
Building Large VAXcluster Configurations 45
To avoid running out of disk space, set up common dump files for all the
satellites or for groups of satellite nodes. For debugging purposes, it is
best to have separate dump files for each MOP or disk server. Also, you
can use local disks on satellite nodes to hold page and swap files, instead
of putting them on the system disk. In addition, move page and swap files
for MOP and disk servers off the system disk.
Future Directions
There are no design limitations inherent in VAXcluster technology that
limit its scalability to 96 nodes. VMS Engineering is currently involved
in testing larger VAXcluster configurations, and VAXcluster systems with
well over 100 nodes have been successfully built and operated. The intent
of this study is to stretch the limits of VAXcluster configurations in testing
with the goal of making VAXcluster systems of all sizes work better.
Feedback from this testing has already resulted in enhancements to the
System Management (SYSMAN) and MOUNT utilities, as described in
the following two sections.
MOUNT/CLUSTER Limits
The MOUNT/CLUSTER command has a limit of 96 nodes. If you issue a
MOUNT/CLUSTER command when there are more than 96 nodes in the
VAXcluster system, your node bugchecks. This limit is removed in VMS
Version 5.5–1.
As a workaround, use the SYSMAN Utility or batch jobs to enter a
MOUNT/SYSTEM command on each node.
SYSMAN Limits
The SYSMAN utility has a limit of 128 nodes. As a workaround, you can
use a batch job on each node or use network tasks, such as TELL.COM.
SYSMAN executes a command on each of the nodes sequentially. This can
take a long time to complete if there are a lot of nodes in the VAXcluster
system.
System managers with large VAXcluster configurations sometimes start
multiple SYSMAN sessions at once, with each one handling a subset of
the total nodes, to speed up the process. SYSMAN supports the use of
logical names to define groups of nodes.
You can also use batch jobs, which execute in parallel, by setting up a
queue per node.
46 Building Large VAXcluster Configurations
Conclusions
While building a VAXcluster system with a large number of nodes is not
simple, with careful planning, a large VAXcluster configuration can be
built to provide the following benefits:
•
Support to a large number of workstations and users
•
Full shared access to common files
•
Minimal system management work load
Additional VAXcluster Information
4
P1 Revision Management Level
Kathy Thomas
VAXcluster Systems and Support Engineering
Digital Equipment Corporation
Revision Management Level P1 provides support for:
•
VAX 9000-400 system
•
VAXft 310 system
•
VAX 4000-200 system
•
VAXserver 4000-200 system
•
KFMSA adapter
•
TA91 tape subsystem
•
VMS Version 5.4-2 operating system
It is recommended that you upgrade your VAXcluster system to the latest
revision as soon as practical and over as short a period of time as possible.
Although we attempt to maintain VAXcluster system functionality and
integrity during upgrades, we cannot guarantee it in all cases.
Table 4–1 summarizes the existing revision levels. Table 4–2 details
the applicable versions for individual VAXcluster components within the
revision levels, beginning with Level L1.
Revision levels listed in this document are minimum acceptable revisions
for products to function reliably in a VAXcluster environment with the
current version of the VMS operating system. These revision levels do
not reflect the subsequent revisions from Manufacturing or the latest
engineering change order (ECO) revision from Engineering, unless these
subsequent revisions create a new minimum acceptable revision for
VAXcluster systems.
49
50 P1 Revision Management Level
For information about revision levels or changes for components within
Table 4–2 that were made available after the Quorum print date, contact
your Digital Customer Services representative. For further information on
Revision Management Level P1, contact your Digital Customer Services
representative.
Table 4–1 Summary of Revision Management Levels
Revision
Level
Feature
E1
TA78, VMS Version 4.1
E2
VAX 8600, update to revisions of other components
E3
TA81, VMS Version 4.2, CI750 update (D1), CI780 update (E1),
HSC50 update (D1)
E4
VAX 8650, VMS Version 4.3, HSC70
F1
VAX 8200, VAX 8300, VAX 8500, VAX 8800, CIBCI, CI750 update
(E1), CI780 update (F1), VMS Version 4.4
H1
VAX 8550, VAX 8700, VAXcluster Console System, Volume
Shadowing, HSC50 update (E2), HSC70 update (B2)
J1
VAX 8250, VAX 8350, VAX 8530, Update HSC50/HSC70 based on
Version 3.50 microcode, LO100 E2/LO118 upgrades, SA482, VAX
Performance Advisor (VPA), VAX Supercomputer Gateway, VMS
Version 4.6, CIBCA VAXBI-to-CI Interface, TA79 Tape System
K1
SA600 Storage Array, RA70 Disk, HSC Software Version 3.70,
VAX 62X0, VAX 88X0, CISCE/CINLE 24-Node Star Coupler,
VMS Version 5.0, CIBCA–B VAXBI-to-CI Interface, Local Area
VAXcluster Systems
L1
VMS Version 5.1, VAX 6300, DESQA, MicroVAX 3300/3400,
MicroVAX 3800/3900, VAXstation 3100, VAXserver 3100, RA90,
SA70, SA650, TA90, HSC40
M1
VMS Version 5.2, VAX 6000–400, DEBNI, RV20, RV60, RV64,
ESE20
N1
VMS Version 5.4, VAX 9000–200, VAX 6000–500, VAX 4000–300,
VAXserver 4000–300, CIXCD, DEMNA, KDM70, TA90E, RA90,
RA92
P1
VMS Version 5.4-2, VAX 9000–400, VAXft 310, VAX 4000–200,
VAXserver 4000–200, KFMSA, TA91
P1 Revision Management Level 51
Table 4–2 Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX–11/750 Computer
System
60,80
60,88
6C,8C
60,80
60,88
6C,8C
60,90
60,98
60,9C
60, 90
60, 98
60, 9C
TU58 41 VAX–11/750
Console (BE–T204?–ME)
R
–
–
–
TU58 41 VAX–11/750
Console (BE–FK94?–ME)
–
A
A
A
VAX–11/780 Computer
System
8,8B
Note 2
8,8B
Note 2
8, 8B
Note 2
8, 8B
Note 2
RX1 VAX–11/780 Standard
Console (AS–T213?–ME)
R
R
R
R
RX41 VAX–11/780 Europe
RD Console (AS–T215?–DE)
R
R
R
R
RX4 VAX–11/780 Remote
Console (AS–T216?–DE)
R
R
R
R
VAX–11/785 Computer
System
3,3B
Note 2
3,3B
Note 2
3,3B
Note 2
3, 3B
Note 2
RX1A VAX–11/785 Console
(AS–T793?–ME)
P
P
P
P
VAX 8600 CPU Kernel
(KA86–A)
L
L
L
L
VAX 8650 CPU Kernel
(KA86–B)
D
D
D
D
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
52 P1 Revision Management Level
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX 8600/VAX 8650 Console
with diag MT
(BB–FI16?–DE)
K
K
T
T
VAX 8600/VAX 8650 Console
RL02 (BC–FI17?–ME)
K
K
T
T
VAX 8600/VAX 8650 Console
with diag RL02
(BC–FI18?–DE)
K
K
T
T
VAX 8200 CPU Kernel
(821B)
B
B
B
B
VAX 8250 CPU Kernel
(824B) (825B)
B
B
B
B
VAX 8300 CPU Kernel
(831B)
B
B
B
B
VAX 8350 CPU Kernel
(834B) (835B)
B
B
B
B
VAX 8200/VAX 8250
/VAX 8300/VAX 8350 Console
Flp (BL–FG81?–ME)
J
J
P
S
VAX 8200/VAX 8250
/VAX 8300/VAX 8350
Complete Diag
(1600 BPI MT)
(BB–FG87?–DE)
L
L
V
Y
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
P1 Revision Management Level 53
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX 8200/VAX 8250
/VAX 8300/VAX 8350
Diag Super + Auto
(BL–FG79?–ME)
K
K
U
W
VAX 8200/VAX 8250/VAX
8300/VAX 8350 Util Prog Flp
(BL–FG80?–ME)
L
L
V
Y
VAX 9000–2XX CPU Kernel
(KA920)
–
–
C
E
VAX 9000–4XX CPU Kernel
(KA940)
–
–
–
E
VAX 9000 Console Image
TK5O (AQ-PAKH?–ME)
–
–
A
B
VAX 9000 Utility and
Microcode (AQ–PAKJ?–ME)
–
–
A
B
VAX 9000 Licensed Diag
(AQ–RAKK?–DE)
–
–
A
B
VAX 9000 Field Service SDD
(AQ–PBE9?–AE)
–
–
–
B
VAX 6000–2XX CPU Kernel
(62AMB–Y), (62AMN–Y),
(62AMP–Y)
A
A
A
A
VAX 6200 Complete Diag Set
16MT9 (BB–FK03?–DE)
A
A
K
L
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
54 P1 Revision Management Level
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX 6000–200 Console TK50
(AQ–FJ77?–ME)
A
A
K
L
VAX 6000–200 EE Prom
Patch TK50 (AQ–FJ98?–ME)
A
A
F
F
VAX 6000–3XX CPU Kernel
(63AMB–Y), (63AMP–Y)
A
A
A
A
VAX 6300 Complete Diag Set
16MT9 (BB–FK65?–DE)
A
A
F
F
VAX 6000–300 Console TK50
(AQ–FK60?–ME)
A
A
F
F
VAX 6300 Console Patch
TK50 (AQ–FK97?–ME)
–
–
D
E
VAX 6000–4XX CPU Kernel
(64AMA–Y), (64AMP–Y)
–
A
A
A
VAX 6000–400 Complete
Diag 16MT9 (BB–FK89?–DE)
–
B
E
F
VAX 6000–400 Console TK50
(AQ–FK87?–ME)
–
B
E
F
VAX 6000–400 Complete
Diag TK50 (AQ–FK88?–DE)
–
B
E
F
VAX 6000–400 Console Patch
(AQ–PBD2?–ME)
–
A
B
C
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
P1 Revision Management Level 55
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX 6000-5XX CPU Kernel
(65*MA–X), (65*PA–X)
–
–
A
A
VAX 6000–500 Complete
Diag TK50
(AQ–PDWX?–DE)
–
–
A
D
VAX 6000–500 Console
CDROM
(AI–PDYQ?–BE)
–
–
–
D
VAX 6000-500 Console TK50
(AQ–PDYP?–ME)
–
–
–
D
VAX 6000–500 Complete
Diag CDROM
(AI–PDZZ?–BE)
–
–
A
D
VAX 6000–500
Complete Diag 16MT9
(BB–PDWY?–DE)
–
–
–
D
VAX 8530 CPU Kernel
(851BA–Y)
F
H
H
H
VAX 8550 CPU Kernel
(855BA–Y)
F
H
H
H
VAX 8700/VAX 8810 CPU
Kernel (871BA)
D
E
E
E
VAX 8800/VAX 8820N CPU
Kernel (882BA)
E
F
F
F
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
56 P1 Revision Management Level
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX 8500/VAX 8530
/VAX 8550/VAX 8700/VAX
8800 Console Media
(BT–ZMAAD–C3)
35
37
40
42
VAX 8500/VAX 8530
/VAX 8550/VAX 8700/VAX
8800 Diag Set (ZM920–C3)
35
37
40
42
VAX 8820/VAX 8830/VAX
8840 CPU Kernel (885BA)
C
C
C
D
VAX 8830/VAX 8840/VAX
8820 Console with Diag
TK50 (AQ–FJ79?–DE)
C
C
F
F
VAX 8820/VAX 8830/VAX
8840 Console TK50
(AQ–FJ80?–ME)
C
C
F
F
VAXft 310 CPU Kernel
(52AAA–X), (52BAA–X)
–
–
–
B
VAX–11/750 Adapter to CI
(CI750)
F,H
Note 3
F,H
Note 3
F,J
Note 3
F
Note 5
SBI Adapter to CI (CI780)
F,J
Note 3
F,J
Note 3
F,K
Note 3
H
Note 5
CI780.BIN Microcode
–
–
8.7
8.7
BI Adapter to CI (CIBCI)
with VAX 85XX, VAX 8700,
VAX 8800
B,C
Note 3
B,C
Note 3
B,D
Note 3
B
Note 5
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
P1 Revision Management Level 57
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
BI Adapter to CI (CIBCI)
with VAX 8200, VAX 8300
C,D
Note 3
C,D
Note 3
C,E
Note 3
C
Note 5
VAX BI-to-CI Interface
(CIBCA–A)
D
D
D
D
VAX CIBCA Microcode
Update Flp (B–FJ11?–ME)
H
H
M
M
CIBCA–BIN Microcode
–
–
7.5
7.5
VAX BI-to-CI Interface
(CIBCA-B)
A
A
A
A
VAX CIBCA–BA Microcode
Update Flp (BL–FK14?–ME)
B
B
E
E
CIBCB–BIN Microcode
–
–
5.2
5.2
CISCE Star Coupler
Expander
A
A
A,B
Note 4
A,B
Note 4
CINLE–AA, –AB CI7XX
Upgrade
A
A
A
A
CINLE–BA, –BB HSC CI
Upgrade
B
B
B
B
CIXCD–AA (CI-to-XMI
interface for VAX 9000)
–
–
B
B,D
Note 6
CIXCD–AA CIXCD.BIN
–
–
1.04
2.02,2.03
Note 6
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
58 P1 Revision Management Level
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
CIXCD–AA SHOW
CLUSTER RP_REVIS
–
–
24
42,43
Note 6
CIXCD–AB (CI-to-XMI
Interface for VAX 6000)
–
–
B
B
CIXCD–AB CIXCD.BIN
–
–
1.04
2.02
CIXCD–AB SHOW
CLUSTER RP_REVIS
–
–
24
42
DEBNA NI Interface
F4,H4
F4,H4
F4,H4
F4,H4
DEBNI NI Interface
–
C1
C1
C1
DELUA NI Interface
F1
F1
F1
F1
DEUNA NI Interface
E
E
E
E
DELQA NI Interface
D3,E4
D3,E4
D3,E4
D3,E4
DEQNA NI Interface
L4
K3
K,N
K,N
DESVA NI Interface
A
A
A
A
DESQA NI Interface
B
B
B
B
DEMNA NI Interface
–
–
F
F
DEMNA Microcode
–
–
6.03
6.06
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
P1 Revision Management Level 59
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
MicroVAX II
(630QB/630QE/630QY
/630QZ)
A
A
A
A
MicroVAX 3500/3600
(650QF/650QS)
A
A
A
A
MicroVAX 3300/3400
(640QS)
A
A
A
A
MicroVAX 3800/3900
(655QF/655QS)
A
A
A
A
MicroVAX 3100/VAXserver
3100 (KA41–A)
A
A
A
A
VAX/VAXserver 4000–200
CPU Kernel (660Q)
–
–
–
A
VAX/VAXserver 4000–300
CPU Kernel (670Q)
–
–
A
C
VAX Supercomputer
Gateway (825CC–**)
A
A
C
C
VAXcluster Console System
Upgrade (DJ–630C1–**)
1.2
–
–
–
VAXcluster Console
System 4-node License
(QL–V01A9–PD)
1.2
1.2
1.3
1.3
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
60 P1 Revision Management Level
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
VAX Performance Advisor
(QL–VE5A*-**)
1.2
2.0
2.1
2.1
Note 7
RA60–** — 205 MB DSA
Disk Drive
A6,A7,A8
A8,A9
A8,A9
A8,A9
RA60–** Disk Drive
Reported HV
–
–
1
1
RA60–** Disk Drive
Reported MC
–
–
5
5
RA70–** — 280 MB DSA
Disk Drive
H6,J6
K6
K6
K6
RA70–** Disk Drive
Reported HV
–
–
7
7
RA70–** Disk Drive
Reported MC
–
–
79
79
RA80–** — 121 MB DSA
Disk Drive
Note 1
Note 1
Note 1
Note 1
RA81–** — 456 MB DSA
Disk Drive
Note 1
Note 1
Note 1
Note 1
RA81–** Disk Drive
Reported HV
–
–
8
8
RA81–** Disk Drive
Reported MC
–
–
8
8
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
P1 Revision Management Level 61
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
RA82–** — 622 MB DSA
Disk Drive
Note 1
Note 1
C
C
RA82–** Disk Drive
Reported HV
–
–
2
2
RA82–** Disk Drive
Reported MC
–
–
49
49
RA90–** — 1216 MB DSA
Disk Drive (Long Arm)
70–23899–01
J
J
J
S
RA90–** Disk Drive
Reported HV
–
–
17
18,25
Note 8
RA90–** Disk Drive
Reported MC
–
–
25
26
RA90–** — 1216 MB DSA
Disk Drive (Short Arm)
70–23899–02
–
–
A
B
RA90–** Disk Drive
Reported HV
–
–
49
50
RA90–** Disk Drive
Reported MC
–
–
25
26
RA92 Disk Drive
–
–
A
B
RA92–** Disk Drive
Reported HV
–
–
81
82
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
62 P1 Revision Management Level
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
RA92–** Disk Drive
Reported MC
–
–
25
26
TA78–** Tape Subsystem
C
C
D
D
TA79–** Tape Subsystem
A
A
A
A
TA81–** Tape Subsystem
E
E
E
E
TA90–** Tape Subsystem
A
A
C4
C4
TA90E–** Tape Subsystem
–
–
A
A
TA91–** Tape Subsystem
–
–
–
A
KDM70 SI Disk and
Tape Controller
–
–
A
A
KDM70 Microcode
–
–
2.2
2.4,2.5
Note 9
KFMSA–** Dual DSSI-toXMI adapter
–
–
–
A
KFMSA–** Microcode
–
–
–
3.14
KFMSA–** SHOW
CLUSTER RP_REVIS
–
–
–
D26E
HSC40–** CI-based Disk and
Tape Controller
A1,A2
A1,A2
A1,A2
A1,A2
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
P1 Revision Management Level 63
Table 4–2 (Cont.) Component Revision Levels
Revision Level
Description (Part No.)
L1
M1
N1
P1
HSC40/70 Software
(BN–FNAAH–BK)
–
–
5.0A
5.0A
HSC50–** CI-based Disk and
Tape Controller
H8,H9
H8,H9
H8,H9
H8,H9
HSC50 Software
(BE–FNWAF–BK,
BE–FN04F–BK)
–
–
4.0
4.0
HSC70–** CI-based Disk and
Tape Controller
D8,D9
D8,D9
D8,D9
D8,D9
HSC40/70 Software
(BN–FNAAH–BK)
–
–
5.0A
5.0A
RV64 Optical Library
Juke Box
–
A
A
A,B
Note 10
RV60 Optical Drive
–
A
A
B
RV20 Optical Drive
–
B
B
C
ESE20 Electronic Storage
Element Microcode
–
–
Note 1
Note 1
HV
–
–
0
0
MC
–
–
17
17
VMS Operating System
5.1
5.2
5.4
5.4–2
Notes
1.
2.
No revision level restrictions.
Revision Level 8 (for VAX–11/780) or 3 (for VAX–11/785) is acceptable if the memory is not type
MS780–E or MS780–H.
3. VAXcluster systems of less than six nodes may use the lower revision indicated. VAXcluster
systems of more than five nodes or greater than 4.5 megabytes per second, use the higher
revision. A CINLE upgrade is required for node numbers greater than 16.
4. B for 24 to 32 nodes.
5. With microcode revision 8.7.
6. Higher revision needed for the VAX 9000 system with XJA revision D05 and higher.
7. 2.1 plus mandatory update package (MUP).
8. 17 with HDAs above revision M12. 25 with HDAs below revision N12.
9. 2.4 for VAX 6000 systems. 2.5 for VAX 9000 systems and VMS Volume Shadowing Phase II.
10. Revision B for VAX 9000 systems.
64 P1 Revision Management Level
blank page
VAXcluster Customer
Configuration Database
Questionnaire
Questionnaire 67
VAXcluster Customer Configuration
Database (VCCD) Questionnaire
The Digital VAXcluster Group has an online database of VAXcluster
customer configuration data. The purpose of this database is to gather
a high-quality statistical sampling of installations. The information will
help in identifying progress trends, forecasting future product needs, and
providing information to serve our customers better.
Since frequent changes occur at a customer site, it is sometimes difficult
to capture these changes in a timely manner. For this database to meet
its goals, the information must be accurate and current. The best source
of this information is you, the customer.
We encourage you to participate in this update process by completing the
attached questionnaire. If you have more than one VAXcluster system,
please photocopy and complete the form for each VAXcluster system.
Customer Name
Division
Street
City
Phone (
State/Province
Zip
)
-
Country
Name
Title
MIS Manager
Phone (
)
Other Contact
Phone (
)
Title
Number of CI-VAXcluster systems at this site:
If there is more
than one, please copy the remainder of this questionnaire and complete
for each VAXcluster system.
Security of Customer Information (check the appropriate box):
SECRET......Will not be disclosed.
RESTRICTED....May be disclosed to a limited internal distribution.
NONE........Unlimited internal distribution.
Northeast
Southern
Central
Europe
(163)
(1DF)
(161)
New York/NJ
Southwest
Western
GIA
(1DG)
(1WQ)
(160)
Mid-Atlantic
South Central
East Central
(162)
(ASL)
(AZ3)
68 Questionnaire
Phone (
1. System Manager
)
Department
2. Digital Customer Services Representative
Branch Office
Phone (
Check the appropriate box:
Digital Hardware
Contract
Digital Software
Contract
Self-Maintenance
Third-Party
H
S
Y
T
Check Yes or No:
VCS-VAXcluster Console System
VAXcluster system on Ethernet
VAX Performance Advisor
VAX Volume Shadowing
VAX RMS Journaling
Is there an LAVc connected to this CI-VAXcluster?
(Mixed interconnect)
If yes, what is the number of satellite nodes?
Average active users
Serial
No.
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
No
Peak number of users
VAXcluster operating system:
CPU
Type
)
VMS
Memory
Size (MB)
or ULTRIX
Number of
CI Adapters
Version Number:
Number of Star
Couplers Connected
to This CPU
Questionnaire 69
HSC Type
Serial No.
HSC Type
Serial No.
Digital Drives and Tapes: Enter quantity of each type listed below.
RA60
RA70
RA80
RA81
RA82
RA90
RA92
RM05
RM80
RP06
RP07
RV20
RV64
SA482
1.1 GB SA550-1
2.4 GB SA550-2
4.2 GB SA550-3
1.2
2.4
4.8
9.7
GB
GB
GB
GB
3.5
6.0
9.5
2.8
GB
GB
GB
GB
SA600-4
SA600-3
SA600-1
SA600-2
SA650-1
SA650-2
SA650-3
SA705
SA800
SA850
120 MB ESE20-1
240 MB ESE20-2
TA78
TA79
TA81
TA90
TA91
TE16
TSV05
TU45
TU77
TU78
TU79
TU80
TU81
TU90
Non-Digital Storage Devices:
Disk
Tape
Mfr
Name
Model
Storage/MB
Qty
Software Products: Please check all the products used on this
VAXcluster system.
VAX DBMS
VAX Rdb/VMS
(full development)
VAX ACMS
VAX SPM
V01
V02
Remote System Manager
VAX Distributed File Server
V05
V06
V03
V04
VAX Distributed Queueing Server
DECintact
V07
V08
70 Questionnaire
Applications: Please check only the top five applications on the
VAXcluster system.
Insurance
Marketing/Sales/Service
Marketing Mgmt Support
Customer Information Service
Retail Operations & Channels
Sales Operations & Comm
Repair Services
Wholesale Dist Operations
Other Dist, Mktg, Sales, Svc
B13
B14
B15
B16
B17
B18
B19
Finance and Administration
Billing/Project Accounting
General Ledger/Payables/Rcvg
Legal/Litigation
General Administration
Payroll
Personnel/Policy Administration
Purchasing/Procurement
Other Finance & Admin Bus
B50
B51
B52
B53
B54
Brokerage
B20
B21
B22
B23
B24
B25
B26
B27
Engineering
Arch Engineering & Const
Process Engineering & Design
Electrical Engineering
Engineering Support
Mapping
Mechanical Engineering
Manufacturing Engineering
Oil Expl/Production/Mining
Computer Aided Software Eng
Other Engineering
Agency Systems
Claims Processing
Underwriting
Insurance Policy Admin
Other Insurance
B28
B29
B30
B31
B32
B33
B34
B35
B36
B37
Portfolio Management
Retail Brokerage
Trading Systems
B55
B56
B57
Banking
Demand & Time Deposit Acct
Foreign Exchange
Funds Transfer
Cash Management
Loan Processing
Other Whsle & Retail Banking
B58
B59
B60
B61
B62
B63
Telecommunications
Telecom Intelligent Networks
Telecom Network Management
Telecom Operational Support
Computer Integrated Telephone
Other Telecommunications
B64
B65
B66
B67
B68
Questionnaire 71
Research/Lab
Lab Information Mgmt
Scientific Data Analysis
Data Acquisition & Control
Signal Processing
Scientific Image Processing
Health & Education
Other Research/Lab
Generic Applications
B38
B39
B40
B41
B42
B43
B44
Manufacturing
Factory/Industrial Automation
Manufacturing Decision Support
Mfg Planning & Ctl Sys/MRP II
Maintenance/Facilities Mgmt
Other Manufacturing
B45
B46
B47
B48
B49
Application Design & Devel
Economic/Business Analysis
Electronic Publishing
Document Imaging
Word & Document Process
Data Network Mgmt
Modeling/Simulation
Office Automation/Electronic
Planning/Budget
Realtime Computing
Technical Documentation
Supercomputing
Other Generic
Note: An envelope is provided inside the front cover for you to return
your completed questionnaire.
B01
B02
B03
B04
B05
B06
B07
B08
B09
B10
B11
B12
B99
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising