Digital Technical Journal .

-9myqp&z-
---,
,
'2
.
*
,
L
.,
.
.
"
.-!m
.
.T-
TransactionProcessing, Databases,
and Fault-tolerant Systems
Digital Technical Journal
Digital Equipment Corporation
Volume 3 Number 1
Winter 1991
Editorial
Jane C. Blake, Editor
Kathleen M. Stetson,Associate Editor
Circulation
Catherine M. Phillips, Administrator
Suzanne J. Babineau, Secretary
Production
Helen L. Patterson, Production Editor
Nancy Jones, Qpographer
Peter Woodbury, Illustrator
Advisory Board
Samuel H. Fuller, Chairman
Richard W Beane
Robert M. Glorioso
Richard J. Hollingsworth
John W! McCredie
Alan G. Nemeth
Mahendra R. Patel
E Grant Saviers
Robert K. Spitz
Victor A. Vyssotsky
Gayn B. Winters
The Digital TechnicalJournal is published quarterly by Digital
Equipment Corporation, 146 Main Street ML01-3/868,Maynard,
Massachusetts01754-2571. Subscriptions to the Journal are $40.00
for four issues and must be prepaid in U.S. funds. University and
college professors and Ph.D. students in the electrical engineering
and computer science fields receive complimentarysubscriptions
upon request. Orders, inquiries, and address changes should be
sent to The Digital TechnicalJournal at the published-by address.
Inquiries can also be sent electronicallyto DTJ@CRL.DEC.COM.
Single copies and back issues are available for $16.00 each from
Digital Press of Digital Equipment Corporation, 12 Crosby Drive.
Bedford, MA 01730-1493.
Digital employees may send subscription orders on the ENET to
RDVAX::JOURNAL or by interoffice mail to mailstop ML01-3D68.
Orders should include badge number, cost center, site location code
and address. All employees must advise of changes of address.
Comments on the content of any paper are welcomed and may
be sent to the editor at the published-by or network address.
Copyright 1991 Digital Equipment Corporation. Copying
without fee is permitted provided that such copies are made for
use in educational institutions by faculty members and are not
distributed for commercial advantage. Abstracting with credit
of Digital Equipment Corporation's authorship is permitted. All
rights reserved.
The information in this Journal is subject to change without
notice and should not be construed as a commitment by Digital
Equipment Corporation. Digital Equipment Corporation assumes
no responsibility for any errors that may appear in this Journal.
Documentation Number EYF588E-DP
Cover Design
Transactionprocessing is the common themeforpapers in this
issue. The automatic teller machine on our cover represents one
of the many businesses that rely on TP systems. If we could look
behind thefamiliar machine, we would see the products and
technologies-here symbolized by linked databases - that
support reliable and speedyprocessing of transactions worldwide.
The cover was designed by Dave Bryant of Digital's Media
Communications Group.
The following are trademarks of Digital Equipment Corporation:
DEC, DECforms, DECintact, DECnet, DECserver, DECtp, Digital, the
Digital logo, LAT, RdbNMS, TA, VAX ACMS, VAX CDD, VAX COBOL,
VAX DBMS, VAX Performance Advisor, VAX RALLY, VAX RdbNMS,
VAX RMS, VAX SPM, VAX SQL, VAX 6000, VAX 9000, VAXcluster,
VAXft, VAXserver, VMS.
IBM is a registered trademark of InternationalBusiness Machines
Corporation.
TPC Benchmark is a trademark of the Transaction Processing
Performance Council.
Book production was done by Digital's Educational Services
Media Communications Group in Bedford, MA.
'
Contents
8 Foreword
Carlos G. Borgialli
Transaction Processing, Databases,
and Fault-tolerant Systems
10 DECdta -Digital's Distributed
Transaction Processing Architecture
Philip A. Bernstein, William T. Emberton, and Vijay Trehan
18 Digital's Transaction Processing Monitors
Thomas G. Speer and Mark W Storm
33 Transaction Management Support in the
VMS Operating System Kernel
William A. Laing, James E. Johnson, and Robert V. Landau
45 Performance Evaluation of
Transaction Processing Systems
Walter H . Kohler, Yun-Ping Hsu, Thomas K. Rogers,
and Wael H. Bahaa-El-Din
58 Tools and Techniquesfor Preliminary Sizing of
Transaction Processing Applications
William 2. Zahavi, Frances A. Habib, and Kenneth J. Omahen
65 Database Availability for Transaction Processing
Ananth Raghavan and T. K. Rengarajan
70
Designing an Optimized Transaction Commit Protocol
Peter M . Spiro, Ashok M. Joshi, and T. K. Rengarajan
79
Verz~ication
of the First Fault-tolerant V N System
William E Bruckert, Carlos Alonso, and James M. Melvin
I
Editoh Introduction
Editor
Digital's transaction processing systems are integrated hardware and software proclucts that operate
in a distributed environment to support commercial applications, such as bank cash withdrawals,
credit card transactions, and global trading. For
these applications, data integrity and continuous
access t o shared resources are necessary system
characteristics; anything less would jeopardize the
revenues of business operations that depend on
these applications. Papers in this issue of the Journal
look at some of Digital's techologies and products
that provide these system characteristics in three
areas: distributed transaction processing, database
access, and system fault tolerance.
Opening the issue is a discussion of the architecture, DECdta, which ensures reliable interoperation
in a distributed environment. Phil Bernstein, Bill
Emberton, and Vijay Trehan define some transaction
processing terminology and analyze a TP application to illustrate the need for separate architectural
components. They then present overviews of each
of the components and interfaces of the distributed
transaction processing architecture, giving particular attention to transaction management.
Two products, the ACMS and DECintact monitors,
implement several of the functions defined by the
DECdta architecture and are the twin topics of a
paper by Tom Speer and Mark Storm. Although
based on different implementation strategies, both
ACMS and DECintact provide TP-specific services
for developing, executing, and managing TP applications. Tom and Mark discuss the two strategies
and then highlight the functional similarities and
differences of each monitor product.
The ACMS and DECintact monitors are layered on
the VMS operating system, which provides base
services for distributed transaction management.
Described by Bill Laing, Jim Johnson, and Bob
Landau, these VMS services, called DECdtm, are an
addition to the operating system kernel and address
the problem of integrating data from multiple systems and databases. The authors describe the three
DECdtm components, an optimized implementation of the two-phase commit protocol, and some
VAXcluster-specific optimizations.
The next two papers turn to the issues of measuring TP system performance and of sizing a system
t o ensure a TP application will run efficiently. Walt
Kohler, Yun-Ping Hsu, Tom Rogers, and Wael BahaaEl-Din discuss how Digital measures and models TP
system performance. They present an overview of
the industry-standard TPC Benchmark A and Digital's
implementation, and then describe an alternative
to benchmark measurement - a multilevel analytical model of TP system performance that simplifies
the system's complex behavior to a manageable set
of parameters. The discussion of performance continues but takes a different perspective in the paper
o n sizing TP systems. Bill Zahavi, Fran Habib, and
Ken Omahen have written about a methodology
for estimating the appropriate system size for a TP
application. The tools, techniques and algorithms
they describe are used when an application is still
in its early stages of development.
High performance must extend to the database
system. In their paper on database availability,
Ananth Raghavan and T.K. Rengarajan examine
strategies and novel techniques that minimize the
affects of downtime situations. The two databases
referenced in their discussion are the VAx Rdb/VMS
and VAX DBMS systems. Both systems use a database
kernel called KODA, which provides transaction
capabilities and commit processing. Peter Spiro,
Ashok Joshi, and T.K. Rengarajan explain the importance of commit processing relative to throughput
and describe new designs for improving the performance of group commit processing. These designs
were tested, and the results of these tests and the
authors' observations are presented.
Equally as important in TP systems as database
availability is system availability. The topic of the
final paper in this issue is a system designed to be
continously available, the VAXft 3000 fault-tolerant
system. Authors Bill Bruckert, Carlos Alonso, and
Jim Melvin give an overview of the system and then
focus on the four-phase verification strategy devised
to ensure transparent system recovery from errors.
I thank Carlos Borgialli for his help in preparing
this issue and for writing the issue's Foreword.
Biographies
Carlos Alonso A principal software engineer, Carlos Alonso is a team leader
for the project to port the System-V operating system to the VAXft 3000.
Previously, he was the project leader for various VAXft 3000 system validation
development efforts. As a member of the research group, Carlos developed the
test bed for evaluating concurrency control algorithms using the VMS
Distributed Lock Manager, and he designed the prototype alternate lock
rebuild algorithm for cluster transitions. He holds a B.S.E.E. (1979) from Tulane
University and an M.S.C.S. (1980) from Boston University.
Wael Hilal Bahaa-El-Din Wael Bahaa-El-Dinjoined Digital in 1987 as a senior
consultant to the Systems Performance Group, Database Systems. He has led a
number of st~icliesto evaluate performance database and transaction processing systems under response time constraints. After receiving his Ph.D.(1984) in
computer and information science from Ohio State University, Wael spent
three years as an assistant professor at the University of Houston. He is
a member of ACMS and IEEE, and he has written numerous articles for professional journals and conferences.
Philip A. Bernstein As a senior consultant engineer, Philip Bernstein is both
an architectural consultant in the Transaction Processing Systems Group and a
researcher at the Cambridge Research Laboratory. Prior to joining Digital in 1987,
he was a professor at Wang Institute of Graduate Studies and at Harvard University, a vice president at Sequoia Systems, and a researcher at the Computer
Corporation of America. He has published over 60 papers and coauthored two
books. Phil received a B.S. (1971) in engineering from Cornell University and a
Ph.D. (1975) in computer science from the University of Toronto.
William F. Bruckert William Bruckert is a consulting engineer who joined
Digital in 1969 after receiving a B.S.E.E. degree from the University of
Massachusetts. He received an M.S.E.E./C.E. degree from the same university
in 1981. Beginning as a worldwide product support engineer, Bill later worked
on a number of DECsystem-10/20 designs. He developed the cache, memory,
and I/O subsystem for the VAX 8600 processor and was the system architect
of the VAX 8650 processor. His most recent role was as the architect of the VAxft
3000 system. Bill currently holds seven patents.
I
William T. Emberton As a principal software engineer, William Emberton is
currently invol\led in tlie development of Queue Management Architecture. H e
is also involved in X/Open and POSIX 7'P Stantlards work ;mtl is a nieniber
of the team that is developing tlie overall ]>E<:tpproduct ;~rchitecture.l'revioilsly, he worked o n the initial versions of the DECtlta architectllre. Before coming to Digital in 1987, Bill held positions 21s Director of Software 1)evelopment
at National Semiconductor and Manager of Systems l > e \ ~ l o p n i e n for
t International Retail Systems at N C R . He was educated ; ~London
t
Ilniversity.
Frances A. Habib Fran Habib is ;I principal software engineer involved with
the tleveloprnent of transaction processing workload ch;lracterizntion and sizing tools. Previously, Fran worked at Data General and <;TE I.;~l,oratorics as a
management science consultant. She holds an M.S. in operations rcsc:lrch From
MIT and a H.S. in engineering and applied science from H;lrv;ircl. Fran is ;I h ~ l l
member of ORSA ant1 belongs t o ACM, IEEE, ant1 the A(:M SI(;Ml;l'Rl<:S special
interest group on modeling and performance evaluation of compllter systems.
Yun-Ping Hsu Yun-Ping is currently ;I principal s o f w ; ~ r eengineer in the
Transaction Processing Systems Performance and Characterization Group. He
joined Digital in October 1987, after receiving his master's degree in electrical
and computer engineering from the University of Massachusetts at Anillerst. In
his position, Yun-Ping is responsible for performance modeling ant1 benchmark measurement of both ACMS- and DECintact-based TI' systelns. He also
participated in the TPC Benchmark A standardization activity during 1989. He is
a member of ACM and IEEE.
James E. Johnson A consulting software engineer, Jim Johnson has worked
for the VMS Engineering Group since joining Digital in 1984. He is currently a
project 1c;lder for VMS Engineering in Europe. Prior to this work, Jiln led the
k\tS project, and after relocating to the U K three years ago, he was I-esponsible
for much of the design and implementation of the I>ECtlttnservices. At the same
time, Jim was an active participant in the transaction management architecture
review group. He has applied for a patent pertaining to the two-phase commit
protocol optimization currently irsecl in DECdtm services.
Ashok M. Joshi Ashok Joshi is a principal softw;lre engineer interested in
database systems, transaction processing, a~itlobject-based programming. I-le is
presently working o n the KODA subsystem, which pro\~iclesrecord storage for
Rtlb/VMS and DBMS software. For the Rdb/VMS project, he clevelopetl hash
indexing and record placement features, and h e worked on optimizing the lock
protocols. Ashok came to Digital after receiving a bachelor's tlegrcc in eJectric;~l
engineering from the Indian Institute of Technology, Bomb;~y,and a master's
degree in computer science from the University of Wisconsin, &ladison.
Walter H. Kohler As a software engineering senior manager, Walt is responsible for TP system performance measurement and analysis and leads Digital's
TP benchmark standards activities. Before joining Digital in 1988,Walt was a visiting scientist and technical consultant to Digital and a professor of electrical
and computer engineering at the University of Massachusetts at Amherst. He
holds B.S., M.S., and ph.1,. degrees in electrical engineering, all from Princeton
University. Walt recently received the IEEE/CS Meritorious Service Award, and
he has published o17er25 technical articles.
William A. Laing William Laing is a senior consultant engineer based in
Newbury, England. He is the technical leader for production systems support
for the VMS operating system. During five years spent in the U.S., Bill was
responsible for the design and initial development of symmetrical multiprocessing support in the VMS system. He joined Digital in 1981, after doing
research on operating systems at Edinburgh University for nine years. Bill holds
a B.Sc. (1972) in mathematics and computer science and an M.Phil. (1976) in
computer science, both from Edinburgh University.
Robert V. Landau Principal software engineer Robert Landau is a member of
the vMS Engineering Group, based in Newbury, England. He is currently the
project leader of a W S advanced development team investigating a high-performance, transaction-based, flat file system. Before joining Digital in 1987, Bob
worked for a variety of software houses specializing in database-related prodircts. He studied botany at London University and, subsequently, obtained a
teaching clualification from Hereford College.
James M. Melvin As a principal design engineer,Jim was responsible for the
specification of hardware error-handling mechanisms in the VAXft system and is
presently an engineering project leader for future VkXft systems. He also specified and led the implementation of the hardware system simulation platform
and the hardware design verification test plan. Jim joined Digital in 1984 and
holds a B.S.E.E. (1984) and an M.S. (1989) in engineering management from
3000 sysWorcester I'olytechnic Institute. He holds three patents on the
tem, all related to error handling in a fault-tolerant system.
Kenneth J. Omahen A principal engineer, Kenneth Omahen is developing
object-oriented queuing network solvers. He designed a variety of performance tools and performed design support studies which influenced a number
of Digital products. Prior to joining Digital, Ken worked at Bell Telephone
Laboratories, lectured at the University of Newcastle-Upon-Tyne, and was a
faculty member at Purdue University. He received a B.S. degree in science engineering from Northwestern University and M.S. and Ph.D. degrees in information sciences from the University of Chicago.
Biographies
Ananth Raghavan Since joining Digital in 1988, Ananth Raghavan has been
a software engineer who has led projects for the KODMRdb Group. Previous to
this position, he was a teaching assistant in the computer science department
of the University of Wisconsin. Ananth holds a B.S. (1985) degree in mechanical engineering from the Indian Institute of Technology, Madras, and an M.S.
(1987) degree in computer science from the University of Wisconsin, ~Maclison.
He has two patent applications pending for his work on undo and undo/redo
database algorithms.
T. K. Rengarajan T. K. Rengarajan has been a member of the Database
Systems Group since 1987 and works on the KODA software kernel for dat;tbase
management systems. He is involved in the support for worn1 devices and
global buffer management in the Vacluster environment. His work in the areas
of boundary element methods and database management systems is reportecl in
several published papers and patent applications. Ranga holds an M.S. degrec in
computer-aided clesign from the University of Kentucky and an M.S. in computer science from the University of Wisconsin.
Thomas K. Rogers Thomas Rogers is a project leader for the 'Transaction
Processing Systems Performance ancl Characterization Group. He is responsible for testing the VA;u 9000 Model 210 system using the TPC Benchmark A
standard. Prior to joining Digital in January 1988, Tom worked for Sperry
Corporation as a technical specialist for the Northeast region. He received a
bachelor of science degree in mathematical sciences in 1979 from Johns
Hopkins University.
Thomas G. Speer As a principal software engineer in the DECtp/East
Engineering Group, Thomas Speer is currently leading the DECintact V2.0 project. In this position, his major responsibility is defining the requirements for
DECintact support of DECdtm services, client/server database access, and support for the DECforms product. Since joining Digital in 1981, Tom has worked
on several development projects, including FORTRAN-10/20 and LMS-20 He holds
degrees from Harvard University, Rirtgers University, and Simmons College. He
is a member of Phi Beta Kappa.
Peter M. Spiro Peter Spiro, a principal software engineer, is currently
involved in optimizing database technology for RISC machines. He has n~orked
on database facilities such as access methods, journaling and recovery, transaction protocols, and buffer management. Peter joined Digital in 1985, after
receiving M.S. degrees in forest science and computer science from the
University of Wisconsin. He has a patent pending for a method of database journaling and recovery, and he authored a paper for an earlier issue of the Digital
Technic~il
Journal. In addition, Peter enjoys the game of Ping-Pong.
Mark W. S t o r m Consulting engineer Mark Storm was one of the original
designers of the ACMS monitor, and he has been involved in the development of
TP products for more than ten years. Currently, he is acting technical director
for the East Coast Transaction Processing Engineering Group, as well as rnanaging a small advanced development group. After joining Digital in 1976, Mark
worked on COBOL comp~lersfor the PDP-I1 systems and developed the first
native COBOL compiler for the VAX computer He holds a B.S. (with honors) in
computer science from the University of Southern Mississippi.
a
Vijay Trehan Since joining Digital in 1978, Vijay Trehan has contributed to
several architecture projects. He is the technical director responsible for
DECtp architecture, design, and standards work. Prior to this assignment, Vijay
was the architect for the DECdtm protocol, architect for the DDlS data interchange format, and initiator of work on the IIDIF document interchange format
and compound document strategy. He holds a R.S. (1972) in mechanical engineering from the Indian Institute of Technology and an M.S. (1974) in operations
research from Syracuse University.
William Z. Zahavi As an engineering manager, Bill is responsible for the
design and development of predictive sizing tools for transaction processing
applications. Before joining Digital in 1987, he was a technical consultant for
Sperry Corporation, specializing in systems performance analysis and capacity
planning. Bill received an M.B.A. from Northeastern University and a R.S. in
mathematics from the University of Virginia. He is an active member of the
Computer Measurement Group, and frequently presents at CMG conferences.
Carlos G . Borgialli
Senior Managq DECtp Software
Engineering
Transaction processing is one of the largest, most
rapidly growing segments of the computer i n d u s
try. Digital's strategy is to be a leader in transaction
processing, and toward that cnd we are making
technological advances and delivering products to
meet the evolving needs of businesses that rely on
transaction processing systems.
Because of the speed and reliability with which
transaction processing systems capture and display up-to-date information, they enable businesses
to make well-informed, timely decisions. Industries
for which transaction processing systems are a significant asset include banking, laboratory automation, manufacturing, government, and insurance.
For these industries and others, transaction processing is an information lifeline that supports the
achievement of daily business objectives and in
many instances provides a competitive advantage.
Many older transaction processing systems on
which businesses rely are centralized and tied to a
particular vendor. A great deal of money and time
has been invested in these systems to keep pace
with business expansion. As expansion continues
beyond geographic boundaries, however, the centralized, single-vendor transaction processing systems are less and less likely to offer the flexibility
needed for round-the-clock, reliable, business
operations conducted worldwide. Transaction processing technology therefore must evolve to
respond to the new business environment and at
the same time protect the investment made in
existing systems.
Our research efforts and innovative products
provide the transaction processing systems that
businesses need today. The demand for distributed
rather than centralized systcms has focused attention on system management. Queuing services,
highly available systems, heterogeneous environments, security services, and computer-aided software engineering (CASE) are a few examples of
areas in which research and advanced develop
lnent efforts have had and will continue to have a
major impact on the capabilities of tr~nsaction
processing systems.
Transaction processing solutions require the
application of a wide range of technology and the
integration of multiple software and hardw;ire
prodi~cts:from desktop to mainframe: from prcscntation services and user interfaces to TP monitors,
database systems, and computer-aided software
engineering tools; from optimization of system
performance to optimization of availability. Making
all of this technology work well together is a great
challenge, but a challenge Digital is uniquely positioned to meet.
Digital ensures broad application of its transaction processing technology by defining an
architecture, the Digital Distributed Transaction
Architecture (DECdta). DECdta, about which you will
read in this issue, defines the major components of
a Digital TP system and the way those components
can form an integrated transaction processing system. The DECdta architecture describes how data
and processing are easily distributed among multiple VAX processors, as well as how the components
can interoperate in a heterogeneous environment.
The DECdta architecture is based on the client/
server computing model, which allows Digital to
apply its traditional strengths in networking and
expandability to transaction processing system
solutions. In the DECdta client/server computing
model, the client portion interacts with the user to
create processing requests, and the server portion
performs the data manipulation and computation
to execute the processing request. This computing
model facilitates the division of a TP system into
small components in three ways. It allows for distribution of functions among VAX processors; it
partitions the work performed by one or more of
the components to allow for parallel processing;
or it replicates functions to achieve higher availability goals. These options permit the customer
to purchase the configuration that meets present
needs, confident that the system will allow smooth
expansion in the h t u r e .
Further, the DECdta architecture sets a direction
for its evolution through different products in a
coordinatetl manner. It provides for the cooperation and interoperation of components implemented on different platforms, and i t supports the
expansion of customer applications to meet growth
requirements. The DECdta architecture is designed
to work with other Digital architectures such as the
Digital Network Architecture (DNA), the network
application services (NAS), and the Digital database
architecture (DDA). Moreover, the DECdta architecture supports industry standards that enable the
portability of applications and their interoperation in a heterogeneous environment, such as the
standard application programming interfaces being
developed by the X/Open Transaction Processing
Working Group and the IEEE POSIX. Standard wire
protocols that provide for systems interoperation
in a multivendor, heterogeneous environment are
being developed by the International Standards
Organization as part of the Open System Interconnection activities.
Among thc products Digital has developed specifically for TP systems are the TP monitors. These
monitors provide the system integration ''glue," if
you will. Rather than act as their own systems integrators, customers who use Digital's TI' monitors
are able to spend more time on solving business
problems and less time on solving software integration problems, such as how to make forms and
database products work together smoothly.
Digital's TP monitors run on all types of hardware configurations, including local area networks
(LANs), wide area networks (WAYS),and VAXcluster
systems. The DECdta client/server computing model
provides the necessary flexibility to change hardware configurations, thus allowing reconfiguration without the need for any source code changes.
The two TP monitors, DECintact and VAX ACMS,
integrate vital Digital technologies such as the
Digital Distributed Transaction Manager (DECdtm)
and products such as Digital's forms systems
(DECforms) and our Rdb/VMS o r V&Y DBMS database products. DECdtm uses the two-phase commit protocol t o solve the complex problem of
coordinating updates to multiple data resources
or databases.
Major developments in Digital's database products have enhanced the strengths of its overall
product offerings. The two mainstream database
products noted above, Rdb/VMS and VAX DBMS,
layer on top of a database kernel called KODA, thus
providing data access independent of any data
model. The services made available by KODA,
besides its high performance, allow Digital's database products to efficiently support TP applications as well as to provide rich functionality for
general-purpose database applications.
For those TP systems that require user interfaces, DECforms provides a device-independent,
easy-to-use human interface and permits the support of multiple devices and users within a single
application.
TP systems that require high availability or continuous operations are supported by the vku family of hardware and software. The introduction of
the fault-tolerant VAX? 3000 system, added to the
successful VAXclustcr system, allows for a high
level of system availability. Performance needs
also are being met by a combination of hardware
resources, including the VAX 9000 system.
This combination of architecture, software, and
hardware technology, and support for emerging
industry standards places Digital in an excellent
position to become the industry leader for distributed, portable transaction processing systems.
The papers in this issue of the Journal provide a
view of the key elements of Digital's distributed
transaction processing technologies.
Many individuals. teams, organizations, and business partners are responsible for bringing Digital's
TP vision to fruition. Their dedication, hard work,
and creativity will continue to drive the development of new technologies that enhance our family
of products and services.
Philip A. Bernstein
William 3: Emberton
Vijay Trehan
DECdta -Digital's Distributed
Transaction Processing
Architecture
Digital's Distributed Trmnsaction Processing Arthltect~lre(UtCcltc~)describes the
rnodules and iriterjii~cesthclt are co~nnzonto D~gital'st~~aiuaclion
proctssing
(DECtp) prodilcts The arcigitecture allows ens}) distributiotz of LIECtp products.
111 particula~it supports client/seruer style applicatio~zs.Distribrtted tratisactiori
rnanagernent is the inain fci~zctionthat ties DECdta modules together It ensures
that application programs, database systems, and other resotrrce rnaitclgers inleroperate reliablj~in a distributed system.
Transaction processing (Tr) is the activity of executing requests to access shared resources, typically
databases. A computer system that is configi~retlto
execute TP applications is called a TP system.
A transaction is an execution of a set of operations on shared resources that has the following
properties:
Atomicity. Either all of the transaction's operations execute, or the transaction has no effect
at all.
Serializability. The set of all operations that execute on behalf of the transaction appears to
execute serially with respect to the set of operations executed by every other transaction.
Durability, The effects of the tranbaction's operations ;ire resistant to fa~lurcs
A transaction terminates by executing the comnit or abort operation. <:oniniit tells the system to
install the effect of the transaction's operations
permanently. Abort tells the system to undo the
effects of the transaction's operations.
For enhanced reliability ant1 availability, a TP
application uses transactions to execute requests.
That is, the i~pplicationreceives a request message
(from a display, computer, or other device), execiltes one or more transactions t o process the
request, and possibly sends a reply to the originator of the request or to some other party specified
by the originator.
1'P applications are essential to the operation
of many intlustries, such as finance, retail, health
care, transportation, government, communications,
and manufacturing. Given the broatl range of applications of TP, Digital offers ;I witle variety of protlucts with which to build I'l'systenls.
DECtp is an umbrella term th;~trefcrs to I>igit;~l's
TP products. The goal of 1)ECtp is to offer an integrated set of hiirdware and softwarc products
that supports the tlevelopment, execution, and
management of TP applications for enterprises of
all sizes.
DECtp systems inclutle software components
that are specializetl for TP, notably TI' monitors
such as the ACMS and DECintact TP monitors, and
transaction managers sucl.1 as the l)E<;cltm transaction manager." DECtp systems ;11so require the
integration of general-purpose hardware products
(processors, storage, communications, and terminals) and software products (operating systems,
database systems, ant1 communication gatcw;lys).
These products ;ire typically intcgratecl as sliown
in Figure 1.
TP APPLICATION
TP MONITOR
DATABASE SYSTEMS
OPERATING SYSTEM
I
PROCESSORS
Figure I
1
FORMS MANAGER
COMMUNICATION SYSTEM
MASS STORAGE
I
NETWORK
I
~~~~T~~
Lnyeritzg rf Prodt~ulsto Sc~/>/>ort
a TP Applic~ition
I
DECrlta - Digital's Distributed Trcrnsaction Processing Architecture
Applications on I>E<:tpsystems can be designed
using a client/server paradigm. This p;~radigmis
especially usefill for separating the work of preparing a request from t1i;it of running transactions.
Request prep;ir;ltion can bc done by a front-end
system, that is, one that is close to the user, in
which processor cycles are inexpensive and interactive feedback is e:isy to obtain. 7'r;ins;iction
execution c;in be done by a larger back-end system, that is, one th;~tmanages 1;irge cl;~t;tbases
;und may be far from the user. Back-end systems
may then~selvesbe distributed. Each back-end
system manages 3 portion of the enterprise
d;~t;tbaseand executes :ipplications, usu;~llyones
that make heaw use of the database on that back
end. DECtp products ;Ire modul;irized to ;lllow e:isy
distribution across front ends and back ends,
which enables them to support client/server style
al>plications. I)E<:tp systems thereby simplify programming and reconfiguration in a distributed
system.
Digital's Distributed Transaction l'rocessing
Architectilre (I)E<;dt;i) defines the modularization
and distribution structure that is common to DICtp
products. Distributetl transaction management is
the main function that ties this structure together.
This paper describes the DECdta structure and
explains how l)E<:tlta components are integrated
by distributed transaction management.
Current versions o f DECtp products implement
most, but not all, modules and interfaces in the
DECdta architecture. G ~ p sbetween the architecture and products will be filled over time. DECtp
products that currently implement DECdta components are referenced throughout the paper.
TP Application Structure
By analyzing T'P npplic;~tions,we can see where the
need arises for separate DECdta components. A
typical TP application is structured as follows:
Step 1 : The client application interacts with a
user (a person or m;ichine) to gather input, e.g.,
using a forms manager.
Step 2: l'he client maps the user's input into a
request, that is, a message that asks the system to
perform some work. The client sends the request
to a server application to process the request.
A request may be direct or queued. If direct, the
client expects a server to process the request right
away. If queued, the client deposits the request
in a queue from which a server can declueue the
request later.
D i g i t a l TecI~nical
Journal
Vol. .T !\b. 1
Wiriier 1991
Step 3: A server processes the request by
executing one or more transactions. Each trans
action may
a. Access multiple resources
b. Cal I programs, some of which may be remote
c. Generate requests to execute other transactions
d. Interact with ;I user
e. Return ;I reply when the transaction finishes
Step 4: If the transaction produces a reply, then
the client interacts with the user to display that
reply, e.g.,using a forms manager.
Each of the above steps involves the interaction
of two or more programs. In many cases, it is desirable that these programs be distributed. To distribute them conveniently, it is important that the
programs be in separate components. For example, consider the following:
The presentation service that operates the display and the application that controls which
form to display may be distributed.
One may want to off-load presentation services
and related functions to front ends, while allowing programs on back ends to control which
forms ;ire tlisplayed to users. This capability is
useful in Steps 1, 3d, and 4 above to gather input
and display output. To ensure that thc presentation service ant1 application can be distributed,
the presentation service should correspond to a
separate I)E<:dta component.
The clicnt application that sends a request and
the server application that processes the request
may be distributed. The applications may communicate throilgh a network or ;I queue.
In Step 2, front-end applications may want to
send requests directly to back-end applications
or to place requests in queues that are managed
on back ends. Similarly, in Step 3c, a transaction, T,may enqueue a request to run another
transaction, where the queue resides on a different system than T. To maximize the flexibility of distributing request management, request
management should correspontl to a separate
DECdta component.
Two transaction managers that want to run a
commit protocol may be distributecl.
For a transaction to be distributed across different
systems, as in Step 3b, the transaction management
Transaction Processing, Databases, and Fault-tolerant Systems
services must be distribl~ted 'l'o ensure that each
transaction is ;ttomic, the tr;ins;iction m;inagers on
these sjrstems must control tr;ins;tction commitment using ;I common coninlit protocol. R)complic;ite m;itters, tlierc is more tll:in o n e widely ~lsetl
protocol h)r transaction co~tirnitment. To the
extent possible, ;i system shoulcl ;illow interoperation of these protocols.
To ensure that transaction ni;in;igers can be distributed, the tr:lns:lction m:in;lgcr s h o l ~ l dbe a
component of I>E<:dt;i.Ti) ensure th;it they can
interoperate, their tr;ins;tction protocol s h o ~ ~ l c l
also be in DE<:tlta. To ensure t1i;it tlifferent commit
protocols c;in be supportetl, the Ixirt of tr;tnsnction
management t1i;it defines the protocol for interaction with remote tr;ins;iction ni:in;tgers should
be separated from the p;irt that coordinates transaction execution ;icross local resources. In the
DECdta architecture, the former is c;ll let1 a cornmunication man;igcr, ant1 the 1;ttter is calletl a transaction maniiger.
Interoperation of transaction m;in;igers and
resollrce managers, such as tl:~t;~b:ise
sybtems, also
affects the modularization of 1)M:tlta components.
A tr;insaction ni;~y involve tlit'f?rent types of
resources, as in Step 32. For ex;t~nplc,i t may update
data that is managed by different tlatabase systems.
To control transaction conimitnient, tlic. (ransaction manager must intcr;ict with clifferent resource
managers, possibly supplied h y different vendors.
This requires that resource m;tn;tgcr.; be separate
components of I)t.:(:tlta.
The DECdta Architecture
Having seen where the need for I>E<:clt:icomponents arises, w e are now reatly to describe the
1)ECdta arcliitectl~rcas a whole, inclucling the fi~nctions o f ;tnd interfaces to e;icli coniponent.
most I)E<:t!ta intertaces ;ire plrblic. Some of the
p i ~ b l i cinterkices are controlled by of'l'icial stanclartls bodies and industry consortia; i.e., they are
"open" interfaces. Others :ire controlled solely by
Digital. I)E<:dta intcrfacrs :lncl protocols will be
published and ;iligned with inclustry stanclartls, as
appropriate.
I>E<:dtacomponents ;Ire :ibstr;ict entities, They
d o not necessarily map one-to-one to hardware
components, softw;ire components (e.g., programs o r proclucts), o r execution environments
(e.g., ;I single-threatletl process, a multithreaded
process, o r an operating system service). R;itIier, a
DECdta component may be implemented as multiple software components, for ex:~lnple,as several
processes. Altern;itiveljf, several l)~(:dt;t coniponents m;ty be implementetl as ;I single soft\v;tre
coniponent. For ex;uiiple, an opcr;lting system o r
TP monitor typic;tlly offers the lacilitics of more
than o n c l>E<:dtacomponent.
The following ;ire thc c o m l x ~ n e n t s o fI)L(:clt;t:
An ;ipplication program is :in). progr:im that
uses services of I)E<:tlt;i components
A resource manager rn:in:tges resources thxt support tr;lnsaction semantics.
A transaction man;tger coorclin;~testri~nsaction
termination (i.e.,coniniit :inti ;il>ort).
A communication m:in;tger supports ;I trimsaction coliimuriic:ttioli protocol between '1'1'
systems.
A presentation manager s ~ ~ p p o r device-indets
penclent interactions with ;I prcscnt;ttion clcvicc.
A request m;tn;igcr kicilitatcs thc s ~ ~ b m i s s i oofn
requests to execute tr:lns;lctions.
DECtlta components :Ire 1:l)~credo n services t h ; ~ t
are provided by the ~lnclerlyingopcriiting system
and distributed s).stem pl;ttforni, ;tntl ;Ire not specific to TI', 2s sho\vn in Figure 2.
Applic~ilion P ~ " o ~ ~ G I T ~ z
We use the term application progr:tm to nic;ln :I
program that uses the scrviccs provitlctl by other
I>E<:dta components. An :tpplic:ition lxogr;ini
could be a customer-written progr;trn, a 1a)reretl
product, o r a UT.(:tlta component.
In the 1)ECtlta arcliitect~~re,
w e tlisting~~isli
two
special types of applic~ttionprogr;lm: ~ - c q ~ l cinist
tiators and transaction servers. A request initi~ttor
is ;I DECdta component that prcp;ircs ancl s~rblnits
a request for the execution of a transaction. 'Ti)
create a request, the request initiator tlsl~allyinteracts with a presentation man;iger t1i;it provides :in
interkice t o ;I device, such 21s ;I terniin:il, n workstation, a digital private br;lncli exchange, o r an
automated teller ni;~cliine.
A transstction server c;tn clcni;irc;itc ;I transaction, inter;ict with o n e o r more resource ni;inagers to access recover:tble resources o n behalf of
the transaction, invokc other tr:lnsaction servers,
ant1 respontl to calls from reclucst initiz~tol-s.
For a simple recluest, 21 tr;tns;tction server
receives the request, processes it, ;lncl optionally
returns a reply to the request initi:itor. A conversation;il request is likc ;I simple request, except that
while processing t h e request, t h e trans;lction
DECdta - Digital's Distributed Trui~sactionProcessing Architecture
APPLICATION PROGRAMS
I
I
TRANSACTION SERVER
TP SERVICES
REQUEST
REQUEST
COMMUNICATION
MANAGER
PRESENTATION
MANAGER
COMMUNICATION
MANAGERS
.........................
OPERATING SYSTEM AND DISTRIBUTED SYSTEM SERVICES
DISTRIBUTED
NAME SERVICE
DISTRIBUTED
TIME SERVICE
~ ~1~
server exchanges o n e o r m o r e messages with t h e
user, i ~ s i ~ a lthrough
ly
the request initiator.
In principle, a request initiator could ;tlso execute
tr;~nsactions(not shown in Figure 2). That is, the distinction between request initi;~torsant1 transaction
servers is for clarity only, and does not restrict an
application from performing request initiation functions in a transaction. Architecturally, this amounts
t o saying that request initiation fi~nctionscan execute in a transaction server
Resource Manager
A resource manager performs operations on shared
resources. We arc especially interested in recoverable resource managers, those that obey trans;tction
semantics. In particular, a recoverable resource
manager undoes a transaction's 11pd;ltes to t h e
resources if t h e transaction aborts. O t h e r recoverable resource manager activities in support of transactions are described in the next section. In the rest
o f this paper, w e use "resource manager" t o mean
"recoverable resource manager."
In a TP system, t h e most c o m m o n kind of
resource manager is a database system. Some presentation managers and con~municationmanagers
may also be resource managers. A resource man-
AUTHENTICATION
ager may b e written by a customer, a third party,
o r Digital.
Each resource manager type offers a resourcemanager-specific interface that is used by application programs to access and m o d ~ f yrecoverable
resources managed by the resource manager. A description of these resource manager interfaces is
outside the scope of DECdta. However, many of
these resource manager interfaces have architectures defined by industry standards, such as SQL
(e.g., the VAX RdbrVMS product), CODASYL data manipulation language (e.g., the VAX DBMS product), and
COBOL file operations (e.g., RMS in the VMS system).
One type of resource manager that plays a special role in TI-' systems is a queue resource manager.
It manages recoverable queues, which are often
used t o store requests.' It allows application programs t o place elements into queues and retrieve
them, s o that application programs can communicate even though they execute independently and
asynchronously. For example, an application program that sends elements can communicate with
o n e that receives elements even if t h e two application programs are not operational simultaneously.
This communication arrangement improves availability and facilitates batch input of elements.
Transaction Processing, Databases, and Fault-tolerant Systems
A queue resource manager interface supports
such operations as open-queue, close-queue,
enqueue, dequeue, and read-element. The ACMS
and DECintact TP monitors both have queue
resource managers as components.
later use the log to reconstruct transactions' states
while recovering from a failure.
A detailed description of the DECdta transaction
manager component appears in the Transaction
Manager Architecture section.
Transaction Manager
Com~nunicationManager
A transaction manager supports the transaction
abstraction. It is responsible for ensuring the atom-
A communication manager provides services for
communication between named objects in a TP
icity of each transaction by telling each resource
manager in a transaction when to commit. It uses
a two-phase commit protocol to ensure that either
all resource managers accessed by a transaction
commit the transaction or they all abort the transaction.' To support transaction atomicity, a transaction manager provides the following functions:
systcm, such as application programs and transaction managers. Some communication managers
participate in coordinating the termination of a
transaction by propagating the transaction manager's two-phase commit operations as messages
to remote communication managers. Other communication managers propagate application data
and transaction context, such as a transaction identifier, from one node to another. Some do both.
A TP system can support multiple communication managers. These communication managers
can interact with other nodes using different commit protocols or message-passing protocols, and
may be part of different name spaces, security
domains, systcrn management clomains, etc.
Examples are an IBM SNA ~ t ~ 6 communication
.2
manager or an ISO-TP communication manager.
By supporting multiple communication managers, the DECdta architecture enhances the interoperability of TP systems. Different TP systems can
interoperate by executing a transaction using different commit protocols.
A communication manager offers an interface
for application programs to communicate with
other application programs. Different communication managers may offer different communication
paradigms, such as remote procedure call or peerto-peer message passing.
A communication manager also has an interface
to its local transaction manager. It uses this interface to tell the transaction manager when a transaction bas spread to a new node and to obtain
information about transaction commitment, which
it exchanges with communication managers on
remote nodes.
Transaction demarcation operations allow application programs or resource managers to start
and commit or abort a transaction. (Resource
managers sometimes start a transaction to execute a resource operation if the caller is not
executing a transaction. The SQL standard
requires this.)
Transaction execution operations allow
resource managers and communication managers to declare themselves part of an existing
transaction.
Two-phase commit operations allow resource
managers and communication managers to
change a transaction's state (to "prepared," "committed," or "aborted").
The serializability of transactions is primarily
the responsibility of the resource managers.
Usually, a resource manager ensures serializability
by setting locks on resources accessed by each
transaction, and by releasing the locks after the
transaction manager tells the resource manager
to commit. (The latter activity makes serializability partly the responsibility of the transaction
manager.) If transactions become deadlocked, a
resource manager may detect the deadlock and
abort one of the deadlocked transactions.
The durability of transactions is a responsibility
of transaction managers and resource managers.
The transaction manager is responsible for the
durability of the commit or abort decision. A
resource manager is responsible for the durability
of operations of committed transactions. Usually,
it ensures durability by storing a description of
each transaction's resource operations and state
changes in a stable (e.g., disk-resident) log. It can
Presentation Manager
A presentation manager provides an application
program with a record-oriented interface to a presentation device. Its services are used by application programs, usually request initiators. By using
presentation manager services, instead of directly
accessing a presentation device, application programs become device independent.
Vol. 3 No. 1
Winter 1991
Digital Technical Journal
DECrlta - Digitnlk Distributed Tra~zsnctiollProcessing Architecture
A forms manager is o n e type of presentation
manager. Just as a database system supports operations to define, open, close, and access databases, a
forms manager s u p p o r t s operations to define,
enable, disable, and access forms. A form includes
the definition of the fields (with different
attributes) that make irp the form. It ;ilso includes
services to map the fielcls into device-independent
application records, to perform data validation,
and to perform data conversion to map fields o n t o
device-specific frames.
One presentation manager is Digital's I)E<:forms
forms management product. The IlE<:forms product is the first implementation of the ANSI/ISO
Forms Interface (Management Systems standard
(CODASM. FIMS).'
Request Manager
A request manager provides services to authenticate the source of requests (a user and/or a presentation device), to submit requests, ant1 to receive
replies from the execution of requests. It supports
such operations as send-request and receive-reply.
Send-request must provide the identity of the
source device, the identity of the user who entered
the request, the identity of the ;~pplic;~tion
program t o be invoked, and t h e input data t o t h e
program.
A request manager can either pass the request
directly to an application program, o r it can store
requests in a queue. In the latter case, another
request manager can subsequently schedule the
request by dequeuing the request and invoking an
application program. The ACMS System Interface is
;In example of an existing request manager interface for direct requests. The ACMS Queued Transaction Initiator is an example of a request manager
that schedules queued requests.'
Transaction Manager Architecture
DE<:dta components are tied together by the transaction abstraction. Transactions allow application
programs, resource managers, request managers
(indirectly through queue resource managers), and
communication managers to interoperate reliably.
Since transactions play an especially important
role in the DECdta architecture, w e describe the
transaction management functions in more detail.
The DECdta architecture includes interfaces
between transaction managers and application
programs, resource managers, and communication
managers, as shown in Figure 3. It also includes a
Digital 'li~cbnicul
Joun-fral
I4)I .i\%. I
I r ' i ~ ~ r oI O. O I
I
APPLICATION
PROGRAM
6
1
OTHER
COMMUNICATION
MANAGERS
Figure
Trcinsclctior?il.1~1nclgc.r
Arcbitect~~re
transaction manager protocol, whose messages are
propagated by commilnication managers. This p r u
tocol is used by Digital's I>E<:dtmdistributed transaction manager.'
From a transaction manager's viewpoint, a transaction consists o f trans;tction demarcation operations, transaction execution operations, two-phase
commit operations, and recovery operations.
The transaction demarcation operations are
issued by an application program to a transaction manager and inclutle operations to stArt
and either end o r abort a transaction.
Transaction execution operations are issued by
resource managers ant1 communication managers to a tr;ins;~ction manager. They include
operat ions
-
-
For a resource manager o r communication
manager to join an existing transaction
For a communic;~tionmanager to tell a transaction manager to start a new branch of a
transaction that already exists at another node
Two-phase commit operations are issued by a
transaction manager to resource managers,
communic;ition managers, and through communication managers to other transaction managers, and vice-versa. They include operations
- For a transaction manager to ask a resource
manager o r commilnication manager to prepare, commit, o r abort a transaction
- For a resource manager o r cornmunica-
tlon manager to tell a transaction manager
w h e t h e r it has prepared, committed, o r
aborted a transaction
I5
Transaction Processing, Databases, and Fault-tolerant Systems
-
-
For a communication manager to ask a transaction manager to prepare, commit, o r ;tbort
;I transaction
For a transaction manager to tell a communication manager whether it has prepared,
committecl, o r aborted a transaction
Recovery operations are issued by a resource
manager t o its transaction manager to determ i n e t h e state of a transaction (i.e., committed
o r aborted).
In response t o a start operation invoked by an
application program, the transaction manager clispenses a unique transaction identifier for t h e transaction. The transaction manager that processes the
start operation is that transaction's h o m e transaction manager.
When an application program invokes an operation supported by a resource manager, the
resource manager must find out tlie transaction
identifier of the application program's transaction.
This can happen in different ways. For example, the
application program may tag the operation with
the transaction identifier, o r the resource manager
may look u p the transaction identifier in the application program's context. When a resource manager receives its first operation on behalf of a
transaction, T, it mustjoin T, meaning that it must
tell a transaction manager that it is a su6orcZinnte
for T. Alternatively, t h e DECdta architecture supports a model in which a resource manager may ask
t o be joined automatically to all transactions managed by its transaction manager, rather than asking
t o join each transaction separately.
A transaction, T, spreads from o n e node, Node 1,
t o another node, Node 2, by sending a message
(through a communication manager) from an application program that is executing T at Node 1 t o
an application program at Node 2. When T sends
a message from Node 1 t o Node 2 for t h e first
time, t h e communication managers at Node 1 and
Node 2 must perform br;inch registration. This
function may be performed automatically by the
communication managers. Or, i t may be done manually by the application program, which tells the
communication managers at Node 1 and Node 2
that the transaction has spread to Node 2. In either
case, the resiilt is as follows: the communication
manager at Node 1 becomes the suborclinate of the
transaction manager at Node 1 for T and the superior of t h e communication manager at Node 2
for T; and the communication manager at Node 2
becomes t h e superior of t h e transaction manager
at Node 2 for T. This arrangement allows the commit protocol between transaction managers to be
propagated properly by communication managers.
After the transaction is done with its application
work, t h e application program that started transaction T may invoke an "end" operation ;it the home
transaction manager to cornnzit T. 'I'his causes the
h o m e transaction managcr to ask its subordinate
resource managers and communication managers
t o try t o commit T. The transaction manager does
this by using a two-phase commit protocol. The
protocol ensures that e i t h e r all subordinate
resource managers commit the transaction o r they
all abort the transaction.
In phase 1, the home transaction manager asks
its subordinates for T toprepctre T. A subordinate
p e p a r c s T by doing what is ncccssary t o guarantee
that it can either commit I ' o r abort'rifasked t o d o
s o by its superior; this guarantee is valid even i f
it fails immediately after becoming prepared. To
prepare T,
Each subortlinate for 'I' recursjvely propagates
the prepare request to its subordinates for T
Each resource manager subordinate writes all of
'T's upclatcs to stable storage
Each resource manager and transaction manager
subordinate writes a prepare-record t o stable
storage
A subordinate for T replies with a "yes" vote if
and w h e n i t has completetl its stable writes and all
of its subordinates for T have voted "yes"; otherwise, it votes "no." If any subordinate for T does not
acknowledge t h e request t o prepare within t h e
timeout period, then t h e h o m e transaction manager aborts T; t h e effect is t h e same as issuing an
abort operation.
In phase 2, when the h o m e tr;lnsactjon manager
has received "yes" votes from all of its subortlinates
for T, it decicles to commit T. It writes a commit
record for T to st;tble storage ;ind tclls its subordinates for T to c o m m i t T. Each subordinate for T
writes a commit recortl for ,r to stable storage ancl
recursively propagates the commit request to its
subordinztes for T.A subordinate for T replies with
an acknowletlgment i f and when it has c o m n ~ i t t e d
the transaction (in the case of a resource manager
suborclinate) ancl has received ack~lowledgrnents
from all subordinates for T. When the home transaction manager receives acknowledgments from all
of its subordinates for T, t h e transaction commitment is complete.
k 1 . .3
I
W i ~ r t e 1991
r
Digital Techrrical Journal
LlEC~/tcr- Digitul's Distributed Transaction Processing Arcbitecture
To recover fro111:I f ; ~ i l i ~ rall
c , resource managers
that participatecl in ;I tr;lnsaction must examine
their logs on st;ible storage to tletcrmine wliat to
clo. If thc log contains ;I comnlit o r abort record for
T, then 7' completecl. No action is required. If the
log contains n o prepare, commit, o r xbort record
for 1', then 1'was active. 'I' must be ;~bortecl.If the
log contains a prepare record for T, but n o commit o r abort record for 7'. T was between phases 1
and 2. The resource manager must ask its superior
transaction rn;lnager whether t o commit o r abort
the transaction.
An inherent problem in all two-phase commit
protocols is that a resource managcr is blocked
between phases 1 and 2, that is, after voting "yes"
and before receiving the commit o r abort tlecision.
It cannot commit o r abort the transaction 1111tilthe
transaction manager tells it which to do. If its transaction ni;in;lger fails, t h e resource manager may be
blocked indefinitely, until either t h e transaction
manager recovers o r a n extern;~lagent, such as a
system manzrger, steps in to tell the resource nianager whether to commit or ; ~ b o r t .
A tr;~nsaction1'may spontaneously ; ~ b o rdue
t
to
system errors at any time during its execution. Or,
;In ;~pplic;~tion
progl-am (prior to completing its
work) o r a resource manager (prior to voting "yes")
may tell its trllnsaction manager t o abort 1'. 111
either case, t h e transaction niariager then tells
a11 of its subordin;ites for 'I' to undo t h e effects
of 7''s resoilrce m;tn;lger oper;~tions.Subordinate
resource m;in;Igcrs ;lbort T, and subordinate communication managers recursively propagate the
abort request to their subordinates for T.
The two-phase commit protocol is optimized for
those cases in which the number of messages
exchanged can be reduced below that of the general case (c.g.. if there is only o n e subordinate
resource manager. if a rcsoiIrce manager did not
rnoclify resources, o r if the prcsumecl-abort protocol was ~ ~ s to
e dS ; I V ~acknow1etlgments)l'
Acknowledgments
This architecture grew from discussions with many
colleagues. We thank thern all for their liclp, especially Dieter Gawlick, Bill Laing, Dave Lomet, Bruce
Mann, Barry Rubinson, Diogenes Torres, and the TP
architecture group, inclutling Etlwarcl Bri~ginsky,
Tony DellaFera, George Gajnak, Per (;yllstrom, and
Yoav Raz.
References
1. T. Speer and M. Storm, "Digital's Transaction
Processing Monitors;' Digital TechnicalJournal,
vol. 3, no. I (Winter 1991, this issue): 18-32.
2. W. Lqing, J. Johnson, and R. Landau, "Transaction
Management Support in t h e \!MS Operating
System Kernel," Digital T e c h n i c a l J o ~ r u lvol.
, 3,
no. I (Winter 1991, this issue): 33-44.
3. F? Bernstein, V. H;lclzilacos, and N. Goodman,
Conc~lrrencyControl and Reco~wryin D~ttabuse
Systems (lieatling, MA: Addison-Wesley, 1987).
4 I? Bernstein, M. Hsu, and B. Mann, "lrnplementing Recoverable Recluests l l s ~ n g Queues,"
Proceeclitzgs 1990 AGW SIGIWOD Conference on
Mcrizngettzent of Dcrtcr (May 1990)
5. FIMS Jo~lrttalo,$ Dei)elopmerrt (Norfolk,
CODASYL FlMS Committee, July 1990).
VA:
6. C. Mohan, 0 . Linclsay, and R . Oberm;~rck,
"Transactjon Management in tlie R* Ilistributed
Ilatabase management System," A C M Trunsactions 0 1 2 Dotabase Systems, vol 11, no. 4
(December 1986).
Summary
We have presented an overview o f t h e I>ECdta
architecture. As part of this overview, w e introduced the components and expl;~inedt h e function
of e;~cliinterface. We also described tlie DECtlta
transaction management architecture in s o m e
detail. Over time, many interfaces of t h e I>Ec:dta
model will be m ; ~ d epublic via product offerings o r
;~rchitecturcpublications.
Digital TechtticaI J o u r t ~ ~ Ii t ) / ..$
.\:o. I
W17tttcv1991
17
Thomas G. Speer
Mark W Storm
Digital's Transaction
Processing Monitors
Digital provides tulo transaction processing (TP) ~non.itorproducts - AClVlS
(Application Control and itlanagemeizt Sj~ste??z)
and 1)ECitztact (Integrated Application Control). Each ]nonitor is a iilzified set oftlvnsactionyrocessing servicesfor
the application enviro~zmelzt.These services are layered on the VIW operating system. Althozigh there is a large fi~nctionaloverlap between the two, both prodz~cts
achieve similar goals by tnealzs o f s o ~ ~ significantly
ze
duerent impleme~ztation
strategies, Flow control crnd nzultithreading in the ACiMS monitor is managed by
means o f a fot~rth-genercltiorzlc/nguage (4GL) task definition langi~~/ge.
Flow control
and nztiltithreading in the DECintact monitor is nzanaged at the ~ipplicationlevel
by third-generation language (3CW cc~llsto a libra[]jofservices. The ACMS monitor
supports a deferred task model of queuing, and the DECintact monitor supports a
betz~leenthe
message-based model. Over time, the persistent dislitzgl~ishingfeat~~re
tzuo rnonitors will be their diSfereylt applic~~tion
progra~~zrning
intefnces.
Transaction processing is the execution of an
application that performs an administrative function by accessing a shared database. Within transaction processing, processing monitors provide
the software "glue" that ties together many software components into a transaction processing
system solution.
A typical transaction processing application
involves interaction with many terminal users by
means o f a presentation manager or forms system
to collect user requests. Information gathered by
the presentation manager is then used to query or
update one or more databases that reflect the current state of the business. A characteristic of transaction processing systems and applications is
many users performing a small number of similar
functions against a common database. A transaction processing monitor is a system environment
that supports the efficient development, execution, ancl management of such applications.
Processing monitors are usually built on top of
or as extensions to the operating system and other
products such as database systems and presentation services. By so doing, additional components
can be integrated into a system and can fill "holes"
by providing functions that are specifically needed
by transaction processing applications. Some
examples of these functions are application control and management, transaction-processing-
specific execution environments, and transactionprocessing-specific programming interfaces.
Digital provides two transaction processing
monitors: the Application Control and Management System (ACMS) and the DECintact monitor.
Both monitors are built on top of the VPlS operating system. Each monitor provicles a uniIied set
of transaction-processing-specific services to the
application environment, and a large functional
overlap exists between the services each monitor
provides. The distinguishing factor between the
two monitors is in the area of application programming styles and interfaces - fourth-generation
language (4~1.)versus third-generation language
(3GL). This distinction represents Digital's recognition that customers have their own styles of
application programming. Those that prefer 4 < ; ~
styles should be able to build transaction processing applications using Digital's TP monitors without changing their style. Similarly, those that prefer
3GL styles shoultl also be able to built1 TP applications using Digital's TP monitors without changing
their style.
The ACMS monitor was first introducetl by Digital
in 1984. The ACMS monitor addresses the requirements of large, complex transaction processing
applications by making them easier to develop and
manage. The ACMS monitor also creates an efficient
execution environment for these applications.
Vol .j iVo I
Wirzter 1931
Digital Tecbnical Jorrrtral
Di'itul's Trcinsuction Processing Monitors
The I>M;intact monitor (Integrated Application
Control) was originally cleveloped by a third-party
vendor. Purchasecl and introduced by Digital i n
1988, it has been installed in major financial institutions and manufacturing sites. 'The I>ECintact
monitor incl~ltlesits own prcsent;~tionmanager,
sul,l>ort h)r I)E(:fornis, ;I I-ccoverable queuing subsystem, a tr;~ns;~ction
m;lnager. and ;I resource manager th:~tprovides its own recovery of lt\lS (Record
Man;ige~nentServices) files.
This p;lpcr highlights the irnport;~ntsimilarities
;ind differences of the t\(:MS ;inti I)E(:intact monitors
in terms ofgo;~ls;~ntlimplernent;~tionstrategies.
Developnzent Ercvironrnent
Trans;iction processing monitors provide a view
o f t h e transaction processing system for application development. Therefore, t h e ACMS and
I)E<:int;lct monitors must embody a style of program development.
ACMS Progmmming Style
A "divide and conquer" al'proach was used in t h e
ACMS monitor. T h e w o r k typically involved in
tleveloping a TP applic;~tionwas divided into logically separate functions described below. Each of
these fi~nctionsw ; ~ sthen "conquered" by a special
utility o r approach.
In the A<;IMSmonitor, an "application" is defined
as a collection of selectable units of work calletl
tasks. A separate application definition facility
isolates the system management characteristics of
the applic;~tion(such ;IS resource ;~llocation,file
location, and protection) from the logic of the
applici~tion.
The specific;ltion of menus is also tlecoi~pled
from the ; ~ p p l i c ; ~ t i o nA. nonprocedural (4GI.)
methotl of defining menu I ; ~ y o ~is~ usetl
t s in which
the layouts are compilecl into form files and data
structures to be used at run-time. Each menu entry
points either t o another menu o r to an application
and ;I task. (Ilecoupling menus from t h e application allows user menus to be independent of how
the tasks are g r o i ~ p e dinto applications.)
In ;uldition to separate menu specification and
system management ch;~r;~cteristics,
t h e application logic is broken down into t h e three logical
parts of interactive TP ;~pplic;~tions:
Exchange steps s u p p o r t t h e exchange of data
with t h e e n d user. This exch;lnge is typically
accomplished by displaying a form on a terminal
screen and collecting the input.
Digital T&cbnicalJournal h l . .j No. I
Winter 1991
rn
Processing steps perform computational processing and database o r file I/O through standard
subroutines. T h e subroutines are written in
any language that accepts records passed by
reference.
The task definition language tlefines the flow of
control between processing steps ant1 exchange
steps and specifies tr;~nsaction clemarcation.
Work spaces arc special records that the ACMS
monitor provides to pass data between the task
definition, exchange steps, and processing steps.
A compiler, called the application clefinition utility (ADU), is implemented in t h e ACMS monitor t o
compile t h e task definition language into binary
data structures. The run-time system is table-driven,
rather than interpreted, by these structures.
Digital is the only vendor that supplies this "divide
and conquer" solution t o building large complex 'rP
applications. We believe this approach - unique in
t h e industry - reduces complexity, thus making
applications easier t o produce and t o manage.
DECintactProgmmming Style
The approach t o application development used in
t h e DE<:intact monitor provides t h e application
developer with 3<;L control over the transaction
processing services requirecl. This approach
allows application prototyping and development
t o be d o n e rapidly. Moreover, the application can
make t h e most efficjent use of monitor services
by selecting and controlling only those services
required for a particular task.
In t h e DECintact monitor, an application is
defined as o n e o r more programs written entirely
in 3GL and supported by thc VMS system. The code
written by the application tleveloper manages all
flow control, user interaction, ancl clata manipulation through the utilities and service libraries
provided by the DECintact monitor. All DECintact
services are callable, including most services provided by t h e 1)ECintact utilities. T h e l)E<:intact
services are as follows:
A library of presentation services used for all
interaction with users. The application developer
includes calls t o these services for form manipulation and display. Forms are created with a
forms editor utility and can be updated dynamically. Forms a r e displayed by t h e DECintact
terminal manager in emulated block mode.
Device- and terminal-dcpentlent information is
completely separated from t h e implementation
of t h e application.
19
Transaction Processing, Databases, and Fault-tolerant Systems
The separation of specification of menus from
the application. DECintact menus arc defined by
means of a menu database and are compiled into
data structures accessed at run-time. The menus
are tree-structured. Each entry points either to
another menu entry o r to an executable application image. The specification of menus is linked
to the DECintact monitor's security subsystem.
The I>E<:intact terminal user sees only those
specific menu entries for which t h e user has
been granted access.
A library of services for the control of file and
clueile operations. In addition to layered access
to the KiMS file system, the DECintact monitor
s u p p o r t s i t s o w n hash file format (a filnctional
analog to single-keyed indexed files in RIMS)
which provides very fast, efficient record
retrieval. The application developer includes
calls to these services for managing RiMS and
hash file I/O operations, demarcating recovery
unit boundaries, creating queues, placing data
items on queues, and removing data items from
q u e ~ l e sThe queuing subsystem is typically an
intcgral part of application design and work
flow control. Application-defined DECintact
recovery units ensure that RIMS, hash, and clueue
operations can be committed o r aborted atomically; that is, either all permanent effects of the
recovery unit happen, o r none happen.
Beca~lseo f DECintact's 3GL development environment, application programmers w h o are accustolnetl to calling procedure libraries from standard
VMS languages o r w h o are familiar w i t h o t h e r
transaction processing monitors can easily learn
DECintact's services. Application prototypes can
be producetl quickly because o n l y skills in 3<;L
are required. Further, completed applications
can be produced quickly b e c ; ~ u s etraining time
is minimal.
On-line Execution Environment
Transaction processing monitors provide an execution environment tailored to the characteristics and
needs of transaction processing applications. This
environment generally has two aspects: on-line, for
interactive applications that use terminals; ant1 offline, for noninteractive applications that use other
devices.
Traditional VMS timesharing applications are
implemented by allocating one VMS process to each
terminal user when t h e user logs in t o t h e system.
ib image activation is then done each time the ter-
minal user invokes a n e w filnction. This method is
most beneficial in simple transaction processing
applications that have a relatively small nitmlxr of
users. However, as the number of users grows o r as
the application becomes larger and more complex,
several problem areas may arise with this method:
Resource ~ l s e .As the number of processes cgrow~,
m o r e ancl m o r e memory is needed t o run t h e
system effectively.
Start-up costs. Process creation, image activation, file opens, and datitbase binds are expensive operations in terms of system resources
utilized and time elapsed. These operations can
degrade system performance if done frequently.
Contention. As t h e number of users simultaneously accessing a database o r file grows,
contention for locks also increases. For many
; ~ p p l i c i ~ t i o nlock
s , contention is a significant
factor in t h r o u g h p ~ ~ t .
Processing location. Single process implenientations limit distribution options.
ACMS On-line Execution
-10. ~oatlclressthe problems listed above, Digital implemented a client/server architecture in the ACMS
monitor. (Client/server is also callecl request/
response.) The basic run-time architecture consists
of three types of processes, as shown in Figure I :
the command process, execution controller, and
procedure servers.
An agent in the ACMS monitor is a process that
submits work requests to an application. In the
ACMS system, the comm;tnd process is a special
agent responsible for interactions with the terminal user. (In terms of the DECdta architecture, the
con1m;tnd process implements t h e functions of
a reqilest initiator, presentation manager, and
request manager for direct requests.)' The command process is generally created at system startu p time, although ACMS c o ~ u m a n d sallow it to
be started at o t h e r times. T h e process is multithreaded through the use of IiMS asynchronous
system traps (AS73. Thus, o n e command process
p e r node is generally sufficient for all terminals
handled by that node.
There are two subcomponents of the ACMS monitor within the command process:
System interface, which is a set of services for
submitting work requests and for interacting
with the ACMS application
Vr11 .iNo. I
Winlev 1991 Digital Tech~rical
Jorrr?ral
Digital's Transaction Processing Monitors
) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
I
FRONT-END NODE
I
--_-_-_---BACK-END NODE
1
I
DEFINITION
I
I
EXECUTION
CONTROLLER
SERVER
I
I______-_______-----I
KEY:
I
I
I
I
I
SERVERS
5
I
I
I
I
APPLICATION
PROCESSES
Figure 1
,----------A
I
I
Busic Run-time Architecture of the ACMS Monitor
DECforms, Digital's forms management product,
which implements the ANSI/IS<) Forms Interface management System (FIMS) that provides
the presentation server for executing the
exchange steps
The commantl process reads the menu definition for a particular terminal user and determines
which menu to display. W e n the terminal user
selects a particular menu entry, the command process calls the ACMS system interface services to
submit the task. The system interface uses logical
names from the VMS system to translate the application name into the address of the execution controller that represents that application. The system
interface then sends a message to the execution
controller. The message contains the locations of
the presentation server and an index into the task
definition tables for the particular task. The status
of the task is reti~rncdin the response. During the
course of task execution, the command process
accepts callbacks from the task to display a form
for interaction with the terminal user.
The execution controller executes the task
definition language and creates and manages procedure servers. The controller is created at application start-up time and is rnultithreaded by
using VMS ASTs. There is one execution controller
per application. (In terms of the DECdta architecture, the execution controller and the proce-
Digital Tecbnicnl Jourrinl
u
USER DATABASES
W'ol. .I :\?A 1
W'i~~tc,r
1991
dure servers implement the functions of a transaction server.)'
When the execution controller receives a request
from the command process, i t invokes DE(:dtm
(Digital Distributed Transaction Manager) services
to join the transaction if the agent passes the
transaction identifier. If the agent does not pass a
transaction identifier, there is no transaction to
join and a DECdtm or resource-manager-specific
transaction is started as specified in the task definition. The execution controller then uses the task
index to find the tables that represent the task.
When the execution of a task reaches an exchange
step, the execution controller sends a callback to
the command process for a form to be displayed
and the input to be collected for the task. When
the request to display a form is sent to the command process, the execution controller dismisses
the AST to enable other threads to execute. When
the response to the request arrives from the
exchange step, an AST is added to the queue for
the execution controller.
When a task comes to a processing step, the execution controller allocates a free procedure server
to the task. It then sends a request to the procedure server to execute the particular procedure
and dismisses the AST. If no procedure server is
free, the execution controller puts the request
on a waiting list and dismisses the AST. When a
21
Transaction Processing, Databases, and Fault-tolerant Systems
procedure server becomes free, the execution controller checks the wait list ant1 allocates the procedure server to the next task, if any, on the wait list.
Procedure servers are created and deleted by
the execution controller. I'rocedure servers are ;I
collection of user-written procetlures that perform
conlputation and provide database or file accesses
for the application. The procedures are written in
standard languages and use no special services. 'l'he
ACMS system creates a transfer vector from the
server definition. This transfer vector is linked into
the server image. With this vector, the A<:MS system
code can receive incoming messages and translate
them into calls to the procedure.
A procedure server is specifietl with initialization
and termination procedures, which are routines
supplied by the user. The A<:MS monitor calls these
procedures whenever a procedure server is created
and deleted. The initialization procedure opens
files and performs database bind operations. The
termination procedure does clean-up work, such
as closing files prior to process exit.
The ACMS architecture addresses the problem
areas discussetl in the On-line Execution Environment section in several ways.
Reso~irceUse Because procedure se. vers are allocated only for the time required to execute a processing step, the servers are available for other
use while a terminal user types in tlata for the
form. Thus, the system can execute efficiently with
fewer procedure servers than active terminal
users. Improvement gains in resource use can vary,
depending on the application. Our debit ;ind credit
benchmark experiments with the AC:MS monitor
and the Rtlb/vMS relational database system indicated that the most improvement occurs with one
procedure server for every one or two transactions
p e r second (TPS). These benchmarks equate to
1 procedure server for every 10 to 20 active terminal users.
The use of procedure servers ancl the multithreaded character of the execution controller and
the command process allow the architecture to
reduce the number of processes and, therefore, the
number of resources needed. 'The optimal solution
for resource use would consist of one Iarge multithreaded process that perhrmed all processing.
However, we chose to trade off some resource use
in the architecture in favor of other gains.
Ease of use - Multithreadetl applications ilre
generally more difficult to code than single-
thre;idecl applications. For this reason, procetlure server subroutines in the ACMS system
can be written in a standartl fashion by using
stantlarcl calls to Rdb/VMS and the VMS system.
Error isolation - In one large multithreaded
process, the threacls ;ire not conlpletely protected within the process. An application logic
error in one threat1 can corrupt data in a thread
that is executing for a different user. A severe
error in one thread coultl potentially bring
down the entire application. The multithreaded
processes in the ACMS architecture ( i t . , the
execution controller and command process)
are provided by Digital. Because no application code executes directly in these processes,
we can gugiriintee that no application cotling
error can affect them. l-'rocedure servers are
single-thresided. Therefore, an application logic
error in a procedure server is isol~itetlto affect
only the task that is executing jn the procetlure server.
Sturt-up Costs The run-time environment is basically "static;' which means that the start-up costs
(i.e., system resources and elapsed time) are
incirrretl infrequently ( i t . , at system and application start-up time). A timesharing user who is
running many different applications causes image
iictivations ;~ntlrundowns by switching among
im;iges. Because the terminal user in the ACMS
system is separated from the applications processes, the process of switching applications
involves only changing message destinations and
incurs minimal overhead.
Cnnterition The database accesses in the ACMS
environment are channeled through a relatively
few, but heavily used, number of processes. The
typical v,MS timesharing environment uses a large
number of lightly used processes. By reclucing
the number of processes that access the database,
the contention for locks is reduced.
I-'rvces.siizg Locutiolz Because the ACMS monitor
is a multiprocess architecture, the command process ancl forms processing can be clone close to the
ternlinal user on small, inexpensive machines. This
method takes aclvantage of the inexpensive processing power available on these smaller machines
while the rest of the application executes on a
larger VAScluster system.
Ihl. . j
No. l
Wiwler 1991
Digital Technical Joctrnnl
Digital's Transaction Processing Monitors
DECintact On-line Execution
Although the specific components of the DECintact
monitor vary from those of the ACMS monitor, the
basic architecture is very similar. Figure 2 shows the
application configured locally to the front end. The
run-time architecture consists of three types of
DECintact system processes - terminal manager/
dispatcher, DECforms servers, server manager and, typically, one or more application processes.
When forms processing is distributed, the same
application is configured as shown in Figure 3.
The DECintact monitor can run in multiple
copies on any one VAS node. Each copy can be an
independent run-time environment; or it can share
data and resources, such as user security profiles
and menu definitions, with other copies on the
same system. Thus, independent development,
testing, and production environments can reside
on the same node.
In the DECintact system, the terminal manager/
dispatcher process (one per copy) is responsible
for the following:
Displaying DECintact forms
Coordinating DECforms forms display
Interacting with local applications
Communicating, through DECnet, with remote
DECintact copies
Maintaining security authorization, including
the dynamic generation of user-specific menus
Applications designated in the local menu database as remote applications cause the front-end
terminal manager/dispatcher process to communicate with the cooperating back-end terminal
manager/dispatcher process through a task-to-task
DECnet link. (In terms of the DECdta architecture,
the terminal manager/dispatcher implements the
functions of presentation manager, request initiator, and request manager for direct requests.)'
When a user selects the remote task, that user's
request is sent to the back end and is treated by the
application as a local request. The terminal manager/dispatcher process is started automatically as
part of a copy start-up and is multithreaded.
Therefore, one such process can handle all the terminal users for a particular DECintact copy.
When the terminal user selects a menu task, one
of the following actions occurs, depending on
whether the task is local or remote and whether it
is single- or multithreaded.
If the application is local and single-threaded, a
VMS process may be created that activates the
application image associated with this task. The
terminal manager/dispatcher, upon start up, may
create a user-specified number of application shell
VMS processes to activate subsequent application
images. If such a shell exists when the user selects
a task, this process is used to run the application
image. Each user who selects a given menu entry
receives an individual VMS process and image.
If the application is local and multithreaded, the
terminal manager/dispatcher first determines
V
l
DATABASE
TERMINAL MANAGER1
DISPATCHER
MULTITHREADED
APPLICATION
I
h
6
KEY:
DECINTACT
PROCESSES
APPLICATION
PROCESSES
Figure 2
Digitul Technical Journal
SERVER
MANAGER
USERDATABASES
Bmic Run-time Architecture of the DECintact Monitor
L'ol. .i.Vo. I
Winler
I991
23
Transaction Processing, Databases, and Fault-tolerant Systems
- - ---------
- - - - - - - - - - - - - - - - -7
! FRONT-END NODE
I
I
BACK-END NODE
DATABASE
DATABASE
!
I
DATABASE
11
KEY
DECINTACT
PROCESSES
7
APPLICATION
PROCESSES
Figure 3
I
USER DATABASES
I- - - - - - - - - - - - - - - - - - J
DECzl~tcrctUcrsicArchitect~rreurith Distrib~ftedForms Processing
whether this task has already been activated by previous users. If the task has not been activated ant1
a shell is not available, the terminal manager/
dispatcher creates a VMS process for the application and activates the image. If the task is ;llre;idy
activated, the terminal manager/dispatcher connects the user to the active task. The user becomes
another thread of execution within the image.
Multithreaded applications handle many simultaneous users within the context of one VMS process
and image.
Remote applications, whether single- o r multithreaded, route the menu task selection to a remote
terminal manager/dispatcher process. On receipt
of the request, t h e remote terminal manager/
dispatcher processes the selection locally by using
the same procedures as described above.
Local DECintact forms interaction is handled in
the following manner by the local terminal manager/dispatcher. The application's call to display a
form sends a request t o the terminal manager. The
terminal manager locates the form in its tlatabase
of active forms, displays the form o n the user's terminal, ant1 returns control to the application when
24
I
the user has entered all clata in the form. If the
applic;~tionis remote, form information is sent
bctwcen cooperating local and remote terminal
rniinsger processes; the interface is transparent to
the application.
In addition to supporting DECintact forms, the
1)ECintact monitor also supports applications that
use I)E<:forms as their presentation service. The
implementation of this support follows the same
client/server model used by the ACMS system's
support for DECforms and shares much of the
underlying run-time interprocess communication
code used by the ACMS monitor. Functionally, the
two imp~ementationsof DECforms support are also
similar t o the ACMS monitor. Both implementations offer transparent support for distributed
DECforms processing, automatic forms caching
( i t . , propagation of updated DECforms in a distributed environment), and DECforms session caching
for increased performance.
The DECintact monitor supports applicationlevel, single- and multithreaded environments. The
DECintact monitor's threading package allows application programmers t o use standard languages
Val. .j
No. l
W i ~ r l rI991
r
Digital TechnicalJournal
Digital's Transtrction Processing iMonitors
supported by t h e ViMS system t o write multithreaded processes. Applications tleclare themselves as either single- o r multitlireaded. With t h e
exception of the declaration, there is little differe n c e between t h e way an on-line multithreaded
application and its single-threaded counterpart
must be codecl. For on-line applications, thread
creation, deletion, and management are automatic.
New threads are created w h e n a terminal user
selects t h e multithreaded application and are
deleted when the user leaves the application.
In a single-threaded application, the following
occurs:
Each user receives an individual VMS process and
image context (e.g., 200 users, 200 processes).
All terminal and file VO is synchronous.
The application image normally exits w h e n the
application work is completed.
In a multithreaded on-line application, the following occurs:
O n e VMS process/image can handle many simultaneous users.
All terminal and file I/O is asynchronous.
.
Ncw threads are created automatically when
n e w users are connected to the process.
The application image does not exit when all
currently allocated threads have completed execution but remains for use by new on-ljne users.
For each thread in a multithreaded application
image, the DECintact system maintains thread context and state information. Each I/O request is
issuecl asynchronously. Immediately after control
is returned, but before the 1/0 request completes,
the DECintact system saves the currently executing
threatl's context and schedules another thread to
execute. When the thread's I/O completion AST is
delivered, the thread's context is restored, and t h e
thread is inserted on an internally maintained list
of threads eligible for execution.
A thread's context consists of the following:
An internally maintained thread block containing state information
Thestack
Standard DECintact work spaces that are allocated t o each thread and that maintain terminal
and file management context
Digital Technical Journal
I.'r)l..$ ,'VIA l
W'i''irzfor 1991
.
Local storage (e.g., t h e $L<)<:AL PSECT in COBOL
applications) that the application has designated
as thread-specific
T h e PSECT naming convention allows t h e
application t o decide which variable storage is
thread-specific ant1 which is process-global.
Thread-specific storage is unavailable t o o t h e r
threads in t h e same process because it is saved
and restored o n each thread switch. Process-global
storage is always available t o all threads in t h e
process and can be used when interthread communication o r synchronization is desired.
The use of multithreading in t h e IIECintact system is appropriate for higher volume multiuser
applications that perform frequent I/O. Such application usage is typical in transaction processing
environments. Because thread switches occur only
w h e n I/O is requested o r when locking requests
are iss~led,this environment may not be recommended for applications that perform infrequent
I/O o r that expect very small numbers of concurrent users, such 21s end-of-day accounting programs o r other batch-oricnted processing. These
kinds of applications typically choose to declare
themselves as single-threaded.
All I/O from within a multithreatled DECintact
application process is asynchronous. Therefore,
the DECintact system provides a client/server interface between multithreatletl applications and sync h r o n o u s database systems, such as Vi\x DBMS
(Database Management System) and lidb/VMS systems. The interface is providetl because calling a
synchronous database operation directly from
within a multithreaded application would stall the
calling thread and all other threads until the call
completed. Figure 2 shows that a typical on-line
DECintact application accessing Rdb/Vi\lS, for
example, is written in two pieces:
A multithreaded, on-line piece (the client), that
handles forms requests from multiple users
A single-threaded, database server piece (a server
instance), that performs t h e actual synchronous
database 1/0
This client/server approach t o database access is
functionally very similar t o that of ACMS procedure
servers and offers similar benefits. Like t h e ACMS
monitor, t h e DECintact monitor offers system management facilities t o define pools of servers and t o
adjust them dynamically at run-time in accordance
with load. Similar algorithms are used in both monitors t o allocate server instances t o client threads
Transaction Processing, Databases, and Fault-tolerant Systems
and to start up new instances, as necessary. The
DECintact server code, like the ACMS procedure
server code, can define initialization and termination procedures to perform once-only start-up and
shut-down processing. With DECintact transaction
semantics, which are layered on DECtltm services,
a client can declare a global transaction that the
server instance will join. The server instance can
also declare its own independent transaction or no
transaction. (In terms of the DECdta architecture,
this client/server approach implements the fi~nctions of a transaction server.)' The principal difference between the DECintact and ACMS approach is
that DECintact clients and servers use a nlessagebased 3GL communications interface to send and
receive work requests. Control in the ACMS monitor resides in the execution controller.
As the ACMS monitor does, the DECintact architecture addresses the problem areas tliscussed in
the On-line Execution section in several ways.
AJso, as with the ACMS approach, the factors we
chose to trade off allowed us to achieve better efficiency, performance, and ease of use.
Resource [Jse The DECintact systcm's multithreaded methotlology economizes on VMS
resources. Similar to the method used in the ACMS
monitor, the system reduces process creations
and image activations. ii major difference between
the ACMS and DECintact architectures is the way
the DECintact monitor implements multithreading support. The transparent implementation of
threading capabilities means that coding multithreaded applications is no more difficult than
coding traditional single-threaded applications. As
with any application-level threading scheme, however. the responsibility for ensuring that a logic
error in one thread is isolated to that thread lies
with the application. The DECintact client/server
facilities for accessing databases, like those used in
the ACMS monitor, can realize similar benefits in
process reuse, throughput, and error isolation.
Start-zip Costs The DECintact architecture, like
the ACMS architecture, distributes start-up costs
(i.e., system resources and elapsed time) between
two points: the start of the DECintact system, and
the start of applications. System start-up can
involve prestarting
process shells (as discussed previously) for subsequent application
image activation. On-line application start-up is
executed on demand when the first user selects a
particular menu task. Multithreaded applications,
once st;irted, do not exit but wait for new user
threads as users select the application. Thus, the
l)I.:<:intactterminal user can switch between application images ancl incur only an inexpensive
thread creation.
Conterztion As in the ACMS monitor, database
accesses in the DECintact client/server environment are channeled through a relatively few, but
heavily used, number of processes rather than
through a large number of lightly used processes.
This retluction decreases lock contention.
Processing Locc~tion Forms processing can be
off-loatled to a front end and brought closer to the
terminal user. Thus smaller, less expensive Cl'lls
can be used while the rest of the application executes on a larger back-end machine or cluster. In
the I>E(:intactmonitor, the front end can consist of
forms processing only or a mix of forms processing ant1 application remote queuing work.
Off-line Execution
Many transaction processing applications are used
with nonterrninal devices, such as a bar code
reader or a communications link used for ;In electronic funds transfer application. Because there is
no human interaction with these applici~tions,
they have two requirements that differ from the
requirements of interactive applic~tions:tasks must
be simple data entries, and the system must handle
failures transparently.
ACMS Off-lineExecution
The ACMS monitor's goal for off-line processing is
to allow simple transaction capture to continue
when the application is not available. A typical
example is the continued capture of data on a manufacturing assembly line by a MicroVAX system
when the application is unavailable. The A<;MS
monitor provides two mechanisms for supporting nonterminal devices: queuing agents and userwritten agents.
Figure 4 illustrates the ACMS queuing model.
A queuing system is a resource manager that
processes entries, with priorities, in first-in, firstout (FIFO) order. (In terms of DECdta, this is the
queue resource manager.)' The ACMS queuing facility is built upon RiS-indexed files. The primary
goal of ACMS queuing is to provicle a storeand-forward mechanism to allow task requests
to be collected for later execution. By using
the ACMS$ENQUE-TASK
service, a user can write
Vol. . j No. I
Winter 1391 Digilal Technical Jourrral
Digital's Transaction Processing Monitors
r--------------------I NODE A
I
BARCODE
READER
I
I
-
II
I
I
I
AGENT
ACMS$ENQUEUE-TASK()
I
------------I
NODE B
-----
I
I
EXECUTION
CONTROLLER
I
I
I
I
I
DECintact Off-lineExecution
I
The DECintact monitor provides several facilities
for applications to perform off-line processing.
These facilities allow applications to
Interface with and process data from nonterminal devices and asynchronous events
PROCEDURE
I
USER DATABASES
user writes an agent that captures the input for the
task and then starts a DECdtm transaction. The
agent uses the system interface services to invoke
the ACMS task and passes the transaction identifier
and the input data. When the task call completes,
the agent commits the transaction. If DECdtm
returns an error on the commit, the agent loops
back to start another transaction and to resubmit
the task. If a VAXcluster system is used for the application, this configuration will survive any single
point of failure.
I-------------------J
Control transaction capture, store and forward,
interprocess communication, and business work
flow through the DECintact queuing subsystem
KEY:
a
Of-line Multithreading Off-line, multithreaded
DEcintact applications are typically used to service
~ z E s s E s
APPLICATION
PROCESSES
r---------------------
Figure 4
ACMS Queuing Agents
a process that captures a task request and safely
stores the task on a local disk queue.
The ACMS monitor provides a special agent,
called the clueuetl task initiator (QTI), which takes
a task entry from the queue and submits it to the
appropriate execution controller. The QTI starts a
DE(:dtm transaction, removes the task entry from
the queue within that transaction, invokes the
ACMS task, and passes the transaction identifier. (In
the DECdta architecture, the QTI implements the
functions of a request manager for queued
requests.)' The task then joins that transaction.
The removal from the queue is atomic with the
commit of the task, and no task entry is lost or
executed twice.
Figure 5 shows the A<:MS user-written agent
model for off-line processing. With the ACMS system interface, users may write their own versions
of the command process. Note that because these
agents cannot be safely stored on disks, this
method is generally not as reliable as using queues.
User-written agents can be used, however, with
DECdtm and the fault-tolerant VAXft 3000 system
to produce a reliable front-end system. To do so, a
Digital Tecbnicul Jofrrnal
Lh1. .3 No. I
Winter 1991
I
I VAXFT 3000
I
I BACK-END NODE
EXECUTION
CONTROLLER
I/
mL==
PROCEDU
SERVERS
USERDATABASES
KEY.
O
0
:F;%EssEs
APPLICATION
PROCESSES
Figure 5
ACMS User-writtenAgent Model
for Off-line Processing
Transaction Processing, Databases, and Fault-tolerant Systems
asynchronous events, such 21s the arrival of ;in
electronic funds transfer message or the adclition
to the qucile of an item already on a DECintact
queue. The application programmer cxpl~citly
controls how ni;iny threads are created, when they
are created, ;ind which execution path or paths
each thread will follow. Off-line, multithreatletl
applications are well-suited to message switching
systems and other aspects of electronic fi~nds
transfer in which each thread may be dedicated to
servicing a different kind of event.
DECintnct Queues The primary goal of the
1)EC:intact queuing subsystem is to support a work
flow model of business transactions. (In the
DE<;dta architecture, the DECintact queuing subsystem implements the functions of a queue
resource manager and request initiator for queued
requests.)' In a typical DECintact application that
relies on queuing, the state of the business transaction may be represented by the queue on which
a particular queue item resides at the moment. hi
item moves from queue to queue as the item's
processing state changes, much as a work item
moves from desk to desk. The superset 06 cliieue
items that reside on queues throughout the application at any one time represents the state of transactions current]y executing. Depending on the
number of programs that need to process data during the course of a transaction, a queue item may
be inserted on several different queues before the
transaction completes. The application also may
wish to chain together several small transactions
within the context of a larger business transaction.
The DECintact queuing system functions throughout the application: from the front end, where
queues collect and route incoming data; to the
back end, where queues can be integrated with
data files in recovery units; and in between, whcre
different programs in the application can use
queues to share data.
The I>E<:intactqueuing subsystem consists of a
comprehensive set of callable services for the creation ant1 manipulation of queues, queue sets, and
qileue items. Queue item operations performed
within the context of a DECintact transaction are
fiilly atomic along with DECintact file operations.
In addition to overall workflow control, the
DECintact queuing system allows the following:
Deferred processing - An item can be queuecl
by one process and then removed from the
queue later by another process for processing.
Deferred processing is useful when the volume
ofdrit;~entry is concentrated at particul;~rtimcs
of d;~)!;;ipplic;itions can assign themselves to
one or more queues and can be notified when
;in item is inserted on the queue.
Store-and-forwarcl processing - When users at
thc front end of the system write items to local
qiicx~es,data entry can be continuous in the
event of back-end system failure or whenever a
program tl1;lt is needed to process clata is temporarily unavailable.
Interprocess con~munication- Locally between
applications sharing a node and by means of
the 1)ECintact remote queuing facility, applications can use the queuing system to reliably
exchange application data between processes
and applications.
A fundamental difference betwccn i\(;hls queues
and OECintact queues is that the ACMS system
inserts tasks onto the queues, ;mtl the I>k<:intact
system inserts data items. In DE<:intrrct queuing,
cacti d;ita item contains both user-supplietl data
anel a lieader that includes an item key and other
control information. The header is ilsed by the
queuing system to control the movenlent o f the
item from clileue to queue. Each qileue item can be
;issigncd ;In item priority. Items can be removed
from the q\ieue in FlFO order, in FlFO order within
item priority, or by direct access using the item
key. Queues can be stopped and started for insertion, removal, or both. Queues can also bc redirected transparently at the system management
levcl to running applications.
In the I>E<:intactmonitor, alert thresholds can be
specified on a queue-by-queue basis to alert the
system manager when queue levels reach defined
amounts. Individual queue items can be held
against removal or released. Queues can be grouped
together into logical entities, called queue sets,
which look ancl behave to the application the same
as individu;~lqueues. Queue sets have acltletl facilities for broatlcast insertion on all members of a
queue set anel a choice of removing algorithms that
can weight relative item- and queue-level priorities
from the queue.
l>E<:intactqueues can be automatically clistribi ~ t e d At
. the system management level, ;I local
queue can be designatetl as remote outbouncl. That
is to say, items added to this queue are shippecl
transparently across the network to a corresponding remote inbound queue on the clestination
notle. The transfer is handlecl by the DECintact
queuing system by using exactly-once semantics
Val. .i,\'o. I
Winlev 1991
Digilnl TecbrricalJournal
Digital's Trcinsaction Processing Monitors
generated documents or by an off-line application
that receives data from another branch or bank.
The transactions are verified or repaired by other
clerks in a different department of the bank. The
transactions are then sent to destination banks
over one or more network services.
To implement this application, the developer uses
queues to route, safely store, and synchronize data
as it progresses through the system, and to prioritize data items. Data items are given priority levels,
based on application-defined criteria, such as transfer amount, destination bank, or time-to-closing.
(i.e., the item is guaranteed to be sent once and
only once). From the point of view of the application that is adding or removing items from the
queue, remote queues behave exactly as local
queues behave.
To better understand some of the uses for
DECintact queuing, consider a simplified but representative electronic funds transfer example built
on the DECintact monitor. Figure 6 shows the elements of such an application. In this application,
transactions might be initiated either locally by
clerks entering data into the system from user-
I-__--------_-_____-------------I
, NODEA
I
I
I
I
I
I
I
TERMINAL MANAGER1
DISPATCHER
I
I
I
I
I
I
I
I
I
I
I
I
I
DATA ENTRY
APPLICATION
OUTBOUND +
QUEUE
r--------------------------------
v
I NODE B
VERIFY AND
REPAIR
QUEUE
-
SUPERVISOR
QUEUE
-
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
FEDWIRE
XMT QUEUE
DATABASE
SERVERS
I
L
I
I USER DATABASES
I
IL-------------------------------J
FEDWIRE
PROCESS
-
FEDWIRE
NETWORK
-
KEY:
DECINTACT
PROCESSES
APPLICATION
PROCESSES
Figure 6 Elements of a DECint~tctElectronics Funds Transfer
Digilul TecbnfcnlJozwnrrl
Ihl .i:Vo. I
Wir~lcr.1991
I
I
I
I
I
I
I
I
I
I
I
I
Transaction Processing, Databases, and Fault-tolerant Systems
As illustrated in Figure 6, the terminal manager
controls termin;~lsfor the Data Entry and Verify ant1
Repair applications. Clerks enter data from usergenerated documents on-line as complete messages.
Verification ant1 repair clerks receive these messages as work items from the ver* and repair
qiletle through the Verify and Repair application.
The result of verification is either a validated message, which is ~lltirn;~tely
sent to a destination b;lnk,
or an unverifiable message, which is routed to the
supervisor queue for special hand ling. After special
handling, the message rejoins the processing flow
by returning to the verify and repair queue. After
validation, the messages are inserted in the
Fedwire Xmt queue and sent over the network to
the Federal Reserve System. The Fedwire I'rocess
application controls the physical interface to
the communication line and irnplernents tlie
Fedwire protocol. The validated messages are also
used to update ;I local database by means of
database server procgrarns.
The Fedwirc Xmt queue could be defined ;is a
queue set, which would permit the Fedwire
Process ;ipplication to remove items from the
q i i a ~ ehy a number of algorithms that bias tlie
transfer amount by queue and item priority.
Similarly, this queue set could be passively reprioritized near the close of the business clay. In other
words, the 1)ECintact system administrator could
use tlie DECintact queue utility near the end of the
day to change queue-wide priorities and ensure
that items with a higher priority level in the queue
set would be sent over the Fedwire first, without
changing any application code.
Controlling and restricting terminal user
environments
Controlling and restricting the application
Ability to dynaniically make changes to the application without stopping work
In addition to using the VIMS user authorization
file (VMS SYSIJAF), the ACMS monitor provicles utilities to define which users and terminals have
access to the ACMS system. Controlled tcrminals
are terminals defined by one of these utilities to be
owned by the ACMS monitoc These terminals are
allocated by the ACMS ~iionitorwhen the A<:IMS
system is started. When a user presses the Return
key, the A<;MS monitor displays its login prompt.
Unless the user has login access, the vMS system
cannot be accessed. The user's access is restricted
to only those ,.\<:MS functions that the i ~ s e ris permitted to invoke. This restriction prevents a user
from tlaniaging the integrity of data on the system.
The AGMS monitor also allows access support for
terminals that are automatically logged in to the
A<:MS system, such as a terminal on a shop floor.
Such access is useful for unprivilegetl users who
are not accustomed to computers. They can enter
data without understanding the process for logging in to the system.
For application control, the ACMS monitor uses a
protectetl directory, ACMS$DIRECTORY,
to store tlie
application definition files. The application authorization utility (AAU) ensures that special authorization is required for a user to make changes to an
application.
In the ACIMS monitor, the application is a single
point of control. The AC>lS/START MPI.ICATI0N and
Application ~Matzagernent
A<:MS/SI'OPMPLICATION commands cause the exeTypically, transaction processing ;~pplicationsare
cution controller for the application to be created
crucial to the business running the applications. If
and deleted. An operator can control the times
the applications cannot perform their functions
when an application is accessible. For example, an
reliably or securely, business activity may have to
application can be controlled to run only on
cease altogether or be curtailed, as in the case of ;in
Fridays or only between certaic hours. Tlie control
inventory control application or electronic f ~ ~ n d s of access times can also be used to restrict access
processing applic;~tion.Therefore, the applicawhile changes or repairs are made to the applications require adclitional controls to ensure that the
tion. This type of access control is difficult to
applications and the access by users to tlie appliachieve with only the VivlS system bccailse the v M S
cations are limited to exactly what is needed for
system does not provide these capabilities.
the business.
Tlie execution controller does access-control
list checking that is specified for each task. This
ACMSApplication Management
mechanism can restrict user access by function.
For example, a user could have tlie privilege to
Of the many features and tools for monitoring and
make a particular ~lpdateto ;I database but not have
controlling the system offered in the AC>lS moniaccess to read or make changes to any other parts
tor, three areas are most often used.
Vol
. I t\'o. I
Wir7ter 1991
Digital Tecb~ricalJorrrtrtrl
Digital's Transaction Processing Monitors
of that database. The execution controller achieves
a much finer level of control than do the mechanisms of the VMS system or the database system.
DECintact Application Management
The DECintact monitor controls access to the whole
system and to individual tasks by means of a security subsystem. The subsystem adds transactionprocessing-specific features to basic VMS security.
User security profiles specify the DE<:intact user
name and password (DECintact users are not
required to have an entry in the VMS SYSUAF
file); levels of security entitlement; inclusive and
exclusive hours of permissible sign-on; menu
entries authorized for the user. Only one user
under a given DECintact user name can be signed
on to the DECintact system at any one time on
any one node.
Dedicated terminal security profiles are used, in
conjunction with user security profiles, to provide geographic entitlement.
CAPTIVE and INITIAL_MENU user attributes
restrict users to a specific menu level of functions and prevent the user from accessing outer
levels.
User-specific menus are menu entries for which
an explicit authorization has been granted in the
user profile and are the only menu items visible
on the menu presented to terminal users. The
DECintact monitor does include an exception for
users w h o have an auditor privilege. Auditors
can see all menu hlnctions but must be specifically authorized to execute any single fi~nction.
The subsystem provides the ability to dynamically enable or disable specific menu functions.
Password revalidation is an attribute that can be
associated with a menu function. If set, the user
must reenter the DECintact user name and password before being allowed to access the function.
The DECintact monitor supports both controlled
or dedicated terminals and terminals assigned LAT
terminal server application ports, as does the ACMS
monitor. These terminals are owned by, and allocated to, the DECintact system. When a user types
any character at these terminals, a DECintact signon screen is displayed, and the user is prevented
from logging in to the VMS system.
Geographic entitlement limits certain DECintact
terminal-based functions to certain terminals or
Digital TechnicalJourtral
Vol. .7
1%~.
l
Winter 1991
even to certain users on certain terminals. The three
elements in geographic entitlement are as follows:
The user security profile enables a fiunction to be
accessed by a certain user.
The terminal security profile enables a function
to be accessed at a certain terminal.
A GEOG attribute is associated with a menu
entry in the terminal manager/dispatcher's
menu database. This attribute, when associated
with a function, demands that there be an applicable terminal security profile before the function can be accessed.
Normally, if a function is enabled in a user
profile, the user can access the function without
further checks. If the GEOG attribute is associated
with the function, however, that function must
be enabled in the user profile and in the terminal
profile before it can be accessed.
Geographic entitlement is frequently a requirement in financial environments which have specific
and rigid security protocols. For example, a bank
officer may be authorized to execute certain sensitive functions available only at dedicated terminals
when the officer is signed-in at the home office.
The same officer may be authorized to execute
only a subset of less sensitive functions when
signed-in from a branch office. Such sensitive functions can be protected by requiring that the user
profile and the dedicated terminal profile enable
the function.
Applications and resources are controlled
within the context of a DECintact copy's run-time
and management environment. Multiple copies
can be established on the same VMS system.
Different groups of users can maintain a certain
level of autonomy (e.g., separate applications and
data files), but all users can also share some or all
h~nctionsand resources of a given DECintact version. A typical example of this concept, that is, the
ability to create multiple DECintact copies for isolation and partitioning, is the common practice of
establishing development, acceptance testing, and
production DECintact environments. Managing
applications and resources within a development
environment, for example, can differ from managing applications and resources within a production
environment with a different system manager.
Access to menu functions is controlled by the
INTACT MANAGE DISABLE/ENABLE command. This
command removes or restores specified functions
Transaction Processing, Databases, and Fault-tolerant Systems
tlynamic;tlly from all menus i n the DECintact copy
shared by both rnonitors has been growing with
ant1 disables o r enables their selection by subset h e latest releases of t h e ACMS and D1:Cintact
quent users. (Current accessors of the specified
monitors. This external convergence has been fostered and made possible by an internal converfunction are allowed t o complete the function.)
gence, which is based o n sharing the unclerlying
The execution of single- and multitlireadetl applicode that supports the c o m m o n features of each
cations or 1)ECint;lct system components can be
shut clown by the INTACT &MUAGE SI-IlrTllOWN
monitor. As more coninion features are introduced
and enhanced in the DECtp system, the investment
command. This conlmand issues a mailbox request
in applications built o n e i t h e r monitor can b e
to the ;ipplication o r component, which then initiprotected and the distinctive programming styles
ates an orclerly shutdown. Access to the system by
of both can be preserved.
inclusi\ie ant1 exclusive time of day is controlled o n
;I per-user basis through the DECintact security
Reference
subsystem. In addition to these commands and
functions, the clueuing subsystem is managed by
1. I? A . Bernstein, W T. Emberton, and V Trehan,
nie;lns of ;I queue management utility. This ~ ~ t i l i t y
"DE<:dta - Digital's Distributed Transaction Procreates and deletes queues ancl clileue sets, modiDigital Technical Jotlnzal,
cessing Arcl~itecti~re,"
fies queue ;lnd queue set attributes, and performs
vol. 3, no. 1 (Winter 1991, this issue): 10- 17.
;]I1 other functions necessary for managing the
1)K:intact queuing subsystem.
In general, the DECintact monitor's security and
application control focuses on the front end by
concentrating access checking at the point of system sign-in ant1 menu generation. The ACMS system
concentrates m o r e on the back-end parts of the
system by means of VMS access control lists (A<:l.)
on specified tasks. The ACMS approach is built on
VMS security and system access (the SYSLJAF file)
and reflects an environment in which the VMS system antl the transaction processing security hlnctions arc typically performed by the same system
management agency. The DECintact monitor's system access is handled more independently of the
IS system and reflects an environment in which
transaction-processi~ig-specific
security functions
may be performecl by a different department from
those of the general VMS security system.
Conclusio?~
The h<;MS and I>ECint;~cttransaction processing
monitors provide a unified set of transaction-processing-specific services to the application environment. A large h ~ n c t i o n aoverl;~p
l
exists between
the services c.;~chmonitor provicles. Where the
functions provided by each monitor are identical
o r similar (e.g., client/server database access ant1
support for l>E<:forms),the factors that clistinguish
o n e from the other are primarily a result of the use
of 4G1, and 3<;lrapplication programming styles
and interfi~ces.W i e r e notable h ~ n c t i o n a ldifferences remain (as in each product's respective
cli~euingo r security systems), the clifferences are
primarily o n e s of emphasis rather than functional inconlpatibility. The set of common fe;~turcs
I/ol .3 i\'o.
I
Winrer I991
Digilnl Technical Jourtral
William A. Laing
James E. Johnson
Robert R Landau
Transaction Management
Support in the WMS
operating System Kernel
Distributed transaction management support is an enhancement to the VMS operating system. This supportprovides services in the VMS operating systemfor atomic
transactions that may span multiple resource managers, such as thoseforfitfiles,
network databases,and relational databases. These transactions may also be dt3trib
uted across multiple nodes in a network, independent of the communications
mechanisms used by either the application programs or the resource managers.
The Digital distributed transaction manager (DECdtm)services implement an optimized variant of the twophase commit protocol to ensure transaction atomicity.
Additionally, these services take advantage of the unique VAXclustercapabilities to
greatly reduce thepotentialfor blocking that occurs with the traditional two-phase
commitprotocol. Thesefeatures, now part of the VMS operating system, are readily
available to multiple resource managers and to many applications outside the
traditional transactionprocessing monitor environment.
Businesses are becoming critically dependent on
the availability and integrity of data stored on computer systems. As these businesses expand and
merge, they acquire ever greater amounts of on-line
data, often on disparate computer systems and often
in disparate databases. The Digital distributed transaction manager (DECdtm) services described in
this paper address the problem of integrating data
from multiple computer systems and multiple
databases while maintaining data integrity under
transaction control.
The DECdtm services are a set of transaction processing features embedded in the VMS operating
system. These services support distributed atomic
transactions and implement an optimized variant
of the well-known, two-phase commit protocol.
Design Goals
Our overall design goal was to provide base services
on which higher layers of software could be built.
This software would support reliable and robust
applications, while maintaining data integrity.
Many researchers report that an atomic transaction is a very powerful abstraction for building
robust applications that consistently update data.'.l
Supporting such an abstraction makes it possible
both to respond to partial failures and to maintain
Digital TechnicalJournal
Vol. 3 No.l
Winter I991
data consistency. Moreover, a simplifying abstraction is crucial when one is faced with the complexity of a distributed system.
With increasingly reliable hardware and the
influx of more general-purpose, fault-tolerant systems, the focus on reliability has shifted from
hardware to software.+Recent discussions indicate
that the key requirements for building systems
with a 100-yearmean time between failures may be
(1) software-fault containment, using processes,
and (2) software-fault masking, using process checkpointing and transactions."
It was clear that we could use transactions as a
pervasive technique to increase application availability and data consistency. Further, we saw that
this technique had merit in a general-purpose operating system that supports transaction processing,
as well as timesharing, office automation, and technical computing.
The design of DECdtm services also reflects several other Digital and VMS design strategies:
Pervasive availability and reliability. As organizations become increasingly dependent on their
information systems, the need for all applications to be universally available and highly reliable increases. Features that ensure application
Transaction Processing, Databases, and Fault-tolerant Systems
availability and data integrity, such as journaling
and two-phase commit, must be available to all
applications, and not limited to those traditionally thought of as "transaction processing."
Operating environment consistency. Embedding
features in the operating system that are required
by a broad range of utilities ensures consistency
in two areas: first, in the functionality across all
layered software products, and, second, in the
interface for developers. For instance, if several
distributed database products require the twophase commit protocol, incorporating the
protocol into the underlying system allows
programmers to focus on providing "valueadded" features for their products instead of
re-creating a common routine or protocol.
Flexibility and interoperability. Our vision
includes making DECdtm interfaces available to
any developer or customer, allowing a broad
range of software proclucts to take advantage of
the VMS environment. Future DECdtm services
are also being designed to conform to de facto
and international standards for transaction processing, thereby ensuring that VMS applications
can interoperate with applications on other
vendors' systems.
Transaction Manager -Some
Definitions
To grasp the concept of transaction manager, some
basic terms must first be understood:
Resource manager. A software entity that controls both the access and recovery of a resource.
For example, a database manager serves as the
resource manager for a database.
Transaction. The execution of a set of operations with the properties of atomicity, serializability, and durability on recoverable resources.
Atomicity. Either all the operations of a transaction complete, or the transaction has no effect
at all.
Serializability. All operations that executed for
the transaction must appear to execute serially,
with respect to every other transaction.
Durability. The effects of operations that executed on behalf of the transaction are resilient
to failures.
A transaction manager supports the transaction
abstraction by providing the following services:
Demarcation operations to start, commit, and
abort a transaction
= Execution operations for resource managers to
declare themselves part of a transaction and for
transaction branch managers to declare the distribution of a transaction
Two-phase commit operations for resource managers and other transaction managers to change
the transaction state (to either "preparing" or
"committing") or to acknowledge receipt of a
request to change state
Benefits of Embedding Transaction
Semantics i n the Kernel
Several benefits are achieved by embedding trans
action semantics in the kernel of the VMS operating
system. Briefly, these benefits include consistency,
interoperability, and flexibility. Embedding trans
action semantics in the kernel makes a set of
services available to different environments and
products in a consistent manner. As a consequence,
interoperability between products is encouraged,
as well as investment in the clevelopment of "valueadded" features. The inherent flexibility allows a
programmer to choose a transaction processing
monitor, such as VAX ACMS, and to access multiple
databases anywhere in the network. The programmer may also write an application that reads a
VAX DBMS CODASYL database, updates an Rdb/VivlS
relational database, and writes report records to
a sequential VAX RMS file - all in a single transaction. Because all database and transaction processing products use DECdtm services, a failure at
any point in the transaction causes all updates to
be backed out and the files to be restored to their
original state.
Two-phase Commit Protocol
DECdtm services use an optimized variant of the
technique referred t o as two-phase commit. The
technique is a member of the class of protocols
known as Atomic Commit Protocols. This class
guarantees two outcomes: first, a single yes or no
decision is reached among a distributed set of participants; and, second, this decision is consistently
propagated to all participants, regardless of s u b
sequent machine or communications failures. This
guarantee is used in transaction processing to help
achieve the atomicity property of a transaction.
The basic two-phase commit protocol is straightforward and well known. It has been the subject of
considerable research and technical literature for
Vo1.3 No. I
Winter 1991 Digital Technical Jourrral
Transaction Management Support in the VMS Operating System Kernel
several year~.~."'".'Thefollowing section describes
in detail this general two-phase commit protocol
for those who wish to have more information on
the subject.
The Basic Two-phase Commit
Protocol
The two-phase commit protocol occurs between
two types of participants: one coordinator and one
or more subordinates. The coordinator must arrive
at a yes or no decision (typically called the "commit decision") and propagate that decision to all
subordinates, regardless of any ensuing failures.
Conversely, the subordinates must maintain certain guarantees (as described below) and must
defer to the coordinator for the result of the commit decision. As the name suggests, two-phase
commit occurs in two distinct phases, which the
coordinator drives.
SUBORDINATE
In the first phase, called the prepare phase, the
coordinator issues "requests to prepare" to all subordinates. The subordinates then vote, either a "yes
vote" or a "veto." Implicit in a "yes vote" is the guarantee that the subordinate will neither commit nor
abort the transaction (decide yes or no) without an
explicit order from the coordinator. This guarantee
must be maintained despite any subsequent failures and usually requires the subordinate to place
sufficient data on disk (prior to the "yes vote") to
ensure that the operations can be either completed
or backed out.
The second phase, called the commit phase,
begins after the coordinator receives all expected
votes. Based on the subordinate votes, the coordinator decides to commit if there are no "veto"
votes; otherwise, it decides to abort. The coordinator propagates the decision to all subordinates as
either an "order to commit" or an "order to abort."
COORDINATOR
END-TRANS FROM
APPLICATION
JI '
-1
REQUEST TO PREPARE
FORCE WRITE
"PREPARE"
RECORD
\YES VOTE
INCREASING
TIME
I
COMMIT POINT
NOTIFY
APPLICATION
C I-'
I
FORCE WRITE
"COMMIT"
RECORD
ORDER TO COMMIT
LAZY WRITE
"COMMIT"
RECORD
\ DONE
LAZY WRITE
"FORGET"
RECORD
Figure I
Digital Technical Journal
Vo1.3 No.
1
Simple Two-phase Commit Time Line
Winter 1771
35
Transaction Processing, Databases, and Fault-tolerant Systems
In such cases, there is a tree-structured relationship between the coordinator and the full set of subordinates. Intermediate nodes must propagate the
messages down the tree and collect responses back
up the tree. Figure 2 shows a time line for a twophase commit sequence with an intermediate node.
Most of us have had direct contact with the twophase commit protocol. It occurs in many activities.
Consider the typical wedding ceremony as presented below, which is actually a very precise twophase commit.
Because the coordinator's decision must survive
failures, a record of the decision is usually stored
on disk before the orders are sent to the subordinates. When the subordinates complete processing, they send an acknowledgment back to the
coordinator that they are "done." This allows the
coordinator to reclaim disk storage from completed transactions. Figure 1 shows a time line of
the two-phase commit sequence.
A subordinate node may also function as a superior (intermediate) node to follow-on subordinates.
SUBORDINATE
L
I
-
COORDINATOR
INTERMEDIATE
END-TRANS FROM
APPLICATION
REQUES
FORCE WRlTE
"PREPARE"
RECORD
\YES
INCREASING
FORCE WRlTE
"PREPARE"
RECORD
I
YES VOTE
COMMIT POINT
NOTIFY
APPLICATION
--
FORCE WRlTE
"COMMIT"
RECORD
/
ORDER TO COMMIT
ORDER TO COMMIT
LAZY WRlTE
"COMMIT"
RECORD
LAZY WRlTE
"COMMIT"
RECORD
LAZY WRlTE
"FORGET"
RECORD
LAZY WRlTE
"FORGET"
RECORD
Figure 2
36
LAZY WRlTE
"FORGET"
RECORD
Three-nodeTwo-phaseCommit Time Line
Vo1.3 No. I
Winfer I991
Digital Technical Journal
Transaction Management Support in the ViMS Operating System Kernel
Components of the DECdtm Services
Will you, Mary, take John. . . ?
I will.
Will you, John, take Mary.. .?
I will.
I now pronounce you man and wife.
Official:
Bride:
Official:
Groom:
Official:
The DECdtm services were developed as three separate components: a transaction manager, a log
manager, and a communication manager. Together,
these components provide support for distributed
transaction management. The transaction manager
is the central component. The log manager services enable the transaction manager to store data
on nonvolatile storage. The communication manager provides a location-independent interprocess
communication service used by the transaction
and log managers. Figure 3 shows the relationships
among these components.
The above dialog can be viewed as a two-phase
commit:
Coordinator:
Participant 1:
Coordinator:
Participant 2:
Coordinator:
Request to Prepare?
Yes Vote.
Request to Prepare?
Yes Vote.
Commit Decision.
Order to Commit.
The Digital Distributed Transaction
Manager
The basic two-phase commit protocol is straightforward, survives failures, and produces a single,
consistent yes or no decision. However, this protocol is rarely used in commercial products. Optimizations are often applied to minimize message
exchanges and physical disk writes. These optimizations are important particularly to the transaction processing market because the market is
very performance sensitive, and two-phase commit occurs after the application is complete. Thus,
two-phase commit is reasonably considered an
added overhead cost. We have endeavored to reduce
the cost in a number of ways, resulting in low
overhead and a scalable protocol embodied in the
DECdtm services. Some of the optimizations are
described later in another section.
As the central component of the DECdtm services,
the transaction manager is responsible for the
application interface to the DECdtm services. This
section presents the system services the transaction manager comprises.
The transaction coordinator is the core of the
transaction manager. It implements the transaction
state machine and knows which resource managers and subordinate transaction managers are
involved in a transaction. The coordinator also controls what is written to nonvolatile storage and
manages the volatile list of active transactions.
The user services are routines that implement
the START-TRANSACTION, END-TRANSACTION, and
ABORT-TRANSACTION transaction system services.
1 gdFJ2TlON
I
TO REMOTE
DECDTM
I
RESOURCE
MANAGER
TRANSACTION
COORDINATOR
RESOURCE
MANAGER
SERVICES
4
-
VOLATILE
REGISTRY
LOGGING
INTERFACE
-
TO HARDENED
REGISTRY
........................................
EXTERNAL
INTERFACE
Figure 3
Digital Tecbnlcal Journal
Vo1. .? No. 1
Components of the DECdtm Services
Winter I991
37
Transaction Processing, Databases, and Fault-tolerant Systems
They validate user parameters, dispense a transaction identifier, pass state transition requests to
the transaction coordinator, and return information about the transaction outcome.
The branch management services support the
creation and demarcation of branches in the distributed transaction tree. New branches are constructed when subordinate application programs
are invoked in a distributed environment. The services are called on to attach an application program to the transaction, to demarcate the work
done in that application as part of the transaction,
and finally to return information about the transaction outcome.
The resource manager services are routines that
provide the interface between the DECdtm services
and the cooperating resource managers. This interf ~ c eallows resource managers to cleclarc themselves to the transaction manager and to register
their involvement in the "voting" stage of the twophase commit process of a specific transaction.
Finally, the information services routines are
the interface that allows resource managers to
query and update transaction information stored
by DECdtm services. This information is stored
in either the volatile-active transaction list or the
nonvolatile transaction log. Resource managers
may resolve and possibly modify the state of
"in-doubt" transactions through these services.
The Log Manager
The log manager provides the transaction manager
with an interface for storing sufficient information
in nonvolatile storage to ensure that the outcome
of a transaction can be consistently resolved. This
interface is available to operating system components. The log manager also slipports the creation,
deletion, and general management of the trans
action logs used by the transaction manager. An
additional utility enables operators to examine
transaction logs and, in extreme cases, makes it
possible to change the state of any transaction.
The Communication Manager
The communication manager provides a command/
response message-passing facility to the transaction manager and the log manager. The interface
is specifically designed to offer high-performance,
low-latency services to operating system components. The command/response, connectionoriented, message-passing system allows clients
to exchange messages. The clients may reside on
the same node, within the same cluster, or within
a homogeneous VMS wide area network. The communication manager also provides highly optimized
local (that is, intranode) and intracluster trans
ports. In addition, this service component multiplexes communication links across a single, cached
DECnet virtual circuit to improve the performance
of creating and destroying wide area links.
Transaction Processing Model
Digital's transaction processing model entails the
cooperation of several distinct elements for correct
execution of a distributed transaction. These elements are (1) the application programmer, (2) the
resource managers, (3) the integration of the
DECdtm services into the VMS operating system,
(4) transaction trees, and (5) vote-gathering and
the final outcome.
Application Programmer
The application programmer must bracket a series
of operations with STAlZT-TRANSACTION and
END-TRANSACTION calls. This bracketing demarcates the unit of work that the system is to treat as
a single atomic unit. The application programmer
may call the DECdtm services to create the branches
of the tlistributed transaction tree.
Resource Managers
Resource managers, such as VAX RMS, VAX RdbfWl~,
and VAX DBMS, that access recoverable resources
during a transaction inform the DECdtm services of
their involvement in the transaction. The resource
managers can then participate in the voting phase
and react appropriately to the decision on the final
outcome of the transaction. Resource managers
must also provide recovery mechanisms to restore
resources they manage to a transaction-consistent
state in the event of a failure.
Integration in the Operating System
The DECdtm services are a basic component of the
VMS operating system. These services are responsible for maintaining the overall state of the distrib
uted transaction and for ensuring that sufficient
information is recorded on stable storage. Such
information is essential in the event of a failure so
that resource managers can obtain a consistent
view of the outcome of transactions.
Each VMS node in a network normally contains
one transaction manager object. This object maintains a list of participants in transactions that are
active on the node. This list consists of resource
managers local t o the node and the transaction
manager objects located on other nodes.
1/01, .3 No. I
Witrler 1991 Digital Technical Journal
Transaction Management Support in the VMS Operating System Kernel
Transaction Trees
The node on which the transaction originated (that
is, the node on which the START_TRANSACTION
service was called) may be viewed as the c6rootn
of
a distributed transaction tree. The transaction
manager object on this node is usually
for coordinating the transaction commit phase of
the transaction. The transaction tree grows as
applications call on the branch management services of the transaction manager object.
The transaction identifier dispensed by the
START-TRANSACTIONservice is an input parameter
to the branch services. This parameter identifies
two concerns for the local transaction manager
object: (I) to which transaction tree the new branch
should be added, and (2) which transaction manager object is the immediate superior in the tree.
Resource managers join specific branches in a
transaction tree by calling the resource manager
services of the local transaction manager object.
Vote-gatheringand the Final Outcome
When the "commit" phase of the transaction is
entered (triggered by an application call to
END-TRANSACTION), each transaction manager
object involved in the transaction must gather the
"votes" of the locally registered resource managers
and the subordinate transaction manager objects.
The results are forwarded to the coordinating transaction manager object.
The coordinating transaction manager object
eventually informs the locally registered resource
managers and the subordinate transaction manager
objects of the final outcome of the transaction. The
subordinate transaction manager objects, in turn,
propagate this information to locally registered
resource managers as well as to any subordinate
transaction manager objects.
transaction by the coordinator, the transaction
aborts. This removes the need to write an abort
decision to disk and to subsequently acknowledge
the order to abort. In addition, subordinates that
do not mod* any data during the transaction (that
is, they are "read only"), avoid writing information
to disk or participating in the commit phase.
Lazy Commit Log Write The DECdtm services
can act as intermediate nodes in a distributed transaction. In this mode, they write a "prepare" record
prior to responding with a "yes vote." They also
write a "commit" record upon receipt of an order
to commit. This latter record is written so that the
coordinator need not be asked about the commit
decision should the intermediate node fail. This
refinement isolates the intermediate node's recovery from communication failures between it and
the coordinator.
Performance is enhanced when the DECdtm services write the commit record on an intermediate
node in a "nonurgent" or "lazy" manner."' The lazy
write buffers the information and waits for an
urgent request to trigger the group commit timer
to write the data to disk. vpically, this operation
avoids a disk write at the intermediate node. The
increase in the length of time before the commit
record is written is negligible.
The DECdtm services use several previously published optimizations and extend those optimizations with a number that are unique to VAXcluster
systems. In this section we present these general
optimizations, a discussion of VAXcluster considerations, and two VAXcluster-specificoptimizations.
One-phase Commit A key consideration in the
design of the DEcdtm services was to incur minima1 impact on the performance of Digital's database products. We exploited two attributes to
achieve this goal. First, all current users are limited
to non-distributed transactions (those that involve
only a single subordinate). Second, the two-phase
commit protocol requires that all subordinates
respond with a "yes vote" to commit the transaction. This allows a highly optimized path for
single subordinate transactions. Such transactions
require no writes to disk by the DECdtm services
and execute in one phase. The subordinate is told
that it is the only voting party in the transaction
and, if it is willing to respond with a "yes vote," it
should proceed and perform its order to commit
processing.
General Optimizations
V3Xcluster Considerations
The following sections describe some previously
published optimizations.
The optimizations listed above (and others not
described here) provide the DECdtm services
with a competitive two-phase commit protocol.
VAXcluster technology, though, offers other
untapped potential. VAXcluster systems offer several unique features, in particular, the guarantee
Protocol Optimizations
PresumedAbort DECdtm services use the "presumed abort" o p t i m i z a t i ~ n . ~ . T hoptimization
is
states that, if no information can be found for a
Digital Technical Journal
VoI. 3 N o 1
Winter I991
39
Transaction Processing, Databases, and Fault-tolerant Systems
against partitioning, the distributed lock manager,
and the ability to share disk access between CPUs."
Within a VAXcluster system, use of these unique
features allows the DECdtm services to avoid a
blocked condition which occurs during the short
period of time when a subordinate node responds
with a "yes vote" and communication with its
coordinator is lost. Normally, the subordinate is
unable to proceed with that transaction's commit
until communications have been restored.
Outside a VAXcluster system, the DECdtm services would indeed be blocked. If, however, the
subordinate and its coortlinator are in the same
VAXcluster system, this will not occur. If communication is lost, a subordinate node knows, as a result
of the guarantee against partitioning, that its coordinator has failed.
Because a subordinate node can access the transaction log of the failed coordinator, i t may immediately "host" its failed coordinator's recovery.
Communications to the hosted coordinator are
quickly restored, and the subordinate node is able
to complete the transaction commit.
Early Prepare Log Write As mentioned earlier, an
intermediate node must write a "prepare" record
prior to responding with a "yes vote." The presence of this record in an intermediate node's
log indicates that the node must get the outcome
of the transaction from the coordinator and, thus,
i t is subject to blocking. Therefore, the prepare
record is typically written after all the expected
votes are returned, which adds to commit-time
latency.
The DECdtm services are free from blocking concerns within a VAXcluster system; the vast majority
of transactions do commit. This factor prompted
an optimization that writes a prepare record while
simultaneously collecting the subordinate votes.
This reduces commit-time latency.
No ComfnitLog Write The lazy commit log write
optimization described above causes the intermediate node's commit record to be written and,
thus, minimizes the potential for blocking should
the intermediate node fail. Note that this is not a
concern for the intra-VAXcluster case. Therefore, no
commit record is written at the intermediate node.
VMcluster-specific Optimizations
Once the blocking potential was removed from
i n t r a - v ~ c l u s t e rtransactions, several additional
protocol optimizations became practical. The
optimizations described in this section are dynamically enabled if the subordinate and its coordinator are both in the same VAXcluster system.
Table 1
Performance Evaluation
Table 1 describes the message and log write costs
of the DECdtm services protocol and compares i t
to the basic two-phase commit protocol, as well
as to the standard presumed abort variant previously described."'
Logging and Message Cost by Two-phase Commit (2PC) Protocol Variant
Coordinator
Log Write
Message
Intermediate
Log Write
Message
Basic 2PC:
Presumed Abort:
(RO intermediate)
2 , l forced
2 , l forced
2 , l forced
2, 2 forced
2, 2 forced
0
Normal DECdtm:
(RO intermediate)
2 , l forced
2 , l forced
2 , l forced
Intracluster:
2 , l forced
2 , l forced
1 , 1 forced*
0
Coordinator
- -
(RO intermediate)
-
2N
2N
1N
2
2
1
0
DECdtm 1 PC:
Notes:
Log writes are total writes, forced. The table entry 2.1 forced means that there are two total log writes, one of which is forced. A forced write
must complete before the protocol makes a transition to the next state.
RO means Read Only.
Where a message is listed as xN, N represents the number of intermediatesthat fit that category.
In this instance, forced means that the log write 1s initiated optimistically; thus, it has lower latency
40
Vol. .3 No. 1
Winter 1991
Digital TechnicalJournal
Trnnsaction Management Support in the VMS Operating System Kemel
Ease-of-use Evaluation
A primary goal in providing transaction processing
primitives within the VMS kernel was to supply
many disparate applications with a straightforward
interface to distributed transaction management.
This contrasts with most commercially available
systems, where distributed transaction management functionality is available on1y from a transaction processing monitor. This latter form restricts
the fi~nctionalityto applications written to execute under the control of the transaction processing monitor, and it effectively precludes other
applications from making use of the technology.
From the outset of development, we endeavored
to provide an interface that was suitable for as
many applications as possible. We made early versions of the DECdtm services available within
Digital to decrease the "time to market" for software products that wished to exploit distributed
transaction processing technology. As of July 1990,
at least seven Digital software products have been
modified to use the DECdtm services. These
products are VAX RdbfVMS, VAX DBMS, VAX RMS
Journaling, VAX ACMS, DECintact, VAX RALLY,
and VAX SQL.
In general, the modifications to these products
have been relatively minor, as might be inferred
I
I
-__----_---__----_-
Example of DECdtm Usage
The model and pseudocode shown in Figures 4a
and b illustrate the use of UECdtm services in a
simple example of a distributed transaction. The
transaction spans two nodes, NODE-A and NODE-B,
in a VMS network. During the course of the transaction, recoverable resources managed by resource
managers, RM-A and RM-B, are modified. Two
"application" programs, APPL-A and APPL-B, that
run on NODE-A and NODE-B, respectively, make
normal procedural calls to RM-A and RM-B. APPL-A
I___________________
1
I
NODE A
from the short time i t took to make the required
changes. Based on this experience, we expect thirdparty software vendors to rapidly take advantage of
the DECdtm services as they become available as
part of the standard VMS operating system.
To incorporate the DECdtm services into a
recoverable resource manager, the existing internal transaction management module with calls
to the DECdtm services must be replaced. The
resource manager must also be modified to correctly respond to the prepare and commit callbacks
by the DECdtm services. Further, the recovery
logic of the resource manager must be modified to
obtain from the DECdtm services the state of "in
doubt" transactions.
I
NODE B
DECDTM
iI
I1
RESOURCE
SERVICES
11
KEY:
-----
.......
IPC CONNECTION
RPC
SYSTEM SERVICE CALL
Figure 4a
Digital Technical Joztrnal
I
DECDTM
11
EZ
E
;:S
I'ol. .3 Xo. I
1I
.%!ICES
I
II
---------
PROCEDURE CALL
RM
RESOURCE MANAGER
APPL
APPLICATION
Model Illustrating the Use of DECdtm Services
Winter 1991
I
Transaction Processing,Databases, and Fault-tolerant Systems
PROGRAM APPL-A
! E s t a b l i s h communications
with
remote a p p l i c a t i o n
I
IPC-LINK
(node="NODE-B",
application="APPL-B",
Link=link-id);
! E x c h a n g e t r a n s a c t i o n manager names
!
LIB$GETJPI ( J P I ~ ~ C O M M I T ~ D O M A I N , , , m y Y c d ~ ;
IPC-TRANSCEIVE
(Link=Link-id,
send-data=rny-cd,
receive-data=your-cd);
! Start a transaction
!
$START-TRANSW
(iosb=status,
tid=tid);
! M a k e a p r o c e d u r a l c a l l t o RM-A
!
RM-A
(tid,
requested-operation);
t o perform an o p e r a t i o n
! Now c r e a t e a t r a n s a c t i o n b r a n c h f o r t h e r e m o t e a p p l i c a t i o n
!
$ADD-BRANCHW
(iosb=status,
tid=tid,
cd-name=your-cd);
! A s k APPL-B
t o do something
branch=bid,
as p a r t
of
this
transaction
I
IPC-TRANSCEIVE
! And e n d t h e
(Link=Link-id,
send-data=(tid,
receive-data=status);
bid,
data),
transaction
I
PROGRAM APPL-B
(Link-id)
! E x c h a n g e t r a n s a c t i o n manager names
!
IPC-RECEIVE
(link=Link-id,
data=sup-cd);
L I B $ G E T J P I (JPIS-COMMIT-DOMAIN,,,my_cd);
IPC-REPLY
(Link=Link-id,
data=my-cd);
! Now we e x e c u t e t r a n s a c t i o n r e q u e s t s
I
loop;
IPC-RECEIVE
! Start the
(Link=link-id,
data=(tid,
bid, data));
t r a n s a c t i o n b r a n c h c r e a t e d b y APPL-A.
! M a k e a p r o c e d u r a l c a l l t o RM-B
!
RM-B
(tid,
t o p e r f o r m an o p e r a t i o n
requested-operation);
! T e l l APPL-A we a r e d o n e
!
IPC-REPLY
(Link=Link-id,
data=SS$-NORMAL);
! D e c l a r e t h a t we a r e f i n i s h e d f o r t h i s t r a n s a c t i o n a n d
! wait f o r i t t o complete
I
$READY-TO-COMMITW
end-loop;
(iosb=status,
tid=tid);
Vol..$ No. 1
Winler 1991
Digital Technical Journal
Transaction Managentent Szrpport in the VMS Operating System K m e l
ROUTINE RM-A
(tid,
requested-operation)
! I f t h i s i s t h e f i r s t o p e r a t i o n , r e g i s t e r w i t h DECdtm s e r v i c e s a s a
! r e s o u r c e m a n a g e r . As p a r t o f t h e r e g i s t r a t i o n we d e c l a r e a n e v e n t
! r o u t i n e t h a t w i l l be c a l l e d d u r i n g t h e v o t i n g p r o c e s s .
!
i f f i r s t t i m e we've been c a l l e d t h e n
$DECLARE-RMW ( i o s b = s t a t u s , name="RM-A",
evtrtn=RM-A-EVENT,
rm-id=rm-handle);
! I n f o r m DECdtm s e r v i c e s
of
our
interest
i n this
transaction
I
i f
t i d has n o t p r e v i o u s l y been seen t h e n
$JOIN-RMW
(iosb=status,
rm-id=rm-handle,
part-id=participant);
tid=tid,
! Perform the requested operation
!
DO-OPERATION
(requested-operation);
RETURN
ROUTINE RM-A-EVENT
(event-block)
! S e l e c t a c t i o n f r o m t h e DECdtm s e r v i c e s
event
type
I
CASE e v e n t - b l o c k . D D T M $ L - O P T Y P E
FROM
...
TO
...
! Do " r e q u e s t
t o prepare" processing
!
CDDTM$K-PREPARE]:
DO-PREPARE-ACTIVITY
(result=status,
! Do " o r d e r t o c o m m i t "
!
CDDTMSK-COMMIT]:
DO-COMMIT-ACTIVITY
! Do " o r d e r t o a b o r t "
!
CDDTMSK-ABORT]:
DO-ABORT-ACTIVITY
tid=event-type.DDTM$A-TID);
processing
(result=status,
tid=event-type.DDTM$A-TID);
processing
(result=status,
tid=event-type.DDTM$A-TID);
ESAC;
! I n f o r m t h e DECdtm s e r v i c e s o f t h e f i n a l s t a t u s o f o u r e v e n t
! processing.
!
$FINISH-RMOPW
( i o s b = i o s b , part-id=event-type.DDTfl$L-PART-ID,
retsts=status);
RETURN
Figure 4b Pseudocode Illustrating the Use of DECdtnz Services
and APPL-B use an interprocess communication
mechanism to communicate information across
the network. The DECdtm service calls are prefixed with a dollar sign ($).
The code for the resource managers, KM-A and
RM-B, is identical with respect t o calls for the
DECdtm services. The resource manager routine,
Digital Technical Journal
1'0f.3 No. 1
Winter 1991
is invoked by the DECdtm
services during transaction state transitions.
ROUTINE RM-A-EVENT,
Conclusions
The addition of a distributed transaction manager
to the kernel of the general-purpose VMS operating
system makes distributed transactions available
43
Transaction Processing, Databases, and Fault-tolerant Systems
t o a wide spectrum of applic;~tions.This design
ant1 implementation was accomplished with comparative ease and with quality performance. In
addition to utilizing the most commonly described
optimizations of the two-phase commit protocol,
we 11;1ve used optimizations that exploit s o m e of
the unique benefits of the VAXcluster system.
Acknowledgments
We wish to gratefully acknowledge the contributions of all the transaction processing architects
involvecl, and in particular Vijay 'Srehan, for delivering to us an understanclable and implementable architecture. We also extend o u r thanks t o
Phil Bernstein for his encouragement and advice,
and t o o u r initial users, Hill Wright, Peter Spiro,
and Lenny Szubouricz, for their persistence and
goocl nature.
Finally, and most importantly, w e would like
t o thank all t h e DECdtm development engineers
and t h e o t h e r s w h o helped s h i p t h e protluct:
Stuart Bayley, Cathy Foley, Mike Grossmith, Torn
Harding, Tony Hasler, Mark Howell, Dave Marsh,
Julian Palmer, Kevin Playfortl, and Chris Whitaker.
References
1
R Haskin, Y. Malachi, W Sawdon, and G. Chan,
"Recovery [Management in Quicksilver," AC1V.I
Transclctions o n Computer Systems, vol. 6,
no. 1 (February 1988).
7. B. Lampson, "Atomic Transactions," In Distributed Systems-Architecture and Implementcltion: An Adz)anced Course, edited by
G. Goos ant1 J. Hartmanis (Berlin: SpringerVerlag, 1981).
8. C. Mohan, B. Lindsay, and R. Obermarck,
"Transaction Management in the R* Distributed
Database Management System," ACM Transactions on Con?puter Sjlstems, vol. 11, no. 4
(December 1986).
9. C. Mohan and B. Lindsa): "Efficient Commit
Protocol for the Tree of Processes Model of
Distributed Transactions," Proceedings of the
m Prin2nd AChl SIGACTAT/GOIIS S ~ ~ m p o s i uon
ciples of Distributed Computing ([Montreal,
August 1983).
10. D. Duchamp, "Analysis of Transaction
manage-
ment Performance," Proceedings of tl7e Tzueljtl7
ACM .Symj,osiurn o n Operating Systenzs Prim
ciples (Special issue), vol. 23, no. 5 (December
1989): 177- 190.
11. N. Kronenberg, H. Levy, and W Strecker,
"Vi\Xclusters: A Closely-Coupled Distributed
System," ACi1.1 Transactions o n Computer
Syste~ns,vol. 4,no. 2 (May 1986).
2. A. Spector et al., Camelot: A Dislribz~tedTrclnsaction Facilityfor i k c h and the Internet - An
Interim Report (Pittsburgh: Carnegie Mellon
Universit): Department of Computer Science,
June 1987).
3. W Hruckert, C. d o n s o , and J. Melvin, "VerifiSystem,"
cation of the First Fault-tolerant \la
Digital TechnicalJo~lrnnl,vol. 3, no. 1 (Winter
1991, this issue): 79-85.
4 J Gray, "A Census of Tantlem System Availabrlity between 1985 and 1990,"Tandem Technical Report 90.1, part no. 33579 (January 1990).
5. I? Bernstein, V. Hadzilacos, and N. Goodman,
Co~zc~irrency
Con.tl-01 and Recovery in Dntabase Systems (Reading, M.4: Addison-Wesley,
1987).
6. J. Gra): "Notes on Database Operating Systems,"
In Operc~tingSystems: An Adz)nnced Course
(Berlin: Springer-Verlag, 1978).
I/ol .I No 1
Wi~zter1991
Digilal Technical Journal
Walter H. Kohler
Yun-Ping Hsu
Thomas K. Rogers
Wael H. Bahau-El-Din
Pegormance Evaluation
of TransactionProcessing
Systems
Performance and price/performance are important attributes to consider when
evaluating a transactionprocessing system. Two major approaches toperformance
evaluation are measurement and modeling. TPC Bencl?mark A is an industry standard benchmark for measuring a transactionprocessingsyste~nkperfonnanceand
price/performance. Digital has implemented TPC Benchmark A in a distributed
transaction processing environment. Benchmark measurements were performed
on the VAX 9000 Model 210 and the VAX 4000 Model 300 systems. Further a comprehensive analytical model was developed and customized to model theperformance
behavior of TPC Benchmark A on Digital's transaction processing platforms. This
model was validated using measurement results and hasproven to be an accurate
pe?formanceprediction tool.
Transaction processing systems are complex in
nature and are usually characterized by a large
number of interactive terminals and users, a large
volume of on-line data and storage devices, and a
high volume of concurrent and shared database
accesses. Transaction processing systems require
layers of software components and hardware
devices to work in concert. Performance and
price/performance are two important attributes
for customers to consider when selecting transaction processing systems. Performance is important because transaction processing systems are
frequently used to operate the customer's business
or handle mission-critical tasks. Therefore, a certain
level of throughput and response time guarantee
are required from the systems during normal operation. Price/performance is the total system and
maintenance cost in dollars, normalized by the performance metric.
The performance of a transaction processing
system is often measured by its throughput in transactions per second (TPS) that satisfies a response
time constraint. For example, 90 percent of the
transactions must have a response time that is less
than 2 seconds. This throughput, qualified by the
associated response time constraint, is called the
maximum qualified throughput (MQTh). In a transaction processing environment, the most meaningful response time definition is the end-to-end
Digital TechnicalJournal
Vo1. .3 No. 1
IVtnfer
I991
response time, i.e., the response time observed by
a user at a terminal. The end-to-end response time
represents the time required by all components
that compose the transaction processing system.
The two major approaches used for evaluating
transaction processing system performance are
measurement and modeling. The measurement
approach is the most realistic way of evaluating the
performance of a system. Performance measurement results from standard benchmarks have been
the most accepted form of performance assessment of transaction processing systems. However,
due to the complexity of transaction processing
systems, such measurements are usually very expensive, very time-consuming, and difficult to perform.
Modeling uses simulation or analytical modeling techniques. Compared to the measurement
approach, modeling makes it easier to produce
results and requires less computing resources.
Performance models are also flexible. Models can
be used to answer "what-if" types of questions and
to provide insights into the complex performance
behavior of transaction processing systems, which
is difficult (if not impossible) to observe in the
measurement environment. Performance models
are widely used in research and engineering communities to provide valuable analysis of design
alternatives, architecture evaluation, and capacity
planning. Simplifying assumptions are usually
Transaction Processing, Databases, and Fault-tolerant Systems
made in the modeling approach. Therefore, performance models require validation, through detailed
simulation o r measurement, before predictions
from the models are accepted.
This paper presents Digital's benchmark measurement and modeling approaches to transaction
processing system performance evaluation. The
paper includes an overview of the current industry
standard transaction processing benchmark, the
TPC Benchmark A, and a description of Digital's
implementation of the benchmark, including the
distinguishing features of the implementation and
the benchmark methodology. The performance
measurement results that were achieved by using
the TPC Benchmark A are also presented. Finally, a
multilevel analytical model of the performance
behavior of transaction processing systems with
response time constraints is presented and validated against measurement results.
TPC Benchmark A-An Overview
The TPC Benchmark A simulates a simple banking
environment and exercises key components of
the system under test (SUT) by using a simple,
update-intensive transaction type. The benchmark
is intended to simulate a class of transaction processing application environments, not the entire
range of transaction processing environments.
Nevertheless, the single transaction type specified
by the TPC Benchmark A standard provides a simple
and repeatable unit of work.
100,000
ACCOUNTS
I
I
The benchmark can be run in either a local
area network (LAN) or a wide area network (WAN)
configuration. The related throughput metrics
are tpsA-Local and tpsA-Wide, respectively. The
benchmark specification defines the general application requirements, database design and scaling
rules, testing and pricing guidelines, full disclosure report requirements, and an audit checklist.'
The following sections provide an overview of
the benchmark.
Application Environment
The TPC Benchmark A workload is patterned after a
simplified banking application. In this model, the
bank contains one or more branches. Each branch
has 10 tellers and 100,000 customer accounts. A
transaction occurs when a teller enters a deposit
or a withdrawal for a customer against an account
at a branch location. Each teller enters transactions
at an average rate of one every 10 seconds. Figure 1
illustrates this simplified banking environment.
Transaction Logic
Thc transaction logic of the TPC Benchmark A
workload can be described in terms of the bank
environment shown in Figure 1. A teller deposits
in or withdraws money from an account, updates
the current cash position of the teller and branch,
and makes an entry of the transaction in a history
file. The pseudocode shown in Figure 2 represents
the transaction.
I
i
100,000
I
ACCOUNTS
I
I
100,000
ACCOUNTS
I
I
00
TELLERS
TELLERS
CUSTOMERS
Figure I
46
I
TPCBencbmark A Banking Environment
Vol. 3 No. 1
Winter 1991 Digital Technical Journal
Per$omnce Evaluation of TransactionProcessing Systems
Read 100 b y t e s i n c l u d i n g B i d , T i d , A i d , D e l t a f r o m t e r m i n a l
BEGIN TRANSACTION
Update Account where Account-ID
= Aid:
Read Account-Balance
from Account
= Account-Balance + D e l t a
Set Account-Balance
W r i t e Account-Balance
t o Account
Write t o History:
A i d , T i d , B i d , D e l t a , Time-Stamp
= Tid:
Update T e l l e r where T e l l e r - I D
= Teller-Balance + Delta
Set Teller-Balance
Write Teller-Balance
to Teller
Update Branch where Branch-ID
= Bid:
= Branch-Balance + D e l t a
Set Branch-Balance
W r i t e Branch-Balance
t o Branch
COMMIT T R A N S A C T I O N
W r i t e 200 b y t e s i n c l u d i n g A i d , T i d , D e l t a , A c c o u n t - B a l a n c e
t o terminal
Figure 2
TPC Benchmark A Transaction Pseudocode
Terminal Communication
For each transaction, the originating terminal is
required to transmit data to, and receive data from,
the system under test. The data sent to the system
under test must consist of at least 100 alphanumeric
data bytes, organized as at least four distinct fields:
Account-ID, Teller-ID, Branch-ID, and Delta. The
Branch-ID identifies the branch where the teller is
located. The Delta is the amount to be credited to,
or debited from, the specified account. The data
received from the system under test consists of at
least 200 data bytes, organized as the above four
input fields and the Account-Balance that results
from the successful commit operation of the
transaction.
Implementation Constraints
The TPC Benchmark A imposes several conditions
on the test environment.
The transaction processing system must support
atomicity, consistency, isolation, and durability
(ACID) properties during the test.
The tested system must preserve the effects of
committed transactions and ensure database
consistency after recovering from
- The failure of a single durable medium that
contains datatbase or recovery log data
- The crash and reboot of the system
- The loss of all or part of memory
Digital TecbnfcalJournal
Vol. .J IVO. I
Winter 1991
Eighty-five percent of the accounts processed
by a teller must belong to the home branch (the
one to which the teller belongs). Fifteen percent
of the accounts processed by a teller must be
owned by a remote branch (one to which the
teller does not belong). Accounts must be uniformly distributed and randomly selected.
Database Design
The database consists of four individual files/tables:
Branch, Teller, Account, and History, as defined in
Table 1. The overall size of the database is determined by the throughput capacity of the system.
Ten tellers, each entering transactions at an average rate of one transaction every 10 seconds, generate what is defined as a one-TPS load. Therefore,
each teller contributes one-tenth (1/10) TPS. The
history area must be large enough to store the history records generated during 90 eight-hour days
of operation at the published system TPS capacity.
For a system that has a processing capacity of
x TPS, the database is sized as shown in Table 2.
For example, to process 20 TPS, a system must
use a database that includes 20 branch records, 200
teller records, and 2,000,000 account records.
Because each teller uses a terminal, the price of the
system must include 200 terminals. A test that
results in a higher TPS rate is invalid unless the size
of the database and the number of terminals are
increased proportionately.
Transaction Processing, Databases, a n d Fault-tolerant Systems
Table 1
Database Entities
Record
Bytes
Fields Required
Description
Branch
100
Branch-ID
Branch-Balance
Teller-ID
Branch-ID
Teller-Balance
Account-ID
Branch-ID
Account-Balance
Account-ID
Teller-ID
Branch-ID
Amount
Identifies t h e branch across the range of branches
Contains t h e branch's current cash balance
ldentifies the teller across the range of tellers
ldentifies the branch where the teller is located
Contains the teller's current cash balance
Identifies the customer account uniquely for the entire database
Identifies t h e branch where the account is held
Contains t h e account's current cash balance
Identifies the account updated by the transaction
Identifies the teller involved in the transaction
Identifies the branch associated with the teller
Contains t h e amount of credit or debit (delta) specified by
the transaction
Contains the date and time taken between the BEGIN
TRANSACTION and COMMIT TRANSACTION statements
Teller
Account
100
History
Time-Stamp
Table 2
Database Sizing
Number of Records
Record Type
1 xx
1Oxx
100,000 x x
2,592,000 x x
Branch records
Teller records
Account records
History records
Benchmark Metrics
TPC Benchmark A uses two basic metrics:
Transactions p e r second (TPS) - throughput in
TPS, subject t o a response time constraint, i.e.,
t h e MQTh, is measured while the system is in a
sustainable steady-state condition.
Price p e r TPS (K$/TPS) - t h e purchase price
and five-year maintenance costs associated w i t h
o n e TPS.
Transactions per Second To guarantee that t h e
tested system provides fast response t o on-line
users, t h e TPC Benchmark A imposes a specific
response time constraint on the benchmark.
Ninety percent of all transactions must have a
response time of less than t w o seconds. The TPC
Benchmark A standard defines transaction response
time as the time interval between the transmission
from the terminal of the first byte of the input message t o t h e system under test to t h e arrival at t h e
terminal of t h e last byte of t h e o u t p u t message
from t h e system under test.
The reported TPS is the total number of committed transactions that both started and completed
during an interval of steady-state performance,
divided by the elapsed time of t h e interval. The
steady-state measurement interval must be at least
15 minutes, and 90 percent of t h e transactions
must have a response time of less than 2 seconds.
Price per TPS The KS/TPS price/performance
metric measures t h e total system price in thousands of dollars, normalized by t h e TPS rating of
t h e system. The priced system includes all t h e
components that a customer requires to achieve
t h e reported performance level and is defined by
t h e TPC Benchmark A standard as t h e
Price of the system under test, including all hardware, software, ant1 maintenance for five years.
Price of t h e terminals and network components, and their maintenance for five years.
Price of on-line storage for 90 days of history
records at the published TPS rate, which amounts
t o 2,592,000 records p e r TPS. A storage medium
is considered t o b e on-line if any record can be
accessed randomly within o n e second.
Price of additional products required for t h e
operation, administration, o r maintenance of
t h e priced systems.
Price of p r o d u c t s required for application
development.
All hardware and software used in t h e tested
configuration must be announced and generally
available to customers.
W>l. .? No. l
Winter 1991 Digilal Technical Journal
Performance Evaluation of Transactio~zProcessing Systems
TPC Benchmark A Implementation
Table 3
Digital's implementation of the TPC Benchmark A
goes beyond the minimum requirements of the
TPC Benchmark A standard and uses Digital's distributed approach to transaction processing? For
example, Digital's TPC Benchmark A implementation includes forms management and transaction
processing monitor software that are required in
most real transaction processing environments
but are not required by the benchmark. The following sections provide an overview of Digital's
approach and implementation.
Transaction Processing Software
Components
Component
Example
Operating system
Communications
Database
TP monitor
Forms
VMS
LAT, DECnet
VAX RdbNMS
VAX ACMS, DECintact
DECforms
Application
COBOL
database manager) on separate computers. In the
simplest form of a distributed transaction processing system, the user interface component runs on a
front-end processor, and the application and database components run on a back-end processor. The
configuration allows terminal and forms management to be performed at a remote location, whereas
the application is processed at a central location.
The Digital transaction processing software components are separable because their clearly defined
interfaces can be layered transparently onto a network. How these components may be partitioned
in the Digital distributed transaction processing
environment is illustrated in Figure 3.
Transaction Processing Software
Environment
The three basic functions of a general-purpose
transaction processing system are the user interface (forms processing), applications management,
and database management. Digital has developed a
distributed transaction architecture (DECdta) to
define how the major functions are partitioned
and supported by components that fit together to
form a complete transaction processing system.
Table 3 shows the software components in a typical
Digital transaction processing environment.
Distributed Transaction Processing
Approach
TPC Benchmark A Test Environment
Digital transaction processing systems can be distributed by placing one or more of the basic system
functions (i.e.,user interface, application manager,
The Digital TPC Benchmark A tests are implemented in a distributed transaction processing
environment using the transaction processing
TERMINALS
FORMS
FRONT-END
PROCESSORS
TP MONITOR
OPERATING SYSTEM
COMMUNICATIONS
COMMUNICATIONS
BACK-END
PROCESSORS
I
DATABASE
1
DATABASE
STORAGE
OPERATING SYSTEM
I COMMUNICATIONS I
Figure 3
Digital Technical Jorc.rnal
Distributed Trnnsaction Processing Environment
Vol. -3 No. I
Wiizfer 1991
Transaction Processing, Databases, and Fault-tolerant Systems
software components shown in Figure 3. The user
interface component runs on one or more frontend processors, whereas the application and
database components run on one or more backend processors. Transactions are entered from
teller terminals, which communicate with the
front-end processors. The front-end processors
then communicate with the back-end processors
to invoke the application servers and perform
database operations. The communications can
take place over either a local area or a wide area
network. However, to simplify testing, the TP<:
Benchmark A standard allows sponsors to use
remote terminal emulators (RTEs) rather than real
terminals. Therefore, the TPC Benchmark A tests
base performance and price/performance results
on two distinctly configured systems, the target
system and the test system.
The target system is the configuration of hardware and software components that customers can
use to perform transaction processing. With the
Digital distributed transaction processing approach,
user terminals initiate transactions and communicate with the front-end processors. Front-end processors communicate with a back-end processor
using the DECnet protocol.
The test system is the configuration of components used in the lab to measure the performance of the target system. The test system uses
RTEs, rather than user terminals, to generate the
workload and measure response time. (Note: In
previously published reports, based on Digital's
Debitcredit benchmark, the RTE emulated frontend processors. In the TPC Benchmark A standard,
the RTE emulates only the user terminals.) The
KTE component
Emulates the behavior of terminal users according to the benchmark specification (e.g., think
time, transaction parameters)
Emulates terminal devices (e.g., conversion
and multiplexing into the local area transport
[LATI protocol used by the DECserver terminal
servers)
Records transaction messages and response
times (e.g., the starting and ending times of
individual transactions from each emulated
terminal device)
Figure 4 depicts the test system configuration in
the LAN environment with one back-end processor, multiple front-end processors, and multiple
remote terminal emulators.
h
hb
REMOTE
TERMINAL
EMULATORS
FRONT-END
PROCESSORS
,
'
dm
ETHERNET
BACK-END
PROCESSOR
DATABASE
Figure 4
11
Test Systern Configurcltion
c, u ~ r c ~ r ~ r r c ns
c cAr n c o u c c o
We now present the results of two TPC Benchmark
A tests based on audited benchmark experiments
performed on the VAX 9000 Model 210 and the
VAX 4000 Model 300 systems.'.' These two systems
are representative of Digital's Large and small transaction processing platforms. The benchmark was
implemented using the VAX ACMS transaction processing monitor, the VAX RdbrYMS relational database management system, and the DECforms forms
management system on the VMS operating system.
'rables 4 and 5 show the back-end system configurations for the VAX 9000 Model 210 and the VAX 4000
Model 300 systems, respectively. Table 6 shows the
system configuration of the front-end systems.
Measurement Results
The maximum qualified throughput and response
time results for the TPC Benchmark A are summarized in Table 7 for the VAX 9000 Model 210 and the
VAX 4000 Model 300 systems. Both configiirations
have sufficient main memory and disk drives such
Table 4 VAX 9000 Model 210 Back-end
System Configuration
Component
Product
Processor
Memory
Tape drive
Disk controller
Disks
Operating system
Communications
TP monitor
Dictionary
Application
Database system
Forms management
VAX 9000 Model 21 0
Vol. 3 No. 1
Quantity
1
256 MB
TA81
1
KDM70
2
RA92
16
VMS 5.4
1
DECnet-VMS Phase IV 1
VAX ACMS V3.1
1
VAX CDDIPIus V4.1
1
VAX COBOL V4.2
1
VAX RdbNMS V4.0
1
DECforms V1.2
1
Winter 1991
Digital Technical Journal
Pe$omnce Evaluation of Transaction Processing Systevns
Table 5
VAX 4000 Model 300 Back-end
System Configuration
Component
Product
Quantity
Processor
Memory
Tape drive
Disk controller
Disks
Operating system
Communications
TP monitor
Dictionary
Application
Database system
Forms management
VAX 4000 Model 300
1
64 MB
TK70
1
3
DSSl
RF31
18
VMS 5.4
1
DECnet-VMS Phase IV 1
VAX ACMS V3.1
1
VAX CDDIPIus V4.1
1
VAX COBOL V4.2
1
VAX RdbNMS V4.0
1
DECforms V1.2
1
u
20
30
40
50
60
80
KEY:
M AVERAGE
90TH PERCENTILE
Figure 5
VAX 9000 Response Time in
Relationship to Transactions
per Second
response times at 100 percent and approximately
80 percent and 50 percent of the maximum
qualified throughput. The mean transaction
response time still grows linearly with the
transaction rate up to the 70 TPS level, but the
ninetieth percentile response time curve has
started to rise quickly due t o the high CPU utilization and random arrival of transactions.
Response Time in Relationship to TPS. Figure 5
shows the ninetieth percentile and average
Response Time Frequency Distribution. Figure 6
is a graphical representation of the transaction
Front-end Run-time System Configuration
Component
Product
Quantity
Processor
VAXserver 3100 Model 10
10 for VAX 9000 back-end
3 for VAX 4000 back-end
16 MI3 for VAX 9000 back-end
12 MB for VAX 4000 back-end
16
1 for VAX 9000 back-end
1 for VAX 4000 back-end
1
1
1
Memory
Disks
Operating system
Communications
TP monitor
Forms management
Table 7
70
TRANSACTIONS PER SECOND
that the processors are effectively utilized with no
other bottleneck. Both systems achieved well over
90 percent CPU utilization at the maximum qualified throughput under the response time constraint.
In addition to the throughput and response time,
the TPC Benchmark A specification requires that
several other data points and graphs be reported.
We demonstrate these data and graphs by using
the VAX 9000 Model 210 TPC Benchmark A results.
Table 6
0.4
RZ23 (104 MB)
VMS 5.3
VMS 5.4
DECnet-VMS Phase IV
VAX ACMS V3.1
DECforms V1.2
Maximum Qualified Throughput
System
TPS (tpsA-Local)
VAX 9000 Model 210
VAX 4000 Model 300
69.4
21.6
Digital Technical Journal
V i l . .? No. 1
Winter 1991
Response Time (seconds)
Average
90 percent
1.20
1.39
1.74
1.99
Maximum
5.82
4.81
51
Transaction Processing, Databases, and Fault-tolerant Systems
rn
z
14000 [
I+
I -1
AVERAGE = 1.20SECONDS
90TH PERCENTILE = 1 74 SECONDS
-1
I
MAXIMUM
=
I
I
I
, 1
I
I
I
I.
5.82 SECONDS
I
I
I
I
,
I . ,
,
,
.
.
,
,
5
10
VAX 9000 Respotzse Tiirze Freqliencj~
Distrib~ition
response time distribution. The average, ninetieth percentile, and maximum transaction
response times are also marked on the graph.
I'r'~nsactionsper Second over Time The results
shown in Figure 7 demonstrate the sustainable
maximum qualified throughput. The one-minute
running average transaction throughputs during the warm-up and data collection periods of
the experiment are plotted on the graph. This
gr'cph shows that the throughput was steady
during the period of data collection.
Average Response Time over Time. The results
shown in Figure 8 demonstrate the sustainable average response time in the experiment.
The one-minute running average transaction
response times during the warm-up and data
collection periods of the experiment are plotted
on the graph. This graph shows that the mean
response time was steady during the period of
data collection.
I
I
INTERVAL
I
, I
20
25
30
,
35
TlME (MINUTES)
RESPONSE TlME (SECONDS)
i r e6
15
DATA COLLECTION
Figlire 7
VAX 9000 Truizsc~ctionsperSecond
ooer Tirne
and policies. Such a moclel, after proper validation,
can be a powerful tool for many types of analysis,
as well as a performance prediction tool. Results
can be obtained cluickly for any combination of
parameters.
A comprehensive analytical model of the performance behavior of transaction processing systems
with a response time constraint was developed
and validated against measurement results. This
model is hierarchical and flexible for extension.
The following sections describe the basic construction of the model ant1 the customization made
to model the execution of TPC Benchmark A on
Digital's transaction processing systems. The
model can also be used to study different transaction processing workloads in addition to the
TPC Benchmark A.
Response Time Components
The main metric used in the model is the maximum qualified throughput under a response time
constraint. The response time constraint is in the
Comprehensive Analytical Model
Modeling techniques can be used as a supplement
or an alternative to the measurement approach.
The performance behavior of complex transaction
processing systems can be characterized by a set of
parameters, a set of performance metrics, and the
relationships among them. These p-Irameters can
be used to describe the different resources available in the system, the database operations of transactions, and the workload that the transaction
proccssing system undergoes. To completely represent such a system, the size of the parameter set
would be too huge to manage. An analytical model
simplifies, through abstraction, the complex behavior of a system into a manageable set of parameters
.
I
0
I ,
5
10
15
I
20
25
30
,
35
TlME (MINUTES)
Figure 8
Vo1..? No. l
VRY 9000 Ai,erage Response Time
oLler Titne
Winter 1991 Digital Technical Joirrnal
Performance Evaluation of Transaction Processing Systems
Within the back-end system, the transaction
response time is further decomposed into two
additional components, CPU delays and non-CPLJ,
nonoverlapping delays. CPU delays include both
the c p u service and t h e CPU waiting times of transactions. Non-CPU, nonoverlapping delays include:
form of "x percent of transaction response times
are less than y seconds."
To evaluate throughput under such response time
constraint, the distribution of transaction response
times is determined by first decomposing the transaction response time into nonoverlapping and
independent components. The distribution of each
component is then evaluated. Finally, the overall
transaction response time distribution is derived
from t h e mathematical convolution of t h e component response time distributions.
The logical flow of a transaction in a front-end
and back-end distributed transaction processing
system that is used to implement TPC Benchmark A
is depicted in Figure 9. T h e response time of a
transaction consists of three basic components:
front-end processing, back-end processing, and
communication delays.
Logging delays, which include the time for transaction log writes and commit protocol delays
Database I/O delays, which include both waiting
and service times for accessing storage devices
Other delays, which include delays that result
from concurrency control (e.g., waiting for locks)
and waiting for messages
Two-levelApproach
The model is configured in a two-level hierarchy, a
high level and a detailed level. The use of a hierarchy
allows a complex and detailed model that considers
many components and involves many parameters
to be constructed easily. Because of the hierarchical
approach, the model also provides flexibility for
modifications and extensions, and validation of
separate submodels.
The high-level model assumes the decomposition
of transaction response times, as described in the
Response Time Components section, and models
t h e behavior of t h e transaction processing system
by an open queuing system, as shown in Figure 10.
The queuing system consists of servers and delay
centers, which are connected in a queuing network with the following assumptions:
Front-end processing usually includes terminal
I/O processing, forms/presentation services, and
communication with t h e back-end systems. In
the benchmark experiments, n o disk I/O activity
w a s involved during t h e front-end processing.
Back-end processing includes the execution of
application, database access, concurrency control,
and transaction commit processing. The back-end
processing usually involves a high degree of concurrency and many disk I/O activities.
Communication delays primarily include t h e
communications between t h e user terminal and
the front-end node, and t h e front-end and backend interactions.
The front-end processing does not involve any
disk I/O operation, and t h e load o n the frontend systems is equally balanced.
(Note: These response time components d o not
overlap with each other.)
1
D;:EE:oi";
TIME
I
I
COMMUNICATION FRONT-END COMMUNICATION
1
BACK-END
)
Figure 9 Response Time Components
Digital Technical Journal
Vo/. 3 No. 1
Winter 1991
53
Transaction Processing, Databases, and Fault-tolerant Systems
COMMUNICATION
FRONT-END
PROCESSORS
BACK-END
PROCESSORS
Figure 10 High-level Queuing Model for a Transaction Processing Sj~stern
The back-end is a shared-memory multiprocessor
system with symmetrical loads on all processors
(or it can be simply a uniprocessor).
No intratransaction parallelism exists within
individual transaction execution.
No mutual dependency exists between transaction response time components.
Transaction arrivals t o the processors have a
Poisson distribution.
These assumptions correspond to Digital's TPC Benchmark A testing methodology and implementation.
The front-end CPU is modeled as an M/M/l queuing center, and t h e back-end CPU is modeled as an
M/M/m queuing center. The transactions' CPU times
o n the front-end and back-end systems are assumed
t o be exponentially distributed (coefficient of variation equal t o 1) due to t h e single type of transaction i n the benchmark. (Note: An approximation
of M/G/m can be used t o consider a coefficient of
variation other than 1 for the back-end transaction
CPU service time, especially in the multiprocessor
case when the bus is highly utilized.) Database I/O,
logging I/O, and other delays are modeled as delay
centers, with appropriate delay distributions. For
t h e model of the TPC Benchmark A workload, the
database I/O, j o ~ ~ r n a l i nI/O,
g and other communication and synchronization delays are combined
into o n e delay center, called the LOD delay center,
which is represented by a 2-Erlang distribution.
The major input parameters for this high-level
model are the
Number of front-end systems and the front-end
CPU service time p e r transaction
54
Number of CPUs in the back-end system and the
back-end CPU service time per transaction
Sum of the back-end database I/O response time,
journaling I/O response time, and other delay
times (i.e., the mean for the LOD delay center's
2-Erlang distribution)
Response time constraint (in the form of x percentile less than y seconds)
The main result from the high-level model is the
ivlQTh. This high-level model presents a global picture of the performance behavior and manifests the
relationship between the most important parameters
of the transaction processing system and MQTh.
Some of t h e input parameters in t h e high-level
model are dynamic. The CPU service time of a t r a n s
action may vary with t h e throughput or number of
processors, and the database I/O o r other delays
may also depend o n the throughput. A good example of a dynamic model is a tightly coupled multiprocessor system, with o n e bus interconnecting
the processors and with a shared common memory
(e.g., a VAX 6000 Model 440 system). Such a system
would run a single copy of t h e symmetrical multiprocessing operating system (e.g., the V M S system).
The average CPU service time of transactions is
affected by both hardware and software factors,
such as
Hardware contention that results from conflicting accesses to the shared bus and main memory
and that causes processor speed degradation
and longer CPU service time.
Processor synchronization overhead that results
from t h e serialization of accesses t o shared data
Vo1. .? No. 1
Winter I991
Digital Technical Journal
Perjomuznce Evaluution of TrnnsactionProcessing Systems
structures. Many operating systems use spinlocks as the mechanism for processor-level
synchronization, and the processor spins (i.e.,
busy-waits) in the case of a conflict. In the
model, the busy-wait overhead is considered
to be part of the transaction code path, and
such contention elongates the transaction CPU
service time.
Four detailed-level submodels are used to
account for the dynamic behavior of these parameters: CPU-cache-bus-memory,busy-wait, I/O group,
and LOD.
The CPU-cache-bus-memory submodel consists
of many low-level parameters associated with the
workload, processor, cache, bus, and memory components of multiprocessor systems. It models these
components by using a mixed queuing network
model that consists of both open and closed chains,
as shown in Figure 11. The most important output
from this submodel is the average number of CPU
clock cycles per instruction.
The busy-wait submodel models the spin-lock
contention that is associated with the two major
VMS spin-locks, called SCHED and IOLOCKB. This submodel divides the state of a processor into several
nonoverlapping states and uses probability analysis t o derive busy-wait time. The I/O grouping submodel models the group commit and group write
mechanisms of the VAX Rdb/VMS relational database
management system. This submodel affects the path
length of transaction because of the amortization
of disk I/O processing among grouped transactions. The LOD submodel considers the disk I/O
times and the lock contention of certain critical
resources in the VAX Rdb/VMSsystem.
Integrating the Two Levels of the Model
The two levels of the model are integrated by using
an iterative procedure outlined in Figure 12. It
starts at the detailed-level submodels, with initial
values for the MQTh, the transaction path length,
the busy-wait overhead, and the CPU utilization.
By applying the initialized parameters to the
submodels, the values of these parameters are
refined and input to the high-level model. The output parameters from the high-level model are then
fed back to the detailed-level submodels, and this
iterative process continues until the MQTh converges. In most cases, convergence is reached
within a few iterations.
Model Predictions
The back-end portion of the model was validated
against measurement results from numerous
Debitcredit benchmarks (Digital's precursor of the
TPC Benchmark A) on many VAX computers with
the VMS operating system, running VAX ACMS and
VAX Rdb/VMS software? With sufficient detailed
parameters available (such as transaction instruction count, instruction cycle time, bus/memory
access time, cache hit ratio), the model correctly
estimated the MQTh and many intermediate results
for several multiprocessor VAX systems. The model
was then extended to include the front-end systems. In this section, we discuss applying this complete end-to-end model to the TPC Benchmark A
on two VAX platforms, the VAX 9000 Model 210 and
the VAX 4000 Model 300 systems, and then compare
the results. The benchmark environment and implementation are described in the TPC Benchmark A
Implementation section of this paper.
MEMORY m
KEY:
----
--
I
I
SOURCE
CLOSED CHAIN
OPEN CHAIN
Figure I 1
Digital Technical Journal
Vo1.3 No. 1
CPU-cache-bus-memorySubmodel
Winter I991
55
Transaction Processing, Databases, and Fault-tolerant Systems
INITIALIZE:
TxnPL,MQTh,BusyWaitPL,CpuUtilization;
LOD-submodel(input:MQTh;output:LOD)
REPEAT
I/O-Grouping-submodel(input:MQTh;output:DioPerTxn,TxnPL~;
REPEAT
REPEAT
BusyWait-submodel(input:TxnPL,B~syWaitPL,CpuUtilization,
DioPerTxn;output:BusyWaitPL);
UNTIL(BusyWaitPL converges);
CPU-Cache-Bus-Memory-submodel(input:TxnPL,BusyWaitPL:
output:CpuUtilization,AvgCpuSvcTime);
UNTIL(CpuUti1ization converges);
REPEAT
MQTh-model(input:AvgCpuSvcTime,LOD;output:MQTh,CpuUtilization~;
LOD-submodel(input:MQTh;output:LOD);
UNTIL(MQTh c o n v e r g e s ) ;
UNTIL(MQTh c o n v e r g e s ) ;
Figure 12
The Iterative Procedure to Integrating Submodels
Because both the VAX 9000 Model 210 and the
VAX 4000 Model 300 systems are uniprocessor
systems, there is no other processor contending
for the processor-memory interconnect and memory subsystems. Such contention effects can therefore be ignored when modeling a uniprocessor
system. The transaction processing performance
prediction for t h e VAx 9000 Model 210 system is a
successful example of the application of our analytical model.
We needed an accurate estimate of TPC Benchmark A performance o n t h e VAx 9000 Model 210
system before a VAX 9000 system was actually available for testing. The high-level (MQTh) model was
used with estimated values for the input parameters, LOD and transaction CPU service time. The
estimated LOD was based o n previous measurement observations from the VAx 6000 systems. The
other parameter, back-end transaction CPll senlice
time, was derived from the
Timing information of the VAx 9000 CPU
Memory access time and cache miss penalty of
the VAX 9000 CPU
Prediction of cache hit ratio of the v a 9000 system under the TPC Benchmark A workload
Transaction path length of the TPC Benchmark A
implementation
Instruction profile of the TPC Benchmark A
implementation
The high-level model predicted a range of MQTh,
with a high end of 70 TPS and with a strong probability that the high-end performance a7asachievable.
Additional predictions were made later, when an
early prototype version of the VAX 9000 Model 210
system was available for testing. A variant of the
Debitcredit benchmark, much smaller in scale and
easier t o run, was performed on the prototype
system, with the emphasis o n measuring the CPtj
performance in a transaction processing environment. The result was used t o extrapolate the CPU
service time of the TPC Benchmark A transactions
o n the VAX 9000 Model 210 system and to refine
the early estimate. The results of these modifications supported the previous high-end estimate of
performance of 70 TPS and refined t h e low-end
performance to be 62 TPS. The final, audited TPC
Benchmark A measurement result of the VAX 9000
Model 210 system showed 69.4 TPS, which closely
matches the prediction. Table 8 compares t h e
results from benchmark measurement and the
analvtical model outnuts.
Table 8
Measurement Compared to Model
Predictions
System
Measured
MQTh
Modeled
MQTh
VAX 9000 Model 210
69.4
21 .5
70.0
20.8
4000 Model300
Vo1. 3 No. 1
Winter
I991 Digital Technical Journal
Pe@ormunceEvaluation of TransactionProcessing Systems
The VAX 4000 Model 300 TPC Benchmark A
results were also used as a validation case. VAX 4000
Model 300 systems use the same CMOS chip as
the VAX 6000 Model 400 series and the same
28-nanosecond (ns) CPU cycle time. However, in
the VAX 4000 series, the CPU-memory interconnect
is not the xMr bus but a direct primary memory
interconnect. This direct memory interconnect
results in fast main memory access. The processor,
cache, and main memory subsystems are otherwise
the same as in the VAX 6000 Model 400 systems.
Therefore, the detailed-level model and associated
parameters for the VAX 6000 Model 410 system
can be used by ignoring the bus access time. The
TPC Benchmark A measurement results are within
7 percent of the model prediction, which means
that our assumption on the memory access time
is acceptable.
Conclusion
Performance is one of the most important attributes in evaluating a transaction processing system.
However, because of the complex nature of transaction processing systems, a universal assessment
of transaction processing system performance is
impossible. The performance of a transaction processing system is workload dependent, configuration dependent, and implementation dependent. A
standard benchmark, like TPC Benchmark A, is a
step toward a fair comparison of transaction processing performance by different vendors. But it is
only one transaction processing benchmark that
represents a limited class of applications. When
evaluating transaction processing systems performance, a good understanding of the targeted application environment and requirements is essential
before using any available benchmark result.
Additional benchmarks that represent a broader
range of commercial applications are expected to
be standardized by the Transaction Processing
Performance Council (TPC) in the coming years.
Performance modeling is an attractive alternative to benchmark measurement because it is less
expensive to perform and results can be compiled
more quickly. Modeling provides more insight
into the behavior of system components that are
treated as black boxes in most measurement experiments. Modeling helps system designers to better
understand performance issues and to discover
existing or potential performance problems. Modeling also provides solutions for improving performance by modeling different tuning or design
alternatives. The analytical model presented in this
Digital Technical Journal
Vol. 3 No. 1
Winter 1991
paper was validated and used extensively in many
engineering performance studies. The model also
helped the benchmark process to size the hardware during preparation (e.g., the number of
RTE and front-end systems needed, the size of
the database) and t o provide an MQTh goal as a
sanity check and a tuning aid. The model could
be extended to represent additional distributed
configurations, such as shared-disk and "sharednothing" back-end transaction processing systems,
and could be applied to additional transaction p r o
cessing workloads.
Acknowledgments
The Digital TPC Benchmark A implementation and
measurements are the result of work by many
individuals within Digital. The authors would like
especially to thank Jim McKenzie, Martha Ryan,
Hwan Shen, and Bob Tanski for their work in the
TPC Benchmark A measurement experiments; and
Per Gyllstrom and Rabah Mediouni for their contributions to the analytical model and validation.
References
1. Transaction Processing Performance Council,
TPC Benchmark A Standard Specification
(Menlo Park, CA: Waterside Associates,
November 1989).
2. Transaction Processing Systems Handbook
(Maynard: Digital Equipment Corporation,
Order No. EC-~0650-57,
1990).
3. TPC Benchmark: A Report for the VRY 9000
Model 210 System (Maynard: Digital Equipment
Corporation, Order No. EC-N0302-57, 1990).
4. TPC Benchmark: A Report for the VAX 4000
Model 300 System (Maynard: Digital Equipment
Corporation, Order No. EC-N0301-57, 1990).
5. L. Wright, W Kohler, and W Zahavi, "The Digital
Debitcredit Benchmark: Methodology and
Results:' Proceedings of the International
Conference on Management and Performance
Evaluation of Computer Systems (December
1989): 84-92.
57
William Z Zahuvi
Frances A. Habib
KennethJ. Omahen
Tools and Techniquesfor Preliminary
Sizing of Transaction Processing
Applications
Sizing transaction processing systems correctly is a difficzilt tmk. BJ~nature, trans~
the simple to
action processing applicatious are notpredefined and can u n qfrom
the complex. Siring during the analysis and design stages of the cipplication development cycle is particularly difficult. It is impossible to measure the resource
requirements of an application uhich is not yet written orfully implemented. To
make sizing easier and more accurate in these stages, a sizing methodologbywas
developed that uses measurements from systems on zvhich industry-standard
benchmarks have been run and employs standard systems analysis techniquesfor
acquiring sizing information. These metrics are then used to predict future transaction resource usage.
The transaction processing marketplace is dominated by commercial applications that support
businesses. These applications contribute substantially to the success or failure of a business, based on
the level of performance the application provides.
In transaction processing, poor application performance can translate directly into lost revenues.
The risk of implementing a transaction processing application that performs poorly can be minimized by estimating the proper system size in the
early stages of application development. Sizing estimation includes configuring the correct processor
and proper number of disk drives and controllers,
given the characteristics of the application.
The sizing of transaction processing systems is
a difficult activity. Unlike traditional applications
such as mail, transaction processing applications
are not predefined. Each customer's requirement
is different and can vary from simple to complex.
'Therefore, Digital chose to develop a sizing methodology that specifically meets the unique requirements of transaction processing customers. The
goal of this effort was to develop sizing tools and
techniques that would help marketing groups and
design consultants in recommending configurations that meet the needs of Digital's customers.
Digital's methodology evolved over time, as experience was gained in dealing with the real-world
problems of transaction processing system sizing.
The development of Digital's transaction process
ing sizing methodology was guided by several principles. The first principle is that the methodology
should rely heavily upon measurements of Digital
systems running industry-standard transaction
processing benchmarks. These benchmarks provide valuable data that cluantifies the performance
characteristics of different hardware and software
configurations.
The second principle is that systems analysis
methodologies should be used to provide a framework for acquiring sizing information. In particular, a multilevel view of a customer's business
is adopted. This approach recognizes that a manager's view of the business functions performed by
an organization is different from a computer analyst's view of the transaction processing activity.
The sizing methodology should accommodate both
these views.
The third principle is that the sizing methodology must employ tools and techniques appropriate
to the current stage of the customer's application
design cycle. Early in the effort to develop a sizing
methodology, it was found that a distinction must
be made between preliminary sizing and sizing
during later stages of the application development
cycle. Preliminary sizing occurs during the analysis
and design stages of the application development
cycle. Therefore, no application software exists
Vol. .3
i\'o.
I
Wirrler 1991
Digital Technical Journal
I
Tools and Techniquesfor Preliminary Sizing of Transaction Processing Applications
which can be measured. Application software does
exist in later stages of the application development
cycle, and its measurement provides valuable input
for more precise sizing activities.
For example, if a customer is in the analysis or
design stages of the application development cycle,
it is unlikely that estimates can be obtained for
such quantities as paging rates or memory usage.
However, if the application is fully implemented,
then tools such as the VAXcluster Performance
Advisor (VPA) and the DECcp capacity planning
products can be used for sizing. These tools provide facilities for measuring and analyzing data
from a running system and for using the data as
input to queuing models.
The term sizing, as used in this paper, refers to
preliminary sizing. The paper presents the metrics
and algebra used in the sizing process for DECtp
applications. It also describes the individual tools
developed as part of Digital's transaction processing sizing effort.
Sizing
The purpose of sizing tools is twofold. First, sizing
tools are used to select the appropriate system
components and to estimate the performance level
of the system in terms of device utilization and
user response times. Second, sizing tools bridge the
gap between business specialists and computer
specialists. This bridge translates the business units
into functions that are performed on the system
and, ultimately, into units of work that can be quantified and measured in terms of system resources.
In the sections that follow, a number of important
elements of the sizing methodology are described.
The first of these elements is the platform on which
the transaction processing system will be implemented. It is assumed that the customer will supply
general preferences for the software and hardware
configuration as part of the platform information.
The Levels of Business Metrics section details the
multilevel approach used to describe the work performed by the business. The Sizing Metrics and
Sizing Formulas sections describe the algorithms
that use platform and business metric information
to perform transaction processing system sizing.
Platforms
The term platform is used in transaction processing sizing methodology to encompass general customer preferences for the hardware and software
upon which the transaction processing application
will run.
Digital Technical Journal
Vo1.3 No. 1
Winter 1991
The hardware platform specifies the desired
topology or processing style. For example, process
ing style includes a centralized configuration and a
front-end and back-end configuration as valid alternatives. The hardware platform may also include
specific hardware components within the processing style. (In this paper, the term processor refers
to the overall processing unit, which may be composed of multiple CPUs.)
The software platform identifies the set of layered
products to be used by the transaction processing
application, with each software product identified
by its name and version number. In the transaction
processing environment, a software platform is
composed of the transaction processing monitor,
forms manager, database management system, application language, and operating system.
Different combinations of software platforms
may be configured, depending on the hardware platform used. A centralized configuration contains
all the software components on the same system. A
distributed system is comprised of a front-end processor and a back-end processor; different software
platforms may exist on each processor.
Levels of Business Metrics
The term business metrics refers collectively to
the various ways to measure the work associated
with a customer's business. In this section, various
levels of business metrics are identified and the
relationship between metrics at different levels is
described! As mentioned earlier, the levels correspond to the multilevel view of business operation
typically used for systems analysis. The organization or personnel most interested in a metric in
relation to its business operation is noted in the
discussion of each metric.
The decomposition of the business application
requirements into components that can be counted
and quantified in terms of resource usage requires
that a set of metrics be defined. These metrics
reflect the business activity and the system load.
The business metrics are the foundation for the
development of several transaction processing sizing tools and for a consistent algebra that connects
the business units with the computer units.
The business metrics are natural forecasting units,
business functions, transactions, and the number
of I/Os per transaction. The relationship among
these levels is shown in Figure 1. In general, a business may have one or more natural forecasting
units. Each natural forecasting unit may drive one or
more business functions. A business function may
Transaction Processing, Databases, and Fault-tolerant Systems
I
NATURAL
FORECASTING
I
NATURAL
FORECASTING
BUSINESS
FUNCTION
FUNCTION
I
I
TRANSACTIONS
TRANSACTIONS
I
I
TRANSACTIONS
T-
FILES
I
I
TRANSACTIONS
INSERTS
UPDATE
READS
I
UPDATE
WRITES
Levels of Business Activity Characterization
have multiple transactions, and a single transaction
may be activated by different business functions.
Every transaction issues a variety of 1/0 operations
to one or more files, which may be pl~ysically
located on zero, one, or more disks. This section
discusses the business metrics but does not discuss the physical distribution of I/Os across disks,
which is an implementation-specific item.
A natural forecasting unit is a macrolevel indicator of business volume. (It is also called a key volume indicator.) A business generally uses a volume
indicator to measure the level of success of the
business. The volume is often measured in time
intervals that reflect the business cycle, such as
weekly, monthly, or quarterly. For example, if business volume indicators were "number of ticket sales
per quarter," or "monthly production of widgets,"
then the corresponding natural forecasting units
would be "ticket sales" and "widgets." Natural forecasting units are used by high-level executives to
track the health of the overall business.
Business functions are a logical unit of work performed on behalf of a natural forecasting unit. For
example, within an airline reservation system, a
common business function might be "selling airline tickets." This business function may consist
of multiple interactions with the computer (e.g.,
flight inquiry, customer credit check). The completion of the sale terminates the business function,
and "airline ticket" acts as a natural forecasting unit
for the enterprise selling the tickets. The measurement metric for business functions is the number of business function occurrences per hour.
Business functions may be used by middle-level
60
I
TRANSACTIONS
I10 ACTIVITY REQUIREMENTS
READS
Figure I
I
TRANSACTIONS
FUNCTION
managers to track the activity of their departments.
A transaction is an atomic unit of work for an
application, and transaction response time is the
primary performance measure seen by a user. Each
of the interactions mentioned in the above business function is a transaction. The measurement
metric for a transaction is the number of trans
action occurrences per business function. Transactions may be used by low-level managers to track
the activity of their groups.
The bulk of commercial applications involves
the maintaining and moving of information. This
information is data that is often stored on permanent storage devices such as rotational disks, solid
state disks, or tapes. An I/O operation is the process
by which a transaction accesses that data. The measurement metric for the I/O profile is the number
of I/O operations per transaction. 1 / 0 operations
by each transaction are important to programmers
or system analysts.
In addition to issuing VOs, each transaction
requires a certain amount of CPU time to handle
forms processing. (Forms processing time is not
illustrated in Figure 1.) The measurement metric
for forms processing time is the expected number
of fields. The number of input and output fields
per form are important metrics for users of a trans
action processing application or programmer/
system analysts.
By collecting information about a transaction
processing application at various levels, high-level
volume indicators are mapped to low-level units
of I/O activity. This mapping is fundamental to the
transaction processing sizing methodology.
VoL 3 No. 1
Winter I991 Digital TecbnicalJou+nal
Tools and Techniquesfor Preliminary Sizing of Transaction Processing Applications
Performance goals play a particularly important
role in the sizing of transaction processing systems."
The major categories of performance goals commonly encountered in the transaction processing
marketplace are bounds for
Sizing Metrics
Device utilization(s)
Average response time for transactions
Response time quantiles for transactions
For example, a customer might spec~fya required
processor utilization of less than 70 percent. Such a
constraint reflects the fact that system response
time typically rises dramatically at higher processor utilizations. A common performance goal for
response time is to use a transaction's average
response time and response time quantiles. For
example, the proposed system should have an average response time of x seconds, with 95 percent
of all responses completing in less than or equal
t o y seconds, where x is less than y. Transaction
response times are crucial for businesses. Poor
response times translate directly into decreased
productivity and lost revenues.
When a customer generates a formal Request For
Proposal (RFP), the performance goals for the
transaction processing system typically are specified in detail. The specification of goals makes
it easier to define the performance bounds. For
customers who supply only general performance
goals, it is assumed that the performance goal takes
the form of bounds for device utilizations.
Overall response time consists of incremental
contributions by each major component of the
overall system:
Front-end processor
= Back-end processor
Communications network
Disk subsystem
A main objective in this approach to sizing was
to identlfy and use specific metrics that could be
easily counted for each major component. For
instance, the number of fields per form could be
a metric used for sizing front-end processors
because that number is specific and easily counted.
As the path of a transaction is followed through the
overall system, the units of work appropriate for
each component become clear. These units become
the metrics for sizing that particular component.
The focus of this paper is on processor sizing with
bounds on processor utilization. Processors gener-
Digital TechnicalJournal
ally constitute the major expense in any proposed
system solution. Mistakes in processor sizing are
very expensive to fix, both in terms of customer
satisfaction and cost.
Vo1.3 No. 1
Winter 1991
Transaction processing applications permit a large
number of users to share access to a common database crucial to the business and usually residing on
disk memory. In an interactive transaction processing environment, transactions generally involve
some number of disk I/O operations, although the
number is relatively small compared t o those
generated by batch transaction processing applications. CPU processing also is generally small and
consists primarily of overhead for layered transaction processing software products. Although
these numbers are small, they did influence the
sizing methodology in several ways.
Ratings for relative processor capacity in a transaction processing environment were developed
to reflect the ability of a processor to support disk
I/O activity (as observed in benchmark tests). In
addition, empirical studies of transaction processing applications showed that, for purposes of preliminary sizing, the number of disk r/Os generated
by a transaction provides a good prediction of the
required amount of CPU processing Numerous
industry-standard benchmark tests for product
positioning were run on Digital's processors. These
processors were configured as back-end processors in a distributed configuration with different
software platforms.
The base workload for this benchmark testing is
currently the Transaction Processing Performance
Council's TPC Benchmark A (TPC-A, formerly the
Debitcredit ben~hrnark)!.~.~
The most complete
set of benchmark testing was run under Digital's
VAX ACMS transaction processing monitor and
\'Ax RdblVMS relational database. Therefore, results
from this software platform on all Digital proces
sors were used to compute the first sizing metric
called the base load factor.
The base load factor is a high-level metric that
incorporates the contribution by all layered software products on the back-end processor to the
total CPU time per I/O operation. Load factors are
computed by dividing the total CPU utilization by
the number of achieved disk I/O operations per
second. (The CPU utilization is normalized in the
event that the processor is a Symmetrical Multiprocessing [SMP] system, to ensure that its value
falls within the range of 0 to 100 percent.) The
61
Transaction Processing, Databases, and Fault-tolerantSystems
calculation of load factor yields the total CPLJ time,
in centiseconds (hundredths of seconds), required
to support an application's single physical I/O
operation.
The base Joad factors give the CPU time per I/O
required to run the base workload, TPC-A, on any
Digital processor in a back-end configuration using
the ~ c ~ s / R dThe
b . CPU time per 1/0 can be estimated for any workload. This generalized metric is
called the application load factor.
To relate the base load factors to workloads other
than the base, an additional metric was defined
called the intensity factor. The metric calculation
for the intensity factor is the application load
factor divided by the base load factor. The value in
using intensity factors is that, once estimated (or
calculated for running applications), intensity factors can be used to characterize any application in
a way that can be applied across all processor types
to estimate processor requirements.
Intensity factors vary based on the software
platform used. If a software platform other than a
combined VAX ACMS and VAX Rdb/VMS platform is
selected, the estimate of the intensity factor must
be adjusted to reflect the resource usage cliaracteristics of the selected DECtp software platform.
To estimate an appropriate intensity factor for a
nonexistent application, judgment and experience
with similar applications are required. However,
measured cases from a range of DECtp applications
shows relatively little variation in intensity factors.
Guidelines to help determine intensity factors are
included in the documentation for Digital's internally developed transaction processing sizing tools.
The work required by any transaction processing application is composed of two parts: the
application/database and the forms management.
This division of work corresponds to what occurs
in a distributed configuration, where the forms processing is off-loaded to one or more front-end processors. Load factors and intensity factors are
metrics that were developed to size the application/
database. To estimate the amount of CPU time
required for forms management, a forms-specific
metric is required. For a first-cut approximation,
the expected number of (input) fields is used as the
sizing metric. This number is obtained easily from
the business-level description of the application.
Sizing Formulas
This section describes the underlying algebra developed for processor selection. Different formulas
to estimate the CPU time required for both the
application/database and forms management were
developed. These formulas are used separately for
sizing back-end arid front-end processors in a distributed configuration. The individual contributions of the formulas are combined for sizing a
centralized configuration.
The application/database is the work that takes
place on the back-end processor of a distributed
configuration. It is a function of physical disk
accesses. To determine the minimal CPU time
required to handle this load, processor utilization
is used as the performance goal, setting u p an
inequality that is solved to obtain a corresponding
load factor. The resulting load factor is then compared to the table of base load factors to obtain a
recommendation for a processor type. To reinforce this dependence of load factors on processor
types, load factor x refers to the associated processor typex in the following calculations.
One method for estimating the average CPU time
per transaction is to multiply the number of I/Os
per transaction by the load factor x and the intensity factor. This yields CPU time per transaction,
expressed in centiseconds per transaction. By multiplying this product by the transactions per second rate, an expression for processor utilization is
derivcd. Thus processor utilization (expressed as a
percentage scaled between 0 and 100 percent) is
the number of transactions per second, times the
number of I/Os per transaction, times load factorx,
times the intensity factor.
The performance goal is a CIJU utilization that is
less than the utilization specified by the customer.
Therefore, the calculation used to derive the load
factor is the utilization percentage provided by the
customer, divided by the number of transactions
per second, times the number of I/Os per transaction, times the intensity factor.
Once computed, the load factor is compared to
those values in the base load factor table. 'The base
load factor equal to or less than the computed value
is selected, and its corresponding processor type,
x, is returned as the minimal processor required to
handle this workload.
The four input parameters that need to be estimated for inclusion in this inequality are
Processor utilization performance goal (traditionally set at around 70 percent, but may be set
higher for Digital's newer, faster processors)
Target transactions per second (which may be
derived from Digital's multilevel mapping of
business metrics)
Vo1.3 No. 1
Winter f99I
Digital Technical Journal
Tools and Techniquesfor PreZi;minarySizing of Transaction Processing Applications
I/Os per transaction (estimated from application
description and database expertise)
Intensity factor (estimated from experience with
similar applications)
Note: Response time performance goals do not
appear in this formula. This sizing formula deals
strictly with ensuring adequate processor capacity.
However, these performance parameters (including the CpU service time per transaction) are fed
into an analytic queuing solver embedded in some
of the transaction processing sizing tools, which
produces estimates of response times.
Forms processing is the work that occurs either
on the front-end processor of a distributed configuration or in a centralized configuration. It is not a
function of physical disk accesses; rather, forms
processing is CPU intensive. To estimate the Cpu
time (in seconds) required for forms processing,
the following simple linear equation is used:
y = c(a
+ bz)
where y equals the CPU time for forms processing;
a equals the CpU time per form per transaction
instance, depending on the forms manager used;
b equals the cpU time per field per transaction
instance, depending on the forms manager used;
z equals the expected number of fields; and c equals
the scaling ratio, depending on the processor type.
This equation was developed by feeding the results
of controlled forms testing into a linear regression
model to estimate the CPU cost per form and per
field (i.e., a and 6 ) . The multiplicative term, c, is
used to eliminate the dependence of factors a and
6 on the hardware platform used to run these tests.
Sizing Tools
Several sizing tools were constructed by using the
above formulas as starting points. These tools differ in the range of required inputs and outputs, and
in the expected technical sophistication of the user.
The first tool developed is for quick, firstapproximation processor sizing. Currently embodied as a DECalc spreadsheet, with one screen for
processor selection and one for transactions-persecond sensitivity analysis, it can handle back-end,
front-end, or centralized sizing. The first screen
shows the range of processors required, given the
target processor utilization, target transactions
per second, expected number of fields, and the
possible intensity factors and number of I/Os per
transaction. (Because the estimation of these last
Digital Tecbnical Journal
%>I. 3 No. 1
Winter 1991
two inputs generally involves the most uncertainty, the spreadsheet allows the user to input a
range of values for each.) The second screen turns
the analysis around, showing the resulting transaction-per-second ranges that can be supported by
the processor type selected by the user, given the
target processor utilization, expected number of
fields, and possible intensity factors and number of
I/Os per transaction.
The basic sizing formula addresses issues that
deal specifically with capacity but not with performance. To predict behavior such as response
times and queue lengths, modeling techniques that
employ analytic solvers or simulators are needed.
A second tool embeds an analytic queuing solver
within itself to produce performance estimates.
This tool is an automated system (i.e., a DECtp
application) that requests information from the
user according to the multilevel workload characterization methodology. This starts from general
business-level information and proceeds to request
successively more detailed information about the
application. The tool also contains a knowledge
base of Digital's product characteristics (e.g., processor and disk) and measured DECtp applications.
The user can search through the measured cases to
find a similar case, which could then be used to
provide a starting point for estimating key application parameters. The built-in product characteristics shield the user from the numeric details of the
sizing algorithms.
A third tool is a spin-off from the second tool.
This tool is a standalone analytic queuing solver with
a simple textual interface. The tool is intended for
the sophisticated user and assumes that the user
has completed the level of analysis required to be
able to supply the necessary technical input parameters. No automatic table lookups are provided.
However, for a completely characterized application, this tool gives the sophisticated user a quick
means to obtain performance estimates and run
sensitivity analyses. The complete DECtp software
platform necessary to run the second tool is not
required for this tool.
Data Collection
To use the sizing tools fully, certain data must be
available, which allows measured workloads to be
used t o establish the basic metrics. Guidance in
sizing unmeasured transaction processing applications is highly dependent on developing a knowledge base of real-world transaction processing
application descriptions and measurements. The
Transaction Processing, Databases, and Fault-tolerantSystems
kinds of data that need to be stored within the
knowledge base require the data collection tools to
gather information consistent with the transaction
processing sizing algebra.
For each transaction type and for the aggregate
of all the transaction types, the following information is necessary to perform transaction processing system sizing:
CPU time per disk I/O
Disk I/O operations per transaction
Transaction rates
Logical-to-physical disk I/O ratio
The CpU to I/O ratio can be derived from Digital's
existing instrumentation products, such as the V
m
S o h a r e Performance Monitor (SPM) and VAXcluster
Performance Advisor (VPA) product^.^ Both products can record and store data that reflects CPU
usage levels and physical disk I/O rates.
The DECtrace product collects event-driven data.
It can collect resource items from layered software products, including VAX ACMS monitor, the
VAX Rdb/VMS and DBMS database systems, and if
instrumented, from the application program itself.
As an event collector, the DECtrace product can be
used to track the rate at which events occur.
The methods for determining the logical-tophysical disk I/O ratio per transaction remain open
for continuing study. Physical disk I/O operations
are issued based on logical commands from the
application. The find, update, o r fetch commands
from an SQL program translate into from zero to
many thousands of physical disk I/O operations,
depending upon where and how data is stored.
Characteristics that affect this ratio include the
length of the data tables, number of index keys, and
access methods used to reach individual data items
(i.e., sequential, random).
Few tools currently available can provide data
on physical I/O operations for workloads in the
design stage. A knowledge base that stores the
logical-to-physical disk I/O activity ratio is the best
method available at this time for predicting that
value. The knowledge base in the second sizing
tool is beginning to be populated with application
descriptions that include this type of information.
It is anticipated that, as this tool becomes widely
used in the field, many more application descriptions will be stored in the knowledge base. Pooling
individual application experiences into one central
repository will create a valuable source of knowledge that may be utilized to provide better information for future sizing exercises.
Acknowledgments
The authors would like to acknowledge our colleagues in the Transaction Processing Systems
Performance Group whose efforts led to the development of these sizing tools, either through product characterization, system support, objective
critique, or actual tool development. In particular,
we would like to acknowledge the contributions
made by Jim Bouhana to the development of the
sizing methodology and tools.
References
1. W Zahavi and J. Bouhana, "Business-Level Description of Transaction Processing Applications,"
CMG '88Proceedings (1988): 720-726.
2. K. Omahen, "Practical Strategies for Config-
uring Balanced Transaction Processing Systems:'
IEEE COMPCON Spring '89 Proceedings (1989):
554-559.
3. W; Zahavi, "A First Approximation Sizing
Technique -The I/O Operation as a Metric of
CPU Power," CMG '90Conference Proceedings
(forthcoming December 10-14,1990).
BENCHMARK A - Standard Specification,"
(Transaction Processing Performance Council,
November 1989).
4. "TPC
5. "A Measure of Transaction Processing Power,"
Datamation, vol. 31, no. 7 (April 1, 1985): 112- 118.
6. L. Wright, W Kohler, and W Zahavi, "The Digital
Debitcredit Benchmark: Methodology and
Results," CMG '89 Conference Proceedings
(1989): 84-92.
7. E Habib, Y. Hsu, and K. Omahen, "Software
Measurement Tools for VAXNMS Systems," CMG
Transactions (Summer 1988): 47-78.
Vol. 3 No. 1
Winter 1991 Digital Technical Journal
Anantb Ragbauan
T. K. Rengarajan
Database Availabilityfor
TransactionProcessing
A transactionprocessing system relies on its database management system to supply
high availability. Digital offers a network-basedproduct, the VAX DBMS system,
and a relational data-based product, the VAX RdbflMS database system, for its
transaction processing systems. These database systems have several strategies to
survive failures, disk head crashes, revectored bad blocks, database corruptions,
memory corruptions, and memory overwrites by fnulty application programs.
They use base hardware technologies and also employ novel software techniques,
such as parallel transaction recovery, recovery on surviving nodes of a VAXcluster
system, restore and rollforward operations on areas of the database, on-line
backup, verification and repair utilities, and executive modeprotection of trusted
database management system code.
Modern businesses store critical data in database
management systems. Much of the daily activity
of business includes manipulation of data in the
database. As businesses extend their operations
worldwide, their databases are shared among
office locations in different parts of the world.
Consequently, these businesses require transaction processing systems to be available for use at
all times. This requirement translates directly to a
goal of perfect availability for database management systems.
VAX DBMS and VAX Rdb/VMS database systems are
based on network and relational data models, respectively. Both systems use a kernel of code that is
largely responsible for providing high availability.
This layer of code is maintained by the KODA group.
KODA is the physical subsystem for VAX DBMS and
VAX Rdb/VMS database systems. It is responsible for
all I/O, buffer management, concurrency control,
transaction consistency, locking, journaling, and
access methods.
In this paper, we define database availability,
and describe downtime situations and how such
situations can be resolved. We then discuss the
mechanisms that have been implemented to provide minimal loss of availability.
Database Availability
The unit of work in transaction processing systems
is a transaction. We therefore define database availability as the ability to execute transactions. One
Digital TecbnfcalJournal
Vol. 3 No. 1
Winter I991
way the database management system provides
high availability is by guaranteeing the properties of transactions: atomicity, serializability, and
durability.' For example, if a transaction that has
made updates to the database is aborted, other
transactions must not be allowed to see these
updates; the updates made by the aborted transaction must be removed from the database before
other transactions may use that data. Yet, data that
has not been accessed by the aborted transaction
must continue to be available to other transactions.
Downtime is the term used to refer to periods
when the database is unavailable. Downtime is
caused by either an unexpected failure (unexpected downtime) or scheduled maintenance on
the database (scheduled downtime). Such classifications of downtime are useful. Unexpected downtime is caused by factors that are beyond the
control of the transaction processing system. For
example, a disk failure is quite possible at any
time during normal processing of transactions.
However, scheduled downtime is entirely within
the control of the database administrator. High
availability demands that we eliminate scheduled
downtime and ensure fast system recovery from
unexpected failures.
The layers of the software and hardware services
which compose a transaction processing system
are dependent on one another for high availability.
The dependency among these services is illustrated in Figure 1. Each service depends on the
65
I
Transaction Processing, Databases, and Fault-tolerantSystems
A database monitor must be started on a node
before a i~ser'sprocess running on that node can
access a database. The monitor oversees all database activity on the node. It allows processes to
attach to and detach from databases and detects
failures. On detecting a failure, the monitor starts
a process to recover the transactions that did not
complete because of the failure. Note that this
database monitor is different from the TP monitorZ
APPLICATION
PROGRAM
DATABASE
MANAGEMENT
Application Program Exceptions
OPERATING
SYSTEM (VMS)
HARDWARE
(CPU, DISK)
GENERAL
ENVIRONMENT
Figure I
AVAILABILITY
Layers oJ Availability in Transaction
Processing Systems
availability of the service in the lower layers.
Errors and failures can occur in any layer, but may
not be detected immediately. For example, in the
case of a database management system, the effects
of a database corruption may not be apparent until
long after the corruption (error) has occurred.
Hence it is difficult t o deal with such errors. On the
other hand, failures are noticed immediately.
Failures usually make the system unavailable and
are the cause of unexpected downtime.
Each layer can provide only as much availability
as the immediate lower layer. Hence we can also
express the perfect-availability goal of a database
management system as the goal of matching the
availability of the immediately lower layer, which
in our case is the operating system.
At the outset, it is clear that a database management system layered on top of an operating system
and hence only as available as the underlying operating system. However, a database management
system is in general not as available as the underlying layer because of the need t o guarantee t h e
properties of transactions.
Unexpected Downtime
In this section we discuss the causes of unexpected downtime and the techniques that minimize downtime.
66
Although transaction processing systems are based
on the client/server architecture, Digital's database
systems are process based. The privileged database
management system code is packaged in a shareable library and linked with the application programs. Therefore, bugs in the applications have
a good chance of affecting the consistency of the
database. Such bugs in applications are one type of
failure that can make the database unavailable.
The VAX DBMS and VAX RdbfiMS systems guard
against this class of failure by executing the database management system code in the VAX executive mode. Since application programs execute in
user mode, they d o not have access t o data structures used by the database management system.
When a faulty application program attempts such
an access, the V M S operating system detects it and
generates an exception. This exception then forces
an image rundown of the application program.
In general, when an image rundown is initiated,
Digital's database management products use the
condition-handling facility of WS to abort the transaction. Condition handling of image rundown is
performed at two levels. Two condition handlers
are established, one in user mode and the other in
kernel mode. The user mode exit handler is usually
invoked, which rolls back the current transaction
and unbinds it from the database. In this case, the
rest of the users on the system are not affected at
all. The database remains available. The execution
of the user mode exit handler is, however, not
guaranteed by the V M S operating system. Under
some abnormal circumstances, the user mode exit
handlers may not be executed at all. In such circumstances, the kernel mode exit handler is
invoked by the V M S system. This handler resides
in the database monitor. The monitor starts a
database recovery (DBR) process. It is the responsibility of the DRR process to roll back the effects of
the aborted transaction. To do this, the DBR process first establishes a database freeze. This freeze
prevents other processes from acquiring locks that
Vo1.3 No. 1
Winter I991
Digital Technical Journal
Database Availabilityfor Transaction Processing
were held by the aborted transaction and hence
see and update uncommitted data. (The VMS lock
manager releases all locks held by a process when
that process dies.) The DBR process then proceeds
t o roll back the aborted transaction.
Code Corruptions
It is important to prevent coding mistakes within
the DBMS from irretrievably corrupting the database. To protect the database management system
from coding mistakes, internal data structure consistency is examined at different points in the
code. If any inconsistency is found, a bug-check
utility is called that dumps the internal database
format t o a file. The utility then raises an e x c e p
tion that is handled by the monitor, and the DBR
process is started as described above.
To deal with corruptions to the database that are
undetected with this mechanism, an explicit utility
is provided that verifies the structural consistency
of the database. This ver* utility may be executed
on-line, while users are still accessing the database. Such verification may also be executed by a
database administrator (DBA) in response t o a bugcheck dump. Once such a corruption is detected,
an on-line utility provides the ability to repair the
database.
In general, corruption in databases causes unexpected downtime. Digital provides the means of
detecting such corruption on-line and repairing
them on-line through recovery utilities.
Process Failure
In the VMS system, a process failure is always preceded by image rundown of the current image running as part of the process. Therefore, a process
failure is detected by the database monitor, which
then starts a DBR process to handle recovery.
Node Failure
Among the many mechanisms Digital provides for
availability is node failover within a cluster. When
a node fails, another node on the cluster detects
the failure and rolls back the lost transactions from
the failed node. Thus the failure of one node does
not cause transactions on other active nodes of the
cluster t o come t o a halt (except for the time the
DBR process enforces a freeze). It is the database
monitor that detects node failure and starts a
recovery process for every lost transaction on the
failed node. The database becomes available as
soon as recovery is complete for all the users on
the failed node.
Digital Technical Journal
Vol. 3 No. 1
Winter 1991
Power Failure
Power failure is a hardware failure. As soon as
power is restored, the VMS system boots. When a
process attaches to the database, a number of messages are passed between the process that is attaching and the monitor. If the database is corrupt
(because of power failure), the monitor is s o
informed by the attaching process, and again the
monitor starts recovery processes to return the
database to a consistent state. The database becomes
available as soon as recovery is complete for all
such failed users.
As described above, recovery is always accomplished by the monitor process starting DBR processes t o do the recovery. The only differences in
the case of process, node, o r cluster failure is the
mechanism by which the monitor is informed of
the failure.
Disk Head Crash
Some failures can result in the loss or corruption of
the data on the stable storage device (disk). Digital
has a mechanism for bringing the database back to
a consistent state in such cases.
A disk head crash is a failure of hardware that is
usually characterized by the inability to read from
or write to the disk. Hence database storage areas
residing on that disk are unavailable and possibly
irretrievable. A disk head crash automatically aborts
transactions that need to read from or write to that
disk. In addition, recovery of these aborted transactions is not possible since the recovery processes need access to the same disk. In this case,
the database is shut down and access is denied until
the storage areas on the failed disk are brought online. Areas are restored from backups and then
rolled forward until consistent with the rest of the
database. The after image journal (MJ) files are used
to roll the areas forward. As soon as all the areas on
the failed disk have been restored onto a good disk
and rolled forward, the database becomes available.
Bad Disk Blocks
Bad blocks are hardware errors that often are not
detected when they happen. The bad blocks are
revectored, and the next time the disk block is
read, an error is reported. Bad blocks simply mean
that the contents of a disk block are lost forever.
The database administrator detects the problem
only when a database application fails to fetch data
on the revectored block. Such an error may cause a
certain transaction or a set of transactions to fail,
no matter how many attempts are made to execute
Transaction Processing, Databases, and Fault-tolerant Systems
the transactions. This failure constitutes reduced
availability; parts of the database are unavailable to
transactions. Exactly how much of the database
remains available depends on which blocks were
revectored.
The mechanism provided to reduce the possible
downtime is early detection. Digital's database
systems provide a verification utility that can be
executed while users are running transactions.
The verification utility checks the structural consistency of the database. Once a bad block is
detected by such a utility, that area of the database
may be restored and rolled forward. These two
operations make the whole database temporarily
unavailable; however, the bad block is corrected,
and future downtime is avoided. The downtime
caused by the bad block may be traded off against
the downtime needed to restore and roll forward.
Site Failure
The loss of availability during repair is not worse
than the loss due to the memory error itself.
As explained previously, the database monitor
plays an important part in ensuring database consistency and availability. Most unexpected failure
scenarios are detected by the monitor, which then
starts recovery processes. In addition, some failures might require the use of backup files t o
restore the database.
Scheduled Downtime
Most database systems have scheduled maintenance
operations that require a database shutdown. Database backup for media recovery and verification to
check structural consistency are examples of operations that may require scheduled downtime. In
this section we describe ways to perform many of
these operations while the database is executing
transactions.
A site failure occurs when neither the computers
Backup
nor the disks are available. A site failure is usually
caused by a natural disaster such as an earthquake.
The best recourse for recovery is archival storage.
Digital provides mechanisms to back up the database and AIJ files to tape. These tapes must then be
stored at a site away from the site at which the
database resides. Should a disaster happen, these
backup tapes can be used to restore the database.
However, the recovery may not be complete. It
cannot restore the effects of those committed transactions that were not backed up to tape.
After a disaster, the database can be restored
and rolled forward to the state of the completion of
the last AIJ that was backed up to tape. Any transactions that committed after the last AIJ was backed
up cannot be recovered at the alternate site. Such
transaction losses can be minimized by frequently
backing up the AJJ files.
Digital's database systems allow two types of transactions: update and "snapshot." The ability to back
up data on-line depends on the snapshot transaction
capability of the database.
Database backup is a standard way of recovering
from media failures. Digital's database systems provide the ability to d o transaction consistent backups of data on-line while users continue to change
the database.
The general scheme for snapshot transactions is
as follows. The update transactions of the database
preserve the previous versions of the database
records in the snapshot file. All versions of a database record are chained. Only the current version
of the record is in the database area. The older versions are kept in the snapshot area. The versions
of the records are tagged with the transaction
numbers (TSNs). When a snapshot transaction (for
example, a database backup) needs to read a database record, it traverses the chain for that database
record and then uses the appropriate version of
the record.
There are two modes of database operation with
respect to snapshot activity. In one mode, all update
transactions write snapshot copies of any records
they update. In the deferred snapshot mode, the
updates cause snapshot copies to be written only
if a snapshot transaction is active and requires old
versions of a record. In this mode, a snapshot trans
action cannot start until all currently active update
transactions (which are not writing snapshot
Memory Errors
Memory errors are quite infrequent, and when
they happen, they usually are not detected. If the
error happens to a data record, it may never be
detected by any utility, but may be seen as incorrect data by the user. If the verification utility is run
on-line, it may also detect the errors. Again, the
database may only be partially available, as in the
case of bad blocks. However, it is possible to repair
the database while users are still accessing the
database. Digital's database management products
provide explicit repair facilities for this purpose.
68
Vo1.3 No. 1
Winter 1991 Digital Technical Journal
Database Availability for Transaction Processing
records) have completed; that is, the snapshot
transaction must wait for a quiet point in time. If
there are either active or pending snapshot transactions when an update transaction starts, the
update transaction must write snapshot copies.
Here we see a trade-off between update transactions and snapshot transactions. The database
is completely available to snapshot transactions
if all update transactions always write snapshot
copies. On the other hand, if the deferred snapshot
mode is enabled, update transactions need not
write snapshot copies if a snapshot transaction
in not active. This approach obviously results in
some loss of availability to snapshot transactions.
Verification
Database corruption can also result in downtime.
Although database corruption is not probable, it
is possible. Any database system that supports
critical data must provide facilities to ensure the
consistency of the database. Digital's database management systems provide verification utilities that
scan the database to check the structural consistency of the database. These utilities may also be
executed on-line through the use of snapshot
transactions.
On-line Schema Changes
Digital's database management systems allow users
to change metadata on-line, while users are still
accessing the database. Although this may be standard for relational database management systems,
it is not standard for network databases. The VAX
DBMS system provides a utility called the database
restructuring utility (DRU) to allow for on-line
schema modifications.
Acknowledgments
Many engineers have contributed to the development of the algorithms described in this paper. We
have chosen not to enumerate all such contributions. However, we would like to recognize the contributions of Peter Spiro, Ashok Joshi, Jeff Arnold,
and Rick Anderson who, together with the authors,
are members of the KODA team.
References
AIJ Backup
The backup and the NJlog are the two mechanisms
that provide media recovery for Digital's database
management products. The AIJ file is continuously
written to by all user processes updating the database. We need to provide some ability to back up
the AIJ file since it monotonically increases in size
and eventually fills up the disk. Digital's database
D f g f t a lTechnfcalJournal W .3
systems offer the ability to back up the AIJ file to
tape (or another device) on-line. The only restriction is that a quiet point must be established for a
short period during which the backup operation
takes place. A quiet point is defined as a point
when the database is quiescent, i.e., there are no
active transactions.
M.i
Wi)rcm UP1
1. I! Bernstein, W Emberton, and V. Trehan,
"DECdta - Digital's Distributed Transaction Processing Architecture," Digital TechnicalJournal,
vol. 3, no. 1 (Winter 1991, this issue): 10-17
2. T. Speer and M. Storm, "Digital's Transaction
Processing Monitors," Digital TechnicalJournal,
vol. 3, no. 1 (Winter 1991, this issue): 18-32.
69
Peter M. Spiro
Ashok M. Joshi
T. K. Rengarajan
Designing an Optimized
Transaction Commit
Protocol
Digital's databaseproducts, VM Rdb/KMS and VRY DBiVJS, share the same database
k m e l called KODA. KODA uses agrouping mechanism to commit nzany concurrent
transactions together Thisfeature enables high transaction rates in a transaction
processing (TP) enviro?zment.Sincegroup commitprocessing affects the maximum
throughput of the traizsaction processing system, the KODA group designed and
implemented several grouping algorithms and studied their performance characteristics. Preliminary results indicate that it ispossible to achieve up to a 66pwcent
improvement in transaction througl?put by using more efficient grouping designs.
Digital has two general-purpose database products,
Rdb/\%fS software, which supports t h e relational
data model, and VAX DBMS software, which supports t h e CODASYL (Conference o n Data Systems
Languages) data model. Both products layer on t o p
of a database kernel called KODA. In addition t o
other database services, KODA provides t h e transaction capabilities and commit processing for these
two products.
In this paper, w e address some of the issues relevant to efficient c o m m i t processing. We begin by
explaining the importance of commit processing
in achieving high transaction throughput. Next, w e
describe in detail the current algorithm for group
commit used in KODA. We then describe and contrast several n e w designs for performing a group
commit. Following these discussions, w e present
our experimental results. And, finally, w e discuss
the possible direction of future work and some
conclusions. No attempt is made to present formal
analysis o r exhaustive empirical results for commit
processing; rather, the focus is on an intuitive
untlerstanding of the concepts and trade-offs,
along with some empirical results that support our
conclusions.
Commit Processing
To follow a discussion of commit processing, two
basic terms must first be understood. We begin this
section by defining a transaction and the "moment
of commit."
A transaction is t h e execution of o n e o r m o r e
statements that access data managed by a database
system. Generally, database management systems
guarantee that the effects of a transaction are atomic,
that is, either all updates performed within the context of the transaction are recorded in t h e database,
o r n o updates are reflected in the database.
The point at which a transaction's effects become
durable is known as t h e " m o n ~ e nof
t commit." This
concept is important because it allows database
recovery to proceed in a predictable manner after
a transaction failure. If a transaction terminates
abnormally before it reaches t h e moment of commit, then it aborts. As a result, t h e database system
performs transaction recovery, which removes all
effects of the transaction. However, if t h e transaction has passed the moment of commit, recovery
processing ensures that all changes made by the
transaction are permanent.
Transaction Profile
For the purpose of analysis, it is i~sefulto divide a
transaction processed by KODA into four phases:
t h e transaction start phase, the data manipulation
phase, t h e logging phase, and the commit p r o c e s s
ing phase. Figure 1 illustrates the phases of a transaction in time sequence. The first three phases are
collectively referred to as "the average transaction's
CPU cost (excluding t h e cost of commit)" and the
last phase (commit) as "the cost of writing a group
commit buffer."'
Vol. 3 No. 1
Winter 1991
Digital Technical Journal
Designing an Optimized Transaction Commit Protocol
TlME
DATA
MANIPULATION
-
START
Figure I
LOGGING
COMMIT
Phases in the Execution
of a Transaction
The transaction start phase involves acquiring
a transaction identifier and setting u p control
data structures. This phase usually incurs a fixed
overhead.
The data manipulation phase involves executing
the actions dictated by an application program.
Obviously, t h e time spent in this phase and t h e
amount of processing required depend o n t h e
nature of the application.
At some point a request is made to complete the
transaction. Accordingly in KODA, the transaction
enters t h e logging phase which involves updating
the database with t h e changes and writing t h e
undo/redo information to disk. The amount of work
done in the logging phase is usually small and constant (less than o n e 110)for transaction processing.
Finally, the transaction enters the commit processing phase. In KODA, this phase involves writing
commit information t o disk, thereby ensuring that
the transaction's effects are recorded in the database and now visible to other users.
For s o m e transactions, t h e data manipulation
phase is very expensive, possibly requiring a large
number of I/Os and a great deal of CPU time. For
example, if 500 employees in a company w e r e t o
get a 10 percent salary increase, a transaction would
have t o fetch and modlfy every employee/salary
record i n t h e company database. The commit processing phase, in this example, represents 0.2 percent of the transaction duration. Thus, for this class
of transaction, commit processing is a small fraction of the overall cost. Figure 2 illustrates t h e profile of a transaction moddying 500 records.
COMMIT
LOGGING
TlME
-
Figure 2
3
DATA
MANIPULATION
Profile of a Transaction iModifying
500 Records
tn contrast, for transaction processing applications such as hotel reservation systems, banking
Digital Technical Journal
1'01. .? No. 1
Winter 1991
applications, stock market transactions, o r t h e
telephone system, t h e data manipulation phase is
usually short (requiring few 1/0s). Instead, t h e logging and commit phases comprise t h e bulk of the
work and must be optimized t o allow high transaction throughput. The transaction profile for a
transaction moddying o n e record is s h o w n in
Figure 3. Note that t h e commit processing phase
represents 36 percent of t h e transaction duration,
in this example.
MANIPULATION
I
TlME
-
Figure 3
LOGGING
COMMIT
Profile of a Transaction Modifying
One Record
Group Commit
Generally, database systems must force write information t o disk in order t o commit transactions. In
the event of a failure, this operation permits recove r y processing t o determine which failed transactions were active at the time of their termination
and which ones had reached their moment of commit. This information is often in t h e form of lists of
transaction identifiers, called commit lists.
Many database systems perform an optimized
version of commit processing where commit information for a group of transactions is written to disk
in o n e I/O operation, thereby, amortizing t h e cost
of the I/O across multiple transactions. So, rather
than having each transaction write its o w n commit
list to disk, o n e transaction writes t o disk a commit list containing t h e commit information for a
number of o t h e r transactions. This technique is
referred t o in t h e literature as "group ~ o m m i t . " ~
Group commit processing is essential for achieving high throughput. If every transaction that
reached t h e commit stage had t o actually perform
an 1 / 0 t o t h e same disk t o flush its o w n commit
information, t h e throughput of t h e database system would be limited t o t h e I/O rate of t h e disk. A
magnetic disk is capable of performing 30 I/O
operations p e r s e c o n d . Consequently, in t h e
absence of group commit, t h e throughput of t h e
system is limited t o 30 transactions p e r second
(TPS). Group commit is essential t o breaking this
performance barrier.
Transaction Processing, Databases, and Fault-tolerant Systems
There are several variations of the basic algorithms for grouping multiple commit lists into a
single I/O. The specific group commit algorithm
chosen can significantly influence the throughput
and response times of transaction processing. One
study reports throughput gains of as much as 25
percent by selecting an optimal group commit
algorithm.'
At high transaction throughput (hundreds of
transactions per second), efficient commit processing provides a significant performance advantage.
There is little information in the database literature about the efficiency of different methods of
performing a group commit. Therefore, we analyzed several grouping designs and evaluated their
performance benefits.
Factors Affecting Group Commit
Before proceeding to a description of the experiments, it is useful to have a better understanding of
the factors affecting the behavior of the group commit mechanism. This section discusses the group
size, the use of timers to stall transactions, and the
relationship between these two factors.
Group Size An important factor affecting group
commit is the number of transactions that participate in the group commit. There must be several
transactions in the group in order to benefit from
I/O amortization. At the same time, transactions
should not be required t o wait too long for the
group to build u p t o a large size, as this factor
would adversely affect throughput.
It is interesting t o note that the incremental
advantage of adding one more transaction t o a
group decreases as the group size increases. The
incremental savings is equal to 1/(G x (G + I)),
where G is the initial group size. For example, if
the group consists of 2 transactions, each of them
does one-half a write. If the group size increases
t o 3, the incremental savings in writes will be
(1/2 - 1/3), or 1/6 per transaction. If we do the same
calculation for a group size incremented from 10
to 11, the savings will be (1/10 - 1/11), or 1/110 of a
write per transaction.
In general, if G represents the group size, and I
represents the number of I/Os per second for the
disk, the maximum transaction commit rate is I x G
TPS. For example, if the group size is 45 and the rate
is 30 I/Os per second to disk, the maximum transaction commit rate is 30 x 45, or 1350 TPS. Note that
a grouping of only 10 will restrict the maximum
TPS to 300 TPS, regardless of how powerful the
computer is. Therefore, the group size directly
affects the maximum transaction throughput of
the transaction processing system.
Use of Timers to Stall Transactions One of the
mechanisms t o increase the size of the commit
group is the use of timers!,' Timers are used t o
stall the transactions for a short period of time
(on the order of tens of milliseconds) during commit processing. During the stall, more transactions
enter the commit processing phase and s o the
group size becomes larger. The stalls provided by
the timers have the advantage of increasing the
group size, and the disadvantage of increasing the
response time.
Trade-offs This section discusses the trade-offs
between the size of the group and the use of timers
to stall transactions. Consider a system where there
are 50 active database programs, each repeatedly
processing transactions against a database. Assume
that on average each transaction takes between
0.4 and 0.5 seconds. Thus, at peak performance, the
database system can commit approximately 100
transactions every second, each program actually
completing two transactions in the one-second
time interval. Also, assume that the transactions
arrive at the commit point in a steady stream at different times.
If transaction commit is stalled for 0.2 seconds t o allow the commit group t o build up, the
group then consists of about 20 transactions
(0.2 seconds x 100 TPS). In this case, each transaction only incurs a small delay at commit time,
averaging 0.10 seconds, and the database system
should be able to approach its peak throughput of
100 TPS. However, if the mechanism delays commit
processing for one second, an entirely different
behavior sequence occurs. Since the transactions
complete in approximately 0.5 seconds, they accumulate at the commit stall and are forced to wait
until the one-second stall completes. The group
size then consists of 50 transactions, thereby maximizing the I/O amortization. However, throughput
is also limited to 50 TPS, since a group commit is
occurring only once per second.
Thus, it is necessary to balance response time
and the size of the commit group. The longer the
stall, the larger the group size; the larger the group
size, the better the I/O amortization that is achieved.
However, if the stall time is too long, it is possible
to limit transaction throughput because of wasted
CPU cycles.
Vo1.3 No. 1
Winter 1991 Digital Technical Journal
Designing a n Optimized Transaction Commit Protocol
Motivationfor Our Work
Commit-LockDesign
The concept of using commit timers is discussed
in great detail by Reuter.' However, there are significant differences between his group commit scheme
and our scheme. These differences prompted the
work we present in this paper.
In Reuter's scheme, the timer expiration triggers
the group commit for everyone. In our scheme, no
single process is in charge of commit processing
based on a timer. Our commit processing is performed by one of the processes desiring to write a
commit record. Our designs involve coordination
between the processes in order to elect the group
committer (a process).
Reuter's analysis to determine the optimum value
of the timer based on system load assumes that the
total transaction duration, the time taken for commit processing, and the time taken for performing
the other phases are the same for all transactions.
In contrast, we do not make that assumption. Our
designs strive to adapt to the execution of many different transaction types under different system
loads. Because of the complexity introduced by
allowing variations in transaction classes, we d o
not attempt to calculate the optimal timer values as
does Reuter.
The Commit-Lock Design uses a VMS lock to generate groups of committing transactions; the lock is
also used to choose the group committer.
Once a process completes all its updates and
wants to commit its transaction, the procedure is
as follows. Each transaction must first declare its
intent to join a group commit. In KODA, each process uses the interlocked quelle instructions of the
VAX system running VMS software to enqueue a
block of commit information, known as a commit
packet, onto a globally accessible commit queue.
The commit queue and the commit packets are
located in a shared, writeable global section.
Each process then issues a lock request for the
commit lock. At this point, a number of other
processes are assumed to be going through the
same sequence; that is, they are posting their
commit packets and making lock requests for the
commit lock. One of these processes is granted
the commit lock. For the time being, assume the
process that currently acquires the lock acts as
the group committer.
The group committer, first, counts the number
of entries on the commit queue, providing the
number of transactions that will be part of the
group commit. Because of the VAX interlocked
queue instructions, scanning to obtain a count and
concurrent queue operations by other processes
can proceed simultaneously. The group committer
uses the information in each commit packet to
format the commit block which will be written
to disk. In KODA, the commit block is used as a
commit list, recording which transactions have
committed and which ones are active. In order to
commit for a transaction, the group cornmitter
must mark each current transaction as completed.
In addition, as an optimization, the group committer assigns a new transaction identifier for each
process's next transaction. Figure 4 illustrates a
commit block ready to be flushed to disk.
Once the commit block is modified, the group
committer writes i t to disk in one atomic I/O. This
is the moment of commit for all transactions in
the group. Thus, all transactions that were active
and took part in this group commit are now stably
marked as committed. In addition, as explained
above, these transactions now have new transaction identifiers. Next, the group committer sets a
commit flag in each commit packet for all recently
committed transactions, removes all commit packets from the commit queue, and, finally, releases
the commit lock. Figure 5 illustrates a committed
Cooperative Commit Processing
In this section, we present the stages in performing the group commit with cooperating processes,
and we describe, in detail, the grouping design currently used in KODA, the Commit-Lock Design.
Group Committer
Assume that a number of transactions have completed all data manipulation and logging activity
and are ready to execute the commit processing
phase. To group the commit requests, the following steps must be performed in KODA:
1. Each transaction r~lustmake its commit infor-
mation available to the group committer.
2. One of the processes must be selected as the
"group committer."
3. The other members of the group need to be
informed that their commit work will be completed by the group committer. These processes
must wait until the commit information is written to disk by the group committer.
4. Once the group committer has written the commit information to stable storage, it must inform
the other members that commit processing is
completed.
Digital Technical Journal
H)1. .3 No. l
Winter 1991
Transaction Processing, Databases, and Fault-tolerant Systems
QUEUE
-
COMMIT
PACKET
CURR-TID
37
0
NEXT-TID
COMMIT-FLG 0
-
CURR-TID
32
NEXT-TID
0
COMMIT-FLG 0
-
CURR-TID
41
NEXT-TID
0
COMMIT-FLG 0
-+
CURR-TID
28
NEXT-TID
0
COMMIT-FLG 0
-
CURR-TID
39
NEXT-TID
0
COMMIT-FLG 0
COMMIT GROUP
I
I
COMMIT BLOCK
KEY:
CURR-TID
NEXT-TID
COMMIT-FLG
CURRENT TRANSACTION IDENTIFIER
NEXT TRANSACTION IDENTIFIER
COMMIT FLAG
Figure 4
Commit Block Ready to be Flushed to Disk
group with new transaction identifiers and with
commit flags set.
At this point, the remaining processes that were
part of the group commit are, in turn, granted
the commit lock. Because their commit flags are
already set, these processes realize they do not
need to perform a commit and, thus, release the
commit lock and proceed to the next transaction.
After all these committed processes release the
commit lock, a process that did not take part in the
group commit acquires the lock, notices i t has not
been committed, and, therefore, initiates the next
group commit.
There are several interesting points about using
the VMS lock as the grouping mechanism. Even
though all the transactions are effectively committed after the commit block I/O has completed, the
transactions are still forced to proceed serially;
that is, each process is granted the lock, notices
that it is committed, and then releases the lock.
NEXT COMMIT PACKET
-
COMMIT
QUEUE
CURR-TID
-1
42
NEXT-TD
COMMIT-FLG 1
II
CURR-TID
-1
NEXT-TID
43
COMMIT-FLG 1
II
CURR-TID
-1
NEXT-TID
44
COMMIT-FLG 1
CURR-TIE
28
NEXT-TID
0
COMMIT-FLG 0
-
CURR-TID
39
0
NEXT-TID
COMMIT-FLG 0
--+
CURR-TIE
29
NEXT-TIE
0
COMMIT-FLG 0
I
1
Y
CURR-TID
37
NEXT-TIE
42
II
COMMITTED GROUP
KEY:
CURR-TID
NEXT-TID
COMMIT-FLG
CURRENT TRANSACTION IDENTIFIER
NEXT TRANSACTION IDENTIFIER
COMMIT FLAG
Figure 5
Committed Group
Vol. 3 No. 1
Winter 1991 Digital Technical Journal
Designing an Optimized Transaction Commit Protocol
So there is a serial procession of lock enqueues/
dequeues before the next group can start.
This serial procession can be made more concurrent by, first, requesting the lock in a shared mode,
hoping that all processes committed are granted
the lock in unison. However, in practice, some processes that are granted the lock are not committed.
These processes must then request the lock in an
exclusive mode. If this lock request is mastered on
a different node in a VAXcluster system, the lock
enqueue/dequeues are very expensive.
Also, there is no explicit stall time built into
the algorithm. The latency associated with the
lock enqueue/dequeue requests allows the commit
queue to build up. This stall is entirely dependent
on the contention for the lock, which in turn
depends on the throughput.
Group Commit Mechanisms Our New Designs
To improve on the transaction throughput provided
by the Commit-Lock Design, we developed three
different grouping designs, and we compared their
performances at high throughput. Note that the
basic paradigm of group commit for all these
designs is described in the Group Committer section. Our designs are as follows.
Commit-StallDesign
In the Commit-Stall Design, the use of the commit
lock as the grouping mechanism is eliminated.
Instead, a process inserts its commit packet onto
the commit queue and, then, checks to see if it is
the first process on the queue. If so, the process
acts as the group committer. If not, the process
schedules its own wake-up call, then sleeps. Upon
waking, the process checks to see if it has been
committed. If so, the process proceeds to its next
transaction. If not, the process again checks to see
if it is first on the commit queue. The algorithm
then repeats, as described above.
This method attempts to eliminate the serial
wake-up behavior displayed by using the commit
lock. Also, the duration for which each process
stalls can be varied per transaction to allow explicit
control of the group size. Note that if the stall time
is too small, a process may wake up and stall many
times before it is committed.
Willing-to-WaitDesign
As we have seen before, a delay in the commit
sequence is a convenient means of converting a
response time advantage into a throughput gain. If
we increase the stall time, the transaction duration
Digital Technical Journal
1/01. 3
No. 1
Winter 1991
increases, which is undesirable. At the same time,
the grouping size for group commit increases,
which is desirable. The challenge is to determine
the optimal stall time. Reuter presented an analytical way of determining the optimal stall time for a
system with transactions of the same type!
Ideally, we would like to devise a flexible scheme
that makes the trade-off we have just described in
real time and determines the optimum commit
stall time dynamically. However, we cannot determine the optimum stall time automatically, because
the database management system cannot judge
which is more important to the user in a general
customer situation - the transaction response time
or the throughput.
The Willing-to-WaitDesign provides a user parameter called wTW time. This parameter represents
the amount of time the user is willing to wait for
the transaction to complete, given this wait will
benefit the complete system by increasing throughput. wnxl time may be specified by the user for each
transaction. Given such a user specification, it is
easy to calculate the commit stall to increase the
group size. This stall equals the W W time minus
the time taken by the transaction thus far, but only
if the transaction has not already exceeded the
WTW time. For example, if a transaction comes to
commit processing in 0.5 second and the wrw time
is 2.0 seconds, the stall time is then 1.5 seconds. In
addition, we can make a further improvement by
reducing the stall time by the amount of time
needed for group commit processing. 'This delta
time is constant, on the order of 50 milliseconds
(one I/O plus some computation).
The wTw parameter gives the user control over
how much of the response time advantage (if any)
may be used by the system to improve transaction
throughput. The choice of an abnormally high value
of wTw by one process only affects its own transaction response time; it does not have any adverse
effect on the total throughput of the system. A low
value of wTw would cause small commit groups,
which in turn would limit the throughput. However,
this can be avoided by administrative controls on
the database that speclfy a minimum wTw time.
Hiber Design
The Hiber Design is similar to the Commit-Stall
Design, but, instead of each process scheduling its
own wake-up call, the group committer wakes up
all processes in the committed group. In addition,
the group committer must wake up the process
that will be the next group committer.
75
Transaction Processing, Databases, and Fault-tolerant Systems
Note, this tlesign exhibits a serial wake-up behavior like the Commit-Lock Design, however, the
mech;inisrn is less costly than the viMS lock used by
the Commit-Lock Design. 111 the Hiber I)esign, if
a process is not the group committer, it simply
sleeps; it cloes not schedule its own wake-up call.
Therefore, each process is guaranteed to sleep and
wake u p at most once p e r commit, in contrast to
the Commit-Stall Design. Another interesting characteristic of the Hiber Design is that the group
committer can choose t o either wake u p the next
group committer immediately, o r it can actually
schedule the wake-up call after a delay. Such a delay
allows the next group size to become larger.
Experiments
We implemented and tested the Commit-Lock, the
Commit-Stall, and the Willing-to-Wait designs in
KOIIA. The objectives of our experiments were
To find out which design would yielcl t h e
mnxirnum throughput under response time
constraints
'To 11nderst;lnd the performance charactcristics
o f the designs
In the following sections, we present the c1et;iils
o f our experimetlts, the results we obtained, and
some observations.
throughput, 500 TPS. Using this last design, it was
possible to achieve u p to a 66 percent improvement over the less-efficient Commit-Lock Design.
Although both timer s c h e n ~ e s i.e.,
,
the CommitStall and Willing-to-Wait designs, neetletl tillling to
set the parameters and the Commit-Lock Design
did not, we observed that the maximum throughput obtained using timers is much better than that
obtained with the lock. These results were similar
to those of Reuter.
For our Willingto-Wait Design, the minimum
transaction duration is the WTW time. Therefore,
the m;~ximumTPS, the number of servers, and
the WTWr stall time, measured in milliseconds,
are related by t h e formula: number of servers
x 100O/WTW = maximum TPS. For example, o u r
maximum TPS for t h e W T W design was obtained
with 50 servers and 90 milliseconds WTW time.
Using t h e formula, 50 x 1000/90 = 555. The actiral
TPS achieved was 500, which is 90 percent of the
maximum TPS. This ratio is also a measure of the
effectiveness of the experiment.
During our experiments, the rn;~ximumgroup
size observed was 45 (with the Willing-to-Wait
Ilesign). This is close to the system-imposed limit
of 50 and, so, we may be able to get better grouping
with higher limits on the size of the group.
Details of tl7e Experiments
Observations
'l'he hardware used for all of the following tests was
a VAX 6340 with four processors, each rated at 3.6
VAX units of performance (WP). The total possible
<:PI) ~~tilixation
was 400 percent and the total processing power of the computer was 14.4 VllPs. As
the commit processing becomes more significant
in ;I transaction (in relation t o the other phases),
the impact of the grouping mechanism on the transaction throughput increases. Therefore, in order
to accentuate the performance differences between
the various designs, we performed our experiments
using ;I transaction that involved no database activity except to follow the commit sequence. So, for
all pr;ictical purposes, the TPS d;it;i presented
in this paper can be interpreted as "commit
sequences per second." Also, note that our system
imposecl an upper limit of 50 on the grouping size.
In the Commit-Stall and the Willing-to-Wait designs,
given a constant stall, if the number of servers is
incrcascd, the 1'PS increases and then decreases.
The rate of decrease is slower than the ratc of
increase. The TPS decrease is tlue to CPLJ overloading. The TPS increase is due to more servers trying
to execute transactions and better CPII utilization.
Figure 6 illustrates how TPS varies with the number of servers, given a constant st;ill WIW time.
Again, in the stalling designs, for a constant number of servers, if the stall is increased, the TPS
increases and then decreases. The TPS increase is
due to better grouping and the decrease is due to
<:Ptt untlerutilization. Figures 7 and 8 show the
effects o n TPS when you vary the commit-stall
time o r the WTWr time, while keeping the number
of servers constant.
Ih maximize TPS with the Commit-Stall Design,
the following "mountain-climbing" algorithm was
usefi11.This algorithm is based on the previous two
observations. Start with a reasonable value of the
stall and the number of servers, such that the (;I-'[J
is unclerutilized. Then increase the number of
servers. CPU utilization and t h e TI'S increase.
Results
Using the Commit-Lock Design, transaction processing bottler~ecked at 300 TPS. Performance
gre;itly improved with the Commit-Stall Design;
the maximum throughput was 464 'rl'S. The
Wil ling-to-IV'tit Design provided the highest
Wl. .$ No. I
W'i~zter1391
Digital Tecb~riculJorrrnul
Designing an Optimized Trans~tctio?~
Commit Protocol
NUMBER OF SERVERS
NOTE: THE WILLING-TO-WAIT STALL TlME IS A CONSTANT
100 MILLISECONDS.
Figure 6
Transactions per Second in
Relationship to the Number of
Servers, Given a Constant
Willing-to-Wc~itTime
Continue until the CPlJ is overloaded; then, increase
the stall time. CPU utilization decreases, but the
TPS increases due to the larger group size.
This algorithm demonstrates that increasing
the number of servers and the stall by small
amounts at a time increases the TPS, but only up
to a limit. After this point, the TPS drops. When
close to the limit, the two factors may be varied
alternately in order to find the true maximum.
Table 1 shows the performance measurements of
the Commit-Stall Design. Comments are included
in the table to highlight the performance behavior
the data supports.
The same mountain-climbing algorithm is modified slightly to obtain the maximum TPS with the
Willing-to-Wait Design. The performance measure-
WILLING-TO-WAIT TlME (MILLISECONDS)
NOTE. THE NUMBER OF SERVERS EQUALS 65.
Figure 8
Transactions per Secondin
Relc~tionshipto the WTW Time,
Given a Constant Number
of Seruers
ments of this design are presented in Table 2. As
we have seen before, the maximum TPS with this
design is inversely proportional to the wTw time,
while CPU is not fi~llyutilized. The first four rows
of Table 2 illustrate this behavior. The rest of the
table follows the same pattern as Table 1.
The Willing-to-Wait Design performs slightly
better than the Commit-Stall Design by adjusting
to the variations in the speed at which different
servers arrive at the commit point. Such variations
are compensated for by the variable stalls in the
Willing-to-Wait Design. Therefore, if the variation
is high and the commit sequence is a significant
portion of the transaction, we expect the Willingto-Wait Design to perform much better than the
Commit-Stall Design.
Future Work
COMMIT-STALL TlME (MILLISECONDS)
NOTE: THE NUMBER OF SERVERS EQUALS 50
Figure 7
Transactions per Secondin
Rehtionship to the Commit-Stall
Time, Given n Constant Number
of Servers
Digital Technical J o u r n a l
Vol. .3 .\?), I
K'inter 1991
There is scope for more interesting work to further
optimlze commit processing in the KODA database
kernel. First, we would like to perform experiments on the Hiber Design and compare i t to the
other designs. Next, we would like to explore ways
of combining the Hiber Design with either of the
two timer designs, Commit-Stall or Willing-toWait. This may be the best design of all the above,
with a good mixture of automatic stall, low overhead, and explicit control over the total stall time.
In addition, we would like to investigate the use of
timers to ease system management. For example, a
system administrator may increase the stalls for
all transactions on the system in order to ease CPU
contention, thereby increasing the overall effectiveness of the system.
Transaction Processing, Databases, and Fault-tolerant Systems
Table 1 Commit-Stall Design Performance Data
Number of
Servers
Commit Stall
(Milliseconds)
CPU Utilization
(Percent)*
TPS
Comments
Starting point
lncreased number of servers, therefore, higher TPS
lncreased number of servers, therefore, CPU saturated
lncreased stall, therefore, CPU less utilized
lncreased number of servers, n~aximumTPS
"Over-the-hill" situation, same strategy of further
increasing the number of servers does not increase TPS
No benefit from increasing number of servers and stall
No benefit from just increasing stall
* Four processorswere used in the experiments. Thus, the total possible CPU utilization is 400 percent.
Table 2
Willing-to-Wait Performance Data
Number of
Servers
Willing-to-Wait
Stall
CPU Utilization
(Milliseconds) (Percent)*
TPS
Comments
Starting point, CPU not saturated
Decreased stall to load CPU, CPU still not saturated
Decreased stall again
Further decreased stall, CPU almost saturated
lncreased number of servers, CPU more saturated
lncreased stall to lower CPU usage, maximum TPS
"Over-the-hillWsituation,
same strategy of further
increasing number of servers does not increase TPS
No benefit from just increasing stall
Four processorswere used in the experiments. Thus, the total possible CPU utilizationis 400 percent.
Conclusions
We have presented the concept of group commit
processing as well as a general analysis of various
options available, some trade-offs involved, and
some performance results indicating areas for possible improvement. It is clear that the choice of the
algorithm can significantly influence performance
at high transaction throughput. We are optimistic
that with some further investigation an optimal
commit sequence can be incorporated into Rdb/VMS
and VAX DBMS with considerable gains in transaction processing performance.
Acknowledgments
We wish t o acknowledge the help provided by
Rabah Mediouni in performing the experiments
discussed in this paper. We would like t o thank
Phil Bernstein and Dave Lomet for their careful
reviews of this paper. Also, w e want to thank the
other KODA group members for their contributions during informal discussions. Finally, we
would like to acknowledge the efforts of Steve Klein
w h o designed the original KODA group commit
mechanism.
References
1. I? Helland, H. Sammer, J. Lyon, R. Carr, I? Garrett,
and A. Reuter, "Group Commit Timers and High
Volume Transaction Processing Systems," High
Performance Transaction Systems, Proceedings
of the 2nd International Workshop (September
1987).
2. D. Gawlick and D. Kinkade, "Varieties of Concurrency Control in IMSrVS Fast Path:' Database
Engineering (June 1985).
Vo1. 3 No. 1
Winter 1991
Digital Technical Journal
William F, Bruckert
Carlos Alonso
J a m e s M. Melvin
Verificationof the First
Thefault-tolerant character of the V a f t 3000 system required thatplans be made
early in the development stagesfor the verification and test of the system. To ensure
proper test coverage of the fault-tolerant features, engineers built fault-insertion
points directly into the system hardware. During the verification process, test engineers used hardware and software fault insertion in directed and random test
forms. A four-phase verification strategy was devised to ensure that the VAXftsystem
hardware and software was fully tested for error recovery that is transparent to
applications on the system.
The VAxft 3000 system provides transparent fault
tolerance for applications that run on the system.
Because the 3000 includes fault-tolerant features,
verification of the system was unlike that ordinarily conducted on V M systems. To facilitate system
test, the verification strategy outlined a four-phase
approach which would require hardware to be
built into the system specifically for test purposes.
This paper presents a brief overview of the VAxft
system architecture and then describes the methods used to verlfy the system's fault tolerance.
VMft 3000 Architectural Overview
The VAXft fault-tolerant system is designed t o
recover from any single point of hardware failure.
Fault tolerance is provided transparently for all
applications running under the VMS operating
system. This section reviews the implementation
of the system to provide background for the main
discussion of the verification process.
The system comprises two duplicate systems,
called zones. Each zone is a fully functional computer with enough elements to run an operating
system. These two zones, referred to as zone A and
zone B, are shown in Figure 1,which illustrates the
duplication of the system components. The two
independent zones are connected by duplicate
cross-link cables. The cabinet of each zone also
includes a battery, a power regulator, cooling fans,
and an AC power input. Each zone's hardware has
sufficient error checking to detect all single faults
within that zone.
Figure 2 is a block diagram of a single zone with
one I/O adapter. Note the portions of the zone
Digital Technical Journal
VoI. 3 No. 1
Winter 1991
labeled dual-rail and single-rail. The dual-rail portions of the system have two independent sets
of hardware performing the same operations.
Correct operation is verified by comparison. The
fault-detection mechanism for the single-rail 1/0
modules combines checking codes and communication protocols.
The system performs I/O operations by sending
and receiving message packets. The packets are
exchanged between the CPU and various servers,
including disks, Ethernet, and synchronous lines.
These message packets are formed and interpreted
in the dual-rail portion of the system. They are protected in the single-rail portion of the machine by
check codes which are generated and checked in
the dual-rail portion of the machine. Corrupted
packets can be retransmitted through the same or
alternate paths.
In the normal mode of fault-tolerant operation,
both zones execute the same instruction at the
same time. The four processors (two in each zone)
appear to the operating system as a single logical
CPU. The hardware supplies the detection and
recovery facilities for faults detected in the CPU
and memory portions of the system. A defective
CPU module and its memory are automatically
removed from service by the hardware, and the
remaining CPU continues processing.
Error handling for the I/O interconnections is
managed differently. The paths to and from 1/0
adapters are duplicated for checking purposes. If a
fault is detected, the hardware retries the operation.
If the retry is successful, the error is logged, and
operation continues without software assistance.
79
Transaction Processing, Databases, and Fault-tolerant Systems
ZONE A
ZONE B
CROSSLINK CABLES
Figure 1 A Dual-zone VAXft System
E
CACHE
MEFAORY
CONTROL
MEMORY
INTERFACE
-
>
fl
DUAL-RAIL
CROSSLINK
CROSSLINK
CABLES
TO ZONE B
CPU
MODULE
CROSSLINK
I
MODULE
INTERCONNECT
I
-9
110 ADAPTER
h
SINGLE-RAIL
~ p r t
ADAPTER
Figure 2
80
Single-zone Structure of a VMJt 3000 Sjatenz
Vol. 3 No. I
Wirrtrv 1991
Digital Tecbttical Journal
Verification of the First Fault-tolerant VRX System
If t h e retry is unsuccessful, t h e Fault-tolerant
System Services (FTSS) software performs e r r o r
recovery. n S S is a layered software product that is
utilized with every VAXFt 3000 system. It provides
the software necessary t o complete system e r r o r
recovery. For system recovery from a failed I/O
device, an alternate path o r device is used. All
recoverable faults have an associated maximum
threshold value. If this threshold is exceeded, FTSS
performs appropriate device reconfiguration.
Verij3cationof a Fault-tolerant
VAX System
This section entails a discussion of t h e types of system tests and the fault-insertion techniques used
to ensure the correct operation of the VAXft system.
In addition, the four-phase verification strategy and
the procedures involved in each phase are reviewed.
There are two types of system tests: directed and
random. Directed tests, which test specific hardware o r software features, are used most frequently
in computer system verification and follow a strict
test sequence. Complex systems, however, cannot
be completely verified in a directed fashion.' As a
case i n point, an operating system running o n a
processor has innumerable states. Directed tests
verify functional operation under a particular set
of conditions. They may not, however, be used to
verify that s a m e functionality u n d e r all possible
system conditions.
In companson, random testing allows multiple
test processes to interact in a pseudo-random o r
random fashion. In random testing, test coverage
is increased w i t h additional run-time. Thus, o n c e
t h e proper test processes are in place, t h e need t o
develop additional tests in order to increase coverage is eliminated. This type of testing also reduces
the effects of t h e biases of the engineers generating
the tests. While directed testing can provide only a
limited level of coverage, this coverage level can b e
well understood. Random testing offers a potentially unbounded level of coverage; however, quantifying this coverage is difficult if not impossible.
To achieve the proper level of verification, t h e
VAxft verification utilized a balance of directed
and random testing. Directed testing was used t o
achieve a certain base level of functionality, and
random testing was used t o expand t h e level of
coverage.
To permit testing of system fault tolerance in a
practical amount of time, s o m e form of fault insertion is required. The reliability of components used
in computer systems has been improving, and more
Digital Technical Journal
h1.
No. I
IVinrcr 1991
importantly, the number of components used to
implement any function has been dramatically
decreasing. These factors have produced a corresponding reduction in system failure rates. Given
t h e high reliability of today's machines, it is not
practical from a verification standpoint t o verify a
system by letting it run until f'i 'I l ures occur.
Conceptually, faults can be inserted in two ways.
First, memory locations and registers can be corrupted to mimic the results of gate-level faults
(software fault insertion). Second, gate-level faults
may be inserted directly into t h e hardware (hardware fault insertion). T h e r e a r e advantages to
b o t h techniques. O n e advantage of software-.
implemented fault insertion is that n o embedded
hardware support is required.' The advantage of
hardware fault insertion, on the other hand, is that
faults are m o r e representative of actual hardware
failures and can reveal unanticipated side effects
from a gate-level failure. To utilize hardware fault
insertion, either a mechanism must b e designed
into t h e system, o r an external insertion device
must b e developed o n c e t h e hardware is available.
Given t h e physical feature size of t h e components
used today, it is virtually impossible t o achieve adequate fault-insertion coverage through an external
fault-insertion mechanism.
The error detection and recovery mechanism
determines which fault insertion techniclue is
suitable for each component. Some examples i l l u s
trate this point. For t h e lockstep portion of t.he
VAXft 3000 CPUs, s o h a r e fault insertion is not suitable because t h e lockstep functionality prevents
corruption of memory o r registers w h e n faults
occur. Therefore, hardware faults cannot be mimicked by mod~fyingmemory contents. However,
the software fault-insertion technique was suitable
t o test the I/O adapters since t h e system handles
faults in the adapters by detecting the corruption
of data. Hardware fault insertion was not suitable
because t h e I/O adapters w e r e implemented with
standard components that did not support hardware fault insertion.
Because t h e verification strategy for t h e 3000
was considered a fundamental part of t h e system
development effort, fault insertion points w e r e
built directly into the system hardware. The amount
of logic necessary t o implement fault insertion is
relatively small. The goals of t h e fault-insertion
hardware were to
fl
Eliminate any corruption of t h e environment
under test that could result from fault insertion.
For example, if a certain type of system write
81
Transaction Processing, Databases, and Fault-tolerant Systems
operation is required to insert a fault, then every
test case will be done on a system that is in a
"post-fault-insertion" state.
Enable the user to distribute faults randomly
across the system.
Allow insertion of faults during system operation.
Enable testing of transient and solid faults.
The fault-insertion points are accessed through
a separate serial interface bus isolated from the
operating hardware. This separate interface ensures
that the environment under test is unbiased by
fault insertion.
Even with hardware support for fault insertion,
only a small number of fault-insertion points can
be implemented relative to the total number possible. Where the number of fault-insertion points is
small, the selection of the fault-insertion points
is important to achieve a random distribution.
Fault-insertion points were designed into most of
the custom chips in the VAxft system. When the
designers were choosing the fault-insertion points,
a single bit of a data path was considered sufficient
for data path coverage. Since a significant portion
of the chip area is consumed by data paths, a high
level of coverage of each chip was achieved with
relatively few fault-insertion points. The remaining
fault-insertion points could then be applied to the
control logic. Coverage of this logic was important
because control logic faults result in error modes
that are more unpredictable than data path failures.
The effect that a given fault has on the system
depends on the current system operation and when
in that operation the fault was inserted. In the
3000, for example, a failure of bit 3 in a data path
will have significantly different behavior depending upon whether the data bit was incorrect during
the address transmission portion of a cycle or during the succeeding data portion. Therefore, the
timing of the fault insertion was pseudo-random.
The choice of pseudo-random insertion was based
on the fact that the fault-insertion hardware operated asynchronously to the system under test. This
meant that faults could be inserted at any time,
without correlation to the activity of the system
under test.
Faults may be transient or solid in nature. For
design purposes, a solid fault was defined as a failure that will be present on retry of an operation.
A transient fault was defined as a fault that will not
be present on retry of the operation. Transient
faults do not require the removal of the device that
experienced the fault; solid faults do require device
removal. Since the system reacts differently to transient and hard faults, both types of faults had to
be verified in the VAxft system. Therefore, it was
required that the fault-insertion hardware be capable of inserting solid or transient faults. Solid faults
were inserted by continually applying the faultinsertion signal. Transient faults were inserted by
applying the fault-insertion signal only until the
machine detected an error.
As noted earlier, the verification strategy utilized
both hardware and software fault insertion. The
hardware fault-insertion mechanisms allowed faults
to be inserted into any system environment, including diagnostics, exercisers, and the VMS operating
system. As such, it was used for initial verification
as well as regression testing of the system. The verification strategy for the VAxft 3000 system involved
a multiphase effort. Each of the following four verification phases built upon the previous phase:
1. Hardware verification under simulation
2. Hardware verification with system exerciser and
fault insertion
3. System software verification with fault insertion
4. System application verification with fault
insertion
layers of the
Figure 3 shows the f~~nctional
VAxft 3000 system in relation to the verification
phases. The numbered brackets to the right of
the diagram correlate to the testing coverage of
each layer. For example, the system software verification, phase 3, verified the VMS system, Faulttolerant System Services (RSS), and the hardware
platform.
The following sections briefly describe the four
phases of the VAxft verification.
Hardware Verification under Simulation
Functional design verification using software simulation is inherently slow in a design as large as the
VAXft 3000 system. To use resources most efficiently,
a verification effort must incorporate a number of
different modeling levels, which means trading off
detail to achieve other goals such as speed.'
VAxft 3000 simulation occurred at two levels: the
module level and the system level. Module-level
simulation verified the base functionality of each
module. Once this verification was complete, a system-level model was produced to validate the
intermodule functionality. The system-level model
Vol. 3 No. I
Winter 1991
Digital TechnicalJournal
V?rzf?cationof the First Fault-tolerant VAX System
TEST PHASE COVERAGE
I
I
------------
USER APPLICATION
HOST-BASED VOLUME SHADOWING
I
I
PHASE 4
VMS 5.4
I
FAULT-TOLERANT SYSTEM SERVICES
I
} PHASE 3
VAXFT 3000 HARDWARE
Figure 3
Functional Layers ofthe VRYft3000 System in Relation to the VerificationPhases
consisted of a full dual-rail, dual-zone system with
an I/O adapter in each zone. At the final stage, hill
system testing was performed.
More than 500 directed error test cases were
developed for gate-level system simulation. For each
test, the test environment was set up on a fully
operational system model, and then the fault was
inserted. A simulation controller was developed to
coordinate the system operations in the simulation
environment. The simulation controller provided
the following control over the testing:
Initialization of all memory elements and certain
system registers to reduce test time
Setup of all memory data buffers to be used in
testing
Automated test execution
Automated checking of test results
Log of test results
For each test case, the test environment was
selected from the following: memory testing, I/O
register access, direct memory access (DMA) traffic, and interrupt cycles. In any given test case, any
number of the previous tests could be run. These
environments could be run with or without faults
inserted. In addition, each environment consisted
of multiple test cases. In an error handling test case,
the proper system environment required for the
test was set, and then the fault was inserted into
the system. The logic simulator used was designed
to ver@ logic design. When an illegal logic condition was detected, it produced an error response.
When a fault insertion resulted in an illegal logic
condition, the simulator responded by invalidating the test. Because of this, a great deal of time was
spent to ensure that faults were inserted in a way
Digital Technical Journal
Vo1.3 No. I
Winter 1331
that would not generate illegal conditions. Each
test case was considered successful only when the
system error registers contained the correct data
and the system had the ability to continue operation after the fault.
Hardware Verz~ication
with System
Exerciser and Fault Insertion
After the prototypes were available, the verification
effort shifted from simulation to fault insertion on
the hardware. The goal was to insert faults using an
exerciser that induced stressful, reproducible hardware activity and that allowed us to analyze and
debug the fault easily.
Exerciser test cases were developed to stress
the various hardware functions. The tests were
designed to create maximum interrupt and data
transfer activity between the CPU and the I/O
adapters. These functions could be tested individually or simultaneously. The exerciser scheduler
provided a degree of randomness such that the
interaction of functions was representative of a
real operating system. The fault-insertion hardware
was used to achieve a random distribution of fault
cases across the system.
Because it was possible to insert initial faults
while specific functions were performed, a great
degree of reproducibility was achieved that aided
debug efforts. Once the full suite of tests worked
correctly, fault insertion was performed while the
system continually switched between all fi~nctions. This testing was more representative of actual
faults in customer environments, but was less
reproducible.
As previously mentioned, the hardware faultinsertion tool allowed the insertion of both transient and solid failures. The VAXft 3000 hardware
recovers from transient failures and utilizes
Transaction Processing, Databases, and Fault-tolerant Systems
software recovery for hard failures. Since the goal
of phase 2 testing was to verlfy the hardware, the
focus was on transient fault insertion. Two criteria
for each error case determined the success of the
test. First and foremost, the system must continue
to run and to produce correct results. Second, the
error data that the system captures must be correct
based on the fault that was inserted Correct error
data is important because it is used to identi@ the
failing component both for software recovery and
for servicing.
Although the simulation environment of phase 1
was substantially slower than phase 2, it provided
the designers with more information. Therefore
when problems wcre discovered on the prototypcs
irsed in phase 2, the failing case was transferred to
the simulator for further debugging. Thc hardware
verification also validated the motlels and test procedures used in the simulation environment.
System Software Verificationwith Fault
Insertion
In parallel with hardware verification, the VAxft 3000
system software error handling capabilities were
tested. This phase represented the next higher
level of testing. The goal was to verify the V U hlnctionality of the 3000 system as well as the software
recovery mechanisms.
Digital has produced various test packages to
verlfy VAX functionality. Since the v&sft 3000 system
incorporates a VAX chip set used in the VAX 6000
series, it was possible to use several standard
test packages that had been used to verify that
system!
Fault-tolerant verification, however, was not
addressed by any of the existing test packages.
Therefore, additional tests were developed by combining the existing functional test suite with the
hardware fault-insertion tool and software faiiltinsertion routines. Test cases used included cache
failure, clock failure, memory failure, interconnect failures, and disk failures. These failures were
applied to the system during various system operations. In addition, servicing errors were also tested
by removing cables and modules while the system
was running. The completion criteria for tests
included the following:
Detection of the fault
Isolation of the failed hardware
Continuation of the test processes without
interruption
84
System Application Verification with
Fault Insertion
The goals for the final phase of the VAXft 3000
verification were to run an application with fault
insertion and to demonstrate that any system
fault recoverv action had no effect on the process
integrity ant1 data integrity of the application. The
application used in thc testing was based on the
standard DebitCretlit banking benchmark and was
implemented using the 1)ECintact layered product.
The bank has 10 branches, 100 tellers, and 3,600
customer accounts ( 1 0 tellers and 360 accounts
per branch). Traffic on the system was simulated
using terminal emulation process (VURTE) scripts
representing bank teller activity. The transaction
rate was initially one transaction per second (TI'S)
and was varied up to thc maximum TI'S rate to stress
the system load.
The general test process can be described as
follows:
1. Started application execution. The terminal emu-
lation processes emulating the bank tellers were
started and continued until the system was
operating at the desired TPS rating
2. Invoked f a ~ ~insertion.
lt
A fault was selectecl at
random from ;i table of hardware and softwarc
faults. The terminal emulation process subrnittetl
stimuli to the npplication before, during, ant1
after fault insertion.
3. Stopped terminal emulation process. The application was run until a quiescent state was
reached.
4 I-'erformetl result valiclntlon 1he process integrlty and data integrity ot the application wcre
val~dated
All the meaningful events were logged and timestamped during the experiments. Process integrity
was proved by verifying continuity of trans;iction
processing through failures. The time stamps on
the transaction c x e c u t i o ~ ~and
s the system error
logs allowed these two independent processes to
be correlated.
The proof of data integrity consisted of using the
following consistency rules for transactions:
1. The sum of the account balances is equal to the
sum of the teller balances, which is equal to the
sum of the branch balances.
2. For each branch, the sum of the teller balances is
equal to the branch bal;ince.
Vol. .j l\'o
1
Witito' 1991
Digital Technical Journal
Verificcrtion of the First Fault-tolerant VXXSysterrz
3. For each transaction processed, a new record
must be added to the history file.
Application verification under fault insertion
served as the final level of fault-tolerant validation.
Whereas the previous phases ensured that the vari011scomponents required for Fault tolerance operated properly, the system application verification
demonstrated that these components could operate together to provide a fully fault-tolerant system.
The process of verifying fault tolerance requires
;I strong architectural test plan. This plan must be
developed early in the design cycle bec;~usehardware support for testing may be required. The verification plan must tlelnonstrate cognizance of the
capabilities ant1 limitations at each phase of the
development cycle. For example, the speed of simirlation prohibits verification of software error
recovery in a simulation environment. Also, when
;I system is implemented with VI.SI technology, tlie
ability to physically insert faults into the system
by means of an external mechanical mechanism
may not be adequate to properly verify the correct
system error recovery. These and other issues
must be acldressed before the chips are fabricated
or adequate error recovery verification may not be
possible. Inadequate error recovery verification
directly increases the risk of real, unrecoverable
faults resulting in system outages.
The verification plan for the VAXft 3000 system
consistetl of the following phases ant1 objectives:
I. Hardware simulation with fault insertion verified
error detection, hardware recovery, and error
data capture.
Tests also revealed problems that would have
resultetl in system outages if left uncorrected.
System enhancements were made in the areas
of system recovery actions and repair call out.
Whereas some of the problems were simple
coding errors, others were errors in carefully
reviewed and documented algorithms. Simply put,
the collective wisdom of the designers was not
always sufficient to reach the degree of accuracy
desired for this fault-tolerant system.
As the VAXft product family evolves, performance and functional enhancements will be available. The test processes described in this paper
will remain in use, so that every future release
of software will be better than the previous onc.
1:he combination of hardware and software fault
insertion, coupled with physical system disruption
allows testing to occur at such a greatly accelerated
rate, that all testing performed will be repeated for
every new release.
References
1. J. Croll, L. Camilli, and A. Vaccaro, "Test and
Qualification of the jiiuc 6000 Model 400 System,"
Dzgitcd TechnicalJot~rnal,vol. 2, no. 2 (Spring
1990):73-83
2. J. Barton, E Czeck, Z Segall, and D.Siewiorek,
"Fault Injection Experiments Using FLAT (Fault
Injection-based Automated Testing," IEEE Trans-
actions on Computers, vol. 39, no. 4 (April 1990).
3. R. Calcagni and W. Sherwood, "VAX 6000 ivlotlel
400 cru Chip Set Functional Design Verification:'
Digital TechnicalJournal, vol. 2, no. 2 (Spring
1990):64-72.
2. System exerciser with h i ~ l insertion
t
enhanced
the coverage of the hardware siniulat ion cffort.
3. System software with fault insertion verified
software error recovery and rcporting
4. System software verification with fault insertion verified the transparency of the system
error recovery to the application running on
tlie system.
The test of any fault tolerant system is to survive
a real fault while running a customer application.
Removing a module from a machine may be an
impressive test, but machines fail as a result of
modules falling out of the backplane. The initial
test of the VAXft 3000 system showed that the system would survive most of the faults introduced.
Digilal Techrticnl Journal
Vol .$ iVo 1
IVirr1c.r 1991
85
I
Further Readings
The Digital Technical Journal
publishespapers that explore
the technologicalfoundations
of Digital's majorproducts. Each
Journalfocuses on at least one
product area andpresents a
compilation ofpapers written
by the engineers who developed
theproduct. The content for
theJournal is selected by the
Journal Advisory Board.
Topics covered in previous issues of the Digital
TechnicalJournal are as follows:
VAX 9000 Series
Vol. 2,No. 4, Fall 1990
The technologies and processes used to build
Digital's first mainframe computer, including
papers on the architecture, microarchitecture,
chip set, vector processor, and power system,
as well as CAD and test methodologies
DECwindows Program
Vol. 2, No. 3, Summer 1990
An overview and descriptions of the enhancements
Digital's engineers have made to MIT's X Window
System in such areas as the server, toolkit, interface
language, and graphics, as well as contributions
made to related industry standards
VAX 6000 Model 400 System
Vol. 2,No. 2, Spring 1990
The highly expandable and configurable midrange
family of VAX systems that includes a vector processor, a high-performance scalar processor, and
advances in chip design and physical technology
Compound Document Architecture
Vol. 2, No. 1, Winter 1990
The CDA family of architectures and services that
support the creation, interchange, and processing
of compound documents in a heterogeneous
network environment
manufacture, and maintenance of Digital's storage
and information management products
CVAX-based Systems
Vol. I, No. 7,August I988
CVAX chip set design and multiprocessing architecture of the mid-range VAX 6200 family of
systems and the MicroVAX 3500/3600 systems
Software Productivity Tools
Vol. I, No. 6 February 1988
Tools that assist programmers in the development
of high-quality, reliable software
VAXcluster Systems
Vol. 1,No. 5, September 1987
System communication architecture, design and
implementation of a distributed lock manager,
and performance measurements
VAX 8800 Family
Vol. I, No. 4, February 1987
The microarchitecture, internal boxes, VAXBI bus,
and VMS support for the VAX 8800 high-end multiprocessor, simulation, and CAD methodology
Networking Products
Vol. I, No. 3, September 1986
The Digital Network Arch.itecture (DNA), network
performance, LANbridge 100, DECnet-ULTRM and
DECnet-DOS,monitor design
MicroVAX I1 System
Vol. I,No. 2, March 1986
The implementation of the microprocessor and
floating point chips, CAD suite, MicroVAX workstation, disk controllers, and TK50 tape drive
VAX 8600 Processor
Vol. I,No. 1,August 1985
The system design with pipelined architecture,
the I-box, F-box, packaging considerations, signal
integrity, and design for reliability
Distributed Systems
Vol. I, No. 9,June 1989
Products that allow system resource sharing
throughout a network, the methods and tools to
evaluate product and system performance
Subscriptions to the Digital TechnicalJournalare
available on a yearly, prepaid basis. The subscription rate is $40.00 per year (four issues). Requests
should be sent to Cathy Phillips, Digital Equipment
Corporation, ~ ~ 0 1 - 3 / B 6146
8 , Main Street, Maynard,
MA 01754, U.S.A. Subscriptions must be paid in U.S.
dollars, and checks should be made payable to
Digital Equipment Corporation.
Storage Technology
Vol. I, No. 8,February I989
Engineering technologies used in the design,
Single copies and past issues of the Digital
Technicaljournal can be ordered from Digital
Press at a cost of $16.00 per copy.
Vol. 3 No. 1
Winter 1991
Digital Technical Journal
Technical Papers a n d Books by Digital Authors
P. Bernstein, V. Hadzilacos, and N. Goodman,
Concurrency Control and Recovery in Database
S y s t m (Reading, IMA: Addison-Wesley, 1987).
I? Bernstein, M. Hsu, and B. M ~ M ,"Implementing
Recoverable Requests Using Queues:' Proceedings
1990ACM SIGMOD Conference on Management of
Data (May 1990).
T.K. Rengarajan, l? Spiro, and W Wright, "High
Availability Mechanisms of VAX DBMS Software:'
Digital TechnicalJournal, vol. 1, no. 8 (February
1989): 88-98.
K. Morse, "The VMS/MicroVMS Merge," DEC
Professional Magazine, vol. 7, no. 5 (May 1988).
K. Morse and R. Gamache, "VAX/SMP," DEC
Professional Magazine, vol. 7, no. 4 (April 1988).
K. Morse, "Shrinking VMS," Datamation uuly 15,
1984).
L. Frampton, J. Schriesheirn, and M. Rountree,
"Planning for Distributed Processing," Auerbacb
Report on Communications (1989).
Digital P r e s s
Digital Press is the book publishing group of Digital
Equipment Corporation. The Press is an international publisher of computer books and journals
on new technologies and products for users, system
and network managers, programmers and other
professionals. Press editors welcome proposals and
ideas for books in these and related areas.
V W V M S : Writing Real P r o g r a m s in DCL
Paul C. Anagnostopoulos, 1989, softbound,
409 pages ($29.95)
X WINDOW SYSTEM TOOLKIT: T h e C o m p l e t e
Programmer's G u i d e and Specification
Paul J. Asente and Ralph R. Swick, 1990, softbound,
967 pages ($44.95)
UNM FOR VMS USERS
Philip E. Bourne, 1990, softbound,
368 pages ($28.95)
VAX ARCHITECTURE REFERENCE MANUAL,
S e c o n d Edition
Richard A. Brunner, Editor, 1991, softbound,
560 pages ($44.95)
Digital Tecbntcal Journal
Vol. .? No. 1
Winter 1991
SOFTWARE DESIGN TECHNIQUES FOR LARGE
ADA SYSTEMS
William E. Byrne, 1991,hardbound,
314 pages ($44.95)
INFORMATION TECHNOLOGY STANDARDIZATION: Theory, Practice, a n d O r g a n i z a t i o n s
Carl E Cargill, 1989, softbound,
252 pages ($24.95)
THE DIGITAL GUIDE T O SOFTWARE
DEVELOPMENT
Corporate User Publication Group of Digital
Equipment Corporation, 1990, softbound,
239 pages ($27.95)
DIGITAL GUIDE TO DEVELOPING
INTERNATIONAL SOFTWARE
Corporate User Publication Group of Digital
Equipment Corporation, 1991, softbound,
400 pages ($28.95)
VMS INTERNALS AND DATA STRUCTURES:
Version 5 Update Xpress, Volumes 1,2,3,4,5
Ruth E. Goldenberg and Lawrence J. Kenah, 1989,
1990, 1991, all softbound ($35.00 each)
COMPUTER PROGRAMMLNG AND
ARCHITECTURE: T h e VAX, Second Edition
Henry M. Levy and Richard H. Eckhouse Jr., 1989,
hardbound, 444 pages ($38.00)
USING MS-DOS KERMIT. C o n n e c t i n g Your PC
to the Electronic World
Christine M. Gianone, 1990, softbound,
244 pages, with Kermit Diskette ($29.95)
THE USER'S DIRECTORY OF COMPUTER
NETWORKS
Tracy L. LaQuey, 1990, softbound,
630 pages ($34.95)
SOLVING BUSINESS PROBLEMS WITH MRP I1
Alan D. Luber, 1991, hardbound,
333 pages ($34.95)
VMS FILE SYSTEM INTERNALS
Kirby McCoy, 1990, softcover,
460 pages ($49.95)
TECHNICAL ASPECTS OF DATA
COMMUNICATION, T h i r d Edition
John E. McNamara, 1988, hardbound,
383 pages ($42.00)
LISP STYLE a n d DESIGN
Molly M. Miller and Eric Benson, 1990, softbound,
214 pages ($26.95)
87
THE VMS USER'S GUIDE
James E Peters 111 and Patrick J. Holmay, 1990,
softbound, 304 pages ($28.95)
X WINDOW SYSTEM, S e c o n d E d i t i o n
Robert Scheifler and James Gettys, 1990,
softbound, 851 pages ($49.95)
THE MATRIX: C o m p u t e r Networks a n d
C o n f e r e n c i n g S y s t e m s Worldwide
John S. Quarterman, 1990, softbouncl,
719 pages ($49.95)
COMMON LISP: The Language, S e c o n d E d i t i o n
G u y L SteeleJr., 1990, 1,029 pages ($38.95 in
softbound, $46.95 in hardbol~nd)
X AND MOTIF QUICK REFERENCE GUIDE
Randi J. Rost, 1990, softbound,
369 pages ($24.95)
FIFTH GENERATION MANAGEMENT:
Integrating Enterprises Through H u m a n
Networking
Charles M. Savage, 1990, hardbound,
267 pages ($28.95)
A BEGINNER'S GUIDE T O VAX/VMS UTILITIES
AND APPLICATIONS
Ronalcl M Sawey and 'l'roy ?: Stokes, 1989,
softbound, 278 pages ($26.95)
88
WORKING WlTH WPS-PLUS
Charlotte Temple and Dolores Cordeiro, 1990,
softbound, 235 pages ($24.95)
To receive information o n these o r other publications from Digital Press, write:
Digital Prcss
Department DTJ
12 Crosby Drive
Bedford, MA 01730
617/276-1536
Or order directly by calling DECdirect at
800-DIGITAL(800-344-4825).
Vol. -3 No. 1
Witrler 1991
Digital Technical Jourrral
ISSN 0898-901X
-.
Printed in U.SA. EY-P~~~R-DPI~O
11 02 16.0 MCG/BUO Copyright O Digital Equipment Corporation. All Rights Reserved
Download PDF

advertising