Tuning Informix for OLTP Workloads
Technical White Paper
Denis Sheahan
Database Engineering, SMCC
Sun Microsystems Computer Corporation
2550 Garcia Avenue
Mountain View, CA 94043
U.S.A.
 1997 Sun Microsystems, Inc.
2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A.
All rights reserved. This product and related documentation are protected by copyright and distributed under licenses
restricting its use, copying, distribution, and decompilation. No part of this product or related documentation may be
reproduced in any form by any means without prior written authorization of Sun and its licensors, if any.
Portions of this product may be derived from the UNIX® and Berkeley 4.3 BSD systems, licensed from UNIX System
Laboratories, Inc. and the University of California, respectively. Third-party font software in this product is protected by
copyright and licensed from Sun’s Font Suppliers.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictions
set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19.
The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.
TRADEMARKS
Sun, Sun Microsystems, Sun Microsystems Computer Corporation, the Sun logo, the Sun Microsystems Computer
Corporation logo,are trademarks or registered trademarks of Sun Microsystems, Inc. UNIX and OPEN LOOK are registered
trademarks of UNIX System Laboratories, Inc., a wholly owned subsidiary of Novell, Inc. All other product names mentioned
herein are the trademarks of their respective owners.
All SPARC trademarks, including the SCD Compliant Logo, are trademarks or registered trademarks of SPARC International,
Inc. SPARCstation, SPARCserver, SPARCengine, SPARCworks, and SPARCompiler are licensed exclusively to Sun
Microsystems, Inc. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
The OPEN LOOK® and Sun™ Graphical User Interfaces were developed by Sun Microsystems, Inc. for its users and licensees.
Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user
interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface,
which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license
agreements.
X Window System is a trademark and product of the Massachusetts Institute of Technology.
TPC-C Benchmark TM is a trademark of the Transaction Processing Council.
Infomix ODS 7is a registered trademark of Informix Inc.
VERITAS, VxVM, VxVA, VxFS, and the VERITAS logo are registered trademarks of VERITAS Software Corporation.
THIS PUBLICATION IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE, OR NON-INFRINGEMENT.
THIS PUBLICATION COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE
PERIODICALLY ADDED TO THE INFORMATION HEREIN; THESE CHANGES WILL BE INCORPORATED IN NEW
EDITIONS OF THE PUBLICATION. SUN MICROSYSTEMS, INC. MAY MAKE IMPROVEMENTS AND/OR CHANGES IN
THE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED IN THIS PUBLICATION AT ANY TIME.
Please
Recycle
Contents
1. Informix Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
ODS Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Virtual Processors: Dynamic Scalable Architecture . . . . . . . .
7
An Informix Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Private and Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2. Database Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Informix Layout basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Rootdbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Logical Logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Online Physical Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Spindle Count:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
iii
iv
Table Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Volume Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Interleave Factor
.......................................
25
Raw vs UFS
.......................................
25
Tuning Existing Layouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Building Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3. Online Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
I/O Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Physical Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Logical Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Connecting to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Configuring CPUVPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Configuring Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
BUFFERS and LRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
LOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4. System Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Sample /etc/system File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Disk
................................................
50
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Tuning Informix for OLTP Workloads
CPU
................................................
52
5. Appendix A : Informix Scripts. . . . . . . . . . . . . . . . . . . . . . . . . .
53
File: move_log.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
6. Appendix B: Application Tuning . . . . . . . . . . . . . . . . . . . . . . . .
55
Using sqexplain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Database Procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Application errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Deadlock and locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Using PDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
optcompind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
Contents
v
vi
Tuning Informix for OLTP Workloads
Informix Overview
Introduction
Currently Informix offers 3 different versions of their engine. Version 7, also known as Online
Dynamic Server (ODS), is their main OLTP engine. Its current revision is 7.2.x with a planned
release of 7.3 in Early 1998. Version 8, also known as Extended Paralell Server (XPS), is their
main Decision Support engine. Its current revision is 8.11 with a planned release of 8.2
December 1997. Version 9, also known as Informix Universal Server (IUS), is a merge of 7.2
and the Universal Server from Illustra. IUS is Informix’s Object Relational offering providing
extended types and datablades.
This paper explains how to tune ODS for OLTP Workloads on Sun servers running Solaris
2.5.1 and above. Most of the tuning tips are also applicable to IUS in OLTP situations. There
are three major sections dealing with data layout, generic Informix tuning issues and system
tuning. In Appendix B we also provide a tutorial on Informix application tuning.
ODS Overview
Virtual Processors: Dynamic Scalable Architecture
The concept of virtual processors (VPs) underlies the entire structure of Informix ODS , and is
called the Dynamic Scalable Architecture (DSA). ODS runs one Informix thread (rather than
one process) for every user session connected to the database. Context for threads is
maintained in shared memory, so the same thread can be serviced by different VPs if necessary,
although by preference it remains in a single VP to reduce cache-line transfers at the hardware
level.
Each VP can run many user threads plus internal threads to perform database I/O, logging I/O,
page cleaning, administrative tasks and other work. Certain Informix utilities are served by
their own special threads. VPs are divided into several classes depending on the type of work
they do. All VPs in the same class share the same code and access the same data and
processing queues in memory.
Informix Overview
7
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Shared Memory
Virtual Processor Virtual Processor Virtual Processor Virtual Processor Virtual Processor
CPU 0
CPU 1
Figure 1
CPU 2
CPU 3
Informix ODS 7 Clients, Virtual Processors and Physical CPUs
The relationship of virtual processors, physical CPUs and clients is illustrated in Figure 1.
An Informix Instance
Every running Informix database is associated with an Informix instance. When a database is
brought up, shared memory is allocated and the virtual processor processes are started. The
combination of the shared memory and VPs is called an instance.
8
Informix Overview
Multiple databases can be created within one instance. Whenever a CREATE DATABASE
command is executed, system catalogs are created to map the tables, indexes, views and other
relational objects of a logically independent database. Informix applications and utilities
always initiate a session by specifying which database within an instance they want to connect
to.
Private and Shared Memory
Informix virtual processors each have some private memory plus access to global shared
memory. The private memory holds the VP’s program text and private data, including private
pointers into shared memory. Locks and latches (mutexes) are used to manage concurrent
access to shared memory by all VP’s.
Shared memory is divided into three major portions: the resident portion, the virtual portion
and the message portion. The resident portion contains the buffer pool and several internal
tables used only to track other objects in shared memory. The virtual portion contains large
I/O buffers, session data, thread data, the dictionary cache, stored procedures cache, the
sorting pool and a global pool for structures shared by many components of OnLine, especially
messages from client applications. The message portion is only used to exchange messages
with client programs executing on the same machine as the database and communicating via
shared memory interprocess messages.
For more details of the architecture of ODS refer to the Administrators Guide, Volumes 1 and 2.
Informix Overview
9
10
Informix Overview
Database Layout
This chapter describes one of the most important aspects of tuning a database application laying out a database on disk. A well thought out and tuned layout can do wonders to
performance. On the other hand, all the performance tricks you can try on a runtime system
won’t do any good if the database is poorly laid out to start with.
The primary tool for obtaining Informix statistics is onstat. Onstat interrogates the engine and
dumps out statistics that the later has gathered. Online statistics are zeroised on bringup of
the engine or with the onstat -z command. When tuning the system, wait until the
environment is in a steady state, call onstat -z, let the system run for a short period (say 5 - 10
minutes) and then dump the required stats. This will avoid statistics such as I/O per second
being inaccurate. onstat -a will dump all stats.
Informix Layout basics
The basic unit of data in Online is a page. In 7.x this is currently 2k in size, in 8.x it is 4k. Each
page in 7.x can hold up to 255 rows of data. Multiple contiguous pages make up a chunk.
Chunks are created using the onspaces utility and can be 2GB maximum, 8.x has a 4GB
maximum. To be used in create statements chunks must be included in a dbspace. Dbspaces
can are made up of one or more chunks.
To create a dbspace specify the first chunk it contains
onspaces -c -d oli_dbs1 -p /links/DEV/olinei_41 -o 0 -s 500000
-c indicates create
-d is the dbspace name
-p is the physical device
-o is the offset
-s is the size in kilobytes
To add more chunks to the database use the -a option
onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 0 -s 500000
11
As stated earlier the limit for any chunk is 2GB. Multiple chunks can be taken from the same
device by using the offset parameter:
onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 0 -s 500000
onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 500001 -s 500000
We always recommend that the user use soft links when declaring devices in onspaces. Then
if the controller number changes on the system the link can be moved.
The 2GB limit can be a major drawback with Informix especially on larger 9GB disks which
require a minimum of 5 chunks to ulilize the whole disk. The user can quickly run out of the
8 partitions provided by the format utility. Using veritas can aleviate this problem as the user
can declare as many plexes as required.
Once the dbspaces are created onstat -d can be used to display their sizes and remaining free
space. The output is in 2 sections, first dbspaces then chunks.
Dbspaces
address
flags
fchunk
nchunks
flags
owner
7d778150 1
number
1
1
1
N
informix rootdbs
name
7e085f50 2
1
2
1
N
informix plogdbs
7e093f50 3
1
3
1
N
informix llogdbs1
7e175978 4
1
4
1
N
informix llogdbs2
7e175a38 5
1
5
1
N
informix llogdbs3
7e175af8 6
1
6
1
N
informix wdi_dbs
7e175bb8 7
1
7
4
N
informix cd_dbs1
1
268
1
N
informix si_dbs12
size
free
......
7e17cd98 114
114 active, 2047 maximum
Chunks
address
12
chk/dbs offset
bpages
flags pathname
7d778210 1
1
500
950000
948181
PO-
/links/DEV/root_chunk
7d7ab588 2
2
500
950000
49947
PO-
/links/DEV/plog_chunk
7d7ab668 3
3
500
950000
49947
PO-
/links/DEV/llog_chunk1
7d7ab748 4
4
500
950000
49947
PO-
/links/DEV/llog_chunk2
Database Layout—December 1997
7d7ab828 5
5
500
950000
49947
500000
PO-
/links/DEV/llog_chunk3
7d7ab908 6
6
0
490782
PO-
/links/DEV/wdi_1
7d7ab9e8 7
7
0
250200
47
PO-
/links/DEV/custd_1
7d7abac8 8
7
0
250200
97
PO-
/links/DEV/custd_2
7d7abba8 9
7
0
250200
97
PO-
/links/DEV/custd_3
7d7abc88 10
7
0
250200
97
PO-
/links/DEV/custd_4
.....
7e175898 268 114 0
250000
170497
PO-
/links/DEV/stocki_12
268 active, 2047 maximum
Notice how cd_dbs1, which is fchunk number 7, indicates 4 in its nchunks column. There are
then 4 chunks, numbers 7 - 10 which have 7 as their dbs number.
The size and free fields, in the chunk section, are specified in 2k pages. When the chunk is
initialized these two columns will be the same. As table space is allocated from the chunk the
amount free is reduced. Online reserves a number of pages from the chunk for what is termed
its bitmap pages. These pages indicate the free space in a chunk. In addition the first chunk in
a dbspace has pages allocated for what is termed the tablespace. The result is that the first
chunk in a dbspace will have less usable space.
Data Layout
We recommend using raw devices for Online data. Raw devices are accessed using kaio
which is the most optimal path. UFS also requires an extra copy through the kernel address
space, consuming CPU cycles. Its aging algorithm is suboptimal for database applications and
the caching of inodes is extra overhead.
Because veritas is an extra layer in the data access path there is a small penalty with its use. In
situations such as striping below, however, its advantages outweigh this small performance
degradation.
Online requires space for the following
•
System catalogs. This is held in a special dbspace called the root dbspace
which is created on Online initialization.
Informix Layout basics
13
•
•
•
•
Logical logs. This contains records of transactions in a database that is
logged
Physical Log. This contains before images of modified pages which are
used in rolling transactions back and forward
Tables. User tables can be placed by default in the root dbspace or in user
created spaces
Indexes. Index data can be held with the table data or in their own separate
dbspaces
All data can be placed in the rootdbspace but naturally this can lead to poor performance.
Rootdbs
The user must first select a partition / volume to use for Online’s root dbspace. A soft link
should then be made to this partition and is specified in the onconfig file i.e..
ROOTNAME
ROOTPATH
ROOTSIZE
rootdbs
# Root dbspace name
/dev/online_root # Path for device containing root dbspace
20000
# Size of root dbspace (Kbytes)
The root dbspace is what is initialized when oninit -iy is performed. The size must be
sufficient to hold the initial physical log and all of the logical logs plus overhead for the system
catalogs. The logs are always created in the rootdbs initially and can be moved later. If the
logs are moved the rootdbs will never be a hot disk. It can be placed on a disk with other data
if the user is short of spindles. Usually all catalog data is cached and the rootdbs is rarely
accessed after the first couple of minutes of operation.
Logical Logs
Most commercial installations require some level of logging to ensure the integrity of the data
in a database. The logical log in Online contains logical log records. These records are required
for a number of functions:
14
Database Layout—December 1997
•
Fast Recovery: If Online shuts down in an uncontrolled manner it uses the
log records to recover all transactions that occured since the last checkpoint.
•
Transactions roll back: If during normal transaction processing a transaction
must be rolled back (error, rollback command, etc.) Online uses the log
records to reverse the changes made on behalf of the transaction.
•
Data Restoration: During a data restore the user combines the backup tapes
of the logical log files with the most recent Online dbspace backup tapes.
Online provides 3 logging modes Buffered, Unbuffered and Ansi. Unbuffered and Ansi are the
2 safest modes. With these modes the logical log records are written to disk whenever any
transaction executes a commit. This guarantees that all transactions are committed in case of
system failure. In the buffered mode the logical log buffer is committed only when it is full.
On system failure a number of committed transaction may not have been written to disk.
Logging is specified on a per database basis using the ontape utility
•
•
•
•
ontape -A tpcc -s : Set Ansi logging on database tpcc
ontape -B tpcc -s : Set Buffered logging on database tpcc
ontape -U tpcc -s : Set Unbuffered logging on database tpcc
ontape -N tpcc -s : Turn off logging on database tpcc
The number and initial size of the logical logs are specified in the onconfig file with the
following parameters
LOGFILES
3
# Number of logical log files
LOGSIZE
500
# Logical log size (Kbytes)
The user can see the state of the logical logs with the command onstat -l
Logical Logging
Buffer bufused
L-1
0
bufsize
numrecs
numpages numwrits recs/pages pages/io
16
227
9
3
Subsystem
numrecs
Log Space used
OLDRSAM
227
12664
Informix Layout basics
25.2
3.0
15
address
number
flags
uniqid
begin
ab08990
1
U---C-L
7
300035
ab089ac
2
U-B----
5
ab089c8
3
U-B----
6
size
used
%used
250
125
50.0
400035
250
250
100.00
500035
250
250
100.00
The size is specified in 2k pages. If the %used are all 100% the system will halt waiting for
the logs to be backed up unless the TAPEDEV onconfig parameter is specified to be /dev/null.
After initialization the logs can be moved with the move_log.sh script specified in Appendix
A. If the system moves across 2 logical log boundaries it automatically triggers a checkpoint.
The flag fields of interest are C indicating current logical log, U indicating used, B indicating
backed up and L indicating the log contains the most recent checkpoint record. The uniqid
field increases every time a new logical log is started. The current log will have the highest
number. Each time a log is completed a mesage is dumped to the message log
<<INFORMIX-OnLine Server>>> Logical Log 21 Complete.
10:53:48
Logical Log 21 Complete.
Online Physical Log
The physical log contains before-images of database pages. Before the engine modifies a page
it is written to the physical log. These images are used in fast recovery to bring the database to
a consistent state. After a failure Online uses the before-images in the physical log to restore
all pages on disk to their state at the last checkpoint. This is the first phase of fast recovery.
Next the before images are combined with the logical log records stored since the checkpoint,
restoring the data to consistency up to the last transaction commit, this is the second phase of
fast recovery..
Because after a checkpoint all modified pages are flushed to disk the physical log is emptied at
that time. During bringup Online starts recovery from the last checkpoint record in the logical
log.
As a safety feature a checkpoint is initiated when the physical log becomes 75% full. It is
important to make the physical log big enough to avoid 2 scenarios
16
Database Layout—December 1997
•
•
Continual checkpointing
Physical log overflow. This occurs when the log is filling up to fast for a
checkpoint to get in and empty it. This will cause Online to shutdown.
onstat -l also gives info on the Physical log
Physical Logging
Buffer bufused
P-1
bufsize
numpages numwrits pages/io
16
78633
4909
phybegin physize
phypos
phyused
%used
200035
875077
195578
21.73
10
900000
16.02
Tables
Usually the majority of pages in a database are dedicated to user data. In Online an extent is
the minimum amount of contiguous space that can be allocated for a table. Every permanent
table has 2 extent sizes associated with it. The initial extent size is the number of kilobytes
allocated to the table when first created. The next extent is the amount in kilobytes allocated
when the initial and all other extents are full. The size of these extents is specified in the create
table statement
CREATE TABLE item (
i_id
INTEGER,
i_im_id
INTEGER,
i_name
CHAR(24),
i_price
DECIMAL(5,2),
i_data
CHAR(50)
)
IN wdi_dbs
EXTENT SIZE 12000
NEXT SIZE 8000
Informix Layout basics
17
wdi_dbs is the dbspace the table will be created in. After creation you can see from the onstat
-d output that 12000/2 = 6000 pages will be taken from the free column of wdi_dbs. As the
tables fills the free column will reduce in 8000/2 = 4000 page extents.
It is good practise to have the minimum number of extents possible for a table. Multiple
extents leads to fragmentation of the disk. Initialize one a large chunk for the table and then
allocate an extent greater than the chunk size. Because extents cannot cross a chunk boundary,
Online will take the maximum it can allocate for the table ie the entire chunk.
Online also has an internal mechanism to reduce the number of extents. After every 16 next
extents it will double the size of next. In our example the 17th, 33rd and 49th extents will be
8000, 16000 and 32000 pages respectively.
Spindle Count:
Once the database size is determined, it would seem that calculating the number of disks
required should be a simple task. Simply divide the total database size by the size of each disk.
Unfortunately, for OLTP applications, this strategy could yield very poor performance. The
number of spindles required is usually much larger than what the above calculation would
dictate. However, this does not mean that we will be wasting disk space. The additional space
is used for growing tables, filesystems etc.
To determine the number of spindles, it is crucial to understand the workload. Disk access is
also closely tied to the size of the buffer cache. If your workload is such that rows updated by
some transactions are re-used by others, it is advantageous to increase the size of the buffer
cache, cutting down disk access. This scheme is best explained by means of an example.
In TPC-C, the stock table is the largest. For a 900 warehouse database, the total space
requirement for this table is over 36 GB. If we did a dumb distribution, we would need 18
2.1GB drives. But let’s look at the table accesses a little more by understanding the workload.
The Stock table is randomly accessed 20 times from the Neworder transaction (10 reads/10
updates) and 200 times from the Stocklevel transaction. However, the Stocklevel transaction
reads the data that is accessed by the Neworder transaction and should hopefully be in the
buffer cache. Therefore, the number of stock table accesses is 20/Neworder transaction. For the
900 warehouse database, we hope to achieve 10,500 Neworder transactions/minute or 175/sec.
This is a total of 175 * 20 or 3500 ios/sec on the stock table. For good performance, each disk
should be restricted to a response time less than 20ms, which means we will need 3500 / 40 or
88 disks. Notice that this is a far cry from the 18 drives we computed by looking at the space
usage alone. Thus, for OLTP applications, it is extremely important to consider database access
patterns before deciding on the number of spindles required.
18
Database Layout—December 1997
Fragmentation
For small and lightly used tables a flat layout inside a single dbspace is sufficient. If these
tables are greater than 2GB just add extra chunks to the dbspace. For hot tables however this
scheme will lead to poor performance. These tables should be fragmented. Fragmentation lets
the user place table rows in different dbspaces based on some distribution scheme. The
advantage of fragmentation is that the optimizer can possibly eliminate whole sections of data
when executing a query. In an OLTP environment this improves concurrency as it reduces
contention on the underlying devices.
There are 2 types of distribution scheme, round-robin and expression-based. In round-robin
Online uses and internal scheme to distribute the rows. It is generally a poor performer as no
fragments can be eliminated.
An expression-based distribution scheme requires the user to define a rule and include it in the
create table or create index statement for more details see Informix ODS Administrators Guide
Vol. 1.
We usually use range fragmentation for our hot table definitions e.g..
CREATE TABLE customer (
c_w_id
SMALLINT,
c_d_id
SMALLINT,
c_id
INTEGER,
c_ytd_payment
DECIMAL(12,2),
c_data
CHAR(500)
...
)
FRAGMENT BY EXPRESSION
c_w_id <= 100 IN cd_dbs1,
c_w_id <= 200 AND c_w_id > 100 IN cd_dbs2,
c_w_id <= 300 AND c_w_id > 200 IN cd_dbs3,
c_w_id <= 400 AND c_w_id > 300 IN cd_dbs4,
c_w_id <= 500 AND c_w_id > 400 IN cd_dbs5,
c_w_id <= 600 AND c_w_id > 500 IN cd_dbs6,
Fragmentation
19
c_w_id <= 700 AND c_w_id > 600 IN cd_dbs7,
c_w_id <= 800 AND c_w_id > 700 IN cd_dbs8,
c_w_id <= 900 AND c_w_id > 800 IN cd_dbs9,
REMAINDER
IN cd_dbs10
EXTENT SIZE 500200
NEXT SIZE 500200
As rows are inserted into the table the value of the c_w_id is checked and this determines in
which dbspace to place the data.
Table Size
In order to size an Online database correctly we can make a rough calculation of the row
length of both data and indexes and the number of rows in a 2k page. If the user has already
created the table he can find the rowsize with the sql fragment:
select rowsize from systables where tabname = “table-name”;
Each “normal” page has 28 bytes at the start taken up in the Page Header. At the end is a 4
byte timestamp. Comparison of this timestamp with one in the header determines if a page
has been modified. Growing back from the time stamp are the slot table entries, 4 bytes for
each row of data. A slot table entry contains the offset in the page and length of the row. So
the actual space a row takes up is rowsize + 4. From this we can determine the number of
rows of a table a page can accommodate. Due to a 1 byte identifier in the slot table entry the
maximum number of rows is 255.
It it not recommended to make a row longer than a page. There is a lot of overhead following
the chain of pages for each row accessed and performance is degraded. Blobpages are also
very poor performers.
If the table is already loaded oncheck can be used to determine the exact number of pages
allocated to it:
oncheck -pT tpcc:tablename
20
Database Layout—December 1997
This can take some time for large tables but can give some valuable information. It is useful to
dump the oncheck data when the database is initially loaded and on a regular basis to see
actual growth needs etc.
For each dbspace a report is generated e.g.
TBLspace Report for tpcc:informix.stock
Table fragment in DBspace s_ddbs01
Physical Address
200005
Creation date
03/31/97 19:16:07
TBLspace Flags
802
Row Locking
TBLspace use 4 bit bit-maps
Maximum row size
306
Number of special columns
0
Number of keys
0
Number of extents
4
Current serial value
1
First extent size
190000
Next extent size
190000
Number of pages allocated
760000
Number of pages used
750187
Number of data pages
750000
Number of rows
4500000
Partition partnum
2097154
Partition lockid
2097154
Extents
Logical Page
Physical Page
Size
0
200035
190000
190000
300003
190000
380000
400003
190000
570000
500003
190000
BLspace Usage Report for tpcc:informix.stock
Table Size
21
Type
Pages
Empty
Semi-Full
Full
Very-Full
---------------- ---------- ---------- ---------- ---------- ---------Free
9813
Bit-Map
187
Index
0
Data (Home)
750000
----------
Total Pages
760000
Unused Space Summary
Unused data slots
Unused bytes per data page
Total unused bytes in data pages
0
160
120000000
Home Data Page Version Summary
Version
0 (current)
Count
750000
Volume Management
After the number of disks required is determined, you must now consider how to manage
them. In a large database, several hundred disks may be used and managing them can be a
major task. Use of a Volume Manager like Solstice DiskSuite or Veritas Volume Manager can
ease this task considerably.
22
Database Layout—December 1997
Availability
There are trade-offs to be made between cost, availability and performance. How much
downtime you can tolerate will help decide this. If you want your database to be impervious to
a single disk failure, then you should consider RAID 1 or RAID 5 implementations. For the
TPC-C workload, a fully mirrored database using RAID 1 shows a 10% performance
degradation compared to a non-RAID database. Using RAID 5, further increases this
degradation to 35%. The RSM2000 is a possible solution for RAID 5. A more complete
description of the performance implications of RAID on OLTP workloads can be found in the
whitepaper “Performance Evaluation of RAID With OLTP Workloads” at
http://hot.eng/whitepapers. Note that if your workload performs many writes, the
degradation may be more severe. Performance degradation of up to 50% are not uncommon.
Striping
Disk striping using RAID 0 is often used to configure disks for good performance. Striping
helps spread the load across disks, eliminating any hot-spots in the database. The user can
use Informix striping via fragmentation or veritas striping. In some situations, primarily when
there is skew in the access pattern of a table, Informix fragmentation is not the best solution.
As an example lets takes the stock table of the TPC-C database. Using Informix striping we
initially lay out the data on 72 disks. Each dbspace is made up of a 6 chunks from 6 different
disks
onspaces -c -d sd_dbs1 -p /device_links/stockd_1 -o 0 -s 463300
onspaces
-a sd_dbs1 -p /device_links/stockd_2 -o 0 -s 463300
onspaces
-a sd_dbs1 -p /device_links/stockd_3 -o 0 -s 463300
onspaces
-a sd_dbs1 -p /device_links/stockd_4 -o 0 -s 463300
onspaces
-a sd_dbs1 -p /device_links/stockd_5 -o 0 -s 463300
onspaces
-a sd_dbs1 -p /device_links/stockd_6 -o 0 -s 463300
We run our benchmark and using iostat/statit we discover that I/Os to the 6 disks are very
uneven
Disk I/O Statistics (per second)
Disk
c3t3d1
util%
42.6
xfer/s rds/s
59.9
Volume Management
28.8
wrts/s rdb/xfr wrb/xfr wtqlen svqlen srv-ms
31.1
2048
2048
0.00
0.62
10.4
23
c3t3d2
51.7
81.0
34.9
46.1
2048
2048
0.00
0.85
10.4
c3t3d3
10.9
13.7
7.4
6.3
2048
2048
0.00
0.12
8.7
c3t3d4
14.1
18.0
9.5
8.5
2048
2048
0.00
0.16
8.8
c3t4d0
26.7
36.0
18.5
17.5
2048
2048
0.00
0.34
9.3
c3t4d1
38.7
55.3
27.4
27.9
2048
2048
0.00
0.55
9.9
The first 2 disks are hot, the second 2 cold and the last 2 medium. To even I/O we create 6, 6
way veritas stripes
/etc/vx/bin/vxdisksetup -i c3t3d1
vxdg -g rootdg adddisk stkvol1=c3t3d1
....
/etc/vx/bin/vxdisksetup -i c3t4d1
vxdg -g rootdg adddisk stkvol6=c3t4d1
vxassist -g rootdg make stk1 1000m layout=stripe columns=6 stripeunit=16k
stkvol1 stkvol2 stkvol3 stkvol4 stkvol5 stkvol6
....
vxassist -g rootdg make stk5 1000m layout=stripe columns=6 stripeunit=16k
stkvol1 stkvol2 stkvol3 stkvol4 stkvol5 stkvol6
and take the 6 chunks from these new volumes. The I/O then becomes uniform:
24
c3t3d1
35.6
43.0
20.5
22.5
2048
2048
0.00
0.50
11.6
c3t3d2
36.2
43.4
20.7
22.7
2048
2048
0.00
0.50
11.6
c3t3d3
36.0
42.7
20.6
22.1
2048
2048
0.00
0.51
11.9
c3t3d4
36.6
43.6
21.0
22.6
2048
2048
0.00
0.51
11.6
c3t4d0
35.9
42.5
20.5
22.0
2048
2048
0.00
0.50
11.7
c3t4d1
35.6
43.6
20.6
22.9
2048
2048
0.00
0.50
11.4
Database Layout—December 1997
Interleave Factor
Interlace or Interleave factor is the amount of contiguous space on one spindle. For example, if
we specify an interlace factor of 16K, the volume manager will assign the first 16K bytes from
the first disk, the next 16K from the next disk in the stripe and so on. For OLTP workloads, the
interlace factor should be small. 16K to 32K interleave factors should work well for table and
index data.
Raw vs UFS
A large number of customers use UFS for their database files for convenience. From a
performance perspective, raw will outperform UFS for OLTP workloads. We have measured a
two-fold increase in performance for raw vs UFS.
Tuning Existing Layouts
Often times, we don’t have the luxury to layout a database from scratch. In such cases, we
need to monitor disk performance and tune as best we can. The first step is to collect disk io
statistics during normal operation of the work-load. System utilities such as sar and iostat can
be used. If the volume manager being used is Veritas, then the utility vxstat will provide disk
statistics at a volume level, eliminating the need to map physical disks to logical tablespaces.
Online also provides statistics such as the number of reads and writes per chunk. If the
statistics show greater than 40 ios/sec or service times greater than 50 ms, the system may
warrant tuning. Identify which tablespace is on the problem disks. If the tablespace contains
multiple tables, more analysis needs to be done to determine which table is the culprit.
It may be possible to reduce disk activity, by caching more data in memory. If you have
sufficient memory try bumping the number of BUFFERS in the onconfig file. If this doesn’t
reduce the io bottleneck, then it may be necessary to re-distribute the data onto more disks. If
the Informix chunks were all added using logical names (i.e. symbolic links to the actual
physical devices, or Veritas volume names), this re-distribution is rather straight-forward.
Shutdown Online, re-create the volumes over a larger number of disks and copy the data over
from the old volumes to the new ones. If disk space is a constraint or actual physical device
names were used as datafile names, it may be necessary to export the table contents, drop and
re-create the tablespace before loading the data back in.
Tuning Existing Layouts
25
Indexes
Indexes can greatly improve the performance of OLTP environments. Informix uses a btree+
structure for indexes. All indexes start with a single root node . There maybe a number of
intermediate branch levels and the index ends in leaf nodes. The index nodes contain rows
called index items. Each index item consists of a key value (which may be a composite) and a
rowid. The rowid represents either a row in another index page or the actual data row in the
case of leaf nodes. The fields from the original table that make up the index are called the
keys. Keys are chosen so that the optimizer will choose the index in a particular query. If by
choosing the keys all the results of a query can be returned without going to the data row the
index is said to cover the query.
With regards to performance there are a number of issues with indexes. The first is the depth
of the tree. When accessing a data row using the index each level in the tree requires a buffer
read. This can effect the cache hit rate of the buffer pool and worse case require extra I/Os. In
the case of an update, a lock is held on all rows involved in the index search. Locks on idexes
can become quite hot if a lot of updates are occurring.
One potential solution is to fragment the index. Indexs on fragmented tables that don’t
specify their own strategy are fragmented the same way as the table. The engine reserves
space in the tables dbspace for the index using an internal algorithm. These are known as
attached indexes. Alternatively the user can declare his own index fragmentation schema, for
example he could fragment the table 4 ways
CREATE TABLE orders (
o_id
INTEGER,
o_c_id
INTEGER,
o_d_id
SMALLINT,
o_w_id
SMALLINT,
o_carrier_id
SMALLINT,
o_ol_cnt
SMALLINT,
o_all_local
SMALLINT,
o_entry_d
DATETIME YEAR TO SECOND
)
FRAGMENT BY EXPRESSION
o_w_id <= 250 IN od_dbs1,
o_w_id <= 500 AND o_w_id > 250 IN od_dbs2,
26
Database Layout—December 1997
o_w_id <= 700 AND o_w_id > 500 IN od_dbs3,
REMAINDER
IN od_dbs4
EXTENT SIZE 267700
NEXT SIZE 267700
LOCK MODE ROW;
and fragment the index 2 ways
CREATE UNIQUE INDEX oi1 ON orders(o_id, o_w_id, o_d_id)
FRAGMENT BY EXPRESSION
o_w_id <= 500 IN oi1_dbs1,
REMAINDER
IN oi1_dbs2;
Fragmentation has the effect of reducing the depth of the btree. The optimize can determine
from the fragmentation strategy which fragment of the index to traverse. If the trees in the
fragments are shallower than a single large index then fewer I/Os are required to reach the
row. Index fragmentation does have a downside, however, it requires searching long linked
lists to get to the required fragment. This can sometimes adversely effect performance. Even
if an index is shallow the transaction mix may make the disks hot. Fragmentation can help
this situation.
To determine the depth of a tree use the Online facility oncheck described earlier.
If there is an attached index on the table there will be an “Index” line with the number of
pages allocated. For each attached index there will be an output as follows:
Index Usage Report for index customer_index on tpcc:informix.customer
Average
Level
Average
Total No. Keys Free Bytes
----- -------- -------- ---------1
1
66
972
2
66
62
1017
3
4135
62
1027
Indexes
27
4
256411
116
31
----- -------- -------- ---------Total
260613
116
47
Here we see the customer_index has a depth of 4. Notice how the branch pages have far more
free space than the leaf pages. The number of rows in an index page is controlled by the
FILLFACTOR parameter. FILLFACTOR can be set for the whole system in the onconfig file or
overridden in the create/alter index statement. The default FILLFACTOR is 90%.
FILLFACTOR only takes effect when the index is being built not when it is being updated.
Set FILLFACTOR higher for static indexes or indexes that are updated but rarely change size.
This will increase index performance by yielding a better cache hit rate and reduced I/O. Set
the FILLFACTOR to lower than default for indexes that are going to have a lot of inserts. This
will give the index room to grow and reduce the amount of page splitting required. For
heavily updated tables it might make sense to periodically drop the index and rebuild it with
the required FILLFACTOR to help performance.
Another issue with indexes is the number to have on a table. There is no limit in Informix but
multiple indexes must be maintained. If a table is heavily updated / altered the index must
modified for each operation.
We also recommend not choosing a varchar for the key in an index. Traversing an index
requires a number of key comparisons and varchars are generally a poor performer in this
situation.
Building Indexes
Online can build indexes serially or in parallel. Serial is the default operation and for small
tables is often sufficient. Each chunk of the table is read in sequence and the data sorted. On
machines with a higher number of CPUS parallel builds are more efficient. Parallel builds read
multiple disks into memory, sorts the data and writes the index out in parallel. To do this the
user must enable Parallel Data Query (PDQ) and set the PSORTNPROCS environment
variable. PSORTNPROCS restricts the number of sort threads that will be started in the engine
(on a CPUVP basis) PDQ is enabled by setting MAX_PDQPRIORITY in the onconfig file and
setting the PDQPRIORITY environment variable. An example script would be:
PSORT_NPROCS=10
PDQPRIORITY=100
28
Database Layout—December 1997
export PSORT_NPROCS
export PDQPRIORITY
dbaccess tpcc index_second_customer
PDQ requires memory for operation and the amount it can allocate is set by the
DS_TOTAL_MEMORY variable in the onconfig file. Each index build will get
DS_TOTAL_MEMORY / DS_MAX_QUERIES of memory to work with. So for the most
efficient parallel build set DS_TOTAL_MEM to the most you can allocate, DS_MAX_QUERIES
to 1 so the build gets all available.
It is also a good idea to set SHMVIRTSIZE high to avoid multiple shared segments from being
allocated during the index build (see shared memory section in Informix Tuning).
During a parallel index build onstat -g ath (show all threads) should show psortproc threads
being scheduled.
311
cf51d20
eb61490
2
cond wait(packet_cond)
5cpu
xchg_2.82
312
cf75de8
eb61834
2
cond wait(packet_cond)
3cpu
xchg_2.83
313
cf50c58
eb61bd8
2
cond wait(packet_cond)
3cpu
xchg_2.84
314
cf50d80
eb61f7c
2
ready
5cpu
xchg_2.85
625
e9b7d58
30b267b8 2
ready
1cpu
psortproc
626
e9c0680
30b26b5c 2
ready
3cpu
psortproc
627
e9c0988
30b26f00 2
ready
5cpu
psortproc
628
e9c0c90
30b272a4 2
ready
4cpu
psortproc
629
e9c0f98
30b27648 2
ready
4cpu
psortproc
Parallel build also uses big buffers so onstat -g iob should have output
INFORMIX-OnLine Version 8.20.UA2
Kbytes
-- On-Line -- Up 00:33:58 -- 3608456
AIO big buffer usage summary:
class
reads
pages
ops
Building Indexes
pgs/op
writes
holes
hl-ops hls/op
pages
ops
pgs/op
29
fif
0
kio 1593640
0
53394
0.00
0
0
0.00
0
0
0.00
29.85
26
10
2.60
11470
2685
4.27
Finally onstat -g iof (I/O to each chunk) should have parallel reads from more than one of the
table chunks.
When DS memory is exhausted the index build will overflow to the temp dbspace. Temp is
specified by the DBSPACETEMP variable in onconfig. If none is specified /tmp is used and
the sort data is written to a cooked file. This is suboptimal as kaio cannot be employed and the
cooked file codepath is not optimized. Always create a number of temp spaces and set the
DBSPACETEMP variable.
DBSPACETEMP
tmpdbs1,tmpdbs2,tmpdbs3,tmpdbs4,tmpdbs5,tmpdbs6,tmpdbs7,tmpdbs8,tmpdbs9,t
mpdbs10
# Default temp dbspaces
The spaces will be written to in a round-robin fashion. If TEMP becomes hot the user might
consider striping using veritas. A small interleave factor (say 16k) or lower should be chosen
for the volumes.
Loading data
Informix provide a parallel loader for Online which is beyond the scope of this document For
more information see the Guide to the High Performance Loader.
A simpler solution is to load into multiple fragments in parallel. Any table that is fragmented
can be loaded in this fashion which avoids contention on the bitmap page of the table and the
thrashing inherent in two loader going after the same tables.
30
Database Layout—December 1997
Online Tuning
I/O Tuning
In Solaris Online’s default method of I/O is kaio for all reads and writes. Each CPUVP has a
special kaio thread that performs this task. When a normal thread yields in the engine the kaio
thread is always scheduled next. The kaio thread uses aio_read and aio_write to submit
outstanding requests to the OS and then uses aio_wait with a zero timeout to check if there are
any completions. It passes on any completed I/O and then yields to the scheduler which
chooses the next thread to run.
onstat -g iov displays the activity of each kaio thread
AIO I/O vps:
class/vp s
io/s totalops
dskread dskwrite
dskcopy
wakeups
io/wup
kio
0 i
0.0
43
36
7
0
89
0.5
kio
1 i
0.0
658
640
18
0
1489
0.4
The kaio thread employs an algorithm to try and coalesce I/O requests for adjacent pages
together. It then submits this larger I/O in what is called a big buffer which is more efficient.
Big buffers are also used in Index building. To see big buffer usage use onstat -g iob
AIO big buffer usage summary:
class
reads
pages
ops
pgs/op
writes
holes
hl-ops hls/op
pages
ops
pgs/op
fif
0
0
0.00
0
0
0.00
0
0
0.00
kio
2803
2742
1.02
0
0
0.00
102
77
1.32
Normal threads submit reads and writes via the I/O queues. The kaio thread empties its
queue each time it is scheduled. To see the status of the queues use onstat -g ioq.
31
AIO I/O queues:
q name/id
len maxlen totalops
dskread dskwrite
dskcopy
kio
1
0
16
119642
58799
60843
0
kio
2
0
16
132167
59397
72770
0
kio
3
0
16
124531
59469
65062
0
kio
4
0
16
110924
59482
51442
0
kio
5
0
16
126967
59182
67785
0
kio
6
0
16
122345
58707
63638
0
kio
7
0
16
130828
61490
69338
0
The len field gives the current number of outstanding I/Os submitted and the maxlen is a
highwater indicator. The max is usually achieved during buffer cleaning when a cleaner
thread chooses a number of buffers from an LRU (default 16) and writes them to disk. Look at
dskread and dkswrite to ensure all kaio threads are doing roughly the same amount of I/O.
To find out what chunks are receiving the I/O requests use onstat -g iof
AIO global files:
gfd pathname
totalops
dskread dskwrite
io/s
3 root_chunk
93
45
48
0.0
4 plog_chunk
2986
0
2986
0.0
5 llog_chunk1
0
0
0
0.0
6 llog_chunk2
18411
5841
12570
0.1
7 llog_chunk3
0
0
0
0.0
8 /amir/DEV/wdi_1
172
133
39
0.0
9 custd_1
8578
5867
2711
0.0
10 custd_2
8789
6094
2695
0.1
11 custd_3
9106
6411
2695
0.1
.....
Informix generally recommends no more than 40 I/O a second to a disk but with todays faster
disks 56 - 60 can be ok. If the chunk is on a multi-disk stripe however obviously the chunk can
take 40 I/Os times the number of disks. For instance if we have a 6 way stripe the I/Os can go
up to 240 I/Os a second. On RSM or SSA arrays turning on the fast write cache can improve
write performance.
32
Online Tuning—December 1997
Logging
Physical Logging
As mentioned earlier it is essential to make the physical log big enough to avoid continual
checkpointing. The size limit is 2GB in 7.x and there can be only one physical log in an Online
instance. If a 2GB log is still filling too fast the user will have to live with the checkpoints and
try to reduce their time instead.
The onconfig parameter PHYSBUFF (in kbytes) determines the I/O block size for the
physical log. Writes to the physical log are buffered, one is being written to in memory while
another is being written to disk. In update intensive OLTP environments the physical log may
become hot. In such situations first increase the size of Physbuf to 128k to see if it helps
performance. 128k is by default the largest block size that Solaris will not break up when
writing to disk.
The user can also stripe the physical log if possible using a volume manager such as veritas.
Make sure the interleave factor is less than the PHYSBUFF or the entire buffer will be written
to just 1 disk in the stripe.
Logical Logging
As with the Physical log the user should ensure that the logical logs are big enough to avoid
constant checkpoints. A checkpoint is initiated when 2 logical logs are crossed. Having 16
logs of 500k each, therefore, is a bad schema. Make the logical logs large (2GB is possible). If
the environment is being archived to tape the user should provide enough log space for say an
8 hour work day. When all logs are full the system will halt waiting for backup to complete.
33
Again the logical logs can be striped but they are not usually hot enough to warrant this as the
actual I/Os tend to be sequential. They are usually mirrored, however,. The user has 2
options here, volume manager mirroring or Informix mirroring. Informix mirroring is
achieved using the onspaces command either when the log is created
onspaces -c -d llog_dbs -p /dbs/llog_dbs0 -o 4 -s 500000 -m /dbs/lm_dbs0 4
or afterwards with the -m option
onspaces -m llog_ddbs01 -p /dev/ifmx/s132d15vol -o 5120
We have seen no difference in performance between Informix and veritas log mirroring
The onconfig parameter LOGBUFF sets the I/O size of the Logical log. In BUFFERED logging
this amount of data is written out on each I/O. In UNBUFFERED and ANSI logging this is
the size of the log buffer. Multiple transactions write log records to the log buffer. When one
transaction commits the log buffer must be flushed. The amount of buffer actually written is
determined by the amount of “piggybacking” achieved before a commit. The length of the
transactions and their mix determines this piggy backing. The pages/io in the onstat -l output
indicates the amount of piggybacking achieved
Logical Logging
Buffer bufused
L-1
0
bufsize
numrecs
numpages numwrits recs/pages pages/io
16
2618502
97451
13371
Subsystem
numrecs
Log Space used
OLDRSAM
2618502
183907884
26.9
7.3
In this case each logical log write averaged 7.3 pages or 14.6k, roughly half the 32k Log buffer.
In high volume, short transaction environments the piggybacking will decrease as transactions
are continually committing. Use the pages/io stat to determine the size of the log buffer
Connecting to the Database
When a database has been initialized the next step is getting users connected to the engine.
In order to connect, either via an application or a tool such as dbaccess, a user must set an
environment variable INFORMIXSERVER in their environment to indicate the instance of the
server they wish to attach to (multiple instances can be present either on the same machine or
on the network). This name must be present in the users sqlhosts file
echo $INFORMIXDIR
INFORMIXSERVER=xtpcc
34
Online Tuning—December 1997
cat $INFORMIXDIR/etc/sqlhosts
xtpcc
ontlitcp
campi-1
7600
The default directory for sqlhosts is $INFORMIXDIR/etc but this can be changed by setting the
INFORMIXSQLHOSTS environment variable. Communication can be over the network using
tli/tcp or locally using a shared memory protocol called ipcshm. The hostname campi-1 above
must also exist in the client’s /etc/hosts file.
The engine must then provide the service to the user. The INFORMIXSERVER name must be
present in the onconfig file as either DBSERVERNAME or an alias in the DBSERVERALIASES
list. Each DBSERVERNAME or DBSERVERALIASES must be present in the servers sqlhosts
file
onconfig entries:
DBSERVERNAME
rtpcc
DBSERVERALIASES xtpcc
sqlhosts file:
rtpcc
olipcshm
campi
xtpcc
ontlitcp
campi-1 7600
rtpcc
In this example local connections user the server name rtpcc and remote connections use the
server alias xtpcc. When using tli/tcp there must be an entry in the /etc/services file for the
Online listener.
rtpcc
7600/tcp
# Informix listner
Informix must have read permission for /etc/services. When the engine comes up netstat -a
will show if the listener is operational
TCP
Local Address
Remote Address
Swind Send-Q Rwind Recv-Q
State
----------------- -------------------- ----- ------ ----- ------ ------campi-1.rtpcc
campi-1.rtpcc
*.rtpc
*.*
haxx3-1.33232
*.*
0
8760
0
0
0
0
8760
0
8576
0 LISTEN
0 ESTABLISHED
0 BOUND
When dbaccess or an application is connected over tli/tcp netsat will show ESTABLISHED in
the state field. When a connection terminates there will still be a netstat entry with BOUND
in the state field.
onstat -g ses will show connections in Online
Connecting to the Database
35
session
#RSAM
total
used
id
user
tty
pid
hostname threads
memory
memory
22
dbbench
1
2098
haxx3-1
1
32768
27624
12
informix 5
1
32768
27400
18372 campi
The hostname field indicates a local or remote connection. For more information on the
various connection options see the Administrators Guide, Volume 1.
By using different server aliases Online can listen on multiple networks. Each requires a
listener in /etc/services and an entry in the sqlhosts file which specifies a different name
onconfig.
DBSERVERNAME
thash
DBSERVERALIASES net2,net3
# List of alternate dbservernames
sqlhosts
thash
ontlitcp
campi-1 7600
net2
ontlitcp
campi-2 7700
net3
ontlitcp
campi-3 7800
/etc/hosts
#private nets
192.1.1.100
campi-1
192.1.2.100
campi-2
192.1.3.100
campi-3
By default Online spawns one poll thread for each nettype entry in the sqlhosts file. The
NETTYPE onconfig parameter allows the user to allocate more than one poll thread and
designate whether these run inline as part of the work of a CPUVP or as its own process on a
NETVP. onstat -g ath displays how many threads are polling
Threads:
36
tid
tcb
rstcb
prty
status
vp-class
name
7
ca0b9678 0
2
running
1cpu
sm_poll
8
ca0c3248 0
2
running
25tli
tlitcppoll
9
ca0c3788 0
2
cond wait(arrived)
26tli
tlitcppoll
Online Tuning—December 1997
Here we see one poll thread for shared memory (sm_poll) and two for tli/tcp. onstat -g glo
will indicate if NETVPS have been started. If the poll thread is running on a NETVP there will
be a shm entry for ipcshm and a tli entry for tli/tcp.
Virtual processor summary:
class
vps
usercpu
syscpu
total
cpu
19
14.88
23.11
37.99
aio
1
0.06
0.25
0.31
tli
8
0.06
0.08
0.14
shm
8
0.06
0.03
0.09
lio
1
0.01
0.01
0.02
pio
1
0.02
0.00
0.02
adm
1
0.00
0.02
0.02
msc
1
0.00
0.01
0.01
40
15.09
23.51
38.60
total
Individual virtual processors:
26
18578
shm
0.01
0.01
0.02
33
18534
tli
0.01
0.01
0.02
The default operation is inline for ipcshm and a network vp for tli/tcp. In high transaction
rate OLTP environments we have found it best to use NETVPS for polling. Using a CPUVP
means this process must switch from processing queries to handling network operations. This
has a detrimental effect on the cache of the processor. An option to try is to have one less
CPUVP than the number of physical processors and one network VP. Bind the CPUVPS to
physical processors. The NETVP will be scheduled on the spare CPU and will not starve for
resources.
There is a known bug with the poll call in Solaris that is being currently fixed. When a lot of
connections are made on a port, poll can perform extremely poorly. This is because the kernel
keeps a linked list of connections that the poll system call must traverse. This is particularly
bad in Baan environments where each Baan session starts multiple Online sessions.
onstat -g ntu,ntt, ntm, ntd, nss, nsc and nsd can give useful network statistics,
Connecting to the Database
37
Configuring CPUVPS
All processing in Online is performed by cpuvps. Informix recommend setting cpuvps to 1
less than the number of physical processors available. The user should always experiment
with this, sometimes setting the number of CPUVPS greater than number of physical can
increase performance. This will only be true if there is idle time on the system. Use mpstat to
determine if Online is using resources efficiently.
When Online is initializing it starts a single oninit process and for each type of VP forks
another oninit. The state of the VPs can be seen with onstat -g glo
The aio, pi, lio and msc VPS should have little or no cpu time allocated to them. Also usercpu
should dominate as Online only really uses the OS for I/O, shared memory and timing
(gettimeofday). We have seen at most 20% system time on a heavily loaded OLTP system.
For optimal performance it is better that the CPUVPS stay on the same physical cpu for as long
as possible. This reduces cache miss rate for the process. There are 2 parts to this, the first is
binding the processor to the CPU and the second is extending the amount of time each process
has.
The CPUVP processes are bound to physical cpus using the parameters AFF_SPROC, which
indicates the physical number of the first processor to bind and AFF_NPROCS the number of
processors to bind. If these are enabled the following messages appear in the message log.
10:01:58
Affinitied VP 1 to phys proc 1
10:01:58
Affinitied VP 3 to phys proc 4
...
If we run pbind on the system we see
process id 4940: 4
process id 4945: 1
...
Binding should reduce the amount of migration of the oninit processes from one physical cpu
to another.
38
Online Tuning—December 1997
For details on modifying the dispatch table see http://hot.eng/dbe/tools/ . Modifying the dispatch
table gives the oninit processes a longer timeslice, moves them to the highest priority and
keeps them at that priority. This reduces the amount of times the processes are rescheduled
and helps performance.
To see how many users are logged in use onstat -g ses.
use onstat -g ath
To see what threads are in the system
Threads:
tid
tcb
rstcb
prty status
vp-class
name
6
de0e31d8 de00e018 4
sleeping secs: 1
3cpu
main_loop()
7
de0e3980 0
2
running
24tli
tlitcppoll
8
de0e3e78 0
2
running
25tli
tlitcppoll
9
de0e6478 0
3
sleeping forever
1cpu
tlitcplst
10
de0e6b98 0
3
sleeping forever
1cpu
tlitcplst
11
de0e71e8 de00e4bc 2
12
de0e73d8 de00e960 2
58
de0f5000 de01bed8 2
sleeping forever
7cpu
sleeping forever
sleeping forever
19cpu
11cpu
flush_sub(0)
flush_sub(1)
flush_sub(47)
59
de0f5378 0
4
sleeping forever
22aio
kaio
60
de0f5568 0
4
sleeping forever
1cpu
kaio
61
de1090b0 de01c37c 3
sleeping forever
3cpu
aslogflush
62
de109360 de01c820 2
sleeping secs: 30
4cpu
btclean
80
de1189f8 0
4
sleeping forever
3cpu
kaio
307
de4788f8 0
4
sleeping forever
17cpu
kaio
311
df229df8 deec32a8 2
cond wait
netnorm
1cpu
sqlexec
312
dfcd4b10 deed037c 2
cond wait
netnorm
1cpu
sqlexec
313
dfcd51b0 deedcb08 2
cond wait
netnorm
1cpu
sqlexec
Common threads are flush_sub, which are cleaner threads, kaio which are the kaio threads
(notice it has highest priority), aslogflush flushes the logical log buffer, btclean is for cleaning
up indexes and sqlexec area user sessions.
Configuring CPUVPS
39
Configuring Memory
All procceses in an Online instance attach to a number of shared segments. Since 7.2.3 and
Solaris 2.5.1 the size of the shared memory area can be up to 3.7 GB (approx). There are 3
segment types resident, virtual and message.
There is one resident portion, of fixed size which contains the buffer cache, locks, hash tables,
log buffers etc. The size of this segment is determined by the parameters BUFFERS, LOCKS,
PHYSBUFF and LOGBUFF. There is very little point in trying to manually determine the size
of this segment as the amount each element takes can change with the release of Online.
Simply set the desired parameters and see the size of the segment allocated. You can use
onstat -g seg
Segment Summary:
(resident segments are
locked)
id
key
addr
577
1387874305 a000000
578
1387874306 de000000 131072000
Total:
-
-
size
ovhd
-749838336 54952
2592
3676200960 -
class blkused
blkfree
R
432754
1
V
6275
9725
-
439029
9726
Class can be one of R (Resident) V (Virtual) or M (Message). The minus sign in the Resident
output is because the value is > 2GB, ipcs -a gives similar information but the segsize will be
displayed correctly. The resident segment is the only one that can currently be locked down
with ISM by setting the RESIDENT flag to 1 in the onconfig file. For more information on ISM
see http://hot.eng/dbe/tuning_and_faqs/ism.html. We have seen performance gains of up to 15%
with ISM and always recommend its use.
The address at which the resident segment is placed is controlled by the SHMBASE variable in
onconfig. This is usually set to 0x0A000000L which is 160MB up in the address space. Online
places resident this high to avoid the code and data segments which are below. As a last resort
this can be reduced if you are short of memory, be warned it can lead to odd behaviour.
The virtual portion of shared memory contains all other structures, big buffers, sort pools,
active thread data, user session data, database procedure cache, network message queues plus
many more. It is also used by PDQ for scans, aggregations etc. It can be composed of 1 or
more actual shared memory segments. On startup the size of the first segment is controlled by
the onconfig parameter SHMVIRTSIZE. When this initial segment is full extra segments of size
SHMADD will be added until the max shared memory limit is reached. The user can restrict
the entire size of shared memory with the SHMTOTAL onconfig parameter, a value of 0 means
unlimited size. If unlimited is specified and the system memory limit is reached a query will
abort. Reducing SHMVIRTSIZE can give extra memory to the BUFFER cache. The user must
40
Online Tuning—December 1997
be careful here. Virtual memory structures associated with user connections are only allocated
when the connection is established. If SHMVIRTSIZE is too small and there is no shared
memory available in the system the connection will fail.
The amount of shared memory that each user requires varies greatly with what each session is
doing. The Informix Performance Guide indicates that anything from 100k to 500k maybe
required per user. For a large number of users a TP monitor, such as Tuxedo, may be required
to avoid memory exhaustion.
The message portion of the shared memory is for the ipcshm interface. It is used by ipcshm
sessions to pass messages to and from the engine. It is usually small, its size being determined
by the NETTYPE parameter for ipcshm in onconfig. The message segment is always placed at
the end of first virtual segment. An ipcshm client attaches at default address 80000. In older
versions on Online the Message segment can be misaligned and cause a severe performance
problem. This manifests itself as up to 70% system time. A workaround in this situation is to
change the client attach address with the env variable INFORMIXSHMBASE.
Solaris is optimized for a maximum of 5 shared segments per process. For best performance,
therefore, we recommend the user determine the maximum size of the virtual segment for his
running system and set SHMVIRTSIZE to this value. It is suboptimal to have Online add
multiple small virtual segments. In some releases the maximum SHMVIRTSIZE can be set to
is 2GB. In these cases allocate the initial 2GB and then use onmode -a to add an extra virtual
segment of the remainder of memory.
Setting the shared memory values may require
adjusting the Solaris shm values in /etc/system (see chapter 4).
To see how the memory is used in the system use onstat -g mem
Pool Summary:
name
class addr
totalsize freesize #allocfrag #freefrag
resident
R
a00e018
143630336 12144
res-buff
R
12908018 -1222950912 14192
global
V
ca002018 11100160
10009648 1029
803
mt
V
ca006018 16039936
9671264
4178
637
rsam
V
ca036018 827392
29232
1390
8
aio
V
ca072018 22650880
3036976
2537
869
458
V
ca9c2018 8192
3512
7
1
aio_fpf
V
caa16018 81920
16240
2
2
2
2
2
2
...
Configuring Memory
41
Blkpool Summary:
name
class addr
global
V
size
#blks
ca004168 0
0
BUFFERS and LRUs
The buffer cache, configured with the BUFFERS onconfig parameter, is one of the most
important resources in Online. Currently each buffer is 2k and each can hold 1 page of data.
The reason for any buffer cache in a database is to reduce the number of I/Os to disk. Disk
access is an order of magnitude longer than memory access. By caching a page in memory the
hope is that it can be reused thus avoiding a disk I/O. This leads to the concept of cache hit
rate. The onstat -p command gives the read and write cache hit rate. %cached is calculated as
the total number of buffer reads that were already in cache divided by the total number of
buffer reads.
Profile
dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached
2676
2719
3061
12.58
isamtot
open
start
10
3
ovlock
ovuserthread ovbuff
usercpu
syscpu
numckpts flushes
0
0
8.62
14.15
1
read
2
102
write
2
0
102
44
rewrite
0
delete
0
0.00
commit
0
rollbk
0
0
2
bufwaits lokwaits lockreqs deadlks
dltouts
ckpwaits compress seqscans
0
0
9
0
0
0
ixda-RA
idx-RA
da-RA
RA-pgsused lchwaits
1
0
0
1
0
0
44
onstat -p is one of the most important Informix OLTP stats. The meaning of the most relevant
fields are as follows:
42
Online Tuning—December 1997
Raw I/O
•
•
•
dskreads: Physical reads from disk
•
•
•
•
%cached: The Read Cache hit ratio
•
%cached: The Write Cache hit ratio. Generally speaking this is less than the
read cache hit ratio in insert intensive enviroments but can be higher in
update intensive enviroments.
•
•
isamtot: Total number of isam calls
•
rollbk: Number of rollbacks. If this starts to increase sharply there may be
a lot of errors or deadlocks occuring in the system.
•
ovlock: The number of times that Online attempted to exceed the max
number of locks (LOCKS in onconfig). If this is non zero there should be
errors in the message file as well.
•
•
•
usercpu: The total user CPU time.
•
bufwaits: The number of times a thread waited for a buffer. This might
indicate too few LRUs, a number of hot pages or a transaction holding a
buffer too long. Always try to keep this number low
pagreads: Number of pages read.
bufreads: Is the number of reads from the buffer pool. This should always
be significantly greater than dskreads or the system is critically short of
memory..
dskwrites: Physical writes from disk
pagwrites: Number of pages written.
bufwrites: Is the number of reads from the buffer pool. This should always
be significantly greater than dskwrites but not to the same extent as
bufreads.
commit: Calls to iscommit. Informix states there is no link between this
stats and the number of COMMIT WORK calls but it is a fairly good
indicator of the number of successful transactions executed.
syscpu: System CPU. If this is high it may indicate a Solaris problem.
numckpts: Number of checkpoints. If this is high the physical or logical
logs might be too small or the checkpoit interval might be too short.
BUFFERS and LRUs
43
•
lokwaits: The number of times a thread waited on a lock. Again strive to
keep this value low.
•
lokreqs: The number of locks requested. Used with an isolated transaction
use this stat to size the lock requirements of the system.
•
deadlks: Incremented every time a candidate is chosen and terminated to
resolve a deadlock.
•
seqscans: Increments for each sequential scans. In most OLTP enviroments
sequential scans should be avoided.
•
lchwaits: Increments each time a thread had to wait for a shared memory
resource. A high number indicates a problem.
In Online buffers are arranged into groups called Least Recently Used (LRU) queues. The
number of LRUs in Online is specified using the LRUS onconfig parameter. Each LRU is in
fact 2 queues, a free and a modified, and is assigned approximately BUFFERS / LRUS of the
buffers in the system . On initialization all buffers are placed on the free queues. User threads
take a buffer for use from the free queue and data is loaded into it from disk. Other sessions
can share this data page, the individual rows are locked when they are modified until the
transaction commits. If a buffer is modified it is placed on the modified queue.
To see the status of the lrus use onstat -R
64 buffer LRU queue pairs
# f/m
length
% of
0 f
3278
69.6%
1 m
1430
30.4%
2 f
3223
69.2%
pair total
4708
4658
...
126 f
3329
70.9%
127 m
1369
29.1%
4698
92742 dirty, 300000 queued, 300000 total, 524288 hash buckets, 2048 buffer
size
start clean at 25% (of pair total) dirty, or 1172 buffs dirty, stop at 24%
Modified buffers are placed at the head of the queue (hence the name LRU) and are flushed to
disk in one of 3 ways, during a checkpoint, by a page cleaner or with a foreground write.
44
Online Tuning—December 1997
A user thread becomes a page cleaner when it places a buffer on a modified LRU queue and
calculates that the percentage of buffers on this queue is greater than the onconfig parameter
LRU_MAX_DIRTY. The thread locks the queue for a short period, selects 16 buffers to flush to
disk and unlocks the queue again. The cleaner will continue to flush groups of 16 buffers to
disk until the percentage in the modified queue is less than the onconfig parameter
LRU_MIN_DIRTY. Buffers being flushed are locked until cleaning is complete and then they
are placed on the free queue.
The cleaned buffer is not zeroized and is placed at the head of the free queue in the hope that
if another thread hashes to it will not have been reused. Clean buffers are read from the tail of
the free queue. In high throughput environments the gap between LRU_MAX_DIRTY and
LRU_MIN_DIRTY should be kept small or threads will spend too long cleaning and will be
unavailable for user work.
The value of LRU_MAX_DIRTY also directly affects the duration of a checkpoint in OLTP
environments. Assuming this many buffers are dirty at time of checkpoint, Online must flush
LRU_MAX_DIRTY * (BUFFERS / 100) pages. If checkpoint time is a concern reducing
LRU_MAX_DIRTY will help.
The onconfig parameter CLEANERS specifies the maximum number of threads that can be
cleaning at any one time. When a modified queue is being cleaned the small m in onstat -R
will be replaced with a capital M
5 M
1342
28.8%
CLEANERS also affects the number of threads that will be initiated to complete a checkpoint.
Each cleaner thread will be given a chunk to clean. When they complete their work the next
uncleaned chunk is assigned to them. This can affect the duration of a checkpoint as there can
be tail off if CLEANERS are set incorrectly.
The temptation is to set CLEANERS as close to the number of chunks in a database as possible
(max value of CLEANERS is 128) to reduce checkpoint time. This can adversely affect OLTP
performance during regular page cleaning. What tends to happen is that all lrus are consumed
at a roughly even rate. Initially they all reach LRU_MAX_DIRTY at approximately the same
time and the cleaners kick in. If CLEANERS is 128 suddenly this many threads are cleaning
and 2048 buffers are locked, actual user work takes a severe hit. The system can show more
idle as user threads wait for the I/O to complete.
An alternative to increasing cleaners to reduce checkpoint time is determine if any of the
chunks are taking longer to flush and thus increasing the overall length. Use iostat to
determine these chunks and use striping to reduce the overall write time. The duration of a
checkpoint can be determined from the message log
22:40:49
Checkpoint Completed:
BUFFERS and LRUs
duration was 44 seconds.
45
There is no hard and fast rule for configuring LRU queues. A thread must take a lock on the
queues when taking a free buffer or returning a modified one and so the main advantage to
having more LRUs is spreading the heat on these locks. The minimum required, therefore, is
the number of active threads in the system (locks are not held across thread switches) which is
limited to the number of CPUVPS. A number of 128 should be fine in most situations.
A foreground write occurs when a thread needs to load data from disk and all the free LRU
queues are empty. The thread will initiate a single I/O to write a modified buffer to disk It
must wait until the I/O is complete. Foreground writes should be avoided at all costs as they
severely impact performance. They occur if the cleaners can not keep up with the rate of
buffer modification. onstat -R will show all the modified queues containing 100% of the
buffers. To avoid this situation reduce the LRU_MAX_DIRTY and / or increase the number of
CLEANERS.
Use onstat -F to determine if foreground writes are occurring
Fg Writes
LRU Writes
Chunk Writes
0
0
2
address
state
data
ca038458 0
flusher
I
0
= 0X0
ca038898 1
I
0
= 0X0
LOCKS
In most OLTP environments locking is very important. The default mode for locking in Online
is page level but the user can modify this to row level using the LOCK MODE clause in a
create/alter table statement. Row level locking naturally requires a lot more locks and can
add some overhead but for hot tables is generally desirable. The maximum number of locks is
determined by the LOCKS onconfig parameter. Its value must be determined by
experimentation. Lock structures do not take too much memory so the user has some scope to
increase them. If a transaction cannot obtain enough locks an error will be dumped in the
message log and the transaction will be aborted.
46
Online Tuning—December 1997
If a lot of transactions are trying to lock the same row or page performance can be severely
impacted. The transactions will spin waiting for the lock and eventually some may timeout.
The onconfig parameter TXTIMEOUT determines the amount of time a transaction will wait
before it times out waiting for a resource. The user might want to set this low if he has long
transactions wants to indicate quickly that there is congestion.
Use onstat -g spi to determine if locks are getting hot.
Spin locks with waits:
Num Waits
Num Loops
Avg Loop/Wait
Name
297
2428
8.18
vproc vp_lock, id = 1
206
1645
7.99
vproc vp_lock, id = 3
153
159
1.04
lockfr0
68
73
1.07
lockfr1
189
688
3.64
lockfr2
7
7
1.00
lockfr10
36
49
1.36
lockfr11
24
24
1.00
lockfr12
16
149
9.31
fast mutex, lru-3
15
72
4.80
fast mutex, lru-5
17
77
4.53
fast mutex, lru-7
1
500
500.00
fast mutex, lockhash[37444]
1
4
4.00
fast mutex, lockhash[63173]
1
5
5.00
fast mutex, lockhash[63174]
82
593
7.23
fast mutex, bhash[228039]
2
2
1.00
fast mutex, bhash[299083]
88
604
6.86
fast mutex, bhash[494215]
There are a number of common hot locks to look for . vproc vplock is a lock held on the vp for
scheduling, lockfrn are the locks that control the linked lists of lock structures themselves. lrun are the locks on the LRUs, lockhash are individual user level locks and bhash are locks on
buffers. If the num waits field is high for a lock but the avg loop wait is low then the lock is
being taken regularly for a short period. If the num waits is low but avg loop is high then the
lock is being held by individual threads for a long period.
LOCKS
47
If the lru locks are hot increase the number of lrus. If a particular bhash is hot then the user
has a hot page and the application may need tuning. Use onstat -k to determine who is
holding locks (see tuning the application in Appendix B)
48
Online Tuning—December 1997
System Tuning
In this section of the paper we discuss system and Solaris issues as related to Informix OLTP
applications. We do not intend to cover all system issues at great length, but will touch on
some key things to keep an eye on.
Sample /etc/system File
The following parameters in /etc/system should bring up an Informix database that can have
up to a 3.86GB shared memory segment. For larger number of users, you may have to increase
the semaphore parameters.
set shmsys:shminfo_shmmax=4026531839
set shmsys:shminfo_shmseg=64
set shmsys:shminfo_shmmni=64
set semsys:seminfo_semmns=4000
set semsys:seminfo_semmnu=4000
set semsys:seminfo_semmsl=1000
set semsys:seminfo_semmni=2000
set semsys:seminfo_semume=2000
*
* The next 2 parameters should be used only if database is on raw devices
set bufhwm = 100
*
* For telnet connections, set pt_cnt
set pt_cnt = 1005
*
* Set the next parameter on sun4d systems only, to prevent minor faults at
* large number of users. Value depends on memory configured
set max_nprocs=16000
49
Setting bufhwm to 100 tells the kernel to reserve 100Kb to keep track of the filesystem buffer
cache. This will free up more memory for the memory segment.
Disk
One of the most important aspects of database system tuning is tuning the disk I/O subsystem well. We touched on this in Chapter 2. Use extended statistics from iostat -xc or sar -d to
gather disk i/o statistics. The disk utilization (%b column from iostat or %busy column from
sar) and the service time (svc_t column from iostat or avserv from sar) are the key statistics to
monitor. Ideally, a data disk doing random i/o’s should be less than 50% busy (40 ios/sec) and
have a service time less than 50ms. Service times will vary depending on the type of disks
being used, so these numbers are by no means absolute. The log disks can sustain more i/o (up
to 60% busy) without proving to be a bottleneck. Beyond this, it is better to stripe the logs.
Memory
The memory sub-system plays a key role in OLTP performance. Informix requires a large
shared memory segment for good performance. In addition, memory is required to run user
processes. As a general guideline, the kernel requires about 30Mb.
Use sar -pg or vmstat to gather memory statistics. A sample vmstat output is shown in Table 1.
Table 1
procs
r b w
0 2 0
Sample vmstat output
memory
swap
free
20600 134128
page
re
0
0 0 0 4985632 5082864 0
disk
faults
mf pi po fr de sr s2 sd sd sd
in
sy
cpu
cs us sy id
1
0
0
0
0
0
0
0
0
0
434
450
690
6
2 92
5
0
0
0
0
0
0
0
0
0
109
21
404
0
0 100
The key vmstat parameters are explained in this paragraph and the sar parameters are shown
in parenthesis. pi (ppgin) is the number of Kbytes/sec paged in by filesystem reads, po(ppgout)
is the number of Kbytes/sec paged out to the filesystem, sr(pgscan) is the number of pages
scanned by the page daemon. If this is consistently non-zero, it indicates a shortage of memory.
On raw databases, pi and po should be 0, otherwise they may indicate paging.
50
System Tuning—December 1997
If you find you’re short of memory, you can reduce the kernel’s memory requirements by
tuning certain parameters, especially on systems with large memory. Many kernel resources
are tied to the value of maxusers and max_nprocs in /etc/system. The default value of these
parameters depends on the amount of physical memory. max_nprocs can be set to the
maximum number of processes every expected on the system. Use caution though - if the
system hits this limit, it will not be able to fork any more processes. maxusers is not directly
related to the number of processes and must be experimentally lowered.
The user should also ensure he has the correct memory interleaving. To determine how the
memory is interleaved use prtdiag -v under the Memory section
/usr/platform/sun4u/sbin/prtdiag -v
========================= Memory =========================
Intrlv.
Intrlv.
MB
Status
Condition
Speed
Factor
With
-----
----
-------
----------
-----
-------
-------
0
0
1024
Active
OK
60ns
16-way
A
0
1
1024
Active
OK
60ns
16-way
A
2
0
1024
Active
OK
60ns
16-way
A
2
1
1024
Active
OK
60ns
16-way
A
4
0
1024
Active
OK
60ns
16-way
A
4
1
1024
Active
OK
60ns
16-way
A
6
0
1024
Active
OK
60ns
16-way
A
6
1
1024
Active
OK
60ns
16-way
A
8
0
1024
Active
OK
60ns
16-way
A
Brd
Bank
---
9
0
1024
Active
OK
60ns
16-way
A
10
0
1024
Active
OK
60ns
16-way
A
11
0
1024
Active
OK
60ns
16-way
A
12
0
1024
Active
OK
60ns
16-way
A
13
0
1024
Active
OK
60ns
16-way
A
14
0
1024
Active
OK
60ns
16-way
A
15
0
1024
Active
OK
60ns
16-way
A
Memory
51
Here we see we have 16GB of memory in high density / 1GB SIMMS and are getting 16 way
interleaving. If the user is restricted in the amount of memory he can order for a system it
may be better to get low density SIMMS in order to achieve a better interleave factor.
CPU
CPU utilization is highly dependent on the workload. In general, the goal of tuning should be
to reduce time spent by a process in kernel mode. This can be achieved by better caching of the
data in a larger BUFFER cache to reduce i/o, sufficient memory to ensure that processes don’t
get paged out, etc. System utilities like vmstat, sar, mpstat show CPU utilization. As Informix
uses kaio with a zero timeout there should be little or no wt time in the mpstat output. In a
fully-loaded system, for Informix OLTP workloads, we’ve seen usr/sys time ratios of 75/12
with 2% idle time. This is just a rough guideline, but if you see system times of 50%,
something’s probably wrong.
Modifying the default TimeShare (TS) class dispatch table can help significantly when running
a large number of Informix users. See the whitepaper, Supporting Many Database Users at
http://hot.eng/dbe/whitepapers for details on how to modify the dispatch table.
52
System Tuning—December 1997
Appendix A : Informix Scripts
File: move_log.sh
## Bring down to Single user mode
Echo onmode -sy
onmode -c
onmode -sy
sleep 30
## Add logical and physical Log files
Echo Adding 3 logical log files into llog_ldbs01
onparams -a -d llog_ldbs01 -s 499000
onparams -a -d llog_ldbs02 -s 499000
onparams -a -d llog_ldbs03 -s 499000
## Take A checkpoint
sleep 30
Echo checkpoint
onmode -c
## Take A Null Level 0 Backup
Echo Backup again
ontape -s -L 0
## Switch Current LogFile Pointer To The New One (assume 3 times)
onmode -l
onmode -l
onmode -l
53
## Take A checkpoint
Echo checkpoint
onmode -c
## Take A Null Level 0 Backup
Echo Backup again
ontape -s -L 0
## Now Drop The initial logical log files ( presume 3 )
Echo Dropping initial log files
onparams -d -l 1 -y
onparams -d -l 2 -y
onparams -d -l 3 -y
## Take A checkpoint To Free up the Old Logical Logs
Echo checkpoint
onmode -c
## Take A Null Level 0 Backup
Echo Backup again
ontape -s -L 0
54
Appendix A : Informix Scripts—December 1997
Appendix B: Application Tuning
When tuning an application in an OLTP environments start with the tables that will be
accessed and the sql that will be executed on these tables. Generally speaking table scans
should be avoided in OLTP except on temporary tables or if the cardinality of the table is
extremely small. Avoid scans on ALL tables that are being modified concurrently.
To avoid scans we need to build one or more indexes on the table and ensure that these
indexes are used in the queries we will perform on the table.
Using sqexplain
Once the index is built test that the optimizer is choosing it for the query. This is achieved by
using setting explain on in your sql code and running the query either in the application or
via dbaccesss
set explain on;
SELECT COUNT(*) FROM customer
WHERE c_w_id = 286 AND c_d_id = 1 AND c_last = "BARESEANTI";
A file sqexplain.out will be produced in the execution directory and a Query Execution Plan
will be dumped into it
QUERY:
-----SELECT COUNT(*) FROM customer
WHERE c_w_id = 286 AND c_d_id = 1 AND c_last = "BARESEANTI"
Estimated Cost: 1
Estimated # of Rows Returned: 1
55
1) informix.customer: INDEX PATH
(1) Index Keys: c_last c_w_id c_d_id c_first
fragments: 0)
(Key-Only)
(Serial,
Lower
Index Filter: (informix.customer.c_w_id = 286 AND
(informix.customer.c_d_id = 1 AND informix.customer.c_last = 'BARESEANTI'
) )
Here we see that an index is being chosen for the query. In some situations even after an index
has been built on the table the optimizer indicates that it is not chosen
QUERY:
-----SELECT d_name, d_street_1, d_street_2, d_city, d_state, d_zip
FROM district
WHERE d_w_id = 286 AND d_id = 1
Estimated Cost: 828
Estimated # of Rows Returned: 100
1) informix.district: SEQUENTIAL SCAN
Filters: (informix.district.d_w_id = 286 AND informix.district.d_id =
1 )
The optimizer bases its qep on the statistics it has available to it. Statistics are gathered using
the “update statistics” SQL statement. The user has 3 options, LOW, MEDIUM and HIGH. For
LOW the smallest amount of information is gathered. No distributions on columns are
gathered. For HIGH the distribution information is exact. For large tables this can take a long
time, requiring scans for all columns specified.
For MEDIUM the data for distributions is obtained by sampling. This requires one scan of the
data but is a lot faster than HIGH. One strategy for statistics gathering is to specify HIGH for
smaller tables and medium for the rest. Statistics gathered with a LOW distribution can take
only seconds to collect whereas MEDIUM can take minutes or hours.
The user should obtain qeps for all the major groups of SQL statements to be executed and
ensure that the indexes are correct
56
Appendix B: Application Tuning—December 1997
Database Procedures
Most database interactions are in a client server situation, the database engine being the server
and the application being the client. This client server communication can be remote over a
network or locally on the same machine. If there is a lot of data, such as multiple intermediate
result rows, passing back and fourth between client and server the user may consider using
database procedures.
The advantage of procedures is the removal of the need for intermediate results to be passed
back to the client. The disadvantage is that some of the processing that would otherwise be
performed on the client is moved to the server. The user might try both alternatives to see
which performs best.
For more information on database procedures see the Informix Guide to SQL. To test a database
procedure the user can call it from a dbaccess session. For a procedure declared:
CREATE PROCEDURE payment (
did SMALLINT,
-- pmt->d_id
cid INT,
-- pmt->c_id
clast CHAR(16),
-- pmt->c_last
c_did SMALLINT,
-- pmt->c_d_id
c_wid SMALLINT,
-- pmt->c_w_id
hamount NUMERIC(12,2),
-- pmt->h_amount / 100
wid SMALLINT,
-- pmt->w_id
byname INT,
-- pmt->byname
hdate DATETIME YEAR TO SECOND
-- pmt->pay_date
)
call the procedure
database tpcc;
execute procedure
informix.payment(6,123,"OUGHTABLEABLE",4,55,23.30,100,0,'1996-02-14
16:58:21');
Database Procedures
57
Note it is important to get the format correct for any DATETIME parameters. The procedure
itself can be debugged using a trace file. Add the following lines:
SET DEBUG FILE TO '/tmp/payment.trc';
TRACE ON;
This will dump extensive amounts of debug data into the file /tmp/payment.trc including the
long form of any sql errors found. A database procedure can also call any Solaris command
using the SYSTEM command
SYSTEM( "sleep 100" );
A sleep is useful to halt a procedure to determine its state, locks held, stack size etc.
Application errors
All Informix errors have 2 parts, an SQL error and an ISAM error. Use the Informix utility
finderr to dump the full text of both errors
finderr 100
-100
ISAM error: duplicate value for a record with unique
key.
A row that was to be inserted or updated has a key value that already
exists in its index. For C-ISAM programs, a duplicate value was
presented in the last call to iswrite, isrewrite, isrewcurr, or
isaddindex. Review the program logic and the input data. For SQL
products, a duplicate key value was used in the last INSERT or UPDATE.
Deadlock and locking
There are situations where an error may not be catastrophic in an application. Error 100 above
for instance may just need some further intervention. Two other errors
-154
58
ISAM error: Lock Timeout Expired
Appendix B: Application Tuning—December 1997
-143
ISAM error: deadlock detected.
occur when the session is chosen by Online as a candidate to free a deadlock situation. The
user can simply re-submit the SQL statement in the hope that the deadlock has indeed been
cleared. In high OLTP situations deadlock timeouts often occur but an excess number can
indicate a bigger problem.
Even if timeouts are not occurring deadlock situations are one of the main performance
problems in OLTP environments. The timeout value is set with the onconfig parameter
DEADLOCK_TIMEOUT. The default is 60 seconds, the user might want to reduce this to get a
quicker indication of problems
If a lot of timeouts are occuring check the following:
•
•
•
•
The application is not doing prepare statements on the fly
•
The statistics are up to date
All sessions do not try to lock the same row or page
A session is not performing a table lock on a frequently accessed table
All the indexes are created correctly and are being chosen by the optimizer,
thus avoiding table scans
To determine what locks an application requires use onstat -k. This dumps all locks in the
system.
Locks
address
wtlist
owner
tblsnum
rowid
a11f070
0
cb7b5fd8 0
S
100002
203
0
a11f0a4
0
cb7ba3d8 0
S
100002
203
0
a120388
a1203f0
0
ca8badd8 0
S
100002
203
0
0
ca8bcfd8 0
S
100002
203
0
a121394
0
cb7c9618 a758c24
HDR+S
4100002
43c02
0
a1234b0
0
cb7c9618 a935bc0
HDR+S
700002
5e1303
0
a123f40
0
cb7c9618 a8915e4
HDR+S
2d00002
533d17
0
a1bef9c
0
cb7c9618 a893f54
HDR+S
2d00002
533d1d
0
Deadlock and locking
lklist
type
key#/bsiz
59
a1bf414
0
cb7c9618 a1234b0
HDR+IS
4100002
0
0
a1c31d4
0
cb7c9618 a43ab80
HDR+SR
3700002
533d1b
K- 1
a25ddb8
0
cb7c9618 a4d8990
HDR+SR
3700002
533d16
K- 1
a26179c
0
cb7c9618 a43c06c
HDR+SR
4d00002
48816
K- 1
a304378
0
cb7c9618 a1c31d4
HDR+S
2d00002
533d1b
0
a4394f4
0
cb7c9618 a304378
HDR+SR
3700002
533d1c
K- 1
a43ab80
0
cb7c9618 a3a0858
HDR+S
2d00002
533d1a
0
a43c06c
0
cb7c9618 a1bf414
HDR+SR
4d00002
43c02
K- 1
a616fbc
0
cb7c9618 a6b4c2c
HDR+SR
3700002
533d19
K- 1
aa6ccec
0
cb7c9618 a4394f4
HDR+S
2d00002
533d1c
0
....
....
239 active, 200000 total, 65536 hash buckets
The important fields here are the type of lock held, S is a shared lock and X is an exclusive
lock, the tblsnum which is the partition number of the table and the rowid. The rowid
indicates the following
•
•
•
rowid of zero is a table lock
rowid ends in 2 zeros is a page lock
all other rowids are row level loocks of tables or indexes
The tblsnum indicates the internal partition number that the lock ist taken on. Use the
following fragment of sql to determine your partitions
select a.tabname as Table,
HEX(a.partnum) as TablePn,
HEX(b.partn) as FragPn,
b.fragtype as FragType
from systables a , OUTER sysfragments b
where a.tabid = b.tabid
and a.tabid >99 ORDER BY 1,2,3;
This produces output:
table
60
tablepn
fragpn
fragtype
Appendix B: Application Tuning—December 1997
orders
0x00000000 0x04100002 T
orders
0x00000000 0x04200002 T
....
orders
0x00000000 0x04D00002 I
orders
0x00000000 0x04E00002 I
fragpn maps to tblsnum in onstat -k, fragtype is T for a table and I for an index. From the
onstat -k output above we see our code has taken a number of shared locks on both the table
and index of the orders table.
To determine what sql a session is executing first determine the Informix internal session
number with onstat -g ses
session
#RSAM
total
used
id
user
tty
pid
hostname threads
memory
memory
455
dbbench
0
1606
haxx3-1
147456
140448
1
onstat -u can then be used to dump the statistics for the sessio
Userthreads
address flags
sessid
user
tty
ca038018 ---P--D 1
informix -
ca038458 ---P--F 0
informix -
wait
tout locks nreads
0
0
0
0
0
0
3
0
nwrites
41
9136
.....
ca8a9118 ---P--D 19
informix -
0
0
0
0
0
cb7c6b98 ---P--- 455
dbbench
0
0
26
0
0
0
132 active, 384 total, 345 maximum concurrent
Deadlock and locking
61
In a deadlock situation the wait field would show a lock that the session was waiting on,
onstat -k can then be used to determine which session is holding that lock. onstat -p can also
be use to determine how many locks each type of transaction required.
Once the session that is causing the deadlock is determined use onstat -g sql <session-no> to
determine the sql being executed.
INFORMIX-OnLine Version 7.24.UC1
Kbytes
-- On-Line -- Up 22:38:43 -- 3268344
Sess
SQL
Current
Iso Lock
SQL
ISAM F.E.
Id
Stmt type
Database
Lvl Mode
ERR
ERR
Vers
454
EXEC PROCEDURE tpcc
RR
0
0
7.24
Not Wait
Current statement name : slctcur
Current SQL statement :
execute procedure informix.order_status(0,3,5,234,"")
Last parsed SQL statement :
execute procedure informix.order_status(0,3,5,234,"")
The hot locks in the system as a whole can be seen using onstat -g spi (see Informix Tuning
section).
Using PDQ
Occasionally users will use Parallel Data Query (PDQ) in OLTP environments. They can
perform scans on small or temporary tables often joining them with traditional indexes. In
these situations memory must be allocated to PDQ using the DS_TOTAL_MEMORY onconfig
parameter. The user must then achieve a balance between BUFFER requirements and DS
requirements within the memory available.
PDQ spawns a lot more Informix threads, to perform its parallel work, than a straight SQL
session. Use onstat -g ath to determine the number of active threads especially scan threads.
62
Appendix B: Application Tuning—December 1997
Threads:
tid
tcb
rstcb
2
a9e3e018 0
3
a9e3e2c8 0
prty
status
vp-class
name
2
sleeping(Forever)
21lio
lio vp 0
2
sleeping(Forever)
22pio
pio vp 0
188547
c01f6fc0 b31c3318 2
sleeping(Forever)
11cpu
join_2.1
188558
b5d6ad78 c0c59858 2
sleeping(secs: 3)
6cpu
scan_3.0
189086
b9f96fc8 b6557918 2
sleeping(secs: 3)
....
14cpu
group_1.0
There is a limit to the number of threads a CPUVP (and a physical processor) can sustain
before the overhead of thread switching degrades performance. We have seen optimal
performance with 8 to 10 scan threads on a 250Mhz processor. Unfortunately the lower bound
of the onconfig parameter DS_MAX_SCANS is 10, but reducing this parameter can often
increase performance in PDQ situations.
optcompind
Optcompind arises from “OPTimizer COMPare the cost of using INDices”. The comment in
the onconfig file is as follows
# OPTCOMPIND
# 0 => Nested loop joins will be preferred (where
#
possible) over sortmerge joins and hash joins.
# 1 => If the transaction isolation mode is not
#
"repeatable read", optimizer behaves as in (2)
#
below.
Otherwise it behaves as in (0) above.
# 2 => Use costs regardless of the transaction isolation
#
mode.
Nested loop joins are not necessarily
#
preferred.
#
on costs.
optcompind
Optimizer bases its decision purely
63
OPTCOMPIND
0
# To hint the optimizer
In OLTP enviroments we always set this variable to 0.
64
Appendix B: Application Tuning—December 1997
Download PDF
Similar pages