Roger Snowden
Center of Expertise, Oracle Corporation
© 2002 Oracle Corporation, all rights reserved
A holistic method for diagnosing performance techniques in a complex information system is presented. The
COE Performance Method (CPM) relies on proven techniques and offers a simple, end-to-end approach to
isolating performance bottlenecks based on evidence of their actual causes. There are many excellent Oracle
solutions which treat single, individual technology components in greater depth than this paper, but the
purpose in this document is to provide a complete method of end-to-end performance analysis for an entire
application of perhaps many synergistic components. While this approach is shown in the context of a
networked enterprise database application, a general use of the described CPM can be easily applied to any
computing environment. An explicit goal of the COE Performance Method described here is to enhance the
performance achievements of Service Level Agreements and to quickly diagnose variances from those SLA’s.
Although Oracle’s relational database products have never been exactly simple, the software’s complexity has
grown significantly, particularly in recent versions. With increased complexity has come a great deal of
confusion and misinformation regarding performance management of the server and related technologies.
Often, performance issues are treated as though they were concealed in magic, heralded by mystery.
Bookshelves are full of offerings for “tips ‘n tricks” and secret knowledge about squeezing performance
from an Oracle database. Authors with many years of database experience suggest various parameter settings
and configurations with little expository justification. While there is some excellent material on the subject of
Oracle performance, making the best use of that information requires a methodology based on facts and
logic rather than guesswork.
The problem with the common “best guess” approach becomes apparent when the database administrator
encounters a situation where parameters are adjusted, expensive memory or disk is added— changes made
as the experts recommended, yet performance is still abysmal and Oracle appears to be the problem. What to
do? Throwing memory or CPU at a problem may not address the underlying issues at all. In some cases,
such a blanket approach may simply make things worse, until the system in question grinds to an
unimpressive halt.
Therefore, it is reasonable and proper for us to seek a rational, comprehensive approach to managing the
performance of an Oracle database without reliance on guesswork. We need some consistent, uncomplicated
method of finding and relieving bottlenecks in a complex enterprise information architecture.
The movement toward network architectures has significantly added to the complexity of the computing
environment. Years ago it was only necessary to manage a single, unified set of technology components to
achieve optimal performance. Now we have to manage multiple technology stacks — the database and its
host platform, application servers, varied client workstations and operating systems as well as the network
that glues it all together. It is no longer possible to examine a single component and perform effective
diagnostics for the system at large.
The methodology for diagnosing and analyzing performance put forth here not only encompasses all
technology stacks in the realm of an application system, but additionally does so in an orderly manner, quickly
leading the diagnostician toward a positive result. Moreover, it takes into account the disparate pieces of the
computing puzzle other, purely component-based approaches ignore.
The study of performance and capacity analysis of complex systems relies on a mathematical discipline
known as queuing theory. Queuing theory relies on statistical methods to make possible an effective analysis of
the behaviour of systems of processes, particularly as interrelated processes affect one another. While this
description suggests a level of complexity that might discourage the non-mathematician, it is not necessary to
have a mathematics background to develop a reasoned understanding of the principles involved.
The fundamental equation we need to understand is this:
Response Time = Service Time + Wait Time
Response time refers to the total time a process consumes, start to finish. In a rush hour traffic example,
response time would be measured from the time a car entered a freeway to the time it left an off-ramp. In a
retail service scenario, it might be from the time a customer gets into a bank teller’s line (to cash a check,
perhaps) to the moment cash is in hand. Service time is the amount of time consumed by the process itself –
the teller’s “busy” time. Wait time refers to the time spent in line waiting for service. Optimal processes have
minimal service and wait times. The target in the performance method discussed here is overall response time.
For the most part, the focus will be on the causes of wait time, but by no means will service time be ignored.
Most of us already understand these concepts, and we only need to observe the events of our daily lives
to reinforce this understanding. Consider the commuter driving to work during rush hour on a typical
morning. If traffic is moving rapidly, but congestion is heavy and cars are close together, a simple near
miss caused by one car stopping suddenly can create instant havoc. As following cars are forced to
brake suddenly, even more cars further back are affected and are forced to slam on their brakes. The
effect ripples backward through the highway perhaps for miles. Even if the original incident involves no
actual damage and traffic at that initial site begins moving again immediately, the delaying after-effects
are likely to continue for perhaps an additional hour. Once congestion has set in, it seems to feed on
itself long after the cause of the bottleneck is removed. It may be impractical to attempt to solve all of the
mathematical equations demonstrating the various events and collective consequences, but certainly
the rush hour driving experience reinforces the conclusion that a relatively small event can have severe
performance consequences.
As with traffic jams, computer systems suffer similar congestion.
Service time deserves some consideration. In the case of a database application, a session’s process might be
found to spend too much service time, in the form of CPU time, processing extra data blocks because of
the lack of a proper index on a particular table. That is, the server performs a full table scan instead of an
index range scan, retrieving many more data blocks than otherwise would be necessary. While this additional
work might be initially regarded as service time— indeed, each block retrieval operation will consist of some
CPU processing time— the operation will involve even more I/O wait time as the user’s process must wait
for each additional block’s disk read requests. So, while the full table scan certainly incurs additional CPU
service time, the symptom of poor performance will most obviously be exhibited by excessive wait time
(disk I/O) rather than service (CPU) time.
Consider another example from daily life: the junk food lunch. We drop by our favorite hamburger
restaurant for a quick bite and are faced with three lines of people waiting to order food from three
employees acting as servers. Which line do we choose? Almost automatically, we choose the shortest
line available. After several minutes, we notice someone who arrived after us is being served before us.
It dawns on us the person serving our line might still be in training. It takes that person about twice as
long to fill an order as the more experienced workers. So, we intuitively understand service time— the
time it takes to actually take and fill an order— is a vital component of response time. Response time in
this case is the time it takes to get our food in hand, starting from the moment we step into line in the
Another example of the importance of wait time as a primary measure of poor performance would be
CPU time consumed by excess SQL parsing operations. A well-designed application will not only make use
of sharable SQL and avoid hard parses, but will also avoid soft parses by keeping frequently used cursors
open for immediate execution without reparsing at all— neither hard nor soft. A poorly designed application
will certainly exhibit a high percentage of parse time CPU, but will probably also incur a disproportionate
amount of time waiting for latches, most notable the library cache latch. As such, even a highly CPUTHE C OE PERFORMANCE M ETHOD
consumptive process is likely to cause measurable disproportionate waits. So, while service time must be
monitored, performance problems are more likely to be quickly spotted by focusing on wait time.
CPM as presented here takes a holistic approach to performance analysis and encourages the analyst to
concentrate on service time or wait time as appropriate for the situation at hand. If the real problem is
service-time related rather than wait time, it will be indicated by CPM and its cause corrected.
Although the earlier automobile traffic example is easy to understand, the importance of wait time is all too
easy to forget when dealing with the abstractions of computer software. However, that example can
highlight how a database server might have a buffer cache hit ratio of ninety-nine percent and at the same
time exhibit abysmal response time. Or, how a large parallel query might take too long to complete while
CPU consumption mysteriously drops to near-idle levels. When the CPU is not working, it is waiting.
Queuing analysis is helpful in understanding resource utilization and for optimizing service levels. In queuing
analysis, the exact timing of an event is not always known. Customer arrivals, or computer users clicking the
submit button to invoke a database request tend not to be uniformly timed, and often come in groups. This is
a common statistical phenomenon known as variance. It is simpler and more effective to instead deal with the
aggregation of events and construct a mathematical model based on the probability of each event. Since
customer arrival times and hamburger preparation times vary, a model can take the form of a graph
illustrating the effects of congestion, or “busy-ness”. From that model, an analysis can be performed of
response time, throughput, or the nature of a bottleneck.
The manager of the hamburger restaurant knows from experience that people arrive at random intervals.
That is, while there might be an average of three customers per minute during the mid-morning hours,
people don’t actually arrive at exactly twenty- second intervals. They come in groups and as individuals at
unpredictable times. Thus, variances in arrival rates may have an effect on our response time.
An idle resource, like an employee or a CPU, is often seen as “wasted” capacity. However, having an
occasionally idle resource may be the price one pays to provide a level of service needed to be competitive.
Similarly, the freeway we use to drive to work during rush hour may have several lanes idle at two o’clock in
the morning. During rush hour, all lanes may be full and backed up. Extra slack or capacity is traded off for
busy-time response and throughput.
In computing systems, congestion can be experienced as either idle CPU time, or growing process run
queues; unused memory or swapping virtual memory; idle or busy disk. We may not be able to determine
precisely how many users will be logged on at one time or exactly what the workload will be, so we may
have to provide some margin of extra capacity in order to get our business completed on time.
In a large enterprise, the queuing model presents itself within the measure of end-to-end application response.
A user pressing a mouse button in an office may be physically and logically miles from the data of business
interest. The total time a user waits before their screen responds with a display of data is the sum total time
for each system component between that mouse and the distant repository of data— and well as the return
trip. Each component of technology has its own process model and is a potential contributor to response
delay. We will refer to these interconnected technology components as technology stacks. Examples include
the network, database server, application server, the underlying hardware machines, and their operating
With a basic understanding of queuing theory, we need to develop a way to apply it to the technology
problem at hand. We need to have access to information which tells us when system components are busy,
how busy they are, and what they are waiting for when they are not busy. Fortunately, there are numerous
sources for this information. All we need is to identify them and to find a cohesive way to bring this
information together in an understandable manner.
Although each of these stacks consists of sub-processes, each with their own queuing models, we can view
the overall stack as an aggregate process and consider its response as a unit. For the Oracle Database Server
there exist a number of statistical values available for report, called wait events, indicating the presence or
absence of internal bottlenecks. Measuring changes in the performance of an Oracle database involves
viewing these wait events by value of time waited and comparing these wait times to the same measure from
a different time period. Other stacks involved in the end-to-end application view typically have tools to
provide similar information. We will discuss some of those tools in more detail later. Let’s now forge on to
the practical details of diagnosing performance issues.
Certainly, the need for engineering discipline in the deployment and management of mission critical
applications is well understood. Such discipline may be currently less widely applied toward performance
management than other areas of enterprise technology, but an engineering approach to the performance of
an application is equally as important as engineering an initial deployment. While practices vary from
enterprise to enterprise, certain key practices have been identified by Oracle’s Center of Expertise as essential
to effective performance management. First among these is the establishment of a Service Level Agreement
(SLA). It is beyond the scope of this paper to fully define the nature of such an agreement. Nevertheless, it is
clear that in order to declare a particular aspect of system performance as bad, one must first have a clear
definition of good. One goal of the COE Performance Method described here is to achieve the performance
commitments of the SLA and to diagnose variances from that SLA.
Since an SLA is an agreement between a technology service provider and a user, it tends to be a bottom-line
document. That is, the agreement is for a particular specification of availability and performance for a
technology-based service. As such, it tends to focus on end-to-end service and does not bother with the
interconnected details in the middle. It is up to the technology provider to understand and define the
interconnected components (stacks) and to support them. Technology stacks in a contemporary information
environment will include database servers, application servers, hardware and operating system platforms on
which to run those servers, network components such as routers, hubs, gateways and firewalls, and
workstations with user interface software for end users. Each stack has its own set of support issues and
available tools for management.
In order to be able to effectively respond to reactive performance issues, the service provider should take a
proactive approach. The tools and techniques needed to diagnose wait time versus service time for each
technology stack must be implemented and in place, and they should be well understood by the service
provider prior to any actual performance diagnostic engagement. This deployment includes not only the
tools, but also the engineering training and support to use them.
Oracle Database Server from version 8.1.6 and beyond has been shipped with a tool called Statspack.
Statspack is specifically designed to monitor server performance and offers a high level view of server wait
events— the key to tracking down database performance bottlenecks. Operating system tools such as sar,
netstat, glance, vmstat and iostat, among others, are also available on most UNIX platforms and are quite
effective in combination with Statspack for overall proactive diagnostic monitoring. Windows NT and its
successors, Windows 2000 and Windows XP also come packaged with performance monitoring tools. Third
party tools are also available and many are quite effective, although they generally have a price tag associated
with them. Statspack is available free of charge, as is usually the case with the operating system tools
mentioned above.
Whatever our toolset choices, we need to use those tools to establish and maintain a performance metric
baseline. This takes the form of actual performance data gathered at appropriate times, using tools such as
those already mentioned, to establish some measurable norm. A baseline might consist of an elaborate set of
gathered data, or may be as simple as a benchmark timing of a standard query. The important characteristic
of the baseline is that it is consistent and offers a reasonable basis of comparison. Data gathered should
represent actual system performance taken during one or more periods of busy activity. A baseline of data
gathered while the system is idle is of little use.
The baseline will need to be maintained as the system evolves, with respect to workload, functionality and
configuration. If you add new application features, upgrade the database version or add or replace CPUs or
other hardware, the environment has changed and therefore performance may have changed. If the baseline
is not reestablished, any understanding of a future performance complaint by the user community will be
compromised and blurred— one will not be able to know if a performance change is due to a configuration
issue or is a bug introduced with a new application feature. The baseline is established for this system in this
environment and enables a comparative analysis to be made to diagnose a specific problem.
The issue of the performance complaint itself is worthy of some note. One of the problems inherent with
managing complex systems is the uncertainty of the performance metric. Performance is largely a matter of
perception. A user may decide one day that a two second response for the execution of a particular form is
acceptable, but unacceptable the next day, depending on issues like how hurried or relaxed the user feels on a
particular day. This suggests the information used for the reference baseline needs to be coordinated with the
metrics used for the SLA. Even though performance complaints may still be lodged, at least the system or
database administrator has either a defense to offer or a starting point to diagnose the issue.
One of the best features of the COE Performance Methodology is that it lends itself to performance analysis
of large systems of interconnected technology stacks. Since our premise is that a system is no faster than its
worst bottleneck, it is obviously important to be able to identify the location of that bottleneck. Moreover,
although Oracle tends to be the common denominator from the perspective of users and management alike,
we know from experience bottlenecks can just as well reside in the network, the application server, or an
operating system.
In order to identify the problem technology stack, and ultimately the actual problem itself, we need a
systematic approach. The essential steps of the CPM approach, illustrated in Figure 1, will now be discussed
The COE Performance Methodology,
in a nutshell
Problem Statement
Information Gathering / Stack Identification
Stack Drill-Down
Fix the Stack
Test Against Baseline
Repeat Until Complete
As illustrated, the basic steps of the COE Performance Methodology are straightforward. By
starting at a high level, broad view of the enterprise system and rigorously following the steps in
an orderly manner, positive results are achieved simply, quickly and without expensive and timeconsuming guesswork.
Figure 1
A clear and unambiguous definition of both good and bad behaviour is essential. The problem statement is
more than half of the battle for a solution and defines success for us. Moreover, the discipline of stating the
problem clearly and concisely often uncovers possible solutions. There is an undeniable and innate siren
offering a temptation to gloss over this step, but this temptation must be resisted so that misunderstandings
and inefficiencies are avoided. If you think you are solving one problem and the customer or user has a
different expectation, valuable time will be wasted addressing misguided issues. An example of a weak
problem definition would be, “Online queries are slow and need to be much faster”, while a good problem
statement might be, “The Customer Service Name Lookup screen normally returns query results in 3-4
seconds, but has been taking more than 20 seconds since the open of business this morning.”
Define the problem specifically and concisely, establish the measure of success with the customer and make
certain you have agreement. The accordant goal must, of course, be reasonable and realistic. The definition
needs to be quantifiable in terms conforming to the SLA metrics. The “weak” problem statement example
above is harmfully vague. How would we know when we have succeeded in finding a solution? In our
“good” example, if the SLA requires specific response times for the application function in question, we at
least have a target for success and therefore a greater probability of success.
Sometimes a clear problem statement is elusive. When things go wrong, often during critical business hours,
tempers flare and communication lines break down. Sometimes the issue is obvious while at other times we
wonder if we are simply imagining a problem that does not exist. When in doubt, ask yourself the simple
question, “What makes you think there is a problem here?” then demand of yourself a very specific answer
based on symptomatic behavior. As Winston Churchill said, “Never overlook the obvious.” It may well be
the cause of the problem is already understood or suspected. A clear description of what the problem is—
and isn’t— will go a long way toward quickly resolving both obvious and obscure problems.
Take the time to clearly define the nature of the performance symptom, the time and circumstances of
appearance or disappearance, and to establish a valid test. Say what is known about the problem, and describe
what is not known. A previously developed test case is ideal, and if one does not exist in advance, now is the
time to create one. A test case can be as simple as the execution of a procedure through SQL*Plus and then
also through the web server, with a measurement of response times.
The result of the test needs to be compared to the baseline, so the importance of a valid and current baseline
is therefore apparent. If a baseline was not established in advance, get one now so that you at least have the
current bad performance captured and have something against which to measure the impact of changes. Not
all changes are good.
Execute the test case and record the result. Gather associated performance data from all technology stacks
defined earlier, using appropriate tools. Compare the test results for each stack to the baseline for that stack
and identify the most probable stack as the source of the bottleneck.
What is needed for this critical stack identification step is a cursory check of each stack potentially involved in
the problem. For hardware platforms, it may be a straightforward tool such as sar, iostat or vmstat. Network
tools include netstat and ping. For the Oracle database server, a quick review of the alert log or error trace files
will frequently turn up critical evidence for the trained database administrator. The ideal test is the one that
yields the most information with the least effort, so proceed accordingly.
Ideally, we will gather overall system resource data as well as service and wait times for each individual stack
in order to determine which stack is the biggest bottleneck. This is one of the biggest challenges: getting a
coherent, end-to-end measure of response time through each stack. Some organizations prefer to develop
and maintain their own monitoring tools and there are plenty of open source and freeware resources
available for use, including various scripting languages such as perl and tcl. A common practice is to use
operating system command utilities such as vmstat and iostat shown in Figures 2 and 3, and to use a scripting
language such as perl to analyze the text output. The tool can then “phone home” when exceptions are
encountered or predefined thresholds exceeded.
Having an integrated monitoring environment will facilitate rapid and accurate stack identification during a
performance crisis. While elaborate third party tools are available for such an infrastructure, off-the-shelf and
freeware tools are often entirely adequate, although any tools you choose will have to be integrated into your
environment. For example, each UNIX platform in the enterprise might have a scheduled process to gather
sar and netstat statistics on regular intervals. If Statspack snapshots are also collected at similar times, it is a
simple matter to analyze reports for those tools for a period of concern and compare the available data to
reports from, say, exactly one week or one month earlier. If the application workload is similar for both
periods, but the performance problem did not exist in the earlier period, we have a fast way to compare
“bad performance” data to baseline data. If the problem is with the underlying UNIX platform or the
network, it should be apparent immediately. Even without the baseline, a trained technician will recognize
symptoms of constraint— a high percentage of CPU wait time or process swapping activity, for example.
See Figure 2 for an example of vmstat output.
If no obvious starting point presents itself, we recommend you start with the database server itself. One
obvious reason is the database administrator understands that stack best. Another advantage is the Oracle
server gathers and provides information offering clues to problems across other stacks. For example,
network problems often show up as a specific Oracle wait event, “sql*net more data to client”.
Knowing the response time through the database stack will allow you to determine whether most of the
overall response time is spent in the database or not. This in turn will direct your attention to the database
itself or to another stack.
$ vmstat 5 5
r b w
sy id
0 1 0
6 77
20376 49 1775
1 1 0 31807664 443376 10 1037
6 64
de sr s0 s1 s2 s3
37 399 577 87248 27
cs us
0 37 36 4520
7044 17
8798 29
0 51536
0 49 49 5251 61709
1 4 0 31798856 443008
8 60
0 30456
2 54 55 5277 77163 17295 32
0 1 0 31807744 441872
7 62
0 1051 164
0 18000
8 49 49 4755 74029 11738 31
0 0 0 31808072 441376
5 72
0 10640
0 10
0 47 47 4720 48430
5518 23
This is a vmstat sample taken from a 32-processor Sun system for five intervals of five seconds each. Statistical sampling is such that
we ignore the first line of vmstat. A quick glance under the procs section tells us there is some process run queue wait ti me (“r” is
either 0 or 1 in this example) and some resource waiting (“b” > 0 for most interval samples). This is generally considered good, nonbottlenecked performance although the “b” value indicates a process blocked by an IO wait, so disk may need balancing if that “b”
value grows. Run queues are averaged for all CPUs for Solaris.
Memory paging and swapping are not the same. Paging, even with these seemingly large numbers, is quite normal. The “sr” column
tells you how often the page scanner daemon is looking for memory pages to reclaim, shown in pages scanned per second.
Consistent high numbers here (> 200) are a good indication of real (not virtual) memory shortage.
The fields displayed are:
Report the number of processes in each of the three following states:
in run queue
blocked for resources (I/O, paging, and so forth)
runnable but swapped
memory report on usage of virtual and real memory.
amount of swap space currently available (Kbytes)
size of the free list (Kbytes)
Report information about page faults and paging activity, in units per second.
page reclaims
minor faults
kilobytes paged in
kilobytes paged out
kilobytes freed
anticipated short-term memory shortfall (Kbytes)
pages scanned by clock algorithm
Report the number of disk operations per second, per disk unit shown.
Report the trap/interrupt rates (per second).
(non clock) device interrupts
system calls
CPU context switches
Give a breakdown of percentage usage of CPU time. On MP systems, this is an average across all processors.
user time
system time
idle time
Figure 2
An important consideration when evaluating third party tools or “rolling your own” is to gather and analyze
data in a meaningful manner. For the most part, we are dealing with statistical samples when we monitor
hardware and software resources, so sampling techniques must be sensible with respect to sample size and
interval. The vmstat report shown in Figure 2 was taken at five-second intervals. While short intervals show
performance spikes quite well, they also tend to exaggerate variances in values and therefore contain statistical
noise. A better method is to take concurrent short and long samples to be able to analyze both averages and
variances to get a meaningful picture of performance.
$ iostat –xtc
extended device statistics
kw/s wait actv
2.0 35.9
1.2 35.9
tin tout us sy wt id
48 18
5 70
. . .
This as an abbreviated iostat report from the same 32 processor system as shown in Figure 1. The svc_t column is actually the
response time for the disk device, however misleading the name. When looking for input/output bottlenecks on disks, a rule of
thumb is to look for response time greater than 30 milliseconds for any single device. A well-buffered and managed disk system can
show response times under 10 milliseconds.
Here are the field names and their meanings:
name of the disk
reads per second
writes per second
kilobytes read per second
kilobytes written per second
average number of transactions waiting for service (queue length)
average number of transactions actively being serviced
average service time, in milliseconds
percent of time there are transactions waiting for service (queued)
percent of time the disk is busy (transactions in progress)
Figure 3
A sudden burst of activity might cause a single disk drive to be so busy as to cause process queuing, yet may
not be of any real concern unless it become chronic. On the other hand, long iostat samples will average disk
service time and tend to hide frequent spikes and could possibly mask a real problem. See figure 4 for an
example of a CPU resource measurement illustrating how large variances in reported data can be misleading.
If you look at the data for too short an interval, you might conclude CPU idle time is nearly seventy percent
or nearly as low as twenty percent. If you are trying to analyze a performance anomaly during a period of
high or low CPU usage, such a narrow slice of data can be quite helpful. On the other hand, taken as an
indication of the norm, such a microscopic view could be completely misleading.
The first priority at this early juncture is to eliminate obvious problems that can skew performance data and
blur the analysis. We are concerned with quickly ascertaining the overall health of the components of each
technology stack to make sure we know where the possible problem both is and isn’t. We do this by looking
for exceptions to what we know to be normal behavior.
CPU Idle Time
7:2 0
7:3 0
7:5 0
8:3 1
8:5 1
9:0 1
9:3 2
CPU Idle times extracted from a sar report. The jagged line represents samples taken at fifteen-minute intervals. The trend line is
shown to illustrate the degree to which variances among individual samples can be distracting and misleading. You need both average
and variance information to get a true picture of what is happening at the hardware and operating system levels. The interval marked
“Low” is entirely different from the interval marked “High”. A narrow peek at a performance variation can be useful for analyzing
bottlenecks, but can be misleading if taken as an indication of the norm.
Figure 4
For example, perhaps we received a report that an Oracle server had severe latch free wait events during a
period of bad performance. If we respond directly to that symptom without adequate high-level analysis of
the overall platform/database technology stack, we might overlook heavy process queuing at the operating
system level. That is, the Oracle database might appear to be the problem, when the real issue is a lack of
capacity. Reports from vmstat or iostat would indicate chronic process run queues, so we would know that
the Oracle database itself is probably not the culprit, at least not the primary culprit. Once the resource limit is
addressed, by tuning the application, rescheduling processes or adding more or faster processors, we can
proceed once again with the stack analysis and identify server constraints in their proper context.
tracert mail12
Tracing route to []over a maximum of 30 hops:
<10 ms
<10 ms
10 ms
<10 ms
<10 ms
<10 ms []
220 ms
210 ms
231 ms []
Trace complete.
Sample tracert used to identify potential network problems. Coupled with ping, a number of common issues can be quickly identified.
Ping each device shown in tracert, with the “don’t fragment” bit set and a large packet size to isolate individual segment performance.
Although tracert shows timing information, it is for very small packets and may not isolate bottlenecks, so ping is used in conjunction
with tracert.
Figure 5
Each technology stack is then analyzed in detail to ascertain the source of the bottleneck. Since this effort is
specific to each stack, the exact drill-down techniques are beyond the scope of this introductory paper. A
network analysis, for example, might involve the services of a network administrator and the use of a
network sniffer. Figure 5 shows an example of using the tracert utility to analyze network performance.
The specific techniques for each stack differ greatly and have to be developed and supported specifically for
each environment. For the Oracle database server, a tool such as Statspack or Oracle Enterprise Manager can
provide a focused accounting of wait events. Sample the data for a narrow, busy period. One of the most
common errors for database statistics gathering is to assume more is better and to take a sampling for too
long a period. If the performance symptoms appear for fifteen minutes each hour, then an hour long sample
of data only averages the wait events and hides the real cause of the problem. Data gathered should represent
actual activity during the most pronounced performance symptoms for the clearest picture.
Create the associated report, such as Statspack for the database server, and review it for the top wait events,
in order of time waited. Each of those events will provide evidence of the cause of the problem and will
provide a path for further drill-down. Much information is available to discover the significance of each wait
event in the context of Oracle’s internal operations and it is up to the individual performance analyst to learn
how to interpret and respond to wait event statistics. Although many tools purport to offer “tuning advice”,
there is no substitute for individual knowledge and training. A good source of information for database wait
events is Anjo Kolk’s YAPP paper referenced in the bibliography.
A note of caution is due here. There exists a blurry distinction between capacity planning and performance
management. The two subjects are tightly intertwined. One of the important techniques required when
engaging in performance analysis is to properly distinguish between a capacity problem and an actual
performance problem. If a problem crops up slowly over time in the form of gradual performance
degradation as workload grows naturally, the problem is a matter of capacity, not performance per se. A
performance issue is a technical matter to be dealt with in a primarily technical manner, while a capacity
problem quickly turns into a business decision. If a server needs additional capacity, such capacity must be
purchased or done without.
Having identified the worst bottleneck, it is now time to apply an appropriate remedy. Again, as an
introduction to the COE Performance Methodology, it is beyond our scope to attempt to list possible fixes
here. You may have identified a bug or a matter of human error, or a hardware failure. Whatever the cause,
use your engineering, and perhaps diplomatic skills to get it fixed.
Now that the single bottleneck has been identified and relieved, it is time to rerun the test case and compare
to the baseline and SLA to establish relative success. We use the term relative here to suggest the problem
might not be altogether solved. It is common to find the relief of one bottleneck only serves to reveal
another. If you have achieved success, document that fact, stop tuning and go home. You do get to go
home, don’t you?
Performance management is, of course, an ongoing process. This is not meant to suggest the diagnostician
will walk away and not continue to monitor performance. On the contrary, proactive monitoring is the best
way to avoid emergencies. It is important, however, to distinguish between reactive and proactive efforts and
not to be caught in the trap of managing one crisis into the next. After the crisis is resolved, review
performance against the baseline and update the baseline if hardware or software configurations have
changed. Continue to monitor proactively.
If success, as defined by agreement established in the problem statement, is not yet declared, go back to the
second step above and rerun the analysis to identify the stack now containing the worst bottleneck. Consider
the possibility the bottleneck has moved to another stack. It is also possible there is no ready relief for the
problem. This may be a case where a performance problem is actually a capacity issue, in which case an
investment decision may need to be made. Alternatively, the root of the problem may be a bug or a
hardware failure for which there is no immediate solution.
Often one symptom will mask another. It is not uncommon for multiple, unrelated problems to manifest
themselves at the same time. In a recent engagement involving a sudden and dramatic increase in response
time in a production database, heavy contention was discovered within the file system. Once several large
objects were moved to other, less busy disk drives, throughput increased fourfold, but response time for
individual users was still slow. Further investigation from the top down revealed certain SQL statements did
not properly use an index. Both issues surfaced at the same time because of the introduction of a new
business transaction type causing a concentration of activity on the affected disk objects, while at the same
time invoking SQL statements not previously executed. Once the SQL statement was corrected to be more
selective, performance returned to normal, acceptable levels and the engagement ended.
Performance problems are like onions: you peel them one layer at a time.
In order to perform the multiple levels of diagnostics required for each stack, a number of tools will be
needed. Commercial software and hardware products are available from various vendors and free software
tools abound. It is beyond the scope of this paper to attempt to identify all such tools, but some obvious
sources are hardware and software vendors as well as the various open source consortia. Commonly used
diagnostics tools mentioned already include sar, iostat, vmstat, netstat and ping for UNIX platforms. Some
tools offer varying degrees of comprehensiveness and integration. Naturally, an integrated tool is likely to be
more convenient to implement than a set of point-solution tools.
For Oracle servers, obvious choices include Oracle Enterprise Manager (EM), the utlbstat/utlestat scripts,
and Statspack. EM has features incorporating the basic methodology described here. Utlbstat/utlestat and
Statspack have the virtue of being included with the server at no extra charge. Statspack has been shipped
with Oracle database servers since 8.1.6 and is intended as a replacement for the utlbstat/utlestat scripts. It
offers excellent and comprehensive features for ongoing monitoring of the database. All of these tools will
report data for selected intervals and will provide a view of the wait event interface built into the Oracle
server kernel.
Besides tools to cover the technology spectrum under your domain, you will also need occasional
cooperation from other experts. One of the more common problems of the contemporary enterprise is a
direct outgrowth of the integration of disparate technologies— communication barriers. Often, the
administrators of the database, hardware platform and the network belong to entirely different management
structures. While a performance methodology such as this cannot address political turfs, cooperation is
necessary to quickly diagnose potentially complex problems.
An understanding of Oracle concepts is fundamental to effective performance analysis. Have you read the
Concepts Manual lately? An understanding of all components of the Oracle server is contained in that
material, including Buffer Cache operations, enqueues, latches, the Library Cache, the Shared Pool, redo,
undo; lgwr, dbwr and smon background processes. Oracle 9i documentation includes Oracle9i Database
Performance Methods, which along with Oracle9i Performance Guide and Reference provides an in depth
discussion of server and application tuning.
For technology stacks other than the database, there is a wealth of material to read. Some excellent sources
are listed in the bibliography below. Bear in mind some of them are written from the perspective of a
particular operating system, but contain concepts applicable to all brands and flavors of platform.
Documents are available on the Oracle Technical Network site providing an understanding of the wait events
Oracle records to provide the queuing analysis perspective you need to apply this methodology and to tune
the database product effectively. There is a discussion of Oracle wait events, in some detail, as well as an
introduction to wait event analysis known as Yet Another Performance Profiling Method (YAPP), by Anjo
Kolk. Also, Oracle9i Database Performance Methods applies the holistic approach to the database in
particular. Both are well worth reading. See the Bibliography for details and additional reading.
The Center of Expertise Performance Methodology has been a collaborative work of many individuals.
Current and former members of COE, including Jim Viscusi, Ray Dutcher, Kevin Reardon and others,
provided much of the early research. Cary Millsap offered the theoretical foundation for this effort.
Practical Queueing Analysis, Mike Tanner, McGraw-Hill Book Company (out of print in the United States,
but a classic worth finding, available at Amazon’s United Kingdom site)
The Art of Computer Systems Performance Analysis Techniques for Experimental Design, Measurement,
Simulation, and Modeling, Raj Jain, John Wiley & Sons
Capacity Planning for Web Performance, Daniel A. Menasce, Virgilio A. F. Almeida, Prentice Hall
Oracle8i Designing and Tuning for Performance Release 2 (8.1.6), Oracle Corporation, part A76992-01
Oracle9i Database Performance Methods, Oracle Corporation, part A87504-02
Oracle9i Database Performance Guide and Reference, Oracle Corporation, part A87503-02
Sun Performance and Tuning, Java and the Internet, Adrian Cockcroft, Richard Pettit, Sun Microsystems
Press, a Prentice Hall Title
Oracle Performance Tuning 101, Gaja Krishna Vaidyanatha, Kirtikumar Deshpande, John A. Kostelac, Jr.,
Oracle Press, Osborne/McGraw-Hill
Oracle Applications Performance Tuning Handbook, Andy Tremayne, Oracle Press, Osborne/McGraw-Hill
Yet Another Performance Profiling Method (YAPP), Anjo Kolk,