Physical Schema Definition
PART III
Physical Schema Definition
CHAPTER 8
Introduction to Physical
Database Design
Though this be madness, yet there be method in it.
—Shakespeare
You appeal to a small, select group of very confused people.
—Message in a fortune cookie
In the early 1980s, a Honeywell customer was having problems with DBMS batch jobs.
A Honeywell field-service engineer was visiting that company for a number of
management meetings. During her lunch break, she wandered into the DBA’s area where
she found the customer’s staff stymied by the processing problem. Their most important
application, the one that processed customer orders each night, was taking longer and
longer to complete its task. It now took almost 5.5 hours to run, pushing the envelope of
the nightly batch window. The Honeywell engineer looked at the database design and the
application and, after less than 20 minutes, recommended changes to about 20 lines of
code. She then returned to her meeting.
The DBA team made the recommended changes, tested the results in the test
environment, and had the changes approved and installed by that evening. The 5.5hour job ran that night for 1 hour and 10 minutes. No other jobs, batch or online, were
adversely affected.
This example illustrates a peculiarity with databases. Computer programs either
compile or don’t. Compilers are very fussy. If you don’t have everything right, the compile
fails. DBMS compilers (precompilers, translators, interpreters, etc.) are not so picky. They
allow atrocious database designs to compile and be placed in production. And they work,
albeit not well. Database management systems have a resilience and ability to tolerate
some really poor designs. They just run slow. It might be a simple task, but if the database
design is poor, that simple task might take hours, even days, to run. DBAs might blame
other jobs or complain that the machine is too small, but it is not uncommon that a few
small changes in the design can make all the difference in the world. The application
programmer knows when his code is bad; the compiler tells him. Too often, the DBMS
compiler is silent, and that can really hurt.
© George Tillmann 2017
G. Tillmann, Usage-Driven Database Design, DOI 10.1007/978-1-4842-2722-0_8
135
Chapter 8 ■ Introduction to Physical Database Design
Often the cause of the poor performance is relying only on the static definition of
data and not paying sufficient attention to how the data are used. Logical data modeling
correctly defines the data without reference to their use. The physical database design’s
purpose is to explain how the users and applications want to use the data.
Before converting entities, attributes, and relationships to records, data items, and
links, the designer should look at how the current crop of database management systems
came about.
Now, you say, why should I care about the history of the DBMS? Well, as they say,
“The more things change, the more they stay the same.” As you traipse through the history
of data management, you see the same concepts used, forgotten, and then rediscovered.
Like NoSQL? Then you will probably love the DBMS of the 1970s and 1980s. Want to
know what your DBMS vendor will come up with next? It just might be in that 30-year old
manual you use to prop up your wobbling desk. Read and see.
A Short Incondite History of Automated
Information Management (or, a Sequential Look
at Random Access)
The mid-twentieth century rise of the computer created a need to manage not only the
machine hardware and software but also the data that changed its parlor-trick abilities into
something meaningful. Getting information in and out of a computer is still an expensive
and slow task; however, it was much more expensive and slower in the beginning.
Information Management Era 1: Sequential Processing
The punched card was invented in the eighteenth century, reinvented in the late
nineteenth century (for the 1890 census), and saw its first automated use with tabulating
machines in the early twentieth century, all before meeting up with the computer in the
1950s. Although it had many shapes and sizes, the 80-column card was certainly the most
well-known information repository of the era—allowing the storage of 80 characters, or
960 bits, of information. The cards had three significant features. First, they were easily
storable, if space was not an issue—stack ’em, rack ’em, or put them in long, low cabinet
drawers. If treated properly, cards could last centuries and were easily stored remotely
for security purposes. Second, they came in colors, which functioned as a file attribute
telling the operator what they were part of—green, new customer; yellow, existing
customer; red, customer in arrears. Third, a new card could be inserted into a deck, and
an unwanted card could be removed, quickly and easily. No automation required.
A significant disadvantage of punched cards, other than size and weight, was that
they had to be read sequentially. Combine this with computer memory being small—the
amount of data it could hold in its buffer was often little more than a card’s worth—the
system could read only a card or two and complete its processing task before memory
had to be flushed and reused. Sequential processing and limited memory meant that all
the data the computer needed to do its job had to be in its buffers at roughly the same
time. If an account had multiple orders, the Account and its Order records had to be
sufficiently close together that the machine could grab them as a single transaction.
136
Chapter 8 ■ Introduction to Physical Database Design
The solution to the problem was to integrate the card files—to physically place all
the Order cards after their related Account card (Figure 8-1). The machine would read
the Account card and then read each Order card, one at a time, adding up the amount of
money the account owed. When the last Order card was read, the totals would be tallied,
a bill printed, and perhaps a new Account card punched. The machine then went on to
read the next Account card, and the process repeated.
Figure 8-1. Punched cards
For a monthly payroll system, the Employee card might be followed by four or five
weekly Time cards. This parent-child relationship of account-order, employee-time
card, student-grades, product-parts, and so on, became the basis for most sequential
processing. The introduction of tape, paper or magnetic, made little difference; although
faster, the files were still processed sequentially and used this parent-child model.
Tape did provide one advantage, although it was hardly a breakthrough. It allowed
the segregation of data by type. Accounts could be in one physical file, while Orders were
in another. Two tape drives could be used, one containing a tape of Accounts, sorted by
ACCOUNT NUMBER, and a second tape drive containing an Order file also sorted by
ACCOUNT NUMBER. The application would read one ACCOUNT NUMBER from Drive
0, and then, using Drive 1, see whether there were any Orders with the same ACCOUNT
NUMBER; if there were, it would process the lot until it ran out of Orders and then read
the next Account from Drive 0.
Information Management Era 2: The First Random
Access DBMS
Disks, with random access memory, changed the game. Now data could be accessed
nonsequentially. The Account record might still be read sequentially; however, its
associated Orders could be read from a totally different file stored in ORDER NUMBER or
some other sequence. You just had to find it among the myriad orders.
Random access was a great advance but a messy one. How do you find one record in
a pile of a few thousand records? A simple and efficient way to retrieve a record was to use
the record’s disk address. Every record on disk has an address telling the system where it
lives. It might be something like disk 5, cylinder 3, platter 4, sector 6. Jump to that location
and your record should be there. You just had to know 5, 3, 4, 6.
137
Chapter 8 ■ Introduction to Physical Database Design
How do you remember 5, 3, 4, 6? Two record retrieval approaches were, and still are,
popular. The first approach is to store the disk address of the Order record in the Account
record, and then every time you read the Account record the disk address of its Order
record is right there.
What do you do if there is more than one Order record and they are not stored in
ACCOUNT NUMBER sequence but rather are scattered all over the disk (there are always
a few unfortunate data points that challenge a good theory)? A good solution is to store
the disk address of the first Order record in the Account record, store the disk address of
the second Order record in the first Order record, store the disk address of the third Order
record in the second Order record, and so on, and so on. This is known as a linked list.
The first popular database management systems stored each record type in its own
file, and then it allowed the designer to specify parent-child relationships across the files.
The parent-child relationships were carried out using pointers. It was a good system to
find Order records, but it had just one problem: how do you find the Account record if it’s
stored randomly?
The second record retrieval approach used a different tactic. If you had a randomly
stored Account record (suppose all new Account records were stored at the end of the
file), then create another smaller file, sorted by ACCOUNT NUMBER, that stores only the
ACCOUNT NUMBER and the disk location of the Account record and, to top it all off, give
this new sequential file a spiffy name such as index—because, after all, it does seem to
mimic a library card catalog index. Voilà.
Two of the remaining champions of era 2 are IBM’s Information Management
System (IMS) and CA Technologies’ IDMS. (A little trivia: the father of IDMS is Charles
Bachman, who is also the father of data modeling.)
WORD SOUP
In the word soup that is the database arena, the term data model is applied to two
very different concepts. Back in the salad days of data processing, data model referred
to the architectural approach behind the file or database management system. Using
this definition, the main types of data models were the hierarchical data model, used
by IBM’s IMS; the network data model, à la CA Technologies’ IDMS; and the relational
model as with Oracle Corporation’s Oracle. There were, of course, others.
More recently, the term data model is used to describe the abstract representation of
the definition, characterization, and relationships of data in a given environment.
In this book, data model refers to the abstract representation of data, while the
broad approach used by a DBMS to go about its business is referred to as its data
architecture or architectural approach or, more simply, just architecture. Using this
terminology, IMS uses a hierarchical architectural approach, while SQL Server uses
a relational architectural approach.
Don’t like either data model or architectural approach? You can still successfully
straddle the fence by simply using the word model as in network model or
relational model.
138
Chapter 8 ■ Introduction to Physical Database Design
IMS is a hierarchical DBMS; in other words, data are stored in an inverted tree
(parent-child). You enter the database at the top, the root, and then progress down.
Account might be the top (parent), while Order is below it (the child). The tree might
have many levels, so below Order you might find the Line Item segment (segment is IMS
for record), a sort of grandchild. Everything was done with embedded pointers—the root
points to the first Account occurrence, the Account occurrence points to the first Order
occurrence, and so on.
Databases can grow quite large in terms of both record types and record
occurrences. As the functionality of a database grows, the number of record types can
blossom, resulting in multiple trees of multiple levels of quite complex structures. IMS
had a way of making it easier for the programmer by supporting application-specific
subsets of the database. Rather than seeing the entire database structure, IMS supported
a logical database description, which is a view or subschema to provide the application
with a subset of all the records and data items it required. The database objects not
needed could be left out of the view.
The hierarchical DBMS had a few advantages. First, it was fast. As long as you could
start at the root, you could find all associated records under it quite quickly. This made
it ideal for online transaction processing (OLTP). Second, relying on a single system
to manage all data (adding new data, retrieving existing data, or deleting unwanted
data) made database integrity and recovery relatively easy. Third, the logical database
description allowed applications to deal only with the data they needed and not the
entire, potentially complex, database.
However, the hierarchal model had two annoying drawbacks. First, you had to enter
at the top of the tree and then go down. You could never go up. You could enter the data
at the Order record (if you could find it), but there was no way to go up to its associated
Account record. The second problem was that it was strictly one-to-many. If your data
was many-to-many, you were out of luck.
The network architectural approach solved both of these problems with pointers
that pointed up as well as down. Want to go to the related Account record from an Order
record? No problem. Want to go from Account to Product and then to any other Accounts
that ordered the same Product (many-to-many)? No problem. Want to go horizontally
from Account to Account or Order to Order? No problem, because the network DBMS
used linked lists.
PUB TRIVIA
The network architecture was codified into a standard by the Conference/Committee
on Data Systems Languages (CODASYL). CODASYL was a volunteer standards group
that gave the world, among other things, standardized COBOL. CODASYL’s data
management group became the Data Base Task Group (DBTG), which spearheaded
numerous database standards. Some books refer to the network DBMS as the
CODASYL model, others as the DBTG model; however, they all refer to the same thing.
139
Chapter 8 ■ Introduction to Physical Database Design
DBTG gave us a number of DBMS concepts and terms that still exist today, such as
schema, the specification of the database structure and how data in it is organized;
subschema, the subset of the schema that is the programmer’s or end-user’s view
of the database; Data Manipulation Language (DML), a sublanguage that defines
how database information is accessed, created, and destroyed by the programmer
or end user; and the Data Definition Language (DDL), the commands used by
the database administrator to create, modify, and delete database schemas and
subschemas.
Although not a direct line, there is a link between DBTG and the American National
Standards Institute X3/SPARC committees, which carried on its work. ANSI
introduced a three-level database model replacing schema with internal schema
and subschema with external schema and added a new layer called the conceptual
schema, which is the enterprise-wide view of the data.
The key to the network model was the set, a defined owner-member (parent-child)
relationship. Sets were limited to two levels, but the member (child) in one set could be
the owner (parent) of another set, providing a tree structure of any number of levels. The
set had another trick. The member of one set could be a member in a second set, allowing
the system to support many-to-many relationships (Figure 8-2). Invisible pointers, buried
in records, allowed the programmer to navigate from set to set, record to record.
Figure 8-2. Network sets
Navigation required that at any given time the programmer had to be aware of where
they were in the database. The current position (record, data item, or relationship) was
known as currency. Knowing the current record allowed the programmer to navigate
anywhere else in the database. CODASYL systems also supported a robust subschema
architecture minimizing the need for extensive navigation.
140
Chapter 8 ■ Introduction to Physical Database Design
CURRENCY
If you have ever used a word processor, then you are familiar with currency. If you
look at a document on a computer screen and start typing, the characters you type
do not necessarily go where you are looking. Rather, the keystrokes entered go
where the cursor is located, which might be in a part of the document not even
displayed on the screen.
Database currency is similar to the word processing cursor. It is the place in the
database where the next function performed happens.
Network systems had another neat feature. When inserting a new record, you
could specify that you wanted it stored near another record occurrence. For example,
when inserting a new Order, you could specify that you wanted it stored on disk near its
Account occurrence parent, ideally on the same database page. This meant that when
you accessed an Account occurrence, there was a good chance that its associated Order
occurrences were on the same physical page.
Era 2—the first true DBMSs (from the late 1960s to today)—had a number of kudos
to its credit. First, the DBMSs were fast. You can’t beat pointers for speed, which is still
true today. The emergence of online transaction processing (OLTP) became their strong
suit. No DBMS approach got data to a computer screen faster. Second, they were reliable.
They oversaw the entire transaction, no matter how many places on disk it touched. They
guaranteed that the database always faithfully represented what was entered (which is
something many of the newest DBMSs today cannot say).
They also had some drawbacks. First, the hierarchal model was inflexible. Its onedirectional nature and its one-to-many requirement made it sometimes difficult to fit
into the real world. The network architecture solved these problems, but the database
programmer required an additional ten IQ points to keep track of currency (i.e., where
they were in the database). Navigating, following pointers up, down, and sideways, was
confusing to many of our more challenged colleagues.
Modern-day versions of both IMS and IDMS are quite different, although they have
managed to maintain their best qualities. IMS can now “look up” and handle many-tomany relationships, although some would say its solution is a bit klugey. Indices allow
entry into the database at any level in the tree. IDMS had a rebirth with a number of
relational features that reduced or, in some cases, eliminated navigation.
Hierarchical and network systems became the database workhorses of the 1970s and
early 1980s and are still used today in transaction-heavy environments such as banking
and airline reservation systems.
141
Chapter 8 ■ Introduction to Physical Database Design
A Small Digression: A Couple of Words About Database Access
One of the reasons hierarchical and network systems were fast was their methods of
fetching data. Both used hashing techniques and, later in their life, indices.
As with everything else in life, database access is all about costs. In computer
terms, there are two information management cost drivers: storage and speed. Storage
costs dollars for disk space. The more data you store, the more it costs. It’s rather
straightforward. If you want to cut down of storage costs, get rid of some data.
Speed is more interesting. The faster processing occurs the sooner the machine can
do something else, so fast batch processing means that more jobs can be run in a unit
of time. Online processing is a little trickier. If an online application runs from 9 to 5, it
runs 8 hours whether it’s fast or slow. However, slow online transaction speeds can cost
a business in reduced sales (because customers get fed up waiting) or require more call
center staff, pushing up personnel costs.
Information cost structures have changed significantly over the years. When the
DBMS was a teenager, storage costs were high, and processing costs were relatively low
(compared to storage costs). For example, the disk to store 1 megabyte of data in 1955 cost
about $10,000, while that same megabyte costs less than 1/100 of a cent today to store.
That’s a million times cheaper!
The DBMS of the 1970s worked hard to keep storage costs down; however, with
storage costs so low today, the cost focus has shifted to processing time. The effort now is
to process as much as possible as soon as possible, which brings us to disk speed.
What is the simplest accurate way to measure database processing speed? The
answer: disk I/O. The relative difference in speed of fetching information from main
memory and fetching it from disk varies based on the speed of the processor and the
speed of the disk, but as a useful round number, think 1,000 to 1. Fetching something
from main memory is arguably about 1,000 times faster than getting it from disk. So, the
important question for this millennium is how many disk I/Os does it take to fetch the
data you require and how can you reduce that number?
How many disk I/Os are needed to fetch a specific customer record from disk? If
there are 10,000 customer names on disk (and you assume each read of a customer
record requires one I/O, an assumption discussed later), then the average number of
reads to find your customer is 5,000 (number of records/2).
Hashing
Go back to the database file where each record had a physical address on disk (disk
ID, cylinder number, platter number, sector number). Imagine a file to store customer
information by CUSTOMER NAME. Also, imagine that you have a file consisting of 26
database pages. One way to store customer information is by allocating each letter of
the alphabet its own database page. Names are hashed by their first letter. Those starting
with an A are stored in page 1, all names starting with a B are stored in page 2, and so on.
When you want to fetch “Smith,” you know to go directly to database page 19. One I/O
and you have “Smith.”
Hashing consists of performing a function on the search key that always results in the
same number in the desired range. The previous example used the first letter of the name,
which is translated (hashed) into a disk location (A=1, B=2, etc.).
142
Chapter 8 ■ Introduction to Physical Database Design
More complex schemes are both possible and the norm. Imagine a database with
950 pages storing information on an ACCOUNT NUMBER that can range from 1,000
to 9,999. A simple hash is to divide the account number by 950 (the number of pages)
and use the remainder as the location to store the record. In this example, ACCOUNT
NUMBER is the search key, the hash algorithm divides the search key by the number of
database pages, and the hash key is the remainder of the division.
To find the storage location for ACCOUNT NUMBER = 4560 (the search key), divide
4650 by 950 (the hash algorithm), and you get 4 and a remainder of 760. The record
should then be stored in page 761 (1 is added to the remainder to account for a remainder
of 0). It’s fast…using only a single I/O. Both IMS and IDMS used hashing techniques to
store and retrieve data.
PUB TRIVIA
It’s an interesting question whether hashing would have been invented today if it
had not been discovered in the 1960s. Files today are easily expandable. Create a
file and the operating system allocates the space as the file requires it. It was not
always the case. Early mainframes and minicomputer operating systems required
that the size of the file be declared when it was created. The operating system then
allocated the entire declared space for the file. Need more space? You were out of
luck until newer operating system versions allowed dynamic allocation of space.
This was good for hashing because the number of database pages would be a
constant, allowing the hash algorithm to always return the same hash key for the
same search key.
Today, the number of database pages may be increased to accommodate new
records. However, when the number of pages changes, the hashing algorithm
no longer returns the same hash key for the same search key. To make hashing
work, the DBMS must, somehow, maintain the same number of pages (logically or
physically) regardless of file size.
There are probably as many different hash algorithms as there are databases, all
with the goal of producing a rather even distribution. Because even the most complex
hash algorithm rarely requires more than a knowledge of simple arithmetic, even the less
mathematically eloquent can get into the game.
When hashing works, it is great. When it doesn’t work, then things get complicated
and efficiency degrades. Take the example of storing the record “Smith” in a 26-page
database. The DBMS looks at the key “Smith,” says it should be stored on database page
19, goes to page 19, and discovers that the page is full. What does the DBMS do now?
Systems that use hashing have sometimes elaborate schemes for expanding pages or
storing information in overflow areas. Whatever overflow technique is employed, the
speed expected from hashing is compromised. Every time “Smith” is accessed, the DBMS
will go to the wrong page and, not finding “Smith,” start searching other locations.
143
Chapter 8 ■ Introduction to Physical Database Design
Reorganizing the database also becomes difficult. Adding pages to the database can
require removing every record from the database, rehashing using a different algorithm
variable, and then restoring the records on their new page.
While hashing is great in certain situations, it is no access panacea. Luckily, other
access approaches are available. Enter the inverted index.
Inverted Indices
An inverted file or an inverted index is a sequential file of keys and file or database
pointers sorted in a different order than the file they support.
The concept dates to a time before the computer age. Take your average library. The
books are stored on the shelves using some classification system such as the Library of
Congress or Dewey Decimal system. Because few people know the classification number
or code of the book they want, they require a method to find the right shelf containing the
book. Enter the card catalog, which usually stored three index cards for each book—one
for the book title, one for the author’s name, and one for the subject. The title index card
was placed in a file of similar title cards sorted by title name, the author card was placed
in a file of authors sorted by author name, and the subject card was stored in a subject file
sorted by subject. If the reader knew only the title, they could find the desired card in the
title file. The card would then tell the reader where to find the book in the stacks.
The card catalog was three separate lists of all the books in the library, each sorted in
a different order than the books on the shelves. That is why it is called an inverted file or
inverted index, because the order of the cards was inverted from the order of the books.
This approach works for computer files as well. Take a customer. The actual
Customer file might be sorted on CUSTOMER NUMBER, but it could have inverted
indices on CUSTOMER NAME and CUSTOMER PHONE NUMBER. Look up a name in
the Customer Name index and find a pointer to the correct record in the Customer file.
One problem with inverted files is that adding new entries or modifying old entries
requires re-sorting the entire file, which could be a long and nasty process depending on
the size of the file.
How good is an inverted file? Well, it finds the desired record, but it is not very
efficient. An inverted file is still a sequential file with its records in sort-key order. Finding
a record still requires, on average, reading half the file. Using the previous example of the
Customer file of 10,000 entries, finding the correct index entry, assuming one disk I/O per
entry (an assumption discussed later), requires, on average, 5,001 I/Os (5,000 I/Os are
spent in the index alone, with one I/O to fetch the Customer record). Regardless of the
speed of your computer, this is “go get a cup of coffee, your data will be showing up about
the time you get back” speed.
One solution is the binary search. The binary search is also a technique used long
before automation. Go back to the library card catalog. Suppose you want a book written
by Herman Melville. You go to the author catalog and see 100 drawers containing index
cards of author names. Where do you start? Drawer 1 with Abbott? No. Because Melville
is in the middle of the alphabet, it would be smart to start in the middle with drawer 50.
However, suppose drawer 50 ends with the letter J. Now you know that Melville is not
in the first 50 drawers. Congratulations, you just eliminated half the author file. Next, go
to the middle of the Js to the Zs, drawer 75. At drawer 75, you find the first author name
starts with an S, so you have gone too far. You now know that the Melville card is between
144
Chapter 8 ■ Introduction to Physical Database Design
drawers 51 and 74. Halving the distance again, you reach for drawer 62, and bingo, you
find the desired entry. This search is called a binary search because with each probe, you
eliminate half of the remaining drawers. A sequential search or scan would have required
50 reads or probes, but using a binary search, you did it in three.
Three probes seem too good? Actually, we were lucky. According to the math…
where:
N = Number of entries to search
C = Average number of compares to find desired entry
W = Worst-case number of compares
C = log 2 ( N ) − 1 (1)
= ( LOG ( N,2 ) ) − 1 (2)
W = log 2 ( N ) + 1 (3)
= FLOOR ( LOG ( N,2 ) ,1) + 1 (4)
or in Microsoft Excel format…
…with the worst case being…
or in Microsoft Excel format…
…a binary search of 100 drawers should take an average of 5.6 compares or probes.
A binary search of the 10,000 record file should find a hit after 12.3 probes on average,
which is much better than the scan of 5,000 (N/2) probes.
Database Pages
Luckily, DBMS vendors identified the I/O bottleneck early on. The typical operating
system (and a few programming languages), as well as disk drive manufacturers, see the
magnetic disk as a series of rather small sectors. Sector size, often hard-coded into the
disk by hardware or software, allowed only a limited number of bytes written or retrieved
per disk I/O, some as small as 128 bytes. DBMS vendors worked around this limitation
by creating a database page, an allotment of disk real estate consisting of multiple
contiguous sectors read or written as one block. If you assume a sector of 128 bytes
and a database page of 32 sectors, then each database page can store 4,096 bytes. If the
Customer record is 800 bytes, then each database I/O can access five Customer records
(a blocking factor of 5). Finding a single customer in a 10,000-record file would then not
require on average 5,000 I/Os, but only 1,000 I/Os. That’s a significant improvement.
145
Chapter 8 ■ Introduction to Physical Database Design
LOGICAL VS. PHYSICAL I/O
There was a time when each I/O meant a trip to the I/O device. The physical record
was then read into the main memory’s buffer. The size of the allocated space in
main memory (the data buffer) was the same size as the physical record.
As main memory size increased, buffers expanded. The operating system could now
read multiple records into the buffer at a time. Buffer content was transparent to
the application, which still merrily issued an I/O request for each record. However,
now the operating system just might have the desired record in its buffer, saving
considerable resources. This led to the distinction between logical I/O (a request for
a record from secondary storage) and physical I/O (the actual fetching of data from
the storage device).
The gain for the inverted indices is even greater because each index record is
smaller, consisting of only CUSTOMER NAME and its database location. If we assume
CUSTOMER NAME is 12 bytes and the database key is 4 bytes, then 250 index entries can
be stored on a single database page (a blocking factor of 250), requiring fewer than two
physical I/Os to find an index entry.
However, there are even better ways to find a record than a binary search. Read on.
B-Trees
In the early 1970s, a few people, working independently, developed the B-tree index. A
B-tree stores index entries in a tree structure, allowing not only fast retrieval but also fast
insertion and deletion (Figure 8-3).
Figure 8-3. A simple B-tree
B-trees consist of nodes. A node is a record containing a specified number of index
entries (search key and location ID). When the index node becomes full, it is split into
three or more nodes with a parent node linking to two or more child nodes. As the index
grows and the number of nodes expands, new levels are added to the height of the tree.
The top node is called the root node, the bottom-level nodes are the leaf nodes, and the
levels in between store the branch nodes.
146
Chapter 8 ■ Introduction to Physical Database Design
There are many variations of B-trees. There are binary search trees, B+-trees,
balanced trees, unbalanced trees, and many more. They differ in how they work. Some
work better with very large amounts of data, while some work better with smaller
amounts. Some B-trees specialize in highly volatile data (many inserts, updates, and
deletes); others specialize in highly skewed data (uneven distribution). Some are
useful for retrieving a single record, while others are best for fetching whole groups of
records. Some promise fast sequential searching, while other favor fast random retrieval.
Regardless, they all follow the same basic root, branch, leaf structure.
How fast are B-trees?
The answer is easily calculated:
where:
N = Number of index entries to search
C = Average number of compares to find desired entry
m = Blocking factor of index
C = log N / log m (5)
= LOG ( N ) / LOG ( m ) (6)
in Microsoft Excel…
For the 10,000-record customer file and a blocking factor of 250, that’s fewer than 2
physical I/Os.
A main advantage of the B-tree over the binary search of an inverted file is not in
fetching data but in index maintenance. For the inverted file, every time a record is
entered or a search key modified, the entire file must be re-sorted. That’s not the case for
the B-tree. Usually, the new entry is just entered. If there is no more space in the node,
then, in most cases, three or fewer I/Os are required to add the new nodes and index
entry.
where:
Percent Split = Probability that the index node will need to split.
Percent Split = 1 / ( m / 2 − 1) (7)
= 1 /(( CEILING ( m,1) / 2 ) − 1 (8)
or in Microsoft Excel…
Using the previous example, less than 1 percent of the time an insertion requires a
node split, so B-trees are very efficient.
B-trees were also retrofitted to hierarchical and network database systems, allowing
them to access even nonroot data easily and quickly.
147
Chapter 8 ■ Introduction to Physical Database Design
Every major DBMS implementation today, regardless of architecture, uses some type
of B-tree, and other than minor variations, they are almost identical to those available 30
years ago.
Bitmaps
While bitmaps are usually found in the index section of every database book, their similarity
to other indices is tenuous. Bitmaps work well when the possible values of an attribute are
known and limited. Imagine an Employee record with the data item GENDER. Short of
discovering life on Mars, GENDER can have only two values, “Female” and “Male.”
There are a few assumptions about the following example. First, all Employee
records are the same fixed length (assume 1 KB). Second, assume the database is a flat
sequential file so that any record can be found if its position in the file is known. If there
are 1,000 employees at 1 KB per employee, then the database file is 1 MB long. The first
Employee record has a displacement (distance from the start of the file) of zero (it’s in the
first position). The second employee has a displacement of 1 KB and the third of 2 KB.
The 501th employee would have a displacement of 500 KB, and so on.
For a bitmap, the system creates a file of, not 1 MB, but 1 Mb (not 1 million bytes, but
1 million bits). If the first employee in the file is a male, then the first bit is set to 1; if the
first employee is a female, then the bit is set to 0. The same is done for each of the 1,000
employees.
For the query, “How many ‘females’ work for the company,” the system adds the
total number of 0 bits in the bitmap file and you have your answer. For the query “Display
the NAME and SALARY of every ‘Female’,” the system can go to the bitmap and search
each bit for a 0. When it finds one, it then goes to the Employee file and fetches the record
with the displacement equal to the displacement of the bit times the length of the record
minus 1 (because the first position is 0). If bit 125 is a 0, then the system should multiply
124 times 1,000 and go to the record with a displacement of 124 KB.
Bitmaps are great for systems where the query results (the result set) are usually a
large number of records.
The example can be expanded to bring it closer to how bitmaps are actually used.
Assume an Automobile database of 100 10 KB pages with a maximum of 10 Automobile
records per database page, and assume that the data item COLOR can have the values
“red” or “blue” or “green” or “black.”
Because the database can store 1,000 Automobile records (100 pages at 10 per page),
the bitmap needs to be 1,000 bits long. However, because there are four colors, four
bitmap indices are required, one for each color. (Note: Actually only two bitmaps are
needed if “No color” is excluded.)
With the four bitmaps, not only can a user find every car that is red, but, using
Boolean logic, the user can easily find every car that is both red and black by “And-ing”
the red and black bit maps.
Bitmaps are best where
148
•
The database is used primarily for queries.
•
The query outcome is a rather large result set.
•
The queries are on attributes with a relatively small number of
known values.
Chapter 8 ■ Introduction to Physical Database Design
Bitmaps are amazingly fast, and the query can often be successfully completed
without every going into the database file.
Associative Arrays
The associative array has been a data management mainstay almost from the creation
of the disk drive. In a traditional array or vector (a vector is a one-dimensional array),
the array contains only values. For example, in most programming languages, an array
is defined as a collection (series) of elements or values. A particular value is found using
its displacement (the first array value such as a bit, byte, string, etc., is placed in the
first vector location, the second value in the second vector location, etc.). Assume the
vector Employee with four locations or slots are numbered 1 through 4 and contain the
values “Abrams,” “Bailey,” “Collins,” and “Davis.” Fetching Employee [slot 1] would yield
“Abrams,” while fetching Employee [slot 4] would return “Davis.”
While the traditional array stores only values, the associative array stores keyvalue pairs. For example, the associative array Employee could store the key-value pair
EMPLOYEE NUMBER:EMPLOYEE NAME (the key is separated from the value with a
colon). The associative vector would now store (“101:Abrams,” “107:Bailey,” “231:Collins,”
and “103:Davis”). Fetching Employee[231] returns “Collins.”
In this example, the keys are integers, but they need not be. The array could have
been reversed with the key “Collins” and the value “231.” Hashed associative arrays are
often called hash tables.
An associative DBMS is a data management system whose architecture is based on
the associative array. Many consider the associative database the ultimate in flexibility. It
might offer variable-length records, consisting of a variable number of fields, each field of
variable length. For example, imagine a record occurrence consisting of the following:
Key
Value
FIRST NAME
William
LAST NAME
Smith
EMPLOYEE NUMBER
34577
DEPENDENTS
Mary, Thomas, Roger
The associative system stores the field name with the value. On disk, the previous
data might look like the following:
FIRST NAME:William;LAST NAME:Smith;EMPLOYEE NUMB
ER:34577;DEPENDENTS:Mary,Thomas,Roger;;
In this example, a colon separates key from value, a semicolon indicates the end of
the variable-length field, a comma separates the multiple values in a single field, and the
double semicolon indicates end of record. If the database contains 1 million employees,
then the label FIRST NAME is on disk 1 million times. Some space could be saved by
substituting a shorter label for each field name, such as $1 for FIRST NAME.
149
Chapter 8 ■ Introduction to Physical Database Design
Associative systems are excellent for variable-length data and are popular for querybased systems.
Few associative database systems are referred to as associative database
management systems. In practice, they are often referred to as hierarchal (such as DRS)
or inverted file (Model 204) systems, reflecting the trend to classify database systems by
how they link records together rather than how they store data fields.
Information Management Era 3: Inverted File Systems
Overlapping with the era 2 DBMSs was another whole DBMS animal—the inverted
file systems. Products such as Adabas (now owned by Software AG), Model 204 (now
called M204 and marketed by Rocket Software), and Datacom/DB (now marketed by
CA Technologies) followed a different approach than their era 2 cousins. These systems
removed all the pointers from the database and placed them in external indices. One
entered the system and even “navigated” around it within the indices. Only when all the
records wanted were identified did the DBMS delve into the database content to retrieve
the records.
Inverted file database management systems were feasible only because of the
advances made in indexing technology. B-tree indices, and their myriad spin-offs,
made it possible to find a record on disk with just a few more I/Os than the pointer
approach. Although not as good as era 2 hierarchical and network systems for transaction
processing, inverted file system shined in query applications where the target was not
just one or a few records but entire groups of records. Today, most query systems, 4GLs,
document management systems, and many data warehouses use some form of inverted
DBMS technology.
Information Management Era 4: The Age of Relational
The relational model does not leave IT people bored or apathetic. It is either the most
loved or most hated DBMS model ever created. It is sanctified or vilified by academics,
theorists, practitioners, and users, making it the most difficult model to discuss without
upsetting someone. Talking to IT staff, one comes away with the belief that the relational
model is a mixture of mathematics, computer science, and religion. Era 2 and era 3
products—and there were dozens and dozens of them—were designed by software
engineers to solve software engineering problems. Their solutions, however, even the
ones that worked well, were not always elegant. That bothered Edgar (Ted) Codd, an
IBM researcher, who decided to develop an information management system that was
both mathematically elegant and useful. He used set theory and predicate logic as his
foundation. Although the relational model is technically as old as or older than many era
2 and era 3 systems, it took more than a decade for anyone to develop a viable product.
That, and the radical change inherent in relational technology, gives it its own era.
Codd looked at the database landscape and didn’t like what he saw. He thought the
then-current database systems were too complex, requiring navigating multiple levels,
dealing with pointers, handling indices, etc. He also disliked that the programs that
accessed the database had to change if the underlying structure of the database changed.
Adding new indices, pointers, or even database pages could require significant changes to
the applications using the database.
150
Chapter 8 ■ Introduction to Physical Database Design
He had a few major goals for his relational model. First, he wanted to simplify the
database. He envisioned a system where data were represented in two-dimensional
tables. Elaborate structures, which looked like a Jackson Pollock painting, were to be
avoided. Artifacts that made the database lumpy (greater than two-dimensional), such as
repeating groups and group items, should be eliminated. Keep it simple.
Second, he wanted data independence. The programmer (or end user because the
new relational model was envisioned to not require programmers) should not have to
know the innards of the database. Pointers were verboten, and even indices should be
transparent to the user. In addition, how the data were used, by end user or program,
should be unaffected by any structural changes to the database. Adding an index,
modifying a relationship, etc., should all be possible without having to change how the
data were used. (Changes to the DDL should not affect the DML.)
Third, Codd wanted a solid formal foundation for the model. Then-current DBMSs
were like contemporary programming languages. They started as a simple concept,
but then were expanded, modified, and jury-rigged until they are large, complex, and
unwieldy, sometimes not resembling at all what they looked like in the beginning. His
solution would have a mathematical background centered on set theory and predicate
logic that would require little or no expansion.
Fourth, the model was to be declarative. Contemporary DBMSs were procedural in
nature, requiring the programmer to tell the system, step by step, what to do. Declarative
models have the user tell the system what is wanted and then leave it to the DBMS to
decide how to obtain the desired result.
Fifth, the new system should eliminate redundancy and data inconsistency.
The mathematical nature of the new model required new terminology not familiar to
many in IT. There were no longer files; there were now relations. Records were now tuples
(rhymes with couples). Data items were columns or attributes. Perhaps most important, at
least in hindsight, Codd separated theory from implementation. There was the relational
model, and there were vendor (IBM, Oracle, Microsoft) implementations (DB2, Oracle,
SQL Server), and as you will see, rarely the twain shall meet.
In the beginning, relational DBMSs (RDBMSs) suffered from poor performance. The
emphasis on declarative syntax and data independence, and a de-emphasis of storage
techniques, left the RDBMS relegated to use by small query applications.
However, time, and a mild shift of emphasis from the theoretical to the
implementable, moved the RDBMS into the mainstream. Today, the RDBMS has had a
long run, with more products out there than all the products from all the other DBMS
architectures combined.
You can’t talk about the relational model without talking about SQL. The
relational fathers did not create a user interface (DML or DDL), but others did. An early
implementation for a relational front end was SEQUEL, developed at IBM. It was followed
by SEQUEL2 and then, after discovering SEQUEL was trademarked by someone else,
SQL. (Old-timers still pronounce SQL as “sequel.”) SQL became the most popular RDBMS
language despite relational purists hating it. Adding insult to injury, many RDBMS
products have SQL in their name and omit relational (SQL Server, MySQL, SQLBase,
NonStop SQL, to name a few).
SQL’s popularity extends beyond relational systems. Many nonrelational DBMSs,
such as object-oriented DBMSs and NoSQL products (which are really “no relational”),
use a SQL-like interface.
151
Chapter 8 ■ Introduction to Physical Database Design
The RDBMS, without question, is the most popular DBMS model in the world
today. It is the standard from which all others deviate. Look up Data Definition Language
(DDL) or Data Manipulation Language (DML), and the explanation is likely to focus on
relational DDL and relational DML without even a reference to its network origins.
It is the workhorse for many companies. If they have only one DBMS, it is likely to be
relational. If they have two or more data management products, one of them is probably
relational. It is the DBMS of choice for applications with relatively flat files, which use
simple, well-defined data types, and are query based, such as data warehouses.
Problems with Relational
The RDBMS also has some shortcomings, particularly in moving from formal model to
DBMS implementation.
First—Performance Issues
For many, the relational database management system is still not the DBMS of choice
for high-volume, short-response-time transaction processing. Even some early
relational advocates admitted it might be necessary to sacrifice performance for other
relational features, and to at least some extent, that remains true today. Those vendors
that have focused on performance have had to make some interesting theory versus
implementation trade-offs if not totally abandon the soul of the model.
Second—Not So Simple Simplicity
The simplicity aspects of the relational model have proved surprisingly complex for some.
Data Types
Vendors had to modify the relational model to accommodate nontraditional data types
(large text, audio, video, etc.) with very mixed results. Vendor-specific workarounds made
moving between relational products, or even between versions within the same product
line, problematic. In some cases, relational systems’ unfriendly attitude toward new data
types sparked whole new nonrelational database models.
Procedural Code
For some, the RDBMS is used for query processing, with SQL as a stand-alone language.
For others, the RDBMS is used for transaction processing, with a SQL sublanguage
embedded in a host programming language. In most of these cases, that host language is
a procedural one using, of all things, a procedural SQL cursor to maintain currency. Many
of these procedural addenda became industry standards.
152
Chapter 8 ■ Introduction to Physical Database Design
Groups
Relational theory does not allow group attributes or multivalue attributes (repeating
groups or group data items). Implementers of the relational model are not so fussy.
Unfortunately, as seen in logical data modeling, the real world is full of group and
multivalue attributes. Price lists, tax tables, and other vectors and arrays abound in the
real world, yet they are illegal in the relational world. The same is true for group data
items such as DATE and ADDRESS.
This is at odds with most procedural languages, which provide, as a significant
feature, the ability to process arrays and group data items. And for good reason. It is naïve
to say that DATE is not an aggregate of MONTH, DAY, and YEAR. Or that EMPLOYEE
NAME does not include the data items FIRST NAME, MIDDLE INITIAL, and LAST
NAME. Programmers know this.
The argument against group attributes or multivalue attributes is that their
exclusion makes the database easier to use. This might be true, but it also makes it
less powerful and that much further away from representing the real world. And in
truth, if programmers can learn to use these features with considerable success in their
programming work, then they should be able to use them in their database access code.
Rather than benefiting from the “simplicity” of the relational model, database
programmers are forced to replicate, on their own, features (such as arrays and groups)
that were once readily available to them. Some IT shops purchase utility packages so
programmers do not have to re-create missing functionality, while others write their own
library routines. The relational model has not eliminated these real-world concepts; it
just turned automated solutions into manual ones. In any case, a database feature that
was designed to make the programmer’s life simpler actually complicated it.
In fairness, many relational model purists, even the most pure of the pure,
believe that some group data items are needed, with date being a prime example.
And most vendors support groups even if they don’t call them that. However, their
accommodations, as welcome as they are, often involve two unpleasantries. First, the
implementation is often a kluge, centering on vendor or user-defined domains or data
types. The second inconvenient awkwardness involves intellectual honesty. If you are not
allowing group data items, then don’t allow group data items; if you are going to allow
group data items, then do it straightforwardly, even if with a wink and a nod. The way it
is now, each database designer and programmer must look closely to discover how their
RDBMS vendor implements groups, if at all.
In short, the relational model simplifies the DBMS at the expense of the programmer.
Third—Communication and Language
For Codd, a major advantage of the relational model was its formal foundation. However,
the simplicity of the mathematical foundation of the model was itself problematic. It
might be clear to mathematicians but is much less so to the average programmer. Look at
this sentence from his 1970 paper:
The relations R, S, T must possess points of ambiguity with
respect to joining R with S (say point x), S with T (say y), and
T with R (say a), and, furthermore, y must be a relative of x
under S, z a relative of y under T, and x a relative of z under R.1
153
Chapter 8 ■ Introduction to Physical Database Design
This is not the most complex sentence in the paper. It was chosen because it does not
require special symbols. For most programmers, it is incomprehensible. The relational
model is considerably more difficult to understand for the average database programmer
than any other database model. Many database administrators are left to dust off their
college version of Gödel’s incompleteness theorem or surrender to never understanding
why the relational model does what it does. Codd seemed to recognize this. Twenty years
after introducing the model, he wrote the following:
One reason for discussing relations in such detail is that there
appears to be a serious misunderstanding in the computer
field concerning relations.2
The ugly truth is that although programmers had difficulty understanding
the relational model, Codd had just as much difficulty understanding how
nonmathematicians in general, and programmers in particular, comprehend math.
THEY LOOK BUT THEY CANNOT SEE…WELL, SOME CAN
Peter Chen, the founder of the entity-relationship model, published a paper in 2002
illuminating the problem. In it Chen states the following:
It is correct to say that in the early 70s, most people in the academic
world worked on the relational model instead of other models. One
of the main reasons is that many professors had a difficult time
understanding the long and dry manuals of commercial database
management systems, and Codd’s relational model paper1 was
written in a much more concise and scientific style.3
He goes on to say this:
A lot of academic people worked on normalization of relations
because only mathematical skills were needed to work on this
subject.3
So why are there no better translations of Codd? What have the ocean of authors,
writing hundreds of books and papers on the relational model, done to improve the
situation? Very little. Read almost any instructions on normalization, and it’s obvious that
nonmathematical descriptions are rare. The authors seem either afraid to present the
material in a nonmathematical way or simply do not understand their audience. And that
is the fundamental issue. The intended audience for the relational model is (or should be)
not mathematicians, not end users, but IT people. If you want to communicate with IT
people, then you must speak their language. IMS, IDMS, and Adabas authors understand
their audiences. Relational authors…not so much.
154
Chapter 8 ■ Introduction to Physical Database Design
WHERE ARE YOU, CARL SAGAN?
It is not impossible to make the incomprehensible somewhat fathomable. It just
takes understanding a complex subject and how people learn. A number of very
smart people have done it. Theoretical physicist Stephen Hawking, who held the
Isaac Newton chair of physics at Cambridge University, was able to write a book
for the everyday person describing black holes.4 His book was on the best-seller
list for more than four years. Theoretical physicist George Gamow, the first to give
us a mechanism for the Big Bang, wrote a number of popular books, two of which
focused on quantum theory and mathematics.5,6 Gamow, it is said, targeted his
books at the middle-school reader.
So why is it so difficult to get a simple explanation of the relational model?
Is this Codd’s fault? Not entirely. Chen’s early work on the entity-relationship
approach is quite mathematical and esoteric. However, Chen and even more so his
followers wrote material more understandable for people who wouldn’t know a Turing
from a Tarski. Relational followers have been more reluctant to translate Codd into English.
It is a shame because their reticence hides some of the beauty of the model from its users.
Fourth—Relational: Theory or DBMS?
Codd felt that the relational model was superior to its competitors, not just because it
made a good database management system but because it did a better job of describing
the real world, including doing a better job than the entity-relationship approach. Look at
this passage from his 1990 book:
With the relational approach, an executive can have a
terminal on his or her desk from which answers to questions
can be readily obtained. He or she can readily communicate
with colleagues about the information stored in the database
because that information is perceived by users in such a
simple way. The simplicity of the relational model is intended
to end the company’s dependency on the small, narrowly
trained, and highly paid group of employees.7
If programmers have trouble understanding the fundamentals of the relational
model, then business executives will be totally lost, yet Codd could not see this. In
fact, Codd saw the entity-relationship approach not as an analysis technique but as a
relational model competitor.
No data model has yet been published for the entityrelationship approach. To be comprehensive, it must support
all of the well-known requirements of database management.
Until this occurs, companies intending to acquire a DBMS
product should be concerned about the risk of investing in the
entity-relationship approach.8
155
Chapter 8 ■ Introduction to Physical Database Design
The logical-physical distinction, as well as all the other database design principles
discussed in Chapter 1, are totally ignored. It’s as though Codd could not see the
difference. He saw foreign keys and functional dependencies as a way of describing the
business world to an executive.
Fifth—Where Are You Relational Model?
Lastly, the relational model Codd envisioned still does not exist. Unhappy with vendor
implementations that did not meet the standards of his relational model, Codd, in 1985,
came up with 12 rules that all RDBMS products should follow. By 1990, no vendors had
implemented all 12 rules; however, that did not stop Codd from introducing 321 more
rules. To date, almost half a century after its introduction, no RDBMS implementation
incorporates more than a handful of the final total of 333 rules (see Table 8-1).
Table 8-1. The Effectiveness of the Relational Model
Relational Goals
Goal
Effectiveness
Simplify
Questionable. In theory, yes; in practice, the features
that had to be added to make the DBMS practical,
such as nonstandard data types (large text, video, etc.),
cursors, and triggers, added to the complexity and
difficulty using the model.
Data independence
Effective.
Solid formal foundation
Mixed. The formal foundation is there but in a language
understood by few in IT.
Declarative
Partially. The requirement to make the model work
in the real world necessitated adding a number of
procedural features.
Eliminate redundancy and
data inconsistency
Mixed. The tool for eliminating redundancy
(normalization) is a technique that can be applied
to any DBMS. The need for foreign keys undercuts
reducing redundancy.
Just Because It Has Failings Doesn’t Mean It’s a Failure
How successful is the relational model? Ask yourself this question: since the relational
model was introduced, how many other theoretically based database management
systems are there? Yet, despite all its failings, it is impossible to consider the relational
model a failure. It fails to live up to its inventor’s expectations, yet it endures as no other
DBMS has ever endured. Even failure can’t argue with success.
156
Chapter 8 ■ Introduction to Physical Database Design
Information Management Era 5: Object Technology
Relational technology was king. Then in the 1990s, there appeared a new pretender to
the throne—object technology. An object is a structure that includes both data and the
operations (procedures) to manipulate those data. With the traditional database, an object
(for example, the record Order) only contains data (the attributes of Order). However,
the object-technology object Order contains not only the attributes of Order but also the
operations (computer code) that manipulate Order, such as Create New Order and Fulfill
Order. All the code associated with Order is in the Order object.
Object technology is, in part, a reaction against the traditional way of developing
systems—separating into two groups the tasks and even the teams that work on data and
process. Object technology says you can’t separate the two. Rather, think of a system as a
network of communicating objects that pass information or instructions to each other.
Object technology has a number of unique and defining features.
Association is the natural relationship between objects. For example, customers
place orders, so there is a natural association between Customer and Order. Associations
can have cardinality and modality and can exist between two, three, or n objects.
A child object can inherit properties from its parent object. Imagine objects
organized into multilevel inverted trees. Objects at the top are the most general, such as
Customer, while those lower down are more specific, such as Wholesale Customer. The
lower objects can inherit properties (attributes and operations) from higher-level objects.
In this example, the object Wholesale Customer inherits from Customer all of Customer’s
attributes and operations.
Although they routinely communicate with each other, the internal workings of each
object are independent of any other objects. This is called encapsulation—what goes on
inside an object stays inside the object. For example, the object Customer could contain
the operation Add New Customer, and the Customer object knows exactly what to do
when the operation is invoked, while the object Order knows nothing of, and is oblivious
to, that operation.
Object-Oriented Programming Led to Object-Oriented Analysis
and Design, Which Eventually Led to the Object-Oriented
Database Management Systems (OODBMSs)
The OODBMS is, in many ways, a throwback to the era 2 DBMS. The 1990s were a time
when computer usage was expanding into areas of nontraditional data types. The
hierarchical structure of objects is more compatible with era 2 DBMSs than relational
ones. Objects fit easily into multilevel trees stored in pointer-based systems. The database
could no longer stay in the flat tables of relational systems but had to adjust to the distinct
islands of multilevel data and code contained in the object.
157
Chapter 8 ■ Introduction to Physical Database Design
Information management was no longer limited to numbers and small strings of
text such as names and addresses. Now the DBMS was called upon to store complete
documents, pictures, movies, music, graphics, X-rays, and any other type of exotic data,
including computer code. The RDBMS choked on these data types (imagine performing a
relational join on an artist name and a music video).
The OODBMS had everything in its favor (such as academic blessings and vendor
investments) except customers. For whatever reasons, although usually attributed to
massive corporate investments in relational technology, sales were weak. The solution?
Join the enemy. Underfunded OO vendors died off, and their place was taken by RDBMS
vendors adding OO features to their RDBMS offerings. The result was some strange
bedfellows (particularly if you ignore the RDBMS vendors criticizing, a decade earlier, era
2 and 3 vendors when they added relational features to their hierarchical, network, and
inverted file DBMS offerings to boost their weakening sales).
The OO-RDBMS, the multipurpose tool of the information management world,
offered two, sometimes distinct, views of the data, one relational and one object oriented.
SQL, never liked by the relational purists and becoming increasingly procedural, was
modified to accommodate object technology. And it worked. The OO components were
bolted onto the RDBMS without much loss in relational-ness.
How effective is this strategy? Well, the real question is how much real objectoriented system building is going on out there. Anecdotal information would indicate
that OO technology use is strong in vendor development shops but has been largely
abandoned (except in name) in end-user organizations.
A Small Digression (Again): The ACID Test
If you ask the question which DBMS is best, the right answer should be: for what? They
all have their strengths, and they all have their weaknesses. One fundamental way to
evaluate a DBMS is with the ACID test.
In 1981, Jim Gray, of Tandem Computers, published a paper in which he applied
a formalized definition of transaction to DBMS activity.9 In 1983, Gray’s concept was
expanded and given the acronym ACID by Haerder and Reuter.10 ACID stood for the
following:
158
•
Atomicity: Every part of a transaction must be executed before the
transaction can be considered complete.
•
Consistency: Any change to the database must be consistent with
all validation rules.
•
Isolation: Every transaction must be completed as though it were
the only transaction, regardless of how many transactions there
are and in what sequence they are executed. Isolation deals with
the notion of currency control.
•
Durability: Once a transaction is committed, it stays committed.
Failures from a loss of power to a computer, communications
disruptions, or crashes of any type do not affect a completed
transaction.
Chapter 8 ■ Introduction to Physical Database Design
ACID is for transactions that insert, update, or delete database records. It guarantees
the integrity of data. Unfortunately, integrity does not come cheap. To work, ACID
databases require considerable support in the form of journals that store the image of
the record occurrence before the change (the before image), another image of the record
occurrence after the change (after image), and log files documenting each step of the
transaction. All of this protection is expensive in terms of space and processing time.
The odd thing is that when the ACID paper was published in 1983, virtually every
major DBMS conformed to the ACID criteria, which relegated the entire concept to an
interesting academic sideline. As you will see, it was not until the advent of NoSQL, a
decade later, that ACID compliance became an important DBMS selection criterion
because many NoSQL systems do not meet ACID standards.
Information Management Era 6: NoSQL
NoSQL is an inaccurate name. It should really be called NoRelational because it is a revolt
against the constraints of the relational model, not against SQL. It is also a catchall phrase
that encompasses very different technologies targeted at very different problems.
Although NoSQL databases existed decades before the term was invented, they
became popular around the millennium when IT organizations were faced with not only
a growing number of relational-resistant data types but also big data. How big is big data?
Nobody knows, but nobody admits it. It’s just more and more of what IT has been dealing
with—lots more. Gilding the lily, big data requires doing sometimes detailed, statistical
analysis on large data sets.
Key-Value
There is no common NoSQL architecture, although one of the more interesting ones is
the key-value approach. Imagine a file cabinet full of file folders. Each file folder has a
tab stating what is in the folder. Although the contents of a single folder have something
in common, the same cannot be said for any two folders. One folder might be labeled
“bank statements,” while a second might be labeled “dog vaccination papers,” and a third
“Doonesbury cartoons.” Although a traditional file’s contents are related to the file label,
the contents of one folder might be totally different from the contents of any another
folder. NoSQL key-value systems link, in a tree structure, related contents vertically but
disparate contents horizontally.
NoSQL key-value database management systems are often an amalgam of hashing
techniques and associative arrays of key-value pairs, providing a powerful mechanism
for storage and retrieval. However, key-value databases have expanded the concept of the
associative array. The key is still the key, but the value might be an entire folder consisting
of many different data items of many different domains. Some key-value systems contain
a hierarchy of keys, with the first relating to the highest-level content and subsequent keys
to lower-level content (similar to the bank statement example).
Redis from Redis Labs and Oracle’s Oracle NoSQL are good examples of key-value
DBMSs.
159
Chapter 8 ■ Introduction to Physical Database Design
Graph
Graph NoSQL databases are modern versions of the network architecture with two major
differences. First, they are tuned for high performance using techniques regularly found
in other NoSQL products. Second, they have a DML that makes it easier to navigate the
database.
Graph is a mathematical term for a structure consisting of a number of nodes. Nodes
are connected to each other by edges. Unlike trees, there is no up or down. The nodes
are, of course, records, and the edges are lines or links. Graph systems are often hybrids
combining other architectures into a single implementation, with key-value being a
favorite. Graph systems get their speed from embedded pointers linking the various
notes.
Neo4j, from Neo Technology, Inc., is an example of an ACID-compliant graph
database management system.
Document Management
The document management system, usually listed as a separate NoSQL model, is often
a subset of the key-value approach. The key is the document name or description, and
the value is the underlying document. Each value occurrence contains not only the
document but the description (metadata, data type) of the document.
MongoDB from MongoDB Inc. is a good example of a document DBMS.
Multimodal
Multimodal systems are the strangest NoSQL animal. Multimodal systems provide a layer
on top of other database architectures. The user or program interacts with the top layer.
If the data to be stored are documents, the MMDBMS stores them using a document
manager. If the data to be stored are in tabular form, the MMDBMS stores them in a
relational format. The multimodal system is surely the turducken of the data management
world. Whatever data you have, the MMDBMS finds an appropriate way to store them.
MarkLogic from the MarkLogic Corporation and OrientDB developed and marketed
by Orient Technologies Ltd. are multimodal DBMSs.
Is That an ACID or a BASE?
You should know whether your DBMS complies with ACID even if all of its components
are not important to you. For example, ACID deals with inserting, updating, and deleting
records in a database. If you have a read-only data warehouse, then you might not care
about ACID compliance. Table 8-2 shows a simple comparison of DBMS types against the
ACID model.
160
XXX
Durability
Note: XXX high, X low
XXX
XXX
Consistency
Isolation
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XX
XX
XXX
XX
XX
XX
XX
X
XX
XX
X
XXX
XX
XX
XX
XX
Document
Atomicity
Multimodal
Key-Value
Inverted File Object
Hierarchical
Characteristic
Network Relational
<-------- NoSQL -------->
Illustrative
Table 8-2. Database Architecture and ACID Compliance
XX
XX
XX
XX
Graph
Chapter 8 ■ Introduction to Physical Database Design
161
Chapter 8 ■ Introduction to Physical Database Design
The takeaway from Table 8-2 is that the older database architectures, shown on the
left, are all ACID compliant. These were, and still are, the basic information manager
workhorses that keep the enterprise going. As you move to the right, ACID compliance
drops off. These information managers tend to be special-purpose tools designed to do
one or two things well at the expense of other features, notably ACID.
A word of warning about Table 8-2. The products represented by some of the
columns are so different in their implementation that it is hard to categorize all the
products covered by a column. Look at the key-value column. Some key-value products
are ACID compliant, such as upscaledb; others are not, such as Redis; while still others
are partially compliant, such as Cassandra. Worse, because so many NoSQL products are
relatively new, their ACID compliance could be significantly different by the time you
read this.
Many NoSQL database management systems have given up the ACID guarantees for
exceptional performance in one or another area. However, because so many developers
and DBMS purchasers know about the benefits of ACID, the NoSQL community came
up with its own acronym: BASE. (Apparently, this community likes chemistry puns.)
Yes, those super-fast or super-big DBMSs that fall short on the ACID standard can now
possibly claim that they support the BASE model. What does BASE stand for? Why basic
availability, soft state, and eventual consistency!
•
Basic availability: Data requests are not guaranteed for
completeness or consistency.
•
Soft state: The state of the system and its data are unknown,
although it will probably be determinable in some future time.
•
Eventual consistency: The system is, or will be, consistent but
cannot be guaranteed to be consistent at any specific time.
Want to ensure that your data are valid? BASE systems will eventually figure it out, if
you can wait. For all others, stick with ACID.
And the Winner Is…
It is difficult to identify winners and losers in the database management sweepstakes. For
every serious DBMS, there is some application somewhere for which that DBMS shines
like no other. And if there is no sufficiently important application to keep some old DBMS
or DBMS technology alive today, then it just might show up tomorrow. Just look at NoSQL
as an example. Almost every academic, vendor, and practitioner in the late 1980s thought
the relational model would reign, unopposed, forever. However, new applications using
new data types that did not work well with the relational model were mainstreamed, and
the DBMS landscape changed forever.
So, which of the current crop has the fortitude to last for the next few decades? Well,
it just might be the one thing that everyone can agree on, the one thing that everybody
hates: SQL.
162
Chapter 8 ■ Introduction to Physical Database Design
The most hated data management language in the world is the language of choice for
many different vendors, supporting many different products, using many different data
architectures. Its resilience eclipses its ugly duckling persona. It reigns over the relational
world like a diamond in a plastic tiara. In a twist of language worthy of Monty Python, it is
the only commonality among NoSQL products.
SQL just might outlast them all.
What’s to Come
Processes are notoriously more volatile than data, which is one of the reasons for
separating any examination of the two. Many organizations update their business
processes every year or so, but they revise their data definitions far less frequently,
often going a decade without a major change. Likewise, while application code requires
frequent changes to accommodate procedural updates, the database structure can more
likely forgo frequent process-driven revisions. The database design steps introduced in
this book were created to capitalize on this reality.
The following chapters introduce the second phase of the U3D framework for adding
how that data will be used by end users and applications (defined during logical process
modeling) to the definition of the data (defined during logical data modeling). The result
is a data-definition/usage-driven database design.
References
Charles A. Bachman, 1973 ACM Turing Award Lecture “The Programmer as Navigator,”
Communications of the ACM. Volume 16, Number 11, November 1973, pp. 653–657.
E. F. Codd, “A Relational Model of Data for Large Shared Data Banks.”
Communications of the ACM. Volume 13, Issue 6. June 1970, pp. 377–387.
E. F. Codd, The Relational Model for Database Management Version 2. Reading, MA:
Addison-Wesley Publishing Company, Inc.,1990.
Donald E. Knuth, The Art of Computer Programming, Volume 3: Searching and
Sorting. Reading MA: Addison-Wesley Publishing Company, 1973.
Notes
1.
E. F. Codd, “A Relational Model of Data for Large Shared Data
Banks.” Communications of the ACM. Volume 13, Issue 6, June
1970, p. 384.
2.
E. F. Codd, The Relational Model for Database Management
Version 2. Reading MA: Addison-Wesley Publishing Company,
Inc. 1990. p. 3.
3.
Peter Chen, “Entity-Relationship Modeling: Historical Events,
Future Trends, and Lessons Learned,” in Manfred Broy and
Ernst Denert (Editors). Software Pioneers: Contributions to
Software Engineering. Springer Science & Business Media,
2002, pp. 296–310.
163
Chapter 8 ■ Introduction to Physical Database Design
4.
Stephen W. Hawking, A Brief History of Time: From the Big
Bang to Black Holes. New York: Bantam Books, 1988.
5.
George Gamow, Thirty Years that Shook Physics: The Story of
Quantum Theory. Doubleday & Co. Inc., 1966.
6.
George Gamow, One Two Three… Infinity: Facts and
Speculations of Science. Dover Publications, 1974.
7.
E. F. Codd, The Relational Model for Database Management
Version 2. Reading MA: Addison-Wesley Publishing Company,
Inc., 1990, p. 3, p. 434.
8.
Ibid., p. 478.
9.
J. Gray, “The Transaction Concept: Virtues and Limitations.”
Proceeding of the 7th International Conference on Very Large
Database Systems (Cannes, France), September 9–11, 1981,
ACM, New York, pp. 144–154.
10.
Theo Haerder and Andreas Reuter, “Principles of TransactionOriented Database Recovery.” Computing Surveys. Volume 15,
Number 4, December 1983, pp. 287–317.
164
CHAPTER 9
Introduction to Physical
Schema Definition
There will come a time when you believe everything is finished. That will
be the beginning.
—Louis L’Amour
In theory there’s no difference between theory and practice. In practice
there is.
—Jan L. A. van de Snepscheut (among others)
The challenge for physical database designers is to convert the logical specifications
created during requirements definition into something that is usable by the organization.
This can be a trying task because, unlike the logical data modeler, the physical database
designer must adjudicate competing requests for resources. For example, do you tune
the database to rapidly access online customer information and in the process penalize
batch order processing, or do you favor order processing and, as a result, decrease the
performance of customer service? For some, this dilemma is a no-win proposition—no
matter what you do, you will displease someone. However, if you recognize that, beyond
applying the fundamentals of physical database design, database designers spend most of
their time juggling the trade-offs of aiding one user’s data access at the cost of another’s,
you realize that the methods of measuring resource usage and arriving at the right
balance for the organization is what physical database design is all about.
There are three main sources of input to the physical database design process: (1) the
information requirements uncovered during analysis and documented in the logical data
model, (2) how the applications will use that data, and (3) the rules of the information
manager that will store the data (Figure 9-1). The job of the database designer is to create
a database design that adequately reflects these three separate inputs.
© George Tillmann 2017
G. Tillmann, Usage-Driven Database Design, DOI 10.1007/978-1-4842-2722-0_9
165
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-1. The three inputs to the physical database design
This approach is a departure from traditional database design. Some database
designers do not rely on logical data models or process models. Although they may
examine and review these documents before beginning the database design process, these
documents are rarely an integral part of that process. Rather, too many designers jump right
into physical database design by focusing on the tables and linkages dictated by a particular
database management system (DBMS) or requested by those writing design specifications
or computer code. This process is reactive and disappointingly superficial.
Other database designers simply take the logical data modeling and “physicalize”
it, making it conform to their DBMS, without regard to how the data will be used by
application programs or user queries.
In contrast with these two traditional approaches, this chapter lays out a framework
for database design involving three elements: the logical definition of the data (logical data
model), the business processing requirements (process model), and finally the features and
restrictions of the information manager (DBMS, file manager, etc.). See Figure 9-2.
Figure 9-2. Usage-Driven Database Design
166
Chapter 9 ■ Introduction to Physical Schema Definition
A fitting name for this Usage-Driven Database Design phase is Physical Schema
Definition.
Usage-Driven Database Design: Physical Schema
Definition
Usage-Driven Database Design: Physical Schema Definition (U3D:PSD) involves the
evolution of the logical models into a working physical database design. U3D:PSD
consists of four steps (Table 9-1). The first step, Transformation, converts the logical
data model into a physical data model by substituting physical database objects for
logical data modeling ones. The second step, Utilization, rationalizes the physical data
model by addressing how the data will be used (read, insert, delete, update). The third
step, Formalization, modifies the rationalized physical data model to comply with the
rules/features of the DBMS (or file manager) being used—creating a functional physical
database design. The fourth and last step, Customization, focuses on improving the
performance and enhancing the usability of the database, resulting in an enhanced
physical database design.
Table 9-1. Usage-Driven Database Design
Step
Usage-Driven Database Design: Physical Schema Definition
Purpose
Primary Deliverabe
Transformation
•• Translation
Utilization
•• Usage analysis
Physical data model
•• Expansion
•• Path rationalization
Formalization
•• Environment designation
•• Constraint compliance
Customization
Rationalized (application-specific)
physical data model
Functional physical database
design (schema and subschemas)
•• Resource analysis
Enhanced physical database
•• Performance enhancement design (schema and subschemas)
The best way to understand how U3D:PSD works is to show how a designer would
create a physical database design. The examples in this chapter are based on an order
management system’s logical data model similar to the one in Figure 9-3.
167
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-3. Order processing system’s logical data model
The remainder of this chapter gives a quick synopsis of the overall approach using
all four steps. The examples focus on a database to support an order management system
that takes orders, bills clients, reports on sales, and sends stock replenishment notices to
the manufacturers.
Step 1: Transformation
The first step in physical database design is to transform the logical data model into a
physical data model.
Transformation (Table 9-2) consists of two tasks. The first is to translate the logical
data objects into physical data objects; the second adds physical features to those objects.
The output of the step, the physical data model, is similar to the logical data model except
that while the logical data model’s components are viewed as conceptual constructs, the
physical data model’s objects represent information potentially stored in a computerbased information system. However, it is not a database design yet—the physical data
model is still an abstract representation of a database.
168
Chapter 9 ■ Introduction to Physical Schema Definition
Table 9-2. Step 1: Transformation
Sources
Step 1: Transformation
Procedures
•• E-R diagram
•• Task 1.1: Translation
•• Logical data model
object definitions (data
dictionary)
Deliverables
•• Physical data
model (diagram)
•• Activity 1.1.1: Transform
logical data model objects to •• Physical data
physical data model objects
model object
definitions (data
•• Activity 1.1.2: Diagram the
dictionary)
objects
•• Business requirements
(processes, procedures,
and all volumes)
•• Task 1.2: Expansion
•• Activity 1.2.1: Assign keys
•• Transformation
notes
•• Activity 1.2.2: Normalize
model
The physical data model is not DBMS architecture specific (relational, network,
hierarchical, object, inverted, etc.), product specific (Oracle, IMS, SQL Server, Model
204, Cassandra, etc.), or release specific (Oracle 12, SQL Server 2014, DB2 10.5, etc.). In
truth, a physical data model is a rather skimpy view of stored data. While it does deal with
records and data fields, there is no capability to express such concepts as access methods
or physical storage components. These must wait until later in the phase.
Task 1.1: Translation
The first Transformation task, Translation, is relatively easy. It involves a one-for-one
substitution of a physical database design construct for its corresponding logical data
modeling one. Start with the entities and turn each into a record type. Next move on to
the attributes and turn them into data fields. Lastly, relationships become database links.
A logical data model with 6 entities, 24 attributes, and 3 relationships will, in most cases,
be translated into a physical data model with 6 record types, 24 fields, and 3 links.
Each object must be uniquely named. A record type must have a unique name,
such as Employee, to distinguish it from all other record types. Ideally, the names of the
physical objects are the same as the names of their corresponding logical data modeling
objects.
Figure 9-4 shows the translation of a simple logical data model into a physical data
model. The entities Customer and Account become the record types Customer and
Account (as a convention, record types are represented graphically by a rectangle, and
record type names start with a capital letter), while the relationship Owns becomes
the link Owns. As a convention, links are represented by a line. Linkage names start
with a capital letter. Membership class (cardinality and modality) is similarly handled.
Physical model modality is still represented by a bar or a zero. A cardinality of “one” is still
represented by a bar, but the “many” crow’s foot is replaced with an arrowhead.
169
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-4. Transforming the logical data model into the physical data model
TERMINOLOGY
A cornerstone of this book is that to avoid prejudicing future decisions, it is critical
to discriminate between logical concepts and physical constructs. To emphasize
this point, a sharp distinction is made between the names used to identify logical
objects and those used to identify physical objects. Logical objects include entities,
attributes, and relationships, so different words are needed to express their physical
model counterparts.
However, the file manager/DBMS is not identified/confirmed until step 3,
Formalization. Until then, physical object names need to be file manager/DBMS
independent. Using words such as table, segment, tuple, or set is DBMS prejudicial,
so such terms are avoided whenever possible.
To remain file manager independent, this book uses the terms data field, record, and
link to represent the physical equivalent of attribute, entity, and relationship. Other
terms used later in the book follow suit. Only after the file manager is chosen, and
in some cases even after the vendor and version of that manager is selected, will
model-, product-, or version-specific terminology be used.
The attributes CUSTOMER NAME, CUSTOMER NUMBER, and CUSTOMER
ADDRESS become the data fields CUSTOMER NAME, CUSTOMER NUMBER, and
CUSTOMER ADDRESS. (Note: As a convention, field names are all uppercase.)
Task 1.2: Expansion
Translation is followed by Expansion, in which the structure of each record type is
examined in further detail. The first order of business is to assign keys.
Each record type should have a unique identifier, often called a primary key, that
unambiguously identifies any occurrence of that record type. The key is usually one
field such as EMPLOYEE NUMBER but can be a concatenated key (multiple data fields)
170
Chapter 9 ■ Introduction to Physical Schema Definition
if necessary, as in SITE ID, BUILDING NUMBER. The logical data modeler might have
uncovered a unique identifier that is used by the business for a particular entity. If at
all possible, this business-designated identifier should be used. The convention is to
underline the unique identifiers in diagrams, as in Figure 9-4.
Databases can experience synchronization anomalies when inserting new data,
updating existing data, or deleting old data. For example, if, after deleting an employee’s
time cards, accurate information about the effort applied to a particular project is no
longer present, then the database suffers from a deletion anomaly. The way to reduce, if
not eliminate, this problem is through a technique called normalization. Normalization
is a process of reducing the structure of the model to a state such that data in any
given record occurrence is totally dependent on the key of that record occurrence.
Normalization is examined in more detail in Chapter 10.
Step 2: Utilization
The second step, Utilization (Table 9-3), adds information to the physical data model
about how the database will be used.
Table 9-3. Step 2: Utilization
Sources
•• Physical data model
(diagram)
•• Physical data model
object definitions
(data dictionary)
•• Business requirements
(processes, procedures,
and all volumes)
•• Transformation notes
Step 2: Utilization
Procedures
•• Task 2.1: Usage Analysis
•• Activity 2.1.1: Create
usage scenarios
•• Activity 2.1.2: Map
scenarios to the
physical data model
•• Task 2.2: Path
Rationalization
•• Activity 2.2.1: Reduce
to simplest paths
Deliverables
•• Rationalized physical
data model (diagram)
•• Updated physical data
model definitions
(data dictionary)
•• Usage scenarios
•• Usage maps
•• Combined usage map
•• Utilization notes
•• Activity 2.2.2: Simplify
(rationalize) model
Utilization maps the process models for the applications that will use the database to
the physical data model.
Task 2.1: Usage Analysis
The first Utilization task, Usage Analysis, creates Usage scenarios from the application
process models. Process models (logical or physical) can be long, involved descriptions
of many actions and functions that have nothing to do with a database. Imagine a process
171
Chapter 9 ■ Introduction to Physical Schema Definition
model describing the algorithms to calculate taxes or missile trajectories. They could
go on for pages without a single database activity. About 80 to 90 percent of a process
model is information extraneous to the database design process. To reduce confusion,
not to mention paperwork, the designer creates usage scenarios, which are shorter and
sweeter, textual or graphic, depictions of how an application will use the data. Process
models come in many forms. They could be process model fragments, such as data flow
diagrams, logical transactions, use case scenarios, or any other means for documenting
how the system will work.
A simple usage scenario might look something like the following:
Usage Scenario: Create an Order
1.
Enter the database at the Account record occurrence for
Account X.
2.
Insert an Order record occurrence for Account X.
3.
Read the Product occurrence for Product Y.
4.
Insert record occurrence Line Item linked to the Product and
Order occurrences.
A Usage map is a graphic representation of a usage scenario. It is created by drawing,
or mapping, the scenario onto the physical data model.
Because there are always multiple scenarios for an application, the simplest way
to create a comprehensive usage map is to first make a number of photocopies of the
physical data model. Then draw one scenario on one photocopy showing how, if the
physical data model was the final database design, the application would access the
database. Use arrows to show database entry and navigation, and use the initials E, R, I, U,
and D for the database actions of Entry, Read, Insert, Update, and Delete.
HEY! THIS LOOKS LIKE NAVIGATION—WE ARE A
RELATIONAL SHOP
Calm down. The “navigation” you see is just conceptual—to help you understand
how the data will be used given the logical data model. It does not contradict the
relational model, preclude a relational DBMS, or eliminate joins between tables. In
fact, it shows exactly what joins are needed and how they will work.
When you have completed one map for each scenario, you can then combine
them into a Combined usage map (Figure 9-5) by taking the individual usage maps and
combining them onto a single photocopy of the physical data model.
172
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-5. Mapping usage to the physical data model
Figure 9-5 shows a combined usage map for two usage scenarios. The first scenario
involves searching the database for the appropriate Account and Product occurrences
and then creating an Order occurrence and a Line Item occurrence. The second scenario
involves entering the database at Product and then for each Product reading its Line Item
and then its Order record.
The arrow with a dashed line indicates the route of access, and the number—for
example, 2.3R—tells us that scenario number 2, step 3, is a Read.
Task 2.2: Path Rationalization
In practice, the finished diagram can look a bit like a bird’s nest of lines and arrows. At first,
this can seem a daunting task, but with a little effort and some time spent examining it
(and maybe finding a bigger piece of paper), a few trends should start to emerge, the most
important of which is that not all the paths on the diagram are needed. Path Rationalization
is the task of reducing the complexity of the model to only what is needed to perform its
assigned functions. Take the two usage scenarios in Figure 9-6. The first reads the Customer
173
Chapter 9 ■ Introduction to Physical Schema Definition
record and then moves to the Order record and finally to the Address record. The second
also starts at Customer and then moves to Address followed by Order. These scenarios are
redundant because they access the same data, albeit in a different order. The two could be
combined into one.
Figure 9-6. Redundant paths and links should be eliminated
Note that if you combine the two scenarios, it becomes obvious that the link between
Customer and Address and the link between Account and Address are not both needed.
It would be possible to eliminate the redundant link.
The Step 2 deliverable is a Rationalized physical data model showing the applicationrelevant record types and linkages. Deliverables also include the Usage scenario and
Usage maps. Both of these will be useful when creating database views and subschemas.
Step 3: Formalization
In the third step, Formalization, the rationalized physical data model is made to
conform—first to the underlying file manager or DBMS architecture that will be used
to store the data (i.e., hierarchical, network, relational, object, etc.), and second to the
particular implementation of that model (product/version) such as Oracle 12 or
SQL Server 2014. (See Table 9-4.)
174
Chapter 9 ■ Introduction to Physical Schema Definition
Table 9-4. Step 3: Formalization
Sources
•• Rationalized physical
data model (diagram)
•• Updated physical data
model definitions (data
dictionary)
•• Usage scenarios
•• Usage maps
•• Combined usage map
•• Transformation notes
•• Utilization notes
•• DBMS features and
constraints
Step 3: Formalization
Procedures
Deliverables
•• Functional physical
•• Task 3.1: Environment
database design
Designation—Identify/
(diagram)
confirm the target information manager (architecture, •• Functional Data
product, version)
Definition Language
(schema and
•• Task 3.2: Constraint
subschema)
Compliance
•• Activity 3.2.1: Map
rationalized physical
data model to the data
architecture
•• Activity 3.2.2: Create a
DBMS product/
version-specific
functional physical
database design
•• Updated physical data
model definitions
(data dictionary)
•• Formalization notes
This is the first time the file manager or DBMS is introduced into the U3D framework.
Task 3.1: Environment Designation
Before focusing on a particular product, the database designer needs to make the
Rationalized physical data model reflect the architecture type of the DBMS that will be
used. DBMSs can be grouped into a few basic architecture categories—the products
in these categories share a large number of features. For example, relational database
products use foreign keys to link different tables together, hierarchical systems use
pointers, and some inverted file products rely on external multikey multirecord type
indices. Which features to build into the database design (foreign keys, pointers, or
indices) depends on the type of DBMS being used (relational, hierarchical, inverted,
object, NoSQL, multidimensional, etc.).
Assume the application will use a relational DBMS. This assumption dictates the
physical design because there are a number of potential database features the relational
model does not support. Figure 9-7 shows a Rationalized physical data model and the
Functional physical database design as it might look if a relational DBMS were used.
175
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-7. A DBMS model-specific functional physical database design fragment
The dotted arrows and numbered circles show how the Rationalized physical data
model was transformed into a Functional physical database design.
The following are architecture-specific changes to the model in Figure 9-7:
1.
The Rationalized physical data model’s record type Product
becomes the Functional physical database design’s Product
table.
2.
Relational systems do not directly support recursive
relationships. However, a many-to-many recursive
relationship can be simulated with a bill-of-material structure
consisting of a new table storing two foreign keys for the
many-to-many links to the Product table.
3.
Relational systems do not support many-to-many
relationships. To simulate M:N relationships, a new table,
often called a junction table, is inserted between the Product
and Manufacturer tables.
4.
The Manufacturer record type becomes the Manufacturer
table.
5.
The cardinality and modality of the record types becomes a
simple relational one-to-many link (the only kind relational
systems support).
6.
The attributive record type Manufacturer History becomes the
simple Manufacturer History table. Relational systems do not
directly support attributive record types.
176
Chapter 9 ■ Introduction to Physical Schema Definition
DBMS-specific language can now be used to describe data objects. For a relational
system, data fields become attributes or columns, record occurrences become rows or
tuples, and record types become tables.
Task 3.2: Constraint Compliance
Constraint Compliance is the task in which the Functional physical database design
becomes product and version specific. For the first time, the database designer can apply
the rules of the particular vendor’s offering. This is also the first time the designer has a
usable database design.
The SQL Data Definition Language (DDL) is needed to generate a relational database
schema. However, before coding for SQL Server 16 or Oracle 12, the designer needs to
remember Principle 4 of the database design principles (presented in Chapter 1), the Minimal
Regression Principle—design a database so that business and technology changes minimize
database redesign. To minimize unnecessary future changes, the first draft of the DDL needs
to be DBMS version agnostic—relational to be sure—but using a generic SQL that does not
tie the design to a specific vendor or product. Table 9-5 shows a fragment of the SQL code
for an order management system using a generic form of SQL loosely based on the ISO/IEC
standard. However, the designer could just as easily have used this step to create a Functional
physical database design for a hierarchical, network, or object-oriented database system.
Table 9-5. Functional Design DDL Using Generic SQL
Order Management System
Data Definition Language (DDL) Code Fragment
Using Generic SQL
CREATE TABLE PRODUCT (
PRODUCT_NAME CHAR(30) NOT NULL,
PRODUCT_NUMBER CHAR(8) NOT NULL PRIMARY KEY UNIQUE,
-- primary key assumes unique but both make the message plain
-- even if not a primary key, keep this field unique
PRODUCT_DESCRIPTION VARCHAR(512),
COST_BASIS DECIMAL(8,2) NOT NULL,
LIST_PRICE DECIMAL(8,2) NOT NULL
CREATE INDEX PROD_NO_IDX ON PRODUCT (PRODUCT_NUMBER)
);
CREATE TABLE MANUFACTURER (
MFG_NAME CHAR(30) NOT NULL,
MFG_ID CHAR(6) NOT NULL PRIMARY KEY UNIQUE,
MFG_CATEGORY INTEGER DEFAULT 1 CHECK (MFG_CATEGORY IN
(1, 2, 3)),
MFG_NOTES VARCHAR(512),
ORDER_INSTRUCTIONS VARCHAR(512)
CREATE INDEX MFG_ID_IDX ON MANUFACTURER (MFG_ID)
);
(continued)
177
Chapter 9 ■ Introduction to Physical Schema Definition
Table 9-5. (continued)
Order Management System
Data Definition Language (DDL) Code Fragment
Using Generic SQL
CREATE TABLE PROD_MFG_JCT (
PRODUCT_NUMBER CHAR(8),
MFG_ID CHAR(6),
PRIMARY KEY (PRODUCT_NUMBER, MFG_ID),
FOREIGN KEY (PRODUCT_NUMBER) REFERENCES PRODUCT ON UPDATE CASCADE
ON DELETE CASCADE,
FOREIGN KEY (MFG_ID) REFERENCES MANUFACTURER ON UPDATE CASCADE ON
DELETE CASCADE
);
Keeping the DDL generic improves the communication value of the code and allows
the reader to focus on the structure of the database and not release idiosyncrasies. It also
gives designers and database administrators (DBAs) a source document to use when
updating the database schema to a new release or entirely new product. For example,
specifying that PRODUCT_NUMBER is both the primary key and unique is redundant in
virtually all implementations of a relational DBMS, yet it does have communicative value.
It tells those involved with the next phase of the design process that if, for some reason, a
different primary key is chosen, then this field must be keep unique.
THE LANGUAGE OF DATABASE MANAGEMENT SYSTEMS
Occasionally, the database community gets it right. Too often, academics and
vendors come up with their own proprietary words to describe commonsense
objects. But not this time. It has become almost universal to describe database
functionality using two sublanguages. The first is called the Data Definition
Language (DDL), which includes the syntax and rules used to create database
schemas and subschemas. The second is the Data Manipulation Language (DML)
used in applications to process database data, such as reading, adding, or deleting
data items. These concepts predate relational systems and originate with the
Conference/Committee on Data Systems Languages (CODASYL) or network model.
Luckily, DBMS vendors have chosen to use what has worked so well in the past
rather than continuously inventing new and often confusing terminology.
There are two types of constraint compliance: structural and syntactical. Structural
compliance oversees the addition, deletion, or modification of the database’s
architectural components (record type, links, etc.) to create and maintain a valid schema.
Syntactical compliance oversees the grammar or communication value of the DDL to
ensure that the system understands what is wanted. For example, if you cannot create
178
Chapter 9 ■ Introduction to Physical Schema Definition
a single table for Customer because there are too many fields in it, then the necessary
change is structural. However, if the problem is that your DDL compiler will not accept
table names longer than eight characters and yours is 15, that change is syntactical.
The order management system will require syntactical changes to run with the
selected DBMS—for this example Oracle is assumed. The syntactical changes are needed
to accommodate Oracle syntax rules and reserved words. For example, Oracle uses the
NUMBER data type and not the SQL standard DECIMAL type used in the example. In
addition, the chosen DBMS supports some referential integrity update constraints using
triggers or application code, not DDL declarations. Table 9-6 illustrates the changes
needed to make the schema code Oracle compliant.
Table 9-6. Functional Design DDL Generic SQL Converted to Oracle
Generic SQL
Changes Needed for ORACLE
CREATE TABLE PRODUCT (
PRODUCT_NAME CHAR(30) NOT NULL,
PRODUCT_NUMBER CHAR(8) NOT NULL
PRIMARY KEY UNIQUE,
-- primary key assumes unique but
both make the message plain even if
not a primary key, keep this field
unique
PRODUCT_DESCRIPTION VARCHAR(512),
PRODUCT_HISTORY VARCHAR(512),
COST_BASIS DECIMAL(8,2) NOT NULL,
LIST_PRICE DECIMAL(8,2) NOT NULL,
CREATE INDEX PROD_NO_IDX ON
PRODUCT (PRODUCT_NUMBER)
);
PRODUCT_NUMBER CHAR(8) NOT NULL
PRIMARY KEY,
/*can't use UNIQUE in PK statement*/
PRODUCT_DESCRIPTION VARCHAR2(512),
PRODUCT_HISTORY VARCHAR(512),
/*LONG was the standard but was
dropped. VARCHAR being dropped in
favor of VARCHAR2 */
COST_BASIS NUMBER(8,2) NOT NULL,
/* substitute NUMBER for DECIMAL*/
LIST_PRICE NUMBER(8,2) NOT NULL
/* substitute NUMBER for DECIMAL */
/* Oracle automatically creates
index on PRIMARY KEY columns */
Modifications are needed to support not only a particular vendor’s product but
the particular version of that product as well. For example, earlier versions of Oracle
did not support more than one column per table that was longer than 255 characters
in length, and it required use of the LONG data type. To support both the PRODUCT_
DESCRIPTION and PRODUCT_HISTORY fields, the designer would need to shorten one
of them to 255 characters or place it in a separate table.
The result of this step is a working database. How well it performs depends on a
number of factors, including the data it stores and the design of the database.
Step 4: Customization
The database design created in step 3, Formalization, should be able to support all the
tasks it is assigned. How efficiently it completes those tasks was not a concern until now.
During Customization, the performance of the database is examined and any needed
changes identified.
179
Chapter 9 ■ Introduction to Physical Schema Definition
This sequence (Formalization then Customization) was created for two reasons.
First, it is important to confirm the design is right before making it fast. Speeding up a
database that does the wrong things is useless. Second, performance enhancement is one
of the most common post-implementation activities. There are always usage surprises
after the database is in operation as well as vendor enhancements and improvements
to implement. The majority of these fall into the performance category. The goal is to
limit functional “oops” when implementing performance improvements. Keeping the
functionalization of the database (step 3, Formalization) separate from enhancement of
its efficiency (step 4, Customization) is the best way to do that.
If the database is small, it’s not very complex, usage is low, or performance is not a
major issue, then pack up your toolkit because your work is done. The design created in
Formalization should be sufficient. However, if more is needed, then Customization is
where it happens. In step 4 (Table 9-7), the designer can apply all the tricks of the trade
from a toolkit (hardware and software) provided by the DBMS vendor, third parties, or the
in-house database management team.
Table 9-7. Step 4: Customization
Sources
•• Functional physical
database design
(diagram)
•• Functional Data
Definition Language
(schema and subschema)
•• Updated physical data
model definitions (data
dictionary)
Step 4: Customization
Procedures
•• Task 4.1: Resource
Analysis
•• Task 4.2: Performance
Enhancement
•• Activity 4.2.1:
Customize hardware
•• Activity 4.2.2:
Customize software
•• Usage scenarios
•• Usage maps
Deliverables
•• Enhanced physical
database design
(diagram)
•• Enhanced Data
Definition Language
(schema and
subschema)
•• Updated physical data
model definitions (data
dictionary)
•• Customization notes
•• Combined usage map
•• Transformation notes
•• Utilization notes
•• Formalization notes
•• DBMS features and
constraints
Task 4.1: Resource Analysis
Before you can fix it, you have to know what is wrong. Resource Analysis examines the
database to understand the demands placed on it and the impediments to meeting that
demand.
180
Chapter 9 ■ Introduction to Physical Schema Definition
The DBMS has an oddity built into it. Language compilers are willing to tell you
when you made a mistake—they crash. Operating systems are intolerant of programs
they do not like—they stall. DBMSs, on the other hand, often work even when major
(particularly performance-related) mistakes have been made. The database dog that
takes hours and hours to perform a particular update completes its job in minutes after
an index is added or changed. Conversely, a decently performing database can grind to
a halt if an ill-conceived index is added. The trick is knowing what and where to make
improvements.
To illustrate this point, consider a simple database design of three record types
(Figure 9-8) consisting of 200 Product occurrences and 1,000 Order occurrences, each
linked to an average of 10 Line Items occurrences per Order. Also, assume that the DBMS
allows two methods of improving performance: (1) indices placed on certain fields and
(2) clustering of multiple occurrences of linked, but different, record types on the same
physical database page. In this case, that would mean a Line Item occurrence could be
stored either on the same physical page as its related Order occurrence or on the same
physical page as its related Product occurrence, but not both. The questions to answered
are: (1) which fields should be indexed and (2) next to which record occurrence, Order or
Product, should the related Line Items be stored?
Figure 9-8. Physical database design trade-offs
Look at Scenario 1. The first task is to find a particular Order occurrence. Because
there are 1,000 Order occurrences, it will take, on average, 500 logical inputs/outputs
(I/Os) to find the right record occurrence. If you assume there are 10 Order occurrences
on a physical database page, then finding the right Order will require, on average,
50 physical I/Os. However, if you create an index on the ORDER_NUMBER field, the
average number of physical I/Os can be reduced to about four.
The second method of improving performance is clustering. Without design
intervention, fetching one Order occurrence and its related 10 Line Items will require
11 physical I/Os—one for the Order occurrence and 10 for the 10 related Line Item
occurrences (index I/O is ignored until Chapter 13, which covers step 4, Customization).
Fetching each Product occurrence and its associated Line Items will require an average
of 51 physical I/Os. (All calculations assume that the DBMS did not put more than one
occurrence on an individual page.)
181
Chapter 9 ■ Introduction to Physical Schema Definition
However, if each Line Item occurrence is stored on the same page as its related Order
occurrence, then only one physical I/O is required. Storing Line Item with its associated
Product reduces the 51 physical I/Os to only one physical I/O (assuming that all the Line
Items could fit on one database page). Physical I/Os can be reduced more than 90 percent
for the Access Order Details scenario.
The performance of Usage Scenario 2, Access Product Orders, could be improved by
more than 90 percent by clustering Line Items around Product (rather than Order). But
remember, you cannot have both. Which of the two storage options should you choose?
Adjudicating this trade-off is the crux of Customization.
Of course, this example analyzed only two simple scenarios concerning three record
types. A more realistic example would involve modeling dozens of scenarios, many
requiring data from a half-dozen record types or more, against a much larger design. But
the idea is the same.
Task 4.2: Performance Enhancement
In Task 4.2, the database design code was modified to reflect the performance
improvements identified in Task 4.1. The simplest way to enhance performance with
most DBMS products is to add indices to important fields. Which fields you index is
driven by two criteria, fields you want to search the database for and fields the DBMS uses
to access other record occurrences.
With relational systems, to add indices, you simply add a statement to the DDL as
follows:
CREATE UNIQUE INDEX PRODUCT_NUMBER_IDX ON PRODUCT (PRODUCT_NUMBER);
CREATE UNIQUE INDEX ORDER_NUMBER_IDX ON ORDER (ORDER_NUMBER);
Language is also needed to cluster the related Order and Line Item occurrences
together.
CREATE TABLE ORDER (
•
•
CLUSTER LINE_ITEM_CLUSTER (ORDER_NUMBER);
CREATE TABLE LINE_ITEM (
•
•
CLUSTER LINE_ITEM_CLUSTER (ORDER_NUMBER);
CREATE CLUSTER LINE_ITEM_CLUSTER (ORDER_NUMBER CHAR(8));
•
•
The designer can indicate clustering on the database diagram by placing the
clustering record type name at the bottom of the record type box.
182
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-9 puts the small database fragment together, including the tables to
accommodate Oracle’s constraints and the clustering information, at the bottom of each
record type box.
Figure 9-9. Order management system physical database design
Summary
Many a well-designed system is brought to its knees during maintenance. The reasons
are many but particularly problematic is poor documentation. Correctly and efficiently
modifying an application is difficult if the maintenance staff does not have an accurate
picture of what the application does and how, exactly, it does it.
Equally important is not traveling over ground that had been trod before. It does not
make sense to redesign a car simply because it needs new tires. Likewise, adding an index
to a table or moving data from one file to another should not require going back to the
business users to, once again, understand how those data are used.
When complete, Usage-Driven Database Design: Physical Schema Definition
transforms a logical data model, based on the definition of the data from the enterprise, into
a database design tuned to how the organization will use the application (Figure 9-10).
183
Chapter 9 ■ Introduction to Physical Schema Definition
Figure 9-10. Logical data model to physical database design
The information presented in this chapter, as well as the examples, are an
oversimplification of how U3D works, although they do present a realistic overview of
the basic components. The following chapters look at each of the four U3D:PSD steps in
greater detail, expanding on the points presented here.
184
CHAPTER 10
Transformation: Creating
the Physical Data Model
The time is a critical one, for it marks the beginning of the second half…
when a transformation occurs.
—C.G. Jung
Methods have to change. Focus has to change. Values have to change.
The sum total of those changes is transformation.
—Andy Grove (founder and CEO of Intel Corporation)
One of the more emotionally satisfying, although less technically significant, components
of the database design process is Transformation (see Table 10-1). During Transformation,
the designer takes the first real, although small, step toward creating a physical database
design. Its significance as a psychological milestone is that, from here on, the language of
requirements analysis and logical data modeling is left behind in favor of the terminology
of physical design, database management, and storage devices.
© George Tillmann 2017
G. Tillmann, Usage-Driven Database Design, DOI 10.1007/978-1-4842-2722-0_10
185
Chapter 10 ■ Transformation: Creating the Physical Data Model
Table 10-1. Step 1: Transformation
Step 1: Transformation
Source
Procedures
• Task 1.1: Translation
LDM.1: E-R diagram
• Activity 1.1.1:
LDM.2: Logical data model
Transform LDM objects
object definitions (data
to PDM objects
dictionary)
• Entity to record type
LDM.3: Logical data
• Attribute to data item
modeling notes
• Relationship to link,
PM business requirements
etc.
(processes, procedures, and
• Activity: 1.1.2
all volumes)
Diagram the objects
• Task 1.2: Expansion
• Activity 1.2.1: Assign
keys
• Activity 1.2.2:
Normalize model
Deliverables
1.1: Physical data model
(diagram)
1.2: Physical data model
object definitions (data
dictionary)
1.3: Transformation notes
Step 1, Transformation, is where the objects identified during logical data modeling
are transformed into physical database design objects. The result is the physical data
model.
Task 1.1: Translation
Translation consists of two activities. The first activity, Transform logical data modeling
objects to physical design modeling objects, involves a one-for-one substitution. For
example, the entity becomes a record type. The second activity, Diagram the objects,
creates the physical data model diagram.
Activity 1.1.1: Transform LDM Objects to PDM Objects
The first Transformation activity is a rather simple and mechanical step of substituting
physical design objects for logical design objects. The logical data modeling entity
becomes the physical data modeling record, the logical data modeling relationship
becomes the become physical data modeling linkage or link, and the logical data
modeling attribute becomes the physical data modeling data item. Simple. However,
some situations can prove challenging.
186
Chapter 10 ■ Transformation: Creating the Physical Data Model
Entities to Record Types
For decades, the basic unit of stored data has been the record. As with logical data
modeling objects, the type/occurrence distinction is still useful and should be employed
when discussing physical database design objects. Therefore, record type is the name
given to the collections of all related record occurrences, such as Employee or Customer.
A record occurrence or instance is the data that are stored as a discrete contiguous piece
of information in an information system, such as Smith’s Employee record or Thompson’s
Customer record.
In logical data modeling, there are four types of entities: proper, associative,
attributive, and S-type. At its simplest, proper entities become proper record types,
associative entities become associative record types, attributive entities become
attributive record types, and S-type entities become S-type record types. The definitions of
the four record types are almost the same as for their entity cousins (Table 10-2).
Table 10-2. Record Type Definitions
Object
Entity Type
Record Type
Proper
A simple or fundamental entity type
A simple or fundamental record
type
Associative
A relationship that has its own
relationships or attributes
A link that has its own
relationships or data items
Attributive
An entity whose existence depends
on another entity
A record type whose existence
depends on another record type
S-type
An entity (the supertype) that
contains more than one role (the
subtypes)
A record (the supertype) that
contains more than one role
(the subtypes)
Diagramming conventions are the same for record types as for logical entities
(Figure 10-1). The proper record type is represented by a rectangle. Associative record
types are represented by a rectangle with a diamond in them. Attributive record types are
rectangles drawn using double lines, and S-types are drawn either with the is a construct
or with the box-within-a-box graphic.
187
Chapter 10 ■ Transformation: Creating the Physical Data Model
Figure 10-1. Record types
The naming convention for record types is the same as for entity types. A question
often asked is, “Why are we keeping logical data modeling names when I know that the
DBMS only allows a maximum of 18 characters, with no blank spaces and all uppercase?”
This is a fair question because DBMS restrictions can severely limit logical data
modeling conventions. For example, many information systems do not allow unlimited
name length, others restrict case use to uppercase or lowercase, and most do not allow
spaces in a name, rather requiring underscores or other special characteristics. The
answer to the question is that it is important to maintain a generic non-product-specific
approach as long as possible. This allows the designer to hold off restricting what we
want by what we can have. It is important to document what you would like to have, even
if your current DBMS does not support your desires, because your current DBMS might
not be your future DBMS. Without proper documentation, if you change DBMS products
or if a new version of your current DBMS includes new functionality, there will be no
evidentiary basis to support exploiting the new features.
WORD CAVIAR FORMALIZED FUNCTIONAL OSSIFICATION
AND HOW TO COMBAT IT
Creating first-generation applications (automating manual processes) was easy. The
designer figured out what the users wanted and coded away. Second-generation
applications (creating new systems to replace first-generation systems) ran into a
totally unanticipated problem: formalized functional ossification (a term made up for
this sidebar).
The technology, as well as system development know-how, was limited when the
first-generation systems were built. Users wanted a number of features that IT could
not deliver. The solution was a series of workarounds and alternatives. It was not
what the user wanted, but nonetheless, it got the job done.
188
Chapter 10 ■ Transformation: Creating the Physical Data Model
The technology was more advanced, and systems staff were better versed in
application development by the time the organization was ready for a secondgeneration application. Many of those workarounds could now be removed, and the
original user requirements built into the new system. But then a miracle happened.
The users no longer wanted that new-fangled stuff; they wanted the system to
do it “the way it always did it” (i.e., how it did it after the first-generation system
was implemented without their desired features). The old klugey workarounds had
become fossilized—a formalized functional ossification (FFO?). The challenge for
the second-gen team was to figure out what the new application should include and
exclude, all without the constraints of FFO.
Formalized functional ossification (or fossilization if you prefer) can happen in IT as
well. The then current version of the IT shop’s DBMS, operating system, or projectmanagement application could not do what was wanted, so workarounds were
constructed. Now, a few years later, the vendor includes exactly what IT originally
wanted. Do IT staff jump up and down with joy at the new features? Maybe not. The
klugey workaround—that software dongle so to speak—is now encapsulated so
deeply in the organization that no one knows what was originally wanted. Unless…
…unless they have the documentation detailing exactly what was wanted years
ago and why. U3D is a framework to keep that functional memory alive and ready to
implement once the moment is right.
This is an important point. Experienced database designers might be perplexed
about why they should perpetuate something that does not exist in most database
management systems, such as associative, attributive, and S-type concepts. Although
popular in logical data modeling, they rarely exist as vendor-specified constructs in most
DBMS products.
There are two answers to this question. First, it is important to understand what the
logical data modeler is trying to tell the database designer, such as there is a difference
between Address as a proper record type and Address as an attributive record type. This
becomes more obvious in later U3D steps when the physical designer can treat the two
differently, even if the DBMS doesn’t. For example, an attributive record type might tell
the database designer to use a cascading delete between the Customer and Account
record types, a DBMS constraint that might not be employed if Address is seen as a proper
record type.
Second, the designer should not want to give up information (here the distinction
between the different kinds of record types) until absolutely necessary, where “absolutely
necessary” is determined either by the DBMS selected or by the designer finding some
other way to represent the construct. Therefore, the designer should hold onto this and
other important information as long as possible.
189
Chapter 10 ■ Transformation: Creating the Physical Data Model
Relationships to Linkages
The logical data modeling relationship becomes the physical database design link or
linkage. A link is a way of associating two or more records together for the purposes of
retrieval or maintenance. As in logical data modeling, physical database design links have
membership class, degree, and constraint characteristics. Linkage naming conventions
should be the same as for logical relationship names although, unfortunately, many
designers do not name links at all.
Linkage Membership Class
As in logical data modeling, there are two types of membership class: cardinality and
modality. Cardinality indicates the maximum number of occurrences of one record type
that can be linked to another record type occurrence. The three types of cardinality are
one-to-one, one-to-many, and many-to-many. Modality indicates the minimum number
of record occurrences that must be linked to another record occurrence. Modality is
either mandatory or optional.
Diagramming conventions are similar but slightly different for relationships and
linkages (Table 10-3). As with logical data modeling, modality is still represented by a bar
or a zero; however, the logical data modeling crow’s foot gives way to the physical data
modeling arrowhead.
Table 10-3. Membership Class
Logical Data Model
Physical Data Model
Cardinality
One-to-One
One-to-One
One-to-Many
One-to-Many
Many-to-Many
Many-to-Many
Modality
Mandatory
Mandatory
Optional
Optional
Cardinality is, so far, the only diagrammatic difference between logical and physical
data modeling.
190
Chapter 10 ■ Transformation: Creating the Physical Data Model
Linkage Degree
Degree relates to the number of different record types allowed to be linked to each other.
There are three types of degree: unary, binary, and n-ary (Figure 10-2).
Figure 10-2. Linkage degree
Unary or recursive links are one or more occurrences of a record type related to one
or more other occurrences of the same record type. An example of a unary linkage type
is “Reports to,” which links one or more occurrences of Employee to one or more other
occurrences of Employee.
Binary links are one or more occurrences of record type A linked to one or more
occurrences of record type B. This is the garden-variety link that associates occurrences
from two distinct record types. For example, Employee is related to Department.
N-ary links are one or more occurrences of record type A linked to one or more
occurrences of two or more other record types. An example would be Car, Dealer, and
Customer sharing a linkage.
Note that, so far, the rules are the same for both logical and physical objects.
Linkage Constraints
There are three types of linkage constraints: inclusion, exclusion, and conjunction.
Inclusion states that an occurrence of record type A can be linked to one or more
occurrences of record type B and/or one of more occurrences of record type C. For
example, an occurrence of record type Student can be linked to an occurrence of record
type Class and/or an occurrence of record type Major.
Exclusion states that an occurrence of record type A is linked either to one or more
occurrences of record type B or to one or more occurrences of record type C, but not
both at the same time. For example, for the link “Owns,” an occurrence of the record
type Automobile might be linked to an occurrence of the record type Dealer or to an
occurrence of the record type Customer, but not both.
191
Chapter 10 ■ Transformation: Creating the Physical Data Model
Conjunction states that if an occurrence of record type A is linked to one or more
occurrences of record type B, then it must also be linked to one or more occurrences
of record type C. For example, a company might have a rule that if a Customer record
occurrence is related to a Credit Balance record occurrence, then it must also be related
to a Credit Check occurrence (Figure 10-3).
Figure 10-3. Linkage constraints
The diagrammatic conventions for linkage constraints are the same as those for
logical data modeling relationship constraints.
There is a tendency not to carry forward relationship names from the logical data
model to the physical data model. This is unfortunate, but it’s because few database
management systems let you name a link much less require it. However, linkage names
help inform/remind the physical database designer why the link was created in the first
place. Therefore, the use of linkage names, and associated documentation, is encouraged
even if your information manager does not support them.
Attributes to Data Items
The physical database design equivalent of the attribute is the data item or data field.
Just as the logical data modeling attribute is a descriptor or characteristic of an entity, a
data item is a characteristic or descriptor of a record. If the record type is Employee, then
typical data items are EMPLOYEE NAME, EMPLOYEE START DATE, and EMPLOYEE
SALARY. A data item occurrence is called a data value or just value. For example, the data
item DETECTIVE NAME could have the data item values “Sherlock Holmes,” “Hercule
Poirot,” and “Ellery Queen.” Developers often abbreviate data item type to data item and
data item occurrence to data value or just value. Note: Do not call a data item type a data
type because data type is frequently used to mean data domain.
192
Chapter 10 ■ Transformation: Creating the Physical Data Model
Data Item Domain
A data item domain is the set of possible values a data item type can have. Examples of
domains include dates, text, integers, years between 1900 and 2020, real numbers with
three decimal places, abbreviations (USA, EU, UK), and so on. Domains are used to test
for acceptable values for data items. For example, if the domain of INCOME is real values
with two decimal places, then the value “Donald Trump” is unacceptable. Of course,
domains cut both ways. Many a U.S. database designer was humbled after defining the
domain of POSTAL CODE as integer and then encountering the Canadian postal code
K1A 0A9.
Domains are one of the more important components of database design, yet few
information managers support or require them. The relational model—the theology
behind the relational database management system—centers on the concepts of
domains and sets, yet both are underrepresented in many vendor products. That’s truly
unfortunate.
Data Item Source: Primitive and Derived
A data item source indicates how original or fundamental a data value is. A primitive
data item cannot be broken down into other data items or derived from them. A derived
data item is one whose value can be calculated from other data items. For example, the
data items UNIT PRICE and QUANTITY can be used to calculate the COST data item
by multiplying UNIT PRICE by QUANTITY. If you cannot derive a value from other
data items, then the data item is probably primitive. In this example, UNIT PRICE and
QUANTITY are probably primitive data items.
Primitive Data Item: Unique Identifiers and Descriptors
Primitive data items can be of two types: unique identifiers or descriptors. A unique
identifier is a data item that can point to or choose a single record occurrence. Examples
are EMPLOYEE NUMBER, SOCIAL SECURITY NUMBER, and PART NUMBER.
Descriptors describe or give the characteristics of the record type. Examples of
descriptors are COLOR, HEIGHT, LENGTH, WEIGHT, and LOCATION.
Be careful about calling a unique identifier a key. For many if not most DBMS
products, a key need not be unique. The interest here is uniqueness, not keyness.
Data Item Complexity: Simple and Group
Data item complexity looks at whether a data item contains any other data items. A group
data item is made up of two or more other distinct data items. For example, the data item
ADDRESS could be made up of the data items STREET NUMBER, STREET NAME, CITY
NAME, and POSTAL CODE. Data items that are not made up of other data items are
called simple or atomic.
193
Chapter 10 ■ Transformation: Creating the Physical Data Model
Data Item Valuation: Single Value and Multivalue
Data item valuation indicates how many different values a data item can have at one
time. A single-value data item contains only one value at any given time. Multivalue data
items can have more than one value simultaneously. For example, the data item GENDER
contains only a single value at a time, while MONTHLY REVENUE could have 12 values,
one for each month of the year (“$1000, $1330, $2056, $1820, $9368, $1343, $1588, $1190,
$1030, $1110, $2110, $2100”).
Multivalue data items are supported by most programming languages, which call
them repeating items or repeating groups. Examples would be the OCCURS clause in
COBOL and the struct function in C (Table 10-4).
Table 10-4. Examples of Multivalue Data Items
A Multivalue Data Item in:
COBOL
C
01 MONTHLY-SALES-NUMB
OCCURS 12 TIMES.
05 UNITS-SOLD PIC 999.
05 VALUE-OF-SALES PIC S9(5)V99.
struct monthly-sales-numb {
int units-sold;
float value-of-sales;
};
Although data item sources, complexity, and quantity play only small roles in
Transformation, they can take center stage later in the physical database design process,
depending on requirements and the technology environment.
Other Data Item Information
The activity Transform LDM objects to PDM objects doesn’t necessarily stop here. Other
information that needs to be converted includes data item size or length (number of
characters, bits or bytes), edit rules or masks, and any other information collected during
the logical data modeling process.
Appendix C contains examples of information that should be gathered for all record
types, links, and data items.
Activity 1.1.2: Diagram the Objects
As with logical data modeling, physical objects can be diagrammed, in most cases, on a
single page. Figure 10-4 shows the draft physical data model for an order management
system.
194
Chapter 10 ■ Transformation: Creating the Physical Data Model
Figure 10-4. Draft physical data model for the order management system
The diagram is Figure 10-4 is not the task deliverable but simply a work-in-progress,
actually just the start, toward that deliverable. That is why the word draft is used.
Task 1.2: Expansion
Expansion, the second Transformation task, is concerned with augmenting the most
important, and most challenging, physical database design component, the record type.
After creating the draft physical data model, the designer needs to look at the structure of
each record type. The first order of business is to assign keys.
Activity 1.2.1: Assign Keys
In logical data modeling, unique identifiers were assigned to all entities for which business
staff indicated they were used. In physical design, unique identifiers become keys. A key is
one or more data items used to identify or pick out one or more record occurrences.
By its traditional definition, keys are of two types: primary and secondary. Primary
keys uniquely identify a record occurrence (the logical data model’s unique identifier).
Some authors and systems restrict a record type to having only one primary key, while,
less commonly, others allow a record type to have multiple primary keys, if by primary
key you limit your definition to unique identifier. Secondary keys are rarely ever burdened
with a uniqueness requirement. They are generally used to find record occurrences when
duplicates are allowed. A traditional record type might have a primary key of the unique
data item EMPLOYEE NUMBER but nonunique secondary keys for EMPLOYEE NAME
and EMPLOYEE TITLE.
195
Chapter 10 ■ Transformation: Creating the Physical Data Model
A key can be a single data item or a concatenation of two or more data items. A key
consisting of a single data item would be something such as EMPLOYEE NUMBER, while
a concatenated key or compound key would be SITE ID, BUILDING NUMBER where the
uniqueness of building number is limited to each site.
In many cases, the logical data modeler assigned a unique identifier to an entity.
Unless there are extenuating circumstances, the logical data modeling unique identifier
should be used as the primary key. If uniqueness is not achievable using the LDM
identifier, the database designer can usually employ a concatenated key to satisfy the
uniqueness requirement.
Secondary keys can be defined now, but in most cases, it is wiser to wait until the
designer better understands how the database will be used (step 2, Utilization).
A SHORT HISTORY OF KEYS
Keys have a long history in IT although their use and the definition have changed
over time. Back in the early days of IT, call it the first key era, files were sequential,
existing on punched cards, magnetic tape, or disk, and the key was the field used
to sort the file. In those days, computers spent inordinate amounts of time sorting
files, with the typical job stream a litany of sort, application, sort, application, and so
on. The customer files might be sorted on account number for one application, then
re-sorted on customer name for another, and later sorted a third time on billing date.
Sort keys had no significance beyond their relevance in ordering the file.
The second key era came with the advent of random access technology. Now the
application could fetch any piece of data in a file if it knew its location within the
file. The key now took on a new role—that of an access key, search key, search
argument, or search criterion. This is when the database management system
really took off. The programmer simply passed the key of the desired record to the
DBMS, and the system would deliver it to the application. The DBMS associated the
key with a pointer to the record’s physical location on disk (displacement from the
beginning of the file, sector location, database page, etc.).
The relational model stood keys on their heads with the introduction of the key as a
structural database component. This is the third key era. No longer were keys used
simply for defining how you order a file (a sort key) or how you locate data in the file
(an access key); the relational model made the key part of the architecture of the
data. The foreign key linked a relational parent to its relational children. Keys were
now fundamental to the structure of the database.
As a convention, many modelers underline the primary key in their diagrams.
Unfortunately, the position of the data items in a compound key (which data item is
first, which second, etc.) is important. The concatenated key ACCOUNT NUMBER,
ORDER NUMBER is very different from the key ORDER NUMBER, ACCOUNT NUMBER.
196
Chapter 10 ■ Transformation: Creating the Physical Data Model
Underlining does not always show this distinction unless the designer aligns the data
items in the correct order, but there is no way to tell whether this has been done.
Regardless, the data dictionary should contain this important information even if the
diagram does not.
Activity 1.2.2: Normalize the Model
Normalization is a process of reducing the structure of the model to a state such that data
in any given record is totally dependent on the primary key of that record. This restriction
ensures that if, for example, some data items are deleted, then all associated data items
are also deleted, while all nonassociated data items are not.
Database designers can get themselves in a rather nasty pickle if they are not careful
about how they assign data items to record types. Improper assignment can cause grave
errors or anomalies.
ANOMALY…ISN’T THAT A KIND OF FISH?
In the good old days, before data processing became information technology,
programmers worked very hard never to have more than one input file and one
output file open for any application. Rarely did a computer have more than one card
reader, and even tape drives were few and had to be shared. Therefore, if a new
data item suddenly appeared, the programmer would do almost anything to avoid
creating another file. The new data item was often shoved somewhere in some
existing file. After the application was a few years old, its file structure could look
like the bottom of a dorm room closet.
One of the consequences of this throw-it-and-see-where-it-sticks approach
is that updating a file often resulted in what early programmers described as
“unanticipated results.” Deleting one record might remove data that were needed
elsewhere; modifying a record might mean that needed information was no longer
findable. It could be a mess, so some great IT minds sat down, put their collective
brains together, and, although they didn’t solve the problem, came up with a
new name for the mess that sounded a lot better than “mess.” They called them
anomalies.
By the way, the “fish” is a sea anemone.
An anomaly results when an action produces an unintended consequence. Imagine
a database that contains information about employees and the projects they work on
(Table 10-5), where the record type Employee contains the data items EMPLOYEE NAME
(the primary key), DEPARTMENT, PROJECT, and HOURS WORKED THIS MONTH.
197
Chapter 10 ■ Transformation: Creating the Physical Data Model
Table 10-5. IUD Anomalies
Employee Name
Department
Project
Hours Worked
This Month
Andrews
Manufacturing
RumpMaster 2000
80
Bradley
Customer Service
Disposable Fry Pan
75
Casey
Product Design
RumpMaster 2000
20
Davidson
Customer Service
RumpMaster 2000
Disposable Fry Pan
40
60
By adding up the hours worked this month by project, the user sees that a total of 140
hours were worked on the RumpMaster 2000 and 135 hours on the Disposable Fry Pan.
However, this information would be lost or incorrect if employee Davidson were deleted
from the Employee file or if Davidson moved to other projects.
An anomaly is a data integrity problem that occurs in a database when an object
that is inserted, updated, or deleted causes an unintended change in another object or
objects. For example, you cannot add a new project to this database until an employee
(the source of the primary key) is assigned to the project. This is an insertion anomaly.
Second, if you discover that Casey spells his name “Casie,” you must change every Casey
record instance. If you miss one, that is an update anomaly. Lastly, if employee Andrews
quits the project and you delete her record occurrences, you lose information on how
many hours were worked on her project. This is a deletion anomaly.
There is a solution. It is called normalization.
Normalization is the application of a set of mathematical rules to a database to
eliminate or reduce insertion, update, and deletion (IUD) anomalies. It does this by
ensuring that all data items are completely dependent on the primary key for their
existence and not on any other data item. The various levels of normalization are called
normal forms. The higher the level, the more likely any potential IUD anomalies have
been eliminated. The forms are progressive, meaning the model must be in first normal
form (1NF) before it can be in second normal form (2NF), which is a prerequisite for the
third normal form (3NF), and so on.
Normalization is closely tied to the relational model. In fact, they were created
and first presented together with the existence of one used, at least partially, to justify
the existence of the other. Although normalization is tied to the relational model, it has
a much broader use; in fact, with some adjustments, it can be used, and benefit, any
database design for any available DBMS product, relational or not. Unfortunately, the
adjustments can sometimes be confusing and painful to make.
198
Chapter 10 ■ Transformation: Creating the Physical Data Model
Adjustments Needed for Normalization: Keys—Foreign and
Domestic?
To normalize a model, every record type must have a unique key—no exceptions. U3D
and virtually every DBMS on the planet do not require every record to have a unique key,
including almost every relational DBMS. However, keys, specifically relational primary
keys (the single simple or compound unique identifier selected as the record’s sole
primary key), are the soul of normalization.
Identifying unique identifiers where they do not exist can be a challenge, but the
relational model offers a simple solution that should work in 95 percent of the key-less
cases. The designer can use the relational notion of a foreign key to create unique record
identifiers; it just takes thinking like a relational DBA. That raises the question, what
exactly is a foreign key?
One of the benefits of a DBMS is that it provides the programmer with a way
of linking together data that might physically live in different parts of the database.
For example, in an order management system, the DBMS can make it easy for the
programmer to move from any given account occurrence to the details for any product
associated with that account. How the DBMS does this is the “special sauce” that
separates one DBMS architecture or product from another. Network systems use pointers
in records, inverted systems use external indices, and relational systems use embedded
foreign keys.
That network or inverted file DBMS does not need every record to be unique because
its pointers are unique. While Order might have a unique key (say, ORDER NUMBER),
the Line Item record can get by with a LINE ITEM NUMBER that is just unique within
a given Order. (Remember the discussion of “uniqueness within context” in logical
data modeling?) If 500 Orders are all linked to two or more Line Item occurrences, then
500 Line Item occurrences have a LINE ITEM NUMBER = “1,” 500 have a LINE ITEM
NUMBER = “2,” and so on.
The relational model does not use pointers; rather, it buries in the child record the
primary key of its parent. Take the previous example of the Order and Line Item record
types. Every Line Item record occurrence associated with a particular Order occurrence
would have a special data item, a foreign key, that contained the same data value as the
Order record’s primary key. No pointers required. In practice, well, it depends on the
implementation of the RDBMS. The pointers kept hidden in hierarchal, network, and
inverted models are visible to application programmers and even end-user interfaces
with the relational model. Doesn’t this primary key-foreign key concept require the
duplication of data? Is not the elimination of duplicate data one of the hallmarks of the
relational model? Yes, and yes; however, this is rationalized away by saying that foreign
keys are not data items at all but, well, foreign keys, which are a totally different beast.
If keys, particularly foreign keys, are not needed for all DBMSs, then why deal
with them here rather than in step 3, Formalization, where the DBMS that will be used
is identified? And if keys are just access methods, then why introduce them here and
not where you deal with how efficiently you want to access the data, which is step 4,
Customization? Why do this now?
199
Chapter 10 ■ Transformation: Creating the Physical Data Model
The answer is that if you want to normalize your database, then you have to pretend
that your DBMS conforms to the relational model, which means placing keys in all record
types. Normalization does not say you need foreign keys, just that every record must have
a primary key. Foreign keys are just a way of creating primary keys where they might not
normally exist. In the example, the DBA could simply append ORDER NUMBER to LINE
ITEM NUMBER, giving Line Item a unique compound primary key.
Will foreign keys work for all record types? No, foreign keys work only for record
types at the many end of a one-to-many link. For other keyless record types, the designer
should find some other solution. However, in most models, the only record types without
a unique identifier are those at the many end of a one-to-many link.
The good news is that you can always remove the keys after normalization.
However, because the data architecture of the eventual database is still undecided,
the designer must do a little prenormalization work to make the physical data model
normalization friendly. In this book (and more than likely, only in this book), the
preliminary work goes under the lofty name of zero normal form (0NF). As with other
normal forms, 0NF must be completed before 1NF can begin.
Zero Normal Form
To be in zero normal form (0NF):
1.
Every record must have a relational model–defined primary
key.
When 0NF is complete, more traditional normalization can begin.
First Normal Form
To be in first normal form (1NF):
1.
The record must be in zero normal form.
2.
All multivalue data items (Codd calls them repeating groups)
must be removed from the record.
The remedy for a first normal form violation is to remove the repeating group
(multivalue data items) and create a new record type to house the offending data items.
For example, given the Customer record containing CUSTOMER NAME and the
repeating group CUSTOMER PHONE NUMBER, remove CUSTOMER PHONE NUMBER.
Place it in a new record type, and call it Customer Phone, with a one-to-many link
between Customer and Customer Phone.
Note that 1NF does nothing toward achieving normalization’s primary goal of
reducing IUD anomalies. Rather, its purpose is to ensure conformity with the relational
model’s two-dimensionality requirement. However, because the normal forms are
progressive (you complete one before completing two, etc.), 1NF technically needs to be
adhered to in order to progress. (You’ll learn more about this later in the chapter.)
200
Chapter 10 ■ Transformation: Creating the Physical Data Model
Before Getting to Second Normal Form, a Slight Digression
The key to normalization is understanding functional dependency, a curious term for a
confusing concept. Here goes.
Take the Employee record containing two data items, EMPLOYEE NUMBER and
EMPLOYEE NAME, where EMPLOYEE NUMBER is the unique identifier (primary key). If
you know EMPLOYEE NUMBER, then you can look up EMPLOYEE NAME, so EMPLOYEE
NAME is determined by EMPLOYEE NUMBER or, in relational-ese, EMPLOYEE NAME is
functionally dependent on EMPLOYEE NUMBER.
Assume there are two employees named Smith. One Smith is getting fired, and the
other one promoted. If you know the EMPLOYEE NUMBER of the person to be fired,
then you can be assured that you canned the correct Smith. However, if you only know
that the employee to be fired is Smith, then you cannot guarantee that you will fetch the
correct Smith from your database. Therefore, EMPLOYEE NUMBER is not functionally
dependent on EMPLOYEE NAME because knowing EMPLOYEE NAME does not give you
the record containing the correct EMPLOYEE NUMBER.
Functional dependency also works with compound keys. Take the record Line Item
with the data items ORDER NUMBER, LINE ITEM NUMBER, ORDER DATE, PRODUCT,
and PRICE. The primary key is the concatenation of (the foreign key) ORDER NUMBER
and LINE ITEM NUMBER. PRODUCT is functionally dependent on the concatenated key
ORDER NUMBER-LINE ITEM NUMBER. ORDER DATE is only functionally dependent on
the ORDER NUMBER, part of the concatenated key. PRICE is not dependent on either but
rather on the non-key PRODUCT. PRODUCT is fully functionally dependent on the primary
key, while ORDER DATE is only partially functionally dependent on the primary key.
One more piece of information is needed before you can continue normalizing.
Remember your college logic class when you learned about transitivity? An example
might jar your memory. Transitivity says that if A=B and B=C, then A=C. Transitive
dependency says that if A is functionally dependent on B and B is functionally dependent
on C, then A is functionally dependent on C.
In the example, PRICE is functionally dependent on PRODUCT, which is functionally
dependent on ORDER NUMBER-LINE ITEM NUMBER; therefore, PRICE is transitively
functionally dependent on ORDER NUMBER-LINE ITEM NUMBER.
If you have grasped this, you can move on to 2NF.
Second Normal Form
To be in second normal form (2NF):
1.
The record must be in first normal form.
2.
Every nonkey data item must be fully functionally dependent
on the primary key (no partial functional dependencies).
The remedy for a second normal form violation is to remove the data items not fully
functionally dependent on the primary key and either create a new record type to house
the offending data items or place them in another existing record type.
Using the Line Item example, to make the model 2NF compliant, remove ORDER
NUMBER and ORDER DATE from Line Item and place them in the new record type Order.
201
Chapter 10 ■ Transformation: Creating the Physical Data Model
Third Normal Form
To be in third normal form (3NF):
1.
The record must be in second normal form.
2.
There can be no transitive functional dependencies.
The remedy for a third normal form violation is to remove the data items transitively
dependent on the primary key and either create a new record type to house the offending
data items or place them in another existing record type.
To make Line Item 3NF compliant, create a new record type Product containing the
data items PRODUCT and PRICE.
Figure 10-5 shows the changes that were made to Line Item as a result of
normalization.
Figure 10-5. Normalization (before and after)
How many normal forms are there? Many. At least seven, although every now and
then someone comes up with a new one. However, most practitioners and researchers
agree that getting to third normal form is usually good enough. More can be just gilding
the lily.
WHAT’S THE BIG DEAL? VERY LITTLE CHANGED.
If the E-R model is properly constructed, then normalization should add little.
Remember, normalization was envisioned without the benefit of the E-R approach.
This is why some E-R database design authors do not require, or even recommend,
normalization.
Normalization can be a challenging subject that requires considerable more study
than presented here. This entire book could easily focus on just normalization; however,
that is unnecessary because there are many books dedicated to the subject. You are
encouraged to investigate further.
202
Chapter 10 ■ Transformation: Creating the Physical Data Model
Post-Normalization—Retreat of Sally Forth?
Once normalization is completed, the database designer faces a decision—what to
do with all the relational-based changes made to the physical data model so it could
be normalized. Does the database designer restore the model to its prenormalization
pristine state or leave in all of the relational detritus?
To normalize a model, all record types must have a primary key. Some primary keys
are quite natural, such as CUSTOMER NUMBER, while others require vivid imaginations
to concoct. Foreign keys are not required for normalization because normalization
is concerned with only one record at a time and its key and nonkey data items.
Relationships are irrelevant to the Big N, so foreign keys, technically, play no role.
More troubling is the removal of multivalue data items, which were ejected for
relational model reasons rather than normalization reasons. They could have been left
in the record, and normalization would have been just as effective (although there is a
decent argument that if you are going to normalize the model, then you should follow
its steps).
Restore or not restore? It’s up to the database designer, who can restore the model
now or wait until step 3, Formalization, and either restore or not restore then based on
the DBMS selected.
Issues with Normalization
As good as normalization is, and it is useful, it would be remiss not to mention some of its
issues, problems, and misuses.
•
Normalization is a review process, not a design process. It works
in media res. It assumes that record types already exist, and
the question to be answered is, “Are the data items living in
this record type correct for this record type?” The database
designer must already have a physical data model and then use
normalization to improve/modify it.
•
Normalization is a decomposition process. Normalization breaks
down or decomposes existing “compound” record types into
simpler ones. However, it does not provide a method for going in
the reverse direction. For example, if the abstraction of the data is
too granular, there is no way, using normalization, to build up to
a more appropriate level. While normalization tells you to remove
a data item from a record type, it does not tell you where to put
it. It provides no information on where a data item belongs, only
where it does not belong.
203
Chapter 10 ■ Transformation: Creating the Physical Data Model
204
•
Normalization says little about key suitability. Normalization is
concerned with the primary key and its relationship with the
record’s nonkey fields. Nothing is said about whether the record
type has the proper primary key. For example, normalization
would not flag as an error PART NUMBER being the primary
key of Employee. In fact, it would reject EMPLOYEE NAME from
Employee because it is not functionally dependent on PART
NUMBER. This is consistent with normalization’s relational
model roots. According to the relational model, if a table contains
two unique identifiers (candidate keys in relational parlance),
which one you choose as the primary key is totally arbitrary.
Interestingly, this underscores the need for normalization
to follow a logical data modeling process, such as the entityrelationship approach, to appropriately populate entities with
relevant attributes, and any potential unique identifiers related to
that entity, before undertaking normalization.
•
When do you stop? There is no agreement among the gurus about
how far you need to go when normalizing a database. Third
normal form? Boyce-Codd normal form? Fourth? Fifth? Sixth?
Domain-key normal form? Where does it end?
•
Normalization does not provide a viable end-game strategy.
The performance of a normalized database is often very poor
compared with non-normalized databases, and thus fully
normalized databases are rarely implemented. A common
post-normalization exercise for many physical database design
approaches is to denormalize the model to improve performance.
Unfortunately, there are no accepted rules for denormalization
that ensure IUD anomalies are not reintroduced.
•
Gobbledygook. A major selling point for normalization is its
formal mathematical roots. It’s right out of formal logic and
set theory, so it includes a strong mathematical pedigree. The
problem with a mathematically based database design technique
is that it is a mathematically based technique. It is a mathematical
concept steeped in mathematical jargon, which is anathema
to many IT staff. Unless you are a mathematician, it can be less
understandable than an EBCDIC version of the Bhagavad Gita.
If normalization gurus really want to spread the word, then they
need to take a closer look at their audience and understand
how those people think, how they talk, and the language that is
meaningful for them.
Chapter 10 ■ Transformation: Creating the Physical Data Model
HOW PREVALENT IS NORMALIZATION?
I received a very unscientific and inchoate answer to this question. In dealing with
dozens of database designers on five continents over the past few decades, I found
that about 90 percent said that they were familiar with normalization. Fifty percent
said that they normalized their databases. Fewer than 15 percent actually did it, and
fewer than 2 percent did it properly.
Normalization is confusing, annoying, and frustrating. It is also very useful. Wise
database designers will roll up their sleeves, wear their thinking caps, put on the coffee,
get out the college textbooks, and normalize the hell out of their database design.
Tranformation Notes
There is one very important task remaining. The database designer must document all
the issues and all the decisions made during step 1, Transformation, in one or more
documents called Transformation notes. The reason is not to record history—it is difficult
to imagine a 23rd-century archaeologist rejoicing at finding some DBA’s scribblings—but
for the future of the database. Sometime in the days or years to come, some other
database designer or DBA will need to make changes to the database. Ideally, before
mucking about in the DDL, they will do some research to understand the intentions of
the original users and database designers and the reasoning behind the decisions they
made. If all the new designer has to work with is the current DDL or a few diagrams, your
clever original thinking might be misunderstood or totally ignored.
Transformation notes should include answers to these four questions:
•
Why? Why was something done? Knowing why a decision was
needed and how was it made can prove very useful for future
designers.
•
Where? Some decisions cover the entire model, while others
apply to only a portion of it. “Where” tells future database users
and supporters the context or scope of a decision.
•
When? Some decisions are time dependent. Identifying the
temporal scope of a decision makes it easier to link it to other
designer work products such as diagrams and procedures.
•
Results? There is a tendency for designers to document only
successes. Sometimes documenting what didn’t work or what was
rejected is more important than successes.
This is your chance to ensure that your legacy survives the ravages of some young
undereducated upstart or just yourself three years from now when you are trying to figure
out why you did what you did. In either case, write it down. It can only help.
205
Chapter 10 ■ Transformation: Creating the Physical Data Model
Deliverables
Step 1, Transformation, produces three major deliverables.
1.1.
Physical Data Model: The physical representation of the
logical data model (Figure 10-6 in the next section).
1.2.
Physical Data Model Object Definitions (data dictionary):
Record types, data elements, linkages, keys, formats, and so
on (Figures 10-7 through 10-10 in the next section). Each
should include a description and relevant information on all
data objects. (Appendix C contains a glossary of physical data
object definitions.)
1.3.
Transformation Notes: The database designer’s notes
on relevant issues and decisions made during step 1,
Transformation.
Examples of Deliverables
The first deliverable is a Physical data model diagram giving a graphic representation of
the physical record types and how they are related to each other (Figure 10-6).
PHYSICAL DATA MODEL
CHANGES MADE TO THE MODEL
1. All entities became record types, all
relationships became links, and all
attributes became data fields.
2. Crow’s feet became arrowheads.
3. No changes to the model were made
as a result of normalization.
Figure 10-6. Physical data model
206
Chapter 10 ■ Transformation: Creating the Physical Data Model
The second set of deliverables is the Physical data model object definitions that
are part of the data dictionary. Below are a set of sample Physical Data Model Object
Definition forms.
Figure 10-7. Record type def inition
Not all physical data object information can be entered at this point. Some
information will have to wait until further steps. For example, keys will not be finalized
until Chapter 12, and storage issues, such as clusters and partitions, will be decided in
Chapter 13. Other information might change as you delve further into database design,
such as data items in the record type.
207
Chapter 10 ■ Transformation: Creating the Physical Data Model
Figure 10-8. Data item definition
One of the most important items in the record type definition is the “Notes and
Comments” section. This is an opportunity for the database designer to convey to
future designers and DBAs important information they will need but that might not
be adequately explained elsewhere. A prudent designer will make liberal use of this
opportunity.
Figure 10-9. Domain definition
208
Chapter 10 ■ Transformation: Creating the Physical Data Model
Most database management systems do not require the use of domains, although
many do allow them. This is unfortunate because domains are an effective tool for
maintaining database veracity. If the DBMS does not support them, then DBA or
applications staff should develop the necessary functions to support them.
Figure 10-10. Linkage definition
These documents are only a suggestion. How you document the model might be
quite different. Many CASE and system development tool packages include robust data
dictionaries that can store this and similar information. They are a good place to keep
such documentation and should be used when possible.
This concludes the Transformation process. Next, step 2, Utilization, examines
exactly how the database will be used and the modifications to be made to the physical
data model to accommodate that use.
209
CHAPTER 11
Utilization: Merging Data
and Process
Our biggest cost is not power, or servers, or people. It’s lack of utilization.
It dominates all other costs.
—Jeff Bezos
Data is a precious thing and will last longer than the systems themselves.
—Tim Berners-Lee
Both the logical data and physical data models are static, only representing the
definition of the data they contain. Many database designers stop here, never—or only
inconsistently—taking into account how the data and the database will be used. See
Table 11-1.
Table 11-1. Step 2: Utilization
Sources
Step 2: Utilization
Procedures
·· 1.1: Physical data model ·· Task 2.1: Usage Analysis
(diagram)
·· Activity 2.1.1: Create
usage scenarios
·· 1.2: Physical data model
object d
­ efinitions (data
·· Activity 2.1.2: Map
dictionary)
scenarios to the PDM
·· PM: Business
·· Task 2.2: Path
­requirements (processes,
Rationalization
procedures, and all
·· Activity 2.2.1: Reduce
volumes)
to simplest paths
·· 1.3: Transformation
·· Activity 2.2.2: Simplify
notes
(rationalize) model
Deliverables
·· 2.1: Rationalized physical
model (diagram)
·· 2.2: Updated physical
data model object
definitions (data
dictionary)
·· 2.3: Usage scenarios
·· 2.4: Usage maps
·· 2.5: Combined usage map
·· 2.6: Utilization notes
© George Tillmann 2017
G. Tillmann, Usage-Driven Database Design, DOI 10.1007/978-1-4842-2722-0_11
211
Chapter 11 ■ Utilization: Merging Data and Process
Step 2, Utilization, adds to the physical data model how the procedures defined in
the process models will store and access data. Utilization is where the formally separate
data and process models meet to form the first hybrid definition/use model.
In this step, the database design missing link problem is finally resolved—designers
can create a database that integrates the static definition of data (the data models,
both logical and physical) with the more dynamic use of that data (the process models,
both logical and physical). Utilization is the key, resulting in a structurally resilient,
functionally rich, effective, and efficient database design.
Task 2.1: Usage Analysis
The first Utilization task is to gain an understanding of the functionality the database
will support. The database designer must examine all data usage in the process
documentation created by the application designers. In most cases, to do this effectively
involves understanding at least some forms of process modeling. Because there are plenty
of books on documenting processes, there is no need to go into depth here, although a
cursory look is useful.
Process Modeling
A process model serves a purpose similar to a data model—to document existing or
planned applications. Whereas data models represent information at rest, process models
record information as it is created, used, modified, and deleted by an application.
As with data models, there are different types of process modeling techniques, and
they can vary greatly. Some techniques stress a business-focused process requirements
analysis, while others take on a more technical bent. Logical process requirements
documentation can involve a narrative form using natural language, graphical
techniques, or a combination of both.
As is the case with data, the process side of a system can be divided into logical
process models and physical process models. Logical process models record the existing
or planned functional capabilities of an application—the what is wanted. Physical
process models focus on how the system (its hardware and software) does or should
function—the how it will work. This processing modeling what versus how contrast is
the complement to that presented in Chapter 1 for data modeling. The most popular
documenting techniques for logical process modeling are structured English, data flow
analysis, and, unfortunately, plain English. Physical process models are usually described
using flow charts, structure charts, pseudocode, and, interestingly, plain English.
Logical Process Modeling
When most developers think of documentation techniques, they usually think of logical
process models representing, in narrative or graphical form, the functionality of the
existing or proposed application.
212
Chapter 11 ■ Utilization: Merging Data and Process
Natural-Language Logical Process Modeling Techniques
Natural language is the speech used every day. It is also, for better or worse, how most
applications are described and documented. Natural languages, such as English,
Russian, and Italian, have the advantage of being understood by large numbers of
people, particularly the people who will be using the application. Unfortunately, natural
languages, such as plain English, can be verbose and error prone, leading some designers
to look for more formal approaches to give the model added structure and less bulk.
Plain English
Spend a fortune on college, go to training classes provided by your employer, read books
and journals describing the latest analysis techniques, and then receive a requirements
document written like a Lewis Carroll novel. It’s not fair, but it’s reality. The only thing the
database designer can do is to try to translate the application prose into something more
practical, as illustrated in the next section.
Oy Vey, There Has Got to Be a Better English Translation
At the opposite end of the spectrum from plain English is a formal language process
modeling technique called pseudocode, which is a version of natural language that looks
somewhat like computer code. Pseudocode rates high on the discipline scale but can
appear robotic and mechanical to users and, for that reason, is not the approach of choice
for logical process modeling. Between plain English and pseudocode are a number
of attempts at giving plain language some discipline while not giving up its ease of
understanding. The most popular language-based compromise techniques go by names
such as structured English or tight English (or Portuguese, etc.).
Structured English
Every system ever developed had at least part of its inner workings documented using
plain language. It might be in French or Japanese, but every application has been
described somewhere in the plain language of its users. Because this book is written in
English, plain language here is English.
There is probably not an analyst or designer on Earth who has not been given
development instructions something like the following:
The teacher accesses the student’s record using his or her student number.
The student’s grade for the course is entered next to the appropriate
course and section.
The problem with plain language is that it is easy to overlook important details. For
example, what should the teacher do if the system responds with a matching student
number but a different student name from the person attending his class? What if the
actual section number does not appear?
213
Chapter 11 ■ Utilization: Merging Data and Process
An insidious problem with plain language is that requirements that look complete
can be grossly inadequate.
Structured English is a concept introduced in the late 1970s by the proponents of
structured analysis and structured design. It consists of English language concepts used
with a little more exactness, simplicity, and rigor than found in everyday life.
Structured English consists of a set of rules, precisely defined words, and specified
sentence structures that reduce ambiguity while increasing reader comprehension.
Think of it as a system described by Mr. Spock contrasted with one dramatized by Richard
Simmons. While one is emotional and inexact, the other is more detailed and precise.
Creating structured English is relatively easy; you just have to remember that its purpose
is to unambiguously document how a system is to work.
Because there is no standard for structured English, implementations vary and are
often highly personalized. A simple structured English approach is to decompose the
plain English requirements into separate simple declarative statements. These declarative
statements are then treated to a few rules, such as statements formatted as sequential
logic or drop-through statements, decision logic or trees, or decision loops.
Sequential logic is a list of events with one following another. Here’s an example:
Read Customer record where Customer Number = "xxx"
Then Read Product record where Product Code = "yyy"
Then Insert Order record for Customer Number = "xxx" and
Product code = "yyy"
The logic is simple; start at the first statement and go down the list, one statement at
a time.
Decision logic involves testing a condition and then taking an action based on that
decision. The simplest decisions are branches and represented by the ubiquitous IFTHEN statement.
If Customer Status = "Active" then go to Active Customer
If the condition is true, branch to Active Customer; if not, go on to the next
statement.
A more complex structure would be If-Then-Else.
If Customer Status = "Active" then go to Active Customer
Else Go to Inactive Active Customer
A variation of decision logic is appropriate when there are three or more options that
are better represented by a decision table.
IF THEN
Customer = "Active" Go to Active Customer
Customer = "Inactive" Go to Inactive Customer
Customer = "Credit Hold" Go to Credit Problem
Customer = "National" Go to National Accounts
214
Chapter 11 ■ Utilization: Merging Data and Process
Decision loops repeat a sequence of steps until a condition is met. For example, an
order can consist of many products.
Enter Product
Insert Line Item record for Product Code = "yyy"
Repeat Enter Product until Product Code = "000"
A list of agreed-upon keywords, such as If, Then, Until, Repeat, etc., can improve
comprehension, particularly if there are common (industry, organization, or even just
team) definitions. Some designers like to require that keywords be in all capital letters.
Graphical Logical Process Modeling Techniques
The problem with all natural-language documentation techniques is that they tend to be
verbose and sequential and, ironically, can miss both the detail as well as the big picture.
As with logical data modeling, logical process modeling gains from the use of some
graphical techniques.
The most popular graphical logical process modeling technique is the data flow
diagram (DFD). A major advantage of the DFD, and the main reason for its popularity, is
its simplicity and universality.
DFDs consist of four objects (Figure 11-1). An external entity, represented by a
square, is an object external to the application, such as a person, department, or another
application. A process, represented by a rounded rectangle, is a procedure that acts on
data. A data flow, represented by an arrow, shows the movement of data, such as data
passed between processes or to or from an external entity. A data store, represented by
an open rectangle, is data at rest, such as a computer file, a file cabinet, a file folder, a
Rolodex, or a card catalog.
Figure 11-1. Data flow diagram symbols
215
Chapter 11 ■ Utilization: Merging Data and Process
The highest level of a DFD is called Level 0 (Figure 11-2), and it represents the
entire system or application. Level 0 can be decomposed into multiple Level 1 diagrams
(Figure 11-3), one for each Level 0 process, each with its own subprocesses showing
greater application detail. Level 1 can be decomposed into Level 2, and so forth, until the
entire system is documented.
Figure 11-2. Customer Orders Product Level 0 DFD
Figure 11-3. Order Product Level 1 DFD
DFDs also contain a narrative component. Each object requires a definition, and
each process, particularly those at the lowest level, requires a narrative describing the
operations it performs on the data. The difference between these process narratives and
natural language techniques is scope. A DFD process narrative describes only a small
single process, not an entire system. DFDs and structured English merge well, with the
latter being used to describe DFD processes.
216
Chapter 11 ■ Utilization: Merging Data and Process
Physical Process Modeling
So far, only logical process modeling techniques have been described; however, the
modeler must also be familiar with physical process modeling techniques. As with the
data side, physical process modeling techniques instruct the designers and coders on
how the internals of the application should work.
Natural-Language Physical Process Modeling Techniques
Some natural-language physical modeling techniques are holdovers from logical
processing modeling, such as plain English (again), structured English (again), and
pseudocode.
Plain English
No, this is not a mistake. The plain-language techniques used for logical process
modeling are often, unfortunately, the same techniques used for physical process
modeling. However, how they are used does often differ.
While the DFD is the most popular graphical logical process modeling technique
and the flow chart is the most popular graphical process design technique, they are
both, regrettably, eclipsed by the all-time winning documentation technique: English.
Sad to say, simply writing down what the system is to do is, by far, the most common
(if least practical) way to document both the logical and physical requirements of an
application.
Why is plain English still around? In most cases, it can be boiled down to one of two
reasons. First, the analyst/designer does not know any better technique. An amazing
number of analyst/designers have little more than a passing knowledge of modern
modeling techniques. Second, the analyst/designer is too lazy to use a more precise
documentation approach.
Structured English
The structured English of physical process modeling differs little from the structured
English of logical process modeling. The only difference might be the semantics of the
narrative. Expect physical process structured English to delve more into process control
and components internal to the application, such as data flags, branches, and loops.
Pseudocode
The philosophy behind pseudocode is to give the reader all of the specificity of computer
code without referencing a particular computer language and without unneeded linguistic
details. As with structured English, there is no pseudocode standard. Each practitioner can
217
Chapter 11 ■ Utilization: Merging Data and Process
create their own, or agree on some local or team-wide standard set of rules. The following
pseudocode example uses only three rules. The three simple rules are as follows:
1.
State instructions as simple declarative or imperative
sentences using well understood and documented data object
names where possible.
2.
Convert conditionals to IF THEN, ELSE, or decision table
form.
3.
Allow iteration using DO UNTIL or similar constructs.
The result turns convoluted constructs, such as the following:
Customers are of two types. Those with annual sales averaging more than $10,000
are given a 10 percent discount. Others are given a 10 percent discount only if the order is
greater than $1,000.
into this more understandable pseudocode:
If LAST YEAR SALES > 10,000,
or YEAR TO DATE SALES > 10,000,
or ORDER AMOUNT >1,000,
then CUSTOMER DISCOUNT = 0.10,
else CUSTOMER DISCOUNT = 0.0.
Some designers like to customize their pseudocode around a particular
programming language such as C, COBOL, or Java. Others believe in a more
programming language–free pseudocode. Which is chosen is less important than
consistency.
Graphic Physical Process Modeling Techniques
Graphical physical process modeling techniques predate the computer age and are
considerably more popular than graphical logical processing modeling techniques. This
section looks at the two most popular graphical process modeling techniques: flow charts
and structure charts.
Flow Charts
By far the most popular graphical physical process modeling technique is the flow
chart (Figure 11-4). Both revered and reviled, the flow chart predates the digital
computer. Invented in the 1920s, the flowchart is a general-purpose tool for graphically
representing processes as a system rather than a sequence of steps. John Von Neumann
is thought to have been the first to apply them to computer programs in the 1940s. Their
popularity exploded in the 1950s and continued right up to the early 1970s when batch
systems began to be replaced with online applications. Although they still work well for
documenting logic flow, newer techniques, such as structure charts, do a better job of
documenting many kinds of processing. However, flow charts persist to this day because
of their ease of use and their ubiquity in classroom instruction.
218
Chapter 11 ■ Utilization: Merging Data and Process
Figure 11-4. Customer credit status flow chart
While most everyone agrees that the flow chart has outlived its usefulness, they are
still pervasive throughout the industry. Although your college professor might threaten
to revoke your degree if you use the technique, flow charts are artifacts found in virtually
every IT shop.
Structure Charts
A structure chart is a diagrammatic physical process modeling technique that represents
the process as an inverted tree. The top of the tree is the root application or program
level. Subsequent levels are modules representing greater process granularity. The very
bottom levels usually represent program modules performing a single task. Structure
charts date back to the structured design and programming era. Each box on the chart,
called a module, represents a process with a single input and a single output. Modules are
made up of submodules, which are in turn made up of even “subbier” modules…you get
the idea. Arrows show the flow of data or control. An arrow with an empty (white) circle
shows the movement of data (up or down), while an arrow with a filled-in (black) circle
shows the passing of control such as decisions and flags (Figure 11-5).
219
Chapter 11 ■ Utilization: Merging Data and Process
Figure 11-5. Employee-customer structure chart
Structure charts are a popular technique with web designers because they can easily
represent the architecture of a web site down to the page level. See Table 11-2.
Table 11-2. Logical and Physical Process Modeling Techniques
Summary of Process Modeling Techniques
Logical Process Modeling
Physical Process Modeling
·· Pseudocode: English language structured
·· Structured English: Application of a
to mimic computer code
regimen to natural-language English
to diminish its ambiguities while
·· Flow chart: A diagrammatic technique to
adequately communication with users
represent a computer algorithm
·· Data flow diagramming: A simple yet ·· Structure chart: A tree structure to show
robust graphical representation of the
the hierarchical breakdown of computer
movement of data in a system
modules
At the end of this chapter is a list of sources where you can find materials on these
and other popular processing modeling techniques.
Activity 2.1.1: Create Usage Scenarios
Usage scenarios document how an application uses a database. A usage scenario can be
as simple as “Fetch Customer record where CUSTOMER NUMBER is 1234” or as complex
as a subset of an application involving a significant portion of the database.
The purpose of a usage scenario is to make it easier for the database designer to
understand how the database will be used. It gives the database designer a clear and
220
Chapter 11 ■ Utilization: Merging Data and Process
simple document, devoid of confusing and extraneous process specifications, stating
exactly how the application will create, read, modify, or delete the information stored in
the database.
Logical process modeling can create a mountain of information, much of which is
unrelated to the database design process. The usage scenario process boils down all of
the logical process information into what is relevant to database design. To build a usage
scenario, the database designer reviews the application’s various process components,
such as requirements definitions, functional specifications, flow charts, and so on, and
culls from them all the relevant data fetching and storing information. This information
forms the basis of the usage scenario.
Clearing the Decks for Action
One thing all process modeling techniques have in common (if they have the appropriate
level of detail) is that they provide far too much information to the database designer.
Even a moderately sized system can involve hundreds of pages of text and diagrams
that explain what the system should do. These specifications, created during analysis
or design, contain considerable information beyond how the application accesses, or
uses, data. They also include detailed algorithms, user interaction, control, branching
instructions, report or screen layouts, and so on. The database designer needs only about
10 percent of this information. A good idea is to strip these components out, leaving just
the interaction between the process and the database. A usage scenario boils down the
hundreds of pages of requirements analysis to the few that are relevant to the database
design process.
The following is an example of a plain English specification.
Activity: Create a New Customer Account
The clerk enters the caller’s phone number into the system. If the caller has an account,
then the account information is displayed, and the clerk informs the customer that an
account already exists. If an account does not exist, then the credit status of the caller is
checked with the outside credit bureau. If the caller’s credit is OK, a new account is created
and the new customer informed. If the caller’s credit is Not-OK, the new account is denied
and the caller informed.
This plain English specification can be converted into a more structured format as
follows:
Activity: Add New Customer Account
1.
The clerk enters the caller’s phone number into the system.
2.
If there is a customer account in the system, the system displays
all customer and account information.
3.
The clerk informs the customer that he already has an account
and asks whether he wants a new account. If the customer does
not want a new account, terminate the call; else go to 5.
4.
If the caller is not in the system, the clerk enters the caller’s
information.
221
Chapter 11 ■ Utilization: Merging Data and Process
5.
The system checks the credit status of the caller with the outside
credit bureau.
6.
If the credit status is OK, the system creates a customer account
and informs the clerk.
7.
The clerk informs the customer that a customer account was
created, gives the customer all the account information, and
then terminates the call.
8.
If the credit status is Not-OK, the system informs the clerk, who
informs the caller and terminates the call.
However, this specification contains substantial activity extraneous to any database
activity, which the designer can ignore with impunity.
Using the data model in Figure 11-6, look at the following:
1.
The clerk enters the caller’s phone number into the system.
There is no database activity here. The process is between
an agent external to the application, the clerk, and the
application itself.
2.
If there is a customer account in the system, the system
displays all customer and account information.
This is the first interaction between the application and the
database:
(Database Action 1) Fetch Customer and Account occurrences
where PHONE NUMBER matches the search argument.
3.
The clerk informs the customer that he already has an account
and asks whether he wants a new account. If the customer
does not want a new account, terminate the call; else go to 5.
There is no database activity here.
4.
If the caller is not in the system, the clerk enters the caller’s
information.
There is no database activity here.
5.
The system checks the credit status of the caller with the
outside credit bureau.
There is no database activity here.
6.
If the credit status is OK, the system creates a customer
account and informs the clerk.
(Database Action 2) Add/Update Customer and Add Account
occurrences for new Customer
222
Chapter 11 ■ Utilization: Merging Data and Process
7.
The clerk informs the customer that a customer account was
created, gives the customer all the account information, and
then terminates the call.
There is no database activity here.
8.
If the credit status is Not-OK, the system informs the clerk
who informs the caller and terminates the call.
There is no database activity here.
Figure 11-6. Order management system
The Add New Customer Account usage scenario contains only two database-related
activities.
•
Fetch Customer and Account occurrences where PHONE
NUMBER matches the search argument.
•
Add/Update Customer and Add Account occurrences for new
customer.
To complete this usage scenario, a little more information is required to understand
the properties of the scenario. First, each scenario should have a unique identifier and a
unique name. Second, specify the scenario processing type as online or batch, and third
specify the frequency of use. For an online scenario, the frequency might be 100 times an
hour or 5,000 times a day. For batch jobs, the frequency might be that the program is run
weekly, and each run involves an average of 20,000 evocations. Note, not every step is
executed every time. If the steps have different frequencies, then that information should
be included in the scenario.
To review, a usage scenario is a small document to tell the database designer
how the application will use the database. The sources for the usage scenarios are the
requirements definitions, process analysis, and process design documents, which could
include interview notes, narratives, process models, and process specifications.
223
Chapter 11 ■ Utilization: Merging Data and Process
Putting a Usage Scenario Together
There are four steps for creating a usage scenario, although the first may be skipped if
sufficiently detailed process specifications already exist.
1.
Assemble all physical process documentation. Logical process
documentation is a good thing to have, but if the analysts have
done a good job documenting the physical characteristics of
the system, then it can probably be ignored.
2.
If the application processes are documented using one of the
graphical or structured language methods for defining an
application, then this step can probably be skipped, and the
designer can go straight to step 3. However, if the application
processes are defined using only plain language, then the
database designer will have to reinterpret the system using
a technique such as structured English or pseudocode. The
database designer will have to use whatever (unstructured)
information exists. Worst case (other than not having any
documentation at all) is when all the designer has as a source
are original end-user interview notes.
3.
Strip out all non-database-related process specifications.
Place the database requests in sequential order using
appropriate database terminology (read, add, search
argument, etc.).
4.
Add the scenario properties of unique identifier, name,
processing type, and frequency.
An Example
Do not underestimate the advantage of creating a usage scenario. It can be very helpful in
removing extraneous information. For example, take the following case.
Step 1 is to gather whatever process documentation exists. In the example, the only
information is the following plain English description of the application:
The system reads all product inventory records. Those falling below the
reorder threshold are placed on a possible reorder list. For those on the
possible reorder list if the sales of the product during the last 60 days was
greater than 10 percent of the fully stocked number, then create a reorder
record for x items where x is the difference between the items on hand
and the fully stocked number. If the number of sales during the previous
60 days was less than 10 percent but greater than 5 percent, then reorder
x items where x is 50 percent of the difference between the fully stocked
number and the number on hand. If the number of sales during the past
60 days was less than 5 percent of the fully stocked number, then do not
reorder the product.
224
Chapter 11 ■ Utilization: Merging Data and Process
This is a plain-language usage narrative, so step 2 is to convert the plain-language
description into structured English. It might look something like the following:
Read all Product records where INVENTORY COUNT is less that REORDER
THRESHOLD.
Read all Line Item records for each Product. If CURRENT DATE minus SALES
DATE <60 then add LINE ITEM QUANTITY to TEMP SALES COUNT.
If TEMP SALES COUNT > (FULLY STOCKED COUNT*0.10) then REORDER COUNT = FULLY
STOCKED COUNT – INVENTORY COUNT.
Else If TEMP SALES COUNT > (FULLY STOCKED COUNT*0.05) then REORDER COUNT =
(FULLY STOCKED COUNT – INVENTORY COUNT)/2.
Save the REORDER COUNT and the CURRENT DATE in the Inventory Reorder record.
Step 3 consists of two tasks. First, strip out all non-database-related process
information. Second, use more database-like terminology and create a database
sequence.
Here is the usage scenario:
(1)
Enter at all Product occurrences where INVENTORY COUNT
is less than REORDER THRESHOLD.
(2)
For each Product occurrence, Find all related Line Item
occurrences.
(3)
Insert Inventory Reorder occurrence for each related Product
occurrence.
In step 4, add the scenario properties.
Usage Scenario: 1 Name: Calculate Reorders Processing type: batch Frequency: once
per night.
(1)
Enter at all Product occurrences where INVENTORY COUNT is
less than REORDER THRESHOLD (200 occurrences).
(2)
For each Product occurrence, Find all related Line Item
occurrences (approximately 1,200 occurrences).
(3)
Insert Inventory Reorder occurrence for each related Product
occurrence (50 occurrences).
The result is a much simpler set of database service requests.
The designer should create one usage scenario for each process. For example,
the order management system might have different usage scenarios for creating a new
customer, generating an order, shipping, and processing returns. The collection of
usage scenarios represents how the entire application adds, reads, updates, and deletes
information from the database.
225
Chapter 11 ■ Utilization: Merging Data and Process
Activity 2.1.2: Map Usage Scenarios to the PDM
The usage scenarios can then be converted to a simple diagram, called a usage map, by
drawing the actions of the usage scenario on the physical data model. Take the following
usage scenario:
Usage Scenario: 2 Name: Produce Account Bills
Processing type: batch Frequency: 300 times per night.
2.1.
Enter database and Find Order occurrences where ORDER
DATE equals CURRENT DATE (200 occurrences).
2.2.
Find Line Item occurrences for associated Order record
occurrence (average 2 occurrences).
2.3.
Find Account occurrence for associated Order occurrence
(average 1 occurrence).
2.4.
Find Customer occurrence for associated Account occurrence
(average 1 occurrence).
2.5.
Update Order occurrence (1 occurrence) Comment: to update
Order with billing date.
Mark up the physical data model with the usage scenario steps using the following
convention: x.y.z, where x is the scenario number, y is the step, and z is the database
action (E for enter, F for fetch, I for insert, U for update, and D for delete).
Using usage scenario 2, you have “2.1E” for step 1, enter the database, written on the
physical data model with a dashed arrow pointing at the Order record type (Figure 11-7).
An arrow from Order to Line Item labeled “2.2F” says “Find Line Items records for that
Order.” Step 3 becomes “2.3F,” step 4 is “2.4F,” and step 5 is “2.5U,” update the Order record
with the billing information.
Figure 11-7. Usage map order entry system
A usage map is the result of applying one usage scenario to the physical data model.
The designer should create one usage scenario for each process and one usage map
for each usage scenario. Start by printing out or photocopying as many copies of the
physical data model as you have usage scenarios. Then, taking one usage scenario and
one copy of the physical data model, draw the activities from the usage scenario on the
physical data model.
226
Chapter 11 ■ Utilization: Merging Data and Process
A single-system combined usage map is created when all of the usage scenarios,
placed on their individual copies of the physical data model, are collected onto a single
physical data model page. The result might look something like Figure 11-8, which shows
three usage scenarios (scenarios 4, 5, and 6) placed on a single combined usage map.
Figure 11-8. Combined usage map order entry system
The combined usage map is a graphical representation of how the entire application
uses the database.
Task 2.2: Path Rationalization
Usage analysis can result in a crowded combined usage map. The task of Path
Rationalization is to simplify the map without losing any important usage information.
Activity 2.2.1: Reduce to Simplest Paths
If you compare a logical data model with its actual database schema, one thing should
be obvious. While the number of entities is roughly equal to the number of record types
and the number of attributes is roughly equal to the number of fields, the number of
relationships on the logical data model can be significantly greater than the number of
linkages on the database schema. The reason? Not all of them are needed. Linkages on
a data model are similar to roads on a street map. Some roads are heavily traveled, some
are used only occasionally, and still others could easily not exist without significant
hardship.
Roads and linkages have something else in common—they are expensive.
Linkages take up space and consume processor cycles, so reducing their number can,
in some cases, improve performance while driving down cost. Of course, as with roads,
eliminating the wrong ones can create catastrophic problems. The art of the science is
finding the right ones to eliminate.
227
Chapter 11 ■ Utilization: Merging Data and Process
If you examine the combined usage map in Figure 11-8, what should jump out at you
is that many scenarios use the database in the same way. In this example, both scenario 4
and scenario 5 perform almost the same tasks. This should tell you two things. First, if you
design the database to accommodate scenario 4, it should also be able to accommodate
scenario 5. Second, the paths these two scenarios use are probably important because
they are used as part of two different physical processes.
In the usage map fragment in Figure 11-9, both a customer and an account can have
many addresses, and an address can be for many accounts. Assume usage scenario 7
indicates that the setup of an address for a customer is low frequency, while scenarios 8
and 9 process higher-frequency account activity. The obvious question is, do you really
need the link between Customer and Address? Can usage scenario 7 do what it has to do
without that link? If so, then you can probably remove the link entirely in your physical
data model.
Figure 11-9. Combined usage map fragment
Excluding the Customer/Address link will simplify the physical data model. Even so,
you should keep all the usage scenarios so you have the information available to undo
this change if later it proves unwise.
Activity 2.2.2: Simplify Model
The designer can now pull it all together into a single rationalized physical data model
with the final record types and all relevant keys, including necessary links between the
record types while excluding unnecessary ones. The data dictionary entries created
in step 1, Transformation, must be updated with any changes made in task 2.2, Path
Rationalization.
228
Chapter 11 ■ Utilization: Merging Data and Process
The model no longer represents just the definition of the data but also how those
data will be used. This is the final step and is the culmination of your physical model
before making the necessary compromises imposed by the selected DBMS.
Utilization Notes
The only remaining step 2, Utilization, task is to complete the Utilization notes. As with
Transformation, the database designer should document all the relevant issues and
decisions made during Utilization.
As is the case with step 1, Transformation, the Utilization notes should illuminate all
decisions made, not made, and unmade by answering the questions surrounding why,
where, when, and results.
The other step 2 deliverables are important, but they are not enough. Future users,
designers, and database administrators need to understand the thinking that went into
this step. Without it, they are driving blind when it comes to updating the database with
additional or modified functionality or a new DBMS product or version.
Deliverables
Step 2, Utilization, should produce the following deliverables:
2.1.
Rationalized physical data model: A graphical representation
of the record types and links required for the application
(Figure 11-10 in the next section)
2.2.
Updated physical data model object definitions: The same
physical definitions created in step 1, Transformation,
updated with any necessary changes made during step 2,
Utilization
2.3.
Usage scenarios: Functional summaries describing how the
database will be used by the application
2.4.
Usage maps: A mapping of the individual usage scenarios
onto the physical data model showing how the application
must navigate the database (Figure 11-7)
2.5.
Combined usage map: All of the individual usage map
information on a single diagram (Figures 11-8 and 11-9)
2.6.
Utilization notes: A narrative or journal created by the
database designer of the activities, issues, and decisions made
during step 2, Utilization
229
Chapter 11 ■ Utilization: Merging Data and Process
Example of Deliverables
Figure 11-10 shows the Rationalized physical data model.
RATIONALIZED PHYSICAL DATA MODEL
CHANGES MADE TO THE MODEL
1. Link between the Product and the
Warehouse record types eliminated as
unnecessary
2. Product Code record type eliminated as
unnecessary
Figure 11-10. Rationalized physical data model
Further Reading
A detailed look at a number of topics in the chapter is outside the scope of this book.
Some of the following material should help those who want to investigate these subjects
further.
Structured English
Gane, Chris and Trish Sarson, Structured Systems Analysis: Tools and Techniques.
Englewood Cliffs, NJ: Prentice-Hall, 1978. This book focuses on data flow diagramming,
but it has an excellent section on structured English. This book is out of print, but used
copies are available.
230
Chapter 11 ■ Utilization: Merging Data and Process
Data Flow Diagramming
DeMarco, Tom, Structured Analysis and Systems Specification. Englewood Cliffs, NJ:
Prentice-Hall, 1979. Closely linked with Ed Yourdon and Larry Constantine’s structured
approach, the book is currently out of print, but used copies are available.
Gane, Chris and Trish Sarson, Structured Systems Analysis: Tools and Techniques.
Englewood Cliffs, NJ: Prentice-Hall, 1978. Also Yourdon alumni, their approach is almost
identical to DeMarco’s technique, although their diagramming conventions are easier
to use.
Hathaway, Tom and Angela Hathaway. Data Flow Diagramming by Example: Process
Modeling Techniques for Requirements Elicitation. Kindle Edition, 2015.
Flow Charts
IBM, Flowcharting Techniques, IBM Corporation, White Plains, NY, 1969. The granddaddy
of them all, this manual can still be found online. More modern interpretations of flow
charting appear in almost every system development textbook.
Pseudocode
Bailey, Therold and Kris Lundgaard, Program Design with Pseudocode (Computer
Program Language) Brooks. Belmont CA: Cole Pub Co, 1989.
Farrell, Joyce, Programming Logic and Design. Boston, MA: Course Technology, 2013.
Structure Charts
Dennis, Alan and Barbara Haley Wixom, Robert M Roth. Systems Analysis and Design 6th
Edition. New York: John Wiley & Sons Inc., 2014.
Martin, James and Carma McClure, Diagramming Techniques for Analysts and
Programmers. Englewood Cliffs, NJ: Prentice-Hall Inc., 2000. This book deals with all the
techniques presented in this chapter, so it is a good starting point for the novice, although
sometimes at too high a level for the more experienced.
231
CHAPTER 12
Formalization: Creating a
Schema
The schema is…a mere product of the imagination.
—Immanuel Kant
The first draft of anything is s*#t.
—Ernest Hemingway
Step 3, Formalization, is, unfortunately, the point at which database design starts for
many designers (see Table 12-1). The first thing they do is dig out the vendor’s DBMS
manual and start coding. For the more enlightened, this is the third physical database
design step—where the Rationalized physical data model meets the DBMS that will be
used in its implementation.
Table 12-1. Step 3: Formalization
Sources
Step 3: Formalization
Procedures
Deliverables
•• 3.1: Functional database
•• Task 3.1: Environment
design (diagram)
Designation: Identify/confirm
the target information
•• 3.2: Functional schema
2.2: Updated physical data
manager (architecture,
data definition language
model definitions
product, version)
(data dictionary)
•• 3.3: Functional subschema
•• Task 3.2: Constraint
data definition language
2.3: Usage scenarios
Compliance
2.4: Usage maps
•• Activity 3.2.1: Map
•• 3.4: Functional database
2.5: Combined usage map
rationalized physical data
object definitions
1.3: Transformation notes
model to the data
(data dictionary)
architecture
2.6: Utilization notes
•• 3.5: Formalization notes
•• Activity 3.2.2: Create a
DBMS features and
DBMS product/versionconstraints
specific functional
physical database design
•• 2.1: Rationalized physical
data model (diagram)
••
••
••
••
••
••
••
© George Tillmann 2017
G. Tillmann, Usage-Driven Database Design, DOI 10.1007/978-1-4842-2722-0_12
233
Chapter 12 ■ Formalization: Creating a Schema
Formalization consists of two tasks. The first identifies or confirms the information
manager (architecture, product, and version) that should/will be used to build the
database. The second modifies the Rationalized physical data model to conform to the
selected file manager or DBMS.
Task 3.1: Environment Designation
What database architecture is best for your application? This is the first question that
needs to be answered, even if the choice of a database architecture is out of the database
designer’s hands.
Not so long ago, this task would involve DBMS shopping—figuring out the kind
of DBMS the enterprise should acquire to build the desired applications. Nowadays,
the enterprise probably already owns a DBMS, or even more than one, so the appetite
to acquire another is minimal. In this case, the database designer will have to live with
what the company has. Nonetheless, it makes sense to undertake this task anyway for
two important reasons. First, after examining how the database is intended to be used
(Chapter 11), the database designer might conclude that it is a major mistake to use the
company’s current DBMS. For example, if the organization has a relational DBMS but the
new database application must store and retrieve video and music files, then the database
designer might conclude that a DBMS based on a different architectural approach would
be a wise purchase. The only way to find out is to compare the proposed usage with the
features of the current DBMS.
The second reason to investigate the ideal architectural approach is to document
what would work best for the application even if using the ideal database management
system is not feasible. This will be particularly useful down the road if the current
application/DBMS mix turns out to be a turkey.
It might be useful for the database designer to map the strengths and weaknesses of
the various database architectures against the organization’s information management
needs in a chart similar to Table 12-2. Unlike Table 12-2 (which is a generic chart for
illustrative purposes only), the left column in the database designer’s chart should list the
information manager characteristics most important to the application.
234
XXX
XXX
XXX
XXX
XXX
Atomicity
Consistency
Isolation
Durability
OLTP
XXX
XXX
XXX
High Availability
Data Volume - Static
Throughput
XXX
XX
XX
XX
XXX
XXX
XXX
XXX
XXX
XXX
XX
XXX
X
XXX
XXX
XXX
XXX
XX
XXX
X
XXX
XXX
XXX
XXX
XX
XXX
X
XXX
XX
XX
XXX
XXX
XXX
X
XXX
X
XX
XX
XX
XX
XXX
XXX
XXX
XXX
XXX
XXX XX
XX
X
XX
XX
X
XXX
XXX
XX
XX
XXX
XX
XX
XX
XX
XX
X
XXX
XXX
XX
X
X
<---------- NoSQL ---------->
Network Relational Inverted File Object Multi-modal Star Key-value Document Graph
Note: XXX high, X Low -- Hierarchical Results for IBM’s IMS, Network Results for CA Technologies’ IDMS
Data Warehouse
Structural
Complexity
Financial/Spreadsheet
Query
Complex data types
XXX
Batch Processing
OLAP
Hierarchical
ILLUSTRATIVE
Characteristic
Table 12-2. Functional Comparison of Various Architectural Approaches
Chapter 12 ■ Formalization: Creating a Schema
235
Chapter 12 ■ Formalization: Creating a Schema
As straightforward as this seems, there are some annoying wrinkles. First, two DBMS
products sharing the same architectural approach do not necessarily have the same
strengths and weaknesses. For example, because relational systems have been around for
more than 40 years, their implementations can vary greatly, with some vendors stressing
one feature, while others stress a totally different one.
Second, many of today’s product offerings do not comply with a single architectural
approach but rather with multiple approaches. This is particularly true as products age
and new architectural approaches are developed. IDMS started out as a network DBMS
but added relational features when the market shifted. The same is true for the inverted
file products, which adopted many relational features in their later years. Oracle, the once
quintessential relational system, now includes variations with object-oriented as well as
NoSQL features.
Third, there are always new and expanded approaches, particularly in the NoSQL
ranks. It can be difficult to keep up with what is happening in this rapidly changing field.
Do your chart in pencil. There will be many changes and updates as your knowledge
of the various DBMS offerings increases and the functions the DBMS will need to support
are better understood. Even with its drawbacks, a comparison chart is a good place to
start the search for the ideal DBMS product for the current project.
Once the architecture is chosen, or more likely dictated by past purchases, the
designer must modify the rationalized physical data model to meet the requirements of
that approach. The good news is that some of this work might have already been done
when the physical data model was modified for normalization. If not, then it needs to be
done here.
There are few things you can say about all database management systems, but,
fortunately, here is what you can say:
•
Many-to-many relationships: Most DBMS products do not support
them, and the few that do tend to be niche players and not (yet)
suitable for prime time. A junction record will be needed to mimic
an m:n relationship.
•
Recursive relationships: The designer will be hard-pressed to find
a DBMS supporting recursion. This is unfortunate because many
programming languages do. Nonetheless, for at least the time
being, a bill-of-material structure using a junction record will be
needed instead.
•
Associative, attributive, and S-type record types: The major
DBMSs do not support associative, attributive, and S-type record
types (although object-oriented systems support some of these
features); rather, they need to be implemented as simple proper
record types. Cardinality is supported, while modality is often not
supported or supported only to a limited extent. Alternatives and
workarounds, in the form of DDL or DML code, are sometimes
available (for example, cascading delete).
Some missing features can be provided by triggers, procedural code, or application
code written either by database staff and stored within the database or by application
programmers as application code.
236
Chapter 12 ■ Formalization: Creating a Schema
Hierarchical Systems
In this day and age, the term hierarchical DBMS usually refers to IMS or its Fast
Path variations. IMS and particularly Fast Path have a number of restrictions and
idiosyncrasies that involve a considerable amount of physical data model morphing to
accommodate the DBMS architecture. Just getting the database language correct will
require some work as records become segments and views become logical database
descriptions.
The Rationalized physical data model must be restructured into hierarchical trees,
and many-to-many structures must be morphed into the IMS logical database structure.
The hierarchical characteristics also show up in the NoSQL ranks, particularly with
XML-based products.
Network Systems
The network model, more than likely IDMS, also requires some language adjustment, but
not as radical as for IMS. However, the network database structure can be easily derived
from the physical data model, perhaps more so than any other architecture. Designing an
IDMS database from a Rationalized physical data model is relatively easy. The additional
ten IQ points to successfully navigate the network model (mentioned in Chapter 8) are
needed by the application programmer and not the database designer.
Relational Systems
When it comes to a unique DBMS language, the relational model takes the prize. The
good news is that most of the arcane words it uses have either already entered the
database language sphere or have been dropped in favor of more common terminology
and thus no longer cause the confusion they did a few decades ago.
Keys are a major issue with relational systems, and there are many keys in the
relational model (primary, candidate, super, foreign, compound, alternate, natural,
composite, simple, etc.). The key landscape varies by the relational product and
sometimes, more insidiously, by the product version. Add to that how the keys are used,
and the complexity explodes. (Not every RDBMS requires foreign keys or even primary
keys. Some do not even support them.)
Keys aren’t the only issue. Groups have to go (both multivalue and group data items).
The good news is that most relational products support a few common groups, such as
DATE. The bad news is that they rarely support them in the same way.
Nonstandard data types are the Achilles’ heel of relational systems. The designer
must examine the actual RDBMS selected, and its version, to learn how it handles
documents, pictures, videos, and so on, if it does at all. Even text is not supported in any
consistent way.
237
Chapter 12 ■ Formalization: Creating a Schema
Object-Oriented
The original object-oriented systems were unique and proprietary in design and structure,
but they have largely been supplanted by relational systems that morphed into objectrelational hybrids. Some OODBMSs have hierarchical characteristics, mixed with inverted
file characteristics, mixed with relational ones. The database designer needs to be aware
that for the OODBMS that started out as a RDBMS, most features and constraints will be
similar to those of its original data architecture rather than its adopted one.
NoSQL
Unfortunately, NoSQL is a grab bag of DBMS implementations. Many, in spite of their
name, have relational-like syntax and follow relational-like rules, even if their internal
structure is completely different. Others have an object type feel about them and can
be mistaken for an OO database. A third group of NoSQL implementations follow a
particular computer language and are structured as language extensions. Java is a
common DBMS substrate. Lastly, some DBMSs, such as many key-value approaches, see
the database as a set of key and nonkey pairs that function as pointers to data residing in
a different part of the system.
The smartest way to think of NoSQL is to not think of it at all, but rather of its
underlying structure, such as key-value or document.
DBMS Product and Version Selection
OK, now it has been decided or dictated which database architecture you will be using.
However, there are still decisions concerning the DBMS product and version to use.
Vendors are clever. In their quest to attract and keep customers, they provide certain
enticements, freebies, or enhancements with their products. The first “enhancement,” and
the one you wind up paying for whether you use it or not, is the embedded DBMS. If you buy
certain applications or systems software, a DBMS comes as part of the system to manage
the data. If you are an XYZ DBMS customer but purchase an application or some system
software from ABC corporation, you still might have ABC’s DBMS automatically installed.
Some IT shops have a formal “don’t use the embedded DBMS” policy; others
don’t. Should you use it? Depends. ABC might offer a few features that are critical to the
business, while XYZ, the product you are using, doesn’t.
That brings us to the second vendor enticement: extensions. There is an old vendor
saying, “Standards attract customer; extensions keep them.” Vendors tout their ISO
compliance to get new customers. No company wants to buy some unfamiliar product that
the organization will be locked into for the foreseeable future. If you buy a standard version
of COBOL or a standard version of a DBMS, you buy two benefits. First, to stay compliant,
the vendor must update its product offering with new standards body approved features.
Second, it makes it easier to move from one product to another. Don’t like the ABC product?
It’s easier to move to the XYZ product if both comply with the same standard.
However, once a vendor attracts new customers by offering a standards-compliant
product, the vendor wants to keep them. That’s where extensions come in. Offer
customers new goodies that are not in the standard, and if the customers uses them, then
they are locked into the product—or at least it’s considerably more expensive to leave.
238
Chapter 12 ■ Formalization: Creating a Schema
Relational systems extensions include nonstandard data types, group items, storage
considerations such as indices and clustering, and even keys.
Extensions can be double-edged swords for vendors as well as their customers. Many
a vendor has offered an extension containing a new feature only to have a standards body
subsequently develop that feature as a new standard with characteristics that are at odds
with the vendor’s implementation. The vendor is then forced to invest in creating a new
feature that provides no new capability for its product and that makes it easier for its
customers to leave.
You might have to select a DBMS or a DBMS version based on its nonstandard
features. If you must, you must. But if you have a choice, be very wary of extensions. They
sometimes have a heavy price down the road.
Once the database architecture is identified or confirmed, then the task is to select or
confirm the target product. A good way to start is to go back to the Architectural Approach
Comparison Chart and make it a Product/Version Comparison Chart by substituting a
product and version under consideration for each column. Why version? In most cases,
the current version will be the one used. However, your IT shop might not be working on
the current version of the product or the vendor might have a beta version with features
you need. In either case, create a column for each product version under consideration.
With a little bit of work and/or by simply following what is dictated for your
organization, you now have the target environment consisting of the data architecture,
product, and product version you will be using.
Task 3.2: Constraint Compliance
In the constraint compliance task, the designer creates the first-cut database schema.
Some might think it rather strange to use the word constraint in the task title. The choice
is intentional. Data modeling, both logical and physical, deals with specifying what is
wanted. Schema design shows you what you can realistically have. The data model you
built must now comply with the rules of the selected information manager.
Constraint compliance is divided into two activities. The first, Activity 3.2.1: Map
rationalized physical data model to the data architecture, converts the Rationalized physical
data model to a data architecture–specific, although otherwise generic, database schema. The
second, Activity 3.2.2: Create a DBMS product/version-specific functional physical database
design, transforms the generic schema into a vendor/product/version-specific schema.
Why two activities? Why first convert the rationalized physical data model into a
generic schema? The same reason you created a logical data model before you created
the physical data model: you need to ensure that when the vendor’s DBMS product
changes, and it will, you understand what was wanted in the first place rather than what
you had to settle for.
Pseudocode…Again
When discussing process modeling, one of the techniques mentioned was pseudocode,
which uses a mixture of English and phony computer code to describe what the system
should do. Some designers and programmers find it very useful, others not so much. The
same can be said for schema definition; a pseudocode or pseudo-DDL or ­pseudo-DML
239
Chapter 12 ■ Formalization: Creating a Schema
approach is sometimes useful for describing the database structure without getting into the
restrictions and idiosyncrasies of a particular DBMS. Pseudocode can be useful when trying
to make the description of what is wanted independent of product or version limitations.
For example, assume that your current DBMS has a 512-byte limit on the text field
length and a 16-character limit on field names. This is in contrast to the Rationalized
physical data model data item PRODUCT DESCRIPTION, which is a text field that can be
1,000 characters long. It is much more useful to pass the DBA the pseudocode,
shown here:
PRODUCT DESCRIPTION CHAR (1000)
than the more accurate but less descriptive example shown here:
PROD_DESCRIPT_1 CHAR (500)
PROD_DESCRIPT_2 CHAR (500)
A second advantage of pseudo-DDL is in version preparation. Imagine that the
DBMS vendor comes out with a new version of the product that now supports a text field
length of 1,024 bytes and 24-character field names. Without the pseudocode, how is the
DBA to know that the original intention was to have a single product description field and
not two separate fields? A significant advantage for the end user or the programmer could
be lost because the DBA does not know the database designer’s original intention.
A third benefit of pseudocode, although not ranking with the first two advantages,
is nonetheless just as real. Some database designers are more experienced than some
DBAs. It is not uncommon to see junior staff tasked with preparing a new DBMS version
for installation. Their closeness to the new release gives them a front-row seat for
understanding updates to the DDL, DML, and even application code supported by the
new software.
Database designers, on the other hand, are often the more experienced employees
who cut their teeth on earlier versions of the DBMS. Their DBMS knowledge might be
deep, even if their familiarity with the syntax of the latest DBMS version is weak. Using
pseudocode allows these more experienced designers to concentrate on what the systems
needs to do using a pseudocode that might contain syntax from an earlier DBMS version.
When the more senior designers have completed their tasks, the more junior staff can
then focus on aligning their seniors’ pseudocode with the new version’s syntax.
While pseudocode is useful for designing a schema for any DBMS, it is even more
critical and helpful for relational database management systems. As discussed earlier,
there are dozens of RDBMSs, almost all of which use SQL and follow ISO standards but
that are, nonetheless, different—sometimes significantly different. And the differences are
not just from product to product but also from version to version.
For most new software products, there is a period of version frenzy right after
product introduction that cools off over time. In the first 24 months after product
introduction, three or four new versions may be provided to correct errors and to improve
performance. The next 36 months see a flurry of new features. At about year 5, things start
to slow down, with a new version coming out every 18 months or longer. The relational
vendors were not so lucky.
240
Chapter 12 ■ Formalization: Creating a Schema
RDBMS vendors have been under pressure from all sides to make changes to their
product offerings. There is pressure to make their RDBMS more like the relational model
(remember the 333 Codd rules); there is pressure to expand beyond the 333 rules, adding
features to handle things such as group data items; there is pressure to make the DBMS
more OO-like; and now there is pressure to add some NoSQL features. This versiondriven code instability means that a more stable pseudocode can go a long way to better
communicating designer intentions.
But what if the DBMS is changed and you created a SQL-like pseudo-schema
when the company decided to use a non-SQL DBMS? There’s no real problem because
the pseudocode, even SQL like pseudocode, is sufficiently generic that most database
designers and DBAs can convert it into any legitimate DBMS DDL. It just might require
a bit more work on their part. However, the benefits of pseudocode outweigh any such
possible disadvantage.
Activity 3.2.1: Map Rationalized Physical Data Model to
the Data Architecture
In this activity, the Rationalized physical data modeling objects become data
architecture–specific objects. It is the first time that the record type Customer or the
link Owns is made to conform to a particular DBMS. The approach is to examine the
rationalized physical data model, object by object, and make it conform to the features
and constraints of the proposed DBMS.
Record Types
During step 1, Transformation, you created four kinds of record types: proper, associative,
attributive, and S-type. Now you need to make these record types DBMS compliant.
Proper
Proper record types are supported by every DBMS. In fact, for most every DBMS, the
definition of record type is a proper record type.
Associative
An associative record type is a link with its own data items. Take the example of two
record types, Customer and Car, and the link Rents. Information about the rental, such as
rental date and price, are properties of neither the Customer record nor of the Car record
but rather of the rental agreement itself. You can test this by asking these questions: Can
a customer rent more than one car? Can a car be rented by more than one customer? Is
the rental agreement for a single customer and a single car? Because the answer to the
three questions is yes, yes, and no, then the rental information is itself a record linked to
the customer and car records. The database designer should create three record types,
Customer, Car, and Rental Agreement, with the Rental Agreement at the many end of two
one-to-many links.
241
Chapter 12 ■ Formalization: Creating a Schema
Look at a second example. Keep the same Customer and Car records, but change
the Rents link to Purchases. Because a customer can buy many cars and a car can be
purchased by only one customer (at any one time), then the purchase information is
linked one-to-many with Customer but in a one-to-one link with Car. The one-to-one
linkage allows for the storage of the purchase information in the Car record, making the
Purchases record not needed.
The way to integrate an associative record type into a DBMS that does not directly
support associativity is to look at the relationships between the proper record types.
Unless they are linked many-to-many, an associative record type’s data items can
often be merged with one of the proper records. If they are linked many-to-many, then
the associative record type becomes a database “proper” record type in a mandatory
relationship with its two partners.
Attributive
An attributive record’s existence depends on another record. Take the example of two
records, Customer and Customer Address. A Customer Address occurrence should
exist only if it is linked to a Customer occurrence. If the Customer is deleted, then all
associated Customer Address records should be deleted.
A number of DBMS products indirectly support attributive record types, although
none use the term attributive. Rather than implementing attribution as a characteristic
of the record type, they implement it as a characteristic of the link between the two
records. In some relational systems there is the foreign key option ON DELETE CASCADE
(sometimes called CASCADE ON DELETE), which tells the system that the child record
cannot exist without its associated parent record.
CREATE TABLE ADDRESS (
STREET CHARACTER VARYING(20),
TOWN CHARACTER VARYING(20),
CUST_NO CHARACTER (10),
FOREIGN KEY(CUST_NO)
REFERENCES CUSTOMER (CUST_NUMB) ON DELETE CASCADE
);
Network-based systems have a similar feature implemented as part of the setmembership definition using the RETENTION (also called DISCONNECTION) option.
SET NAME IS CUST_ADDRESS
OWNER IS CUSTOMER
MEMBER IS CUSTOMER_ADDRSS
RETENTION IS FIXED
IMS has a simple solution to the problem. When a parent record (segment) is
deleted, all its children records (dependent segments) are automatically deleted. In IMS,
if you delete the root segment (the very top of the database tree), the system automatically
deletes every bit of data in your database.
242
Chapter 12 ■ Formalization: Creating a Schema
S-Type
An S-type (supertype/subtype also called generalization and specialization) is where
a proper record type includes different roles containing different role-specific data
and/or links. For example, take the record type Customer. A store might have two
different types of customers, retail and wholesale. Both retail and wholesale customers
have a number of data items in common (CUSTOMER NAME, CUSTOMER NUMBER,
PHONE NUMBER, ADDRESS, etc.) and a number of data items unique to their role
(DISCOUNT AGREEMENT, SHIPPING INSTRUCTIONS, CREDIT STATUS, INDUSTRY
CODE, LOYALTY PROGRAM MEMBERSHIP NUMBER, etc.). The supertype contains
the common data and links for both types of customers, while the subtype contains their
unique role information.
Object-oriented database management systems support S-types. In fact, the
supertype/subtype distinction is a fundamental feature of object technology—children
objects inherit properties, including data and processes, from their parent.
Non-object-oriented systems usually do not support S-types, so the database
designer must decide how to handle them. One solution is to create three proper record
types with a one-to-many link between the common attributed parent and the two
(or more) role-specific children.
A second solution is to have a single customer record type containing all the fields
used by both types of customer. If the customer is a retail customer, then the wholesale
data items are left blank. The same is true for a wholesale customer—any retail data items
are blank. This approach assumes the designer does not have a problem with blank, or
null, fields.
There is a simpler, third solution if the designer is comfortable that customers will
never change roles, i.e., a wholesale customer will never become a retail customer or
the reverse. If the type of customer is unchanging, then the designer can simply have
two record types, one for wholesale customer and one for retail customer with the data
item names adjusted to avoid confusion (not CUSTOMER NAME but WHOLESALE
CUSTOMER NAME and RETAIL CUSTOMER NAME, etc.).
Links
Linkages become the stickiest part of Formalization because how they are implemented
varies far more from DBMS to DBMS than record types or data items.
Membership Class: Cardinality
There are three types of cardinality: one-to-one, one-to many, and many-to-many.
One-to-One
The one-to-one linkage is rarely directly supported, but the workaround is both
conceptually simple and easy. The first solution is to make the link one-to-many and
simply ignore that there will never be more than one child per parent. The second
solution is even simpler and easier and is the one used in 99 percent of the cases.
243
Chapter 12 ■ Formalization: Creating a Schema
Combine all of the data items of the two record types in a single record type. The
combined record type is more efficient (indexes, storage location, etc.) and easier for the
programmer to deal with than the one-to-many approach.
One-to-Many
This is the garden-variety link supported by virtually every database management system.
It is the direct descendent of the parent-child relationship of the punched card era. It is so
fundamental that the network model is based on it. Because it is the default condition in
most every DBMS, the database designer must do little to implement it.
Many-to-Many
Few DBMSs support native many-to-many linkages. The almost universal solution for
handling this link is with two one-to-many links with two (or more) parents sharing a
common child. The child record, called a junction record, allows the structure to mimic a
many-to-many linkage.
For example, the relational model does not support many-to-many (m:n)
relationships. Embedded foreign keys make many-to-many links impossible, so they
must be “resolved.”
If two record types are in a many-to-many relationship, you resolve the relationship
by creating a third record type, traditionally called a junction record, that is at the “many”
end of two one-to-many relationships between the two original record types. Figure 12-1
shows how the Employee-Department many-to-many relationship is resolved into the
Employee-Employee/Department Junction-Department relationships.
Figure 12-1. “Resolving” a many-to-many link
Membership Class: Modality
Modality indicates whether a record type must participate in a link.
244
Chapter 12 ■ Formalization: Creating a Schema
Mandatory
Mandatory links are enforced in the DBMS using its DDL linkage constraints. Relational
systems use foreign keys to enforce modality with the NOT NULL clause.
CREATE TABLE ADDRESS (
STREET CHARACTER VARYING (20),
TOWN CHARACTER VARYING (20),
CUST_NO CHARACTER (10),
FOREIGN KEY NOT NULL (CUST_NO)
REFERENCES CUSTOMER (CUST_NUMB)
);
This ensures that a child record cannot exist without its parent.
Network systems use their set membership construct, the insertion clause, to enforce
a mandatory link, as follows:
SET NAME IS CUST_ADDRESS
OWNER IS CUSTOMER
MEMBER IS CUSTOMER_ADDRSS
INSERTION IS AUTOMATIC
Optional
Optional links are the easiest of all. For most database management systems, optional
is the default. The designer does not have to do anything specific to create a modality of
optional.
Degree
Degree indicates the number of record types that can participate in a single link.
Unary
A recursive or unary link exists when an occurrence of record type A is linked to other
occurrences of record type A. Take the example of the record type Employee and the link
Reports To. You can have Smith reporting to Jones, who reports to Williams, where Smith,
Jones, and Williams are all Employee occurrences.
The approach taken by virtually every DBMS to support a unary or recursive
relationship is through the bill-of-materials structure. As with many-to-many links,
a junction record supplies the magic that makes this linkage work. In the example, in
addition to the record type Employee, the designer creates the record type Employee
BOM. Then two one-to-many links are created between Employee and Employee BOM
(Figure 12-2). One link is Supervises, and the other is Supervised By. This allows the
database to cruise from one occurrence of Employee to the next, going one level higher or
one level lower with each new pass.
245
Chapter 12 ■ Formalization: Creating a Schema
Figure 12-2. “Resolving” a recursive link
The bill-of-materials structure is a common way to handle recursion whether the
DBMS is hierarchical, relational, object-oriented, network, or NoSQL.
Binary
Binary linkages are the staid and standard linkage between record types. They are
supported by every DBMS and are often the only link supported by the DBMS.
N-ary
An n-ary link exists when one link connects three or more record types. Most database
management systems do not handle n-ary linkages very well. The standard DBMS is
designed to handle linkages that are binary, optional, and one-to-many. Any divergence
requires a special workaround. For many-to-many links, there is the junction record,
and for a bill-of-materials link, there is the bill-of-materials junction record. What do you
think is going to happen with n-ary links?
Look at the n-ary relationship among Employee, Client, and Project, where an
employee can work on one or more projects for one or more clients. The solution is a
junction record that is at the many end and links to the three proper record types. In
relational parlance, it looks like this:
CREATE TABLE EMPLOYEES (
EMP_ID CHAR(5),
NAME CHAR(20),
PRIMARY KEY(EMP_ID)
);
CREATE TABLE PROJECTS (
PROJECT_NAME CHAR(12),
BUDGET DECIMAL(8,2),
PRIMARY KEY(PROJECT_NAME)
);
CREATE TABLE CLIENT (
CLIENT NAME CHAR(20),
ADDRESS CHAR(40),
PRIMARY KEY(CLIENT_NAME)
);
CREATE TABLE ASSIGNED_TO (
EMP CHAR(5),
246
Chapter 12 ■ Formalization: Creating a Schema
PROJ CHAR(12),
CLNT CHAR(20),
PRIMARY KEY(EMP,PROJ),
FOREIGN KEY(EMP) REFERENCES EMPLOYEES(EMP_ID),
FOREIGN KEY(PROJ) REFERENCES PROJECTS(PROJECT_NAME),
FOREIGN KEY(CLNT) REFERENCES CLIENT (CLIENT _NAME)
);
Many NoSQL systems, particularly the column-based ones, store all information
in an n-ary fashion as their default. In the example, everything about the employee,
including the projects he worked on and the clients he worked for, are stored under his
Employee record occurrence. Likewise, the Project instance includes all the employees
working on the project as well as the clients for which it was done. NoSQL systems are not
shy about duplication, and column-based systems not only allow duplication, but they
count on it.
Constraints
Linkage constraints are the problematic area for DBMS implementation. No DBMS does a
great job, some do an OK job, and many, unfortunately, fail completely.
Inclusion
The inclusion constraint exists if an occurrence of record type A can be linked to an
occurrence of record type B, to an occurrence of record type C, or to both record types B
and C. It is the standard condition between three record types (or more) and two links
(or more). This is the default case; no special symbols or graphics are needed, and no
special action is required by the database designer.
Exclusion
Exclusion is a little trickier than inclusion. It says that an occurrence of record type A
can be linked to an occurrence of record type B or to record type C, but not both at the
same time. Take the example of Customer, Dealer, and Car and the link Owns. A car can
be owned by a dealer or it can be owned by a customer but not both—at least not at the
same time. Owns is either-or, not both.
Conjunction
Conjunction says that if an occurrence of record A is linked to record B, then it must also
be linked to an occurrence of record type C.
Both exclusion and conjunction are particularly problematic because they deal with
not one but multiple links.
As mentioned in Chapter 3, there are two types of conjunction. Simple conjunction
says that given three record types, A, B, and C, and two links, A to B and A to C, every
247
Chapter 12 ■ Formalization: Creating a Schema
occurrence of A must be linked to an occurrence of B and to an occurrence of C. Simple
conjunction can be handled by making the modality of both links mandatory-mandatory.
Conditional conjunction states that given three record types, A, B, and C, and two
links, one between A and B and one between A and C, if an occurrence of A is linked
to an occurrence of B, then it must also be linked to an occurrence of C. Conditional
conjunction cannot be implemented through membership class like simple conjunction
can be.
Ideally, the DBMS should accommodate exclusion and conditional conjunction
through its DDL although only a smattering of help is available here. The problem is
that the average DDL can deal with only one link at a time. Handling multiple links
simultaneously, when the status of one can affect the status of another, is beyond their
scope. Uber-links, such as exclusion and conditional conjunction, defy most DBMS
architectures and product offerings.
Failing a DDL accommodation, the DBMS should at least allow the designer a
workaround using its DML, triggers, or stored procedures. Object-oriented DBMSs
can accommodate exclusion and conjunction, to an extent, though even they fail to
do a complete job. In too many cases, enforcement of exclusion and conjunction is,
unfortunately, left to the application programmer if it is enforced at all.
Data Items
On the surface, data items are the easiest to formalize, but there are data architecture and
DBMS-specific undercurrents that can make the task a challenge.
Domains
While both group and multivalue data items have a long IT history, domains do not. It
was not until the late 1970s and early 1980s that domains were even included in, much
less required by, some newer programming languages. DBMS vendors did not start
incorporating them until a few years later. Even today, many DBMS implementations
allow domain declarations but do not require them. However, domains are a useful
way to help keep database data accurate and useful. Their use is encouraged. Database
designers should include domain information if the DBMS allows.
Source: Primitive and Derived
As you should remember from logical data modeling, there are two types of data source:
primitive and derived.
Primitive Data Items
As stated in logical data modeling, a primitive data item is a single or lowest-level fact
about a record. Primitive data are the bread and butter of a database. The database
designer need only ensure that all primitive data have a home in the database schema.
248
Chapter 12 ■ Formalization: Creating a Schema
Derived Data Items
Derived data are data that can be calculated from one or more primitive or derived data
items. For example, a database does not have to store the data item EMPLOYEE AGE if
it can access CURRENT DATE and EMPLOYEE DATE OF BIRTH. Age can be calculated
from the primitive data in the database.
Derived data should never be placed on an E-R diagram. Whether derived data
should be in the database is a performance question and should probably be left to
Chapter 13, which deals with database efficiency. In the meantime, derived data should
be documented but not be part of the current database design.
Complexity: Simple and Group
Data item complexity is a term that refers to the intricacy of a data item. There are two
types of data item complexity, simple and group.
A simple data item, also called an atomic data item, does not contain any other data
items. For the database designer its place in the database design is as straightforward as
it gets.
A group data item, also called an aggregate data item, contains a fixed number of
other data items. An example would be the group data item ADDRESS, which contains
the following simple data items: STREET NUMBER, STREET NAME, TOWN, STATE/
PROVINCE, POSTAL CODE, and COUNTRY.
ONE PERSON’S SIMPLE IS ANOTHER PERSON’S
GROUP—IT’S ALL IN THE CONTEXT
The complexity of a data item can be context sensitive. For example, for most of
us, COLOR is a simple or atomic data item because it cannot be broken down into
constituent parts. However, for printers and graphic artists, COLOR might contain
the data items MAGENTA, YELLOW, and CYAN, the three primary subtractive colors
that make up all other colors in printing. As was true in logical data modeling, the
designer needs to be sensitive to the context in which the data exists.
Most nonrelational DBMSs support some type of aggregation, virtually every
programming language supports group data items, and most every relational product
supports aggregation (for example DATE) to a limited extent. Unfortunately, how the
DBMS supports groups is not always straightforward, requiring the database designer to
delve into the DBMS product manuals.
249
Chapter 12 ■ Formalization: Creating a Schema
Valuation: Single Value and Multivalue
Data item valuation describes how many values the data item can have. There are two
types of valuation, single value and multivalue. A single-value data item can have only one
value at a time. An example would be COLOR = “blue.” If COLOR is “blue,” then it cannot
be “red,” at least not at the same time.
A multivalue data item can contain a fixed or variable number of values. Examples
include cases where the subject can contain more than one color (COLOR = “blue, red”)
or DAY OF THE WEEK contains seven values, “Mon, Tue, Wed, Thu, Fri, Sat, Sun.”
This type of attribute has various other names such as repeating group and,
unfortunately, group.
Most non-RDBMSs support multivalue data items. Even relational users can get
around this constraint fairly easily. For them, it is more of a question of the IT shop
standard than of programming difficulty.
Data item complexity and data item valuation are two sides of the same coin. Both
have a history going back before database management systems existed; both are part of
many, if not most, programming languages; and both are incredibly useful, which places
pressure on the database designer to accommodate them. The only real difference is that
the data item components in multivalue data items share a single data domain while
those in a data item group need not.
A MILDLY USEFUL OBSERVATION
If your design is a vanilla DBMS default structure, meaning…
1. All records are proper record types.
2. All cardinality is one-to-many.
3. All modality is optional.
4. All links are binary.
5. All linkage constraints are inclusive.
6. All data items are primitive, simple, and single.
then go out and buy a lottery ticket. You are a very lucky designer.
Or you have missed some important features that need to be included in your
database design. A review might be called for.
Vanilla DBMS default structure databases exist about 2 percent of the time in the
business world and, unfortunately, about 70 percent of the time in IT shops, leading
to a severe business/IT disconnect.
250
Chapter 12 ■ Formalization: Creating a Schema
You should now have a preliminary database schema that is data architecture
compatible, although not yet product or version specific. You have “resolved” the manyto-many linkages, created the necessary junction records, and have a pseudocode DDL.
The next step is to make the design conform to the vendor’s offerings.
The work product of Activity 3.2.1, Map rationalized physical data model to the data
architecture, is a generic (although data architecture–specific) physical database design.
The next activity will convert this generic design into a fully functional database design.
Activity 3.2.2: Create a DBMS Product/Version-Specific
Functional Physical Database Design
Even though you have made your database design data architecture compliant, you
still have only a pseudo-schema. One more activity gives you a workable (compliable)
database schema and a design that complies with a DBMS product and version.
Regardless of promises of standards compliance, every vendor has proprietary
features, legacies they need to support, and downright idiosyncrasies that often defy
explanation. These constraints need to be incorporated into the schema.
Product and version constraints are of two types: structural and syntactical. To
address structural constraints, the basic database objects (record types, links, data items,
etc.) need to be modified to work with the DBMS. Ideally, you have already made most
of these modifications in Activity 3.2.1, Map rationalized physical data model to the data
architecture. Remaining structural changes usually relate to storage limitations, such as
file or record type size.
Syntactical changes are more common and usually consist of vendor DDL and DML
language accommodations. For example, the ISO SQL:2011 standard data type DECIMAL
is not supported by Oracle, which uses NUMBER instead. Other syntactical changes
might include name length and what to substitute for spaces.
Whereas structural compliance has to do with making sometimes major changes to
database components, such as record type, links, etc., syntactical compliance deals more
with the words used to describe the schema while leaving their meaning unchanged.
Table 12-3 illustrates the changes needed to make a generic SQL schema Oracle
compliant.
251
Chapter 12 ■ Formalization: Creating a Schema
Table 12-3. Preliminary Design DDL Generic SQL Converted to Oracle
Generic SQL
Changes Needed for ORACLE
CREATE TABLE PRODUCT (
PRODUCT_NAME CHAR(30) NOT NULL,
PRODUCT_NUMBER CHAR(8) NOT NULL
PRIMARY KEY UNIQUE,
-- primary key assumes unique but
both make the message plain
-- even if not a primary key, keep
this field unique
PRODUCT_DESCRIPTION VARCHAR(512),
COST_BASIS DECIMAL(8,2) NOT NULL,
LIST_PRICE DECIMAL(8,2) NOT NULL,
CREATE INDEX PROD_NO_IDX ON PRODUCT
(PRODUCT_NUMBER)
);
PRODUCT_NUMBER CHAR(8) NOT NULL
PRIMARY KEY,
/*can't use UNIQUE in PK statement */
PRODUCT_DESCRIPTION VARCHAR2(512),
/*LONG was the standard but was
dropped. VARCHAR being dropped in
favor of VARCHAR2 */
COST_BASIS NUMBER(8,2) NOT NULL,
/* substitute NUMBER for DECIMAL */
LIST_PRICE NUMBER(8,2) NOT NULL
/* substitute NUMBER for DECIMAL */
/* Oracle automatically creates index
on PRIMARY KEY columns */
CREATE TABLE MANUFACTURER (
MANUF_NAME CHAR(30) NOT NULL,
MANUF_ID CHAR(6) NOT NULL PRIMARY
KEY UNIQUE,
MANUF_CATEGORY INTEGER DEFAULT 1
CHECK(MANUF_CATEGORY IN (1, 2, 3)),
MANUF_NOTES VARCHAR(512),
ORDER_INSTRUCTIONS VARCHAR(512)
CREATE INDEX MANUF_ID_IDX ON
MANUFACTURER (MANUF_ID)
);
MANUF_ID CHAR(6) NOT NULL PRIMARY
KEY,
/* can't use UNIQUE in PK statement */
MANUF_CATEGORY NUMBER(1,0) NOT NULL
DEFAULT 1 CHECK (MANUF_CATEGORY IN
(1, 2, 3)),
/* Oracle does not support the
INTEGER data type, NUMBER used
instead */
MANUF_NOTES VARCHAR2(512),
/* VARCHAR2 replaces VARCHAR */
ORDER_INSTRUCTIONS VARCHAR2(512)
/* Oracle automatically creates index
on PRIMARY KEY columns */
CREATE TABLE PROD_MANUF_JCT (
PRODUCT_NUMBER CHAR(8),
MANUF_ID CHAR(6),
PRIMARY KEY (PRODUCT_NUMBER,MANUF_
ID),
FOREIGN KEY (PRODUCT_NUMBER)
REFERENCES PRODUCT ON UPDATE
CASCADE ON DELETE CASCADE,
FOREIGN KEY (MANUF_ID) REFERENCES
MANUFACTURER ON UPDATE CASCADE ON
DELETE CASCADE
);
FOREIGN KEY (PRODUCT_NUMBER)
REFERENCES PRODUCT (PRODUCT_NUMBER)
ON DELETE CASCADE,
FOREIGN KEY (MANUF_ID) REFERENCES
MANUFACTURER (MANUF_ID) ON DELETE
CASCADE,
/* Oracle does not support ON UPDATE
CASCADE constraint - constraint with
triggers */
252
Chapter 12 ■ Formalization: Creating a Schema
Versions change far more frequently for newer products than for older ones. Even
so, most database designers/DBAs must deal with three or four new DBMS versions
during the life of the average database. Keeping the original database design as generic
as possible will help the designer or DBA incorporate useful new version features. The
generic database design will tell the DBA the difference between what was wanted
and what was settled for. For example, earlier versions of Oracle did not support
more than one column per table that was larger than 255 characters. Without proper
documentation, the DBA would never know that the original desire was for 512-character
MANUF_NOTES and MANUF_INSTRUCTIONS fields.
At this point, the database designer has a complete functional database schema.
However, formalization is not yet complete. The application programmers will need
subschemas, derived from the database schema, to do their work.
Subschema Creation
You now (ideally) have a working database schema. But more is needed. Remember all
those usage scenarios? Well, they become the basis for the needed subschemas. In the
relational model, subschemas are views and, for the most part, straightforward.
Subschemas came into their own with the network or CODASYL database model
and were part of the original ANSI standard. Subschemas provide the application
programmer with only the subset of the data (record types, data types, or sets) needed
to do the job. Extraneous information, such as unneeded record types and links, are
excluded. Subschemas can impose security by limiting what can be seen and what can be
updated. Figure 12-3 is a diagram of a simple network schema and two subschemas.
Figure 12-3. Schema and subschemas
Views are the relational version of a subschema but with considerable differences.
A relational view is a single virtual table created from one or more base tables. Like the
network subschema, a view can consist of a subset of the data items in a record type, but
unlike a subschema, a view cannot contain links to other tables.
253
Chapter 12 ■ Formalization: Creating a Schema
When the view includes more than one table, then the data items of the two tables
are joined into a single flat virtual file or table. Figure 12-4 shows a relational database
with two base tables and a view consisting of the data in the two tables.
Figure 12-4. Relational view
The virtual table is a flat file with the parent information replicated for each child. If the
parent record occurrence “Smith” has four children record occurrences, then “Smith” will
appear in the virtual table four times. Some designers use views to denormalize the database.
Figure 12-4 gives an example of the un-normalized flat file nature of the relational view.
There is just one problem: not all views are updateable. The rules vary from product
to product regarding whether a view is updateable. Some RDBMSs do not allow any view
to be updated, while others allow some views to be modified. No RDBMS vendor has
figured out how to update all views. A good rule of thumb is that single base table views
are probably updateable, but multiple base table views are probably not updateable;
however, you really need to check with your vendor.
NoSQL databases, particularly column and key-value systems, do not have views,
or, perhaps more correctly, all of their schemas are actually views. These systems bundle
data from multiple record types into a single object that looks more like a view than a
relational base table.
Whether it’s a non-relational subschema or a relational view, the usage scenario is
the best resource for creating them (Table 12-4).
Table 12-4. Relational View from a Usage Scenario
Usage Scenario
Relational View
Usage Scenario: 21 Name: Customer
Orders
(1) Enter Customer.
(2) Find Account occurrence for
associated Customer occurrence.
(3) Find Order occurrences for
associated Account occurrence.
CREATE VIEW CUSTOMER_ORDERS
(CUSTOMER_NAME, ACCOUNT NUMBER,
ORDER_NUMBER, DATE) AS
SELECT (C.CUSTOMER_NAME, A.ACCOUNT_NO,
O.ORDER_NO, O.ORDER_DATE,)
FROM CUSTOMER C. ACCOUNT A, ORDER O
WHERE (C.CUSTOMER _NO = A.CUSTOMER_NO)
AND (A.ACCOUNT_NO = O.ACCOUNT_NO)
;
254
Chapter 12 ■ Formalization: Creating a Schema
In most cases, there will not be a one-to-one relationship between usage scenarios
and subschemas/views. Rather, the database designer will find that a few well-defined
subschemas will usually handle many usage scenarios.
The subschema sections of the vendor’s manuals are usually chock-full of
subschema do’s and don’ts.
Formalization Notes
The last Step 3, Formalization, task is to complete the Formalization notes. As with the
other steps, the database designer needs to record the issues and decisions made during
Formalization (why, where, when, and results) so that future designers and DBAs can
perform their jobs properly informed with what was done to the database design and why.
Deliverables
Step 3, Formalization, should produce the following deliverables:
3.1
Functional database design diagram: A database diagram
showing the record types and links (Figure 12-5 in the
next section).
3.2
Functional schema DDL: Two versions should be created.
•
Generic DDL conforming to the database architecture
•
Vendor product and version specific
3.3
Functional subschemas DDL: Two versions should be created.
•
Generic DDL conforming to the database architecture
•
Vendor product and version specific
3.4
Functional database object definitions: The same physical
definitions created in step 1, Transformation, updated with
any necessary changes made during step 2, Utilization, now
need updating with step 3, Formalization, information.
3.5
Formalization notes: A narrative or journal created by the
database designer of the activities, issues, and decisions made
during step 3, Formalization.
255
Chapter 12 ■ Formalization: Creating a Schema
Example of Deliverables
Figure 12-5 shows the functional database design.
FUNCTIONAL DATABASE DESIGN
CHANGES MADE TO THE MODEL
1. Eliminated the supertype Customer and
the two subtypes Retail and Wholesale,
replacing them with the single record
type Customer
2. Changed associative and attributive
record types to proper record types
3. Created a Bill-of-Materials junction
record for Product
Figure 12-5. Functional database design
In Chapter 13, the DDL is modified to improve the performance of the database.
256
CHAPTER 13
Customization: Enhancing
Performance
It is a bad plan that admits no modifications.
—Publilus Syrus
An expert is a person who has made all the mistakes that can be made in
a very narrow field.
—Niels Bohr
For a number of crucial reasons, enhancing the performance of the database design is
done last. First, any performance considerations need to wait until the database design
is well understood. This ensures that critical functional components are completely
identified, included, and documented before any changes are made to the database
design. If you start modifying the design before properly documenting it, you risk losing
critical functional information. To avoid this confusion, do not mix functional database
design requirements with database performance considerations—this is the reason U3D
separates steps 3 and 4.
The second reason to separate functional and performance considerations is that
vendors change the database design syntax far more frequently for performance reasons
than for functional ones. Most DBMS software maintenance releases contain at least
some performance enhancements but might not contain any functional ones at all.
Keeping the more volatile performance enhancements separate from the more stable
functional ones improves the chances of providing performance improvements without
destroying functional necessities. See Table 13-1.
© George Tillmann 2017
G. Tillmann, Usage-Driven Database Design, DOI 10.1007/978-1-4842-2722-0_13
257
Chapter 13 ■ Customization: Enhancing Performance
Table 13-1. Step 4: Customization
Sources
Procedures
Deliverables
•• 3.1: Functional database
design (diagram)
•• Task 4.1: Resource
analysis
•• 3.2: Functional schema DDL
•• Task 4.2: Performance
enhancement
•• 4.1: Enhanced
database design
(diagram)
•• 3.3: Functional subschemas
DDL
•• 3.4: Functional database
object definitions
(data dictionary)
•• Activity 4.2.1:
Customize hardware
•• Activity 4.2.2:
Customize software
•• 2.3: Usage scenarios
•• 2.4: Usage maps
•• 2.5: Combined usage map
•• 1.3: Transformation notes
•• 4.2: Enhanced
schema DDL
•• 4.3: Enhanced
subschema DDL
•• 4.4: Enhanced
database object
definitions
(data dictionary)
•• 4.5: Customization
notes
•• 2.6: Utilization notes
•• 3.5: Formalization notes
•• DB: DBMS features and
constraints
In step 4, Customization, the designer has the option to use design techniques
or tools (vendor, third-party, or homegrown) to improve database performance while
keeping all functional components intact. This step is divided into two tasks. The first task
applies some analytical rigor to the database usage information, followed in the second
task by the actual performance-enhancing changes to the database.
Task 4.1: Resource Analysis
Because you now have a working DBMS schema, you could stop here, and if demands
on your database are minimal (few transactions and a small amount of data), you are
probably done. However, for many databases, performance is a significant driver of good
service. Without special performance-enhancing features, most of which are provided
by DBMS vendors, many transactions or queries could take minutes, or even hours, to
return results. An enhancement as simple as adding an index, often requiring fewer
than a dozen lines of code, can improve performance by one, two, or even three orders
of magnitude. The trick is understanding the trade-offs and knowing where to place the
performance-enhancing components.
Step 3, Formalization, focused on the language to translate the physical design
specifications into something the DBMS software can understand. The skills the designer
needs are concept and software related. For example, the database designer needs to
know both the functional requirements of the application and the vendor-specific DDL
and DML for creating and using a database.
258
Chapter 13 ■ Customization: Enhancing Performance
Performance tuning, on the other hand, requires all the database designer
Formalization skills and, in addition, knowledge of how computer systems work—the
hardware and system software the database lives under, as well as any DBMS functionality,
needs to be understood. This is because, by far, the most significant issue associated with
database tuning is efficiently getting information to and from auxiliary storage devices.
The reason? Fetching a record from main memory can be 1,000 times faster than fetching
the same record from disk. If two records are needed and they are stored on two separate
database pages, then two physical I/Os might be required. However, if they are stored on
the same page on disk, then the second record can be fetched up to 1,000 times faster than
the first.
The Trade-Off Triangle
How do you know when you’re done? The only way to really know the answer is to
examine your database to see whether the off-the-rack implementation will work
satisfactorily. If, on the other hand, performance enhancements are needed, the designer
can dip into the DBMS tool kit and improve the database design.
There are no free rides in the DBMS world. For everything you gain, there is
something you lose. Everything (almost) is possible; however, everything (almost) has a
cost. It all comes down to trade-offs.
Just listen to medication ads on TV. “Wonder drug Nonosedripz gets rid of your
runny nose; however, side effects can include brain damage, anal leakage, and growing
extra toes.” The consumer has to analyze the trade-offs and decide whether nose relief is
worth the risk.
Trade-offs are everywhere, including database design. A good DBMS schema
involves trade-offs related to three competing performance dimensions—flexibility,
throughput, and volume.
•
Flexibility is the ability of the database system to support a broad
range of known and unknown services and to easily adapt to
business and technology changes.
•
Throughput is how quickly the database system can perform its
function either in terms of response time for online applications
or runtime for batch programs.
•
Volume is the number of objects/actions the database system
can accommodate, such as the number of record types, or
occurrences, it can support or the number of concurrent online
transactions it can handle.
These three dimensions can be easily represented as a triangle (Figure 13-1).
259
Chapter 13 ■ Customization: Enhancing Performance
Database designers must choose among
Flexibility, Throughput, and Volume, which
compete for database resources.
Assess the criticality of each dimension
(Flexibility, Throughput, and Volume):
1.
Simple
2.
Average
3.
Complex
4.
Very Complex
Figure 13-1. Trade-off triangle
Trade-off decisions come at a cost. For example, design the database for flexibility
and you might have to sacrifice some of the database’s ability to handle large volumes or
perform functions quickly.
In many cases, the hardware, system software, and DBMS can accommodate all
three dimensions. However, in cases where demands are high, or extreme, the system
might be able to accommodate only one or two dimensions comfortably (Figure 13-2).
Dark area indicates the criticality of each of the
three dimensions…
Trade-off decisions: Design the database to
be flexible.
What you gain: A more robust database that
is able to handle not only today's functionality,
but will probably be able to accommodate
future requests.
What you might lose: The ability for the
database system to handle large volumes
and/or maintain processing speed.
…showing flexibility is most important while
volume support is not a major issue.
Figure 13-2. Trade-off triangle—flexibility most important
260
Chapter 13 ■ Customization: Enhancing Performance
Understanding the trade-offs helps in making decisions about design options and
tool usage (Figure 13-3). Most of all, the trade-off triangle provides a design-trade-off
perspective (Table 13-2).
Need is to handle large volumes
Trade-off decisions: Design the database
to accommodate very large record volume.
What you gain: Ability to process large
amounts of data.
What you might lose: Speed of processing
and/or database flexibility.
Figure 13-3. Trade-off triangle—volume most important
Table 13-2. Critical Dimensions
Two Most Critical Dimensions Design Options to Consider
Throughput and volume
•• Fewer larger record types,
read-only
•• Use of specialized storage
(SSD, cache, main memory,
large buffers, read-only,
multiple disks, etc.)
Tools to Consider
•• NoSQL
•• DBMS middleware
•• IMS/Fast Path
•• Use of partitioning, clustering, and hashing
Throughput and flexibility
•• Distribute across multiple
disks
•• Use of indexing, clustering,
and hashing
•• In-memory DBMS
(no/limited update
capability)
•• T/P monitors
•• Robust use of links
Flexibility and volume
•• Use of many record types
and relationships
•• Parallel DBMS
processing
•• Partitioning
•• Distributed DBMS
•• Strong use of indices
•• DBMS middleware
261
Chapter 13 ■ Customization: Enhancing Performance
The trade-off triangle is a simple visual way to demonstrate, and gain buy-in for,
the database design. It is not a decision tool but rather a communications tool for
illuminating the decisions that need to be made. The trade-off triangle can, and should,
be customized to an organization’s situation—reflecting local data and transaction
volumes as well as functional flexibility and processing speed requirements. Table 13-3 is
an example of a trade-off triangle serviceability index tool for one IT shop.
Table 13-3. Trade-Off Triangle Serviceability Index
Each dimension (flexibility, throughput, and volume) is assigned a value 1 through 4 on
the following scale:
1. Simple
2. Average
3. Complex
4. Very complex
The values are then added together to give the serviceability index.
Here’s a sample of the serviceability index:
••
An index less than or equal to 3 can be handled by almost any DBMS.
••
4 to 6 requires a good general-purpose server-based DBMS.
••
7 to 8 requires trade-offs in the design and/or implementation of the database.
••
9 to 10 requires a special-situation DBMS (specialized DBMS (OO, NoSQL,
IMS Fast Path), and/or special hardware, and/or special database design).
••
No database should have a serviceability index greater than 10.
Although it is certainly not scientific and could be criticized on multiple fronts,
nonetheless the trade-off triangle serviceability index gives the database designer a
framework for structuring potential challenges as well as managing expectations when
meeting with other technical staff and end users.
Task 4.2: Performance Enhancements
If you created a trade-off triangle for your database and it came out 1, 1, 1 (flexibility = 1,
throughput = 1, and volume = 1), then you are done. There is little this chapter can add
to your database design. If your database is 2, 2, 2, you are also probably done, although
reading the chapter might show you some small performance tweak that should be
applied to make your system more efficient. However, if you scored a 3 in any category,
then keep reading—there will be some tidbits here that you can use.
262
Chapter 13 ■ Customization: Enhancing Performance
Activity 4.2.1: Customize Hardware
A simple, although not inexpensive, way to improve database performance is through
hardware. Faster processors and/or more memory can improve the performance of most
databases and overcome a multitude of poor database design sins. But first…
A Few Words About Secondary Storage
Before going further, you need to understand a few things about secondary storage, both
rotating and stolid-state drive (SSD).
Currency is an interesting word. If you use a search engine to wander through
the Internet, you discover that nuclear weapons are the currency of power, attention is
the currency of leadership, secrets are the currency of intimacy, personal information
is the currency of the 21st century, and so on, and so on. The word currency is used to
denote how you measure something important. If you have nuclear weapons, then
you have power; more nuclear weapons = more power.
One can safely say that inputs and outputs (I/Os) are the currency of databases.
The efficiency of a database application, batch or online, can be improved—sometimes
by orders of magnitude—simply by changing how it performs its database I/O. No fairy
tales—systems that were deemed turkeys by users have become champs after changing
a dozen lines of DDL. It should be no surprise that the number-one place DBAs look to
improve database performance is I/O.
Take a simple example. Imagine a program that reads a customer file. Assume
that there are 100,000 customer records on disk, each 1 KB long. Starting with the first
customer, it takes 100,000 trips to the disk to read the complete file. If the average disk
can read a record and ship it to the computer in 8 milliseconds, then it will take almost 14
minutes to read the file. The same file in main memory would take less than 1 second to
read on a fast computer. That’s an amazing difference.
The reality is that disk is slow while main memory is fast. A second reality is that disk
is cheap while main memory is expensive (at least in the quantities needed to compete
with disk). The moral is that if you have a very small database, put it in main memory.
Your users will love you. If you have a big database, you’re stuck with disk…but there are a
few things you can do to speed things up.
Look at the typical disk. It consists of a motor rotating a magnetic oxide–covered
aluminum or some other nonmagnetic substrate. There could be a single platter or multiple
platters, and there might be one read/write head per platter or two (one above and one
below). The disk can have a diameter as small as less than 2 inches or as large as 12 inches.
It also has one or more arms, part of the actuator, which moves the read/write heads across
the platters. Modern disks spin at from 4,000 RPM to greater than 15,000 RPM.
Each platter is divided into concentric tracks. Each track is divided into multiple
sectors. If there is more than one platter, then the platters are stacked, one on top of each
other, sort of like pancakes, except there is space between each platter for an actuator
arm. All of the vertically aligned tracks are called a cylinder.
Reading the data from disk requires a series of steps. First, the request is sent to a
controller, which determines the exact location of the desired data. Second, the actuator
arm is positioned over/under the correct track. The time it takes to position the arm is
called seek time. Third, the system goes into a wait state until the correct sector rotates
263
Chapter 13 ■ Customization: Enhancing Performance
under/over the read/write head. This is called rotational latency or rotational delay.
When the correct sector is in position, the data are read from the disk and transferred to
the host.
All of these steps take time. Table 13-4 gives the times for a typical database disk.
Table 13-4. Disk Data Transfer Speeds, in Milliseconds
1 KB Data
2 KB Data
10 KB Data
100 KB Data
Controller
0.01
0.01
0.01
0.01
Seek Time
4.0
4.4*
8.0*
44.0*
Rotational Latency
4.0
4.4**
8.0**
44.0**
Data Transfer
0.01
0.02
0.1
1
Buffer to CPU
0.0003
0.0006
0.003
0.03
Total Time (ms)
8.0
8.8
16.1
89.0
Notes: *Assumes the actuator arm needs repositioning 10 percent of the time.
**Assumes the desired sectors are contiguous 90 percent of the time.
As Table 13-4 shows, the problem is the seek time and the rotational latency
(the dreaded disk duo)—both mechanical activities. If you could eliminate both
mechanical functions, then the speeds would be considerably faster.
An SSD appears to the system as a rotating disk, but the data are stored in nonvolatile
flash memory. Table 13-5 gives typical speeds for an SSD and main memory.
Table 13-5. Nonrotating Disk Data Transfer Speeds, in Milliseconds
1 KB Data
2 KB Data
10 KB Data
100 KB Data
Solid-State Drive (SSD)
0.030
0.042
0.133
1.150
Main Memory
0.00013
0.00026
0.0013
0.013
SSD not fast enough for you? Then keep your data in the computer’s main memory
where speeds are even faster.
There is just one problem—the faster memory access is, the more expensive it is.
SSDs are much faster than rotating disks, but the per-megabyte cost is considerably
higher and even more so for main memory. However, the message is not “don’t use
nonrotating memory.” Rather, the message is “use your head.” Putting the customer file in
main memory might not make sense, but putting the price list there just might.
In most cases, until SSD prices come closer to those of rotating disk, rotating disk will
be where the majority of the database’s data is stored. The goal for the designer must be
to anticipate, as much as possible, the application’s data needs and fetch a large amount
of desired data from disk with each read. If 10 customer records can be fetched with
each trip to the disk, then the amount of time spent doing physical I/O is substantially
reduced. To the application program, there were 10 (logical) reads, but for the operating
264
Chapter 13 ■ Customization: Enhancing Performance
system, only a single (physical) read took place. The objective of the database designer
is to maximize the amount of required data fetched with each physical I/O. And that is
the goal for this chapter. Examine the size of the data and the number of times data is
accessed and then decide where and how the data should be stored and accessed. If the
database designer understands the real cost (currency) of database performance, then
they are in a position to make informed hardware choices.
Add Disk
Imagine a rather small online transaction processing database for a multiuser application.
When the system was new, users complained about slow response time, but as time went
on the performance improved. The only difference? More data. In a stranger-than-fiction
situation, as the database got bigger, the online performance improved. Why? As the
database got larger, it outgrew its single disk. As additional disks were added, along with
their additional read/write heads, the contention caused by multiple users repositioning
the disk heads decreased. The bigger database is actually faster than the smaller one.
Multiple users or multiple applications requesting the service of a single disk can
require the constant repositioning of the read/write head as each user or application gets
its turn. The disk head can wind up “thrashing” between the requested cylinders. Adding
physical disks increases the number of read/write heads, which, in turn, reduces the very
expensive seek time and rotational delay caused by the contention.
Few organizations have only one database supporting one application. Rather than
putting five databases on five separate disks, mixing them (spreading all five databases
over the five disks) can (based on time of use, etc.) sometimes reduce disk contention,
speeding up access for all five.
Faster Disk
There was a time when a disk was a disk—they all ran at approximately the same speed
(2,400 to 3,600 RPM), and the disk platters were all about the same size (about 12 inches
in diameter). Not true today. Disk RPM can vary from as low as 4,000 up to 15,000 RPM,
and smaller disk platter size means that the read/write head travels shorter distances.
SSDs can be orders of magnitude faster than rotating disks.
Routinely fetched information, such as product or price tables, can be kept on the
smaller, more expensive, but faster disks, while less routinely accessed information can
stay on more lumbering media.
Main Memory
Nothing beats main memory for speed, but at a cost. However, if the application is reading
data in a predictable way (such as sequentially), then large buffers in main memory can be
a godsend. IBM allows disk sectors as large as 50 KB, and most DBMSs use a database page
that can be many times the size of the disk sector. Pulling large amounts of information into
a buffer in main memory can significantly reduce the number of required physical I/Os. If
there is sufficient memory, tables (such as tax and price tables) can be read once and then
kept in main memory to be shared by multiple users and applications.
265
Chapter 13 ■ Customization: Enhancing Performance
Of course, big buffers are useful only if you want all the information in the buffer.
Large buffers and large database pages are not only useless but can be an impediment if
the application wants only 50 of the 5,000 bytes returned from storage.
The main memory sticking point is that it works best for read-only data. Journaling
and backup and recovery activities require nonvolatile memory such as disk.
Once again, to make an informed choice, the database designer must know how the
application will use the data.
Activity 4.2.2: Customize Software
There are a number of ways software can be used to customize a DBMS.
Indices (B-Tree, Hash, Bitmap)
Indices were discussed in Chapter 8, so a repeat is not needed here. Most of the emphasis
on indices has been on retrieval, which is where they shine. Which fields you index is
driven by two criteria: (1) which fields you want to search the database for and (2) which
fields the DBMS uses to access record occurrences.
With relational systems, to add indices you need to simply add a statement to the
DDL, as follows:
CREATE TABLE PRODUCT (
PRODUCT_NAME CHAR(30) NOT NULL,
PRODUCT_NUMBER CHAR(8) NOT NULL PRIMARY KEY,
PRODUCT_DESCRIPTION VARCHAR(512),
COST_BASIS DECIMAL(8,2) NOT NULL,
LIST_PRICE DECIMAL(8,2) NOT NULL,
CREATE UNIQUE INDEX PRODUCT_NUMBER_IDX ON PRODUCT (PRODUCT_NUMBER)
);
Unfortunately, indices are rather poor performers when it comes to index updates.
Inserting, modifying, or deleting an index entry can be I/O expensive.
Clustering
Clustering is placing one record occurrence on the same database page as another record
occurrence so that the physical I/O to access one occurrence will also access the other
occurrence. A common clustering strategy involves the parent-child binary relationship
where the child record type occurrences are placed physically near the parent record type
occurrence. This increases the chance that the physical I/O to access the parent will also
access its children. (The term clustering is also used for index storage and distributed
databases. The use here is for the storage of content data within the database.)
For effective clustering, the database designer needs to identify the record types that
are functionally associated with other record types. On a grand scale, it is as simple as
saying that Customer is more closely associated with Account than with Manufacturer,
while Manufacturer is more closely associated with Distributor than with Customer.
266
Chapter 13 ■ Customization: Enhancing Performance
Look at this manual example. Assume that manufacturer paperwork is stored in
the company warehouse and customer paperwork is stored in the sales office. Where
should distributor and account paperwork be stored? If, when you need distributor
information, you also usually need manufacturer information but you almost never
need customer information, then it makes sense to store the distributor information
in the warehouse and not the sales office. In addition, although you infrequently need
manufacturer paperwork and account paperwork at the same time, you often need
customer and account paperwork at the same time. Therefore, it makes sense to store the
account information with the customer information in the sales office. Using clustering
terminology, it makes sense to cluster Customer and Account information and to cluster
Distributer and Manufacturer information.
However, with an automated system and usage maps, you can go further. Which
do you typically access first, Manufacturer or Distributor? If you typically access
the Manufacturer occurrence first followed by the Distributor occurrence, then the
Distributor occurrence should be clustered around the Manufacturer occurrence; but if
you typically access Distributor data first, then Manufacturer should be clustered around
Distributor. The combined usage map tells you this. Follow the usage arrows and see
which is more common—accessing Manufacturer first or Distributor first (Figure 13-4).
If the arrows show that you typically move from Distributor to Manufacturer, then cluster
Manufacturer with Distributor. This means all Manufacturer occurrences for a specific
Distributor occurrence are stored with their Distributor occurrence, ideally on the same
database page. Typically, the cluster is named after the parent record.
Figure 13-4. Clustering example
Consolidation can be of two types. The first stores the child records on the same
physical page as the parent, so by accessing the parent, the child records can be read
without an additional physical I/O (assuming there is room on the page for all the
children). The second type of clustering stores the parent and children on different
database pages, but all the children for a given parent are stored on the same physical
page. For example, record X might be stored in database file 1, page 10; while all of X’s
children are stored in file 2, page 86. All of X’s children can be fetched with a single
physical I/O (assuming there is room on the page), but not the same physical I/O used to
fetch the parent X.
Consider the order management system in Figure 13-5. If the Line Item record
occurrences are stored on the same database page as their parent Order record
occurrence, then when the Order record is read, all (or most all) of the Line Item
occurrences are fetched with the same physical I/O.
267
Chapter 13 ■ Customization: Enhancing Performance
Figure 13-5. Order management system physical database design
The downside of clustering is that records can be clustered only one way. Line Item
could be clustered around Order or Product, but not both. Clustering Line Item around
Order means that when the Order record is accessed, the Order’s Line Items are probably
also there. However, accessing Line Item from Product means that every Line Item access
probably requires a physical I/O. The database designer must understand the trade-offs
to make the best all-around decision.
Creating a cluster is quite easy with most database systems. The following is a simple
SQL clustering example:
CREATE TABLE ORDER (
ORDER_NO CHAR (5) NOT NULL,
ORDER_TYPE CHARACTER(1),
ACCNT_NO CHARACTER(8),
ONUMB NUMBER(4) NOT NULL,
OAMT NUMBER(6,2),
CLUSTER ACCOUNT(ACCT_NO)
);
Note: Some SQL-based systems do not allow a table called Order, a reserved word, while
others will.
You can graphically indicate clustering on the database diagram by placing the
clustering record type name at the bottom of the record type box.
Figure 13-5 shows a database design fragment including the clustering information
at the bottom of each record type box.
268
Chapter 13 ■ Customization: Enhancing Performance
Example Using Indices and Clusters
Usage scenarios can help the database designer make the correct clustering choices. If
you go back to the trade-off example in Chapter 9, you can now add some additional
usage scenario–driven vigor to the solution.
Take the simple database design fragment of the three record types (Figure 13-6)
consisting of 200 Product occurrences and 1,000 Order occurrences, each linked to an
average of 10 Line Items occurrences per Order occurrence.
Figure 13-6. Physical database design trade-offs
Two software options can improve performance: (1) indices placed on certain fields
and (2) clustering of multiple occurrences of linked but different record types on the same
physical database page. In the example, Line Item occurrences could be stored either on
the same physical page as their related Order occurrence or on the same physical page as
their related Product occurrence, but not both.
Examining scenario 1, the first scenario task is to enter the database at a specified
Order occurrence. Because there are 1,000 Order occurrences, a sequential search for the
desired Order takes, on average, 500 logical I/Os to find the right record. Assume that, on
average, 10 Order occurrences fit on a physical database page, then a sequential search
for the desired Order instance requires, on average, 50 physical I/Os.
You can reduce the number of physical I/Os by creating an index on
ORDER_NUMBER. Of course, indices are not free. They also require storage on disk
and a number of I/Os to fetch their information. Luckily, there are simple formulas
(see Appendix D) to calculate the number of physical I/Os required to fetch the occurrence
location from the index. Using formula (5) introduced in Chapter 8, the average number of
physical I/Os needed to fetch a particular Order record averages 3 (2 for the index and 1 to
fetch the record instance), as illustrated here:
If:
N = Number of index entries to search
C = Average number of compares to find desired entry
m = Blocking factor of index
269
Chapter 13 ■ Customization: Enhancing Performance
Then:
C=Log N/Log m (5)
If you assume:
Number of occurrences (N): 1,000
Number of entries per index page (m): 50
Then using formula (5):
Average physical I/Os to find the desired index entry = fewer
than 2
Average physical I/Os to retrieve record (index I/O plus
fetching record) = less than 3
Using an index on Order is, on average, more than 16 times faster than reading the
file sequentially. However, there is a cost. Indices must be maintained. While reading
indices is relatively cheap, updating them can be considerably more expensive
(see formula (7) in Chapter 8 or Appendix D). Which do you choose: faster retrieval at the
cost of updates or more efficient updates at the expense of retrievals? The answer is in the
physical I/O expended.
The second software method for improving performance is clustering. If you assume
you have the desired Order occurrence, sequentially fetching its related 10 Line Item
occurrences requires an additional average 5,000 logical I/Os. Assuming you can place
50 Line Item occurrences on a physical database page, fetching now requires, on average,
100 physical I/Os.
Adding an index is better. Using formula (5), you can fetch a Line Item occurrence in
approximately 4 I/Os—but that is per Order occurrence. The average of 10 occurrences
per Order would translate into (allowing for some occurrences being on the same
database page) between 30 and 40 physical I/Os per Order.
You can do better. You can cluster (store) all of Order X’s related Line Item
occurrences on a single database page or on the same physical database page that Order
X is on. Then, when you fetch Order X, you also have all of Order X’s Line Items with the
same physical I/O or all clustered together on their own database page (assuming that
they could all fit on one database page).
Scenario 2 is similar to scenario 1, except you enter the database at Product and
then traverse to Line Item. You can place an index on a Product data item to reduce the
number of physical I/Os to fetch a given Product occurrence from 2 (assuming the same
blocking factor of 50 Product occurrences per physical database page) to 2 or 3—no
savings and a potential deficit.
You can also cluster Line Item around Product. If you assume an average of 50 Line
Item instances per Product instance (10,000 divided by 200), you should be able to fetch
all of a Product’s Line Item occurrences with the same physical I/O.
However, while both Order and Product can be indexed, Line Item occurrences can
be stored around only one record. The designer must choose to cluster Line Item around
Order or around Product—clustering around both is not possible. Which do you choose?
270
Chapter 13 ■ Customization: Enhancing Performance
These are the five critical questions:
1.
Should Order be indexed?
2.
Should Product be indexed?
3.
Should Line Item be indexed?
4.
Should Line Item be clustered?
5.
If so, clustered around Order or Product?
To answer these questions, you need to collect all the facts and assumptions.
Number of Order occurrences: 1,000
Order occurrence size (bytes): 200
Number of Product occurrences: 200
Product occurrence size (bytes): 400
Number of Line Item occurrences: 10,000
Line Item occurrence size (bytes): 100
Database page size of 5,000 bytes
Number of scenario 1 transactions (executions) per day: 2,000
Number of scenario 2 transactions (executions) per day: 200
Question 1: Should Order Be Indexed?
Twenty-five Order records can fit on a database page (ignoring database page overhead),
which translates into 40 database pages. (This assumes that the page is dedicated to storing
Order records and that page free-space and expansion space are ignored.) Therefore, it
takes, on average, 20 physical I/Os to sequentially fetch the desired Order record.
Formula (5) says that it takes, on average, 2.15 physical I/Os to fetch the desired record
address from an index (assuming that the index page is the same size as the database page).
Adding an additional I/O to fetch the actual (content) record totals 3.15 physical I/Os.
Therefore, other things being equal, it is more efficient (3.15 versus 20 physical I/Os)
to fetch an Order using an index.
Question 2: Should Product Be Indexed?
Twelve Product records can fit on a database page (the caveats are the same as for the
Order case), meaning that all the Product records can fit on 17 database pages. Fetching a
Product record sequentially requires, on average, 9 physical I/Os.
Formula (5) says that fetching the Product index entry requires, on average, 2.13
physical I/Os. Adding one physical I/O to read the content results in an average of 3.13
physical I/Os per fetch. Three plus physical I/Os is certainly better than 9, but the benefit
is minimal.
271
Chapter 13 ■ Customization: Enhancing Performance
Question 3: Should Line Item Be Indexed?
Fifty Line Item records fit on a database page, taking up a total of 200 database pages. A
sequential read of the database to find a specific Line Item record requires, on average,
100 physical I/Os.
An index on Line Item requires 2.35 index physical I/Os to fetch the record address
and one additional I/O for content, totaling 3.35 physical I/Os per Line Item record. It
makes sense to index Line Item.
Question 4: Should Line Item Be Clustered?
This is a nonmathematical question whose answer is dictated by the structure of the
database and the usage scenarios. Because both scenarios move from fetching a parent to
the Line Item child, it would seem that there could be significant benefit from clustering.
Question 5: Should Line Item Be Clustered Around Order or
Product?
To answer this question, you need to examine two alternatives.
For alternative 1, you need to calculate the total daily physical I/Os consumed by
scenario 1 and scenario 2 if Line Item is clustered around Order.
Alternative 2 calculates the total physical I/Os consumed by each scenario in a day if
Line Item is clustered around Product.
Alternative 1: Line Item Clustered Around Order
Scenario 1 says fetch 1 Order record and then, on average, 10 Line Item records and do
this entire process 2,000 times a day.
The physical I/O to fetch the Order record is 3.15 (from question 1).
The physical I/O to fetch the Line Items depends on the DBMS and how you chose to
store/access them. If you stored Line Items in their own file and on their own pages and
used an index to access them, then you could fetch all 10 Line Items with an additional
3.35 physical I/Os. Because all 10 records are on the same page, you need to read the Line
Item index only once to fetch all 10.
The total scenario 1 daily physical I/O count is 6.5 physical I/Os per transaction times
2,000 transactions per day, equaling 13,000 physical I/Os.
Scenario 2 says fetch one Product record and then, on average, 50 Line Item records
and do this entire process 200 times a day.
If Line Item is clustered around Order, then it cannot be clustered around Product.
Physical I/Os to fetch one Product record are 3.13. Total physical I/Os to fetch 50 Line
Items (3.35 times 50) are 167.5.
The total scenario 2 daily physical I/O count equals 170.63 times 200, which is 34,126
physical I/Os.
The total alternative 1 daily physical I/O count is 47,126.
272
Chapter 13 ■ Customization: Enhancing Performance
Alternative Two: Line Item Clustered Around Product
Scenario 1 says fetch 1 Order record and then, on average, 10 Line Item records and do
this entire process 2,000 times a day.
The physical I/O to fetch the Order record is 3.15 (from question 1).
If Line Item is clustered around Product, then it cannot be clustered around Order.
From question 3, you know that it takes, on average, 3.35 physical I/Os to fetch 1 Line
Item record or 33.5 physical I/Os to fetch 10.
The total scenario 1 daily physical I/O count is 36.5 physical I/Os per transaction
times 2,000 transactions per day, equaling 73,300 physical I/Os.
Scenario 2 says fetch 1 Product record and then, on average, 50 Line Item records,
and do this entire process 200 times a day.
The physical I/Os to fetch 1 Product record are 3.13.
Each Product record has an average cluster size of 50 Line Item records. The total
physical I/Os to fetch 1 Line Item is 3.35 for a transaction (execution) total of 6.48 physical
I/Os.
The total scenario 2 daily physical I/O count at 200 times a day times 6.48 is 1,296
physical I/Os.
The total alternative 2 daily physical I/O count is 74,596.
Comparing the two alternatives, clustering Line Item around Order saves more than
27,000 physical I/Os a day—a reduction of almost 40 percent.
IT DOESN’T HAVE TO BE ACCURATE, IT JUST HAS TO BE
DIRECTIONALLY CORRECT
Any approach to calculating physical I/Os will run into difficulties. Many a DBA has
been surprised when the DBMS statistics-gathering function reports that a record
fetch took 5 physical I/Os while the operating systems indicated that there were 25.
What’s happening?
Any effort to accurately calculate physical I/Os is problematic. The statistics
gathered by the DBA probably will not agree with those gathered by the operating
system support staff, which will probably disagree with any given by the secondary
storage subunit, which will almost certainly be greater than those predicted by the
database designer. The problem is that the various components needed to perform
an application-driven I/O need to do their own I/O as well to support the application.
Many of these additional I/Os are under-reported or not reported at all to the DBMS.
The CPU, which once managed all disk activity, now hands the task over to a
secondary storage subsystem that does something the CPU is unaware of but that
often involves the subsystem’s own I/O. The DBMS gets some information from the
operating system; however, the operating system always seems to have a few tasks
of its own that require I/O. And then there are the statistics-gathering systems. The
operating system has them, the DBMS has them, the secondary storage systems
have them, the transaction processing monitor has them—all that data gathering
involves I/Os—lots of them.
273
Chapter 13 ■ Customization: Enhancing Performance
So, where does that leave the database designer and the DBA? Why perform these
physical I/O-counting exercises when the number could be off by an order of
magnitude or more? The answer is, although the designer-generated number might
under-report the actual I/O count, it is almost always directionally correct. Given
two situations, with two different predictions, the higher count forecast will almost
always require more I/Os than the lower one. The actual number might be low, but
the direction the forecast indicates (greater or lesser) will almost always be correct.
The database designer might get the actual count wrong, but the conclusions drawn
from the analysis, and the associated decisions made based on that information, are
almost always correct.
The takeaway from all of this: it doesn’t have to be accurate; it just has to be
directionally correct!
The previous example examined only a single approach to storing and clustering
records in a database. Had the DBMS stored different record types on the same page
(a default with most systems), then the counts would have been different. A single record
type would have been spread across more database pages than predicted, increasing the
cost of a sequential read.
On the other hand, multiple record type page storage allows the Line Item records
to be stored on the same physical page as their parent (Order or Product), reducing I/O,
but this approach also increases the chance of page overflow resulting in not all clusters
fitting on a single page.
The calculations are not different, although there might be a few more of them. The
principle, however, remains unchanged.
Partitioning
Partitioning is deciding where to locate database files on disk to reduce disk contention.
For example, database inserts, updates, and deletes require writing to the database
journals and log files. If there are sufficient updates, the journals or log files can become
bottlenecks. A simple solution is to place the journals and log files on separate disks from
the database content files. This allows the different disk seek and rotational delay times to
overlap.
Database content can also be partitioned. By looking at the use of the database, the
designer can locate different record types in different files or even the same record type
spread across multiple files partitioned by a data item value.
Partitioning can also be used in conjunction with clustering. The database designer
can cluster Line Item around Order while storing each in a different partition. Partition 1
might contain all Orders while Partition 2 all Line Items. The trick is that all the Line Items
for a particular Order are stored together, ideally on a single database page, in Partition 2.
For most systems, partitions are a DDL matter and not a DML one, making them
totally transparent to the application program. Partitioning works particularly well
when customizing hardware. The database designer can create a partition on an SSD
274
Chapter 13 ■ Customization: Enhancing Performance
for the most frequently accessed record occurrences while less frequently accessed
occurrences are on slower media. Partitions also are useful when the database is spread
across multiple servers, allowing each server to support its own backup and recovery.
Independent backup and recovery is particularly useful when the volume of data is
quite large.
Both partitioning and clustering information can be displayed on the physical
database design diagram (Figure 13-7).
Figure 13-7. Clustering and partitioning information on the enhanced database diagram
Derived and Duplicate Data
In logical data modeling, only primitive data are modeled. Derived data are excluded
because they are the result of one or more processes acting on primitive data. For
example, there is no need to model the total number of courses a student has taken if you
have all the courses the student has taken in the database. The application can simply
count them.
However, you might want to include a TOTAL COURSES TAKEN data item in the
Student record if (1) it would require excessive physical I/Os to calculate the number of
courses or (2) the calculated data are often required. The database designer could decide
that, for performance reasons, it makes sense to store this derived data.
There might be similar reasons to store duplicate data. Adding a few redundant data
items into different record types can reduce physical I/Os and speed up processing.
WORD SOUP
Some authors make a distinction between data duplication and data redundancy.
Duplicate data are always a no-no, while redundant data are permissible duplicate
data. Other authors think it’s a case of toMAto-TOmato.
The argument against duplicate data is the mess that can occur if not all copies of the
data are updated simultaneously. However, duplication is perfect for read-only databases.
Duplicate data are a favorite of many NoSQL systems, which sprinkle popular data items
around the database to reduce physical I/O.
275
Chapter 13 ■ Customization: Enhancing Performance
Denormalization
Denormalization is another favorite of NoSQL database systems, which like to cram a lot
of data into a single record occurrence. They also like to add group data and repeating
groups back into the parent record. If few customers have more than one address, then it
might make sense to place the primary address in the customer record.
The purpose of normalization is to protect the database from ill-conceived inserts,
updates, and deletes. However, if the database is read-only, then normalization is not
needed. Data warehouses, which tend to be large and read-only, are prime candidates for
denormalization.
Get Rid of ACID
ACID (Chapter 8) is expensive. It requires that all database insert, update, and delete
transactions follow at least most, if not all, of the following steps:
1.
The data to be changed (and sometimes even the data stored
around it on the same database page or file) must be locked so
that others cannot access them while the change is occurring.
2.
An image of the existing data (before image) is written
(involving one or more disk writes) to a journal file before the
data are changed, and another image of the data (after image)
is saved (one or more disk writes) to a journal file after the
change.
3.
All the transaction steps taken are recorded to a separate log
file (one or more disk writes).
4.
All the writes are flushed to ensure that all the changes are
physically on the disk and not stored in some buffer awaiting
transfer to disk.
Updating a single database record occurrence could involve more than a dozen
physical disk writes (not logical writes) before the actual record occurrence update is
completed. In terms of resource utilization, a database update could require 10 or more
times the resources of a simple database read.
Eliminating or reducing one or more of these steps can significantly speed up a
database transaction and, if all goes well, nothing is lost. This is how many of the NoSQL
systems obtain their speed. By not getting into locking, journaling, and logging the
update, the speed of a transaction can be increased tenfold.
If you can live with the proclivities and vagaries of the non-ACID world, then you
can, if your DBMS allows, turn off the ACID functions to improve performance at the cost
of guaranteed data integrity.
Figure 13-8 is a completed Enhanced database design diagram showing clustering
and partitioning.
276
Chapter 13 ■ Customization: Enhancing Performance
Figure 13-8. Physical database design diagram showing clusters and partition
information
Ideally, applying the techniques presented in this chapter should allow your generalpurpose DBMS (SQL Server, IMS, Oracle, MySQL, etc.) to accommodate your productivity
requirements. Unfortunately, there are times when the standard DBMS, no matter how
you configure it, cannot handle the required load. For example, Big Data, the reams of
information generated by automated systems, are often more than the traditional DBMS
can handle. How big is big? There are no specific or even agreed-upon answers, but a
useful rule of thumb is that the delineator between traditional data and Big Data is the
practical storage and processing limits for traditional information managers. For Big Data,
you might have to use a specialty or niche DBMS, such as one of the NoSQL products.
Big Data, Big Problems, Big Solution
Big Data is one of the latest technologies to unsettle IT. Organization after organization is
in a quandary, trying to figure out what to do about the large volumes of data streaming
in from a myriad sources. For example, a supermarket or chain store might record every
customer transaction, resulting in a database that could grow to petabytes in size.
How big is Big Data? Nobody knows—or everybody knows but nobody agrees. Does
“big” refer to the number of records in a database or to the number of data items in a
record or to the number of bytes that need to be stored? Perhaps it refers to the amount of
data that must be processed in a certain period of time or the number of users that need
to access it? If you read the literature, you discover that the answer is, yes—which, in a
entanglement of twisted logic, is also the same as saying no. For convenience, if nothing
else, Big Data is usually classified by the number of bytes that need to be stored and
accessed.
277
Chapter 13 ■ Customization: Enhancing Performance
How many bytes constitute Big Data? Gigabytes might be Big Data. Petabytes are
most assuredly Big Data, and exabytes are very big Big Data. However, the label is not only
inexact, it is unnecessary. Storing such large amounts of data is easy if you can afford the
hardware—you can place as many records in a flat file as you want although the file might
consume miles of magnetic tape or span hundreds of disk drives.
Accessing the data to use it is another matter entirely. Most traditional DBMSs should
handle gigabytes of data, they might struggle to the point of collapse with terabytes of data,
and they might drop dead on the floor faced with exabytes of data. Processing Big Data
presents some big problems (Figure 13-9). Luckily, there is a big solution available: NoSQL.
Big Data drives the design
Trade-off decisions: Design the database
to accommodate Big Data.
What you gain: Ability to process very large
amounts of data.
What you might lose: Flexibility and the
ability to attain satisfactory throughput.
Figure 13-9. Trade-off triangle—accommodating Big Data
To Plunge or Not to Plunge
Before diving into a Big Data solution, the IT organization needs to be comfortable that
the plunge is necessary. Big Data can sometimes be handled by traditional data managers
such as Oracle and DB2. If not, then there are number of nontraditional solutions
that specifically target Big Data. However, choosing to use these tools is a big decision
that should not be taken lightly. Table 13-6 presents a good guideline to follow: use a
nontraditional solution only if you absolutely have to.
Table 13-6. Technology Escalation Rules
1. If the application can be adequately managed by a traditional data manager
(e.g., RDBMS), then use a traditional data manager.
2. If, and only if, a traditional data manager cannot adequately manage the
application’s throughput or volume requirements, then look to nontraditional
solutions (e.g., NoSQL).
278
Chapter 13 ■ Customization: Enhancing Performance
Why choose a traditional system over a nontraditional one? Reasons include
•
The pool of staff experienced with traditional DBMSs is
considerably larger than the pool of staff experienced with
nontraditional DBMSs.
•
The availability of traditional DBMS training, documentation,
consulting help, and support tools greatly exceeds that for the
nontraditional DBMSs.
•
Most IT organizations that support a nontraditional DBMS also
support at least one traditional DBMS, often requiring duplicate
staff expertise, procedures, training, development, test, and
maintenance environments.
•
Nontraditional DBMSs—being newer than traditional DBMSs—
will likely undergo a greater rate of change (features, syntax,
maintenance fixes, etc.) than traditional systems, resulting in
greater instability and support costs.
However, circumstances often dictate the direction you must take, and a
nontraditional DBMS, such as NoSQL, might be the only practical solution to an
application problem.
NoSQL
NoSQL is not so much data architecture as a collection of data architectures tuned to
solve a single problem, or at most just a few.
NoSQL products are categorized by some authors as schema-less, meaning that
there is no formal schema like you might find with a traditional DBMS. Although the
statement is technically not true, it does capture an important characteristic of NoSQL
systems. You could describe NOSQL as a series of stand-alone subschemas. This
observation is driven by two common NoSQL features—its usage-driven nature and its
single record type structure.
First, NoSQL systems place significantly more emphasis on data usage than
traditional data management systems. In fact, usage is the primary driver of database
design. Data structures, such as records, attributes, clusters, and partitions, are primarily
determined by how data is accessed rather than its definition.
Second, a goal of a NoSQL database design is to have each usage scenario supported
by a single NoSQL record type. Denormalization, specifically cramming all the userrequired data into a single NoSQL record, is what gives NoSQL its speed and traditional
DBA weltschmerz. A single NoSQL record might contain multiple occurrences of multiple
entities. The resulting NoSQL fat record can then be accessed with a single I/O.
Cassandra is an open source NoSQL DBMS originally developed by Facebook and
now maintained by the Apache Software Foundation. Some authors refer to Cassandra as a
key-value architecture, others as a wide-column architecture, and still others as a partitionedrow store. Actually it is all of these. As with many NoSQL products, it is an assemblage of
numerous, sometimes diverse, features. For example, key-value is how Cassandra stores data
fields, while wide-column architecture describes how records are constructed.
279
Chapter 13 ■ Customization: Enhancing Performance
Cassandra’s essential features include
•
Clustering: The Cassandra partition is a multiple-record type,
multiple-record occurrence, basic unit of storage, allowing the
retrieval of multiple record types and occurrences with the same
physical I/O (the NoSQL fat record).
•
Hashing: All Cassandra partitions are automatically stored based
on a hash of all or part of the partition’s primary key, providing
fast storage and retrieval, ideally with a single physical I/O.
•
Aggregation: Both group and multivalue attributes are supported
and heavily used.
In the previous section it was mentioned that when using a traditional DBMS, the
designer can improve performance by turning off ACID features. Well, Cassandra does
that for you. Cassandra adds to its lightning speed by not having to lock records and
journal activity.
While Cassandra is not ACID compliant (though it does optionally support some
ACID features), it does, like a number of NoSQL products, go half the distance. Cassandra
is BASE compliant (Chapter 8), which means it may provide these steps but not in
real time. They might write to a journal file but not before the transaction is declared
complete. If the system goes down a minute of two after the transaction is “complete,”
then you are probably safe. If it goes down a half-second after the transaction is declared
complete, well, who knows. Cassandra even calls it “eventual consistency,” reflecting its
policy of “we’ll get there when we get there.”
Modeling Big Data U3D Style
In Cassandra, the basic unit of storage is called a partition, or a column family.
(Cassandra, it seems, has at least two names for everything.) The partition is stored by
hashing the partition key, which makes up all or part of the primary key. All access to the
partition is by the hash value derived from its partition key—no indices needed. Within
the partition there can be multiple rows that can be stored in a particular order according
to a clustering column or clustering key (the second part of the primary key).
Take the following usage scenario:
Usage Scenario: 7 Name: Produce Active Employee Roster by
Department
Processing type: Query Frequency: Upon Request
7.1 Enter Department, for all occurrences (75 occurrences)
7.2 Find Employee occurrences where Employee STATUS = “Active”
(2,000 occurrences)
280
Chapter 13 ■ Customization: Enhancing Performance
Cassandra has a stated goal of maintaining only one table per query (usage
scenario). Figure 13-10 shows how the two traditional record types supporting Usage
Scenario 7, Produce Active Employee Roster by Department, are stored as a single
Cassandra partition.
Figure 13-10. Creating a Cassandra petition
The DDL code in Figure 13-10 sets up a partition called department with the
partitioning (hashing) key department_name. Within the partition, rows of employee
information are sorted by the cluster key or cluster column, employee_name, in
employee_name order (giving Cassandra its wide-column designation). Think of a
parent-child relationship with the partition as the parent and the rows as its children.
Cassandra also supports aggregation (Table 13-7), which is used to store a limited
number repeating items.
Table 13-7. Cassandra-Supported Aggregation
Group Attribute
Multivalue Attribute
CREATE TYPE address (
// a user defined data type
street text,
town text,
state_province text,
postal_code int
);
CREATE TABLE customers (
customer_id int PRIMARY KEY,
first_name text,
last_name text,
phone_number set <text>
/* repeating group for
multiple phone numbers */
);
The only way NoSQL can support each usage scenario with a single partition is with
large amounts of data duplication and denormalization. (A note on Cassandra replication
and duplication terminology: Replication is storing a partition in more than one node
[server]. The database designer has the option to store each partition on one, two, or up
to all servers in the server cluster. Duplication is storing the same data items multiple
times within a partition or across multiple partitions.)
By capitalizing on the partition/row architecture, its use of aggregation and hashing,
fueled by a liberal use of data duplication and denormalization, Cassandra could conceivably
retrieve an account, all of its orders, and all of their line items (the unceremonious NoSQL fat
record), with one physical I/O.
281
Chapter 13 ■ Customization: Enhancing Performance
The features found in Cassandra are not unique to Cassandra. Other NoSQL systems
support similar concepts providing similar results.
Customization Notes
Customization notes is one of the most important deliverables coming out of step 4,
Customization. In fact, it’s one of the most important deliverables during all of U3D.
The reason is that the issues raised and the decisions made during Customization are
some of the most volatile, debatable, and controversial ones of the entire design process.
Hardware changes, or a new operating system or DBMS release, can require significant
database updates on rather short notice.
Customization notes should include answers to these four questions:
•
Why? If an index was added, or removed, it is important to
document it. Equally important are the reasons for decisions
surrounding changes that were discussed but not made and why
they were not made.
•
Where? The notes should reflect exactly where in the database
design any new concepts were introduced or existing ones
changed.
•
When? Design changes and test results do not always align.
Comparing test results with an incorrect state of the design can
prove disastrous.
•
Results? Document all results: the good, the bad, and the ugly.
There are times when the bad and the ugly are more useful to
future designers and DBAs than the successes. A report of a
misapplied index or partition can save a successor from making
the same mistake.
The designer or DBA will be well served with a robust set of customization notes.
Deliverables
Step 4, Customization, should produce the following deliverables:
4.1: Enhanced database design diagram: The final physical
database design diagram (Figure 13-11 in the next sections
shows an EPDDD for a traditional data manager)
4.2: Enhanced schema (DDL): Performance enhanced version
of the schema created in step, 3 Formalization
4.3: Enhanced Subschemas (DDL): Performance-enhanced
version of the schemas created in step 3, Formalization
282
Chapter 13 ■ Customization: Enhancing Performance
4.4: Enhanced database object definitions: Update of all
database object definitions to reflect step 4, Customization,
changes (Figures 13-12, 13-13, and 13-14)
4.5: Customization notes: A narrative or journal created by the
database designer of the activities, issues, and decisions made
during step 4, Customization
Examples of Deliverables
Figure 13-11 shows the Enhanced database design diagram.
TRADITIONAL
DATABASE DESIGN
CHANGES MADE TO THE MODEL
Clusters are created for Customer, Account,
and Product.
Partitions are created for Customer and
Product.
Indices are created for Customer, Corporate,
Account, Order, Product, and Warehouse.
Figure 13-11. Enhanced database design diagram
All clusters need to be documented, including the reasoning for having them.
Figure 13-12 shows the cluster definition.
283
Chapter 13 ■ Customization: Enhancing Performance
Figure 13-12. Cluster definition
Partitions are simple for some database management systems, but for others they
involve considerable interaction with the computer’s operating system. This distinction
can result in different parties needing to be involved with the activity (DBA or system
programmer). The information each needs might vary, requiring the capture of different
documentation for partitioning. Figure 13-13 shows the partition definition.
Figure 13-13. Partition definition
Most databases contain a considerable number of indices, with the index count
easily eclipsing the number of record types, even more so for relational databases.
Having accurate information about each index is critical for good database maintenance.
Figure 13-14 shows the index definition.
284
Chapter 13 ■ Customization: Enhancing Performance
Figure 13-14. Index definition
* * * * *
“No battle plan survives contact with the enemy,” said Prussian Chief of Staff
General Helmuth von Moltke. A more modern version of his quote might be that the
best battle plan is useless once the first shot is fired. A database design is questionable
as soon as the database is put into production unless it is monitored and tuned
starting on day 1. Considerable time and expense goes into database design; however,
the cost of keeping a live, breathing database functioning to specification is often an
undervalued, underfunded, and underperformed task. However, both the database
and its documentation need to be monitored and kept up to date if the database is to be
successful. If the database designer follows the steps laid out in U3D, then the difficult
task of maintenance should be easier and, more important, the efficacy of the database
that much greater.
285
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement