TABLE OF CONTENTS Unit 1: Introduction to Database System 1.1

TABLE OF CONTENTS Unit 1: Introduction to Database System 1.1
TABLE OF CONTENTS
Unit 1: Introduction to Database System
1.1 Introduction
1.2 Objectives
1.3 Traditional file oriented approach
1.4 Motivation for database approach
1.5 Database Basics
1.6 Three views of data
1.7 The three level architecture of dbms
1.7.1 External level or subschema
1.7.2 Conceptual level or conceptual schema
1.7.3 Internal level or physical schema
1.7.4 Mapping between different levels
1.8 Database management system facilities
1.8.1 Data definition language
1.8.2 Data manipulation language
1.9 Elements of a database management system
1.9.1 Dml precompiled
1.9.2 Ddl compiler
1.9.3 File manager
1.9.4 Database manager
1.9.5 Query processor
1.9.6 Database administrator
1.9.7 Data dictionary
1.10 Advantages and disadvantages of dbms
1.11 Self test
1.12 Summary
Unit 2: Database Models
2.1
Introduction
2.2
Objectives
2.3
File management system
2.4
Entity-relationship (e-r) diagram
2.4.1 Intrdoction of ERD
2.4.2 Entity-relationship diagram
2.4.3 Generalization and aggregation
2.4.3.1 Aggregation
2.5
The hierarchical model
2.6 The network model
2.7 The relational model
2.8 Advantages and disadvantages of relational approach
2.9 An example of a relational model
2.10 Self test
1
Unit 3:
Unit 4:
Unit 5:
Unit 6:
2.11
Summary
File Organisation For dbms
3.1 Introduction
3.2 Objectives
3.3 File organization
3.4 Sequential file organisation
3.4.1 Index-sequential file organization
3.4.1.1 Types of indexes
3.5 B-trees
3.5.1 Advantages of b-tree indexes
3.6 Direct file organization
3.7 Need for the multiple access path
3.8 Self test
3.9 Summary
Representing Data Elements
4.1 Data elements and fields
4.2 Representing relational database elements
4.3 Records
4.4 Representing block and record addresses
4.5 Client-server systems
4.6 Logical and structured addresses
4.7 Record modifications
4.8 Index structures
4.9 Indexes on sequential files
4.10 Secondary indexes
4.11 B-trees
4.12 Hash tables
4.13 Self Test
Relational Model
5.1 Introduction
5.2 Objectives
5.3 Concepts of a relational model
5.4 Formal definition of a relation
5.5 The codd commandments
5.6 Summary
Normalization
6.1 Functional dependency
6.2 Normalization
6.2.1 First normal form
6.2.2 Second normal form
6.2.3 Third normal form
6.2.4 Boyce-codd normal form
6.2.5 Multi-valued dependency
6.2.6 Fifth normal form
6.3 Self test
6.4 Summary
2
Unit 7: Structured Query Language
7.1 Introduction of sql
7.2 Ddl statements
7.3 Dml statements
7.4 View definitions
7.5 Constraints and triggers
7.6 Keys and foreign keys
7.7 Constraints on attributes and tuples
7.8 Modification of constraints
7.9 Cursors
7.10 Dynamic sql
Unit 8: Relational Algebra
8.1 Basics of relational algebra
8.2 Set operations on relations
8.3 Extended operators of relational algebra
8.4 Constraints on relations
8.5 Self test
8.6 Summary
Unit 9: Management Considerations
9.1 Introduction
9.2 Objectives
9.3 Organisational resistance to dbms tools
9.4 Conversion from an old system to a new system
9.5 Evaluation of a dbms
9.6 Administration of a database management system
9.7 Self test
9.8 Summary
Unit 10: Concurrency Control
10.1 Serial and serializability schedules
10.2 Conflict-serializability
10.3 Enforcing serializability by locks
10.4 Locking systems with several lock modes
10.5 Architecture for a locking scheduler
10.6 Managing hierarchies of database elements
10.7 Concurrency control by timestamps
10.8 Concurrency control by validation
10.9 Summary
Unit 11: Transaction Management
11.1 Introduction of transaction management
11.2 Serializability and recoverability
11.3 View serializability
11.4 Resolving deadlocks
11.5 Distributed databases
11.6 Distributed commit
3
11.7 Distributed locking
11.8 Summary
4
UNIT 1
INTRODUCTION TO DATA BASE SYSTEM
Unit 1: Introduction to Database System
1.1
Introduction
1.2
Objectives
1.3
Traditional file oriented approach
1.4
Motivation for database approach
1.5
Database basics
1.6
Three views of data
1.7
The three level architecture of dbms
1.7.1 External level or subschema
1.7.2 Conceptual level or conceptual schema
1.7.3 Internal level or physical schema
1.7.4 Mapping between different levels
1.8
Database management system facilities
1.8.1.1 Data definition language
1.8.1.2 Data manipulation language
1.8.1.3 Data query language
1.8.1.4 Data control language
1.8.1.5 Transaction control language
1.9
Elements of a database management system
1.9.1 Dml precompiled
1.9.2 Ddl compiler
1.9.3 File manager
1.9.4 Database manager
1.9.5 Query processor
1.9.6 Database administrator
1.9.7 Data dictionary
1.10 Advantages and disadvantages of dbms
1.11 Self test
1.12 Summary
1.1 INTRODUCTION
Database Management is an important aspect of data processing. It involves, several data models
evolving into different DBMS software packages. These packages demand certain knowledge in
discipline and procedures to effectively use them in data processing applications.
5
We need to understand the relevance and scope of Database in the Data processing area. This we do
by first understanding the properties and characteristics of data and the nature of data organization.
Data structure can be defined as specification of data. Different data structures like array, stack,
queue, tree and graph are used to implement data organization in main memory. Several strategies
are used to support the organization of data in secondary memory.
A database is a collection of related information stored in a manner that it is available to many users
for different purposes. The content of a database is obtained by combining data from all the different
sources in an organization. So that data are available to all users and redundant data can be
eliminated or at least minimized. The DBMS helps create an environment in which end user have
better access to more and better managed data than they did before the DBMS become the data
management standard.
A database can handle business inventory, accounting information in its files to prepare summaries,
estimates, and other reports. There can be a database, which stores new paper articles, magazines,
books, and comics. There is already a well-defined market for specific information for highly selected
group of users on almost all subjects. The database management system is the major software
component of a database system. Some commercially available DBMS are INGRES, ORACLE, and
Sybase. A database management system, therefore, is a combination of hardware and software that
can be used to set up and monitor a database, and can manage the updating and retrieval of
database that has been stored in it. Most database management systems have the following
facilities/capabilities:
•
Creating of a file, addition to data, deletion of data, modification of data; creation, addition
and deletion of entire files.
•
Retrieving data collectively or selectively.
•
The data stored can be sorted or indexed at the user's discretion and direction.
Various reports can be produced from the system. These may be either standardized report or that
may be specifically generated according to specific user definition.
Mathematical functions can be performed and the data stored in the database can be manipulated
with these functions to perform the desired calculations.
•
To maintain data integrity and database use.
•
To create an environment for Data warehousing and Data mining.
The DBMS interprets and processes users' requests to retrieve information from a database. The
following figure shows that a DBMS serves as an interface in several forms. They may be keyed
directly from a terminal, or coded as high-level language programs to be submitted for interactive or
6
batch processing. In most cases, a query request will have to penetrate several layers of software in
the DBMS and operating system before the physical database can be accessed.
1.2 OBJECTIVES
After going through this unit, you should be able, to Appreciate the limitations of the traditional
approach to application system development; Give reasons why the database approach is now being
increasingly adopted; — Discuss different views of data; List the components of a database
management system; Enumerate the feature/capabilities of a database management system; and List
several advantages and disadvantages of DBMS.
1.3 TRADITIONAL FILE ORIENTED APPROACH
The traditional file-oriented approach to information processing has for each application a separate
master file and its own set of personal files. An organization needs flow of information across these
applications also and this requires sharing of data, which is significantly lacking in the traditional
approach. One major limitations of such a file-based approach is that the programs become
dependent on the files and the files become dependent upon the programs.
Disadvantages
•
Data Redundancy: The same piece of information may be stored in two or more files. For
example, the particulars of an individual who may be a customer or client may be stored in
two or more files. Some of this information may be changing, such as the address, the
payment maid, etc. It is therefore quite possible that while the address in the master file for
one application has been updated the address in the master file for another application may
have not been. It may be not easy to even find out as to in how many files the repeating items
such as the name occur.
•
Program/Data Dependency: In the traditional approach if a data field is to be added to a
master file, all such programs that access the master file would have to be changed to allow
for this new field which would have been added to the master record.
•
Lack of Flexibility: In view of the strong coupling between the program and the data, most
information retrieval possibilities would be limited to well-anticipated and pre-determined
requests for data, the system would normally be capable of producing scheduled records and
queries which it has been programmed to create.
1.4. MOTIVATION FOR DATABASE APPROACH
Having pointed out some difficulties that arise in a straightforward file-oriented approach towards
information system development. The work in the organization may not require significant sharing of
data or complex access. In other words the data and the way it is used in the functioning of the
organization are not appropriate to database processing. Apart from needing a more powerful
hardware platform, the software for database management systems is also quite expensive. This
means that a significant extra cost has to be incurred by an organization if it wants to adopt this
approach.
7
Advantages gained by the possibility of sharing of the data with others, also carries with it the risk of
unauthorized access of data. This may range from violation of office procedures to violation of privacy
rights of information to down right thefts. The organizations, therefore, have to be ready to cope with
additional managerial problems.
A database management processing system is complex and it could lead to a more inefficient system
than the equivalent file-based one.
The use of the database and its possibility of being shared will, therefore affect many departments
within the organization. If die integrity of the data is not maintained, it is possible that one relevant
piece of data could have been used by many programs in different applications by different users
without they are being aware of it. The impact of this therefore may be very widespread. Since data
can be input from a variety sources, the control over the quality of data become very difficult to
implement.
However, for most large organization, the difficulties in moving over to a database approach are still
worth getting over in view of the advantages that are gained, namely, avoidance of data duplication,
sharing of data by different programs, greater flexibility and data independence.
1.5 DATABASE BASICS
Since the DBMS of an organization will in some sense reflect the nature of activities in the
organization, some familiarity with the basic concepts, principles and terms used in the field are
important.
•
Data-items: The term data item is the word for what has traditionally been called the field in
data processing and is the smallest unit of data that has meaning to its users. The phrase
data element or elementary item is also sometimes used. Although the data item may be
treated as a molecule of the database, data items are grouped together to form aggregates
described by various names. For example, the data record is used to refer to a group of data
items and a program usually reads or writes the whole records. The data items could
occasionally be further broken down into what may be called an automatic level for processing
purposes.
•
Entities and Attributes: The real world would consist of occasionally a tangible object such as
an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities. The qualities of
the entity that we store as information are called the attributes. An attribute may be
expressed as a number or as a text. It may even be a scanned picture, a sound sequence, and
a moving picture that is now possible in some visual and multi-media databases.
Data processing normally concerns itself with a collection of similar entities and records
information about the same attributes of each of them. In the traditional approach, a
programmer usually maintains a record about each entity and a data item in each record
relates to each attribute. Similar records are grouped into files and such a 2-dimensional
array is sometimes referred to as a flat file.
•
Logical and Physical Data: One of the key features of the database approach is to bring about
a distinction between the logical and the physical structures of the data. The term logical
structure refers to the way the programmers see it and the physical structure refers to the
8
way data are actually recorded on the storage medium. Even in the early stages of records
stored on tape, the length of the inter-record tape requires that many logical records be
grouped into one physical record to several storage places on tape. It was the software, which
separated them when used in an application program, and combined them again before
writing back on tape. In today's system the complexities are even greater and as will be seen
when one is referring to distributed databases that some records may physically be located at
significantly remote places.
•
Schema and Subschema: Having seen that the database does not focus on the logical
organization and decouples it from the physical representation of data, it is useful to have a
term to describe the logical database description. A schema is a logical database description
and is drawn as a chart of the types of data that are used. It gives the names of the entities
and attributes, and specifies the relationships between them. It is a framework into which the
values of the data item can be fitted. Like an information display system such as that giving
arrival and departure time at airports and railway stations, the schema will remain the same
though the values displayed in the system will change from time to time. The relationships
that has specified between the different entities occurring in the schema may be a one to one,
one to many, many to many, or conditional.
The term schema is used to mean an overall chart of all the data item types and record-types
stored in a database. The term sub schema refers to the same view but for the data-item types
and record types which a particular user uses in a particular application or. Therefore, many
different sub schemas can be derived from one schema.
•
Data Dictionary: It holds detailed information about the different structures and data types:
the details of the logical structure that are mapped into the different structure, details of
relationship between data items, details of all users privileges and access rights, performance
of resource with details.
1.6 THREE VIEWS OF DATA
DBMS is a collection of interrelated files and a set of programs that allow several users to access and
modify these files. A major purpose of a database system is to provide users with an abstract view of
the data. However, in order for the system to be usable, data must be retrieved efficiently.
The concern for efficiently leads to the design of complex data structure for the representation of data
in the database. By defining levels of abstract as which the database may be viewed, there are logical
view or external, conceptual view and internal view or physical view.
•
External view: This is the highest level of abstraction as seen by a user. This level of
abstraction describes only the part of entire database.
•
Conceptual view: This is the next higher level of abstraction which is the sum total of Data
Base Management System user's views. In this we consider; what data are actually stored in
the database. Conceptual level contains information about entire database in terms of a small
number of relatively simple structures.
•
Internal level: The lowest level of abstraction at which one describes how the data are
physically stored. The interrelationship of any three levels of abstraction is illustrated in figure
2.
9
Fig: The three views of data
To illustrate the distinction among different views of data, it can be compared with the concept of
data types in programming languages. Most high level programming language such as C, VC++, etc.
support the notion of a record or structure type. For example in the ‘C’ language we declare structure
(record) as follows:
struct
Emp{
char
name
[30];
char
address
[100];
}
This defines a new record called Emp with two fields. Each field has a name and data type associated
with it.
In an Insurance organization, we may have several such record types, including among others:
-Customer with fields name and Salary
-Premium paid and Due amount at what date
-Insurance agent name and salary + Commission
At the internal level, a customer, Premium account, or employee (insurance agent) can be described
as a sequence of consecutive bytes. At the conceptual level each such record is described by a type
definition, illustrated above and also die interrelation among these record types is defined and
describing the rights or privileges assign to individual customer or end-users. Finally at the external
level, we define several views of the database. For example, for preparing the Insurance checks of
Customer_details’, only information about them is required; one does not need to access information
10
about customer accounts. Similarly, tellers can access only account information. They cannot access
information concerning about the premium paid or amount received.
1.7 THE THREE LEVEL ARCHITECTURE OF DBMS
A database management system that provides these three levels of data is said to follow three-level
architecture as shown in fig. . These three levels are the external level, the conceptual level, and the
internal level.
Fig : The three level architecture for a DBM
A schema describes the view at each of these levels. A schema as mentioned earlier is an outline or a
plan that describes the records and relationships existing in the view. The schema also describes the
way in which entities at one level of abstraction can be mapped to the next level. The overall design of
the database is called the database schema. A database schema includes such information as:
Characteristics of data items such as entities and attributes Format for storage representation
Integrity parameters such as physically authorization and backup politics. Logical structure and
relationship among those data items
The concept of a database schema corresponds to programming language notion of type definition. A
variable of a given type has a particular value at a given instant in time. The concept of the value of a
variable in programming languages corresponds to the concept of an instance of a database schema.
Since each view is defined by a schema, there exists several schema in the database and these exists
several schema in the database and these schema are partitioned following three levels of data
abstraction or views. At the lower level we have the physical schema; at the intermediate level we
have the conceptual schema, while at the higher level we have a subschema. In general, database
system supports one physical schema, one conceptual schema, and several sub-schemas.
1.7.1 External Level or Subschema
The external level is at the highest level of database abstraction where only those portions of the
database of concern to a user or application program are included. Any number of user views may
exist for a given global or conceptual view.
11
Each external view is described by means of a schema called an external schema or subschema. The
external schema consists of the, definition of the logical records and the relationships in the external
view. The external schema also contains the method of deriving the objects in the external view from
the objects in the conceptual view. The objects include entities, attributes, and relationships.
1.7.2 Conceptual Level or Conceptual Schema
One conceptual view represents the entire database. The conceptual schema defines this conceptual
view. It describes all the records and relationships included in the conceptual view and, therefore, in
the database. There is only one conceptual schema per database. This schema also contains the
method of deriving the objects in the conceptual view from the objects in the internal view.
The description of data at this level is in a format independent of its physical representation. It also
includes features that specify the checks to retain data consistency and integrity.
1.7.3 Internal Level or Physical Schema
It indicates how the data will be stored and describes the data structures and access methods to be
used by the database. The internal schema, which contains the definition of the stored record, the
method of representing the data fields, expresses the internal view and the access aids used.
1.7.4 Mapping between different Levels
Two mappings are required in a database system with three different views. A mapping between the
external and conceptual level gives the correspondence among the records and the relationships of
the external and conceptual levels.
a) EXTERNAL to CONCEPTUAL: Determine how the conceptual record is viewed by the user
b) INTERNAL to CONCEPTUAL: Enable correspondence between conceptual and internal levels. It
represents how the conceptual record is represented in storage.
An internal record is a record at the internal level, not necessarily a stored record on a physical
storage device. The internal record of figure 3 may be split up into two or more physical records. The
physical database is the data that is stored on secondary storage devices. It is made up of records
with certain data structures and organized in files. Consequently, there is an additional mapping
from the internal record to one or more stored records on secondary storage devices.
1.8 DATABASE MANAGEMENT SYSTEM FACILITIES
Two main types of facilities are supported by the DBMS:
- The data definition facility or data definition language (DDL).
- The data manipulation facility or data manipulation language (DML).
- The data query facility or data query language [DQL].
- The data control facility or data control language [DCL].
- The transaction control facility or data control language [TCL].
1.8.1 Data Definition Language
12
Data Definition Language is a set of SQL commands used to create, modify and delete database
structures (not data). These commands wouldn't normally be used by a general user, who should be
accessing the database via an application. They are normally used by the DBA (to a limited extent), a
database designer or application developer. These statements are immediate; they are not susceptible
to ROLLBACK commands. You should also note that if you have executed several DML updates then
issuing any DDL command will COMMIT all the updates as every DDL command implicitly issues a
COMMIT command to the database. Anybody using DDL must have the CREATE object privilege and
a Tablespace area in which to create objects.
1.8.2 Data Manipulation Language
DML is a language that enables users to access or manipulate as organized by the appropriate data
model. Data manipulation involves retrieval of data from the database, insertion of new data into the
database, and deletion or modification of existing data. The first of these data manipulation
operations is called a query. A query is a statement in the DML that requests the retrieval of data
from the database. The DML provides commands to select and retrieve data from the database.
Commands are also provided to insert, update, and delete records.
There are basically two types of DML:
— Procedural: which requires a user to specify what data is needed and how to get it
— Nonprocedural: which requires a user to specify what data is needed without specifying how to get it
9. ELEMENTS OF A DATABASE MANAGEMENT SYSTEM
The major components of a DBMS are explained below:
1.9.1 DML Precompiled
It converts DML statement embedded in an application program to normal procedure calls in the host
language. The precompiled must interact with the query processor in order to generate the
appropriate code.
1.9.2 DDL Compiler
The DDL compiler converts the data definition statements into a set of tables. These tables contain
information concerning the database and are in a form that can be used by other components of the
DBMS.
1.9.3 File Manager
File manager manages the allocation of space on disk storage and the data structure used to
represent information stored on disk. The file manager can be implemented using an interface to the
existing file subsystem provided by the operating system of the host computer or it can include a file
subsystem written especially for the DBMS.
1.9.4 Database Manager
13
Databases typically require a large amount of storage space. Corporate databases are usually
measured in terms of gigabytes of data. Since the main memory of computers cannot store this
information, it is stored on disks. Data is moved between disk storage and main memory as needed.
Since the movement of data to and from disk is slow relative to the speed of control processing unit of
computers, it is imperative that database system structure data so as to minimize the need to move
data between disk and main memory. A database manager is a program module, which provides the
interface between the low level data stored in the database and the application programs and queries
submitted to the system. It is responsible for interfacing with file system.
One of the functions of database manager is to convert user's queries coming directly via the query
processor or indirectly via an application program from the user's logical view to the physical file
system. In addition, database manager also performs the tasks of enforcing constraints to maintain
the consistency and integrity of the data as well as its security. Synchronizing the simultaneous
operations performed by concurrent users is under the control of the data manager. It also performs
backup and recovery operations.
1.9.5 Query Processor
The database user retrieves data by formulating a query in the data manipulation language provided
with the database. The query processor is used to interpret the online user's query and convert it into
an efficient series of operations in a form capable of being sent to the data manager for execution.
The query processor uses the data dictionary to find the structure of the relevant portion of the
database and uses this information in modifying the query and preparing an optimal plan to access
the database.
1.9.6 Database Administrator
Data administration is a high level function that is responsible for overall management of data
resources in an organization including maintaining corporate wide data definitions and standards. It
is a technical function that is responsible for physical database design and for dealing with technical
issue such as security enforcement, Database performance, backup, and recovery. The person having
such control over the system is called the database administrator (DBA). The DBA administers the
three levels of the database and, in consultation with the overall user community, sets up the
definition of the global view or conceptual level of the database. The DBA further specifies the
external view of the various users and applications and is responsible for the definition and
implementation of the internal level, including the storage structure and access methods to be used
for the optimum performance of the DBMS. Changes to any of the three levels necessitated by
changes or growth in the organization and/or emerging technology are under the control of the DBA.
Mappings between the internal and the conceptual levels, as well as between the conceptual and
external levels, are also defined by the DBA. Ensuring that appropriate measures are in place to
maintain the integrity of the database and that the database is not accessible to unauthorized users
is another responsibility. The DBA is responsible for granting permission to the users of the database
and stores the profile of each user in the database. This profile describes the permissible activities of
a user on that portion of the database accessible to the user via one or more user views. The user
profile can be used by the database system to verify that a particular user can perform a given
operation on the database.
14
The DBA is also responsible for defining procedures to recover the database from failures due to
human, natural, or hardware causes with minimal loss of data. This recovery procedure should
enable the organization to continue to function and the intact portion of the database should
continue to be available.
Let us summarize the functions of DBA are:
Schema definition: The creation of the original database schema. This is accomplished by writing a
set of definition, which are translated by the DDL compiler to a set of tables that are permanently
stored in the data dictionary.
Storage Structure and access method definition: The creation of appropriate storage structure and
access method. This is accomplished by writing a set of definitions, which are translated by the data
storage and definition language compiler.
Schema and Physical organization modification: Either the modification of the database schema or
the description of the physical storage organization. These changes, although relatively rare, are
accomplished by writing a set of definition which is used by either the DDL compiler or the data
storage and definition language compiler to generate modification to the appropriate internal system
tables (for example the data dictionary).
1.9.7 Data Dictionary
Keeping a track of all the available names that are used and the purpose for which they were used
becomes more and more difficult. Of course it is possible for a programmer who has coined the
available names to bear them in mind, but should the same author come back to his program after a
significant time or should another programmer have to modify the program, it would be found that it
is extremely difficult to make a reliable account of for what purpose the data files were used.
The problem becomes even more difficult when the number of data types that an organization has in
its database increased. It has also now perceived that the data of an organization is a valuable
corporate resource and therefore some kind of an inventory and catalogue of it must be maintained
so as to assist in both the utilization and management of the resource.
It is for this purpose that a data dictionary or dictionary/directory is emerging as a major tool. An
inventory provides definitions of things. A directory tells you where to find them. A data
dictionary/directory contains information (or data) about the data. A comprehensive data dictionary
would provide the definition of data item, how they fit into the data structure, and how they relate to
other entities in the database.
The DBA uses the data dictionary in every phase of a database life cycle, starting from the embryonic
data-gathering phase to the design, implementation, and maintenance phases. Documentation
provided by a data dictionary is as valuable to end users and managers as it provided by a data
dictionary is as valuable to end users and managers as it are essential to the programmers. Users
can plan their applications with the database only if they know exactly what is stored in it. For
15
example, the description of a data item in a data dictionary may include its origin and other text
description in plain English, in addition to its data format. Thus users and managers will be able to
see exactly what is available in the database. You could consider a data dictionary to be a road map,
which guides users to access information within a large database.
Fig: DBMS Structure
A data dictionary is implemented as a database so that users can query its content by either
interactive or batch processing. Whether or not the cost of acquiring a data dictionary system is
justifiable depends on the size and complexity of the information system. The cost effectiveness of a
data dictionary increases as the complexity of an information system increases. A data dictionary can
be a great asset not only to the DBA for database design, implementation, and maintenance, but also
to managers or end users in their project planning. Figure 4 shows these components and the
connection among them.
1.10. Advantages and Disadvantages of DBMS
Database system is that the organization can exert, via the DBA, centralized management and control
over the data. The database administrator is the focus of the centralized control. Any application
requiring a change in the structure of a data record requires an arrangement with the DBA, who
makes the necessary modifications. Such modifications do not effect other applications or users of
the record in question. Therefore, these changes meet another requirement of the DBMS: data
independence. The following are the important advantages of DBMS:
Advantages
16
Reduction of Redundancies: Centralized control of data by the DBA avoids unnecessary duplication
of data and effectively reduces the total amount of data storage required. It also eliminates the extra
processing necessary to trace the required data in a large mass of data. Another advantage of
avoiding duplication is the elimination of the inconsistencies that tend to be present in redundant
data files. Any redundancies that exist in the DBMS are controlled and the system ensures that these
multiple copies are consistent.
Sharing Data: A database allows the sharing of data under its control by any number of application
programs or users.
Data Integrity: Centralized control can also ensure that adequate checks are incorporated in the
DBMS to provide data integrity. Data integrity means that the data contained in the database is both
accurate and consistent. Therefore, data values being entered for storage could be checked to ensure
that they fall within a specified range and are of the correct format. For example, the value for the age
of an employee may be in the range of 16 and 75. Another integrity check that should be
incorporated in the database is to ensure that if there is a reference to certain object, that object
must exist. In the case of an automatic teller machine, for example, a user is not allowed to transfer
funds from a nonexistent saving account to a checking account.
Data Security: Data is of vital importance to an organization and may be confidential. Unauthorized
persons must not access such confidential data. The DBA who has the ultimate responsibility for the
data in the DBMS can ensure that proper access procedures are followed, including proper
authentication schemas for access to the DBMS and additional checks before permitting access to
sensitive data. Different levels of security could be implemented for various types of data and
operations.
Conflict Resolution: DBA chooses the best file structure and access method to get optimal
Performance for the response-critical applications, while permitting less critical applications to
continue to use die database, albeit with a relatively slower response.
Data Independence: Data independence is usually considered from two points of view: physical data
independence and logical data independence. Physical data independence allows changes in the
physical storage devices or organization of the files to be made without requiring changes in the
conceptual view or any of the external views and hence in the application programs using the
database.
Disadvantages
A significant disadvantage of the DBMS system is cost. In addition to the cost of purchasing or
developing the software, the hardware has to be upgraded to allow for the extensive programs and the
workspaces required for their execution and storage. The processing overhead introduced by the
DBMS to implement security, integrity, and sharing of the data causes a degradation of the response
and through-put times. An additional cost is that of migration from a traditionally separate
application environment to an integrated one.
While centralization reduces duplication, the lack of duplication requires that the database be
adequately backed up so that in the case of failure the data can be recovered. Backup and recovery
operations are fairly complex in a DBMS environment, and this is exacerbated in a concurrent multi-
17
user database system. In further a database system requires a certain amount of controlled
redundancies and duplication to enable access to related data items.
Centralization also means that the data is accessible from a single source namely the database. This
increases the potential severity of security breaches and disruption of the operation of the
organization because of downtimes and failures. The replacement of a monolithic centralized
database by a federation of independent and cooperating distributed databases resolves some of the
problems resulting from failures and downtimes.
1.11 Self Test
1)
Define the term database?
2)
Explain levels of database with the help of suitable example
3)
Explain components of database system.
4)
Explain elements of DBMS.
1.12 Summary
•
A database is a collection of related information stored in a manner that it is available to
many users for different purposes. The content of a database is obtained by combining data
from all the different sources in an organization. So that data are available to all users and
redundant data can be eliminated or at least minimized. The DBMS helps create an
environment in which end user have better access to more and better managed data than they
did before the DBMS become the data management standard.
•
The traditional file-oriented approach to information processing has for each application a
separate master file and its own set of personal files. An organization needs flow of
information across these applications also and this requires sharing of data, which is
significantly lacking in the traditional approach.
•
A database management processing system is complex and it could lead to a more inefficient
system than the equivalent file-based one. The use of the database and its possibility of being
shared will, therefore affect many departments within the organization. If die integrity of the
data is not maintained, it is possible that one relevant piece of data could have been used by
many programs in different applications by different users without they are being aware of it.
The impact of this therefore may be very widespread. Since data can be input from a variety
sources, the control over the quality of data become very difficult to implement.
•
Data-items: The term data item is the word for what has traditionally been called the field in
data processing and is the smallest unit of data that has meaning to its users.
Entities and Attributes: The real world would consist of occasionally a tangible object such as
an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities.
•
Logical and Physical Data: One of the key features of the database approach is to bring about
a distinction between the logical and the physical structures of the data.
•
A schema is a logical database description and is drawn as a chart of the types of data that
are used. It gives the names of the entities and attributes, and specifies the relationships
between them. It is a framework into which the values of the data item can be fitted. Like an
information display system such as that giving arrival and departure time at airports and
18
railway stations, the schema will remain the same though the values displayed in the system
will change from time to time.
19
UNIT 2
DATABASE MODELS
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
Introduction
Objectives
File management system
Entity-relationship (e-r) diagram
2.4.1 Intrdoction of erd
2.4.2 Entity-relationship diagram
2.4.3 Generalization and aggregation
2.4.3.1 Aggregation
The hierarchical model
The network model
The relational model
Advantages and disadvantages of relational approach
An example of a relational model
Self test
Summary
2.1 INTRODUCTION
Data modeling is the analysis of data objects that are used in a business or other context and the
identification of the relationships among these data objects. Data modeling is a first step in doing
object-oriented programming. As a result of data modeling, you can then define the classes that
provide the templates for program objects.
A simple approach to creating a data model that allows you to visualize the model is to draw a square
(or any other symbol) to represent each individual data item that you know about (for example, a
product or a product price) and then to express relationships between each of these data items with
words such as "is part of" or "is used by" or "uses" and so forth. From such a total description, you
can create a set of classes and subclasses that define all the general relationships. These then
become the templates for objects that, when executed as a program, handle the variables of new
transactions and other activities in a way that effectively represents the real world.
2.2 OBJECTIVES
After going through this unit, you should be able to:
•
Identify the structures in the different models of DBMS.
•
Convert any given data base situation to a hierarchical or relational model.
•
Discuss entity-relationship model.
20
•
State the essential features of the relational model; and
•
Discuss implementation issues of all the three models.
2.3 FILE MANAGEMENT SYSTEM
In the early days of data processing, all files were flat files. A flat file is one where each record
contains the same types of data items. One or more of these data items are designated as the key and
is used for sequencing the file and for locating and grouping records by sorting and indexing. All
these types of structures can be closed. As either trees or clause structures. However, it may be
borne in mind that all these complicated file structure can be broken down into groups of flat files
with redundant data item.
An FMS consists of a number of application programs. Because productivity enhancement in using
an FMS compared to a conventional high-level language is about 10 to 1, programmers use it. But
the case of use of an FMS also encourages end users with no previous programming experience to
perform queries with special FMS language. One of the more well known in this regard is RPG (Report
Program Generator), which was very popular for generating routine business reports. In order to use
the RPG the user would define the input fields required by filling out an input specification format.
Similarly output formats can be specified by filling out an output specification forms. The possibility
of giving a certain structure to the output and the availability of default options made the package
relatively easy to learn and use. Some well-known examples of such FMS packages are Mark-4, Data
tree, easy tree, and powerhouse. The structure of an FMS is diagrammatically illustrated below:
Fig: File management system
The FMS relies upon the basic access methods of the host operating system for data management.
But it may have its own special language to be used in performing the retrievals. This language in
some ways is more powerful than the standard high level programming languages in the way they
define the data and development applications. Therefore, the file management system may be
considered to be a level higher than mere high-level languages.
Therefore some of the advantages of FMS compared to standard high level language are:
21
•
Less software development cost - Even by experienced programmers it takes months or years
in developing a good software system in high level language.
•
Support of efficient query facility - on line queries for multiple-key retrievals are tedious to
program.
Of course one could bear in mind the limitations Of an FMS in the sense FMS cannot handle
complicated mathematical operation and array manipulations. In order to remedy the situation some
FMS provide an interface to call other programs written in a high level language or an assembly
language.
Another limitation of FMS is that for data management and access it is restricted to basic access
methods. The physical and logical links required between different files to be able to cope with
complex multiple key queries on multiple files is not possible. Even though FMS is a simple, powerful
tool it cannot replace the high level language, nor can it perform complex information retrieval like
DBMS. It is in this context that reliance on a good database management system become essential.
2.4 ENTITY-RELATIONSHIP (E-R) DIAGRAM
2.4.1 Introduction of ERD
The entity-relationship model is a tool for analyzing the semantic features of an application that are
independent of events. Entity-relationship modeling helps reduce data redundancy. This approach
includes a graphical notation which depicts entity classes as rectangles, relationships as diamonds,
and attributes as circles or ovals. For complex situations a partial entity-relationship diagram may be
used to present a summary of the entities and relationships but not include the details of the
attributes.
The entity-relationship diagram provides a convenient method for visualizing the interrelationships
among entities in a given application. This tool has proven useful in making the transition from an
information application description to a formal database schema. The entity-relationship model is
used for describing the conceptual scheme of an enterprise without attention to the efficiency of the
physical database design. The entity-relationship diagrams are later turned into a conceptual
schema in one of the other models in which the database is actually implemented.
Following are short definitions of some of the basic terms that are used for describing important
entity-relationship concepts:
• Entity - An entity is a thing that exists and is distinguishable -- an object, something in the
environment.
o Entity Instance - An instance is a particular occurrence of an entity. For example, each person
is an instance of an entity, each car is an instance of an entity, etc.
o Entity Set - A group of similar entities is called an entity class or entity class or entity type. An
Entity Class has common properties.
• Attributes - Attributes describe properties of entities and relationships.
o Simple (Scalars) - smallest semantic unit of data, atomic (no internal structure)- singular e.g. city
o Composite - group of attributes e.g. address (street, city, state, zip)
22
o
o
Multivalued (list) - multiple values e.g. degrees, courses, skills (not allowed in first normal form)
Domain - conceptual definition of attributes
a named set of scalar values all of the same type
a pool of possible values
• Relationships
A
relationship
is
a
connection
between
entity
classes.
For example, a relationship between PERSONS and AUTOMOBILES could be an "OWNS"
relationship. That is to say, automobiles are owned by people.
o The degree of a relationship indicates the number of entities involved.
o The cardinality of a relationship indicates the number of instances in entity class E1 that can or
must be associated with instances in entity class E2
One-One Relationship - For each entity in one class there is at most one associated entity in the
other class. For example, for each husband there is at most one current legal wife (in this country
at least). A wife has at most one current legal husband.
A One-to-One Association
Student
Number
ID
Student Name
524-77-9870
Tom Jones
543-87-4309
Suzy Smart
553-67-3965
Joe Blow
...
...
STUDENT-ID-NUMBER (one to one) STUDENT
* Many-One Relationships - One entity in class E2 is associated with zero or more entities in class
E1, but each entity in E1 is associated with at most one entity in E2. For example, a woman may
have many children but a child has only one birth mother.
In a one-to-many association, each member of domain B is assigned to a unique element of domain
A, but each element of domain A may be assigned to several elements from domain B. For instance,
each of the students may have a single major (accounting, finance, etc.), though each major may
contain several students. This is a "one-to-many" association between majors and students, or a
"many-to-one" association between students and majors. A one-to-many association differs from a
many-to-one association only in the order of the involved domains. Figure C9 shows a many-to-one
association, which may be written as:
Student ID Number
Major
524-77-9870
Accounting
543-87-4309
Accounting
553-67-3965
Finance
...
...
MAJORS (one to many) STUDENTS
Finally, many-to-many associations are those in which neither participant is assigned to a unique
partner. The relationship between students and classes is an example. Each student may take many
23
different classes. Similarly, each class may be taken by many students. There is no limitation on
either participant. This association is written:
STUDENTS (many to many) CLASSES
•
Many-Many Relationships - There are no restrictions on how many entities in either class are
associated with a single entity in the other. An example of a many-to-many relationship would
be students taking classes. Each student takes many classes. Each class has many students.
Can be mandatory, optional or fixed.
o
Isa Hierarchies - A special type of relationship that allows attribute inheritance. For
example, to say that a truck isa automobile and an automobile has a make, model and
serial number implies that a truck also has a make, model and serial number.
•
Keys - The Key uniquely differentiates one entity from all others in the entity class. A key is an
identifier.
o
Primary Key - Identifier used to uniquely identify one particular instance of an
entity.
Can be one or more attributes. (consider substituting a single concatenated key
attribute for multiple attribute key)
Must be unique within the domain (not just the current data set)
Value should not change over time
Must always have a value
Created when no obvious attribute exists. Each instance is assigned a value.
o
Candidate Key - when multiple possible identifiers exist, each is a candidate key
o
Concatenated Key - Key made up of parts which when combined become a unique
identifier. Multiple attribute keys are concatenated keys.
o
Borrowed Key Attributes - If an isa relationship exists, the key to the more general
class is also a key to the subclass of items. For example, if serial number is a key for
automobiles it would also be a key for trucks.
o
Foreign Keys - Foreign keys reference a related table through the primary key of that
related
table.
Referential Integrity Constraint - For every value of a foreign key there is a primary
key with that value in the referenced table e.g. if account number is to be used in a
table for journal entries then that account number must exist in the chart of accounts
table.
•
Events - Events change entities or relationships.
2.4.2 Entity-Relationship Diagram
Symbols used in entity-relationship diagrams include:
•
Rectangles represent ENTITY CLASSES
•
Circles represent ATTRIBUTES
•
Diamonds represent RELATIONSHIPS
•
Arcs - Arcs connect entities to relationships. Arcs are
also used to connect attributes to entities. Some styles
of entity-relationship diagrams use arrows and double
arrows to indicate the one and the many in
relationships. Some use forks etc.
24
• Underline - Key attributes of entities are underlined.
Fragment of an entity relationship diagram.
Entity Relationship Diagram for a simple Accounts Receivable database
2.4.3 Generalization and aggregation
There are two main abstraction mechanisms used to model information: Generalization and
aggregation. Generalization is the abstracting process of viewing set of objects as a single general
25
class by concentrating on the general characteristics of the constituent sets while suppressing or
ignoring their differences. It is the union of a number of lower-level entity types for the purpose of
producing a higher-level entity type. For instance, student is a generalization of graduate or
undergraduate; full-time or part-time students. Similarly, employee is a generalization of the classes
of objects cook, waiter, cashier, etc.. Generalization is an IS-A relationship; therefore, manager IS-An
employee, cook IS-An employee, waiter IS-An employee, and so forth. Specialization is the abstracting
process of introducing new characteristics to an existing class of objects to create one or more new
classes of objects. This involves taking a higher-level entity and, using additional characteristics,
generating lower-level entities. The lower-level entities also inherit the characteristics of the higherlevel entity. In applying the characteristic size to car we can create a full-size, mid-size, compact, or
subcompact car. Specialization may be seen as the reverse process of generalization: additional
specific properties are introduced at a lower level in a hierarchy of objects. Both processes are
illustrated in the following figure 9 wherein the lower levels of the hierarchy are disjoint.
The entity set EMPLOYEE is a generalization of the entity sets FULL__TIME-EMPLOYEE and
PART_TIME-EMPLOYEE. The former is a generalization of the entity sets faculty and staff, the latter,
that of the entity sets TEACHING and CASUAL. FACULTY and STAFF inherit the attribute Salary of
the entity set FULL_TIME_EMPLOYEE and the latter, in turn, inherits the attributes of EMPLOYEE.
FULL-_TIME-EMPLOYEE is a specialization of the entity set EMPLOYEE and is differentiated by the
additional attribute Salary. Similarly, PART_TIME_EMPLOYEE is a specialization differentiated by the
presence of the attribute Type.
Fig : Generalization and Specialization
26
In designing a database to model a segment of the real world, the data-modeling scheme must be able
to represent generalization. It allows the model to represent generic entities and treat a class of
objects and s0pecifying relationships in which the generic objects participate.
Generalization forms a hierarchy of entities and can be represented by a hierarchy of tables, which
can also be shown through following relations for conveniences.
EMPLOYEE (Empl-no, name, Date-of-birth)
FULL-TIME (Empl_no, salary)
PART-TIME (Empl-no, type)
FACULTY (Empl_no, Degree, Interest)
STAFF (Empl-no, Hour-rate)
TEACHING (Empl-no, Stipend)
Here the primary key of each entity corresponds to entries in different tables and directs one to the
appropriate row of related tables.
Another method of representing a generalization hierarchy is to have the lowest-level entities inherit
the attributes of the entities of higher levels. The top and intermediate-level entities are not included
as only those of the lowest level are represented in tabular form. For instance, the attributes of the
entity set FACULTY would be (Emp1-No, Name, Date of-Hire, Salary, Degree, Interest). A separate
table would he required for each lowest-level entity in the hierarchy. The number of different tables
required to represent these entities would be equal to the number of entities at the lowest level of the
generalization hierarchy.
2.4.3.1
Aggregation
Aggregation is the process of compiling information on an object, thereby abstracting a higher-level
object. In this manner, aggregating the characteristics name, address, and Social Security number
derives the entity person. Another form of the aggregation is abstracting a relationship between
objects and viewing the relationship as an object. As such, the ENROLLMENT relationship between
entities student and course could be viewed as entity REGISTRATION. Examples of aggregation are
shown in fig.
Consider the relationship COMPUTING of fig. Here we have a relationship among the entities
STUDENT, COURSE, and COMPUTING SYSTEM.
27
Fig : Examples of aggregation
Fig : A relationship among three entities
A student registered in a given course uses one of several computing systems to complete
assignments and projects. The relationship between the entities STUDENT and COURSE could be the
aggregated entity REGISTRATION, as discussed above. In this ease, we could view the ternary
relationship of fig as one between registration and the entity Computing system. Another method of
aggregating is to consider a relationship consisting of the entity COMPUTING Systems being assigned
to COURSES. This relationship can be aggregate as a new entity and a relationship established
between it and STUDENT. Note that the difference between a relationship involving an aggregation
and one with the three entities lies in the number of relationships. In the former case we have two
relationships; in the latter only one exists. The approach to be taken depends on what we want to
express. We would use the ternary relationship related to a COMPUTING SYSTEM.
28
2.5 THE HIERARCHICAL MODEL
A DBMS belonging to the hierarchical data model uses tree structures to represent relationship
among records. Tree structures occur naturally in many data organizations because some entities
have an intrinsic hierarchical order. For example, an institute has a number of programmers to offer.
Each program has a number of courses. Each course has a number of students registered in it. The
following figure depicts, the four entity types Institute, Program, Course and Student make up the
four different levels of hierarchical structure. The figure 12 shows an example of database occurrence
for an institute. A database is a collection of database occurrence.
Fig: A simple Hierarchy
A hierarchical database therefore consists of a collection of records, which are connected with each
other through links. Each record is a collection of fields (attributes), each of which contains one data
value. A link is an association between precisely two records.
A tree structure diagram serves the same purpose as an entity-relationship diagram; namely it
specifies the overall logical structure of the database.
The following figure shows typical database occurrence of a hierarchical structure (tree structure).
Fig : Database occurrence of a hierarchical structure
The hierarchical data model has the following features:
— Each hierarchical tree can have only one root record type and this record type does not have a
parent record type.
29
— The root cart have any number of child record types and each of which can itself be a root of a
hierarchical subtree.
— Each child record type can have only one parent record type; thus a M:N relationship cannot be
directly expressed between two record types.
— Data in a parent record applies to all its children records
— A child record occurrence must have a parent record occurrence; deleting a parent record
occurrence requires deleting its entire children record occurrence.
2.6 THE NETWORK MODEL
The Database Task Group of the Conference on Data System Language (DBTG/CODASYL) formalized
the network data model in the late 1960s. Their first report that has been revised a number of times,
contained detailed specifications for the network data model (a model conforming to these
specifications is also known as the DBTG data model). The specifications contained in the report and
its subsequent revisions have been subjected to much debate and criticism. Many of the current
database applications have been built on commercial DBMS systems using the DBTG model.
DBTG Set: The DBTG model uses two different data structures to represent the database entities and
relationships between the entities, namely record type and set type. A record type is used to represent
an entity type. It is made up of a number of data items that represent the attributes of the entity.
A set is used to represent a directed relationship between two record types, the so-called owner
record type, and the member record type. The set type, like the record type, is named and specifies
that there is a one-to-many relationship (I:M) between the owner and member record types. The set
type can have more than one re-cord type as its member, but only one record type is allowed to be
the owner in a given set type. A database could have one or more occurrences of each of its record
and set types. An occurrence of a set type consists of an occurrence of each of its record and set
types. An occurrence of a set type consists of an occurrence of the owner record type and any number
of occurrences of each of its member record types. A record type cannot be a member of two distinct
occurrences of the same set type.
Bachman introduced a graphical means called a data structure diagram to denote the logical
relationship implied by the set. Here a labelled rectangle represents the corresponding entity or
record type. An arrow that connects two labeled rectangles represents a set type. The arrow direction
is from the owner record type to the member record type. Figure shows two record types
(DEPARTMENT and ENTLOYEE) and the set type DEPIT-EMP, with DEPARTMENT as the owner
record type and EMPLOYEE as the member record type.
Fig : A DBTG set
The data structure diagrams have been extended to include field names in the record type rectangle,
and the arrow is used to clearly identify the data fields involved in the set association. A one-to-many
30
(I:M) relationship is shown by a set arrow that starts from the owner field in the owner record type.
The arrow points to the member field within the member record type. The fields that support the
relationship are clearly identified.
A logical record type with the same name represents each entity type in an E-R diagram. Data fields
of the record represent the attributes of the entity. We use the term logical record to indicate that the
actual implementation may be quite different.
The conversion of the E-R diagram into a network database consists of converting each I:M binary
relationship into a set (a 1:1 binary relationship being a special case of a 1:M relationship). If there is
a 1:M binary relationship R1 from entity type E1 to entity type E2,
Fig : Conversion of an M:N relationship into two 1:M DBTG sets
Then the binary relationship is represented by a set an instance of this would be S1 with an instance
of the record type corresponding to entity E1 as the owner and one or more instances of the record
type corresponding to entity E2 as the member. If a relationship has attributes, unless the attributes
can be assigned to the member record type, they have to be maintained in a separate logical record
type created for this purpose. The introduction of this additional record type requires that the original
set be converted into two symmetrical sets, with the record corresponding to the attributes of the
relationship as the member in both the sets and the records corresponding to the entities as the
owners.
2.7 THE RELATIONAL MODEL
A technique called normalization the entanglement observation in the tree and network structure can
be replaced by a relatively neater structure. Codd principles relate to the logical description of the
data and it is important bear in mind that this is quite independent and feasible way in which the
data is stored. It is only some years back that these concepts have emerged from the research
development test and trial stages and are being seen as commercial projects. The attractiveness of
the relational approach arouses from the simplicity in the data organization and the availability of
reasonably simple to very powerful query languages. The size of the relational database approach is
that all the data is expressed in terms of tables and nothing but tables. Therefore, all entities and
attributes have to be expressed in rows and columns
The differences that arise in the relational approach are in setting up relationships between different
tables. This actually makes use of certain mathematical operations on the relation such as
projection, union, joins, etc. These operations from relational algebra and relational calculus are
31
discussion in some more details in the second Block of this course. Similarly in order to achieve the
organization of the data in terms of tables in a satisfactory manner, a technique called normalization
is used.
A unit in the second block of this course describes in detail the processing of normalization and
various stages including the first normal forms, second normal forms, and the third normal forms. At
the moment it is sufficient to say that normalization is a technique, which helps in determining the
most appropriate grouping of data items into records, segments or tuples. This is necessary because
in the relational model the data items are arranged in tables, which indicate the structure,
relationship, and integrity in the following manner:
1. In any given column of a table, all items are of the same kind
2. Each item is a simple number or a character string
3. All rows of a table are distinct.
4. Ordering of rows within a table is immaterial
5. The columns of a table are assigned distinct names and the ordering of these columns is
immaterial
6. If a table has N columns, it is said to be of degree N. This is sometimes also referred to as the
cardinality of the table. From a few base tables it is possible by setting up relations; create views,
which provide the necessary information to the different users of the same database.
2.8 Advantages and Disadvantages of Relational Approach
Advantages of Relational approach
The popularity of the relational database approach has been apart from access of availability of a
large variety of products also because it has certain inherent advantages.
1. Ease of use: The revision of any information as tables consisting of rows and columns is quite
natural and therefore even first time users find it attractive.
2. Flexibility: Different tables from which information has to be linked and extracted can be easily
manipulated by operators such as project and join to give information in the form in which it is
desired.
3. Precision: The usage of relational algebra and relational calculus in the manipulation of he
relations between the tables ensures that there is no ambiguity, which may otherwise arise in
establishing the linkages in a complicated network type database.
4. Security: Security control and authorization can also be implemented more easily by moving
sensitive attributes in a given table into a separate relation with its own authorization controls. If
authorization requirement permits, a particular attribute could be joined back with others to enable
full information retrieval.
5. Data Independence: Data independence is achieved more easily with normalization structure used
in a relational database than in the more complicated tree or network structure.
6. Data Manipulation Language: The possibility of responding to ad-hoc query by means of a
language based on relational algebra and relational calculus is easy in the relational database
approach. For data organized in other structure the query language either becomes complex or
extremely limited in its capabilities.
32
Disadvantages of Relational Approach
A major constraint and therefore disadvantage in the use of relational database system is machine
performance. If the number of tables between which relationships to be established are large and the
tables themselves are voluminous, the performance in responding to queries is definitely degraded. It
must be appreciated that the simplicity in the relational database approach arises in the logical view.
With an interactive system, for example an operation like join would depend upon the physical
storage also. It is, therefore common in relational databases to tune the databases and in such a case
the physical data layout would be chosen so as to give good performance in the most frequently run
operations. It therefore would naturally result in the fact that the lays frequently run operations
would tend to become even more shared
While the relational database approach is a logically attractive, commercially feasible approach, but if
the data is for example naturally organized in a hierarchical manner and stored as such, the
hierarchical approach may give better results. It is helpful to have a summary view of the differences
between the relational and the non-relational approach in the following section.
2.9 An Example of a Relational Model
Let us see important features of a RDBMS through some examples as shown in figure 24. A relation
has the following properties:
1. A table is perceived as a two-dimensional structure composed of row and columns.
2. Each column has a distinct name (attribute name), and the order of columns is immaterial.
3. Each table row (tuple) represents a single data entity within the entity set.
4. Each row/column intersection represents a single data value.
5. Each row carries information describing one entity occurrence.
Fig : Example of a relational data model
As shown in fig, a tuple is the collection of values that compose one row of a relation. A tuple is
equivalent to a record instance. An n-tuple is a tuple composed of n attribute values, where n is
called the degree of the relation. PRODUCT is an example of a 4-tuple. The number of tuples in a
relation is its cardinality.
A domain is the set of possible values for an attribute. For example, the domain for QUANTITY-ONHAND in the PRODUCT relation is all integers greater than or equal to zero. The domain for CITY in
the VENDOR relation is a set of alphabetic characters strings restricted to the names of U.S. cities.
PRODUCT (PRODUCT#, DESCRIPTION, PRICE,
33
QUANTITY-ON-HAND)
VENDOR (VENDOR#, VENDOR-NAME, VENDOR-CITY)
SUPPLIES (VENDOR#, PRODUCT#, VENDOR-PRICE)
In this form, the attribute (or attributes in combination) for which no more than one tuple may have
the same (combined) value is called the primary key. (The primary key, attributes are underlined for
clarity.) The relational data model requires that a primary key of a tuple (or any component attribute
if a combined key) may not contain a null value. Although several different attributes (called
candidate keys) might serve as the primary key, only one (or one combination) is chosen. These other
keys are then called alternate keys.
The SUPPLIES relation in fig requires two attributes in combination in order to identify uniquely each
tuple. A composite or concatenated key is a key that consists of two or more attributes appended
together. Concatenated keys appear frequently in a relational database, since intersection data, like
VENDOR-PRICE, may be uniquely identified by a combination of the primary keys of the related
entities. Each component of a concatenated key can be used to identify tuples in another relation. In
fact, values for all component keys of a concatenated key must be present, although monkey
attribute values may be missing. Further, the relational model has been enhanced to indicate that a
tuple (e.g., for PRODUCT logically should exist with its key value (e.g., PRODUCT#) if that value
appears in a SUPPLIES tuple; this deals with existence dependencies.
2.10
1)
2)
3)
SELF TEST
What do you mean by the term data model?
Create and ERD of Hostel management.
Explain difference between 1:1 and 1:M relationship.
2.11
SUMMARY
•
Data modeling is the analysis of data objects that are used in a business or other context and
the identification of the relationships among these data objects. Data modeling is a first step
in doing object-oriented programming.
•
One or more of these data items are designated as the key and is used for sequencing the file
and for locating and grouping records by sorting and indexing.
•
Advantages: Less software development cost - Even by experienced programmers it takes
months or years in developing a good software system in high level language; Support of
efficient query facility - on line queries for multiple-key retrievals are tedious to program.
•
The entity-relationship model is a tool for analyzing the semantic features of an application
that are independent of events. Entity-relationship modeling helps reduce data redundancy.
34
UNIT 3
FILE ORGANISATION FOR DBMS
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Introduction
Objectives
File organization
Sequential file organisation
3.4.1 Index-sequential file organization
3.4.1.1 Types of indexes
B-trees
3.5.1 Advantages of b-tree indexes
Direct file organization
Need for the multiple access path
Self test
Summary
3.1 INTRODUCTION
Just as arrays, lists, trees and other data structures are used to implement data organization in main
memory, a number of strategies are used to support the organization of data in secondary memory.
Fig : File organization techniques
35
Look at different strategies of organizing data in the secondary memory. We are concerned with
obtaining data representation for files on external storage devices so that required functions (e.g.
retrieval, update) may be carried out efficiently. The particular organization most suitable for any
application will depend upon such factors as the kind of external storage available, types of queries
allowed, number of keys, mode of retrieval and mode of update.
3.2 OBJECTIVES
After going through this unit you will be able to:
- Define what a file organization is
- List file organization techniques
- Discuss implementation techniques of indexing through tree-structure
- Discuss implementation of direct file organization
- Discuss implementation of. Multi key file organization
- Discuss trade-off and comparison among file organization techniques
3.3 FILE ORGANISATION
The technique used to represent and store the records on a file is called the file organization. The four
fundamental file organization techniques that we will discuss are the following:
1. Sequential
2. Relative
3. Indexed sequential
4. Multi-key
There are two basic ways that the file organization techniques differ. First the organization
determines the file's record sequencing, which is the physical ordering of the records in storage.
Second, the file organization determines the set of operation necessary to find particular records.
Having particular values in search-key fields typically identifies individual records. This data field
may or may not have duplicate values in file; the field can be a group or elementary item. Some file
organization techniques provide rapid accessibility on a variety of search key; other techniques
support direct access only on the value of a single one.
3.4 SEQUENTIAL FILE ORGANISATION
The most basic way to organize the collection of records that from a file is to use sequential
organization. In a sequentially organized file records are written consecutively when the file is created
and must be accessed consecutively when the file is later used for input (fig).
36
Fig : Structure of sequential file
In a sequential file, records are maintained in the logical sequence of their primary key values. The
processing of a sequential file is conceptually simple but inefficient for random access. However, if
access to the file is strictly sequential, a sequential file is suitable. A sequential file could be stored on
a sequential storage device such as a magnetic tape.
Search for a given record in a sequential file requires, on average, access to half the records in the
file. Consider a system where the file is stored on a direct access device such as a disk. Suppose the
key value is separated from the rest of the record and a pointer is used to indicate the location of the
record. In such a system, the device may scan over the key values at rotation speeds and only read in
the desired record. A binary or logarithmic search technique may also be used to search for a record.
In this method, the cylinder on which the required record is stored is located by a series of decreasing
head movements. The search having been localized to a cylinder may require the reading of half the
tracks, on average, in the case where keys are embedded in the physical records, or require only a
scan over the tracks in the case where keys are also stored separately.
Updating usually requires the creation of a new file. To maintain file sequence, records are copied to
the point where amendment is required. The changes are then made and copied into the new file.
Following this, the remaining records in the original file are copied to the new file. This method of
updating a sequential file creates an automatic backup copy. It permits updates of the type U1
through U4.
Addition can be handled in a manner similar to updating. Adding a record necessitates the shifting of
all records from the appropriate point to the end of file to create space for the new record. Inversely,
deletion of a record requires a compression of the file space, achieved by the shifting of records.
Changes to an existing record may also require shifting if the record size expands or shrinks.
The basic advantage offered by a sequential file is the ease of access to the next record, the simplicity
of organization and the absence of auxiliary data structures. However, replies to simple queries are
time consuming for large files. Updates, as seen above, usually require the creation of a new file. A
single update is an expensive proposition if a new file must be created. To reduce the cost per update,
all such requests are hatched, sorted in the order of the sequen1tial file, and then used to update the
sequential file in a single pass. Such a file, containing the updates to be made to a sequential file, is
sometimes referred to a transaction file.
37
In the batched mode of updating, a transaction file of update records is made and then sorted in the
sequence of the sequential file. Ale update process requires the examination of each individual record
in the original sequential file (the old master rile). Records requiring no changes are copied directly to
a new file (the new master rile); records requiring one or Wore changes are written into the new
master file only after all necessary changes have been made. Insertions of new records are made in
the proper sequence. They are written into the new master file at the appropriate place. Records to be
deleted are not copied to the new master file. A big advantage of this method of update is the creation
of an automatic backup copy. The new master file can always be recreated by processing the old
master file and the transaction file.
Fig: A file with empty spaces for record insertions
A possible method of reducing the creation of a new file at each update run is to create the original
file with "holes" (space left for the addition of new records, as shown in the last figure). As such, if a
block could hold K records, then at initial creation it is made to contain only L * K records, where 0 <
L < 1 is known as the loading factor. Additional space may also be earmarked for records that may
"overflow" their blocks, e.g., if the record ri logically belongs to block Bi but the physical block Bi does
not contain the requisite free space. This additional free space is known as the overflow area. A
similar technique is employed in index-sequential files.
3.4.1 INDEX-SEQUENTIAL FILE ORGANISATION
The retrieval of a record from a sequential file, on average, requires access to half the records in the
file, making such inquiries not only inefficient but very time consuming for large files. To improve the
query response time of a sequential file, a type of indexing technique can be added.
An index is a set of y, address pairs. Indexing associates a set of objects to a set of orderable
quantities, which are usually smaller in number or their properties provide a mechanism for faster
search. The purpose of indexing is to expedite the search process. Indexes created from a sequential
(or sorted) set of primary keys are referred to as index sequential. Although the indices and the data
blocks are held together physically, we distinguish between them logically. We shall use the term
index file to describe the indexes and data file to refer to the data records. The index is usually small
enough to be read into the processor memory.
3.4.1.1 Types of Indexes
The idea behind an index access structure is similar to that behind the indexes used commonly in
textbooks. Along with each term, a list of page numbers where the term appears is given. We can
search the index to find a list of addresses - page numbers in this case - and use these addresses to
locate the term in the textbook by searching the specified pages. The alternative, if no other guidance
38
is given, is to sift slowly through the whole textbooks word by word to find the term we are interested
in, which corresponds to doing a linear search on a file. Of course, most books do have additional
information, such as chapter and section titles, which can help us find a term without having to
search through the whole book. However, the index is the only exact indication of where each term
occurs in the book.
An index is usually defined on a single field of a file, called an indexing Field. There are several types
of indexes. A primary index is an index specified on the ordering key field of an ordered file of records.
Recall that an ordering key field is used to physically order the file records on disk, and every record
has a unique value for that field. If the ordering field is not a key field that is, several records in the
file can have the same value for the ordering field another type of index, called a clustering index, can
be used. Notice that a file can have at most one physical ordering field, so it can have at most one
primary index or one clustering index, but not both. A third type of index, called a secondary index,
can be specified on any non-ordering field of a file. A file can have several secondary indexes in
addition to its primary access method. In the next three subsections we discuss these three types of
indexes.
•
Primary Indexes
A primary index is an ordered file whose records are of fixed length with two fields. The first field is of
the same data types as the ordering key field of the data file, and the second field is a pointer to a
disk block - a block address. The ordering key field is called the primary key of the data file there is
one index entry (or index record) in the index file for each block in the data file. Each index entry has
the value of the primary key field for the first record in a block and a pointer to other block as its two
field values. We will refer to the two field values of index entry i as K (i), P (i).
To create a primary index on the ordered file shown in figure 4, we use the Name field as the primary
key, because that is the ordering key field of the file (assuming that each value of NAME is unique).
39
Fig : Some blocks on an ordered (sequential) file of EMPLOYEE records with NAME as
the ordering field
Each entry in the index will have a NAME value and a pointer. The first three index entries would be:
< K(1) = (Aaron, Ed), P(I)= address of block 1 >
< K(2) = (Adams, John), P(I) = address of block 2 >
< K(3) = (Alexander, Fd), P(3) = address of block 3 >
40
Fig
:
Illustrating
internal
hashing
data
Structures.
(a) Array of M Positions for use in hashing. (b) Collision resolution by
chaining of records.
Fig illustrates this Primary index. The total number of entries in the index will be the same as the
number of disk block in the ordered data file. The first record in each block of the data file. The first
record in each block of the data file is called the anchor record of the block, or simple the block
anchor (a scheme called the anchor record of similar to the one described here can be used, with the
last record in each block, rather than the first as the block anchor. A primary index is an example of
what is called a non-dense index because it includes an entry for each disk block of the data file
rather than for every record in the data file. A dense index, on the other hand, contains an entry for
every record in the file.
The index file for a primary index needs substantially fewer blocks than the data file for two reasons.
First, there are fewer index entries than there are records in the data file because an entry exists for
each whole block of the data file rather than for each record. Second, each index entry is typically
smaller in size than a data record because it has only two fields, so more index entries than data
records will fit in one block. A binary search on the index file will hence require fewer block accesses
than a binary search on the data file.
41
Fig: Primary index on the ordering key field
A record whose primary key value is K will be in the block whose address is P(i), where Ki < K< (i + 1).
The ith block in the data file contains all such records because of the physical ordering of the file
records on the primary key field, we do a binary search on the index file to find the appropriate index
entry i, then retrieve the data file block whose address is P(i). Notice that the above formula would not
be correct if the data file was ordered on a non-key field that allows multiple records to have the same
ordering field value. In that case the same index value as that in the block anchor could be repeated
in the last records of the previous block. Example 1 illustrates the saving in block accesses when
using an index to search for a record.
Example 1 : Suppose we have an ordered file with r = 30,000 records stored on a disk with block size
B = 1024 bytes. File records are of fixed size and unplanned with record length R 100 bytes. The
blocking factor for the file would be bfr L[(B/R)] = [(1024/100)]=10 records per block. The number of
blocks needed for the file is b = [(r/bfr)] = [(30,000/10)]= 3000 blocks. A binary search on the data file
would need approximately [(log2b)]= [(log23000)]= 12 block accesses.
42
Now suppose the ordering key field of the file is V = 9 bytes long, a block pointer is P = 6 bytes long,
and we construct a primary index for the file. The size of each index entry is Ri=(9 + 6) = 15 bytes, so
the blocking factor for the index is bfri= [(B/R)]= [(1024/15)] = 68 entries per block.
The total number of index entries ri is equal to the number of blocks in the data file, which is 3000.
The number of blocks needed for the index is hence bi = [ri/bfri)]= [(3OOO/68)]= 45 blocks. To
perform a binary search on the index file would need [ (log2bi)] = [(log245)]=6 block accesses. To
search for a record using the index, we need one additional block access to the data file for a total of
6 + 1 = 7 block accesses - an improvement over binary search on the data file, which required 12
block accesses.
A major problem with a primary index - as with any ordered file - is insertion and deletion of records.
With a primary index, the problem is compounded because if we attempt to insert a record in its
correct position in the data file, we not only have to move records to make space for the new record
but also have to change some index entries because moving records will change the anchor records of
some blocks. We can use an unordered overflow file. Another possibility is to use a linked list of
overflow records for each block in the data file. We can keep the records within each block and its
overflow-linked list sorted to improve retrieval time. Record deletion can be handled using deletion
markers.
•
Clustering Indexes
If records of a file are physically ordered on a non-key field that does not have a distinct value, for
each record, that field is called the clustering field of the file. We can create a different type of index,
called a clustering index, to speed up retrieval of records that have the same value for the clustering
field.
43
Fig: A clustering Index on the DEPTNUMBER ordering field of an
EMPLOYEE file
A clustering index is also an ordered file with two fields; the first field is of the same type as the
clustering field of the data file and the second field is a block pointer. There is one entry in the
clustering index for each distinct value of the clustering field, containing that value and a pointer to
the first block in the data file that has a record with that value for its clustering field.
44
Fig: Clustering index with separate blocks for each group of records with
the same value for the clustering field
The record and record deletion still cause considerable problems because the data records are
physically ordered. To alleviate the problem of insertion, it is common to reserve a whole block for
each value of the clustering field; all records with that value are placed in the block. If more than one
block is needed to store the records for a particular value, additional blocks are allocated and linked
together. This makes insertion and deletion relatively straightforward. Figure 8 shows this scheme.
A clustering index is another example of a non-dense index because it has an entry for every distinct
value of the indexing field rather than for every record in the file.
45
• Secondary Indexes
A secondary index also is an ordered file with two fields, and, as in the other indexes, the second field
is a pointer to a disk block. The first field is of the same data type as some non-ordering field of the
data file. The field on which the secondary index is constructed is called an indexing field of the file,
whether its values are distinct for every record or not. There can be many secondary indexes, and
hence indexing fields, for the same file.
We again refer to the two field values of index entry i as K(i), P(i). The entries are ordered by value of
K(i), so we can use binary search on the index. Because the records of the data file are not physically
ordered by values of the secondary key field, we cannot use block
Fig: A dense secondary Index on a non ordering key field of a file
Anchors. That is why an index entry is created for each record in the data file rather than for each
block as in the case of a primary index. Figure 9 illustrates a secondary index on a key attribute of a
data file. Notice that in figure 9 the pointers P(i) in the index entries are block pointers, not record
pointers. Once the appropriate block is transferred to main memory, a search for the desired record
within the block can be carried out.
A secondary index will usually need substantially more storage space than a primary index because
of its larger number of entries. However, the improvement in search time for an arbitrary record is
much greater for a secondary index than it is for a primary index, because we would have to do a
linear search on the data file if the secondary index did not exist. For a primary index, we could still
46
use binary search on the main file even if the index did not exist because the primary key field
physically orders the records. Example 2 illustrates the improvement in number of blocks accessed
when using a secondary index to search for a record.
Example 2: Consider the file of Example 1 with r = 30,000 fixed- length records of size R 100 bytes
stored on a disk with block size B = 1024 bytes. The file has b = 3000 blocks as calculated in
Example 1. To do a linear search on the file, we would require b/2=3000/2=1500 block accesses on
the average.
Suppose we construct a secondary index on a non-ordering key field of the file that is V=9 bytes long.
As in Example 1, a block pointer is P=6 bytes long, so each index entry is R. (9+6)=15 bytes, and the
blocking factor for the index is bfri = [(B/Ri)]=[(1024/15)]=68 entries per block. In a dense secondary
index such as this, the total number of index entries is ri is equal to the number of records in the
data file, which is 30,000. The number of blocks needed for the index is hence
bi=(ri/bfri)=(30,000/68) = 442 blocks.
Compare this to the 45 blocks needed by the non-dense primary index in Example 1.
A binary search on this secondary index needs (log2bi)=(log2442)= 9 block accesses. To search for a
record using the index, we need an additional block access to the data file for a total of 9 + 1 = 10
block accesses - a vast improvement over the 1500 block accesses needed on the average for a linear
search.
Fig: A secondary index on a non key field implemented using one level of
indirection so that index entries are fixed length and have unique field
values
47
3.5 B-TREES
The basic B-tree structure was discovered by R.Bayer and E. McCreight (1970) of Boeing Scientific
Research Labs and has grown to become one of the most popular techniques for organizing an index
structure. Many variations on the basic B-tree have been developed; we cover the basic structure first
and then introduce some of the variations.
The B-tree is known as the balanced sort tree, which is useful for external sorting. There are strong
uses of B-trees in a database system as pointed out by D.Comer (1979): "While no single scheme can
be optimum for all applications, the technique of organizing a file and its index called the B -tree is,
The facto, the standard organization for indexes in a database system."
The file is a collection of records. The index refers to a unique key, associated with each record. One
application of B- trees is found in IBM's Virtual Storage Access Method (VSAM) file organization.
Many data manipulation tasks require data storage only in main memory. For applications with a
large database running on a system with limited company, the data must be stored as records on
secondary memory (disks) and be accessed in pieces. The size of a record can be quite large, as
shown below:
struct DATA
{
int ssn;
char name [80];
char address [80];
char schoold [76];
struct DATA * left; /* main memory addresses */
struct DATA * right; /* main memory addresses */
d_block d_left; /* disk block address */
d_block d_right; /* disk block address */
}
3.5.1 Advantages of B-tree indexes
Because there is no overflow problem inherent with this type of organization it is good for dynamic
table - those that suffer a great deal of insert / update / delete activity. Because it is to a large extent
self-maintaining, it is good in supporting 24-hour operation. As data is retrieved via the index it is
always presented in order. 'Get next' queries are efficient because of the inherent ordering of rows
with in the index blocks.
B-tree indexes are good for very large tables because they will need minimal reorganization. There is
predictable access time for any retrieval (of the same number of rows of course) because the B-tree
structure keeps itself balanced, so that there is always the same number of index levels for every
retrieval. Bear in mind of course, that the number of index levels does increase both with the number
of records and the length of the key value.
48
Because the rows are in order, this type of index can service range type enquiries, of the type below,
efficiently.
SELECT ... WHERE COL BETWEEN X AND Y.
Disadvantages of B-tree indexes:
For static tables, there are better organizations that require fewer I/0s. ISAM indexes are preferable
to B-tree in this type of environment.
B-tree is not really appropriate for very small tables because index look-up becomes a significant part
of the overall access time.
The index can use considerable disk space, especially in products, which allow different users to
create separate indexes on the same table/column combinations.
Because the indexes themselves are subject to modification when rows are updated, deleted or
inserted, they are also subject to locking which can inhibit concurrency.
3.6 DIRECT FILE ORGANISATION
In the index-sequential file organization considered in the previous sections, the mapping from the
search-key value to the storage location is via index entries. In direct file organization, the key value
is mapped directly to the storage location.
Fig: Mapping from a key value to an address value
The usual method of direct mapping is by performing some arithmetic manipulation of the key value.
This process is called hashing. Let us consider a hash function h that maps the key value k to the
value h(k). The value h(k) is used as an address and for our application we require that this value be
in some range. If our address area for the records lies between S1 and S2, the requirement for the
hash function h(k) is that for all values of k it should generate values between S1 and S2. It is
obvious that a hash function that maps many different key values to a single address or one that
does not map the key values uniformly is a bad hash function. A collision is said to occur when two
distinct key values are mapped to the same storage location. Collision is handled in a number of
ways.
Another problem that we have to resolve is to decide what address is represented by h(k). Let the
addresses generated by the hash function the addresses of buckets in which the y, address pair
values of records are stored. Figure shows the buckets containing the y, address pairs that allow a
reorganization of the actual data file and actual record address without affecting the hash functions.
A limited number of collisions could be handled automatically by the use of a bucket of sufficient
capacity. Obviously the space required for the buckets will be, in general, much smaller than the
actual data file. Consequently, its reorganization will not be that expensive. Once the bucket address
is generated from the key by the hash function, a search in the bucket is also required to locate the
address of the required record. However, since the bucket size is small, this overhead is small.
49
The use of the bucket reduces the problem associated with collisions. In spite of this, a bucket may
become full and the resulting overflow could be handled by providing overflow buckets and using a
pointer from the normal bucket to an entry in the overflow bucket. All such overflow entries are
linked. Multiple overflows from the same bucket results in a long list and slows down the retrieval of
these records. In an alternate scheme, the address generated by the hash function is a bucket
address and the bucket is used to store the records directly instead of using a pointer to the block
containing the record.
.
Fig: Bucket and block organization for hashing
Let s represent the value:
s = upper bucket address value - lower bucket address value + 1
Here, s gives the number of buckets. Assume that we have some mechanism to convert key values to
numeric ones. Then a simple hashing function is:
h(k) = k mod S
Where k is the numeric representation of the key and h(k) produces a bucket address. A moment's
thought tells us that this method would perform well in some cases and not in others.
It has been shown, however, that the choice of a prime number for s is usually satisfactory. A
combination of multiplicative and divisive methods can be used to advantage in many practical
situations.
50
There are innumerable ways of converting a key to a numeric value. Most keys are numeric; others
may be either alphabetic or alphanumeric. In the latter two cases, we can use the bit representation
of the alphabet to generate the numeric equivalent key. A number of simple hashing methods are
given below. Many hashing functions can be devised from these and other ways.
1. Use the low order part of the key. For keys that are consecutive integers with few gaps, this
method can be used to map the keys to the available range.
2. End folding. For long keys, we identify start, middle, and end regions, such that the sum of the
lengths of the start and end regions equals the length of the middle region. The start and end digits
are concatenated and the concatenated string of drifts is added to the middle region digits.
3.7 Need for the Multiple Access Path
Many interactive information systems require the support of multi-key files. Consider a banking
system in which there are several types of users: teller, loan officers, branch manager, bank officers,
account holders, and so forth. All have the need to access the same data, say records of the format
shown in fig various types of users need to access these records in different ways. A teller might
identify an account record by its ID value. A loan officer might need to access all account records
with a given value for OVERDRAW-LIMIT, or all account records for a given value of SOCNO. A
branch manager might access records by the BRANCH and TYPE group code. A bank officer might
want periodic reports of all accounts data, sorted by ID.
Fig: Example record format
Support by Replicating Data: One approach to being able to support all these types of access is to
have several different files; each organized to serve one type of request. For this banking example,
there might be one indexed sequential account file with key ID (to server tellers, bank officers, and
account holders), one sequential account file with records ordered by OVER-DRAW-LIMIT (to serve
loan officer), one account rile with relative organization and user-key SOCNO (to serve loan officers),
one sequential account file with records ordered by GROUP-CODF- (to serve branch managers), and
one relative account file, with user-key NAME, SOCNO, and TYPE code (to serve account holders). We
have just identified five files, all containing the same data records! The, five files differ only in their
organizations, and thus in the access paths they provide.
3.8 SELF TEST
1)
Define the term File Organisation?
2)
Explain different type of File organization.
3)
Explain Difference between index sequence and binary sequence file.
51
3.9 SUMMARY
•
External storage available, types of queries allowed, number of keys, mode of retrieval and
mode of update.
•
The technique used to represent and store the records on a file is called the file organization.
•
A sequentially organized file records are written consecutively when the file is created and
must be accessed consecutively when the file is later used for input.
•
The retrieval of a record from a sequential file, on average, requires access to half the records
in the file, making such inquiries not only inefficient but very time consuming for large files.
To improve the query response time of a sequential file, a type of indexing technique can be
added.
•
A primary index is an ordered file whose records are of fixed length with two fields. The first
field is of the same data types as the ordering key field of the data file, and the second field is
a pointer to a disk block - a block address.
52
UNIT 4
REPRESENTING DATA ELEMENTS
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
Data Elements and Fields
Representing Relational Database Elements
Records
Representing Block and Record Addresses
Client-Server Systems
Logical and Structured Addresses
Record Modifications
Index Structures
Indexes on Sequential Files
Secondary Indexes
B-Trees
Hash Tables
Self Test
4.1 Data Elements and Fields
In order to begin constructing the basic model, the modeler must analyze the information gathered
during the requirements analysis for the purpose of:
• Classifying data objects as either entities or attributes
• Identifying and defining relationships between entities
• Naming and defining identified entities, attributes, and relationships
• Documenting this information in the data document
To accomplish these goals the modeler must analyze narratives from users, notes from meeting,
policy and procedure documents, and, if lucky, design documents from the current information
system. Although it is easy to define the basic constructs of the ER model, it is not an easy task to
distinguish their roles in building the data model. What makes an object an entity or attribute? For
example, given the statement "employees work on projects". Should employees be classified as an
entity or attribute? Very often, the correct answer depends upon the requirements of the database. In
some cases, employee would be an entity, in some it would be an attribute.
While the definitions of the constructs in the ER Model are simple, the model does not address the
fundamental issue of how to identify them. Some commonly given guidelines are:
•
entities contain descriptive information
•
attributes either identify or describe entities
• relationships are associations between entities
These guidelines are discussed in more detail below.
53
•
Entities
•
Attributes
•
Validating Attributes
•
Derived Attributes and Code Values
•
Relationships
•
Naming Data Objects
•
Object Definition
•
Recording Information in Design Document
Entities
There are various definitions of an entity:
"Any distinguishable person, place, thing, event, or concept, about which information is kept"
"A thing which can be distinctly identified"
"Any distinguishable object that is to be represented in a database"
"...anything about which we store information (e.g. supplier, machine tool, employee, utility pole,
airline seat, etc.). For each entity type, certain attributes are stored".
These definitions contain common themes about entities:
Entities should not be used to distinguish between time periods. For example, the entities 1st
Quarter Profits, 2nd Quarter Profits, etc. should be collapsed into a single entity called Profits. An
attribute specifying the time period would be used to categorize by time not every thing the users
want to collect information about will be an entity. A complex concept may require more than one
entity to represent it. Others "things" users think important may not be entities.
Attributes
Attributes are data objects that either identify or describe entities. Attributes that identify entities are
called key attributes. Attributes that describe an entity are called non-key attributes. Key attributes
will be discussed in detail in a latter section.
The process for identifying attributes is similar except now you want to look for and extract those
names that appear to be descriptive noun phrases.
Validating Attributes
Attribute values should be atomic, that is, present a single fact. Having disaggregated data allows
simpler programming, greater reusability of data, and easier implementation of changes.
Normalization also depends upon the "single fact" rule being followed. Common types of violations
include:
Simple aggregation - a common example is Person Name which concatenates first name, middle
initial, and last name. Another is Address which concatenates, street address, city, and zip code.
When dealing with such attributes, you need to find out if there are good reasons for decomposing
them. For example, do the end-users want to use the person's first name in a form letter? Do they
want to sort by zip code?
Complex codes - these are attributes whose values are codes composed of concatenated pieces of
information. An example is the code attached to automobiles and trucks. The code represents over 10
54
different pieces of information about the vehicle. Unless part of an industry standard, these codes
have no meaning to the end user. They are very difficult to process and update.
Derived Attributes and Code Values
Two areas where data modeling experts disagree is whether derived attributes and attributes whose
values are codes should be permitted in the data model. Derived attributes are those created by a
formula or by a summary operation on other attributes. Arguments against including derived data
are based on the premise that derived data should not be stored in a database and therefore should
not be included in the data model. The arguments in favor are:
Derived data is often important to both managers and users and therefore should be included in the
data model it is just as important, perhaps more so, to document derived attributes just as you
would other attributes including derived attributes in the data model does not imply how they will be
implemented. A coded value uses one or more letters or numbers to represent a fact. For example,
the value Gender might use the letters "M" and "F" as values rather than "Male" and "Female". Those
who are against this practice cite that codes have no intuitive meaning to the end-users and add
complexity to processing data. Those in favor argue that many organizations have a long history of
using coded attributes, that codes save space, and improve flexibility in that values can be easily
added or modified by means of look-up tables.
Relationships
Relationships are associations between entities. Typically, a relationship is indicated by a verb
connecting two or more entities. For example: employees are assigned to projects. As relationships
are identified they should be classified in terms of cardinality, optionality, direction, and dependence.
As a result of defining the relationships, some relationships may be dropped and new relationships
added. Cardinality quantifies the relationships between entities by measuring how many instances of
one entity are related to a single instance of another. To determine the cardinality, assume the
existence of an instance of one of the entities. Then determine how many specific instances of the
second entity could be related to the first. Repeat this analysis reversing the entities. For example:
employees may be assigned to no more than three projects at a time; every project has at least two
employees assigned to it.
If a relationship can have a cardinality of zero, it is an optional relationship. If it must have a
cardinality of at least one, the relationship is mandatory. Optional relationships are typically
indicated by the conditional tense. For example: an employee may be assigned to a project
Mandatory relationships, on the other hand, are indicated by words such as must have. For example:
a student must register for at least three course each semester
In the case of the specific relationship form (1:1 and 1:M), there is always a parent entity and a child
entity. In one-to-many relationships, the parent is always the entity with the cardinality of one. In
one-to-one relationships, the choice of the parent entity must be made in the context of the business
being modeled. If a decision cannot be made, the choice is arbitrary.
Naming Data Objects
The names should have the following properties:
55
•
unique
•
have meaning to the end-user
• contain the minimum number of words needed to uniquely and accurately describe the object
Some authors advise against using abbreviations or acronyms because they might lead to confusion
about what they mean. Other believe using abbreviations or acronyms are acceptable provided that
they are universally used and understood within the organization.
You should also take care to identify and resolve synonyms for entities and attributes. This can
happen in large projects where different departments use different terms for the same thing.
Recording Information in Design Document
The design document records detailed information about each object used in the model. As you
name, define, and describe objects, this information should be placed in this document. If you are not
using an automated design tool, the document can be done on paper or with a word processor. There
is no standard for the organization of this document but the document should include information
about names, definitions, and, for attributes, domains.
Two documents used in the IDEF1X method of modeling are useful for keeping track of objects. These
are the ENTITY-ENTITY matrix and the ENTITY-ATTRIBUTE matrix.
The ENTITY-ENTITY matrix is a two-dimensional array for indicating relationships between entities.
The names of all identified entities are listed along both axes. As relationships are first identified, an
"X" is placed in the intersecting points where any of the two axes meet to indicate a possible
relationship between the entities involved. As the relationship is further classified, the "X" is replaced
with the notation indicating cardinality.
The ENTITY-ATTRIBUTE matrix is used to indicate the assignment of attributes to entities. It is
similar in form to the ENTITY-ENTITY matrix except attribute names are listed on the rows.
4.2 Representing Relational Database Elements
56
The relational model was formally introduced by 1970 and has evolved since then, through a series of
writings. The model provides a simple, yet rigorously defined, concept of how users perceive data. The
relational model represents data in the form of two-dimension tables. Each table represents some
real-world person, place, thing, or event about which information is collected. A relational database is
a collection of two-dimensional tables. The organization of data into relational tables is known as the
logical view of the database. That is, the form in which a relational database presents data to the
user and the programmer. The way the database software physically stores the data on a computer
disk system is called the internal view. The internal view differs from product to product and does
not concern us here.
A basic understanding of the relational model is necessary to effectively use relational
software such as Oracle, Microsoft SQL Server, or even personal database systems such as
Fox, which are based on the relational model. This document is an informal introduction to
concepts, especially as they relate to relational database design issues. It is not a
description of relational theory.
database
Access or
relational
complete
4.3 Records
Data is usually stored in the form of records. Each record consists of a collection of related data
values or items where each value is formed of one or more bytes and corresponds to a particular field
of the record. Records usually describe entities and their attributes. For example, an EMPLOYEE and
record represents an employee entity, and each field value in the record specifies some attribute of
that employee, such as NAME, BIRTHDATE, SALARY, or SUPERVISOR. A collection of field names
and their corresponding data types constitutes a record type or record format definition. A data type,
associated with each field, specifies the type of values a field can take.
The data type of a field is usually one of the standard data types used in programming. These include
numeric (integer, long integer, or floating point), string of characters (fixed-length or varying), Boolean
(having 0 and 1 or TRUE and FALSE values only), and sometimes specially coded data and time data
types. The number of bytes required for each data type is fixed for a given computer system. An
integer may require 4 bytes, a long integer 8 bytes, a real number 4 bytes, a Boolean 1 byte, a
Boolean 1 byte, a date 10 bytes (assuming a format of YYYY-MM-DD), and a fixed-length string of k
characters k bytes. Variable-length strings may require, as many bytes as there are characters in
each field value. For example, an EMPLOYEE record type may be defined–using the C programming
language notation–as the following structure:
Struct employee{
char name [30];
char ssn[9];
int salary;
int jobcode;
char department[20];
};
In recent database applications, the need may arise for storing data items that consist of large
unstructured objects, which represent images, digitized video or audio streams, or free text. These
are referred to as BLOBs (Binary Large Objects). A BLOB data item is typically stored separately from
its record in a pool in a pool of disk blocks, and a pointer to the BLOB is included in the record.
57
4.4 Representing Block and Record Addresses
Fields (attributes) need to be represented by fixed- or variable-length sequences of bytes, Fields are
put together in fixed- or variable-length collections called “records” (tuples, or objects). A collection of
records that forms a relation or the extent of a class is stored as a collection of blocks, called a file.
A tuple is stored in a record. The size is the sum of sizes of the fields in the record. The schema is
stored together with records (or a pointer to it). It contains:
_ Types of fields
_ the fields within the record
_ Size of the record
_ Timestamps (last read, last updated)
The schema is used to access specific fields in the record. Records are stored in blocks. Typically a
block only contains one kind of records.
4.5 Client-Server Systems
The principle is that the user has a client program. He asks information (or data) from the server
program. The server searches the data and sends it back to the client.[1] Putting in another way,we
can say that the user is the client, he uses a client program to start a client process, sends message
to server which is a server program, to perform a task or service. As a matter of fact a client server
system is a special case of a co-operative computer system. All such systems are characterised by the
use of multiple processes that work together to form the system solution. (There are two types of cooperative systems client-server systems and peer-to-peer systems.), The client and server systems
consist of three major components : a server with relational database, a client with user interface and
a network hardware connection in between. Client and server is an open system with number of
advantages such as interoperability, scalability, adaptability, affordability, data integrity,
accessibility, performance and security.
WHAT DO THE CLIENT PROGRAMS DO?
They usually deal with;
Managing the application's user-interface part
Confirming the data given by the user
Sending out the requests to server programs
Managing local resources, like monitor, keyboard and peripherals.
The client-based process is the application that the user interacts with. It contains solution-specific
logic and provides the interface between the user and the rest of the application system. In this sense
the graphical user interface (GUI) is one characteristic of client system. It uses tools some are as:
Administration Tool: for specifying the relevant server information, creation of users, roles and
privileges, definition of file formats, document type definitions (DTD) and document status
information.
58
Template Editor: for creating and modifying templates of documents
Document Editor: for editing instances of documents and for accessing component
information.
Document Browser: for retrieval of documents from a Document Server
Access Tool: provides the basic methods for information access.
WHAT DO THE SERVER DO?
Its purpose is fulfilling client's requests. What they do in general is;
Receive request
Execute database retrival and updates
Manage data integrity
Sent the results to client back
Act as a software engine that manage shared resources like databases, printers,
communication links..etc.
When the server do these, it can use common or complex logic. It can supply the information only
using its own resources or it can employ some other machine on the network, in master and slave
attitude.
.[2]
In other words, there can be,
Single client, single server
Multiple clients, single server
Single client, multiple servers
Multiple clients, multiple servers [6]
In a client server system, we can talk about specialization of some components for particular tasks.
The specialization can provide very fast computation servers or high throughput database servers,
resilient transaction servers or for some other purpose, nevertheless what is important is that they
are optimised for that task. It is not optimal to try to perform all of the tasks together.Actually it is
this specialization, therefore optimization that gives power to client and server systems. [10]
These different server types for different tasks can be collected in two categories:
1.Simpler ones are;
Disk server
File server: client can only request for files, but the files are sent as they are, but this needs
large bandwidth and it slows down the network.
2.Advanced ones;
Database server
Transaction server
Application server
What Makes a Design a Client/Server Design?
There are many answers about what differentiates client/server architecture from some other design.
There is no single correct answer, but generally, an accepted definition describes a client application
as the user interface to an intelligent database engine—the server. Well-designed client applications
59
do not hard code details of how or where date is physically stored, fetched, and managed, nor do they
perform low-level data manipulation. Instead, they communicate their data needs at a more abstract
level, the server performs the bulk of the processing, and the result set isn't raw data but rather an
intelligent answer.
Generally, a client/server application has these characteristics:
Centralized Intelligent Servers
That is, the server in a client/server system is more than just a file repository where multiple users
share file sectors over a network. Client/server servers are intelligent, because they carry out
commands in the form of Structured Query Language (SQL) questions and return answers in the
form of result sets.
Network Connectivity
Generally, a server is connected to a clients by way of a network. This could be a LAN, WAN, or the
Internet. However, most correctly designed client/server designs do not consume significant
resources on a wire. In general, short SQL queries are sent from a client and a server returns brief
sets of specifically focused data.
Operationally Challenged Workstations
A client computer in a client/server system need not run a particularly complex application. It must
simply be able to display results from a query and capture criteria from a user for subsequent
requests. While a client computer is often more capable than a dumb terminal, it need not house a
particularly powerful processor—unless it must support other, more hardware-hungry applications.
Applications Leverage Server Intelligence
Client/server applications are designed with server intelligence in mind. They assume that the server
will perform the heavy lifting when it comes to physical data manipulation. Then, all that is required
of the client application is to ask an intelligent question—expecting an intelligent answer from the
server. The client application's job is to display the results intelligently.
Shared Data, Procedures, Support, Hardware
A client/server system depends on the behavior of the clients—all of them—to properly share the
server's—and the clients'— resources. This includes data pages, stored procedures, and RAM as well
as network capacity.
System Resource Use Minimized
All client/server systems are designed to minimize the load on the overall system. This permits an
application to be scaled more easily—or at all.
60
Maximize Design Tradeoffs
A well-designed client/server application should be designed to minimize overall systems load, or
designed for maximum throughput, or designed to avoid really bad response times—or maybe a good
design can balance all of these.
4.6 Logical and Structured Addresses
At the logical level, a structured document is made up of a number of different parts. Some parts are
optional, others are compulsory. Many of these document structures have a required order--they
cannot be inserted at arbitrary points in the document.
For example, a document must have a title and it must be the first element in the document. The
programming example is very good because it is very simple. At the logical level, sections are used to
break up the document into parts and sub-parts that help to assist the reader to follow the structure
of the document and to navigate their way through it.
Sections are made up of a section title, followed by one or more text blocks and then, optionally, one
or more sub-sections. Sections are allowed in either a Chapter document (within chapters) or a
Simple document, and in Appendices.
Sections can contain other sections i.e. they may be nested. Only 4 levels of section nesting are
recommended. Note that once you start entering nested (sub-)sections, you cannot enter any textblocks after the sub-sections i.e.. all of the sections text blocks must come before any sub-sections.
Sections are automatically numbered. A level 1 section has one number, a level two section is
numbered N.n, a level 3 section N.n.n and so on. If the section is contained within a chapter or
appendix, the section number is prefixed with the chapter number or appendix number.
4.7 Record Modifications
The domain record modification/transfer process is the complete responsibility of the domain owner.
If you need assistance modifying your domain record, please contact your domain registrar for
technical support. ColossalHost.com will not provide excessive support resources assisting customers
in the domain record modification process. It is important to understand that ColossalHost.com has
no more power to make modifications to a Subscriber's domain record than a complete stranger
would. The domain name owner is completely responsible for the information (including the name
servers) that is contained within the domain record.
4.8 Index Structures
Index structures in object-oriented database management systems should support selections not only
with respect to physical object attributes, but also with respect to derived attributes. A simple
example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee.
61
Index structures in object-oriented database management systems should support selections not only
with respect to physical object attributes, but also with respect to derived attributes. A simple
example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee. In this
case the index structure should allow support queries for companies specifying the number of
employees of the company.
4.9 Indexes on Sequential Files
Data structure for locating records with given search key efficiently. Also facilitates a full scan of a
relation.
4.10 Secondary Indexes
Secondary indexes to provide access to subsets of records. Both databases and tables provide
automatic secondary indexes. All secondary indexes are held on a separate database file (.sr6). This is
created when the first index is created and deleted if the last index is deleted. Each secondary index
is physically very similar to a standard database. It contains index blocks and data blocks. The sizes
of these blocks are calculated in a similar way to the block size calculations for standard database
blocks to ensure reasonably efficient processing given the size of the secondary index key and the
maximum number of records of that type. Each index potentially has different block sizes. Each
record in the data block in a secondary index has the secondary key as the key and contains the
standard database key as the data. Thus the size of these data blocks is affected by the size of both
keys
All these index files (i.e., primary, secondary, and clustering indexes) are ordered files, and have two
fields of fixed length. One field contains data (in which its value is the same as a field from data file)
and the other field is a pointer to the data file, but
62
In primary indexes, the data item of the index has the value of the primary key of the first record (the
anchor record) of the block in which the pointer points to.
In secondary indexes, the data item of the index has a value of a secondary key and the pointer
points to a block, in which a record with such secondary keys is stored in.
In clustering indexes, the data item on the index has a value of a non-key field, and the pointer
points to the block in which the first record with such non-key fields is stored in.
In primary index files, for every block in a data file, one record exists in the primary index file. The
number of records in the index file is, equal to the number of blocks in the data file. Hence, the
primary indexes are non-dense.
In secondary index files, for every record in the data file, one record exists in the secondary index file.
The number of records in the secondary file is equal to the number of records in the data file. Hence,
the secondary indexes are dense.
In clustering index files, for every distinct clustering field, one record exists in the clustering index
file. The number of records in the clustering index is equal to the distinct numbers for the clustering
field in the data file. Hence, the clustering indexes are non-dense.
The Customers Table holds information on customers, such as their customer number, name and
address. Run the Database Desktop program from the Start Menu or select Tools-> Database
Desktop in Delphi. Open the Customers table copied in the previous step. By default the data in the
table is displayed as read only. Familiarise yourself with the data. Change to edit mode (Table->Edit
Data) and add a new record. View the structure of the table (Table->Info Structure). Select Table>Restructure to restructure the table. Add a secondary index to the table, by selecting Secondary
Indexes from the Table properties combo box. The secondary index is composed of LastName and
FirstName, in that order. Call the index CustomersNameIndex. The index will be used to access the
Customers table on customer name.
4.11 B-trees
A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node may
contain a large number of keys. The number of sub trees of each node, then, may also be large. A Btree is designed to branch out in this large number of directions and to contain a lot of keys in each
node so that the height of the tree is relatively small. This means that only a small number of nodes
must be read from disk to retrieve an item. The goal is to get fast access to the data, and with disk
drives this means reading a very small number of records. Note that a large node size (with lots of
keys in the node) also fits with the fact that with a disk drive one can usually read a fair amount of
data at once.
A multiway tree of order m is an ordered tree where each node has at most m children. For each node,
if k is the actual number of children in the node, then k - 1 is the number of keys in the node. If the
keys and subtrees are arranged in the fashion of a search tree, then this is called a multiway search
tree of order m. For example, the following is a multiway search tree of order 4. Note that the first row
63
in each node shows the keys, while the second row shows the pointers to the child nodes. Of course,
in any useful application there would be a record of data associated with each key, so that the first
row in each node might be an array of records where each record contains a key and its associated
data. Another approach would be to have the first row of each node contain an array of records where
each record contains a key and a record number for the associated data record, which is found in
another file. This last method is often used when the data records are large. The example software
will
use
the
first
method.
What does it mean to say that the keys and subtrees are "arranged in the fashion of a search tree"?
Suppose that we define our nodes as follows:
typedef struct
{
int Count;
// number of keys stored in the current node
ItemType Key[3]; // array to hold the 3 keys
long Branch[4]; // array of fake pointers (record numbers)
} NodeType;
Then a multiway search tree of order 4 has to fulfill the following conditions related to the ordering of
the keys:
•
The keys in each node are in ascending order.
•
At every given node (call it Node) the following is true:
o The subtree starting at record Node.Branch[0] has only keys that are less
Node.Key[0].
o The subtree starting at record Node.Branch[1] has only keys that are greater
Node.Key[0] and at the same time less than Node.Key[1].
o The subtree starting at record Node.Branch[2] has only keys that are greater
Node.Key[1] and at the same time less than Node.Key[2].
o The subtree starting at record Node.Branch[3] has only keys that are greater
Node.Key[2].
•
than
than
than
than
Note that if less than the full number of keys are in the Node, these 4 conditions are truncated
so that they speak of the appropriate number of keys and branches.
64
This generalizes in the obvious way to multiway search trees with other orders.
A B-tree of order m is a multiway search tree of order m such that:
•
All leaves are on the bottom level.
•
All internal nodes (except the root node) have at least ceil(m / 2) (nonempty) children.
•
The root node can have as few as 2 children if it is an internal node, and can obviously have
65
no children if the root node is a leaf (that is, the whole tree consists only of the root node).
•
Each leaf node must contain at least ceil(m / 2) - 1 keys.
Note that ceil(x) is the so-called ceiling function. It's value is the smallest integer that is greater than
or equal to x. Thus ceil(3) = 3, ceil(3.35) = 4, ceil(1.98) = 2, ceil(5.01) = 6, ceil(7) = 7, etc.
A B-tree is a fairly well-balanced tree by virtue of the fact that all leaf nodes must be at the bottom.
Condition (2) tries to keep the tree fairly bushy by insisting that each node have at least half the
maximum number of children. This causes the tree to "fan out" so that the path from root to leaf is
very short even in a tree that contains a lot of data.
Example B-Tree
The following is an example of a B-tree of order 5. This means that (other that the root node) all
internal nodes have at least ceil(5 / 2) = ceil(2.5) = 3 children (and hence at least 2 keys). Of course,
the maximum number of children that a node can have is 5 (so that 4 is the maximum number of
keys). According to condition 4, each leaf node must contain at least 2 keys. In practice B-trees
usually have orders a lot bigger than 5.
4.12 Hash Tables
Linked lists are handy ways of tying data structures together but navigating linked lists can be
inefficient. If you were searching for a particular element, you might easily have to look at the whole
list before you find the one that you need. Linux uses another technique, hashing to get around this
restriction. A hash table is an array or vector of pointers. An array, or vector, is simply a set of things
coming one after another in memory. A bookshelf could be said to be an array of books. Arrays are
accessed by an index, the index is an offset into the array. Taking the bookshelf analogy a little
further, you could describe each book by its position on the shelf; you might ask for the 5th book.
A hash table is an array of pointers to data structures and its index is derived from information in
those data structures. If you had data structures describing the population of a village then you
could use a person's age as an index. To find a particular person's data you could use their age as an
66
index into the population hash table and then follow the pointer to the data structure containing the
person's details. Unfortunately many people in the village are likely to have the same age and so the
hash table pointer becomes a pointer to a chain or list of data structures each describing people of
the same age. However, searching these shorter chains is still faster than searching all of the data
structures.
As a hash table speeds up access to commonly used data structures, Linux often uses hash tables to
implement caches. Caches are handy information that needs to be accessed quickly and are usually a
subset of the full set of information available. Data structures are put into a cache and kept there
because the kernel often accesses them. There is a drawback to caches in that they are more complex
to use and maintain than simple linked lists or hash tables. If the data structure can be found in the
cache (this is known as a cache hit, then all well and good. If it cannot then all of the relevant data
structures must be searched and, if the data structure exists at all, it must be added into the cache.
4.13
Self Test
1. Hash Tables (G&T Section 8.3, page 341)
R-8.4 Draw the 11-item hash table that results from using the hash function
h(i) = (2i+5) mod 11
to hash the keys 12, 44, 13, 88, 23, 94, 11, 39, 20, 16 and 5, assuming collisions are handled
by chaining.
R-8.5 What is the result of the previous exercise, assuming collisions are handled by linear
probing?
R-8.7 What is the result of Exercise R-8.4 assuming collisions are handled by double hashing
using a secondary hash function h'(k) = 7 - (k mod 7)?
2.
“B-Trees are the bread and butter of the database world: Virtually every database is
implemented internally as some variant of B-Trees.”. Comment on this statement.
3.
What are the differences among primary, secondary, and clustering indexes? How do these
differences affect the ways in which these indexes are implemented? Which of these indexes
are dense, and which are not?
4.
How does multi-level indexing improve the efficiency of searching an index file?
5.
Do detailed procedures exist for at least the following operations :
4.14 Record creation?
4.15 Record modification?
4.16 Record duplication?
4.17 Record destruction?
4.18 Consistent quality control?
4.19 Problem resolution?
6.
What is Client and server system. What do Client and the Server do.Client/Server
Architecture. Example:World Wide Web .What is Client/Server Computing
67
UNIT 5
RELATIONAL MODEL
5.1
5.2
5.3
5.4
5.5
5.6
Introduction
Objectives
Concepts of a relational model
Formal definition of a relation
The codd commandments
Summary
1
INTRODUCTION
One of the main advantages of the relational model is that it is conceptually simple and more
importantly based on mathematical theory of relation. It also frees the users from details of storage
structure and access methods.
The relational model like all other models consists of three basic components: A set of domains and a
set of relations Operation on relations Integrity rules
5.2 OBJECTIVES
After completing this unit, you will be able to:
— Define the concepts of relational model
— Discuss the basic operations of the relational algebra
— State the integrity rules
5.3 CONCEPTS OF A RELATIONAL MODEL
The relational model was formally introduced by Dr. E. F. Codd in 1970 and has evolved since then,
through a series of writings. The model provides a simple, yet rigorously defined, concept of how
users perceive data. The relational model represents data in the form of two-dimension tables. Each
table represents some real-world person, place, thing, or event about which information is collected.
A relational database is a collection of two-dimensional tables. The organization of data into relational
tables is known as the logical view of the database. That is, the form in which a relational database
presents data to the user and the programmer. The way the database software physically stores the
data on a computer disk system is called the internal view. The internal view differs from product to
product and does not concern us here.
A basic understanding of the relational model is necessary to effectively use relational database
software such as Oracle, Microsoft SQL Server, or even personal database systems such as Access or
Fox, which are based on the relational model.
E.F. Codd of the IBM propounded the relational model in 1972. The basic concept in the relational
model is that of a relation.
68
A relation can be viewed as a table, which has the following properties:
Property 1: It is column homogeneous. In other words, in any given column of a table, all items are of
the same kind.
Property 2: Each item is a simple number or a character string. That is, a table must be in 1NF.(First
Normal Form) which will be introduced in the second unit.
Property 3: All rows of a table are distinct.
Property 4: The ordering of rows within a table is immaterial.
Property 5: The columns of a table are assigned distinct names and the ordering of these columns is
immaterial.
Example of a valid relation
5.4.
S#
P#
SCITY
10
1
BANGALORE
10
2
BANGALORE
11
1
BANGALORE
11
2
BANGALORE
FORMAL DEFINITION OF A RELATION
Relationships are equally as important to an object-oriented model as to a relational model. Following
are some issues related to relationships that you should keep in mind when designing schema.
Although there are no specific recommendations as to how to handle many of the relationship issues
in object-oriented modeling, it is usually possible to come up with a reasonable, workable solution if
you design with these issues in mind.
In the relational model, a database is a collection of relational tables. A relational table is a flat file
composed of a set of named columns and an arbitrary number of unnamed rows. The columns of the
tables contain information about the table. The rows of the table represent occurrences of the "thing"
represented by the table. A data value is stored in the intersection of a row and column. Each named
column has a domain, which is the set of values that may appear in that column. Fig shows the
relational tables for a simple bibliographic database that stores information about book title, authors,
and publishers.
69
There are alternate names used to describe relational tables. Some manuals use the terms tables,
fields, and records to describe relational tables, columns, and rows, respectively. The formal
literature tends to use the mathematical terms, relations, attributes, and tuples. Figure 2
summarizes these naming conventions.
In This Document
Formal Terms
Many Database Manuals
Relational Table
Relation
Table
Column
Attribute
Field
Row
Tuple
Record
Relational tables can be expressed concisely by eliminating the sample data and showing just the
table name and the column names. For example,
AUTHOR
(au_id, au_lname, au_fname, address, city, state, zip)
70
TITLE
(title_id, title, type, price, pub_id)
PUBLISHER
(pub_id, pub_name, city)
AUTHOR_TITLE
(au_id, title_id)
Properties of Relational Tables
Relational tables have six properties:
Values Are Atomic: This property implies that columns in a relational table are not repeating group or
arrays. Such tables are referred to as being in the "first normal form" (1NF). The atomic value
property of relational tables is important because it is one of the cornerstones of the relational model.
The key benefit of the one value property is that it simplifies data manipulation logic.
Column Values Are of the Same Kind: In relational terms this means that all values in a column
come from the same domain. A domain is a set of values which a column may have. For
example, a Monthly_Salary column contains only specific monthly salaries. It never contains
other information such as comments, status flags, or even weekly salary.
•
Each Row is Unique: This property ensures that no two rows in a relational table are identical;
there is at least one column, or set of columns, the values of which uniquely identify each row
in the table. Such columns are called primary keys and are discussed in more detail in
Relationships and Keys.
•
The Sequence of Columns is Insignificant: This property states that the ordering of the columns
in the relational table has no meaning. Columns can be retrieved in any order and in various
sequences. The benefit of this property is that it enables many users to share the same table
without concern of how the table is organized. It also permits the physical structure of the
database to change without affecting the relational tables.
•
The Sequence of Rows is Insignificant: This property is analogous the one above but applies to
rows instead of columns. The main benefit is that the rows of a relational table can be retrieved
in different order and sequences. Adding information to a relational table is simplified and does
not affect existing queries.
•
Each Column Has a Unique Name: Because the sequence of columns is insignificant, columns
must be referenced by name and not by position. In general, a column name need not be
unique within an entire database but only within the table to which it belongs.
5.5 THE CODD COMMANDMENTS
In the most basic of definitions a DBMS can be regarded as relational only if it obeys the following
three rules:
— All information must be held in tables
— Retrieval of the data must be possible using the following types of operations:
SELECT, JOIN and PROJECT
71
— All relationships between data must be represented explicitly in that data itself.
To define the requirement more rigorously, E.F Codd formulated the 12 rules as follows
Rule 1: The information rule
All information is explicitly and logically represented in exactly one way-by data values in tables.
In simple terms this means that if an item of data doesn't reside somewhere in a table in the
database then it doesn't exist and this should be extended to the point where even such information
as table, view and column names to mention just a few, should be contained somewhere in table
form. This necessitates the provision of such facilities that allow relatively easy additions to RDBMS
of programming and CASE tools for example. This rule serves on its own to invalidate the claims of
several databases to be relational simply because of their lack of ability to store dictionary items (or
indeed metadata) in an integrated, relational form. Commonly such products implement their
dictionary information systems in some native file structure, and thus set themselves up for failing at
the first hurdle.
Rule 2: The rule of guaranteed access
Every item of data must be logically addressable by resorting to a combination of table name, primary
key value and a column name.
Whilst it is possible to retrieve individual items of data in many different ways, especially in a
relational/SQL environment, it must be true that any item can be retrieved by supplying the table
name, the primary key value of the row holding the item and column name in which it is to be found.
If you think back to the table like storage structure, this rule is saying that at the intersection of a
column and a row you will necessarily find one value of a data item (or null).
Rule 3: The systematic treatment of null values
It may surprise you to see this subject on the list of properties, but it is fundamental to the DBMS
that null values are supported in the representation of missing and inapplicable information. This
support for null values must be consistent throughout the DBMS, and independent of data type (a
null value in a CHAR field must mean the same as null in an INTEGER field for example).
It has often been the case in other product types that a character to represent missing or inapplicable
data has been allocated from the domain of characters pertinent to a particular item. We may for
example define four permissible values for a column SEX as:
M Male
F Female
X No data available
Y Not applicable
Such a solution requires careful design, and must decrease production at the very least. This
situation is particularly undesirable when very high-level languages such as SQL are used to
manipulate such data, and if such a solution is used for numeric columns all sorts of problems can
arise during aggregate functions such as SUM and AVERAGE etc.
Rule 4: The database description rule
72
A description of the database is held and maintained using the same logical structures used to define
the data, thus allowing users with appropriate authority to query such information in the same ways
and using the same languages as they would any other data in the database.
Put into easy terms, Rule 4 means that there must be a data dictionary within the RDBMS that is
constructed of tables and/or views that can be examined using SQL. This rule states therefore that a
dictionary is mandatory, and if taken in conjunction with Rule 1, there can be no doubt that the
dictionary must also consist of combinations of tables and views.
Rule 5: The comprehensive sub-language rule
There must be at least one language whose statements can be expressed as character strings
conforming to some well-defined syntax, that is comprehensive in supporting the following:
Data definition
View definition
Data manipulation
Integrity constraints
Authorization
Transaction boundaries
Again in real terms, this means that the RDBMS must be completely manageable through its own
dialect of SQL, although some products still support SQL-like languages (Ingress support of Quel for
example). This rule also sets out to scope the functionality of SQL-you will detect an implicit
requirement to support access control, integrity constraints and transaction management facilities
for example.
Rule 6: The view-updating rule
All views that can be updated in theory, can also be updated by the system.
This is quite a difficult rule to interpret, and so a word of explanation is required whilst it is possible
to create views in all sorts of illogical ways, and with all sorts of aggregates and virtual columns, it is
obviously not possible to update through some of them. As a very simple example, if you define a
virtual column in a view as A*B where A and B are columns in a base table, then how can you
perform an update on that virtual column directly? The database cannot possible break down any
number supplied, into its two component parts, without more information being supplied. To delve a
little deeper, we should consider that can be defined in terms of both tables and other views.
Particular vendors restrict the complexity of their own implementations, in some cases quite
drastically.
Even in logical terms it is often incredibly difficult to tell whether a view is theoretically updateable,
let alone delve into the practicalities of actually doing so. In fact there exists another set of rules that,
when applied to a view, can be used to determine its level of logical complexity, and it is only realistic
to apply Rule 6 to those views that are defined as simple by such criteria.
Rule 7: The insert , update and delete rule
An RDBMS must do more than just be able to retrieve relational data sets. It has to be capable of
inserting, updating and deleting data as a relational set. Many RDBMS that fail the grade fall back to
a single-record-at-a-time procedural technique when it comes time to manipulate data.
73
Rule 8: The physical independence rule
User access to the database, via monitors or application programs, must remain logically consistent
whenever changes to the storage representation, or access methods to the data, are changed.
Therefore, and by way of an example, if an index is built or destroyed by a DBA on a table, any user
should retrieve the same data from that table, albeit a little more slowly. It is this rule that demands
the clear distinction between the logical and physical layers of the database. Applications must be
limited to interfacing with the logical layer to enable the enforcement of this rule, and it is this rule
that sorts out the men from the boys in the relational market place. Looking at other architectures
already discussed, on can imagine the consequences of changing the physical structure of a network
or hierarchical system.
However there are plenty of traps waiting even in the relational world. Consider the application
designer who depends on the presence of a B-type tree index to ensure retrieval of data is in a
predefined order, only to find that the DBA dynamically drops the index. The removal of such an
index might be catastrophic. I point out these two issues because although they are serious factors, I
am not convinced that they constitute the breaking of this rule; it is for the individual to make up his
own mind.
Rule 9: The logical data independence rule
Application programs must be independent of changes made to the base tables.
Fig: TAB1 split into two fragments
This rule allows many types of database design change to be made dynamically, without users being
aware of them. To illustrate the meaning of the rule the examples on the next page shows two types
of activity, described in more detail later, that should be possible if this rule is enforced.
74
Fig : Two fragments Combined into One Table
Firstly, it should be possible to split a table vertically into more than one fragment, as long as such
splitting preserves all the original data (is non-loss), and maintain the primary key in each and every
fragment. This means in simple terms that a single table should be divided into one or more other
tables.
Secondly it should be possible to combine base tables into one by way of non-loss join. Note that if
such changes are made, then views will be required so that users and applications are unaffected by
them.
Rule 10: Integrity rules
The relational model includes two general integrity rules. These integrity rules implicitly or explicitly
define the set of consistent database states, or changes of state, or both. Other integrity constraints
can be specified, for example, in terms of dependencies during database design. In this section we
define the integrity rules formulated by Codd.
Integrity Rule 1
Integrity rule 1 is concerned with primary key values. Before we formally state the rule, let us look at
the effect of null values in prime attributes. A null value for an attribute is a value that is either not
known at the time or does not apply to a given instance of the object .It may also be possible that a
particular tuple does not have a value for an attribute; this fact could be represented by a null value.
If any attribute of a primary key (prime attribute) were permitted to have null values, then because
the attributes in the key must be non-redundant, the key cannot be used for unique identification of
tuples. This contradicts the requirements for a primary key. Consider the relation P in fig. The
attribute Id is the primary key for P.
Id
Name
101
Jones
103
Smith
104
Lalonde
107
Evan
110
Drew
112
Smith
(a)
75
P:
Id
Name
101
Jones
@
Smith
104
Lalonde
107
Evan
110
Drew
@
Lalonde
@
Smith
(b)
Fig: (a) Relation without null values and (b) relation with null values
Integrity rule 1 specifies that instances of the entities are distinguishable and thus no prime attribute
(component of a primary key) value may be null. This rule is also referred to as the entity rule. We
could state this rule formally as:
Definition: Integrity Rule 1 (Entity Integrity):
If the attribute A of relation R is a prime attribute of R, then A cannot accept null values.
Integrity Rule 2(Referential Integrity):
Integrity rule 2 is concerned with foreign keys, i.e., with attributes of a relation having domains that
are those of the primary key of another relation.
Relation (R) , may contain references to another relation (S). Relations R and S need not be distinct.
Suppose the reference in r is via a set of attributes that forms a primary key of the relation S. This set
of attributes in R is a foreign key. A valid relationship between a tuple in R to one in S requires that
the values of the attributes in the foreign key of R correspond to the primary key of a tuple in S. This
ensures that the reference from the tuple of the relation R is made unambiguously to an existing
tuple in the S relation. The referencing attribute(s) in the R relation can have null value(s);in this
case, it is not referencing any tuple in the S relation. However, if the value is not null, it must exist as
the primary attribute of a tuple of the S relation. If the referencing attribute in R has a value that is
nonexistent in S, R is attempting to refer a nonexistent tuple and hence a nonexistent instance of the
corresponding entity. This cannot be allowed. We illustrate this point in the following example.
Rule 11: Distribution independence rule:
A RDBMS must have distribution independence.
This is one of the more attractive aspects of RDBMS. Database systems built on the relational
framework are well suited to today's client/server database design.
Rule 12: No subversion rule:
If an RDBMS supports a lower level language that permits for example, row-at-a-time processing,
then this language must not be able to bypass any integrity rules or constraints of the relational
language.
Thus, not only must a RDBMS be governed by relational rules, but also those rules must be its
primary laws.
76
The beauty of the relational database is that the concepts that define it are few, easy to understand
and explicit. The 12 rules explained can be used as the basic relational design criteria, and as such
are clear indications of the purity of the relational concept.
6.
1)
2)
3)
Self Test
Define relational database system? Where we use RDBMS.
Explain different type of RDBMS
Explain Codd Rules with the help of suitable example.
5.7 SUMMARY
•
One of the main advantages of the relational model is that it is conceptually simple and more
importantly based on mathematical theory of relation. It also frees the users from details of
storage structure and access methods.
•
The model provides a simple, yet rigorously defined, concept of how users perceive data. The
relational model represents data in the form of two-dimension tables. Each table represents
some real-world person, place, thing, or event about which information is collected. A
relational database is a collection of two-dimensional tables.
•
Relationships are equally as important to an object-oriented model as to a relational model.
Following are some issues related to relationships that you should keep in mind when
designing schema.
77
UNIT 6
NORMALIZATION
6.1
6.2
6.3
6.4
Functional dependency
Normalization
6.2.1 First normal form
6.2.2 Second normal form
6.2.3 Third normal form
6.2.4 Boyce-codd normal form
6.2.5 Multi-valued dependency
6.2.6 Fifth normal form
Self test
Summary
6.1 Functional Dependency
Consider a relation R that has two attributes A and B. The attribute B of the relation is functionally
dependent on the attribute A if and only if for each value of A no more than one value of B is
associated. In other words, the value of attribute A uniquely determines the value of B and if there
were several tuples that had the same value of A then all these tuples will have an identical value of
attribute B. That is, if t1 and t2 are two tuples in the relation R and
t1(A) = t2(A) then we must have t1(B) = t2(B).
A and B need not be single attributes. They could be any subsets of the attributes of a relation R
(possibly single attributes). We may then write
R.A -> R.B
If B is functionally dependent on A (or A functionally determines B). Note that functional dependency
does not imply a one-to-one relationship between A and B although a one-to-one relationship may
exist between A and B.
A simple example of the above functional dependency is when A is a primary key of an entity (e.g.
student number) and A is some single-valued property or attribute of the entity (e.g. date of birth). A > B then must always hold. (why?)
Functional dependencies also arise in relationships. Let C be the primary key of an entity and D be
the primary key of another entity. Let the two entities have a relationship. If the relationship is oneto-one, we must have C -> D and D -> C. If the relationship is many-to-one, we would have C -> D but
not D -> C. For many-to-many relationships, no functional dependencies hold. For example, if C is
student number and D is subject number, there is no functional dependency between them. If
however, we were storing marks and grades in the database as well, we would have
(student_number, subject_number) -> marks
and we might have
78
marks -> grades
The second functional dependency above assumes that the grades are dependent only on the marks.
This may sometime not be true since the instructor may decide to take other considerations into
account in assigning grades, for example, the class average mark.
For example, in the student database that we have discussed earlier, we have the following functional
dependencies:
sno -> sname
sno -> address
cno -> cname
cno -> instructor
instructor -> office
These functional dependencies imply that there can be only one name for each sno, only one address
for each student and only one subject name for each cno. It is of course possible that several
students may have the same name and several students may live at the same address. If we consider
cno -> instructor, the dependency implies that no subject can have more than one instructor
(perhaps this is not a very realistic assumption). Functional dependencies therefore place constraints
on what information the database may store. In the above example, one may be wondering if the
following FDs hold
sname -> sno
cname -> cno
Certainly there is nothing in the instance of the example database presented above that contradicts
the above functional dependencies. However, whether above FDs hold or not would depend on
whether the university or college whose database we are considering allows duplicate student names
and subject names. If it was the enterprise policy to have unique subject names than cname -> cno
holds. If duplicate student names are possible, and one would think there always is the possibility of
two students having exactly the same name, then sname -> sno does not hold.
Functional dependencies arise from the nature of the real world that the database models. Often A
and B are facts about an entity where A might be some identifier for the entity and B some
characteristic. Functional dependencies cannot be automatically determined by studying one or more
instances of a database. They can be determined only by a careful study of the real world and a clear
understanding of what each attribute means.
We have noted above that the definition of functional dependency does not require that A and B be
single attributes. In fact, A and B may be collections of attributes. For example
(sno, cno) -> (mark, date)
When dealing with a collection of attributes, the concept of full functional dependence is an important
one. Let A and B be distinct collections of attributes from a relation R end let R.A -> R.B. B is then
fully functionally dependent on A if B is not functionally dependent on any subset of A. The above
example of students and subjects would show full functional dependence if mark and date are not
functionally dependent on either student number ( sno) or subject number ( cno) alone. The implies
that we are assuming that a student may have more than one subjects and a subject would be taken
79
by many different students. Furthermore, it has been assumed that there is at most one enrolment of
each student in the same subject.
The above example illustrates full functional dependence. However the following dependence
(sno, cno) -> instructor is not full functional dependence because cno -> instructor holds.
As noted earlier, the concept of functional dependency is related to the concept of candidate key of a
relation since a candidate key of a relation is an identifier which uniquely identifies a tuple and
therefore determines the values of all other attributes in the relation. Therefore any subset X of the
attributes of a relation R that satisfies the property that all remaining attributes of the relation are
functionally dependent on it (that is, on X), then X is candidate key as long as no attribute can be
removed from X and still satisfy the property of functional dependence. In the example above, the
attributes (sno, cno) form a candidate key (and the only one) since they functionally determine all the
remaining attributes.
Functional dependence is an important concept and a large body of formal theory has been developed
about it. We discuss the concept of closure that helps us derive all functional dependencies that are
implied by a given set of dependencies. Once a complete set of functional dependencies has been
obtained, we will study how these may be used to build normalised relations.
Rules About Functional Dependencies
•
Let F be set of FDs specified on R
•
Must be able to reason about FD’s in F
o Schema designer usually explicitly states only FD’s which are obvious
o Without knowing exactly what all tuples are, must be able to deduce other/all FD’s
that hold on R
o Essential when we discuss design of “good” relational schemas
Design of Relational Database Schemas
Problems such as redundancy that occur when we try to cram too much into a single relation are
called anomalies. The principal kinds of anomalies that we encounter are:
_ Redundancy. Information may be repeated unnecessarily in several tuples.
_ Update Anomalies. We may change information in one tuples but leave the same information
unchanged in another.
_ Deletion Anomalies. If a set of values becomes empty, we may lose other information as side effect.
6.2 Normalization
Designing a database, usually a data model is translated into relational schema. The important
question is whether there is a design methodology or is the process arbitrary. A simple answer to this
question is affirmative. There are certain properties that a good database design must possess as
dictated by Codd’s rules. There are many different ways of designing good database. One of such
methodologies is the method involving ‘Normalization’. Normalization theory is built around the
concept of normal forms. Normalization reduces redundancy. Redundancy is unnecessary repetition
of data. It can cause problems with storage and retrieval of data. During the process of normalization,
80
dependencies can be identified, which can cause problems during deletion and updation.
Normalization theory is based on the fundamental notion of Dependency. Normalization helps in
simplifying the structure of schema and tables.
For example the normal forms, we will take an example of a database of the following logical design:
Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}
Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#}
Relation SP { S#, SUPPLYCITY, P#, PARTQTY},
Primary Key{S#, P#}
Foreign Key{S#} Reference S
Foreign Key{P#} Reference P
SP
S#
SUPPLYCITY
P#
PARTQTY
S1
Bombay
P1
3000
S1
Bombay
P2
2000
S1
Bombay
P3
4000
S1
Bombay
P4
2000
S1
Bombay
P5
1000
S1
Bombay
P6
1000
S2
Mumbai
P1
3000
S2
Mumbai
P2
4000
S3
Mumbai
P2
2000
S4
Madras
P2
2000
S4
Madras
P4
3000
S4
Madras
P5
4000
Let us examine the table above to find any design discrepancy. A quick glance reveals that some of
the data are being repeated. That is data redundancy, which is of course an undesirable. The fact
that a particular supplier is located in a city has been repeated many times. This redundancy causes
many other related problems. For instance, after an update a supplier may be displayed to be from
Madras in one entry while from Mumbai in another. This further gives rise to many other problems.
Therefore, for the above reasons, the tables need to be refined. This process of refinement of a given
schema into another schema or a set of schema possessing qualities of a good database is known as
Normalization.
Database experts have defined a series of Normal forms each conforming to some specified design
quality condition(s). We shall
restrict ourselves to the first five
1NF
normal forms for the simple
reason of simplicity. Each next level
2NF
of normal form adds another
condition. It is interesting to note
3NF
that
the
process
of
normalization is reversible. The
4N
following diagram depicts the
relation between various normal
5NF
forms.
81
The diagram implies that 5th Normal form is also in 4th Normal form, which itself in 3rd Normal form
and so on. These normal forms are not the only ones. There may be 6th, 7th and nth normal forms, but
this is not of our concern at this stage.
Before we embark on normalization, however, there are a few more concepts that should be
understood.
Decomposition. Decomposition is the process of splitting a relation into two or more relations. This is
nothing but projection process. Decompositions may or may not loose information. As you would
learn shortly, that normalization process involves breaking a given relation into one or more relations
and also that these decompositions should be reversible as well, so that no information is lost in the
process. Thus, we will be interested more with the decompositions that incur no loss of information
rather than the ones in which information is lost.
Lossless decomposition: The decomposition, which results into relations without loosing any
information, is known as lossless decomposition or nonloss decomposition. The decomposition that
results in loss of information is known as lossy decomposition.
Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as
shown below.
S
S#
SUPPLYSTATUS
SUPPLYCITY
S3
100
Madras
S5
100
Mumbai
Let us decompose this table into two as shown below:
(1)
SX
S#
SUPPLYSTATUS
SY
S3
100
S5
100
(2)
SX
S#
SUPPLYSTATUS
S3
100
S5
100
SY
S#
SUPPLYCITY
S3
Madras
S5
Mumbai
SUPPLYSTATUS
100
100
SUPPLYCITY
Madras
Mumbai
Let us examine these decompositions. In decomposition (1) no information is lost. We can still say
that S3’s status is 100 and location is Madras and also that supplier S5 has 100 as its status and
location Mumbai. This decomposition is therefore lossless.
In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the location
of suppliers cannot be determined by these two tables. The information regarding the location of the
suppliers has been lost in this case. This is a lossy decomposition. Certainly, lossless decomposition
is more desirable because otherwise the decomposition will be irreversible. The decomposition
process is in fact projection, where some attributes are selected from a table. A natural question
arises here as to why the first decomposition is lossless while the second one is lossy? How should a
given relation must be decomposed so that the resulting projections are nonlossy? Answer to these
questions lies in functional dependencies and may be given by the following theorem.
82
Heath’s theorem: Let R{A, B, C} be a relation, where A, B and C are sets of attributes. If R satisfies the
FD A→B, then R is equal to the join of its projections on {A, B} and {A, C}.
Let us apply this theorem on the decompositions described above. We observe that relation S satisfies
two irreducible sets of FD’s
S# → SUPPLYSTATUS
S# → SUPPLYCITY
Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that
relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#,
SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#, SUPPLYSTATUS}
and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see that one of the FD’s is lost in this
decomposition. While the FD S#→SUPPLYSTATUS is still represented by projection on {S#,
SUPPLYSTATUS}, but the FD S#→SUPPLYCITY has been lost.
An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let F
be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this decomposition
is a lossless-join decomposition of R if at least one of the following functional dependencies are in F+:
R1
R1 ∩ R2 →
R1 ∩ R2 →
R2
Functional Dependency Diagrams: This is a handy tool for representing function dependencies
existing in a relation.
PARTNAME
SUPPLIERNAME
S#
S#
SUPPLYSTATUS
PARTCOLOR
PARTQTY
P#
PARTWEIGHT
P#
SUPPLYCITY
SUPPLYCITY
The diagram is very useful for its eloquence and in visualizing the FD’s in a relation. Later in the Unit
you will learn how to use this diagram for normalization purposes.
6.2.1 First Normal Form
A relation is in 1st Normal form (1NF) if and only if, in every legal value of that relation, every tuple
contains exactly one value for each attribute.
Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most
desirable form of a relation.
Let us take a relation (modified to illustrate the point in discussion) as
83
Rel1{S#, SUPPLYSTATUS, SUPPLYCITY, P#, PARTQTY} Primary Key{S#, P#}
FD{SUPPLYCITY → SUPPLYSTATUS}
Note that SUPPLYSTATUS is functionally dependent on SUPPLYCITY; meaning that a supplier’s
status is determined by the location of that supplier – e.g. all suppliers from Madras must have
status of 100. The primary key of the relation Rel1 is {S#, P#}. The FD diagram is shown below:
S#
SUPPLYCITY
P#
SUPPLYSTATUS
PARTQTY
For a good design the diagram should have arrows out of candidate keys only. The additional arrows
cause trouble.
Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us
insert some sample tuples into this relation.
REL1 S#
SUPPLYSTATUS
SUPPLYCITY P#
PARTQTY
S1
200
Madras
P1
3000
S1
200
Madras
P2
2000
S1
200
Madras
P3
4000
S1
200
Madras
P4
2000
S1
200
Madras
P5
1000
S1
200
Madras
P6
1000
S2
100
Mumbai
P1
3000
S2
100
Mumbai
P2
4000
S3
100
Mumbai
P2
2000
S4
200
Madras
P2
2000
S4
200
Madras
P4
3000
S4
200
Madras
P5
4000
The redundancies in the above relation causes many problems – usually known as update anamolies,
that is in INSERT, DELETE and UPDATE operations. Let us see these problems due to supplier-city
redundancy corresponding to FD S#→SUPPLYCITY.
INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the
information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation
because he has not supplied any part so far.
DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of a
supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier
supplied a particular part but also the fact that the supplier is located in a particular city. In our
case, if we delete entries corresponding to S#=S2, we loose the information that the supplier is
84
located at Mumbai. This is definitely undesirable. The problem here is there are too many
informations attached to each tuple, therefore deletion forces loosing too many informations.
UPDATE: If we modify the city of a supplier S1 to Mumbai from Madras, we have to make sure that
all the entries corresponding to S#=S1 are updated otherwise inconsistency will be introduced. As a
result some entries will suggest that the supplier is located at Madras while others will contradict this
fact.
6.2.2 Second Normal Form
A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally
dependent on the primary key. Here it has been assumed that there is only one candidate key, which
is of course primary key.
A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction
process consists of replacing the 1NF relation by suitable projections.
We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy is
to break the relation into two simpler relations.
REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and
REL3{S#, P#, PARTQTY}
The FD diagram and sample relation, are shown below.
SUPPLYCITY
S#
S#
REL2
REL3
SUPPLYSTATUS
P#
S#
SUPPLYSTATUS
SUPPLYCITY
S1
200
Madras
S2
100
Mumbai
S3
100
Mumbai
S4
200
Madras
S5
300
Kolkata
S1
S2
S2
S3
S4
S4
S4
PARTQTY
P6
P1
P2
P2
P2
P4
P5
S#
S1
S1
S1
S1
S1
1000
3000
4000
2000
2000
3000
4000
P#
P1
P2
P3
P4
P5
PARTQTY
3000
2000
4000
2000
1000
REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of
REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is S#.
By similar argument, REL3 is also in 2NF. Evidently, these two relations have overcome all the
update anomalies stated earlier. Now it is possible to insert the facts regarding supplier S5 even when
he is not supplied any part, which was earlier not possible. This solves insert problem. Similarly,
delete and update problems are also over now.
85
These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the
problems we are going to discuss here, however, REL2 still carries some problems. The reason is that
the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via SUPPLYCITY. Thus
we see that there are two dependencies S#→SUPPLYCITY and SUPPLYCITY→SUPPLYSTATUS. This
implies S#→SUPPLYSTATUS. This relation has a transitive dependency. We will see that this
transitive dependency gives rise to another set of anomalies.
INSERT: We are unable to insert the fact that a particular city has a particular status until we have
some supplier actually located in that city.
DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city has
that particular status.
UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem
related to updataion.
6.2.3 Third Normal Form
A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent on
the primary key.
To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations – REL4
and REL5 as shown below.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
The FD diagram and sample relation, is shown below.
S#
S#
S1
S2
S3
S4
S5
SUPPLYCITY
SUPPLYCITY
Madras
Mumbai
Mumbai
Madras
Kolkata
SUPPLYCITY
SUPPLYCITY
REL4
REL5
SUPPLYCITY SUPPLYSTATUS
Madras
200
Mumbai
100
Kolakata
300
Evidently, the above relations REL4 and REL5 are in 3NF, because there is no transitive
dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any
transitive dependency.
Dependency Preservation
The reduction process may suggest a variety of ways in which a relation may be decomposed in
lossless decomposition. Thus, REL2 can be in which there was a transitive dependency and therefore,
we split it into two 3NF projections, i.e.
86
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
Let us call this decomposition as decompositio-1. An alternative decomposition may be:
REL4{S#, SUPPLYCITY} and
REL5{S#, SUPPLYSTATUS}
Which we will call decomposition-2.
Both the decompositions decomposition-1 and decomposition-2 are 3NF and lossless. However,
decomposition-2 is less satisfactory than decomposition-1. For example, it is still not possible to
insert the information that a particular city has a particular status unless some supplier is located in
the city.
In the decomposition-1, the two projections are independent of each other but the same is not true in
the second decomposition. Here independence is in the sense that updates are made into the
relations without regard of the other provided the insertion is legal. Also independent decompositions
preserve the dependencies of the database and no dependence is lost in the decomposition process.
The concept of independent projections provides for choosing a particular decomposition when there
is more than one choice.
6.2.4 Boyce-Codd Normal Form
The previous normal forms assumed that there was just one candidate key in the relation and that
key was also the primary key. Another class of problems arises when this is not the case. Very often
there will be more candidate keys than one in practical database designing situation. To be precise
the 1NF, 2NF and 3NF did not deal adequately with the case of relations that had two or more
candidate keys, and that the candidate keys were composite, and
They overlapped (i.e. had at least one attribute common).
A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-irreducible FD has
a candiadte key as its determinant.
Or
A relation is in BCNF if and only if all the determinants are candidate keys.
In other words, the only arrows in the FD diagram are arrows out of candidate keys. It has already
been explained that there will always be arrows out of candidate keys; the BCNF definition says there
are no others, meaning there are no arrows that can be eliminated by the normalization procedure.
These two definitions are apparently different from each other. The difference between the two BCNF
definitions is that we tacitly assume in the former case determinants are "not too big" and that all
FDs are nontrivial.
It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in
that it makes no explicit reference to first and second normal forms as such, nor to the concept of
transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the case
that any given relation can be nonloss decomposed into an equivalent collection of BCNF relations.
87
Thus, relations REL1 and REL2 which were not in 3NF, are not in BCNF either; also that relations
REL3, REL4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three
determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key, so
REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because the determinant {SUPPLYCITY}
is not a candidate key. Relations REL3, REL4, and REL5, on the other hand, are each in BCNF,
because in each case the sole candidate key is the only determinant in the respective relations.
We now consider an example involving two disjoint - i.e., nonoverlapping - candidate keys. Suppose
that in the usual suppliers relation REL1{S#, SUPPLIERNAME, SUPPLYSTATUS, SUPPLYCITY}, {S#}
and {SUPPLIERNAME} are both candidate keys (i.e., for all time, it is the case that every supplier has
a unique supplier number and also a unique supplier name). Assume, however, that attributes
SUPPLYSTATUS
and
SUPPLYCITY
are
mutually
independent
i.e.,
the
FD
SUPPLYCITY→SUPPLYSTATUS no longer holds. Then the FD diagram is as shown below.
S#
SUPPLYSTATUS
SUPPLIERNAME
SUPPLYCITY
Relation REL1 is in BCNF. Although the FD diagram does look "more complex" than a 3NF diagram,
it is nevertheless still the case that the only determinants are candidate keys; i.e., the only arrows are
arrows out of candidate keys. So the message of this example is just that having more than one
candidate key is not necessarily bad.
For illustration we will assume that in our relations supplier names are unique. Consider REL6.
REL6{ S#, SUPPLIERNAME, P#, PARTQTY }.
Since it contains two determinants, S# and SUPPLIERNAME that are not candidate keys for the
relation, this relation is not in BCNF. A sample snapshot of this relation is shown below:
REL6 S#
SUPPLIERNAME
P#
PARTQTY
S1
Pooran
P1
3000
S1
Anupam
P2
2000
S1
Vishal
P3
4000
S1
Vinod
P4
2000
As is evident from the figure above, relation REL6 involves the same kind of redundancies as did
relations REL1 and REL2, and hence is subject to the same kind of update anomalies. For example,
changing the name of suppliers from Vinod to Rahul leads, once again, either to search problems or
to possibly inconsistent results. Yet REL6 is in 3NF by the old definition, because that definition did
not require an attribute to be irreducibly dependent on each candidate key if it was itself a
component of some candidate key of the relation, and so the fact that SUPPLIERNAME is not
irreducibly dependent on {S#, P#} was ignored.
The solution to the REL6 problems is, of course, to break the relation down into two projections, in
this case the projections are:
REL7{S#, SUPPLIERNAME} and
88
REL8{S#, P#, PARTQTY}
Or
REL7{S#, SUPPLIERNAME} and
REL8{SUPPLIERNAME, P#, PARTQTY}
Both of these projections are in BCNF. The original design, consisting of the single relation REL1, is
clearly bad; the problems with it are intuitively obvious, and it is unlikely that any competent
database designer would ever seriously propose it, even if he or she had no exposure to the ideas of
BCNF etc. at all.
Comparison of BCNF and 3NF
We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an
advantage to 3NF in that we know that it is always possible to obtain a 3NF design without sacrificing
a lossless join or dependency preservation. Nevertheless, there is a disadvantage to 3NF. If we do not
eliminate all transitive dependencies, we may have to use null values to represent some of the
possible meaningful relationship among data items, and there is the problem of repetition of
information. The other difficulty is the repetition of information.
If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally
preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a
high penalty in system performance or risk the integrity of the data in our database. Neither of these
alternatives is attractive. With such alternatives, the limited amount of redundancy imposed by
transitive dependencies allowed under 3NF is the lesser evil. Thus, we normally choose to retain
dependency preservation and to sacrifice BCNF.
6.2.5 Multi-valued dependency
Multi-valued dependency may be formally defined as:
Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is
multi-dependent on A - in symbols,
A →→B
(read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible legal
value of R, the set of B values matching a given A value, C value pair depends only on the A value
and is independent of the C value.
To elucidate the meaningof the above statement, let us take one example relation REL8 as shown
beolw:
REL8
COURSE
TEACHERS
BOOKS
Computer
TEACHER
BOOK
Dr. Wadhwa
Graphics
Prof. Mittal
UNIX
TEACHER
BOOK
Prof. Saxena
Relational Algebra
Prof. Karmeshu
Discrete Maths
Mathematics
Assume that for a given course there can exist any number of corresponding teachers and any
number of corresponding books. Moreover, let us also assume that teachers and books are quite
89
independent of one another; that is, no matter who actually teaches any particular course, the same
books are used. Finally, also assume that a given teacher or a given book can be associated with any
number of courses.
Let us try to eliminate the relation-valued attributes. One way to do this is simply to replace relation
REL8 by a relation REL9 with three scalar attributes COURSE, TEACHER, and BOOK as indicated
below.
REL9
COURSE
TEACHER
BOOK
Computer
Dr. Wadhwa
Graphics
Computer
Dr. Wadhwa
UNIX
Computer
Prof. Mittal
Graphics
Computer
Prof. Mittal
UNIX
Mathematics
Prof. Saxena
Relational Algebra
Mathematics
Prof. Karmeshu
Disrete Maths
Mathematics
Prof. Karmeshu
Relational Algebra
As you can see from the relation, each tuple of REL8 gives rise to m * n tuples in REL9, where m and
n are the cardinalities of the TEACHERS and BOOKS relations in that REL8 tuple. Note that the
resulting relation REL9 is "all key".
The meaning of relation REL9 is basically as follows: A tuple {COURSE:c, TEACHER:t, BOOK:x}
appears in REL9 if and only if course c can be taught by teacher t and uses book x as a reference.
Observe that, for a given course, all possible combinations of teacher and book appear: that is, REL9
satisfies the (relation) constraint
if
tuples (c, t1, x1), (c, t2, x2) both appear
then tuples (c, t1, x2), (c, t2, x1) both appear also
Now, it should be apparent that relation REL9 involves a good deal of redundancy, leading as usual
to certain update anomalies. For example, to add the information that the Computer course can be
taught by a new teacher, it is necessary to insert two new tuples, one for each of the two books. Can
we avoid such problems? Well, it is easy to see that:
1. The problems in question are caused by the fact that teachers and books are completely
independent of one another;
2. Matters would be much improved if REL9 were decomposed into its two projections call them
REL10 and REL11 - on {COURSE, TEACHER} and {COURSE, BOOK}, respectively.
To add the information that the Computer course can be taught by a new teacher, all we have to do
now is insert a single tuple into relation REL10. Thus, it does seem reasonable to suggest that there
should be a way of "further normalizing" a relation like REL9.
It is obvious that the design of REL9 is bad and the decomposition into REL10 and REL11 is better.
The trouble is, however, these facts are not formally obvious. Note in particular that REL9 satisfies no
functional dependencies at all (apart from trivial ones such as COURSE → COURSE); in fact, REL9 is
in BCNF, since as already noted it is all key-any "all key" relation must necessarily be in BCNF. (Note
90
that the two projections REL10 and REL11 are also all key and hence in BCNF.) The ideas of the
previous normalization are therefore of no help with the problem at hand.
The existence of "problem" BCNF relation like REL9 was recognized very early on, and the way to deal
with them was also soon understood, at least intuitively. However, it was not until 1977 that these
intuitive ideas were put on a sound theoretical footing by Fagin's introduction of the notion of multivalued dependencies, MVDs. Multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD, but the converse is not true (i.e., there exist
MVDs that are not FDs). In the case of relation REL9 there are two MVDs that hold:
COURSE →→ TEACHER
COURSE →→ BOOK
Note the double arrows; the MVD A→→B is read as "B is multi-dependent on A" or, equivalently, "A
multi-determines B." Let us concentrate on the first MVD, COURSE→→TEACHER. Intuitively, what
this MVD means is that, although a course does not have a single corresponding teacher - i.e., the
functional dependence COURSE→TEACHER does not hold-nevertheless, each course does have a
well-defined set of corresponding teachers. By "well-defined" here we mean, more precisely, that for a
given course c and a given book x, the set of teachers t matching the pair (c, x) in REL9 depends on
the value c alone - it makes no difference which particular value of x we choose. The second MVD,
COURSE→→BOOK, is interpreted analogously.
It is easy to show that, given the relation R{A, B, C), the MVD A→→B holds if and only if the MVD
A→→C also holds. MVDs always go together in pairs in this way. For this reason it is common to
represent them both in one statement, thus:
COURSE→→TEACHER | TEXT
Now, we stated above that multi-valued dependencies are a generalization of functional dependencies,
in the sense that every FD is an MVD. More precisely, an FD is an MVD in which the set of dependent
(right-hand side) values matching a given determinant (left-hand side) value is always a singleton set.
Thus, if A→B. then certainly A→→B.
Returning to our original REL9 problem, we can now see that the trouble with relation such as REL9
is that they involve MVDs that are not also FDs. (In case it is not obvious, we point out that it is
precisely the existence of those MVDs that leads to the necessity of – for example - inserting two
tuples to add another Computer teacher. Those two tuples are needed in order to maintain the
integrity constraint that is represented by the MVD.) The two projections REL10 and REL11 do not
involve any such MVDs, which is why they represent an improvement over the original design. We
would therefore like to replace REL9 by those two projections, and an important theorem proved by
Fagin in reference allows us to make exactly that replacement:
Theorem (Fagin): Let R{A, B, C} be a relation, where A, B, and C are sets of attributes. Then R is equal
to the join of its projections on {A, B} and {A, C} if and only if R satisfies the MVDs A→→B | C.
At this stage we are equipped to define fourth normal form:
Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of the
attributes of R such that the nontrivial (An MVD A→→B is trivial if either A is a superset of B or the
91
union of R and B is the entire heading) MVD A→→B is satisfied, then all attributes of R are also
functionally dependent on A.
In other words, the only nontrivial dependencies (FDs or MVDs) in R are of the form Y→X (i.e.,
functional dependency from a superkey Y to some other attribute X). Equivalently: R is in 4NF if it is
in BCNF and all MVDs in R are in fact "FDs out of keys." Therefore, that 4NF implies BCNF.
Relation REL9 is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD "out of
a key." The two projections REL10 and REL11 are both in 4NF, however. Thus 4NF is an
improvement over BCNF, in that it eliminates another form of undesirable dependency. What is more,
4NF is always achievable; that is, any relation can be nonloss decomposed into an equivalent
collection of 4NF relations.
You may recall that a relation R{A, B, C} satisfying the FDs A→B and B→C is better decomposed into
its projections on (A, B) and {B, C} rather than into those on {A, B] and {A, C). The same holds true if
we replace the FDs by the MVDs A→→B and B→→C.
6.2.6 Fifth Normal Form
It seems from our discussion so far in that the sole operation necessary or available in the further
normalization process is the replacement of a relation in a nonloss way by exactly two of its
projections. This assumption has successfully carried us as far as 4NF. It comes perhaps as a
surprise, therefore, to discover that there exist relations that cannot be nonloss-decomposed into two
projections but can be nonloss-decomposed into three (or more). An unpleasant but convenient term,
we will describe such a relation as "n-decomposable" (for some n > 2) - meaning that the relation in
question can be nonloss-decomposed into n projections but not into m for any m < n.
A relation that can be nonloss-decomposed into two projections we will call "2-decomposable" and
similarly term “n-decomposable” may be defined. The phenomenon of n-decomposability for n > 2 was
first noted by Aho, Been, and Ullman. The particular case n = 3 was also studied by Nicolas.
Consider relation REL12 from the suppliers-parts-projects database ignoring attribute QTY for
simplicity for the moment. A sample snapshot of the same is shown below. It may be pointed out that
relation REL12 is all key and involves no nontrivial FDs or MVDs at all, and is therefore in 4NF. The
snapshot of the relations also shows:
a. The three binary projections REL13, REL14, and REL15 corresponding to the REL12 relation
value displayed on the section of the adjoining diagram;
b. The effect of joining the REL13 and REL14 projections (over P#);
c. The effect of joining that result and the REL15 projection (over J# and S#).
REL12
S#
P#
J#
S1
P1
J2
S1
P2
J1
S2
P1
J1
S1
P1
J1
REL13
S#
P#
REL14
P#
J#
REL15
J#
S#
S1
P1
P1
J2
J2
S1
S1
P2
P1
J1
J1
S1
92
S2
P1
P1
J1
J1
S2
Join Dependency:
Let R be a relation, and let A, B, ..., Z be subsets of the attributes of R. Then we say that R satisfies
the Join Dependency (JD)
*{ A, B, ..., Z}
(read "star A, K ..., Z") if and only if every possible legal value of R is equal to the join of its projections
on A, B,..., Z.
For example, if we agree to use SP to mean the subset (S#, P#} of the set of attributes of REL12, and
similarly for FJ and JS, then relation REL12 satisfies the JD * {SP, PJ, JS}.
We have seen, then, that relation REL12, with its JD * {REL13, REL14, REL15}, can be 3decomposed. The question is, should it be? And the answer is "Probably yes." Relation REL12 (with
its JD) suffers from a number of problems over update operations, problems that are removed when it
is 3-decomposed.
Fagin's theorem, to the effect that R{A, B, C} can be non-loss-decomposed into its projections on {A,
B} and {A, C] if and only if the MVDs A→→B and A→→C hold in R, can now be restated as follows:
R{A, B, C} satisfies the JD*{AB, AC} if and only if it satisfies the MVDs A→→B | C.
Since this theorem can be taken as a definition of multi-valued dependency, it follows that an MVD is
just a special case of a JD, or (equivalently) that JDs are a generalization of MVDs.
Thus, to put it formally, we have
A→→B | C ≡ * {AB, AC}
Note that joint dependencies are the most general form of dependency possible (using, of course, the
term "dependency" in a very special sense). That is, there does not exist a still higher form of
dependency such that JDs are merely a special case of that higher form - so long as we restrict our
attention to dependencies that deal with a relation being decomposed via projection and recomposed
via join.
Coming back to the running example, we can see that the problem with relation REL12 is that it
involves a JD that is not an MVD, and hence not an FD either. We have also seen that it is possible,
and probably desirable, to decompose such a relation into smaller components - namely, into the
projections specified by the join dependency. That decomposition process can be repeated until all
resulting relations are in fifth normal form, which we now define:
Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and
only if every nontrivial* join dependency that holds for R is implied by the candidate keys of R.
Let us understand what it means for a JD to be "implied by candidate keys."
Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is
93
certainly not implied by its sole candidate key (that key being the combination of all of its attributes).
Stated differently, relation REL12 is not in 5NF, because (a) it can be 3-decomposed and (b) that 3decomposability is not implied by the fact that the combination {S#, P#, J#} is a candidate key. By
contrast, after 3-decomposition, the three projections SP, PJ, and JS are each in 5NF, since they do
not involve any (nontrivial) JDs at all.
Now let us understand through an example, what it means for a JD to be implied by candidate keys.
Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and
{SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it satisfies the
JD
*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }
That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME, SUPPLYSTATUS}
and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those projections. (This fact does
not mean that it should be so decomposed, of course, only that it could be.) This JD is implied by the
fact that {S#} is a candidate key (in fact it is implied by Heath's theorem) Likewise, relation REL1 also
satisfies the JD
* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}
This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.
To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with
respect to projection and join (which accounts for its alternative name, projection-join normal form).
That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by taking
projections. For a relation is in 5NF the only join dependencies are those that are implied by
candidate keys, and so the only valid decompositions are ones that are based on those candidate
keys. (Each projection in such a decomposition will consist of one or more of those candidate keys,
plus zero or more additional attributes.) For example, the suppliers relation REL15 is in 5NF. It can
be further decomposed in several nonloss ways, as we saw earlier, but every projection in any such
decomposition will still include one of the original candidate keys, and hence there does not seem to
be any particular advantage in that further reduction.
6.3 Self Test
Exercise
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
1
True-False Questions:
Normalisation is the process of splitting a relation into two or more relations. (T)
A functional dependency is a relationship between or among attributes. (T)
The known or given attribute is called the determinant in a functional dependency. (T)
The relationship in a functional dependency is one-to-one (1:1). (F)
A key is a group of one or more attributes that uniquely identifies a row. (T)
The selection of the attributes to use for the key is determined by the database
programmers. (F)
A deletion anomaly occurs when deleting one entity results in deleting facts about another
entity. (T)
An insertion anomaly occurs when we cannot insert some data into the database without
inserting another entity first. (T)
Modification anomalies do not occur in tables that meet the definition of a relation. (F)
A table of data that meets the minimum definition of a relation is automatically in first
94
11.
12.
13.
14.
15.
normal form. (T)
A relation is in first normal form if all of its non-key attributes are dependent on part of the
key. (F)
A relation is in second normal form if all of its non-key attributes are dependent on all of
the key. (T)
A relation can be in third normal form without being in second normal form. (F)
Fifth normal form is the highest normal form. (F)
A relation in domain/key normal form is guaranteed to have no modification anomalies.
(T)
Exercise 2
1.
2.
Multiple choice questions
A relation is analogous to a: (a)
a)
file
b)
field
c)
record
d)
row
e)
column
In a functional dependency, the determinant: (d)
a)
will be paired with one value of the dependent attribute
b)
may be paired with one or more values of the dependent attribute
c)
may consist of more than one attribute
d)
a and c
e)
b and c
6.4 Summary
•
Describes the relationship between attributes in a relation. For Example if A and B are attributes
of a relation R, B is functionally dependent and A (denoted Aà B), if each value of A is associated
with exactly one value of B. (A and B may each consist of one or more attributes)
•
The concept of database normalization is not unique to any particular Relational Database
Management System. It can be applied to any of several implications of relational databases
including Microsoft Access, dBase, Oracle, etc.
•
Normalization may have the effect of duplicating data within the database and often results in
the creation of additional tables. (While normalization tends to increase the duplication of data, it
does not introduce redundancy, which is unnecessary duplication.) Normalization is typically a
refinement process after the initial exercise of identifying the data objects that should be in the
database, identifying their relationships, and defining the tables required and the columns within
each table.
95
UNIT 7
STRUCTURED QUERY LANGUAGE
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
Introduction of sql
Ddl statements
Dml statements
View definitions
Constraints and triggers
Keys and foreign keys
Constraints on attributes and tuples
Modification of constraints
Cursors
Dynamic sql
7.1 Introduction Of SQL
At the heart of every DBMS is a language that is similar to a programming language, but different
in that it is designed specifically for communicating with a database. One powerful language is
SQL. IBM developed SQL in the late 1970s and early 1980s as a way to standardize query
language across the many mainframe and microcomputer platforms that company produced.
SQL differs significantly from programming languages. Most programming languages are still
procedural. Procedural language consists of commands that tell the computer what to do -instruction by instruction, step by step. SQL is not a programming language itself, it is a data
access language. SQL may be embedded in traditional procedural programming languages (like
COBOL). SQL statement is not really command to the computer. Rather, it is a description of
some of the data contained in a database. SQL is nonprocedural because it does not give step-bystep commands to the computer or database. SQL describes data, and instructs the database to
do something with the data.
For example
SELECT [Name], [Company_Name]
FROM Contacts
WHERE ((City = "Kansas City", and ([Name] ="R.."))
7.2
DDL Statements
Data Definition Language is a set of SQL commands used to create, modify and delete database
structures (not data). These commands wouldn't normally be used by a general user, who should be
accessing the database via an application. They are normally used by the DBA (to a limited extent), a
database designer or application developer. These statements are immediate, they are not susceptible
to ROLLBACK commands. You should also note that if you have executed several DML updates then
96
issuing any DDL command will COMMIT all the updates as every DDL command implicitly issues a
COMMIT command to the database. Anybody using DDL must have the CREATE object privilege and
a Tablespace area in which to create objects.
In an Oracle database objects can be created at any time, whether users are on-line or not. Table
space need not be specified as Oracle will pick up the user defaults (defined by the DBA) or the
system defaults. Tables will expand automatically to fill disk partitions (provided this has been set up
in advance by the DBA). Table structures may be modified on-line although this can have dire effects
on an application so be careful.
Creating our two example tables
CREATE TABLE BOOK (
ISBN NUMBER(10),
TITLE VARCHAR2(200),
AUTHOR VARCHAR2(50),
COST NUMBER(8,2),
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3)
)
CREATE TABLE SECTION (
SECTION_ID NUMBER(3),
SECTION_NAME CHAR(30),
BOOK_COUNT NUMBER(6)
)
The two commands above create our two sample tables and demonstrate the basic table creation
command. The CREATE keyword is followed by the type of object that we want created (TABLE,
VIEW, INDEX etc.), and that is followed by the name we want the object to be known by. Between the
outer brackets lie the parameters for the creation, in this case the names, datatypes and sizes of each
field.
A NUMBER is a numeric field, the size is not the maximum externally displayed number but the size
of the internal binary field set aside for the field (10 can hold a very large number). A number size
split with a comma denotes the field size followed by the number of digits following the decimal point
(in this case a currency field has two significant digits)
A VARCHAR2 is a variable length string field from 0-n where n is the specified size. Oracle only takes
up the space required to hold any value in the field, it doesn't allocate the entire storage space unless
required to by a maximum sized field value (Max size 2000).
A CHAR is a fixed length string field (Max size 255).
A DATE is an internal date/time field (normally 7 bytes long).
97
A LONG or LONG RAW field (not shown) is used to hold large binary objects (Word documents, AVI
files etc.). No size is specified for these field types. (Max size 2Gb).
Creating our two example tables with constraints
Constraints are used to enforce table rules and prevent data dependent deletion (enforce database
integrity). You may also use them to enforce business rules (with some imagination).
Our two example tables do have some rules which need enforcing, specifically both tables need to
have a prime key (so that the database doesn't allow replication of data). And the Section ID needs to
be linked to each book to identify which library section it belongs to (the foreign key). We also want to
specify which columns must be filled in and possibly some default values for other columns.
Constraints can be at the column or table level.
Constraint
Description
NULL / NOT NOT NULL specifies that a column must have some value. NULL (default) allows NULL
NULL
values in the column.
DEFAULT
Specifies some default value if no value entered by user.
UNIQUE
Specifies that column(s) must have unique values
PRIMARY
KEY
Specifies that column(s) are the table prime key and must have unique values. Index
is automatically generated for column.
FOREIGN
KEY
Specifies that column(s) are a table foreign key and will use referential uniqueness of
parent table. Index is automatically generated for column. Foreign keys allow deletion
cascades and table / business rule validation.
CHECK
Applies a condition to an input column value.
DISABLE
You may suffix DISABLE to any other constraint to make Oracle ignore the
constraint, the constraint will still be available to applications / tools and you can
enable the constraint later if required.
CREATE TABLE SECTION (
SECTION_ID NUMBER(3) CONSTRAINT S_ID CHECK (SECTION_ID > 0),
SECTION_NAME CHAR(30) CONSTRAINT S_NAME NOT NULL,
BOOK_COUNT NUMBER(6),
CONSTRAINT SECT_PRIME PRIMARY KEY (SECTION_ID))
CREATE TABLE BOOK (
ISBN NUMBER(10) CONSTRAINT B_ISBN CHECK (ISBN BETWEEN 1 AND 2000),
TITLE VARCHAR2(200) CONSTRAINT B_TITLE NOT NULL,
AUTHOR VARCHAR2(50) CONSTRAINT B_AUTH NOT NULL,
COST NUMBER(8,2) DEFAULT 0.00 DISABLE,
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3),
CONSTRAINT BOOK_PRIME PRIMARY KEY (ISBN),
98
CONSTRAINT BOOK_SECT FOREIGN KEY (SECTION_ID) REFERENCES SECTION(SECTION_ID))
We have now created our tables with constraints. Column level constraints go directly after the
column definition to which they refer, table level constraints go after the last column definition. Table
level constraints are normally used (and must be used) for compound (multi column) foreign and
prime key definitions, the example table level constraints could have been placed as column
definitions if that was your preference (there would have been no difference to their function). The
CONSTRAINT keyword is followed by a unique constraint name and then the constraint definition.
The constraint name is used to manipulate the constraint once the table has been created, you may
omit the CONSTRAINT keyword and constraint name if you wish but you will then have no easy way
of enabling / disabling the constraint without deleting the table and rebuilding it, Oracle does give
default names to constraints not explicitly name - you can check these by selecting from the
USER_CONSTRAINTS data dictionary view. Note that the CHECK constraint implements any clause
that would be valid in a SELECT WHERE clause (enclosed in brackets), any value inbound to this
column would be validated before the table is updated and accepted / rejected via the CHECK clause.
Note that the order that the tables are created in has changed, this is because we now reference the
SECTION table from the BOOK table. The SECTION table must exist before we create the BOOK table
else we will receive an error when we try to create the BOOK table. The foreign key constraint cross
references the field SECTION_ID in the BOOK table to the field (and primary key) SECTION_ID in the
SECTION table (REFERENCES keyword).
If we wish we can introduce cascading validation and some constraint violation logging to our tables.
CREATE TABLE AUDIT (
ROWID ROWID,
OWNER VARCHAR2,
TABLE_NAME VARCHAR2,
CONSTRAINT VARCHAR2)
CREATE TABLE SECTION (
SECTION_ID NUMBER(3) CONSTRAINT S_ID CHECK (SECTION_ID > 0),
SECTION_NAME CHAR(30) CONSTRAINT S_NAME NOT NULL,
BOOK_COUNT NUMBER(6),
CONSTRAINT SECT_PRIME PRIMARY KEY (SECTION_ID), EXCEPTIONS INTO AUDIT)
CREATE TABLE BOOK (
ISBN NUMBER(10) CONSTRAINT B_ISBN CHECK (ISBN BETWEEN 1 AND 2000),
TITLE VARCHAR2(200) CONSTRAINT B_TITLE NOT NULL,
AUTHOR VARCHAR2(50) CONSTRAINT B_AUTH NOT NULL,
COST NUMBER(8,2) DEFAULT 0.00 DISABLE,
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3),
CONSTRAINT BOOK_PRIME PRIMARY KEY (ISBN),
CONSTRAINT BOOK_SECT FOREIGN KEY (SECTION_ID) REFERENCES SECTION(SECTION_ID) ON
DELETE CASCADE)
99
Oracle (and any other decent RDBMS) would not allow us to delete a section which had books
assigned to it as this breaks integrity rules. If we wanted to get rid of all the book records assigned to
a particular section when that section was deleted we could implement a DELETE CASCADE. The
delete cascade operates across a foreign key link and removes all child records associated with a
parent record (we would probably want to reassign the books rather than delete them in the real
world).
To log constraint violations I have created a new table (AUDIT) and stated that all exceptions on the
SECTION table should be logged in this table, you can then view the contents of this table with
standard SELECT statements. The AUDIT table must have the shown structure but can be called
anything.
It is possible to record a description or comment against a newly created or existing table or
individual column by using the COMMENT command. The comment command writes your table /
column description into the data dictionary. You can query column comments by selecting against
dictionary views ALL_COL_COMMENTS and USER_COL_COMMENTS.
You can query table comments by selecting against dictionary views ALL_TAB_COMMENTS and
USER_TAB_COMMENTS. Comments can be up to 255 characters long.
Altering tables and constraints
Modification of database object structure is executed with the ALTER statement.
You can modify a constraint as follows :Add new constraint to column or table.
Remove constraint.
Enable / disable constraint.
You cannot change a constraint definition.
You can modify a table as follows :Add new columns.
Modify existing columns.
You cannot delete an existing column.
An example of adding a column to a table is given below :ALTER TABLE JD11.BOOK ADD (REVIEW VARCHAR2(200))
This statement adds a new column (REVIEW) to our book table, to enable library members to browse
the database and read short reviews of the books.
If we want to add a constraint to our new column we can use the following ALTER statement :ALTER TABLE JD11.BOOK MODIFY(REVIEW NOT NULL)
Note that we can't specify a constraint name with the above statement. If we wanted to further modify
a constraint (other than enable / disable) we would have to drop the constraint and then re apply it
specifying any changes.
Assuming that we decide that 200 bytes is insufficient for our review field we might then want to
increase its size. The statement below demonstrates this :ALTER TABLE JD11.BOOK MODIFY (REVIEW VARCHAR2(400))
We could not decrease the size of the column if the REVIEW column contained any data.
100
ALTER TABLE JD11.BOOK DISABLE CONSTRAINT B_AUTH
ALTER TABLE JD11.BOOK ENABLE CONSTRAINT B_AUTH
The above statements demonstrate disabling and enabling a constraint, note that if, between
disabling a constraint and re enabling it, data was entered to the table that included NULL values in
the AUTHOR column, then you wouldn't be able to re enable the constraint. This is because the
existing data would break the constraint integrity. You could update the column to replace NULL
values with some default and then re enable the constraint.
Dropping (deleting) tables and constraints
To drop a constraint from a table we use the ALTER statement with a DROP clause. Some examples
follow :ALTER TABLE JD11.BOOK DROP CONSTRAINT B_AUTH
The above statement will remove the not null constraint (defined at table creation) from the AUTHOR
column. The value following the CONSTRAINT keyword is the name of constraint.
ALTER TABLE JD11.BOOK DROP PRIMARY KEY
The above statement drops the primary key constraint on the BOOK table.
ALTER TABLE JD11.SECTION DROP PRIMARY KEY CASCADE
The above statement drops the primary key on the SECTION table. The CASCADE option drops
foreign key constraint on the BOOK table at the same time.
Use the DROP command to delete database structures like tables. Dropping a table removes
structure, data, privileges, views and synonyms associated with the table (you cannot rollback
DROP so be careful). You can specify a CASCADE option to ensure that constraints refering to
dropped table within other tables (foreign keys) are also removed by the DROP.
the
the
the
the
DROP TABLE SECTION
The above statement drops the table SECTION but leaves the foreign key reference within the BOOK
table.
DROP TABLE SECTION CASCADE CONSTRAINTS
7.3
DML Statements
Data manipulation language is the area of SQL that allows you to change data within the database. It
consists of only three command statement groups, they are INSERT, UPDATE and DELETE.
Inserting new rows into a table
We insert new rows into a table with the INSERT INTO command. A simple example is given below.
INSERT INTO JD11.SECTION VALUES (SECIDNUM.NEXTVAL, 'Computing', 0)
The INSERT INTO command is followed by the name of the table (and owning schema if required),
this in turn is followed by the VALUES keyword which denotes the start of the value list. The value
list comprises all the values to insert into the specified columns. We have not specified the columns
we want to insert into in this example so we must provide a value for each and every column in the
correct order. The correct order of values can be determined by doing a SELECT * or DESCRIBE
101
against the required table, the order that the columns are displayed is the order of the values that
you specify in the value list. If we want to specify columns individually (when not filling all values in a
new row) we can do this with a column list specified before the VALUES keyword. Our example is
reworked below, note that we can specify the columns in any order - our values are now in the order
that we specified for the column list.
INSERT INTO JD11.SECTION
SECIDNUM.NEXTVAL)
(SECTION_NAME,
SECTION_ID)
VALUES
('Computing',
In the above example we haven't specified the BOOK_COUNT column so we don't provide a value for
it, this column will be set to NULL which is acceptable since we don't have any constraint on the
column that would prevent our new row from being inserted.
The SQL required to generate the data in the two test tables is given below.
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Fiction', 10);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Romance', 5);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Science Fiction', 6);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Science', 7);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Reference', 9);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Law', 11);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(21, 'HELP', 'B.Baker', 20.90, '20-AUG-97', NULL, 10, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(87, 'Killer Bees', 'E.F.Hammond', 29.90, NULL, NULL, NULL, 9);
102
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(90, 'Up the creek', 'K.Klydsy', 15.95, '15-JAN-97', '21-JAN-97', 1, 10);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(22, 'Seven seas', 'J.J.Jacobs', 16.00, '21-DEC-97', NULL, 19, 5);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(91, 'Dirty steam trains', 'J.SP.Smith', 8.25, '14-JAN-98', NULL, 98, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(101, 'The story of trent', 'T.Wilbury', 17.89, '10-JAN-98', '16-JAN-98', 12, 6);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(8, 'Over the past again', 'K.Jenkins', 19.87, NULL, NULL, NULL, 10);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(79, 'Courses for horses', 'H.Harriot', 10.34, '17-JAN-98', NULL, 12, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(989, 'Leaning on a tree', 'M.Kilner', 19.41, '12-NOV-97', '22-NOV-97', 56, 11);
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
Changing row values with UPDATE
The UPDATE command allows you to change the values of rows in a table, you can include a WHERE
clause in the same fashion as the SELECT statement to indicate which row(s) you want values
changed in. In much the same way as the INSERT statement you specify the columns you want to
update and the new values for those specified columns. The combination of WHERE clause (row
selection) and column specification (column selection) allows you to pinpoint exactly the value(s) you
want changed. Unlike the INSERT command the UPDATE command can change multiple rows so you
should take care that you are updating only the values you want changed (see the transactions
discussion for methods of limiting damage from accidental updates).
An example is given below, this example will update a single row in our BOOK table :UPDATE JD11.BOOK SET TITLE = 'Leaning on a wall', AUTHOR = 'J.Killner', TIMES_LENT = 0,
LENT_DATE = NULL, RETURNED_DATE = NULL WHERE ISBN = 989
We specify the table to be updated after the UPDATE keyword. Following the SET keyword we specify
a comma delimited list of column names / new values, each column to be updated must be specified
here (note that you can set columns to NULL by using the NULL keyword instead of a new value). The
WHERE clause follows the last column / new value specification and is constructed in the same way
103
as for the SELECT statement, use the WHERE clause to pinpoint which rows to be updated. If you
don't specify a WHERE clause on an UPDATE command all rows will be updated (this may or may not
be the desired result).
Deleting rows with DELETE
The DELETE command allows you to remove rows from a table, you can include a WHERE clause in
the same fashion as the SELECT statement to indicate which row(s) you want deleted - in nearly all
cases you should specify a WHERE clause, running a DELETE without a WHERE clause deletes ALL
rows from the table. Unlike the INSERT command the DELETE command can change multiple rows
so you should take great care that you are deleting only the rows you want removed (see the
transactions discussion for methods of limiting damage from accidental deletions).
An example is given below, this example will delete a single row in our BOOK table :DELETE FROM JD11.BOOK WHERE ISBN = 989
The DELETE FROM command is followed by the name of the table from which a row will be deleted,
followed by a WHERE clause specifying the column / condition values for the deletion.
DELETE FROM JD11.BOOK WHERE ISBN <> 989
This delete removes all records from the BOOK table except the one specified. Remember that if you
omit the WHERE clause all rows will be deleted.
7.4 View Definitions
DBMaker provides several convenient methods of customizing and speeding up access to your data.
Views and synonyms are supported to allow user-defined views and names for database objects.
Indexes provide a much faster method of retrieving data from a table when you use a column with an
index in a query.
Managing Views
DBMaker provides the ability to define a virtual table, called a view, which is based on existing tables
and is stored in the database as a definition and a user-defined view name. The view definition is
stored persistently in the database, but the actual data that you will see in the view is not physically
stored anywhere. Rather, the data is stored in the base tables from which the view's rows are derived.
A view is defined by a query which references one or more tables (or other views).
Views are a very helpful mechanism for using a database. For example, you can define complex
queries once and use them repeatedly without having to re-invent them over and over. Furthermore,
views can be used to enhance the security of your database by restricting access to a predetermined
set of rows and/or columns of a table.
Since views are derived from querying tables, you can not determine the rows of the tables to update.
Due to this limitation views can only be queried. Users can not update, insert into, or delete from
views.
Creating Views
Each view is defined by a name together with a query that references tables or other views. You can
specify a list of column names for the view different from those in the original table when creating a
104
view. If you do not specify any new column names, the view will use the column names from the
underlying tables.
For example, if you want users to see only three columns of the table Employees, you can create a
view with the SQL command shown below. Users can then view only the FirstName, LastName and
Telephone columns of the table Employees through the view empView.
dmSQL> create view empView (FirstName, LastName, Telephone) as
select FirstName, LastName, Phone from Employees;
The query that defines a view cannot contain the ORDER BY clause or UNION operator.
Dropping Views
You can drop a view when it is no longer required. When you drop a view, only the definition stored in
system catalog is removed. There is no effect on the base tables that the view was derived from. To
drop a view, execute the following command:
dmSQL> DROP VIEW empView;
Managing Synonyms
A synonym is an alias, or alternate name, for any table or view. Since a synonym is simply an alias, it
requires no storage other than its definition in the system catalog.
Synonyms are useful for simplifying a fully qualified table or view name. DBMaker normally identifies
tables and views with fully qualified names that are composites of the owner and object names. By
using a synonym anyone can access a table or view through the corresponding synonym without
having to use the fully qualified name. Because a synonym has no owner name, all synonyms in the
database must be unique so DBMaker can identify them.
Creating Synonyms
You can create a synonym with the following SQL command:
dmSQL> create synonym Employees for Employees;
If the owner of the table Employees is the SYSADM, this command creates an alias named Employees
for the table SYSADM.Employees. All database users can directly reference the table
SYSADM.Employees through the synonym Employees.
Dropping Synonyms
You can drop a synonym that is no longer required. When you drop a synonym, only its definition is
removed from the system catalog.
The following SQL command drops the synonym Employees:
dmSQL> drop synonym Employees;
Managing Indexes
An index provides support for fast random access to a row. You can build indexes on a table to speed
up searching. For example, when you execute the query SELECT NAME FROM EMPLOYEES WHERE
NUMBER = 10005, it is possible to retrieve the data in a much shorter time if there is an index
created on the NUMBER column.
An index can be composed of more than one column, up to a maximum of 16 columns. Although a
table can have up to 252 columns, only the first 127 columns can be used in an index.
105
An index can be unique or non-unique. In a unique index, no more than one row can have the same
key value, with the exception that any number of rows may have NULL values. If you create a unique
index on a non-empty table, DBMaker will check whether all existing keys are distinct or not. If there
are duplicate keys, DBMaker will return an error message. After creating a unique index on a table,
you can insert a row in this table and DBMaker will certify that there is no existing row that already
has the same key as the new row.
When creating an index, you can specify the sort order of each index column as ascending or
descending. For example, suppose there are five keys in a table with the values 1, 3, 9, 2, and 6. In
ascending order the sequence of keys in the index is 1, 2, 3, 6, and 9, and in descending order the
sequence of keys in the index is 9, 6, 3, 2, and 1.
When you implement a query, the index order will occasionally affect the order of the data output.
For example, if you have a table name friends with NAME and AGE columns, the output will appear
as below when you execute the query SELECT NAME, AGE FROM FRIEND_TABLE WHERE AGE > 20
using a descending index on the AGE column.
name
age
---------------- ---------------Jeff
49
Kevin
40
Jerry
38
Hughes
30
Cathy
22
As for tables, when you create an index you can specify the fillfactor for it. The fill factor denotes how
dense the keys will be in the index pages. The legal fill factor values are in the range from 1% to
100%, and the default is 100%. If you often update data after creating the index, you can set a loose
fill factor in the index, for example 60%. If you never update the data in this table, you can leave the
fill factor at the default value of 100%.
Creating Indexes
To create an index on a table, you must specify the index name and index columns. You can specify
the sort order of each column as ascending (ASC) or descending (DESC). The default sort order is
ascending.
For example, the following SQL command creates an index IDX1 on the column NUMBER of table
EMPLOYEES in descending order.
dmSQL> create index idx1 on Employees (Number desc);
Also, if you want to create a unique index you have to explicitly specify it. Otherwise DBMaker
implicitly creates non-unique indexes. The following example shows you how to create a unique index
idx1 on the column Number of the table Employees:
dmSQL> create unique index idx1 on Employees (Number);
The next example shows you how to create an index with a specified fill factor:
dmSQL> create index idx2 on Employees(Number, LastName DESC) fillfactor 60;
Dropping Indexes
106
You can drop indexes using the DROP INDEX statement. In general, you might need to drop an index
if it becomes fragmented, which reduces its efficiency. Rebuilding the index will create a denser,
unfragmented index.
If the index is a primary key and is referred to by other tables, it cannot be dropped.
The following SQL command drops the index idx1 from the table Employees.
dmSQL> drop index idx1 from Employees;
7.5 Constraints and Triggers
Constraints are declaractions of conditions about the database that must remain true.
Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever the
triggering event happens. The triggering event is either a INSERT, DELETE, or UPDATE command.
The timing can be either BEFORE or AFTER. The trigger can be either row-level or statement-level,
where the former fires once for each row affected by the triggering statement and the latter fires once
for the whole statement
Constraints are declaractions of conditions about the database that must remain true. These include
attributed-based, tuple-based, key, and referential integrity constraints. The system checks for the
violation of the constraints on actions that may cause a violation, and aborts the action accordingly.
Information on SQL constraints can be found in the textbook. The Oracle implementation of
constraints differs from the SQL standard, as documented in Oracle 9i SQL versus Standard SQL.
Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever the
triggering event happens. The triggering event is either a INSERT, DELETE, or UPDATE command.
The timing can be either BEFORE or AFTER. The trigger can be either row-level or statement-level,
where the former fires once for each row affected by the triggering statement and the latter fires once
for the whole statement.
Deferring Constraint Checking
Sometimes it is necessary to defer the checking of certain constraints, most commonly in the
"chicken-and-egg" problem. Suppose we want to say:
CREATE TABLE chicken (cID INT PRIMARY KEY,
eID INT REFERENCES egg(eID));
CREATE TABLE egg(eID INT PRIMARY KEY,
cID INT REFERENCES chicken(cID));
But if we simply type the above statements into Oracle, we'll get an error. The reason is that the
CREATE TABLE statement for chicken refers to table egg, which hasn't been created yet! Creating egg
won't help either, because egg refers to chicken.
To work around this problem, we need SQL schema modification commands. First, create chicken
and egg without foreign key declarations:
CREATE TABLE chicken(cID INT PRIMARY KEY,
eID INT);
CREATE TABLE egg(eID INT PRIMARY KEY,
107
cID INT);
Then, we add foreign key constraints:
ALTER TABLE chicken ADD CONSTRAINT chickenREFegg
FOREIGN KEY (eID) REFERENCES egg(eID)
INITIALLY DEFERRED DEFERRABLE;
ALTER TABLE egg ADD CONSTRAINT eggREFchicken
FOREIGN KEY (cID) REFERENCES chicken(cID)
INITIALLY DEFERRED DEFERRABLE;
INITIALLY DEFERRED DEFERRABLE tells Oracle to do deferred constraint checking. For example, to
insert (1, 2) into chicken and (2, 1) into egg, we use:
INSERT INTO chicken VALUES(1, 2);
INSERT INTO egg VALUES(2, 1);
COMMIT;
Because we've declared the foreign key constraints as "deferred", they are only checked at the commit
point. (Without deferred constraint checking, we cannot insert anything into chicken and egg,
because the first INSERT would always be a constraint violation.)
Finally, to get rid of the tables, we have to drop the constraints first, because Oracle won't allow us to
drop a table that's referenced by another table.
ALTER TABLE egg DROP CONSTRAINT eggREFchicken;
ALTER TABLE chicken DROP CONSTRAINT chickenREFegg;
DROP TABLE egg;
DROP TABLE chicken;
Basic Trigger Syntax
Below is the syntax for creating a trigger in Oracle (which differs slightly from standard SQL syntax):
CREATE [OR REPLACE] TRIGGER <trigger_name>
{BEFORE|AFTER} {INSERT|DELETE|UPDATE} ON <table_name>
[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]
[FOR EACH ROW [WHEN (<trigger_condition>)]]
<trigger_body>
Some important points to note:
•
You can create only BEFORE and AFTER triggers for tables. (INSTEAD OF triggers are only
available for views; typically they are used to implement view updates.)
•
You may specify up to three triggering events using the keyword OR. Furthermore, UPDATE
can be optionally followed by the keyword OF and a list of attribute(s) in <table_name>. If
present, the OF clause defines the event to be only an update of the attribute(s) listed after
OF. Here are some examples:
•
... INSERT ON R ...
•
•
... INSERT OR DELETE OR UPDATE ON R ...
•
108
... UPDATE OF A, B OR INSERT ON R ...
•
If FOR EACH ROW option is specified, the trigger is row-level; otherwise, the trigger is
statement-level.
•
Only for row-level triggers:
o The special variables NEW and OLD are available to refer to new and old tuples
respectively. Note: In the trigger body, NEW and OLD must be preceded by a colon
(":"), but in the WHEN clause, they do not have a preceding colon! See example below.
o The REFERENCING clause can be used to assign aliases to the variables NEW and
OLD.
o A trigger restriction can be specified in the WHEN clause, enclosed by parentheses.
The trigger restriction is a SQL condition that must be satisfied in order for Oracle to
fire the trigger. This condition cannot contain subqueries. Without the WHEN clause,
the trigger is fired for each row.
•
<trigger_body> is a PL/SQL block, rather than sequence of SQL statements. Oracle has placed
certain restrictions on what you can do in <trigger_body>, in order to avoid situations where
one trigger performs an action that triggers a second trigger, which then triggers a third, and
so on, which could potentially create an infinite loop. The restrictions on <trigger_body>
include:
o You cannot modify the same relation whose modification is the event triggering the
trigger.
o You cannot modify a relation connected to the triggering relation by another constraint
such as a foreign-key constraint.
Trigger Example
We illustrate Oracle's syntax for creating a trigger through an example based on the following two
tables:
CREATE TABLE T4 (a INTEGER, b CHAR(10));
CREATE TABLE T5 (c CHAR(10), d INTEGER);
We create a trigger that may insert a tuple into T5 when a tuple is inserted into T4. Specifically, the
trigger checks whether the new tuple has a first component 10 or less, and if so inserts the reverse
tuple into T5:
CREATE TRIGGER trig1
AFTER INSERT ON T4
REFERENCING NEW AS newRow
FOR EACH ROW
WHEN (newRow.a <= 10)
BEGIN
INSERT INTO T5 VALUES(:newRow.b, :newRow.a);
END trig1;
.
run;
Notice that we end the CREATE TRIGGER statement with a dot and run, as for all PL/SQL
statements in general. Running the CREATE TRIGGER statement only creates the trigger; it does not
execute the trigger. Only a triggering event, such as an insertion into T4 in this example, causes the
trigger to execute.
109
Displaying Trigger Definition Errors
As for PL/SQL procedures, if you get a message
Warning: Trigger created with compilation errors.
you can see the error messages by typing
show errors trigger <trigger_name>;
Alternatively, you can type, SHO ERR (short for SHOW ERRORS) to see the most recent compilation
error. Note that the reported line numbers where the errors occur are not accurate.
Viewing Defined Triggers
To view a list of all defined triggers, use:
select trigger_name from user_triggers;
For more details on a particular trigger:
select trigger_type, triggering_event, table_name, referencing_names, trigger_body
from user_triggers
where trigger_name = '<trigger_name>';
Dropping Triggers
To drop a trigger:
drop trigger <trigger_name>;
Disabling Triggers
To disable or enable a trigger:
alter trigger <trigger_name> {disable|enable};
Aborting Triggers with Error
Triggers can often be used to enforce contraints. The WHEN clause or body of the trigger can check
for the violation of certain conditions and signal an error accordingly using the Oracle built-in
function RAISE_APPLICATION_ERROR. The action that activated the trigger (insert, update, or delete)
would be aborted. For example, the following trigger enforces the constraint Person.age >= 0:
create table Person (age int);
CREATE TRIGGER PersonCheckAge
AFTER INSERT OR UPDATE OF age ON Person
FOR EACH ROW
BEGIN
IF (:new.age < 0) THEN
RAISE_APPLICATION_ERROR(-20000, 'no negative age allowed');
END IF;
END;
.
RUN;
If we attempted to execute the insertion:
insert into Person values (-3);
we would get the error message:
ERROR at line 1:
ORA-20000: no negative age allowed
ORA-06512: at "MYNAME.PERSONCHECKAGE", line 3
ORA-04088: error during execution of trigger 'MYNAME.PERSONCHECKAGE'
and nothing would be inserted. In general, the effects of both the trigger and the triggering statement
are rolled back.
110
7.6 Keys and Foreign Keys
The word "key" is much used and abused in the context of relational database design. In prerelational databases (hierarchtical, networked) and file systems (ISAM, VSAM, et al) "key" often
referred to the specific structure and components of a linked list, chain of pointers, or other physical
locator outside of the data. It is thus natural, but unfortunate, that today people often associate "key"
with a RDBMS "index". We will explain what a key is and how it differs from an index.
According to Codd, Date, and all other experts, a key has only one meaning in relational theory: it is
a set of one or more columns whose combined values are unique among all occurrences in a given
table. A key is the relational means of specifying uniqueness.
Why Keys Are Important
Keys are crucial to a table structure for the following reasons:
They ensure that each record in a table is precisely identified. As you already know, a table represents
a singular collection of similar objects or events. (For example, a CLASSES table represents a
collection of classes, not just a single class.) The complete set of records within the table constitutes
the collection, and each record represents a unique instance of the table's subject within that
collection. You must have some means of accurately identifying each instance, and a key is the device
that allows you to do so.
They help establish and enforce various types of integrity. Keys are a major component of table-level
integrity and relationship-level integrity. For instance, they enable you to ensure that a table has
unique records and that the fields you use to establish a relationship between a pair of tables always
contain matching values.
Candidate Key
As stated above, a candidate key is any set of one or more columns whose combined values are
unique among all occurrences (i.e., tuples or rows). Since a null value is not guaranteed to be unique,
no component of a candidate key is allowed to be null.
There can be any number of candidate keys in a table. Relational pundits are not in agreement
whether zero candidate keys is acceptable, since that would contradict the (debatable) requirement
that there must be a primary key.
Primary Key
The primary key of any table is any candidate key of that table which the database designer
arbitrarily designates as "primary". The primary key may be selected for convenience, comprehension,
performance, or any other reasons. It is entirely proper (albeit often inconvenient) to change the
selection of primary key to another candidate key.
Alternate Key
The alternate keys of any table are simply those candidate keys which are not currently selected as
the primary key. According to {Date95} (page 115), "... exactly one of those candidate keys [is] chosen
as the primary key [and] the remainder, if any, are then called alternate keys." An alternate key is a
function of all candidate keys minus the primary key.
7.7 Constraints on Attributes and Tuples
Not Null Constraints
111
presC# INT REFERENCES MovieExec(cert#) NOT NULL
Attribute-Based CHECK Constraints
presC# INT REFERENCES MovieExec(cert#)
CHECK (presC# >= 100000)
gender CHAR(1) CHECK (gender IN (‘F’, ‘M’)),
presC# INT CHECK (presC# IN (SELECT cert# FROM MovieExec))
Tuple-Based CHECK Constraints
CREATE TABLE MovieStar(name CHr(30) PRIMARY KEY,address VARCHAR(255),
gender CHAR(1),birthdate DATE,CHECK(gender = ‘F’ OR name NOT ‘Ms.%’));
7.8 Modification of Constraints
Constraints can be considered as part of the corresponding ER models; constraint definitions are
stored in meta data tables and separated from stored procedures (in fact, the SQL Server stores the
Transact-SQL creation script in the syscomments table for each view, rule, default, trigger, CHECK
constraint, DEFAULT constraint, and stored procedure); for instance, the CHECK column constraint
on column f1 will be stored in syscomments.text field as a SQL statement: ([f1] > 1) ; constraints
implementation can be modified independently from stored procedures implementation and, by
providing a proper design, modification of constraints does not affect implementation of stored
procedures (or related Transact-SQL scripts).
Moreover, our ER model and corresponding constraints can be mapped to any other RDBMS that
supports a similar metadada format (which is, basically, true for most of the database
7.9 Cursors
Cursor is a bit image on the screen that indicates either the movement of a pointing device or the
place where text will next appear. Xlib enables clients to associate a cursor with each window they
create. After making the association between cursor and window, the cursor is visible whenever it is
in the window. If the cursor indicates movement of a pointing device, the movement of the cursor in
the window automatically reflects the movement of the device.
Xlib and VMS DECwindows provide fonts of predefined cursors. Clients that want to create their own
cursors can either define a font of shapes and masks or create cursors using pixmaps.
This section describes the following:
•
Creating cursors using the Xlib cursor font, a font of shapes and masks, and pixmaps
•
Associating cursors with windows
•
Managing cursors
•
Freeing memory allocated to cursors when clients no longer need them
Create CURSOR
Xlib enables clients to use predefined cursors or to create their own cursors. To create a predefined
112
Xlib cursor, use the CREATE FONT CURSOR routine. Xlib cursors are predefined in
ECW$INCLUDE:CURSORFONT.H. See the X and Motif Quick Reference Guide for a list of the
constants that refer to the predefined Xlib cursors.
The following example creates a sailboat cursor, one of the predefined Xlib cursors, and associates
the cursor with a window:
Cursor fontcursor;
.
.
.
fontcursor = XCreateFontCursor(dpy, XC_sailboat);
XDefineCursor(dpy, win, fontcursor);
The DEFINE CURSOR routine makes the sailboat cursor automatically visible when the pointer is in
window win.
To create client-defined cursors, either create a font of cursor shapes or define cursors using
pixmaps. In each case, the cursor consists of the following components:
•
Shape---Defines the cursor as it appears without modification in a window
•
Mask---Acts as a clip mask to define how the cursor actually appears in a window
•
Background color---Specifies RGB values used for the cursor background
•
Foreground color---Specifies RGB values used for the cursor foreground
•
Hotspot---Defines the position on the cursor that reflects movements of the pointing device
7.10
Dynamic SQL
Dynamic SQL is an enhanced form of Structured Query Language (SQL) that, unlike standard (or
static) SQL, facilitates the automatic generation and execution of program statements. This can be
helpful when it is necessary to write code that can adjust to varying databases, conditions, or servers.
It also makes it easier to automate tasks that are repeated many times. Dynamic SQL statements are
stored as strings of characters that are entered when the program runs. They can be entered by the
programmer or generated by the program itself, but unlike static SQL statements, they are not
embedded in the source program. Also in contrast to static SQL statements, dynamic SQL statements
can change from one execution to the next.
Let's go back and review the reasons we use stored procedure and what happens when we use
dynamic SQL. As a starting point we will use this procedure:
CREATE PROCEDURE general_select @tblname nvarchar(127),
@key
key_type AS -- key_type is char(3)
EXEC('SELECT col1, col2, col3
FROM ' + @tblname + '
WHERE keycol = ''' + @key + '''')
The SELECT statement in client code and send this directly to SQL Server.
113
1. Permissions
If you cannot give users direct access to the tables, you cannot use dynamic SQL, it is as simple as
that. In some environments, you may assume that users can be given SELECT access. But unless
you know for a fact that permissions is not an issue, don't use dynamic SQL for INSERT, UPDATE
and DELETE statements. I should hasten to add this applies to permanent tables. If you are only
accessing temp tables, there are never any permission issues.
2. Caching Query Plans
As we have seen, SQL Server caches the query plans for both bare SQL statements and stored
procedures, but is somewhat more accurate in reusing query plans for stored procedures. In SQL 6.5
you could clearly say that dynamic SQL was slower, because there was a recompile each time. In
later versions of SQL Server, the waters are more muddy.
3. Minimizing Network Traffic
In the two previous sections we
than bare SQL statements from
any network cost for dynamic
general_select, neither is there
procedure call.
have seen that dynamic SQL in a stored procedure is not any better
the client. With network traffic it is a different matter. There is never
SQL in a stored procedure. If we look at our example procedure
much to gain. The bare SQL code takes about as many bytes as the
But say that you have a complex query which joins six tables with complex conditions, and one of the
table is one of sales0101, sales0102 etc depending on which period the user wants data about. This
is a bad table design, that we will return to, but assume for the moment that you are stuck with this.
If you solve this with a stored procedure with dynamic SQL, you only need to pass the period as a
parameter and don't have to pass the query each time. If the query is only passed once an hour the
gain is negligible. But if the query is passed every fifth second and the network is so-so, you are likely
to notice a difference.
4. Using Output Parameters
If you write a stored procedure only to gain the benefit of an output parameter, you do not in any way
affect this by using dynamic SQL. Then again, you can get OUTPUT parameters without writing your
own stored procedures, since you can call sp_executesql directly from the client.
5. Encapsulating Logic
There is not much to add to what we said in our first round on stored procedures. I like to point out,
however, that once you have decided to use stored procedure, you should have all secrets about SQL
in stored procedures, so passing table names as in general select is not a good idea. (The exception
here being sysadmin utilities.)
6. Keeping Track of what Is Used
Dynamic SQL is contradictory to this aim. Any use of dynamic SQL will hide a reference, so that it
will not show up in sysdepends. Neither will the reference reveal itself when you build the database
without the referenced object. Still, if you refrain from passing table or column names as parameters,
you at least only have to search the SQL code to find out whether a table is used. Thus, if you use
dynamic SQL, confine table and column names to the procedure code.
114
UNIT 8
RELATIONAL ALGEBRA
8.1
8.2
8.3
8.4
8.5
8.6
Basics of Relational Algebra
Set Operations on Relations
Extended Operators of Relational Algebra
Constraints on Relations
Self Test
Summary
8.1 Basics of Relational Algebra
Relational algebra is a procedural query language, which consists of a set of operations that take one
or two relations as input and produce a new relation as their result. The fundamental operations that
will be discussed in this tutorial are: select, project, union, and set difference. Besides the
fundamental operations, the following additional operations will be discussed: set-intersection.
Each operation will be applied to tables of a sample database. Each table is otherwise known as a
relation and each row within the table is refered to as a tuple. The sample database consists of tables
in which one might see in a bank. The sample database consists of the following 6 relations:
The account relation
branch-name
Downtown
Mianus
Perryridge
Round
Brighton
Redwood
Brighton
account-number
A-101
A-215
A-102
Hill A-305
A-201
A-222
A-217
balance
500
700
400
350
900
700
750
8.2 Set Operations on Relations
The select operation is a unary operation, which means it operates on one relation. Its function is to
select tuples that satisfy a given predicate. To denote selection, the lowercase Greek letter sigma ( ) is
used. The predicate appears as a subscript to . The argument relation is given in parentheses
following the .
For example, to select those tuples of the loan relation where the branch is "Perryridge," we write:
branch-home = "Perryridge" (loan)
The results of the query are the following:
115
branch-name
loan-number
amount
Perryridge
Perryridge
L-15
L-16
1500
1300
Comparisons like =, , <, >, can also be used in the selection predicate. An example query using a
comparison is to find all tuples in which the amount lent is more than $1200 would be written:
amount > 1200 (loan)
b) Project
The project operation is a unary operation that returns its argument relation with certain attributes
left out. Since a relation is a set, any duplicate rows are eliminated. Projection is denoted by the
Greek letter pi ( ). The attributes that wish to be appear in the result are listed as a subscript to .
The argument relation follows in parentheses. For example, the query to list all loan numbers and the
amount of the loan is written as:
loan-number, amount (loan)
The result of the query is the following:
loan-number
amount
L-17
L-23
L-15
L-14
L-93
L-11
L-16
1000
2000
1500
1500
500
900
1300
Another more complicated example query is to find those customers who live in Harrison is written
as:
customer-name ( customer-city = "Harrison" (customer))
c) Union
The union operation yields the results that appear in either or both of two relations. It is a binary
operation denoted by the symbol
.
An example query would be to find the name of all bank customers who have either an account or a
loan or both. To find this result we will need the information in the depositor relation and in the
borrower relation. To find the names of all customers with a loan in the bank we would write:
customer-name (borrower) and to find the names of all customers with an account in the bank, we
would write customer-name (depositor)
Then by using the union operation on these two queries we have the query we need to obtain the
wanted results. The final query is written as:
customer-name (borrower)
customer-name (depositor)
116
The result of the query is the following:
customer-name
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Adams
d) Set Difference
The set-difference operation, denoted by the -, results in finding tuples taht are in one relation but
are not in another. The expression r - s results in a relation containing those tuples in r abut not in s.
For example, the query to find all customers of the bank who have an account but not a loan, is
written as:
customer-name (depositor) - customer-name (borrower)
The result of the query is the following:
customer-name
Johnson
Turner
Lindsay
For a set difference to be valid, it must be taken between compatible operations just as in the union
operation.
8.3 Extended Operators of Relational Algebra
The set intersection operation is denoted by the symbol . It is not a fundamental operation, however
it is a more convenient way to write r - (r - s).
An example query of the operation to find all customers who have both a loan and and account can
be written as:
customer-name (borrower)
customer-name (depositor)
The results of the query are the following:
customer-name
Hayes
Jones
Smith
It has been shown that the set of relational algebra operations {σ, π, U, –, x} is a complete set; that is,
any of the other relational algebra operations can be expressed as a sequence of operations from this
set. For example, the INTERSECTION operation can be expressed by using UNION and DIFFERENCE
as follows:
117
R ∩ S ≡ ( R ∪ S ) – ((R – S ) ∪ ( S – R ))
Although, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this complex
expression every time we wish to specify an intersection. As another example, a JOIN operation can
be specified as a CARTESIAN PRODUCT followed by a SELECT operation, as we discussed:
R
<condition>S
≡σ
<condition>
(R x S )
Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by RENAME and
followed by SELECT and PROJECT operations. Hence, the various JOIN operations are also not
strictly necessary for the expressive power of the relational algebra; however, they are very important
because they are convenient to use and are very commonly applied in database applications
which is used to assign expressions to a
- the operation denoted by
temporary relation variable.
- the operation denoted by a cross (X) allows for combination of
information from any two relations.
- the operation denoted by and used in queries wanting to find results
including the phrase "for all".
- the operation that pertains to a query that involves a Cartestian product
includes a selection operation on the result of the Cartesian product.
- the operation denoted by the Greek letter pi ( ), which is used to return
an argument with certain attributes left out.
- the operation denoted by the Greek letter rho ( ), which allows the results
of a relational-algebra expression to be assigned a name, which can later
be used to refer to them.
- the operation denoted by the Greek letter sigma (), which enables a
selection of tuples that satisfy a given predicate.
- the operation denoted by - allows for finding tuples that are in one
relation but are not in another.
- the operation denoted by which results in the tuples that are in both
relations the operation is applying to.
- an operation on relations that yields the relation of all tuples shared by
two or more relations. Denoted by the symbol:
assignment
Cartesian product
division
natural join
project
rename
select
Set difference
Set-intersection
union
8.4 Constraints on Relations
Representation of Relations
We can regard a relation in two ways: as a set of values and as a set of maps from attributes from
values.
Let
be the schema of a relation R, and let
relation. Then R is a subset of
from each of the domains
be the domain of the
and each tuple of the relation contains a set of values, one drawn
, each of which contains a unique null element, denoted
118
.
We can also regard each element of the relation as a map from R to
element of an instance r of R, we can write
Integrity Constraints on Relations
, so that if
and get the expected result.
of the attributes
We define a (candidate) key on a relation schema R to be a subset
of R such that for any instance
for all
we have
key is a candidate key in which none of the
primary key of R,
. We write
is an
for
. A primary
. We designate one candidate key to be the
to signify the projection of t onto the primary key
. Where
there is no chance of confusion we will write for
.
We require every relation to satisfy its primary key, and all its candidates keys. Let C be the set of all
candidate keys of R and let
be the primary key of R: we require
to satisfy
and
Operations on Relations
Taking the view of a relation as a set we can apply the normal set operations to relations over the
same domain. If the domain differ this is not possible. We have, of course, the normal algebraic
structure to the operations: the null relation over a domain is well defined, and the null tuple is the
sole element of the null relation.
We also have three relational operators we wish to consider: select, join and project.
of conditional expressions on relations, which
First we define for each relation R the domain
map an element of an instance
Select
Now we define
to true or false.
by
is a relation over the same domain as R and is a subset of R. We notice that we can use the
same primary key for both R and
and that
then R does not satisfy
must satisfy this key, since if there exist
.
Join
The join operation
is defined by
where we have
119
What is the schema for this? The key? Does it satisfy it?
Project
We define
by the post-condition
and by requiring that if
is the domain of R then the domain of
is
.
If we view R as a set of map we can view the projection operator as restricting each of these maps to a
smaller domain.
The schema of
is A. If
then
forms a key for A. Otherwise, if there are no nulls in any
tuple of the projection of R we can use A as a primary key. Otherwise we can not in general identify a
primary key.
Insertion
For each relation R we define an insertion operation:
The insertion operation satisfies the invariant, since it will refuse to break it.
Update
For each relation R we define an update operation.
Update also preserves the invariant.
Deletion
We define the deletion of a tuple by:
8.5 Self Test
1. Simple selection and projection
i. Who are the patients 10 years old or younger?
ii. Who are the surgeons
iii. What are the phone numbers of doctors
iv. What are the phone numbers of surgeons
2. Set Operations
i. Re-state the expression 10 60( ) age age patients s £ Ú ³ using set operations.
ii. Re-state the expression oculist( ) rank surgeon rank doctors s ¹ Ù ¹ using set operations without ¹
and Ù
iii. Find all the patients who saw doctor 801 but not 802 (i.e. dnum=801, dnum ¹802)
3. Cartesian Product and Join
i. Form peer groups for patients, where a peer group is a pair of patients where age difference is less
than 10 years (can use the rename operator rA(R) ).
120
ii. Who are the surgeons who visited the patient 101 (i.e. pnum = 101)?
iii. Who has seen a surgeon in the past two years?
iv. Is there any non-surgeon doctors who performed a surgeon (a doctor performed a surgeon if the
visit record shows diagnosis=”operation” for him)?
4. Divison
i. Who has seen all the surgeons in the past two months?
ii. Find all patients excepts for the youngest ones. patients (pnum, pname, age)
doctors (dnum, dname, rank) visits (pnum, dnum, date, diagnosis)
8.6 Summary
• Relational algebra is a procedural query language, which consists of a set of operations that
take one or two relations as input and produce a new relation as their result. The fundamental
operations that will be discussed in this tutorial are: select, project, union, and set difference.
• Set Operations on Relations: selection, projection, join and set theory.
• The set intersection operation is denoted by the symbol
however it is a more convenient way to write r - (r - s).
121
. It is not a fundamental operation,
UNIT 9
MANAGEMENT CONSIDERATIONS
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.1
Introduction
Objectives
Organisational Resistance To Dbms Tools
Conversion From An Old System To A New System
Evaluation Of A Dbms
Administration Of A Database Management System
Self Test
Summary
INTRODUCTION
This unit will comprise issues of organizational resistance, the methodology for conversion from an
old system to a new system, the importance of adopting a decentralized distributed approach and
evaluation and administration of such systems.
9.2 OBJECTIVES
After going through this unit, you should be able to:
•
Identify the factors causing resistance to the induction of new DBMS tools;
•
Determine the path that must be chosen in converting from an old existing system to a new
system;
•
List the various factors that are important in evaluating a DBMS system;
•
Formulate a simple evaluation methodology for DBMS selection and acquisition;
•
Enumerate the functions of the database administrator; and
•
List the checkpoints and principles, which must be, adhered to in order that information
quality is assured.
9.3 ORGANISATIONAL RESISTANCE TO DBMS TOOLS
In practice, this does not happen and organizations react to information systems by offering
resistance. This is a part of an inherent opposition to change. There are some aspects of change
related to information system that are those great passions. This arose because of some of the
following factors:
•
Political observation: The officers and managers at different levels of an organization feel
threatened with the nice long standing political equations and relationships which have
122
enjoyed their otherwise upward movement within the organization, and may be threatened by
a new intervention into their styles of working.
•
Information transparency: In the absence of an electronic computer-based efficient information
system, many functionaries in an organization have access to information, which they control
and pass on giving it the color that would suit them. The availability of information through
computer-based systems to almost all who would have an interest in it makes this authority
disappear. It is therefore naturally resented.
•
Fear of future potential: The very fact that computers can store information in a very compact
manner and it can be collated and analyzed very speedily gives rise to apprehensions of future
adverse use of this information against an individual. Mistakes n decision-making can now be
highlighted and analyzed in detail after learning spells of time. It would not have been possible
in manual file-based systems or any system where the data does not flow so readily.
Inter-departmental rivalry, fear of personal inadequacy, in comprehensive of the new regime the loss
of ones own power and the greater freedom to others and difference in work styles all add up to
produce resistance to the induction of new information processing tools. Apart from these general
considerations, there are reasons to resist installation of a new DBMS.
There are several points of resistance to new DBMS tools:
•
Resistance to acquiring a new tool
•
Resistance to choosing to use a new tool
•
Resistance to learning how to use a new tool
•
Resistance to using a new tool.
The selection and acquisition of a DBMS and related tools is one of the most important computerrelated decisions made in an organization. It is also one of the most difficult. There are many systems
from which to choose and it is very difficult to obtain the necessary information to make a good
decision. Vendors always have great things to say, convincing argument for their systems, and often
many satisfied customers. Published literature and software listing services are too cursory to provide
sufficient information on which to base a decision. The mere difficulty in gathering information and
making the selection is one point of resistance to acquiring the new DBMS tools.
The initial cost may also be a barrier to acquisition. However, the subsequent investment in training
people, developing applications, and entering and maintaining data will be much time more. Selection
of an inadequate system can greatly increase these subsequent costs to the point where the initial
acquisition cost becomes irrelevant.
In spite of the apparent resistance to acquisition, the projections for the database industry are
forecasting a multi-billion dollar industry in the 1990's. Even though an organization may acquire a
DBMS, there are still several additional points of resistance to overcome.
Simply having a DBMS does not mean that it will be used. Several factors may contribute to the lack
of use of new DBMS tools.
Lack of familiarity with the tools and what it can do
System developers used to writing COBOL (or other language) programs prefer to build systems using
the tools they already know
123
The pressure to get new application development projects completed dictates using established tools
and techniques
Systems development personnel have not been thoroughly trained in the use of the new tool
DBMS tools
DP management is afraid of run away demand on the computing facilities if they allow users to
directly access the data on the host computer using an easy to use, high-level retrieval facility
Organizational policies, which do not demand, appropriate justification for the tools chosen (or not
chosen) for each system development project.
Having pointed out the transactions from which utility can arise to the intervention of a new DBMS,
it may be useful to have a summary of a few pointers, which would possibly lead to a greater success
in such an endeavor.
Reasons for success
Reasons for failure
Appreciation for information
is a valuable corporate
reasons and its management
must be given special
importance.
Perception by the barrens in
the organization that the MIS
design is a menace
conflicting interest to
prevent the success.
Focusing on most beneficial
usage of database, which
relate to the bottom level.
over sailing NUS to top,
management and chosen
applications for their
challenge to the programmer
members.
An incremental approach to
building applications with
each new step being reasonably
small and relatively easy to
implement.
A grant design for creation
of an impressive system
that can be a pinata for
all information problems
Cooperate-wide planning by a
high level, empowered,
competent data administrator.
Fragmented plans by noncommunicating and not
eventually response groups.
Conversion planning which
Permits all the systems to coexit with the new.
A situation which may put
into the new system and
attempts to re-write to
many old programs.
Awareness education and
involvement of all persons at
a level appropriate to their
functions.
Apathy by most people to
implementation of the new
system.
124
Good understanding of the
technical issues and tight
technical control by the
database administrators.
Inadequate computing power,
incorrect assignment of
throughput and assigned
time and failure to monitor
usage and performance.
Recognition of the importance
of a data dictionary and
standards for naming, update
control and version
synchronization.
Casual approach to data
standards and documentation.
Simplicity.
Confused thinking.
A proper mix of centralized
guidance and decentralized
implementation.
Indifference of the central
system and proliferation of
incompatible systems.
Proven work-free software.
The latest software wonder.
9.4 CONVERSION FROM AN OLD SYSTEM TO A NEW SYSTEM
Management is also concerned with long-term corporate strategy. The database selected has to be
consistent with the commitments of that corporate strategy. But if the organization does not have a
corporate database, then one has to be developed before conversion is to take place. Selecting a
database has to be from the top down: data flow diagrams, representing the organization's business
functions, processes and activities should be drawn up first followed by entity-relation charts
detailing the relationships between different business information; and then finally by data modeling.
•
Corporate Strategic Decisions: The database approach to information systems is a long-term
investment. It requires a large-scale commitment of an Organization's resources in compatible
hardware and software, skilled personnel and management support. Accompanying costs are the
education and training of the personnel, conversion, and documentation. It is essential for an
organization to fully appreciate, if not understand, the problems of converting from an existing,
file- based system to a database system, and to accept the implications of its operation before the
conversion.
•
Hardware Requirements and Processing Time: The database approach should be in a position to
delegate to the database management system some of the functions that were previously
performed by the application programmer. As a result of this delegation, a computer with a large
internal memory and greater processing power is needed. Powerful computer systems were once
the luxury enjoyed by those database users who could afford such systems but fortunately, this
trend is now changing. A recent development in hardware technology has made it possible to
acquire powerful, yet affordable system.
125
For some applications, the need for high-volume transaction processing may force a company to
engineer one or even several systems designed to satisfy this need. This sacrifices certain flexibility
for the system to respond to ad-hoc requests. And it is also argued that because of the easier access
to data in the database, the frequency of access will become higher. Such overuse of computing
resources will cause slips in performance, resulting in an increased demand for computing capacity.
It is sometimes difficult to determine if the increased access to the database is rally necessary.
The database approach offers a number of important and practical advantages to an organization.
Reducing data redundancy improves consistency of the data as well as making savings in storage
space. Sharing data often enables new application to be developed without having to create new data
files. Less redundancy and greater sharing also result in less confusion between organizational units
and less file spent by people resolving inconsistencies in reports. Centralized control over data
standards, security restrictions, and so on, facilitates the evolution of information systems and
organizations in response to changing business needs and strategies. Now a day, users with little or
no previous programming experience can, with the aid of powerful user- friendly query languages,
manipulate data to satisfy ad-hoc queries. Data independence helps case program development and
maintenance, raising programmer productivity. All the benefits of the database approach contribute
to reduce costs of application development and improved quality of managerial decisions.
In general, converting to a database involves the following:
•
Inventories current systems such as data volume, user satisfaction, present condition, and
the cost to maintain or redevelop.
•
Determining conversion priority in strategic information system plans, building block systems
and critical needs to replace system.
•
Obtain commitment from senior/top management.
•
Appoint qualified database-administration staff.
•
Education management information systems staff.
•
Select suitable and appropriate software.
•
Install data dictionary first.
•
Involve and educate users.
•
Redesign and implement new data structures.
•
Write software utility tools to convert the files or database in the old system to the new
database.
•
Modify all application programs to make use of the new data structures.
•
Design a simple database first for pilot testing.
•
Implement all software.
•
Update policies and procedures.
• Install the new database on production.
In the recent trend of database development, a common front- end to the various database
management systems will often be constructed in such a way that the original systems and the
programs on them are not modified, but their databases can be mapped to each other through a
single uniform language.
Another approach is to unify various database structures by applying the database standards laid
down by the International Standards Organization for data definition and data manipulation. Public
126
acceptance of these standard database structures will ensure a more rapid development of additional
conversion tools, such as automatic functions for loading and unloading databases into standard
forms for model-to-model database mapping.
If an organization after weighing all the relevant factors decides to make an investment in a good
database management system, it has to develop a product planned for doing so. Many of the steps
required are more or less along the lines that are required when an organization first moves in
towards the use of computer-based information system. One would immediately note the similarity to
the steps referred to in the course on "System Analysis and Design". In the interest of briefing
therefore the reference would be only to those factors, which are of greater consequences for the
problem at hand. It may, however, be useful to bear in mind that a detailed implementation plan
would be more or less along the fines of creation of a computer information system for the first time.
9.5 EVALUATION OF A DBMS
The evaluation is not simply a matter of comparison or description of one system against another
independent system, and surveying sometimes available through publication do describe and
compare the features of available systems, but a value of an organization depends upon its own
problem environment. An organization must therefore look at these own needs to evaluation of the
available systems.
It is worthwhile putting some attention to who should do this. In a small organization it is possible
that a single individual would be able to do the job, but larger organizations need to formally
establish an evaluation team. Even this team's composition would somewhat change as -the
evaluation process moves on. A good role in the initial stage would be played by users and
management focus on the organizational needs. Computers and Information technology professionals
then evaluate the technical gaps of several candidate systems and finally financial and accounting
personnel examine the cost estimates, payment alternatives, tax consequences, personnel
requirements, and contract negotiations.
The reasons which inspire the organization to acquire a DBMS should be clearly documented and
used to determine the properties and help in making trade offs between conflicting objectives and in
the selection of various features that the candidate DBMS may have, depending upon the end-user
requirements. The evaluation team should also be aware of technical and administrative issues.
These technical criteria could be the following:
a.
SQL
implementation
b.
Transaction
management
c.
Programming
interface
d.
Database
server
environment
e.
Data
storage
features
f.
Database
administration
g.
Connectivity
h.
DBMS integrity
Similarly there could be administrative criteria such as:
127
1.
2.
3.
4.
5.
6.
Required
Documentation
Vendor's
hardware
platform
financial
Vendor
stability
support
cost
Initial
Recurring cost
Each of these, especially the technical criteria could be further broken into sub-criteria. For example
the data storage features can be further sub-classified into:
a.
Lost
database
segments
b.
Clustered
indexes
c.
Clustered tables
Once this level of detailing is done, the list of features become quite large and may even run into
hundreds. If a dozen products are to be evaluated, we are talking of a fairly large matrix.
At this point, it is important for the evaluation wars and especially its technical members to segregate
these features into those, which are mandatory. Mandatory features would be those which if not
present in the candidate system, the system need not be considered further. For example, does
DBMS provide facilities for programming and non-programming users? Can be considered as one
among several mandatory conditions. Mandatory requirement may also flow from a desire to preserve
the previous investment in information systems made by an organization. The presence of the
mandatory condition means that the system is a candidate for the rating procedure.
Having done the first stage of creating a feature list, one of the simplest ways could be to develop a
table where the features and its related information for each candidate system is listed to in a tabular
form against the desired feature. Such forms can be chosen to compare the various systems and
although this cannot be enough to conclude an evaluation, it is a useful method for at least broadly
ranking and short-listing the systems. A quantitative flavour can be given to the above approach by
awarding points for features, which are in simple Yes and No type. If all the features are not equally
important to the organization, then the summing up of the points awarded for each of the features for
any of the system is not quite appropriate. In
such a case a rating factor can be assigned to each
feature to reflect the relevant level of importance of that feature to the organization. Of course such
rating or scoring should be done after the first condition of mandatory requirements have been met
by the proposed system. Sometimes the mandatory characteristics may be expressed in the negative
as something, which the system must not have.
The points of the rates is a contentious issue and must be decided looking only to the needs of the
organization and with reference to the characteristics of any specific candidate system one of the
approaches used towards arriving at a suitable set of rating factors is to follow the Delphi method. In
brief, the Delphi approach requires key people who may be expected to be knowledgeable to make
suggestions as to what, would be the appropriate rating factor. These are collected, compiled,
averages taken and deviation from averages pointed out This data is then re-circulated to the same
set of people for wanting to change their opinions where their own views were varying largely from the
average. The details can then be carried out and it has been found that in about as few as 3 to 4
iterations in good consensus emerges.
128
One of the weaknesses of the methodologies discussed so far is that they are focusing on the
systems but not on the cost benefit aspects. A good evaluation methodology should be possibly to
suggest the most cost effective solution to the problem. For example, if a system is twice as good as
another system, but costs only 40% more than it ought to be a preferred solution.
In order to carry a cost after analysis one has to use a rating function with each feature to
normalize the sequence. Rather than having an approach where a feature is characterized as a
Yes/No, the attribute corresponding to its presence or absence, which in marks term could be 0 or 1,
a mark can be given on a scale which is appropriate to the feature. This can arise in issues such as
the number of terminals that are supported or the amount of main memory required. Rating
functions can be of several types of which 4 are illustrated in Figure 1.
Linear: In a linear rating function the rating increases in proportion to higher marks starting
from 0. Broken linear: There are situations where the minimum threshold is essential and similarly
there is a saturated value above which no additional value is given. Typically in general concurrent
access, few or 3 would be includable value and more than 9 is of no additional value.
Binary: This is of course a Yes/No type where a system either has or does not have the feature
or some minimum value for the feature.
Inverse: There are some attributes where a higher mark actually implies a lower rating. For
example in accessing the time to process a standard query, the mark may be simply the time scale in
an appropriate manner. Therefore, a shorter time actually has a higher rating.
For each feature, the rating function uses an appropriate and convenient scale of
measurement for determining a system's feature mark. The rating function transforms a system's
feature mark into a normalized rating indicating its value relative to a nominal mark for that feature.
The nominal mark for each feature has a nominal rating of one.
The use of rating function is more sophisticated and costly to apply than the simplified
methodologies. The greater objectivity and precision obtained must be weighted against the overall
benefits of DBMS acquisition and use. Some features will have no appropriate objective scale on
which to mark the feature. The analyst could use a five point scale with a linear rating function as
follows:
Feature evaluation
Rating point
Excellent (A)
5
Good (B)
4
Average (C)
3
Fair (D)
2
Poor (E)
1
Variations can expand or contract the rating scale, using a nonlinear rating function, or expand the
points in the feature evaluation scale to achieve greater resolution. In extreme cases, the analyst
could simply use subjective judgment to arrive at a rating directly, remembering that a feature rating
of one applies to a nominal or average system.
Having converted all the marks to ratings, the system score is the product of the rating and the
weight summed across all features, just as before. The overall score for a nominal system would he
one (since all weights sum to one and all nominal ratings are one). This is important for determining
cost effectiveness, the ratio between the value of a system and its cost. The organization first
determines the value of a system which cams a nominal mark for all features. This is called the
nominal value. Then the actual value of a given system is the product of the overall system score and
the nominal value. The cost effectiveness of a system is the actual value divided by the cost of the
129
system. System cost is the present value cost of acquisition, operation and maintenance over the
estimated life of the system.
.
Figure : Sample Feature Rating Functions
With a cost-effectiveness measure for several candidate systems, the organization would tentatively
select the system with the highest cost-effectiveness ratio.
Of course there may be intangible factors other than the technical and administrative criteria
referred to earlier, which may influence the final selection based upon political judgments of the
130
management or some other considerations. It would of course be possible to even to build these up of
that can be explicitly so illustrated into the evaluation process.
9.6 ADMINISTRATION OF A DATABASE MANAGEMENT SYSTEM
Acquiring a DBMS is not sufficient for successful data management. The role of database
administrator provides the human focus of responsibility to make it all happen. One person or
several persons may fill the DBA role.
Whenever people share the use of a common resource such as data, the potential for conflict exists.
The database administrator role is fundamentally a people-oriented function to mediate the conflicts
and seek compromise for the global good for the organization.
Within an organization, database administration generally begins as a support function within the
systems development unit. Sometimes it is in a technical support unit associated with operations.
Eventually, it should be separate from both development and operations, residing in a collection of
support functions reporting directly to the director of information systems. Such a position has some
stature, some independence, and can work directly with users to capture their data requirements.
Database administration works with development, operations, and users to coordinate the response
to data needs. The database administrator is the key link in establishing and maintaining
management and user confidence in the database and in the system facilities, which make it
available and control its integrity.
While the 'doing' of database system design and development can be decentralized to several
development projects in the Data Processing Department or the user organizations, planning and
control of database development should be centralized. In this way an organization can provide more
consistent and coherent information to successively higher levels of management.
The functions associated with the role of database administration include:
•
Definition, creation, revision, and retirement of data formally collected and stored within a
shared corporate database.
•
Making the database available to the using environment through tools such as a DBMS and
related query languages and report writers.
•
Informing and advising users on the data resources currently available, the proper
interpretation of the data, and the use of the availability tools. This includes educational
materials, training sessions, participation on projects, and special assistance.
•
Maintaining database integrity including existence control (backup and recovery), definition
control, quality control, update control, concurrency control, and access control.
•
Monitor and improve operations and performance, and maintain an audit trail of database
activities.
•
The data dictionary is one of the more important tools for the database administrator. It is
used to maintain information relating to the various resources used in information systems
(hence sometimes called an information resource dictionary)--data, input transactions, output
reports, programs, application systems, and users. It can:
•
Assist the process of system analysis and design.
131
9.7
9.8
•
Provide a more complete definition of the data stored in the database (than is maintained by
the DBMS).
•
Enable an organization to assess the impact of a suggested change within the information
system or the database.
•
Help in establishing and maintaining standards for example, of data names.
•
Facilitate human communication through more complete and accurate documentation.
•
Several data dictionary software packages are commercially available.
•
The DBA should also have tools to monitor the performance of the database system to indicate
the need for reorganization or revision of the database.
SELF TEST
1) Determine the path that must be chosen in converting from an old existing system to a new
system;
2) List the various factors that are important in evaluating a DBMS system;
3) Explain role of DBA with the help of an example.
SUMMARY
•
There are some aspects of change related to information system that are those great
passions such as Political observation and information transparency etc.
•
Organizational policies, which do not demand, appropriate justification for the tools
chosen (or not chosen) for each system development project.
•
The initial cost may also be a barrier to acquisition. However, the subsequent investment
in training people, developing applications, and entering and maintaining data will be
much time more. Selection of an inadequate system can greatly increase these
subsequent costs to the point where the initial acquisition cost becomes irrelevant.
•
Broken linear: There are situations where the minimum threshold is essential and
similarly there is a saturated value above which no additional value is given. Typically in
general concurrent access, few or 3 would be includable value and more than 9 is of no
additional value.
132
UNIT 10
CONCURRENCY CONTROL
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
10.1
Serial and Serializability Schedules
Conflict-Serializability
Enforcing Serializability by Locks
Locking Systems With Several Lock Modes
Architecture for a Locking Scheduler
Managing Hierarchies of Database Elements
Concurrency Control by Timestamps
Concurrency Control by Validation
Summary
Serial and Serializable Schedules
When two or more transactions are running concurrently, the steps of the transactions would
normally be interleaved. The interleaved execution of transactions is decided by the database
scheduler, which receives a stream of user requests that arise from the active transactions. A
particular sequencing (usually interleaved) of the actions of a set of transactions is called a schedule.
A serial schedule is a schedule in which all the operations of one transaction are completed before
another transaction can begin (that is, there is no interleaving).
Serial execution means no overlap of transaction operations.
If T1 and T2 transactions are executed serially:
RT1(X) WT1(X) RT1(Y) WT1(Y) RT2(X) WT2(X)
or
RT2(X) WT2(X) RT1(X) WT1(X) RT1(Y) WT1(Y)
The database is left in a consistent state.
The basic idea:
Each transaction leaves the database in a consistent state if run individually
If the transactions are run one after the other, then the database will be consistent
For the first schedule:
Database
T1
T2
--- x=100, y=50 --- read(x) --- x=100 --x:=x*5 --- x=500 ----- x=500, y=50 --- write(X)
read(Y) --- y=50 ---
133
Y:=Y-5 --- y=45 ----- x=500, y=45 --- write(Y)
read(x) --- x=500 --x:=x+8 --- x=508 --write(X)
--- x=508, y=45 --For the second schedule:
Database
T1
T2
read(x) --- x=100 ---
--- x=100, y=50 ---
x:=x+8 --- x=108 --write(x)
--- x=108, y=50 --read(x) --- x=108 --x:=x*5 --- x=540 ----- x=540, y=50 --- write(x)
read(y) --- y=50 --y:=y-5 --- y=45 ----- x=540, y=45 --- write(y)
Serializable Schedules
Let T be a set of n transactions T1, T2, ..., Tn . If the n transactions are executed serially (call this
execution S), we assume they terminate properly and leave the database in a consistent state. A
concurrent execution of the n transactions in T (call this execution C) is called serializable if the
execution is computationally equivalent to a serial execution. There may be more than one such serial
execution. That is, the concurrent execution C always produces exactly the same effect on the
database as some serial execution S does. (Note that S is some serial execution of T, not necessarily
the order T1, T2, ..., Tn ). A serial schedule is always correct since we assume transactions do not
depend on each other and furthermore, we assume, that each transaction when run in isolation
transforms a consistent database into a new consistent state and therefore a set of transactions
executed one at a time (i.e. serially) must also be correct.
Example
1. Given the following schedule, draw a serialization (or precedence) graph and find if the
schedule is serializable.
134
Solution:
There is a simple technique for testing a given schedule S for serializability. The testing is
based on constructing a directed graph in which each of the transactions is represented by
one node and an edge between
in the schedule:
executes WRITE( X) before
executes READ( X) before
and
exists if any of the following conflict operations appear
executes READ( X), or
executes WRITE( X)
executes WRITE( X) before executes WRITE( X).
If the graph has a cycle, the schedule is not serializable.
10.2
Conflict-Serializability
•
Two operations conflict if:
o They are issued by different transactions,
o They operate on the same data item, and
o At least one of them is a write operation
•
Two executions are conflict-equivalent, if in both executions all conflicting operations have the
same order
•
An execution is conflict-serializable if it is conflict-equivalent to a serial history
Conflict graph
Execution is conflict-serializable iff the conflict graph is acyclic
T1
T2
T3
W1(a) R2(a) R3(b) W2(c) R3(c) W3(b) R1(b)
Example
Schedule
Conflict
135
Graph
Nodes:
transactions
Directed edges:
conflicts
between
operations
Serializablity (examples)
•
H1: w1(x,1), w2(x,2), w3(x,3), w2(y,1),r1(y)
•
H1 is view-serializable, since it is view- equivalent to H2 below:
o H2: w2(x,2), w2(y,1), w1(x,1), r1(y), w3(x,3)
•
However, H1 is not conflict-serializable, since its conflict graph contains a cycle: w1(x,1)
occurs before w2(x,2), but w2(x,2), w2(y,1) occurs before r1(y)
•
No serial schedule that is conflict-equivalent to H1 exists
Execution Order vs. Serialization Order
•
Consider the schedule H3 below: H3: w1(x,1), r2(x), c2, w3(y,1), C3, r1(y), C1
•
H3 is conflict equivalent to a serial execution in which T3 is followed by T1, followed by T2
•
This is despite the fact that T2 was executed completely and committed, before T3 even
started
Recoverability of a Schedule
•
A transaction T1 reads from transaction T2, if T1 reads a value of a data item that was written
into the database by T2
•
A schedule H is recoverable, iff no transaction in H is committed, before every transaction it
read from is committed
•
The schedule below is serializable, but not recoverable: H4: r1(x), w1(x), r2(x), w2(y) C2, C1
Cascadelessness of a Schedule
•
A schedule H is cascadeless (avoids cascading aborts), iff no transaction in H reads a value
that was written by an uncommitted transaction
•
The schedule below is recoverable, but not cascadeless: H4: r1(x), w1(x), r2(x), C1, w2(y) C2
Strictness of a Schedule
•
A schedule H is strict if it is cascadeless and no transaction in H writes a value that was
previously written by an uncommitted transaction
•
The schedule below is cascadeless, but not strict: H5: r1(x), w1(x), r2(y), w2(x), C1, C2
•
Strictness permits the recovery from before images logs
Strong Recoverability of a Schedule
•
A schedule H is strongly recoverable, iff for every pair of transactions in H, their commitment
order is the same as the order of their conflicting operations.
•
The schedule below is strict, but not strongly recoverable: H6: r1(x) w2(x) C2 C1
136
Rigorousness of a Schedule
•
A schedule H is rigorous, if it is strict and no transaction in H reads a data item untils all
transactions that previously read this item either commit or abort
•
The schedule below is strongly recoverable, but not rigorous: H7: r1(x) w2(X) C1 C2
•
A rigorous schedule is serializable and has all properties defined above
10.3 Enforcing Serializability by Locks
Database servers support transactions: sequences of actions that are either all processed or none at
all, i.e. atomic. To allow multiple concurrent transactions access to the same data, most database
servers use a two-phase locking protocol. Each transaction locks sections of the data that it reads or
updates to prevent others from seeing its uncommitted changes. Only when the transaction is
committed or rolled back can the locks be released. This was one of the earliest methods of
concurrency control, and is used by most database systems.
Transactions should be isolated from other transactions. The SQL standard's default isolation level is
serialisable. This means that a transaction should appear to run alone and it should not see changes
made by others while they are running. Database servers that use two-phase locking typically have to
reduce their default isolation level to read committed because running a transaction as serialisable
would mean they'd need to lock entire tables to ensure the data remained consistent, and such tablelocking would block all other users on the server. So transaction isolation is often traded for
concurrency. But losing transaction isolation has implications for the integrity of your data. For
example, if we start a transaction to read the amounts in a ledger table without isolation, any totals
calculated would include amounts updated, inserted or deleted by other users during our reading of
the rows, giving an unstable result.
Database research in the early 1980s discovered a better way of allowing concurrent access to
data .Storing multiple versions of rows would allow transactions to see a stable snapshot of the data.
It had the advantage of allowing isolated transactions without the drawback of locks. While one
transaction was reading a row, another could be updating the row by creating a new version. This
solution at the time was thought to be impractical: storage space was expensive, memory was small,
and storing multiple copies of the data seemed unthinkable.
Of course, Moore's Law has meant that disk space is now inexpensive and memory sizes have
dramatically increased. This, together with improvements in processor power, has meant that today
we can easily store multiple versions and gain the benefits of high concurrency and transaction
isolation without locking.
Unfortunately the locking protocols of popular database systems, many of which were designed well
over a decade ago, form the core of those systems and replacing them seems to have been impossible,
despite recent research again finding that storing multiple versions is better than a single
version with locks
137
10.4
Locking Systems With Several Lock Modes
Several Object Orientated Databases, which were more recently developed, have incorporated OCC
within their designs to gain the performance advantages inherent within this technological approach.
Though optimistic methods were originally developed for transaction management the concept is
equally applicable for more general problems of sharing resources and data. The methods have been
incorporated into several recently developed Operating Systems, and many of the newer hardware
architectures provide instructions to support and simplify the implementation of these methods.
Optimistic Concurrency Control does not involve any locking of rows as such, and therefore cannot
involve any deadlocks. Instead it works by dividing the transaction into phases.
•
Build-up commences the start of the transaction. When a transaction is started a consistent
view of the database is frozen based on the state after the last committed transaction. This means
that the application will see this consistent view of the database during the entire transaction. This
is accomplished by the use of an internal Transaction Cache, which contains information about all
ongoing transactions in the system. The application "sees" the database through the Transaction
Cache. During the Build-up phase the system also builds up a Read Set documenting the accesses
to the database, and a Write Set of changes to be made, but does not apply any of these changes to
the database. The Build-up phase ends with the calling of the COMMIT command by the application.
•
The Commit involves using the Read Set and the Transaction Cache to detect access
conflicts with other transactions. A conflict occurs when another transaction alters data in a way
that would alter the contents of the Read Set for the transaction that is checked. Other transactions
that were committed during the checked transaction's Build-up phase or during this check phase
can cause a conflict. If a transaction conflict is detected, the checked transaction is aborted. No
rollback is necessary, as no changes have been made to the database. An error code is returned to
the application, which can then take appropriate action. Often this will be to retry the transaction
without the user being aware of the conflict.
•
If no conflicts are detected the operations in the Write Set for the transaction are moved to
another structure, called the Commit Set that is to be secured on disk. All operations for one
transaction are stored on the same page in the Commit Set (if the transaction is not very large).
Before the operations in the Commit Set are secured on permanent storage, the system checks if
there is any other committed transactions that can be stored on the same page in the Commit Set.
After this, all transactions stored on the Commit Set page are written to disk (to the transaction
databank TRANSDB) in one single I/O operation. This behavior is called a Group Commit, which
means that several transactions are secured simultaneously. When the Commit Set has been
secured on disk (in one I/O operation), the application is informed of the success of the COMMIT
command and can resume its operations.
•
During the Apply phase the changes are applied to the database, i.e. the databanks and the
shadows are updated. The Background threads in the Database Server carry out this phase. Even
though the changes are applied in the background, the transaction changes are visible to all
applications through the Transaction Cache. Once this phase is finished the transaction is fully
complete. If there is any kind of hardware failure that means that SQL is unable to complete this
phase, it is automatically restarted as soon as the cause of the failure is corrected.
138
Most other DBMSs offer pessimistic concurrency control. This type of concurrency control protects a
user's reads and updates by acquiring locks on rows (or possibly database pages, depending on the
implementation), this leads to applications becoming 'contention bound' with performance limited by
other transactions. These locks may force other users to wait if they try to access the locked items.
The user that 'owns' the locks will usually complete their work, committing the transaction and
thereby freeing the locks so that the waiting users can compete to attempt to acquire the locks.
Optimistic Concurrency Control (OCC) offers a number of distinct advantages including:
•
Complicated locking overhead is completely eliminated. Scalability is affected in locking
systems as many simultaneous users cause locking graph traversal costs to escalate.
•
Deadlocks cannot occur, so the performance overheads of deadlock detection are avoided as
well as the need for possible system administrator intervention to resolve them.
•
Programming is simplified as transaction aborts only occur at the Commit command whereas
deadlocks can occur at any point during a transaction. Also it is not necessary for the
programmer to take any action to avoid the potentially catastrophic effects of deadlocks, such
as carrying out database accesses in a particular order. This is particularly important as
potential deadlock situations are rarely detected in testing, and are only discovered when
systems go live.
•
Data cannot be left inaccessible to other users as a result of a user taking a break or being
excessively slow in responding to prompts. Locking systems leave locks set in these
circumstances denying other users access to the data.
•
Data cannot be left inaccessible as a result of client processes failing or losing their
connections to the server.
•
Delays caused by locking systems being overly cautious are avoided. This can arise as a result
of larger than necessary lock granularity, but there are also several other circumstances when
locking causes unnecessary delays even when using fine granularity locking.
•
Removes the problems associated with the use of ad-hoc tools.
•
Through the Group Commit concept, which is applied in SQL, the number of I/Os needed to
secure committed transactions to the disk is reduced to a minimum. The actual updates to
the database are performed in the background, allowing the originating application to
continue.
•
The ROLLBACK statement is supported but, because nothing is written to the actual database
during the transaction Build-up phase, this involves only a re-initialization of structures used
by the transaction control system.
•
Another significant transaction feature in SQL is the concept of Read-Only transactions,
which can be used for transactions that only perform read operations to the database. When
performing a Read-Only transaction, the application will always see a consistent view of the
database. Since consistency is guaranteed during a Read-Only transaction no transaction
check is needed and internal structures used to perform transaction checks (i.e. the Read Set)
is not needed, and for this reason no Read Set is established for a Read-Only transaction. This
has significant positive effects on performance for these transactions. This means that a ReadOnly transaction always succeeds, unaffected of changes performed by other transactions. A
Read-Only transaction also never disturbs any other transactions going on in the system. For
example, a complicated long-running query can execute in parallel with OLTP transactions.
139
10.5
Architecture for a Locking Scheduler
Architecture Features
•
Memory Usage
•
Shared Memory
File system
•
Page Replacement Problems
•
Page eviction
•
Simplistic NRU replacement
•
Clock algorithm can evict accessed pages
•
Sub-optimal reaction to variable load or load
Spikes after inactivity
•
Improvements:
•
Finer-grained SMP Locking
•
Unification of buffer and page caches
•
Support for larger memory configurations
•
SYSV shared memory code replaced
•
Page aging reintroduced
•
Active & inactive page lists
•
Optimized page flushing
•
Controlled background page aging
•
Aggressive read ahead
SMP locking optimizations, Use of global “kernel_lock” was minimized. More subsystem based
spinlock are used. More spinlocks embedded in data structures.
Semaphores used to serialize address space access.
More of a spinlock hierarchy established. Spinlock granularity tradeoffs.
Kernel multi-threading improvements
Multiple threads can access address space data structures simultaneously.
Single mem->msem semaphore was replaced with multiple reader/single writer semaphore.
Reader lock is now acquired for reading per address space data structures.
Exclusive write lock is acquired when altering per address space data structures.
32 bit UIDs and GIDs
Increase from 16 to 32 bit UIDs allow up to 4.2 billion users.
Increase from 16 to 32 bit GIDs allow up to 4.2 billion groups.
64 bit virtual address space, Architectural limit of the virtual address space was
expanded to a full 64 bits.
IA64 currently implements 51 bits (16 disjoint 47 bit regions)
Alpha currently implements 43 bits (2 disjoint 42 bit regions)
S/390 currently implements 42 bits
Future Alpha is expanded to 48 bits (2 disjoint 47 bit regions)
140
Unified file system cache
Single page cache was unified from previous
Page cache read write functionality. Reduces memory consumption by eliminating double buffered
copies of file system data.
Eliminates overhead of searching two levels of data cache.
10.6
Managing Hierarchies of Database Elements
These storage type are sometimes called the storage hierarchy. It contains of the archival
storage. It consist of the archival database, physical database, archival log, and current log.
Physical database: this is the online copy of the database that is stored in nonvolatile storage and
used by all active transactions.
Current Database: the current version of the database is made up of physical database plus
modifications implied by buffer in the volatile storage.
Database users
Program code
And buffer
in volatile
storage
Applicationi
Data Buffer
Applicationi
Log Buffers
physical
database on
nonvolatile
storage
current log,
check point
on stable storage
archive copy
of database
on stable
storage
archive log
on stable
storage
Archival database in stable storage: this is the copy of the database at a given time, stored. it contain
the entire database in a quiescent mode and could have been made by simple dump routine to dump
the physical database on to stable storage. all transaction that have been executed on the database
from the time of archiving have to be redline in a global recovery database is a copy of the database in
a quiescent state, and only the committed transaction since the time of archiving are applied to this
database.
Current log: the log information required for recovery from system failure involving loss of volatile
information.
Archival log: is used for failure involving if loss of nonvolatile information.
141
The online or current database is made up of all the records that are accessible to the DBMS during
its operation. The current database consist of the data stored in nonvolatile storage and not yet
propagated tot the nonvolatile storage.
10.7 Concurrency Control by Timestamps
One of the important transactions is that their effect on shared data is serially equivalent. This
means that any data that is touched by a set of transactions must be in such a state that the results
could have been obtained if all the transactions executed serially (one after another) in some order (it
does not matter which). What is invalid, is for the data to be in some form that cannot be the result of
serial execution (e.g. two transactions modifying data concurrently). One easy way of achieving this
guarantee is to ensure that only one transaction executes at a time. We can accomplish this by using
mutual exclusion and having a “transaction” resource that each transaction must have access to.
However, this is usually overkill and does not allow us to take advantage of the concurrency that we
may get in distributed systems (for instance, it is obviously overkill if two transactions don’t even
access the same data). What we would like to do is allow multiple transactions to execute
simultaneously but keep them out of each other’s way and ensure serializability. This is called
concurrency control.
Locking
We can use exclusive locks on a resource to serialize execution of transactions that share resources.
A transaction locks an object that it is about to use. If another transaction requests the same object
and it is locked, the transaction must wait until the object is unlocked. To implement this in a
distributed system, we rely on a lock manager - a server that issues locks on resources. This is
exactly the same as a centralized mutual exclusion server: a client can request a lock and then send
a message releasing a lock on a resource (by resource in this context, we mean some specific block of
data that may be read or written). One thing to watch out for, is that we still need to preserve serial
execution: if two transactions are accessing the same set of objects, the results must be the same as
if the transactions executed in some order (transaction A cannot modify some data while transaction
B modifies some other data and then transaction A accesses that modified data -- this is concurrent
modification). To ensure serial ordering on resource access, we impose a restriction that states that a
transaction is not allowed to get any new locks after it has released a lock. This is known as twophase locking. The first phase of the transaction is a growing phase in which it acquires the locks it
needs. The second phase is the shrinking phase where locks are released.
Strict two-phase locking
A problem with two-phase locking is that if a transaction aborts, some other transaction may have
already used data from an object that the aborted transaction modified and then unlocked. If this
happens, any such transactions will also have to be aborted. This situation is known as cascading
aborts. To avoid this, we can strengthen our locking by requiring that a transaction will hold all its
locks to the very end: until it commits or aborts rather than releasing the lock when the object is no
longer needed.
This is known as strict two-phase locking.
142
Locking granularity
A typical system will have many objects and typically a transaction will access only a small amount of
data at any given time (and it will frequently be the case that a transaction will not clash with other
transactions). The granularity of locking affects the amount of concurrency we can achieve. If we can
have a smaller granularity (lock smaller objects or pieces of objects) then we can generally achieve
higher concurrency. For example, suppose that all of a bank’s customers are locked for any
transaction that needs to modify a single customer datum: concurrency is severely limited because
any other transactions that need to access any customer data will be blocked. If, however, we use a
customer record as the granularity of locking, Transactions that access different customer records
will be capable of running concurrently.
Multiple readers/single writer
There is no harm having multiple transactions read from the same object as long as it has not been
modified by any of the transactions. This way we can increase concurrency by having multiple
transactions run concurrently if they are only reading from an object. However, only one transaction
should be allowed to write to an object. Once a transaction has modified an object, no other
transactions should be allowed to read or write the modified object. To support this, we now use two
locks: read locks and write locks. Read locks are also known as shared locks (since they can be
shared by multiple transactions) If a transaction needs to read an object, it will request a read lock
from the lock manager. If a transaction needs to modify an object, it will request a write lock from the
lock manager. If the lock manager cannot grant a lock, then the transaction will wait until it can get
the lock (after the transaction with the lock committed or aborted). To summarize lock granting: If a
transaction has: another transaction may obtain: no locks read lock or write lock read lock read lock
(wait for write lock) write lock wait for read or write locks
Increasing concurrency: two-version locking
Two-version locking is an optimistic concurrency control scheme that allows one transaction to write
tentative versions of objects while other transactions read from committed versions of the same
objects. Read operations only wait if another transaction is currently committing the same object.
This scheme allows more concurrency than read-write locks, but writing transactions risk waiting (or
rejection) when they attempt to commit. Transactions cannot commit their write operations
immediately if other uncommitted transactions have read the same objects. Transactions that request
to commit in this situation have to wait until the reading transactions have completed.
Two-version locking
The two-version locking scheme requires three types of locks: read, write, and commit locks. Before
an object is read, a transaction must obtain a read lock. Before an object is written, the transaction
must obtain a write lock (same as with two-phase locking). Neither of these locks will be granted if
there is a commit lock on the object. When the transaction is ready to commit:
- all of the transaction’s write locks are changed to commit locks
- if any objects used by the transaction have outstanding read locks, the transaction
must wait until the transactions that set these locks have completed and the locks are released.
If we compare the performance difference between two-version locking and strict two-phase locking
(read/write locks):
143
- read operations in two-version locking are delayed only while transactions are being committed
rather than during the entire execution of transactions (usually the commit protocol takes far less
time than the time to perform the transaction)
- but… read operations of one transaction can cause a delay in the committing of
other transactions.
Problems with locking
Locks are not without drawbacks Locks have an overhead associated with them: a lock manager is
needed to keep track of locks - there is overhead in requesting them. Even read-only operations must
still request locks. The use of locks can result in deadlock. We need to have software in place to
detect or avoid deadlock. Locks can decrease the potential concurrency in a system by having a
transaction hold locks for the duration of the transaction (until a commit or abort).
Optimistic concurrency control
King and Robinson (1981) proposed an alternative technique for achieving concurrency control, called
optimistic concurrency control. This is based on the observation that, in most applications, the
chance of two transactions accessing the same object is low. We will allow transactions to proceed as
if there were no possibility of conflict with other transactions: a transaction does not have to obtain or
check for locks. This is the working phase. Each transaction has a tentative version (private
workspace) of the objects it updates - copy of the most recently committed version. Write operations
record new values as tentative values. Before a transaction can commit, a validation is performed on
all the data items to see whether the data conflicts with operations of other transactions. This is the
validation phase.
If the validation fails, then the transaction will have to be aborted and restarted later. If the
transaction succeeds, then the changes in the tentative version are made permanent. This is the
update phase. Optimistic control is deadlock free and allows for maximum parallelism (at the
expense of possibly restarting transactions)
Timestamp ordering
Reed presented another approach to concurrency control in 1983. This is called timestamp
ordering. Each transaction is assigned a unique timestamp when it begins (can be from a physical or
logical clock). Each object in the system has a read and write timestamp associated with it (two
timestamps per object). The read timestamp is the timestamp of the last committed transaction that
read the object. The write timestamp is the timestamp of the last committed transaction that modified
the object (note - the timestamps are obtained from the transaction timestamp - the start of that
transaction)
The rule of timestamp ordering is:
- if a transaction wants to write an object, it compares its own timestamp with the object’s read and
write timestamps. If the object’s timestamps are older, then the ordering is good.
- if a transaction wants to read an object, it compares its own timestamp with the object’s write
timestamp. If the object’s write timestamp is older than the current transaction, then the ordering is
good.
144
If a transaction attempts to access an object and does not detect proper ordering, the transaction is
aborted and restarted (improper ordering means that a newer transaction came in and modified data
before the older one could access the data or read data that the older one wants to modify).
10.8 Concurrency Control by Validation
Validation or certification techniques. A transaction proceeds without waiting and all updates are
applied to local copies. At the end, a validation phase check if any updates violate serializability. If
certified, the transaction is committed and updates made permanent. If not certified, the transaction
is aborted and restarted later.
Three phases:
read phase
validation phase
write phase
Validation Test
Each T is associated with three TS's:
start(T): T started
val(T): T finished read and started its validation
finish(T): T finished its write phase
Validation test for Ti: for each Tj that is committed or in its validation phase, at least one of the
following holds:
finish(Tj) < start(Ti)
writeset(Tj) ∩ readset(Ti) = finish(Tj) < val(Ti)
writeset(Tj) ∩ readset(Ti) =..
writeset(Tj) ∩ writeset(Ti) = ..
val(Tj) < val(Ti)
10.9
Summary
•
When two or more transactions are running concurrently, the steps of the transactions
would normally be interleaved. The interleaved execution of transactions is decided by the
database scheduler, which receives a stream of user requests that arise from the active
transactions. A particular sequencing (usually interleaved) of the actions of a set of
transactions is called a schedule. A serial schedule is a schedule in which all the
operations of one transaction are completed before another transaction can begin (that is,
there is no interleaving).
•
Two operations conflict if:
o they are issued by different transactions,
o they operate on the same data item, and
o at least one of them is a write operation
Two executions are conflict-equivalent, if in both executions all conflicting operations have the
same order.
145
•
To allow multiple concurrent transactions access to the same data, most database servers use
a two-phase locking protocol. Each transaction locks sections of the data that it reads or
updates to prevent others from seeing its uncommitted changes. Only when the transaction is
committed or rolled back can the locks be released. This was one of the earliest methods of
concurrency control, and is used by most database systems.
•
several Object Orientated Databases, which were more recently developed, have incorporated
OCC within their designs to gain the performance advantages inherent within this
technological approach.
•
Memory Usage, Shared Memory
146
UNIT 11
TRANSACTION MANAGEMENT
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
Introduction of Transaction management
Serializability and Recoverability
View Serializability
Resolving Deadlocks
Distributed Databases
Distributed Commit
Distributed Locking
Summary
11.1 Introduction of Transaction Management
The synchronization primitives we have seen so far are not as high-level as we might want them to be
since they require programmers to explicitly synchronize, avoid deadlocks, and abort if necessary.
Moreover, the high-level constructs such as monitors and path expressions do not give users of
shared objects flexibility in defining the unit of atomicity. We will study here a high-level technique,
called concurrency control, which automatically ensures that concurrently interacting users do not
execute inconsistent commands on shared objects. A variety of concurrency models defining different
notions of consistency have been proposed. These models have been developed in the context of
database management systems, operating systems, CAD tools, collaborative software engineering,
and collaboration systems. We will focus here on the classical database models and the relatively
newer operating system models.
A type of computer processing in which the computer responds immediately to user requests. Each
request is considered to be a transaction. Automatic teller machines for banks are an example of
transaction processing.
The opposite of transaction processing is batch processing, in which a batch of requests is stored and
then executed all at one time. Transaction processing requires interaction with a user, whereas batch
processing can take place without a user being present.
The RDBMS must be able to support a centralized warehouse containing detail data, provide direct
access for all users, and enable heavy-duty, ad hoc analysis. Yet, for many companies just starting a
warehouse project, it seems a natural choice to simply use the corporate standard database that has
already proven itself for mission-critical work. This approach was especially common in the early
days of data warehousing, when most people expected a warehouse to do little more than provide
canned reports.
But decision-support requirements have evolved far beyond canned reports and known queries.
Today's data warehouses must give organizations the in-depth and accurate information they need to
147
personalize customer interactions at all touch points and convert browsers to buyers. An RDBMS
designed for transaction processing can't keep up with the demands placed on data warehouses:
support for high concurrency, mixed-workload, detail data, fast query response, fast data load, ad
hoc queries, and high-volume data mining.
The notion of concurrency control is closely tied to the notion of a ``transaction''. A transaction
defines a set of ``indivisible'' steps, that is, commands with the Atomicity, Consistency, Isolation, and
Durability (ACID) properties:
Atomicity: Either all or none of the steps of the transaction occur so that the invariants of the shared
objects are maintained. A transaction is typically aborted by the system in response to failures but it
may be aborted also by a user to ``undo'' the actions. In either case, the user is informed about the
success or failure of the transaction.
Consistency: A transaction takes a shared object from one legal state to another, that is, maintains
the invariant of the shared object.
Isolation: Events within a transaction are hidden from other concurrently executing transactions.
Techniques for achieving isolation are called synchronization schemes. They determine how these
transactions are scheduled, that is, what the relationships are between the times the different steps
of these transactions. Isolation is required to ensure that concurrent transactions do not cause an
illegal state in the shared object and to prevent cascaded rollbacks when a transaction aborts.
Durability: Once the system tells the user that a transaction has completed successfully, it ensures
that values written by the database system persist until they are explicitly overwritten by other
transactions.
Consider the schedules S1, S2, S3, S4 and S5 given below. Draw the precedence graphs for each
schedule and state whether each schedule is (conflict) serializable or not. If a schedule is serializable,
write down the equivalent serial schedule(s).
S1 : read1(X), read3(X), write1(X), read2(X), write3(X).
S2 : read1(X), read3(X), write3(X), write1(X), read2(X).
S3 : read3(X), read2(X), write3(X), read1(X), write1(X).
S4 : read3(X), read2(X), read1(X), write3(X), write1(X).
S5 : read2(X), write3(X), read2(Y), write4(Y), write3(Z), read1(Z),
write4(X), read1(X), write2(Y),
read1(Y).
11.2 Serializability and Recoverability
Serializability is the classical concurrency scheme. It ensures that a schedule for executing
concurrent transactions is equivalent to one that executes the transactions serially in some order. It
assumes that all accesses to the database are done using read and write operations. A schedule is
called ``correct'' if we can find a serial schedule that is ``equivalent'' to it. Given a set of transactions
T1...Tn, two schedules S1 and S2 of these transactions are equivalent if the following conditions are
satisfied:
148
Read-Write Synchronization: If a transaction reads a value written by another transaction in one
schedule, then it also does so in the other schedule.
Write-Write Synchronization: If a transaction overwrites the value of another transaction in one
schedule, it also does so in the other schedule.
Recoverability for changes to the other control file records sections is provided by maintaining all
the information in duplicate. Two physical blocks represent each logical block. One contains the
current information, and the other contains either an old copy of the information, or a pending
version that is yet to be committed. To keep track of which physical copy of each logical block
contains the current information, Oracle maintains a block version bitmap with the database
information entry in the first record section of the control file.
Recovery is an algorithmic process and should be kept as simple as possible, since complex
algorithms are likely to introduce errors. Therefore, an encoding scheme should be designed around a
set of principles intended to make recovery possible with simple algorithms. For processes such as
tag removal, simple mappings are more straightforward and less error prone than, say, algorithms
which require rearrangement of the sequence of elements, or which are context-dependent, etc.
Therefore, in order to provide a coherent and explicit set of recovery principles, various recovery
algorithms and related encoding principles need to be worked out, taking into account such things
as:
The role and nature of mappings (tags to typography, normalized characters, spellings, etc., with the
original ...);
•
The encoding of rendition characters and rendition text;
•
Definitions and separability of the source and annotation (such as linguistic annotation,
notes, etc.);
•
Linkage of different views or versions of a text; -0.5ex
11.3 View Serializability
Serializability In this paper, we assume serializability is the underlying consistency criterion.
Serializability requires that concurrent execution of transactions have the same effect as a serial
schedule. In a serial schedule, transactions are executed one at a time. Given that the database is
initially consistent and each transaction is a unit of consistency, serial execution of all transactions
will give each transaction a consistent view of the database and will leave the database in a
consistent state. Since serializable execution is computationally equivalent to serial execution,
serializable execution is also deemed correct.
View Serializability A subclass of serializability, called View Serializability, is identified based on
the observation that two schedules are equivalent if they have the same effects. The effects of a
schedule are the values they produce, which are functions of values they read. Two schedules H
1 and H
2 are defined to be view equivalent if
3 Runtime overhead in hard real-time systems effectively translates into increased tasks' execution
times, which in turn affects an algorithm's schedulability.
149
Database Concurrency Control ffl Multiple users. ffl Concurrent accesses. ffl Problems could arise if
there is no control. ffl Example:
- Transaction 1: withdraw $500 from account A. - Transaction 2: deposit $800 to account A - T
1 : read(A)
A = A \Gamma 500 write(A) - T
2 : read(A)
A = A + 800 write(A) - Initially A = 1000 - T
1T
2 A = 1300 - T
2T
1 A = 1300 - T
1 : read(A)
T 1 : A = A \Gamma 500
T 2 : read(A)
T 2 : A = A + 800
T 2 : write(A)
T 1 : write(A)
A = 500, inconsistent
T 2 : read(A)
T 1 : read(A)
T 1 : A = A \Gamma 500
T 1 : write(A)
T 2 : A = A + 800
T 2 : write(A)
A = 1800, inconsistent
Transactions and Schedules ffl A transaction is a sequence of operations. ffl A transaction is atomic.
ffl T = fT 1 ; : : : ; T n g is a set of transactions. ffl A schedule of a sequence of operations in T 1 ; : : : ;
T n such that for each 1 ^ i ^ n; 1. each operation in Ti appears exactly once, and 2. operations in Ti
appear in the same order as in Ti ffl A schedule of T is serial if 8i8j; i 6= j ) either 1. all the operations
in Ti appear before all the operations in Tj , or 2. all the operations in Tj appear before all the
operations in Ti ffl Assumption: Each Ti is correct when executed individually. ffl Any serial schedule
is valid. ffl Objective: Accept schedules "equivalent" to a serial schedule (serializable schedules). ffl ?
What do we mean by "equivalent" ?
View Serializability: ffl equivalent: same effects ffl The effects of a history are the values produced by
the Write operations of unaborted transactions. ffl We don't know anything about the computation of
each transactions. ffl Assume that if each transactions' Reads read the same value in two histories,
then all W rites write the same values in both histories. ffl If for each data item x, the final Write on x
is the same in both histories, then the final value of all data items will be the same in both histories.
ffl Two histories H and H 0 are view equivalent if
1. They are over the same set of transactions and have the same operations;
2. For any unaborted Ti , Tj and for any x, if Ti reads x from Tj in H then Ti reads x from Tj in H0 ,
and 3. For each x, if wi [x] is the final write of x in H then it is also the final write of x in H0. ffl
Assume that there is a transaction (Tb) which initializes the values for all the data objects. ffl A
schedule is view serializable if it is view equivalent
to a serial schedule. ffl r3 [x]w4 [x]w3 [x]w6 [x] - T3 read-from Tb.
- The final write for x is w6 [x].
- View equivalent to T3 T4 T6.
ffl r3 [x]w4 [x]r7 [x]w3 [x]w7 [x]
- T3 read-from Tb.
- T7 read-from T4.
- The final write for x is w7 [x].
- View equivalent to T3 T4 T7.
150
ffl r3 [x]w4 [x]w3 [x]
- T3 read-from Tb.
- The final write for x is w3 [x].
- Not serializable.
ffl w1 [x]r2 [x]w2 [x]r1 [x]
- T 2 read-from T1.
- T1 read-from T2.
- The final write for x is w2 [x].
- Not serializable.
ffl Test for view serializability. ffl Tb issues writes for all data objects (first transaction). ffl Tf read the
values for all data objects (last transaction).
ffl Construction of labeled precedence graph
1. Add an edge Ti. 0 ! Tj, if transaction Tj.
Reads from Ti.
2. For each data item Q such that - Tj
read-from Ti.- T k executes write(Q) and T k 6= Tb. (i 6= j 6= k) do the followings: (a) If Ti = Tb and Tj
6= T f , then insert the edge Tj 0 ! T k .
11.4 Resolving Deadlocks
Use the Monitor Display to understand and resolve deadlocks. This section demonstrates how this is
done. The steps assume that the deadlock is easy to recreate with the tested application running
within Optimize It Thread Debugger. If this is not the case, use the Monitor Usage Analyzer instead.
To resolve a deadlock:
1.
Recreate the deadlock.
2.
Switch to Monitor Display.
3. Identify which thread is not making progress. Usually, the thread is yellow because it is
blocking on a monitor. Call that thread the blocking thread.
4. Select the Connection button to identify where the blocking thread is not making progress.
Double-click on the method to display the source code for the blocking method, as well as
methods calling the blocking method. This provides some context for where the deadlock
occurs.
5. Identify which thread owns the unavailable monitor. Call this the locking thread.
6. Identify why the locking thread does not release the monitor. This can happen in the
following cases:
• The locking thread is itself trying to acquire a monitor owned directly or indirectly by the
blocking thread. In this case, a bug exists since both the locking and the blocking threads
enter monitors in a different order. Changing the code to always enter monitors in the
same order will resolve this deadlock.
• The locking thread is not releasing the monitor because it remains busy executing the
code. In this case, the locking thread is green because it uses some CPU. This type of bug
is not a real deadlock. It is an extreme contention issue caused by the locking thread
holding the monitor for too long, sometimes called thread starvation.
• The locking thread is waiting for an I/O operation. In this case the locking thread is
purple. It is dangerous for a thread to perform an I/O operation while holding a monitor,
unless the only purpose of that monitor is to protect the objects used to perform the I/O.
151
A blocking I/O operation may never occur, causing the program to hang. Often these
situations can be resolved by releasing the monitor before performing the I/O.
• The locking thread is waiting for another monitor. In this case, the locking thread is red.
It is equally dangerous to wait for a monitor while holding another monitor. The monitor
may never be notified, causing a deadlock. Often this situation can be resolved by
releasing the monitor that the blocking thread wants to acquire before waiting on the
monitor.
11.5
Distributed Databases
To support data collection needs of large networks, a distributed multitier architecture is
necessary. The advantage of this approach is that massive scalability can be implemented in an
incremental fashion, easing the operations and financial burden. Web NMS Server has been designed
to scale from a single server solution to a distributed architecture supporting very large scale needs.
The architecture supports one or more relational databases (RDBMS).
A single back-end server collects the data and stores it in a local or remote database. The system
readily supports tens of thousands of managed objects. For example, on a 400 MHz Pentium
Windows NT system, the polling engine can collect over 4000 collected variables/minute, including
storage in the MySQL database on Windows.
The bottleneck for data collection is usually the database inserts, which limits the number of entries
per second that can be inserted into the database. As we discuss below, with a distributed database,
considerably higher performance is possible through distributing the storage of data into multiple
databases.
Based on tests with different modes, one central database on commodity hardware can handle up to
100 collected variables/second; with distributed databases, this can be scaled much higher. The
achievable rate depends on the number of databases and the number and type of servers used. With
distributed databases, there is often a need to aggregate data in a single central store for reporting
and other purposes. Multiple approaches are feasible here:
•
Roll-up data periodically to the central database from the different distributed databases.
•
Use Native database distribution for centralized views, e.g. Oracle SQL Net. This is vendor
dependent, but can provide easy consolidation of data from multiple databases.
•
Aggregate data using JDBC only when creating a report. This would require the report writer
to take care of collecting the data from the different databases for the report.
The need arises to
•
To reduce the burden of Poll Engine in Web NMS Server.
•
To facilitate faster data collection
The solution is Distributed Polling. You can adopt this technique when you are able to distinguish
the network elements geographically. You can form a group of network elements and decide to have
one Distributed Poller for them.
152
This section describes Distributed Polling architecture available with Web NMS Server. It discusses
the design, and the choices available in implementing the distributed solution. It provides guidelines
on setting up the components of the distributed system.
Architecture is very simple.
•
You have Web NMS server running in one machine and Distributed Poller running in other
machines, one in each.
•
Each Poller is identified by a name and has an associated database (labelled as Secondary
RDBMS in the diagram)
•
You create PolledData and specify the Poller name if you want to perform data collection for
that PolledData via the distributed poller. In case you want Web NMS Polling Engine to collect
data you don't specify any Poller name. By default, PolledData will not be associated with any
of the Pollers.
•
Once you associate the PolledData with the Poller and start the Poller , data collection is done
by poller and collected data is stored in Poller database (Secondary RDBMS).
Major features of a DDB are:
o
Data stored at a number of sites, each site logically single processor
o
Sites are interconnected by a network rather than a multiprocessor configuration
o
DDB is logically a single database (although each site is a database site)
o
DDBMS has full functionality of a DBMS
To the user, the distributed database system should appear exactly like a non-distributed database
system.
Advantages of distributed database systems are:
o Local autonomy (in enterprises that are distributed already)
o Improved performance (since data is stored close to where needed and a query may be split
over several sites and executed in parallel)
o Improved reliability/availability (should one site go down)
o Economics
o Expandability
o Shareability
Disadvantages of distributed database systems are:
153
o Complexity (greater potential for bugs in software)
o Cost (software development can be much more complex and therefore costly. Also, exchange of
messages and additional computations involve increased overheads)
o Distribution of control (no single database administrator controls the ddb)
o Security (since the system is distributed the chances of security lapses are greater)
o Difficult to change (since all sites have control of the their own sites)
o Lack of experience (enough experience is not available in developing distributed systems)
Replication improves availability since the system would continue to be fully functional even if a site
goes down. Replication also allows increased parallelism since several sites could be operating on the
same relations at the same time. Replication does result in increased overheads on update.
Fragmentation may be horizontal, vertical or hybrid (or mixed). Horizontal fragmentation splits a
relation by assigning each tuple of the relation to a fragment of the relation. Often horizontal
fragmentation is based on predicates defined on that relation.
Vertical fragmentation splits the relation by decomposing a relation into several subsets of the
each of which contains a subset of
attributes. Relation R produces fragments
attributes of R as well as the primary key of R. Aim of vertical fragmentation is to put together those
attributes that are accessed together.
Mixed fragmentation uses both vertical and horizontal fragmentation.
To obtain a sensible fragmentation design, it is necessary to know some information about the
database as well as about applications. It is useful to know the predicates used in the application
queries - at least the 'important' ones.
Aim is to have applications using only one fragment.
Fragmentation must provide completeness (all information in a relation must be available in the
fragments), reconstruction (the original relation should be able to be reconstructed from the
fragments) and disjointedness (no information should be stored twice unless absolutely essential, for
example, the key needs to be duplicated in vertical fragmentation).
Transparency involves the user not having to know how a relation is stored in the DDB; it is the
system capability to hide the details of data distribution from the user.
Autonomy is the degree to which a designer or administrator of one site may be independent of the
remainder of the distributed system.
It is clearly undesirable for the users to have to know which fragment of the relation they require to
process the query that they are posing. Similarly the users should not need to know which copy of a
replicated relation or fragment they need to use. It should be upto the system to figure out which
fragment or fragments of a relation a query requires and which copy of a fragment the system will use
to process the query. This is called replication and fragmentation transparency.
A user should also not need to know where the data is located and should be able to refer to a
relation by name which could then be translated by the system into full name that includes the
location of the relation. This is location transparency.
Global query optimization is complex because of
154
•
cost models
•
fragmentation and replication
•
large solution space from which to choose
Computing cost itself can be complex since the cost is a weighted combination of the I/O, CPU and
communications costs. Often one of the two cost models are used; one may wish to minimize the total
cost (time) or the response time. Fragmentation and replication add another complexity to finding an
optimum query plan.
Date's 12 Rules for Distributed Databases
RDBMS in all other respects should behave like a non distributed RDBMS. This is sometimes called
Rule 0.
Distributed Database Characteristics
According to Oracle, these are the database characteristics and how Oracle 7 technology meets each
point:
1. Local autonomy The data is owned and managed locally. Local operations remain purely local. One
site (node) in the distributed system does not depend on another site to function successfully.
2. No reliance on a central site All sites are treated as equals. Each site has its own data dictionary.
3. Continuous operation Incorporating a new site has no effect on existing applications and does not
disrupt service.
4. Location independence Users can retrieve and update data independent of the site.
5. Partitioning [fragmentation] independence Users can store parts of a table at different locations.
Both horizontal and vertical partitioning of data is possible.
6. Replication independence Stored copies of data can be located at multiple sites. Snapshots, a type
of database object, can provide both read-only and updatable copies of tables. Symmetric replication
using triggers makes readable and writable replication possible.
7. Distributed query processing Users can query a database residing on another node. The query is
executed at the node where the data is located.
8. Distributed transaction management A transaction can update, insert, or delete data from multiple
databases. The two-phase commit mechanism in Oracle ensures the integrity of distributed
transactions. Row-level locking ensures a high level of data concurrency.
9. Hardware independence Oracle7 runs on all major hardware platforms.
10. Operating system independence A specific operating system is not required. Oracle7 runs under a
variety of operating systems.
11. Network independence The Oracle's SQL*Net supports most popular networking software.
Network independence allows communication across homogeneous and heterogeneous networks.
Oracle's MultiProtocol Interchange enables applications to communicate with databases across
multiple network protocols.
12. DBMS independence DBMS independence is the ability to integrate different databases. Oracle's
Open Gateway technology supports ODBC-enahled connections to non-Oracle databases.
11.6
Distributed Commit
155
To create a new user, test, and a corresponding default schema you must be connected as the ADMIN
user and then use:
CREATE USER test;
CREATE SCHEMA test AUTHORIZATION test; --sets the default schema
COMMIT;
and then connect to the new user/schema using:
CONNECT TO '' USER 'test'
Notice that the COMMIT was needed before the CONNECT because re-connecting would otherwise
rollback any uncommitted changes.
In this example the sequence of events is as follows:
The coordinator at Client A registers automatically with the Transaction Manager database at Server
B, using TM_DATABASE=TMB.
The application requester at Client A issues a DUOW request to Servers C and E. For example, the
following REXX script illustrates this:
/**/
'set DB2OPTIONS=+c' /* in order to turn off autocommit */
'db2 set client connect 2 syncpoint twophase'
'db2 connect to DBC user USERC using PASSWRDC'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testCCC",99)'
'db2 connect to DBE user USERE using PASSWRDE'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testEEE",99)'
'commit'
exit (0);
When the commit is issued, the coordinator at the application requester sends prepare requests to
the SPM for the updates requested at servers C and E.
The SPM is running on Server D, as part of DB2 Connect, and it sends the prepare requests to
servers C and E. Servers C and E in turn acknowledge the prepare requests.
The SPM sends back an acknowledgement to the coordinator at the application requester.
The coordinator at the application requester sends a request to the transaction manager at Server B
for the servers that have acknowledged, and the transaction manager decides whether to commit or
roll-back. The transaction manager logs the commit decision, and the updates are guaranteed from
this point. The coordinator issues commit requests, which are processed by the SPM, and forwarded
to servers C and E, as were the prepare requests. Servers C and E commit and report success to the
SPM. SPM then returns the commit result to the coordinator, which updates the TMB with the
commit results.
156
Two-phase
11.7
Commit
RDBMS
Scenario
Distributed Locking
The intent of this white paper is to convey information regarding database locks as they apply to
transactions in general and the more specific case of how they are implemented by the Progress
server. We’ll begin with a general overview discussing why locks are needed and how they affect
transactions. Transactions and locking are outlined in the SQL standard so no introduction would
be complete without discussing the guidelines set forth here. Once we have a grasp on the general
concepts of locking we’ll dive into lock modes, such as table and record locks and their effect on
different types of database operations. Next, the subject of timing will be introduced, when locks are
obtained and when they are released. From here we’ll get into lock contention and deadlocks, which
are multiple operations or transactions all attempting to get locks on the same resource at the same
time. And to conclude our discussion on locking we’ll take a look at how we can see locks in our
application so we know which transactions obtain which types of locks. Finally, this white paper
describes differences in locking behavior between previous and current versions of Progress and
differences in locking behavior when both 4GL and SQL92 clients are accessing the same resources.
Locks
The answer to why we lock is simple; if we didn’t there would be no consistency. Consistency
provides us with successive, reliable, and uniform results without which applications such as
banking and reservation systems, manufacturing, chemical, and industrial data collection and
processing could not exist. Imagine a banking application where two clerks attempt to update an
account balance at the same time: one credits the account and the other debits the account. While
one clerk reads the account balance of $200 to credit the account $100, the other clerk has already
completed the debit of $100 and updated the account balance to $100. When the first clerk finishes
the credit of $100 to the balance of $200 and updates the balance to $300 it will be as if the debit
never happened. Great for the customer; however the bank wouldn’t be in business for long.
What objects are we locking?
What database objects get locked is not as simple to answer as why they’re locked. From a user
perspective, objects such as the information schema, user tables, and user records are locked while
157
being accessed to maintain consistency. There are other lower level objects that require locks that are
handled by the RDBMS; however, they are not visible to the user. For the purposes of this discussion
we will focus on the objects that the user has visibility of and control over.
Transactions
Now that we know why and what we lock, let’s talk a bit about when we lock. A transaction is a unit
of work; there is a well-defined beginning and end to each unit of work. At the beginning of each
transaction certain locks are obtained and at the end of each transaction they are released. During
any given transaction, the RDBMS, on behalf of the user, can escalate, deescalate, and even release
locks as required. We’ll talk about this in more detail later when we discuss lock modes. The
aforementioned is all-true in the case of a normal, successful transaction; however in the case of an
abnormally terminated transaction things are handled a bit differently. When a transaction fails, for
any reason, the action performed by the transaction needs to be backed out, the change undone. To
accomplish this most RDBMS use what are known as “save points.” A save point marks the last
known good point prior to the abnormal termination; typically this is the beginning of the
transaction. It’s the RDBMS’s job to undo the changes back to the previous save point as well as
ensuring the proper locks are held until the transaction is completely undone. So, as you can see,
transactions that are in the process to be undone (rolled back) are still transactions nonetheless and
still need locks to maintain data consistency. Locking certain objects for the duration of a
transaction ensures database consistency and isolation from other concurrent transactions,
preventing the banking situation we described previously. Transactions are the basis for the ACID
• ATOMICITY guarantees that all operations within a transaction are performed or none of them are
performed.
• CONSISTENCY is the concept that allows an application to define consistency points and validate
the correctness of data transformations from one state to the next.
• ISOLATION guarantees that concurrent transactions have no effect on each other.
• DURABILITY guarantees that all transaction updates are preserved.
11.8
Summary
A type of computer processing in which the computer responds immediately to user requests. Each
request is considered to be a transaction. Automatic teller machines for banks are an example of
transaction processing. The opposite of transaction processing is batch processing, in which a batch
of requests is stored and then executed all at one time. Transaction processing requires interaction
with a user, whereas batch processing can take place without a user being present.
Serializability is the classical concurrency scheme. It ensures that a schedule for executing
concurrent transactions is equivalent to one that executes the transactions serially in some order. It
assumes that all accesses to the database are done using read and write operations. A schedule is
called ``correct'' if we can find a serial schedule that is ``equivalent'' to it. Given a set of transactions
T1...Tn, two schedules S1 and S2 of these transactions are equivalent if the following conditions are
satisfied: Read-Write Synchronization: If a transaction reads a value written by another transaction in
one schedule, then it also does so in the other schedule. Write-Write Synchronization: If a transaction
overwrites the value of another transaction in one schedule, it also does so in the other schedule.
158
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement